Next: References Up: 8. Parallel Computing Techniques Previous: 8.3 Parallel Computing Software

8.4 Parallel Computing in Statistics

8.4.1 Parallel Applications in Statistical Computing

The most important thing in parallel computing is to divide a job into small tasks for parallel execution. We call the amount of independent parallel processing that can occur before requiring some sort of communication or synchronization the ''granularity''. Fine granularity may allow only a few arithmetic operations between processing one message and the next, whereas coarse granularity may allow millions. Although the parallel computing techniques described above can support programming of any granularity, coarse granularity is preferable for many statistical tasks. Fine granularity requires much information exchange among processors and it is difficult to write the required programs. Fortunately, many statistical tasks are easily divided into coarse granular tasks. Some of them are embarrassingly parallel.

In data analysis, we often wish to perform the same statistical calculations on many data sets. Each calculation for a data set is performed independently from other data sets, so the calculations can be performed simultaneously. For example, [16] implemented the backfitting algorithm to estimate a generalized additive model for a large data set by dividing it into small data sets, fitting a function in parallel and merging them together. [3] performed parallel multiple correspondence analysis by dividing an original data set and merging their calculation results.

Another embarrassingly parallel example is a simulation or a resampling computation, which generates new data sets by using a random number generating mechanism based on a given data set or parameters. We calculate some statistics for those data sets, repeat such operations many times and summarize their results to show empirical distribution characteristics of the statistics. In this case, all calculations are performed simultaneously except the last part. [3] provided an example of bootstrapping from parallel multiple correspondence analysis.

We must be careful that random numbers are appropriately generated in parallel execution. For example, random seeds for each process should all be different values, at least. SPRNG ([23]) is a useful random number generator for parallel programming. It allows for the dynamic creation of independent random number streams on parallel machines without interprocessor communication. It is available in the MPI environment and the macro SIMPLE_SPRNG should be defined to invoke the simple interface. Then the macro USE_MPI is defined to instruct SPRNG to make MPI calls during initialization. Fortran users should include the header file sprng_f.h and call sprng() to obtain a double precision random number in . In compiling, the libraries liblcg.a and the MPI library should be linked.

The maximum likelihood method requires much computation and can be parallelized. [18] describes a parallel implementation of the maximum likelihood estimation using the EM algorithm for positron emission tomography image reconstruction. [35] showed maximum likelihood estimation for a simple econometric problem with Fortran code and a full explanation of MPI. [22] solved a restricted maximum likelihood estimation of variance-covariance matrix by using freely available toolkits: the portable extensible toolkit for scientific computation (PETSc) and the toolkit for advanced optimazation (TAO) ([2]) which are built on MPI.

Optimization with dynamic programming requires much computation and is suitable for parallel computing. [15] used this technique to solve sequential allocation problems involving three Bernoulli populations. [9] applied it to the problem of discretizing multidimensional probability functions.

[31] demonstrated that kernel density estimation is also calculated efficiently in parallel.

8.4.2 Parallel Software for Statistics

Several commercial and non-commercial parallel linear algebra packages that are useful for statistical computation are available for Fortran and/or C. We mention two non-commercial packages with freely available source codes: ScaLAPACK ([4]) supports MPI and PVM, and PLAPACK ([40]) supports MPI. [26] described the work to transfer sequential libraries (Gram-Schmidt orthogonalization and linear least squares with equally constraints) to parallel systems by using Fortran with MPI.

Although we have many statistical software products, few of them have parallel features. Parallel statistical systems are still at the research stage. [5] ported a multilevel modeling package MLn into a shared memory system by using C++ with threads. [41] explained a system for time series analysis that has functions to use several computers via Tkpvm, an implementation of PVM in the Tcl/Tk language.

The statistical systems R ([38]) and S ([7]) have some projects to add parallel computing features. [37] added thread functions to S. PVM and MPI are directly available from R via the rpvm ([21]) and Rmpi ([42]) packages. They are used to realize the package ''snow'' ([32]), which implements simple commands for using a workstation cluster for embarrassingly parallel computations in R. A simple example session is:

> cl <- makeCluster(2, type = "PVM")
> clusterSetupSPRNG(cl)
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.749391854 0.007316102 0.152742874

[[2]]
[1] 0.8424790 0.8896625 0.2256776

where a PVM cluster of two computers is started by the first command and the SPRNG library is prepared by the second command. Three uniform random numbers are generated on each computer and the results are printed by the third command.

The statistical system ''Jasp'' ([27]) is implementing experimental parallel computing functions via network functions of the Java language (see also http://jasp.ism.ac.jp/).

Next: References Up: 8. Parallel Computing Techniques Previous: 8.3 Parallel Computing Software