Next: References Up: 1. Computational Statistics: An Previous: 1.2 The Emergence of

Subsections

# 1.3 Why This Handbook

The purpose of this handbook is to provide a survey of the basic concepts of computational statistics; that is, Concepts and Fundamentals. A glance at the table of contents reveals a wide range of articles written by experts in various subfields of computational statistics. The articles are generally expository, taking the reader from the basic concepts to the current research trends. The emphasis throughout, however, is on the concepts and fundamentals. Most chapters have extensive and up-to-date references to the relevant literature (with, in many cases, perhaps a perponderance of self-references!)

We have organized the topics into Part II on ''statistical computing'', that is, the computational methodology, and Part III ''statistical methodology'', that is, the techniques of applied statistics that are computer-intensive, or otherwise make use of the computer as a tool of discovery, rather than as just a large and fast calculator. The final part of the handbook covers a number of application areas in which computational statistics plays a major role are surveyed.

## 1.3.1 Summary and Overview; Part II: Statistical Computing

The thirteen chapters of Part II, Statistical Computing, cover areas of numerical analysis and computer science or informatics that are relevant for statistics. These areas include computer arithmetic, algorithms, database methodology, languages and other aspects of the user interface, and computer graphics.

In the first chapter of this part, Monahan describes how numbers are stored on the computer, how the computer does arithmetic, and more importantly what the implications are for statistical (or other) computations. In this relatively short chapter, he then discusses some of the basic principles of numerical algorithms, such as divide and conquer. Although many statisticians do not need to know the details, it is important that all statisticians understand the implications of computations within a system of numbers and operators that is not the same system that we are accustomed to in mathematics. Anyone developing computer algorithms, no matter how trivial the algorithm may appear, must understand the details of the computer system of numbers and operators.

One of the important uses of computers in statistics, and one that is central to computational statistics, is the simulation of random processes. This is a theme we will see in several chapters of this handbook. In Part II, the basic numerical methods relevant to simulation are discussed. First, L'Ecuyer describes the basics of random number generation, including assessing the quality of random number generators, and simulation of random samples from various distributions. Next Chib describes one special use of computer-generated random numbers in a class of methods called Markov chain Monte Carlo. These two chapters describe the basic numerical methods used in computational inference. Statistical methods using simulated samples are discussed further in Part III.

The next four chapters of Part II address specific numerical methods. The first of these, methods for linear algebraic computations, are discussed by Cízková and Cízek. These basic methods are used in almost all statistical computations. Optimization is another basic method used in many statistical applications. Chapter II.5 on the EM algorithm and its variations by Ng, Krishnan, and McLachlan, and Chap. II.6 on stochastic optimization by Spall address two specific areas of optimization. Finally, in Chap. II.7, Vidakovic discusses transforms that effectively restructure a problem by changing the domain. These transforms are statistical functionals, the most well-known of which are Fourier transforms and wavelet transforms.

The next two chapters focus on efficient usage of computing resources. For numerically-intensive applications, parallel computing is both the most efficient and the most powerful approach. In Chap. II.8 Nakano describes for us the general principles, and then some specific techniques for parallel computing. Understanding statistical databases is important not only because of the enhanced efficiency that appropriate data structures allow in statistical computing, but also because of the various types of databases the statistician may encounter in data analysis. In Chap. II.9 on statistical databases, Boyens, Günther, and Lenz give us an overview of the basic design issues and a description of some specific database management systems.

The next two chapters are on statistical graphics. The first of these chapters, by Symanzik, spans our somewhat artificial boundary of Part II (statistical computing) and Part III (statistical methodology, the real heart and soul of computational statistics). This chapter covers some of the computational details, but also addresses the usage of interactive and dynamic graphics in data analysis. Wilkinson, in Chap. II.11, describes a paradigm, the grammar of graphics, for developing and using systems for statistical graphics.

In order for statistical software to be usable and useful, it must have a good user interface. In Chap. II.12 on statistical user interfaces, Klinke discusses some of the general design principles of a good user interface and describes some interfaces that are implemented in current statistical software packages. In the development and use of statistical software, an object oriented approach provides a consistency of design and allows for easier software maintenance and the integration of software developed by different people at different times. Virius discusses this approach in the final chapter of Part II, on object oriented computing.

## 1.3.2 Summary and Overview; Part III: Statistical Methodology

Part III covers several aspects of computational statistics. In this part the emphasis is on the statistical methodology that is enabled by computing. Computers are useful in all aspects of statistical data analysis, of course, but in Part III, and generally in computational statistics, we focus on statistical methods that are computationally intensive. Although a theoretical justification of these methods often depends on asymptotic theory, in particular, on the asymptotics of the empirical cumulative distribution function, asymptotic inference is generally replaced by computational inference.

The first three chapters of this part deal directly with techniques of computational inference; that is, the use of cross validation, resampling, and simulation of data-generating processes to make decisions and to assign a level of confidence to the decisions. Wang opens Part III with a discussion of model choice. Selection of a model implies consideration of more than one model. As we suggested above, this is one of the hallmarks of computational statistics: looking at data through a variety of models. Wang begins with the familiar problem of variable selection in regression models, and then moves to more general problems in model selection. Cross validation and generalizations of that method are important techniques for addressing the problems. Next, in Chap. III.2 Mammen and Nandi discuss a class of resampling techniques that have wide applicability in statistics, from estimating variances and setting confidence regions to larger problems in statistical data analysis. Computational inference depends on simulation of data-generating processes. Any such simulation is an experiment. In the third chapter of Part III, Kleijnen discusses principles for design and analysis of experiments using computer models.

In Chap. III.4, Scott considers the general problem of estimation of a multivariate probability density function. This area is fundamental in statistics, and it utilizes several of the standard techniques of computational statistics, such as cross validation and visualization methods.

The next four chapers of Part III address important issues for discovery and analysis of relationships among variables. First, Loader discusses local smoothing using a variety of methods, including kernels, splines, and orthogonal series. Smoothing is fitting of asymmetric models, that is, models for the effects of a given set of variables (''independent variables'') on another variable or set of variables. The methods of Chap. III.5 are generally nonparametric, and will be discussed from a different standpoint in Chap. III.10. Next, in Chap. III.6 Mizuta describes ways of using the relationships among variables to reduce the effective dimensionality of a problem. The next two chapters return to the use of asymmetric models: Müller discusses generalized linear models, and Cízek describes computational and inferential methods for dealing with nonlinear regression models.

In Chap. III.9, Gather and Davies discuss various issues of robustness in statistics. Robust methods are important in such applications as those in financial modeling, discussed in Chap. IV.2. One approach to robustness is to reduce the dependence on parametric assumptions. Horowitz, in Chap. III.10, describes semiparametric models that make fewer assumptions about the form.

One area in which computational inference has come to play a major role is in Bayesian analysis. Computational methods have enabled a Bayesian approach in practical applications, because no longer is this approach limited to simple problems or conjugate priors. Robert, in Chap. III.11, describes ways that computational methods are used in Bayesian analyses.

Survival analysis, with applications in both medicine and product reliability, has become more important in recent years. Kamakura, in Chap. III.12, describes various models used in survival analysis and the computational methods for analyzing such models.

The final four chapters of Part III address an exciting area of computational statistics. The general area may be called ''data mining'', although this term has a rather anachronistic flavor because of the hype of the mid-1990s. Other terms such as ''knowledge mining'' or ''knowledge discovery in databases'' (''KDD'') are also used. To emphasize the roots in artificial intelligence, which is a somewhat discredited area, the term ''computational intelligence'' is also used. This is an area in which machine learning from computer science and statistical learning have merged. In Chap. III.13 Wilhelm provides an introduction and overview of data and knowledge mining, as well as a discussion of some of the vagaries of the terminology as researchers have attempted to carve out a field and to give it scientific legitimacy. Subsequent chapters describe specific methods for statistical learning: Zhang discusses recursive partitioning and tree based methods; Mika, Schäfer, Laskov, Tax, and Müller discuss support vector machines; and Bühlmann describes various ensemble methods.

## 1.3.3 Summary and Overview; Part IV: Selected Applications

Finally, in Part IV, there are five chapters on various applications of computational statistics. The first, by Weron, discusses stochastic modeling of financial data using heavy-tailed distributions. Next, in Chap. IV.2 Bauwens and Rombouts describe some problems in economic data analysis and computational statistical methods to address them. Some of the problems, such as nonconstant variance, discussed in this chapter on econometrics are also important in finance.

Human biology has become one of the most important areas of application, and many computationally-intensive statistical methods have been developed, refined, and brought to bear on problems in this area. First, Vaisman describes approaches to understanding the geometrical structure of protein molecules. While much is known about the order of the components of the molecules, the three-dimensional structure for most important protein molecules is not known, and the tools for discovery of this structure need extensive development. Next, Eddy and McNamee describe some statistical techniques for analysis of MRI data. The important questions involve the functions of the various areas in the brain. Understanding these will allow more effective treatment of diseased or injured areas and the resumption of more normal activities by patients with neurological disorders.

Finally, Marchette discusses statistical methods for computer network intrusion detection. Because of the importance of computer networks around the world, and because of their vulnerability to unauthorized or malicious intrusion, detection has become one of the most important - and interesting - areas for data mining.

The articles in this handbook cover the important subareas of computational statistics and give some flavor of the wide range of applications. While the articles emphasize the basic concepts and fundamentals of computational statistics, they provide the reader with tools and suggestions for current research topics. The reader may turn to a specific chapter for background reading and references on a particular topic of interest, but we also suggest that the reader browse and ultimately peruse articles on unfamiliar topics. Many surprising and interesting tidbits will be discovered!

## 1.3.4 The Ehandbook

A unique feature of this handbook is the supplemental ebook format. Our ebook design offers a HTML file with links to world wide computing servers. This HTML version can be downloaded onto a local computer via a licence card included in this handbook.

## 1.3.5 Future Handbooks in Computational Statistics

This handbook on concepts and fundamentals sets the stage for future handbooks that go more deeply into the various subfields of computational statistics. These handbooks will each be organized around either a specific class of theory and methods, or else around a specific area of application.

The development of the field of computational statistics has been rather fragmented. We hope that the articles in this handbook series can provide a more unified framework for the field.

Next: References Up: 1. Computational Statistics: An Previous: 1.2 The Emergence of