next up previous contents index
Next: 13.5 Data Mining Computational Up: 13. Data and Knowledge Previous: 13.3 Supervised and Unsupervised

Subsections



13.4 Data Mining Tasks

The cycle of data and knowledge mining comprises various analysis steps, each step focusing on a different aspect or task. [13] propose the following categorization of data mining tasks.


13.4.1 Description and Summarization

At the beginning of each data analysis is the wish and the need to get an overview on the data, to see general trends as well as extreme values rather quickly. It is important to familiarize with the data, to get an idea what the data might be able to tell you, where limitations will be, and which further analyses steps might be suitable. Typically, getting the overview will at the same time point the analyst towards particular features, data quality problems, and additional required background information. Summary tables, simple univariate descriptive statistics, and simple graphics are extremely valuable tools to achieve this task.

[33] report from a study of $ 50{,}000$ car insurance policies during which the following difficulties emerged amongst others (see Fig. 13.1).

(a)
Barcharts of the categorical variables revealed that several had too many categories. Sex had seven, of which four were so rare as to presumably be unknowns or errors of some kind. The third large category turned out to be very reasonable: if a car was insured by a firm, the variable sex was coded as ''firm''. This had not been explained in advance and was obviously useful for a better grasp of the data.

(b)
A histogram of date of birth showed missing values, a fairly large number (though small percentage) of underage insured persons, and a largish number born in 1900, who had perhaps been originally coded as ''0'' or ''00'' for unknown. Any analytic method using such a variable could have given misleading results.

(c)
Linking the barchart of gender from (a) and the histogram of age from (b) showed quite plausibly that many firms had date of birth coded as missing, but not all. This led to further informative discussions with the data set owners.

Figure 13.1: Linked highlighting reveals structure in the data and explains unusual results of one variable quite reasonably. Barchart of Sex of car insurance policy holders on the left, Histogram of year of birth of policy holders on the right. Highlighted are cases with $ {\textit {Sex}} = 4$ (firm). The lines under some of the bins in the histogram indicate small counts of highlighted cases that can't be displayed proportionally
\includegraphics[width=\textwidth]{text/3-13/bars.eps}

Checking data quality is by no means a negative part of the process. It leads to deeper understanding of the data and to more discussions with the data set owners. Discussions lead to more information about the data and the goals of the study.

Speed of the data processing is an important issue at this step. For simple tasks - and data summary and description are typically considered to be simple tasks, although it is generally not true - users are not willing to spend much time. A frequency table or a scatterplot must be visible in the fraction of a second, even when it comprises a million observations. Only some computer programs are able to achieve this. Another point is a fast scan through all the variables: if a program requires an explicit and lengthy specification of the graph or table to be created, a user typically will end this tedious endeavor after a few instances. Generic functions with context-sensitive and variable-type-dependent responses provide a viable solution to this task. On the level of standard statistical data sets this is provided by software like XploRe, S-Plus and R with their generic functions summary and plot. Generic functions of this kind can be enhanced by a flexible and interactive user environment which allows to navigate through the mass of data, to extract the variables that show interesting information on the first glance and that call for further investigation. Currently, no system comes close to meet these demands, future systems hopefully will do.


13.4.2 Descriptive Modeling

General descriptions and summaries are an important starting point but more exploration of the data is usually desired. While the tasks in the previous section have been guided by the goal of summary and data reduction, descriptive modeling tries to find models for the data. In contrast to the subsequent section, the aim of these models is to describe, not to predict models. As a consequence, descriptive models are used in the setting of unsupervised learning. Typical methods of descriptive modeling are density estimation, smoothing, data segmentation, and clustering. There are by now some classics in the literature on density estimation ([27]) and smoothing ([14]). Clustering is a well-studied and well-known technique in statistics. Many different approaches and algorithms, distance measures and clustering schemes have been proposed. With large data sets all hierarchical methods have extreme difficulties with performance. The most widely used method of choice is $ k$-means clustering. Although $ k$-means is not particularly tailored for a large number of observations, it is currently the only clustering scheme that has gained positive reputation in both the computer science and the statistics community. The reasoning behind cluster analysis is the assumption that the data set contains natural clusters which, when discovered, can be characterized and labeled. While for some cases it might be difficult to decide to which group they belong, we assume that the resulting groups are clear-cut and carry an intrinsic meaning. In segmentation analysis, in contrast, the user typically sets the number of groups in advance and tries to partition all cases in homogeneous subgroups.


13.4.3 Predictive Modeling

Predictive modeling falls into the category of supervised learning, hence, one variable is clearly labeled as target variable $ Y$ and will be explained as a function of the other variables $ X$. The nature of the target variable determines the type of model: classification model, if $ Y$ is a discrete variable, or regression model, if it is a continuous one. Many models are typically built to predict the behavior of new cases and to extend the knowledge to objects that are new or not yet as widely understood. Predicting the value of the stock market, the outcome of the next governmental election, or the health status of a person Banks use classification schemes to group their costumers into different categories of risk.

Classification models follow one of three different approaches: the discriminative approach, the regression approach, or the class-conditional approach. The discriminative approach aims in directly mapping the explanatory variables $ X$ to one of the $ k$ possible target categories $ y_1, \ldots, y_k$. The input space $ X$ is hence partitioned into different regions which have a unique class label assigned. Neural networks and support vector machines are examples for this. The regression approach (e.g. logistic regression) calculates the posterior class distribution $ P(Y\mid x)$ for each case and chooses the class for which the maximum probability is reached. Decision trees (CART, C5.0, CHAID) classify for both the discriminative approach and the regression approach, because typically the posterior class probabilities at each leaf are calculated as well as the predicted class. The class-conditional approach starts with specifying the class-conditional distributions $ P(X\mid y_i, \theta_i)$ explicitly. After estimating the marginal distribution $ P(Y)$, Bayes rule is used to derive the conditional distribution $ P(Y\mid x)$. The name Bayesian classifiers is widely used for this approach, erroneously pointing to a Bayesian approach versus a frequentist approach. Mostly, plug-in estimates $ \hat{\theta}_i$ are derived via maximum likelihood. The class-conditional approach is particularly attractive, because they allow for general forms of the class-conditional distributions. Parametric, semi-parametric, and non-parametric methods can be used to estimate the class-conditional distribution. The class-conditional approach is the most complex modeling technique for classification. The regression approach requires fewer parameters to fit, but still more than a discriminative model. There is no general rule which approach works best, it is mainly a question of the goal of the researcher whether posterior probabilities are useful, e.g. to see how likely the ''second best'' class would be.


13.4.4 Discovering Patterns and Rules

The realm of the previous tasks has been much within the statistical tradition in describing functional relationships between explanatory variables and target variables. There are situations where such a functional relationship is either not appropriate or too hard to achieve in a meaningful way. Nevertheless, there might be a pattern in the sense that certain items, values or measurements occur frequently together. Association Rules are a method originating from market basket analysis to elicit patterns of common behavior. Let us consider an example originating from data that is available as one of the example data files for the SAS Enterprise Miner. For this data (in the following refered to as the SAS Assocs Data) the output for an association query with $ {\text{minconf}} = 50$ and $ {\text{minsup}} = 3$, limited by a maximum of $ 4$ items per rule generated by the SAS Enterprise Miner consists of $ 1000$ lines of the form shown in Table 13.1.


Table 13.1: Examples of association rules as found in the SAS Assocs Data by the SAS Enterprise Miner Software. $ 1000$ rules have been generated: $ 47$ including $ 2$ items, $ 394$ with $ 3$ items and $ 559$ with $ 4$ items
# items conf supp count Rule
$ 2$  82.62 25.17 $ 252$ artichok $ \rightarrow $ heineken
$ 2$  78.93 25.07 $ 251$ soda $ \rightarrow $ cracker
$ 2$  78.09 22.08 $ 221$ turkey $ \rightarrow $ olives
$ \ldots$          
$ 3$  95.16  5.89  $ 59$ soda & artichok $ \rightarrow $ heineken
$ 3$  94.31 19.88 $ 199$ avocado & artichok $ \rightarrow $ heineken
$ 3$  93.23 23.38 $ 234$ soda & cracker $ \rightarrow $ heineken
$ \ldots$          
$ 4$ 100.00  3.1   $ 31$ ham & corned beef & apples $ \rightarrow $ olives
$ 4$ 100.00  3.1   $ 31$ ham & corned beef & apples $ \rightarrow $ hering
$ 4$ 100.00  3.8   $ 38$ steak & soda & heineken $ \rightarrow $ cracker
$ \ldots$          

The practical use of association rules is not restricted to finding the general trend and the norm behavior, association rules have also been used successfully for detecting unusual behavior in fraud detection.


13.4.5 Retrieving Similar Objects

The world wide web contains an enormous amount of information in electronic journal articles, electronic catalogs, and private and commercial homepages. Having found an interesting article or picture, it is a common desire to find similar objects quickly. Based on key words and indexed meta-information search engines are providing us with this desired information. They can not only work on text documents, but to a certain extent also on images. Semi-automated picture retrieval combines the ability of the human vision system with the search capacities of the computer to find similar images in a data base.


next up previous contents index
Next: 13.5 Data Mining Computational Up: 13. Data and Knowledge Previous: 13.3 Supervised and Unsupervised