The cycle of data and knowledge mining comprises various analysis steps, each step focusing on a different aspect or task. [13] propose the following categorization of data mining tasks.
At the beginning of each data analysis is the wish and the need to get an overview on the data, to see general trends as well as extreme values rather quickly. It is important to familiarize with the data, to get an idea what the data might be able to tell you, where limitations will be, and which further analyses steps might be suitable. Typically, getting the overview will at the same time point the analyst towards particular features, data quality problems, and additional required background information. Summary tables, simple univariate descriptive statistics, and simple graphics are extremely valuable tools to achieve this task.
[33] report from a study of car insurance
policies during which the following difficulties emerged amongst
others (see Fig. 13.1).
![]() |
Checking data quality is by no means a negative part of the process. It leads to deeper understanding of the data and to more discussions with the data set owners. Discussions lead to more information about the data and the goals of the study.
Speed of the data processing is an important issue at this step. For simple tasks - and data summary and description are typically considered to be simple tasks, although it is generally not true - users are not willing to spend much time. A frequency table or a scatterplot must be visible in the fraction of a second, even when it comprises a million observations. Only some computer programs are able to achieve this. Another point is a fast scan through all the variables: if a program requires an explicit and lengthy specification of the graph or table to be created, a user typically will end this tedious endeavor after a few instances. Generic functions with context-sensitive and variable-type-dependent responses provide a viable solution to this task. On the level of standard statistical data sets this is provided by software like XploRe, S-Plus and R with their generic functions summary and plot. Generic functions of this kind can be enhanced by a flexible and interactive user environment which allows to navigate through the mass of data, to extract the variables that show interesting information on the first glance and that call for further investigation. Currently, no system comes close to meet these demands, future systems hopefully will do.
General descriptions and summaries are an important starting point but
more exploration of the data is usually desired. While the tasks in
the previous section have been guided by the goal of summary and data
reduction, descriptive modeling tries to find models for the data. In
contrast to the subsequent section, the aim of these models is to
describe, not to predict models. As a consequence, descriptive models
are used in the setting of unsupervised learning. Typical methods of
descriptive modeling are density estimation, smoothing, data
segmentation, and clustering. There are by now some classics in the
literature on density estimation ([27]) and smoothing
([14]). Clustering is a well-studied and well-known
technique in statistics. Many different approaches and algorithms,
distance measures and clustering schemes
have been proposed. With large data sets all hierarchical methods have
extreme difficulties with performance. The most widely used method of
choice is -means clustering. Although
-means is not particularly
tailored for a large number of observations, it is currently the only
clustering scheme that has gained positive reputation in both the
computer science and the statistics community. The reasoning behind
cluster analysis is the assumption that the data set contains natural
clusters which, when discovered, can be characterized and
labeled. While for some cases it might be difficult to decide to which
group they belong, we assume that the resulting groups are clear-cut
and carry an intrinsic meaning. In segmentation analysis, in contrast,
the user typically sets the number of groups in advance and tries to
partition all cases in homogeneous subgroups.
Predictive modeling
falls into the category of supervised learning, hence, one variable is
clearly labeled as target variable and will be explained as
a function of the other variables
. The nature of the target
variable determines the type of model: classification model, if
is
a discrete variable, or regression model, if it is a continuous
one. Many models are typically built to predict the behavior of new
cases and to extend the knowledge to objects that are new or not yet
as widely understood. Predicting the value of the stock market, the
outcome of the next governmental election, or the health status of
a person Banks use classification schemes to group their costumers
into different categories of risk.
Classification models follow one of three different approaches: the
discriminative approach, the regression approach, or the
class-conditional approach. The discriminative approach aims in
directly mapping the explanatory variables to one of the
possible target categories
. The input space
is
hence partitioned into different regions which have a unique class
label assigned. Neural networks and support vector machines are
examples for this. The regression approach (e.g. logistic regression)
calculates the posterior class distribution
for each case
and chooses the class for which the maximum probability is
reached. Decision trees
(CART, C5.0, CHAID) classify for both the discriminative approach and
the regression approach, because typically the posterior class
probabilities at each leaf are calculated as well as the predicted
class. The class-conditional approach starts with specifying the
class-conditional distributions
explicitly. After estimating the marginal distribution
, Bayes
rule is used to derive the conditional distribution
. The
name Bayesian classifiers
is widely used for this approach, erroneously pointing to a Bayesian
approach versus a frequentist approach. Mostly, plug-in estimates
are derived via maximum likelihood. The
class-conditional approach is particularly attractive, because they
allow for general forms of the class-conditional
distributions. Parametric, semi-parametric, and non-parametric methods
can be used to estimate the class-conditional distribution. The
class-conditional approach is the most complex modeling technique for
classification. The regression approach requires fewer parameters to
fit, but still more than a discriminative model. There is no general
rule which approach works best, it is mainly a question of the goal of
the researcher whether posterior probabilities are useful, e.g. to see
how likely the ''second best'' class would be.
The realm of the previous tasks has been much within the statistical
tradition in describing functional relationships between explanatory
variables and target variables. There are situations where such
a functional relationship is either not appropriate or too hard to
achieve in a meaningful way. Nevertheless, there might be a pattern in
the sense that certain items, values or measurements occur frequently
together. Association Rules
are a method originating from market basket analysis to elicit
patterns of common behavior. Let us consider an example originating
from data that is available as one of the example data files for the
SAS Enterprise Miner. For this data (in the following refered to as
the SAS Assocs Data) the output for an association query with
and
, limited by
a maximum of
items per rule generated by the SAS Enterprise Miner
consists of
lines of the form shown in
Table 13.1.
# items | conf | supp | count | Rule | |
![]() |
82.62 | 25.17 | ![]() |
artichok |
![]() |
![]() |
78.93 | 25.07 | ![]() |
soda |
![]() |
![]() |
78.09 | 22.08 | ![]() |
turkey |
![]() |
![]() |
|||||
![]() |
95.16 | 5.89 | ![]() |
soda & artichok |
![]() |
![]() |
94.31 | 19.88 | ![]() |
avocado & artichok |
![]() |
![]() |
93.23 | 23.38 | ![]() |
soda & cracker |
![]() |
![]() |
|||||
![]() |
100.00 | 3.1 | ![]() |
ham & corned beef & apples |
![]() |
![]() |
100.00 | 3.1 | ![]() |
ham & corned beef & apples |
![]() |
![]() |
100.00 | 3.8 | ![]() |
steak & soda & heineken |
![]() |
![]() |
The practical use of association rules is not restricted to finding the general trend and the norm behavior, association rules have also been used successfully for detecting unusual behavior in fraud detection.
The world wide web contains an enormous amount of information in electronic journal articles, electronic catalogs, and private and commercial homepages. Having found an interesting article or picture, it is a common desire to find similar objects quickly. Based on key words and indexed meta-information search engines are providing us with this desired information. They can not only work on text documents, but to a certain extent also on images. Semi-automated picture retrieval combines the ability of the human vision system with the search capacities of the computer to find similar images in a data base.