Given a data matrix and basic information about the values in it, the next analysis step is to compute some characteristic numbers from the data.
We will start our first steps in statistical analysis by investigating the data matrix salpro, which contains the variables sales and profits from the U.S. companies database. If you want to check the contents of this matrix, just issue salpro at the command line. Recall that this is a small data set of only 79 rows, containing realizations of two metric variables. For all observations, we have information about the branch in which the company is working. This information is contained in the text vector branch. To give you a first impression of the data, Figure 2.1 shows a scatter plot of sales against profits.
Later in this section, we will continue our course by studying the data matrix earn. earn contains data of a different structure. Some of the variables are continuous but most of the variables are discrete (i.e. can take on only a few values) or are in fact nonmetric.
In the following subsections, we will compute different
characteristics of the data matrices salpro and earn.
All
XploRe
codes can be found in the quantlets
XLGdesc04.xpl
and
XLGdesc05.xpl
, respectively.
A basic information of a data set are the minimum and maximum values of the variables. For the variables sales and profits of the U.S. companies data set we obtain these values from
min(salpro) max(salpro)which computes
Contents of min [1,] 176 -771.5 Contents of max [1,] 50056 6555In XploRe , the minima and maxima can be computed for all rows or columns of a matrix in one step. By default, the columnwise minima and maxima are computed. This means for the above output, 176 is the minimum of sales and -771.5 the minimum of profits. The maximum of sales is 50056 and the maximal profit is 6555.
For rowwise computation of minima and maxima, a second parameter needs
to be given to
min
and
max
. For example,
min(salpro,2) would compute the rowwise minimum of salpro.
Let us remark that rowwise computations make of course no sense in the
case of salpro, since the variables (sales and profits) are
stored columnwise in this data matrix.
|
To describe numeric data, it is useful to study the empirical values for
the moments of the underlying distribution. Suppose we have a data
vector which contains
observations
.
The mean
and the
variance
mean(salpro) var(salpro)give the means and variances for the variables sales and profits of the U.S. companies data set:
Contents of mean [1,] 4178.3 209.84 Contents of var [1,] 4.9163e+07 6.3517e+05The standard deviations of sales and profits can be obtained by taking the square root from the variance: sqrt(var(salpro)).
The skewness and the
kurtosis
measure
the skewness and the departure from the normal
distribution, respectively:
skewness(salpro) kurtosis(salpro)resulting in
Contents of s [1,] 4.2659 6.5908 Contents of k [1,] 25.426 51.577which are all far away from 0 or
An alternative way to characterize numeric data are the median and quantiles of the data. The empirical median of a data set is given by the value that separates the data into the 50% smallest and 50% highest values. Formally, the empirical median is calculated as
If the distribution of the data were symmetric, arithmetic mean and median should roughly coincide. Let us check this for sales and profits of the U.S. companies:
mean(salpro) median(salpro)result in the following output
Contents of mean [1,] 4178.3 209.84 Contents of med [1,] 1754 70.5We observe a mean of 4178.3 and a median of 1754 for the sales as well as a mean of 209.84 and a median of 70.5 for the profits. Obviously, the mean
While the median reflects the central value which partitions the data into two 50% portions, we might also be interested in partitioning into fractions of other size. The corresponding values from the data are then empirical values for the quantiles of the distribution.
Theoretically, the -quantile of a random variable
is the value
, such that
The empirical quantiles of a data set can be found by sorting
the data and then finding the sample value that partitions the
data into portions of size and
.
Together with the median (
quantile
of the data), the quartiles (
and
quantiles)
are often computed for a data set.
quantile(salpro,0.25) median(salpro) quantile(salpro,0.75)computes both quartiles and medians for our running example and gives
Contents of q [1,] 749 37.8 Contents of med [1,] 1754 70.5 Contents of q [1,] 4781 195.3As you can easily calculate, the absolute difference between median and the lower quartile (25% quantile) is less than the absolute difference between median and the upper quartile (75% quantile):
abs(median(salpro)-quantile(salpro,0.25)) abs(median(salpro)-quantile(salpro,0.75))yields
Contents of abs [1,] 1005 32.7 Contents of abs [1,] 3027 124.8This indicates that the densities of both variables are more flat at the right tail and more steep at the left tail. Recall also the scatter plot of both variables in Figure 2.1.
Recall the scatter plot of the data from the matrix salpro which was given in Figure 2.1. The figure shows that both variables sales and profits are highly dependent. The dependence of metric continuous variables (to be exact, the linear dependence) can be measured by the covariance or correlation between the variables.
Suppose we have realizations and
(
) for
two variables
and
. The empirical covariance between
and
is then defined as
To compute the covariance matrix of a data matrix,
XploRe
provides the function
cov
. For example, from
cov(salpro)we obtain the covariance matrix
Contents of s [1,] 4.9163e+07 4.5475e+06 [2,] 4.5475e+06 6.3517e+05On the diagonal of the
covmat=cov(salpro) covmat[1,2]
As mentioned above, the covariance measures the (linear)
dependence between two variables. If many or all terms
have the same sign as
, we observe positive dependence
(
increases if
increases and vice versa).
If the signs are typically different, we observe negative
dependence (
increases if
decreases and vice versa).
However, it is difficult to assess the magnitude of dependence
from the covariance alone.
For measuring the strength of dependence, the
covariance is considered in relative value to the variances
of
and
. The resulting coefficient is the
correlation coefficient
which is defined as
Similar to the covariance matrix, the correlation coefficients
for a data matrix can be stored
in a correlation matrix.
This matrix has values 1 on the diagonal and the
correlation coefficients
on the off-diagonal elements. Of course, correlation matrices
are symmetric matrices too.
XploRe
provides the function
corr
to compute the
correlation matrix:
corr(salpro)results in
Contents of r [1,] 1 0.81378 [2,] 0.81378 1which means that there is correlation of 0.81378 between sales and profits. As we expected, both variables are highly correlated.
|
Up to now we have dealt mainly with continuous (metric) variables.
In applied sciences, however, dichotomous or categorical variables
often play an important role. The first question in descriptive
analysis of such variables is which categories
can be found with which frequencies. This information
is available from the function
discrete
. Let us
consider the third column of the matrix earn
which is 1 for females and 0 for males:
{cat,freq}=discrete(earn[,3]) cat freq
Contents of cat [1,] 0 [2,] 1 Contents of freq [1,] 289 [2,] 245As can be seen, the data contain 245 observations of females and 289 observations of males.
The function
discrete
can be used for text matrices
as well. Let us consider the observations in the
branch vector from the U.S. companies:
{cat,freq}=discrete(branch) cat freq
Contents of cat [1,] Communication [2,] Energy [3,] Finance [4,] HiTech [5,] Manufacturing [6,] Medical [7,] Other [8,] Retail [9,] Transportation Contents of freq [1,] 2 [2,] 15 [3,] 17 [4,] 8 [5,] 10 [6,] 4 [7,] 7 [8,] 10 [9,] 6
We will come back to categorical variables in Subsection 2.3.2. This subsection will present a more convenient way to summarize categories and frequencies in a frequency table. We will also see how to tabulate two variables and to study the dependence between them.
|
Numeric data sometimes contain missing values. Also, operations on the data may lead to missing or infinite values. For a subsequent statistical analysis, it is important to identify these values, to delete them or replace them by other values. XploRe encodes missing values by NaN, positive infinite values by Inf and negative infinite values by -Inf.
By means of
countNaN
and
countNotNumber
, the
existence of values that are NaN or not numbers
(NaN, Inf, -Inf)
can be investigated. Here, let us check our data matrix earn:
countNaN(earn) countNotNumber(earn)give both the result 0, which tells us that the matrix earn contains neither missing values nor infinite values.
Consider now a small matrix that contains such values. The
XploRe
code for the following examples can be found in the quantlet
XLGdesc07.xpl
.
matnan=#(NaN, 1, 2) ~ #(Inf, -Inf, 3) ~ #(4, 5, 6) matnangenerates the following
Contents of matnan [1,] NaN Inf 4 [2,] 1 -Inf 5 [3,] 2 3 6that has one missing and two infinite values. The lines
countNaN(matnan) countNotNumber(matnan)result consequently in
Contents of n [1,] 1 Contents of n [1,] 3To identify the location of the problematic values, we can use the functions
isNaN(matnan)shows
Contents of y [1,] 1 0 0 [2,] 0 0 0 [3,] 0 0 0Note that isNaN(x) is equivalent to x==NaN.
We can now use the results from
isNaN
,
isInf
,
and
isNumber
to delete rows of our data matrix
that contain these problematic values. For example, to
extract all rows without missing or infinite numbers, type
inum=prod(isNumber(matnan),2) inum matnew=paf(matnan,inum) matnewThe first line computes the rowwise product of isNumber(matnan) by means of the function
Contents of inum [1,] 0 [2,] 0 [3,] 1 Contents of matnew [1,] 2 3 6Note that
Another possibility of handling missing and infinite values is to
replace them by some other values. For example,
we want to replace all NaN, Inf and -Inf
in matnan by 0. This is done by the function
replace
:
matgood=replace(matnan,#(NaN,Inf,-Inf),0) matgoodgives
Contents of matgood [1,] 0 0 4 [2,] 1 0 5 [3,] 2 3 6