2.2 Computing Statistical Characteristics

Given a data matrix and basic information about the values in it, the next analysis step is to compute some characteristic numbers from the data.

We will start our first steps in statistical analysis by investigating the data matrix salpro, which contains the variables sales and profits from the U.S. companies database. If you want to check the contents of this matrix, just issue salpro at the command line. Recall that this is a small data set of only 79 rows, containing realizations of two metric variables. For all observations, we have information about the branch in which the company is working. This information is contained in the text vector branch. To give you a first impression of the data, Figure 2.1 shows a scatter plot of sales against profits.

Figure: Scatter plot of sales vs. profits for U.S. companies data.
3558 XLGdesc03.xpl
\includegraphics[scale=0.425]{descriptuscomp}

Later in this section, we will continue our course by studying the data matrix earn. earn contains data of a different structure. Some of the variables are continuous but most of the variables are discrete (i.e. can take on only a few values) or are in fact nonmetric.

In the following subsections, we will compute different characteristics of the data matrices salpro and earn. All XploRe codes can be found in the quantlets 3565 XLGdesc04.xpl and 3568 XLGdesc05.xpl , respectively.


2.2.1 Minimum and Maximum


mx = 3650 min (x {,d})
computes the minimum of an array x, optionally with respect to dimension d
mx = 3653 max (x {,d})
computes the maximum of an array x, optionally with respect to dimension d

A basic information of a data set are the minimum and maximum values of the variables. For the variables sales and profits of the U.S. companies data set we obtain these values from

  min(salpro)
  max(salpro)
which computes
  Contents of min
  [1,]      176   -771.5 
  Contents of max
  [1,]    50056     6555
In XploRe , the minima and maxima can be computed for all rows or columns of a matrix in one step. By default, the columnwise minima and maxima are computed. This means for the above output, 176 is the minimum of sales and -771.5 the minimum of profits. The maximum of sales is 50056 and the maximal profit is 6555.

For rowwise computation of minima and maxima, a second parameter needs to be given to 3660 min and 3663 max . For example, min(salpro,2) would compute the rowwise minimum of salpro. Let us remark that rowwise computations make of course no sense in the case of salpro, since the variables (sales and profits) are stored columnwise in this data matrix.


2.2.2 Mean, Variance and Other Moments


mx = 3795 mean (x {,d})
computes the mean of an array x, optionally with respect to dimension d
vx = 3798 var (x {,d})
computes the variance of an array x, optionally with respect to dimension d
kx = 3801 kurtosis (x)
computes the (columnwise) kurtosis of an array x
sx = 3804 skewness (x)
computes the (columnwise) skewness of an array x

To describe numeric data, it is useful to study the empirical values for the moments of the underlying distribution. Suppose we have a data vector $ x$ which contains $ n$ observations $ x_1,\ldots,x_n$. The mean $ \overline{x}$ and the variance $ v^2$

$\displaystyle \overline{x} = \frac1n \sum_{i=1}^n x_i, \quad
v^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\overline{x})^2$

are the average (arithmetic mean) of all observations and average quadratic deviation from the mean, respectively. Analogous to minimum and maximum, mean and variance can be computed for all rows or columns of a matrix in one step. By default, columnwise means and variances are computed. For example, the commands
  mean(salpro)
  var(salpro)
give the means and variances for the variables sales and profits of the U.S. companies data set:
  Contents of mean
  [1,]   4178.3   209.84 
  Contents of var
  [1,]  4.9163e+07  6.3517e+05
The standard deviations of sales and profits can be obtained by taking the square root from the variance: sqrt(var(salpro)).

The skewness $ s$ and the kurtosis $ k$ measure the skewness and the departure from the normal distribution, respectively:

$\displaystyle s = \frac1n \sum_{i=1}^n \frac{(x_i-\overline{x})^3}{v^3}, \quad
k = \left\{\frac1n \sum_{i=1}^n \frac{(x_i-\overline{x})^4}{v^4}\right\}$

The skewness should be close to 0 for a distribution that is symmetric around $ \overline{x}$. The kurtosis should be close to $ 3$ for a distribution resembling the normal. For the variables sales and profits from the U.S. companies data we find
  skewness(salpro)
  kurtosis(salpro)
resulting in
  Contents of s
  [1,]   4.2659   6.5908 
  Contents of k
  [1,]   25.426   51.577
which are all far away from 0 or $ 3$, respectively. We can conclude that we have a skewed distribution. See Figure 2.1 for a scatter plot of both variables. Of course, this skewness also implies the observed non-normality.


2.2.3 Median and Quantiles


mx = 3975 median (x)
computes the (columnwise) median of an array x
qx = 3978 quantile (x, alpha)
computes the (columnwise) quantile of an array x at level alpha

An alternative way to characterize numeric data are the median and quantiles of the data. The empirical median of a data set is given by the value that separates the data into the 50% smallest and 50% highest values. Formally, the empirical median is calculated as

\begin{displaymath}x_{\textrm{med}}=\left\{
\begin{array}{ll}
x_{[(n+1)/2]} & \q...
...
& \quad \textrm{ if } n \textrm{ is even,}
\end{array}\right.\end{displaymath}

where $ x_{[(n+1)/2]}$ and $ x_{[(n/2)+1]}$ are the elements at position $ (n+1)/2$ and $ (n/2)+1$ of the ranked data. In simpler words, the median gives the central value of the data.

If the distribution of the data were symmetric, arithmetic mean and median should roughly coincide. Let us check this for sales and profits of the U.S. companies:

  mean(salpro)
  median(salpro)
result in the following output
  Contents of mean
  [1,]   4178.3   209.84 
  Contents of med
  [1,]     1754     70.5
We observe a mean of 4178.3 and a median of 1754 for the sales as well as a mean of 209.84 and a median of 70.5 for the profits. Obviously, the mean $ \overline{x}$ and median $ x_{\textrm{med}}$ for both variables are quite different. This confirms the conclusions we had already obtained from the computation of the skewness for both variables.

While the median reflects the central value which partitions the data into two 50% portions, we might also be interested in partitioning into fractions of other size. The corresponding values from the data are then empirical values for the quantiles of the distribution.

Theoretically, the $ \alpha$-quantile of a random variable $ X$ is the value $ q_\alpha$, such that

$\displaystyle q_\alpha=F^{-1}(\alpha) \quad \textrm{ or } \quad
F(q_\alpha)=\alpha \quad \textrm{ or } \quad
\int_{-\infty}^{q_\alpha} f(x)\,dx =\alpha.$

Here, $ F(\bullet)$ and $ f(\bullet)$ denote the cumulative distribution function (cdf) and the probability density function (pdf) of $ X$, respectively. If $ \alpha=0.5=50\%$, then $ q_{0.5}$ is called the (theoretical) median of the distribution. For illustration, Figure 2.2 visualizes the $ 0.95$ quantile $ q_{0.95}\cong 1.64$ of a Gaussian distribution.

Figure 2.2: The $ q_{0.95}$ quantile of a Gaussian distribution.
\includegraphics[scale=0.425]{descriptquantil}

The empirical quantiles of a data set can be found by sorting the data and then finding the sample value that partitions the data into portions of size $ \alpha$ and $ 1-\alpha$. Together with the median ($ 0.5$ quantile of the data), the quartiles ($ 0.25$ and $ 0.75$ quantiles) are often computed for a data set.

  quantile(salpro,0.25)
  median(salpro)
  quantile(salpro,0.75)
computes both quartiles and medians for our running example and gives
  Contents of q
  [1,]      749     37.8 
  Contents of med
  [1,]     1754     70.5 
  Contents of q
  [1,]     4781    195.3
As you can easily calculate, the absolute difference between median and the lower quartile (25% quantile) is less than the absolute difference between median and the upper quartile (75% quantile):
  abs(median(salpro)-quantile(salpro,0.25))
  abs(median(salpro)-quantile(salpro,0.75))
yields
  Contents of abs
  [1,]     1005     32.7 
  Contents of abs
  [1,]     3027    124.8
This indicates that the densities of both variables are more flat at the right tail and more steep at the left tail. Recall also the scatter plot of both variables in Figure 2.1.


2.2.4 Covariance and Correlation


cx = 4155 cov (x)
computes the covariance matrix of a data matrix x (the covariance matrices of an array x, respectively)
rx = 4158 corr (x)
computes the correlation matrix of a data matrix x (the covariance matrices of an array x, respectively)

Recall the scatter plot of the data from the matrix salpro which was given in Figure 2.1. The figure shows that both variables sales and profits are highly dependent. The dependence of metric continuous variables (to be exact, the linear dependence) can be measured by the covariance or correlation between the variables.

Suppose we have realizations $ x_{1i}$ and $ x_{2i}$ ( $ i=1,\ldots,n$) for two variables $ X_1$ and $ X_2$. The empirical covariance between $ X_1$ and $ X_2$ is then defined as

$\displaystyle \textrm{cov}(X_1,X_2)=\frac{1}{n-1} \sum_{i=1}^n \,
(x_{1i}-\overline{x}_1)\,(x_{2i}-\overline{x}_2).$

For a data matrix consisting of several variables, often the covariance matrix is given. This matrix contains the variances in the diagonal and the covariances $ \textrm{cov}(X_\ell,X_k)$ in the off-diagonal elements. Note that $ \textrm{cov}(X_\ell,X_k)=\textrm{cov}(X_k,X_\ell)$, so that covariance matrices are always symmetric matrices.

To compute the covariance matrix of a data matrix, XploRe provides the function 4165 cov . For example, from

  cov(salpro)
we obtain the covariance matrix
  Contents of s
  [1,]  4.9163e+07  4.5475e+06 
  [2,]  4.5475e+06  6.3517e+05
On the diagonal of the $ 2\times 2$ matrix we discover the variances of the variables sales and profits. The covariance between sales and profits is the off-diagonal element 4.5475e+06, which we can access by
  covmat=cov(salpro)
  covmat[1,2]

As mentioned above, the covariance measures the (linear) dependence between two variables. If many or all terms $ (x_{1i}-\overline{x}_1)$ have the same sign as $ (x_{2i}-\overline{x}_2)$, we observe positive dependence ($ X_1$ increases if $ X_2$ increases and vice versa). If the signs are typically different, we observe negative dependence ($ X_1$ increases if $ X_2$ decreases and vice versa). However, it is difficult to assess the magnitude of dependence from the covariance alone. For measuring the strength of dependence, the covariance is considered in relative value to the variances of $ X_1$ and $ X_2$. The resulting coefficient is the correlation coefficient which is defined as

$\displaystyle \rho(X_1,X_2)=\frac{\textrm{cov}(X_1,X_2)}{v_1\cdotp v_2}$

with $ v_1$ and $ v_2$ denoting the standard deviations of $ X_1$ and $ X_2$, respectively. For the correlation coefficient, it always holds that $ -1\le \rho(X_1,X_2) \le 1$. The values -1 or 1 indicate perfect negative or positive correlation, respectively.

Similar to the covariance matrix, the correlation coefficients for a data matrix can be stored in a correlation matrix. This matrix has values 1 on the diagonal and the correlation coefficients $ \rho(X_\ell,X_k)$ on the off-diagonal elements. Of course, correlation matrices are symmetric matrices too.

XploRe provides the function 4172 corr to compute the correlation matrix:

  corr(salpro)
results in
  Contents of r
  [1,]        1  0.81378 
  [2,]  0.81378        1
which means that there is correlation of 0.81378 between sales and profits. As we expected, both variables are highly correlated.


2.2.5 Categorical Data


{xr, r} = 4311 discrete (x {,y})
reduces a matrix to its distinct rows and gives the number of replications of each row in the original data set; optionally a second matrix y can be given, which is summed up accordingly

Up to now we have dealt mainly with continuous (metric) variables. In applied sciences, however, dichotomous or categorical variables often play an important role. The first question in descriptive analysis of such variables is which categories can be found with which frequencies. This information is available from the function 4314 discrete . Let us consider the third column of the matrix earn which is 1 for females and 0 for males:

  {cat,freq}=discrete(earn[,3])
  cat
  freq
4318 XLGdesc06.xpl

gives
  Contents of cat
  [1,]        0 
  [2,]        1 
  Contents of freq
  [1,]      289 
  [2,]      245
As can be seen, the data contain 245 observations of females and 289 observations of males.

The function 4323 discrete can be used for text matrices as well. Let us consider the observations in the branch vector from the U.S. companies:

  {cat,freq}=discrete(branch)
  cat
  freq
4327 XLGdesc06.xpl

results in
  Contents of cat
  [1,] Communication
  [2,] Energy
  [3,] Finance
  [4,] HiTech
  [5,] Manufacturing
  [6,] Medical
  [7,] Other
  [8,] Retail
  [9,] Transportation
  Contents of freq
  [1,]        2 
  [2,]       15 
  [3,]       17 
  [4,]        8 
  [5,]       10 
  [6,]        4 
  [7,]        7 
  [8,]       10 
  [9,]        6

We will come back to categorical variables in Subsection 2.3.2. This subsection will present a more convenient way to summarize categories and frequencies in a frequency table. We will also see how to tabulate two variables and to study the dependence between them.


2.2.6 Missing Values and Infinite Values


nx = 4530 countNaN (x)
counts missing values in an array x
nx = 4533 countNotNumber (x)
counts missing and infinite values in an array x
ix = 4536 isNaN (x)
determines whether the elements of an array x are missing values
ix = 4539 isInf (x)
determines whether the elements of an array x are infinite values
ix = 4542 isNumber (x)
determines whether the elements of an array x are regular numeric values
y = 4545 paf (x, i)
deletes all rows of x for which the corresponding element of i equals 0
y = 4548 replace (x, w, b)
replaces all elements of x which equal w by the value b

Numeric data sometimes contain missing values. Also, operations on the data may lead to missing or infinite values. For a subsequent statistical analysis, it is important to identify these values, to delete them or replace them by other values. XploRe encodes missing values by NaN, positive infinite values by Inf and negative infinite values by -Inf.

By means of 4555 countNaN and 4558 countNotNumber , the existence of values that are NaN or not numbers (NaN, Inf, -Inf) can be investigated. Here, let us check our data matrix earn:

  countNaN(earn)
  countNotNumber(earn)
give both the result 0, which tells us that the matrix earn contains neither missing values nor infinite values.

Consider now a small matrix that contains such values. The XploRe code for the following examples can be found in the quantlet 4565 XLGdesc07.xpl .

  matnan=#(NaN, 1, 2) ~ #(Inf, -Inf, 3) ~ #(4, 5, 6)
  matnan
generates the following $ 3\times 3$ matrix
  Contents of matnan
  [1,]      NaN      Inf        4 
  [2,]        1     -Inf        5 
  [3,]        2        3        6
that has one missing and two infinite values. The lines
  countNaN(matnan)
  countNotNumber(matnan)
result consequently in
  Contents of n
  [1,]        1 
  Contents of n
  [1,]        3
To identify the location of the problematic values, we can use the functions 4568 isNaN , 4571 isInf , and 4574 isNumber . 4577 isNaN produces a matrix of the same size as the input matrix, indicating NaN values by 1 and all other values by 0:
  isNaN(matnan)
shows
  Contents of y
  [1,]        1        0        0 
  [2,]        0        0        0 
  [3,]        0        0        0
Note that isNaN(x) is equivalent to x==NaN. 4580 isInf searches in the same way for Inf and -Inf values. 4583 isNumber indicates numbers by 1 and missing and infinite values by 0.

We can now use the results from 4586 isNaN , 4589 isInf , and 4592 isNumber to delete rows of our data matrix that contain these problematic values. For example, to extract all rows without missing or infinite numbers, type

  inum=prod(isNumber(matnan),2)
  inum
  matnew=paf(matnan,inum)
  matnew
The first line computes the rowwise product of isNumber(matnan) by means of the function 4595 prod . The result inum indicates all rows with missing or infinite numbers with the value 0, otherwise the value 1 is produced. The third line of code uses 4598 paf to extract all rows from matnan for which the corresponding element of inum is 1. Hence, 4601 paf extracts only the last line of matnan:
  Contents of inum
  [1,]        0 
  [2,]        0 
  [3,]        1 
  Contents of matnew
  [1,]        2        3        6
Note that 4604 paf only operates on the rows of a matrix. To extract columns of a matrix, 4607 paf needs to be applied to the transposed matrix. See Matrix Handling (16) for more information on matrix operations.

Another possibility of handling missing and infinite values is to replace them by some other values. For example, we want to replace all NaN, Inf and -Inf in matnan by 0. This is done by the function 4610 replace :

  matgood=replace(matnan,#(NaN,Inf,-Inf),0)
  matgood
gives
  Contents of matgood
  [1,]        0        0        4 
  [2,]        1        0        5 
  [3,]        2        3        6