2.3 Summarizing Statistical Information

All functions introduced in Section 2.2 are intended to directly compute statistical characteristics from data. These direct computations may be quite cumbersome for everyday work. XploRe offers additional functions which summarize the statistical characteristics of data sets in an efficient way.


2.3.1 Summarizing Metric Data


s = 4880 summarize (x {,xvars})
computes a short summary of descriptive statistics for each column of a matrix x; optionally a vector of variable names xvars can be given
s = 4883 fivenum (x {,xvars})
computes the five number summary for each column of a matrix x; optionally a vector of variable names xvars can be given
s = 4886 descriptive (x {,xvars})
computes detailed descriptive statistics for each column of a matrix x; optionally a vector of variable names xvars can be given

4889 summarize is a tool to obtain a fast overview on a (metric) data matrix. It gives the most important statistical characteristics in the form of a short table. The only required input is the data matrix itself. Optionally, variable names can be provided as a text vector.

The following codes for the data matrix earn are available form the quantlet 4892 XLGdesc08.xpl . For example

  vearn="educ"|"south"|"female"|"exp"|"union"|"lnwage"|"age"
  summarize(earn, vearn)
shows the summary of the earn data together with their variable names:
  [ 1,]         
  [ 2,]        Minimum  Maximum     Mean   Median   Std.Error
  [ 3,]        -----------------------------------------------
  [ 4,] educ         2       18   13.019       12      2.6154
  [ 5,] south        0        1  0.29213        0     0.45517
  [ 6,] female       0        1   0.4588        0     0.49877
  [ 7,] exp          0       55   17.822       15       12.38
  [ 8,] union        0        1  0.17978        0     0.38436
  [ 9,] lnwage       0   3.7955   2.0592   2.0513     0.52773
  [10,] age         18       64   36.833       35      11.727
  [11,]

An alternative to 4895 summarize is 4898 fivenum which reports the five number summary of all columns of a data set. These five numbers are minimum, maximum, median, 25% and 75% quartile of the data. As with 4901 summarize , the required input is the data matrix itself and variable names can be provided optionally. For the sake of brevity we show 4904 fivenum applied only to column seven of the earn matrix:

  fivenum(earn[,7],vearn[7])
reports
Contents of five
[ 1,]  
[ 2,] ==================================================
[ 3,]  Five number summary: age
[ 4,] --------------------------------------------------
[ 5,]    Minimum                    18
[ 6,]    25% Quartile               28
[ 7,]    Median                     35
[ 8,]    75% Quartile               44
[ 9,]    Maximum                    64
[10,] ==================================================
[11,]
The function 4907 descriptive provides detailed information about the statistical characteristics of all columns of a data matrix. As in the previous tools, the input for 4910 descriptive is the data matrix and optionally variable names:
  descriptive(earn[,7],vearn[7])
produces
Contents of desc
[ 1,]  
[ 2,] =========================================================
[ 3,]  Variable age
[ 4,] =========================================================
[ 5,]  
[ 6,]  Mean              36.8333
[ 7,]  Std.Error         11.7266     Variance          137.513
[ 8,]  
[ 9,]  Minimum                18     Maximum                64
[10,]  Range                  46
[11,]  
[12,]  Lowest cases                  Highest cases 
[13,]         350:            18             368:           64
[14,]          94:            18             212:           64
[15,]          48:            18             223:           64
[16,]          78:            18             125:           64
[17,]         298:            19             501:           64
[18,]  
[19,]  Median                 35
[20,]  25% Quartile           28     75% Quartile           44
[21,]  
[22,]  Skewness         0.545221     Kurtosis        -0.595615
[23,]  
[24,]  Observations                    534
[25,]  Distinct observations            47
[26,]  
[27,]  Total number of {-Inf,Inf,NaN}    0
[28,]  
[29,] =========================================================
[30,]


2.3.2 Summarizing Categorical Data


s = 5048 frequency (x {, xvars {, outwidth}})
computes a frequency table for each column of a matrix x; optionally a vector variable names xvars and maximal string length for categories can be given
s = 5051 crosstable (x{,xvars})
computes pairwise cross tables from all columns of a data matrix x and computes the result of a $ \chi^2$ independence test; optionally a vector variable names xvars can be given

The functions 5054 frequency and 5057 crosstable can be applied to numeric as well as to text matrices. The XploRe codes for this section are available from the quantlet 5064 XLGdesc09.xpl .

5067 frequency produces a text matrix containing the categories and frequencies as well as cumulative frequencies for all columns of a data matrix. We apply this function to the first and third columns of the earn data.

  frequency(earn[,1|3], vearn[1|3])
5070 frequency is a different way of reporting information about categories and frequencies than the function 5073 discrete used in Section 2.2.5:
  Contents of freq
  [ 1,]  
  [ 2,] ==================================================
  [ 3,]  Variable educ
  [ 4,] ==================================================
  [ 5,]                 |  Frequency  Percent  Cumulative 
  [ 6,] --------------------------------------------------
  [ 7,]               2 |          1    0.002      0.002
  [ 8,]               3 |          1    0.002      0.004
  [ 9,]               4 |          1    0.002      0.006
  [10,]               5 |          1    0.002      0.007
  [11,]               6 |          3    0.006      0.013
  [12,]               7 |          5    0.009      0.022
  [13,]               8 |         15    0.028      0.051
  [14,]               9 |         12    0.022      0.073
  [15,]              10 |         17    0.032      0.105
  [16,]              11 |         27    0.051      0.155
  [17,]              12 |        219    0.410      0.566
  [18,]              13 |         37    0.069      0.635
  [19,]              14 |         56    0.105      0.740
  [20,]              15 |         13    0.024      0.764
  [21,]              16 |         71    0.133      0.897
  [22,]              17 |         24    0.045      0.942
  [23,]              18 |         31    0.058      1.000
  [24,] --------------------------------------------------
  [25,]                 |        534    1.000
  [26,] ==================================================
  [27,]  
  [28,] ==================================================
  [29,]  Variable female
  [30,] ==================================================
  [31,]                 |  Frequency  Percent  Cumulative
  [32,] --------------------------------------------------
  [33,]               0 |        289    0.541      0.541
  [34,]               1 |        245    0.459      1.000
  [35,] --------------------------------------------------
  [36,]                 |        534    1.000
  [37,] ==================================================
  [38,]

To study the dependence of two categorical variables, one typically analyzes the contingency table (or cross table). 5076 crosstable provides the cross tables of all columns of a data matrix and additionally computes the $ \chi^2$ statistic for testing independence and contingency coefficients. For example,

  crosstable(earn[,1|3], vearn[1|3])
gives
  Contents of cross
  [ 1,]  
  [ 2,]                    
  [ 3,] Crosstable for variables educ, female
  [ 4,]  
  [ 5,]           |      0.0000  1.0000 |
  [ 6,] ----------|---------------------|---------
  [ 7,]   2.0000  |      1       0      |       1 
  [ 8,]   3.0000  |      1       0      |       1 
  [ 9,]   4.0000  |      1       0      |       1 
  [10,]   5.0000  |      1       0      |       1 
  [11,]   6.0000  |      1       2      |       3 
  [12,]   7.0000  |      4       1      |       5 
  [13,]   8.0000  |      6       9      |      15 
  [14,]   9.0000  |      7       5      |      12 
  [15,]  10.0000  |     12       5      |      17 
  [16,]  11.0000  |     16      11      |      27 
  [17,]  12.0000  |    109     110      |     219 
  [18,]  13.0000  |     21      16      |      37 
  [19,]  14.0000  |     33      23      |      56 
  [20,]  15.0000  |      7       6      |      13 
  [21,]  16.0000  |     37      34      |      71 
  [22,]  17.0000  |     10      14      |      24 
  [23,]  18.0000  |     22       9      |      31 
  [24,] ----------|---------------------|---------
  [25,]           |    289     245      |     534 
  [26,]                    
  [27,] Chi^2 test of independence
  [28,]  
  [29,]   chi^2 statistic:                    16.15
  [30,]   degrees of freedom:                    16
  [31,]   significance level for rejection:  0.4427
  [32,]  
  [33,]   contingency coefficient:             0.17
  [34,]   corrected contingency coefficient:   0.24
  [35,]
The significance value for the $ \chi^2$-test of independence between both variables is 0.4427 here, which means that independence cannot be rejected (at the usual 5% or 10% levels).