All functions introduced in Section 2.2 are intended to directly compute statistical characteristics from data. These direct computations may be quite cumbersome for everyday work. XploRe offers additional functions which summarize the statistical characteristics of data sets in an efficient way.
|
summarize
is a tool to obtain a fast overview on a (metric)
data matrix. It gives the most important statistical characteristics
in the form of a short table. The only required input is the
data matrix itself. Optionally, variable names can be provided
as a text vector.
The following codes for the data matrix earn are available
form the quantlet
XLGdesc08.xpl
. For example
vearn="educ"|"south"|"female"|"exp"|"union"|"lnwage"|"age" summarize(earn, vearn)shows the summary of the earn data together with their variable names:
[ 1,] [ 2,] Minimum Maximum Mean Median Std.Error [ 3,] ----------------------------------------------- [ 4,] educ 2 18 13.019 12 2.6154 [ 5,] south 0 1 0.29213 0 0.45517 [ 6,] female 0 1 0.4588 0 0.49877 [ 7,] exp 0 55 17.822 15 12.38 [ 8,] union 0 1 0.17978 0 0.38436 [ 9,] lnwage 0 3.7955 2.0592 2.0513 0.52773 [10,] age 18 64 36.833 35 11.727 [11,]
An alternative to
summarize
is
fivenum
which
reports the five number summary of all columns of a data
set. These five numbers are minimum, maximum, median, 25%
and 75% quartile of the data. As with
summarize
,
the required input is the
data matrix itself and variable names can be provided optionally.
For the sake of brevity we show
fivenum
applied only
to column seven of the earn matrix:
fivenum(earn[,7],vearn[7])reports
Contents of five [ 1,] [ 2,] ================================================== [ 3,] Five number summary: age [ 4,] -------------------------------------------------- [ 5,] Minimum 18 [ 6,] 25% Quartile 28 [ 7,] Median 35 [ 8,] 75% Quartile 44 [ 9,] Maximum 64 [10,] ================================================== [11,]The function
descriptive(earn[,7],vearn[7])produces
Contents of desc [ 1,] [ 2,] ========================================================= [ 3,] Variable age [ 4,] ========================================================= [ 5,] [ 6,] Mean 36.8333 [ 7,] Std.Error 11.7266 Variance 137.513 [ 8,] [ 9,] Minimum 18 Maximum 64 [10,] Range 46 [11,] [12,] Lowest cases Highest cases [13,] 350: 18 368: 64 [14,] 94: 18 212: 64 [15,] 48: 18 223: 64 [16,] 78: 18 125: 64 [17,] 298: 19 501: 64 [18,] [19,] Median 35 [20,] 25% Quartile 28 75% Quartile 44 [21,] [22,] Skewness 0.545221 Kurtosis -0.595615 [23,] [24,] Observations 534 [25,] Distinct observations 47 [26,] [27,] Total number of {-Inf,Inf,NaN} 0 [28,] [29,] ========================================================= [30,]
|
The functions
frequency
and
crosstable
can be applied
to numeric as well as to text matrices. The
XploRe
codes for this
section are available from the quantlet
XLGdesc09.xpl
.
frequency
produces a text matrix containing
the categories and frequencies as well as cumulative frequencies
for all columns of a data matrix. We apply this function to the first
and third columns of the earn data.
frequency(earn[,1|3], vearn[1|3])
Contents of freq [ 1,] [ 2,] ================================================== [ 3,] Variable educ [ 4,] ================================================== [ 5,] | Frequency Percent Cumulative [ 6,] -------------------------------------------------- [ 7,] 2 | 1 0.002 0.002 [ 8,] 3 | 1 0.002 0.004 [ 9,] 4 | 1 0.002 0.006 [10,] 5 | 1 0.002 0.007 [11,] 6 | 3 0.006 0.013 [12,] 7 | 5 0.009 0.022 [13,] 8 | 15 0.028 0.051 [14,] 9 | 12 0.022 0.073 [15,] 10 | 17 0.032 0.105 [16,] 11 | 27 0.051 0.155 [17,] 12 | 219 0.410 0.566 [18,] 13 | 37 0.069 0.635 [19,] 14 | 56 0.105 0.740 [20,] 15 | 13 0.024 0.764 [21,] 16 | 71 0.133 0.897 [22,] 17 | 24 0.045 0.942 [23,] 18 | 31 0.058 1.000 [24,] -------------------------------------------------- [25,] | 534 1.000 [26,] ================================================== [27,] [28,] ================================================== [29,] Variable female [30,] ================================================== [31,] | Frequency Percent Cumulative [32,] -------------------------------------------------- [33,] 0 | 289 0.541 0.541 [34,] 1 | 245 0.459 1.000 [35,] -------------------------------------------------- [36,] | 534 1.000 [37,] ================================================== [38,]
To study the dependence of two categorical variables, one
typically analyzes the contingency table (or cross table).
crosstable
provides the cross tables
of all columns of a data matrix and additionally computes
the
statistic for testing independence and contingency coefficients.
For example,
crosstable(earn[,1|3], vearn[1|3])gives
Contents of cross [ 1,] [ 2,] [ 3,] Crosstable for variables educ, female [ 4,] [ 5,] | 0.0000 1.0000 | [ 6,] ----------|---------------------|--------- [ 7,] 2.0000 | 1 0 | 1 [ 8,] 3.0000 | 1 0 | 1 [ 9,] 4.0000 | 1 0 | 1 [10,] 5.0000 | 1 0 | 1 [11,] 6.0000 | 1 2 | 3 [12,] 7.0000 | 4 1 | 5 [13,] 8.0000 | 6 9 | 15 [14,] 9.0000 | 7 5 | 12 [15,] 10.0000 | 12 5 | 17 [16,] 11.0000 | 16 11 | 27 [17,] 12.0000 | 109 110 | 219 [18,] 13.0000 | 21 16 | 37 [19,] 14.0000 | 33 23 | 56 [20,] 15.0000 | 7 6 | 13 [21,] 16.0000 | 37 34 | 71 [22,] 17.0000 | 10 14 | 24 [23,] 18.0000 | 22 9 | 31 [24,] ----------|---------------------|--------- [25,] | 289 245 | 534 [26,] [27,] Chi^2 test of independence [28,] [29,] chi^2 statistic: 16.15 [30,] degrees of freedom: 16 [31,] significance level for rejection: 0.4427 [32,] [33,] contingency coefficient: 0.17 [34,] corrected contingency coefficient: 0.24 [35,]The significance value for the