2.1 Data Matrices

In XploRe , data can be stored in matrices ($ n\times p$) or arrays ( $ n\times p\times\ldots$). Here, we will concentrate on data matrices. Small data matrices can be created directly from the command line or within an XploRe quantlet. Large data matrices are typically read from data files.

The following subsections provide a short introduction on matrix and data handling. Consult Read and Write (15) to learn more about loading data files into XploRe . More details on data and matrix manipulation can be found in Matrix handling (16).


2.1.1 Creating Data Matrices


z = # (x1, x2, ..., xn)
creates a column vector z from scalar numbers x1, x2,..., xn
z = x | y
concatenates two arrays x and y rowwise
z = x $ \sim$ y
concatenates two arrays x and y columnwise

Small data matrices can be directly given at the command line or within an XploRe program. The following XploRe codes are all available from the quantlet 3020 XLGdesc01.xpl . As a first example, consider the data matrix

\begin{displaymath}\left(
\begin{array}{llr@{.}l}
1& 2.0 & 3&4\\
5& 6.0 & 7&8\\
9& 0.0 & 1&44\\
8& 7.0 & 10&432\\
\end{array}\right)\end{displaymath}

which has dimension $ 4\times 3$, i.e. four rows and three columns. To construct this matrix in XploRe , we create each column vector separately and then concatenate these column vectors. A column vector can be created by means of the # or the |operator. The following two lines are equivalent:
  col1=#(1,5,9,8)
  col1=1|5|9|8
Both create the column vector

\begin{displaymath}\left(
\begin{array}{l}
1\\
5\\
9\\
8\\
\end{array}\right).\end{displaymath}

We can check the contents of col1 by issuing just
  col1
at the command line, which results in
  Contents of col1
  [1,]        1 
  [2,]        5 
  [3,]        9 
  [4,]        8
in the output window. In the same way as for col1, we build the second and third columns:
  col2=#(2.0,6.0,0.0,7.0)
  col3=#(3.4,7.8,1.44,10.432)
and group all three vectors together by means of the $ \sim$ operator:
  mat=col1~col2~col3
When we check the contents of mat we see
  Contents of mat
  [1,]        1        2      3.4 
  [2,]        5        6      7.8 
  [3,]        9        0     1.44 
  [4,]        8        7   10.432
Note that we could have created mat within a single step
  mat= #(1,5,9,8) ~ #(2.0,6.0,0.0,7.0) ~ #(3.4,7.8,1.44,10.432)
Let us also remark that XploRe does not distinguish between integer and float values. Therefore, the first two columns of the matrix mat appear in the same format.

It is also possible to create text matrices. For example

  textmat= #("aa","c") ~ #("b","d2")
creates the text matrix

\begin{displaymath}\left(
\begin{array}{ll}
\textrm{\tt aa}& \textrm{\tt b}\\
\textrm{\tt c}& \textrm{\tt d2}\\
\end{array}\right)\end{displaymath}

Note that text and numeric values need to be stored in different matrices.


2.1.2 Loading Data Files


x = 3173 read ("file")
reads numeric data from file.dat
x = 3176 readm ("file")
reads mixed text and numeric data from file.dat

Large data sets are usually stored in data files. XploRe can read data from ASCII files, consisting of both numeric and text data. In the following we will use two data sets: cps85 and uscomp2 (see Data Sets (B.2)).

The file cps85.dat consists of a subsample of the 1985 U.S. current population survey. The file contains only numeric data. We will assign columns 1 (years of education), 2 ($ =1$ if living in south), 5 ($ =1$ if female) 8 (years of labor market experience), 10 ($ =1$ if working on a union job), 11 (natural logarithm of average hourly earnings) and 12 (age in years) to the XploRe variable earn:

  earn=read("cps85")
  earn=earn[,1|2|5|8|10|11|12]
3194 XLGdesc02.xpl

The file uscomp2.dat contains information on 79 U.S. companies. The data set has 8 columns, two of them text (columns 1,8) and six numeric (columns 2 to 7). We will only use column 8 (branch, text) and columns 3 and 5 (sales and profits, both numeric) and assign them to the XploRe variables branch and salpro:
  uscomp=readm("uscomp2")
  branch=uscomp.text[,2]
  salpro=uscomp.double[,2|4]
3206 XLGdesc02.xpl

Since text and numeric data are stored in different XploRe objects, we find column 8 of uscomp2 as the second text column and columns 3 and 5 as the second and fourth numeric columns, respectively. 3217 readm is a function written in the XploRe language, which can be used for reading mixed text and numeric data.


2.1.3 Matrix Operations


d = 3391 dim (x)
shows the dimension of an array x
n = 3394 rows (x)
shows the number of rows of an array x
p = 3397 cols (x)
shows the number of columns of an array x
y = x[i,j] or y = x[i,] or y = x[,j]
extracts element i,j or row i or column j from x

The first step in data analysis is to find out information on the dimension of the data. This can be done generally by using the function 3412 dim . We apply this function now to the data matrices mat, earn, branch, and salpro that we specified in Subsections 2.1.1 and 2.1.2. The codes for this section are available from the quantlet 3415 XLGdesc02.xpl .

  dim(mat)
  dim(earn)
  dim(branch)
  dim(salpro)
yields
  Contents of dim
  [1,]        4 
  [2,]        3 
  Contents of dim
  [1,]      534 
  [2,]        7 
  Contents of dim
  [1,]       79 
  Contents of dim
  [1,]       79 
  [2,]        2
and tells us that mat is a $ 4\times 3$ matrix, earn is $ 534\times 7$, branch is a $ 79\times 1$ vector and salpro is $ 79\times 2$. If we are just interested in the number of rows or columns, we can use the commands 3418 rows and 3421 cols . For example,
  rows(earn)
  cols(earn)
gives
  Contents of rows
  [1,]      534 
  Contents of cols
  [1,]        7

To extract elements or submatrices of a matrix, we can use the subarray operator []. The following three lines extract the first row, the second column and $ (4,3)$-element (fourth row, third column), for example:

  mat[1,]
  mat[,2]
  mat[4,3]
This operator can also be used for extracting several rows and columns at once. The statement mat[1:3,1|3] extracts the elements which are in the 1st to 3rd rows of mat and in the 1st and 3rd columns. The operator : is used to specify a range of subsequent integers.