1.1 Probability and Data Generating Process

In this section we make a revision of some concepts that are necessary to understand further developments in the chapter, the purpose is to highlight some of the more important theoretical results in probability, in particular, the concept of the random variable, the probability distribution, and some related results. Note however, that we try to maintain the exposition at an introductory level. For a more formal and detailed expositions of these concepts see Härdle and Simar (1999), Mantzapoulus (1995), Newbold (1996) and Wonacot andWonacot (1990).


1.1.1 Random Variable and Probability Distribution

A random variable is a function that assigns (real) numbers to the results of an experiment. Each possible outcome of the experiment (i.e. value of the corresponding random variable) occurs with a certain probability. This outcome variable, X, is a random variable because, until the experiment is performed, it is uncertain what value X will take. Probabilities are associated with outcomes to quantify this uncertainty.

A random variable is called discrete if the set of all possible outcomes $ x_1,x_2,...$ is finite or countable. For a discrete random variable $ X$, a probability density function is defined to be the function $ f(x_i)$ such that for any real number $ x_i$, which is a value that $ X$ can take, $ f$ gives the probability that the random variable $ X$ is equal to $ x_i$ . If $ x_i$ is not one of the values that $ X$ can take then $ f(x_i)=0$.

$\displaystyle P(X = x_i) = f(x_i) \qquad i = 1, 2, \dots$

$\displaystyle f(x_i) \geq 0, \qquad \sum\limits\limits_i f(x_i) = 1$

A continuous random variable $ X$ can take any value in at least one interval on the real number line. Assume $ X$ can take values $ c\leq x\leq d$. Since the possible values of $ X$ are uncountable, the probability associated with any particular point is zero. Unlike the situation for discrete random variables, the density function of a continuous random variable will not give the probability that $ X$ takes the value $ x_i$. Instead, the density function of a continuous random variable $ X$ will be such that areas under $ f(x)$ will give probabilities associated with the corresponding intervals. The probability density function is defined so that $ f(x)\geq 0$ and

$\displaystyle P(a < X \leq b) = \int\limits_a^b f(x)\, dx ;\ a \leq b$ (1.1)

This is the area under $ f(x)$ in the range from $ a$ to $ b$. For a continuous variable

$\displaystyle \int\limits_{- \infty}^{+ \infty} f(x) \, dx = 1$ (1.2)

Cumulative Distribution Function

A function closely related to the probability density function of a random variable is the corresponding cumulative distribution function. This function of a discrete random variable $ X$ is defined as follows:

$\displaystyle F(x)=P(X\leq x)=\sum_{X\leq x}f(X)$ (1.3)

That is, $ F(x)$ is the probability that the random variable $ X$ takes a value less than or equal to $ x$.

The cumulative distribution function for a continuous random variable $ X$ is given by

$\displaystyle F(x)=P(X\leq x)=\int_{-\infty}^x f(t)dt$ (1.4)

where $ f(t)$ is the the probability density function. In both the continuous and the discrete case, $ F(x)$ must satisfy the following properties:

Expectations of Random Variables

The expected value of a random variable $ X$ is the value that we, on average, expect to obtain as an outcome of the experiment. It is not necessarily a value actually taken by the random variable. The expected value, denoted by $ E(X)$ or $ \mu$, is a weighted average of the values taken by the random variable $ X$, where the weights are the respective probabilities.

Let us consider the discrete random variable $ X$ with outcomes $ x_1,\cdots,x_n$ and corresponding probabilities $ f(x_i)$. Then, the expression

$\displaystyle E(X)=\mu=\sum_{i=1}^n x_if(X=x_i)$ (1.5)

defines the expected value of the discrete random variable. For a continuous random variable $ X$ with density $ f(x)$, we define the expected value as

$\displaystyle E(X)=\mu = \int\limits_{- \infty}^{+ \infty} xf(x) \ dx$ (1.6)

Joint Distribution Function

We consider an experiment that consists of two parts, and each part leads to the occurrence of specified events. We could study separately both events, however we might be interested in analyzing them jointly. The probability function defined over a pair of random variables is called the joint probability distribution. Consider two random variables X and Y, the joint probability distribution function of two random variables X and Y is defined as the probability that X is equal to $ x_i$ at the same time that Y is equal to $ y_j$

$\displaystyle P(\{X=x_i \} \cap \{Y=y_j \}) = P(X=x_i, Y=y_j) = f(x_i, y_j) \quad i,j = 1, 2, \dots$ (1.7)

If $ X$ and $ Y$ are continuous random variables, then the bivariate probability density function is:

$\displaystyle P(a < X \leq b; c< Y \leq d) = \int\limits_c^d\int\limits_a^b f(x,y)\\ dxdy$ (1.8)

The counterparts of the requirements for a probability density function are:


$\displaystyle \sum_i \sum_j f(x_i,y_j) = 1$      
      (1.9)
$\displaystyle \int^{+\infty}_{-\infty}\int^{+\infty}_{-\infty} f(x,y)dx dy = 1$      

The cumulative joint distribution function, in the case that both $ X$ and $ Y$ are discrete random variables is

$\displaystyle F(x,y) = P(X\le x, Y\le y) = \sum_{X\le x}\sum_{Y\le y}f(X,Y)$ (1.10)

and if both $ X$ and $ Y$ are continuous random variables then

$\displaystyle F(x,y) = P(X\le x, Y\le y) = \int^{x}_{-\infty}\int^{y}_{-\infty}f(t,v)dt dv$ (1.11)

Marginal Distribution Function

Consider now that we know a bivariate random variable $ (X,Y)$ and its probability distribution, and suppose we simply want to study the probability distribution of $ X$, say $ f(x)$. How can we use the joint probability density function for $ (X,Y)$ to obtain $ f(x)$?

The marginal distribution, $ f(x)$, of a discrete random variable $ X$ provides the probability that the variable $ X$ is equal to $ x$, in the joint probability $ f(X,Y)$, without considering the variable $ Y$, thus, if we want to obtain the marginal distribution of $ X$ from the joint density, it is necessary to sum out the other variable $ Y$. The marginal distribution for the random variable $ Y$, $ f(y)$ is defined analogously.

$\displaystyle P(X = x) = f(x) = \sum_{Y} f(x,Y)$ (1.12)

$\displaystyle P(Y = y) = f(y) = \sum_{X} f(X,y)$ (1.13)

The resulting marginal distributions are one-dimensional.

Similarly, we obtain the marginal densities for a pair of continuous random variables X and Y:

$\displaystyle f(x) = \int\limits_{- \infty}^{+ \infty} f(x,y) \, dy$ (1.14)

$\displaystyle f(y) = \int\limits_{- \infty}^{+ \infty} f(x,y) \, dx$ (1.15)

Conditional Probability Distribution Function

In the setting of a joint bivariate distribution $ f(X,Y)$, consider the case when we have partial information about $ X$. More concretely, we know that the random variable $ X$ has taken some vale $ x$. We would like to know the conditional behavior of $ Y$ given that $ X$ has taken the value $ x$. The resultant probability distribution of $ Y$ given $ X=x$ is called the conditional probability distribution function of $ Y$ given $ X$, $ F_{Y\vert X=x}(y)$. In the discrete case it is defined as

$\displaystyle F_{Y\vert X=x}(y) = P(Y\leq y\vert X=x) = \sum_{Y\le y}\frac{f(x,Y)}{f(x)} = \sum_{Y\le y}f(Y\vert x)$ (1.16)

where $ f(Y\vert x)$ is the conditional probability density function and $ x$ must be such that $ f(x)>0$ . In the continuous case $ F_{Y\vert X=x}(y)$ is defined as

$\displaystyle F_{Y\vert X=x}(y) = P(Y\leq y\vert X=x) = \int\limits_{- \infty}^{y} f(y\vert x) \ dy = \int\limits_{- \infty}^{y} \frac{f(x,y)}{f(x)} \ dy$ (1.17)

$ f(y\vert x)$ is the conditional probability density function and $ x$ must be such that $ f(x)>0$ .

Conditional Expectation

The concept of mathematical expectation can be applied regardless of the kind of the probability distribution, then, for a pair of random variables $ (X,Y)$ with conditional probability density function, namely $ f(y\vert x)$, the conditional expectation is defined as the expected value of the conditional distribution, i.e.

$\displaystyle E(Y\vert X=x) = \left\{ \begin{array}{c} \sum_{j=1}^n y_j f(Y=y_j...
...(y\vert x) \, dy \quad \textrm{ if }Y \ \textrm{continuous} \end{array} \right.$ (1.18)

Note that for the discrete case, $ y_1,\cdots,y_n$ are values such that $ f(Y=y_j\vert X=x)>0$.

The Regression Function

Let us define a pair of random variables $ (X,Y)$ with a range of possible values such that the conditional expectation of $ Y$ given $ X$ is correctly defined in several values of $ X=x_1,\cdots,x_n$. Then, a regression is just a function that relates different values of $ X$, say $ x_1,\cdots,x_n$, and their corresponding values in terms of the conditional expectation $ E(Y\vert X=x_1),\cdots,E(Y\vert X=x_n)$.

The main objective of regression analysis is to estimate and predict the mean value (expectation) for the dependent variable $ Y$ in base of the given (fixed) values of the explanatory variable. The regression function describes dependence of a quantity $ Y$ on the quantity $ X$, a one-directional dependence is assumed. The random variable $ X$ is referred as regressor, explanatory variable or independent variable, the random variable $ Y$ is referred as regressand or dependent variable.


1.1.2 Example

In the following Quantlet, we show a two-dimensional random variable $ (X,Y)$, we calculate the conditional expectation $ E(Y\vert X=x)$ and generate a line by means of merging the values of the conditional expectation in each $ x$ values. The result is identical to the regression of $ y$ on $ x$.

Let us consider $ 54$ households as the whole population. We want to know the relationship between the net income and household expenditure, that is, we want a prediction of the expected expenditure, given the level of net income of the household. In order to do so, we separate the $ 54$ households in $ 9$ groups with the same income, then, we calculate the mean expenditure for every level of income.

2068 XEGlinreg01.xpl

This program produces the output presented in Figure 1.1

Figure 1.1: Conditional Expectation: $ E(Y\vert X=x)$
\includegraphics[width=0.58\defpicwidth]{condexp.ps}

The function $ E(Y\vert X=x)$ is called a regression function. This function express only the fact that the (population) mean of the distribution of $ Y$ given $ X$ has a functional relationship with respect to $ X$.


1.1.3 Data Generating Process

One of the major tasks of statistics is to obtain information about populations. A population is defined as the set of all elements that are of interest for a statistical analysis and it must be defined precisely and comprehensively so that one can immediately determine whether an element belongs to the population or not. We denote by $ N$ the population size. In fact, in most of cases, the population is unknown, and for the sake of analysis, we suppose that it is characterized by a joint probability distribution function. What is known for the researcher is a finite subset of observations drawn from this population. This is called a sample and we will denote the sample size by $ n$. The main aim of the statistical analysis is to obtain information from the population (its joint probability distribution) through the analysis of the sample.

Unfortunately, in many situations the aim of obtaining information about the whole joint probability distribution is too complicated, and we have to orient our objective towards more modest proposals. Instead of characterizing the whole joint distribution function, one can be more interested in investigating one particular feature of this distribution such as the regression function. In this case we will denote it as Population Regression Function (PRF), statistical object that has been already defined in sections 1.1.1 and 1.1.2.

Since very few information is know about the population characteristics, one has to establish some assumptions about what is the behavior of this unknown quantity. Then, if we consider the observations in Figure 1.1 as the the whole population, we can state that the PRF is a linear function of the different values of $ X$, i.e.

$\displaystyle E(Y\vert X=x)=\alpha+\beta x$ (1.19)

where $ \alpha $ and $ \beta $ are fixed unknown parameters which are denoted as regression coefficients. Note the crucial issue that once we have determined the functional form of the regression function, estimation of the parameter values is tantamount to the estimation of the entire regression function. Therefore, once a sample is available, our task is considerably simplified since, in order to analyze the whole population, we only need to give correct estimates of the regression parameters.

One important issue related to the Population Regression Function is the so called Error term in the regression equation. For a pair of realizations $ (x_i,y_i)$ from the random variable $ (X,Y)$, we note that $ y_i$ will not coincide with $ E(Y\vert X=x_i)$. We define as

$\displaystyle u_i=y_i-E(Y\vert X=x_i)$ (1.20)

the error term in the regression function that indicates the divergence between an individual value $ y_i$ and its conditional mean, $ E(Y\vert X=x_i)$. Taking into account equations (1.19) and (1.20) we can write the following equalities

$\displaystyle y_i=E(Y\vert X=x_i)+u_i=\alpha+\beta x_i+u_i$ (1.21)

and

$\displaystyle E(u\vert X=x_i)=0
$

This result implies that for $ X=x_i$, the divergences of all values of $ Y$ with respect to the conditional expectation $ E(Y\vert X=x_i)$ are averaged out. There are several reasons for the existence of the error term in the regression:

The PRF is a feature of the so called Data Generating Process DGP. This is the joint probability distribution that is supposed to characterize the entire population from which the data set has been drawn. Now, assume that from the population of $ N$ elements characterized by a bivariate random variable $ (X,Y)$, a sample of $ n$ elements, $ (x_1,y_1),\cdots,(x_n,y_n)$, is selected. If we assume that the Population Regression Function (PRF) that generates the data is

$\displaystyle y_i=\alpha+\beta x_i+u_i, \quad i=1,\cdots,n$ (1.22)

then, given any estimator of $ \alpha $ and $ \beta $, namely $ \hat\beta$ and $ \hat\alpha$, we can substitute these estimators into the regression function

$\displaystyle \hat y_i=\hat\alpha+\hat\beta x_i, \quad i=1,\cdots,n$ (1.23)

obtaining the sample regression function (SRF). The relationship between the PRF and SRF is:

$\displaystyle y_i=\hat y_i+\hat u_i, \quad i=1,\cdots,n$ (1.24)

where $ \hat u_i$ is denoted the residual.

Just to illustrate the difference between Sample Regression Function and Population Regression Function, consider the data shown in Figure 1.1 (the whole population of the experiment). Let us draw a sample of 9 observations from this population.

2178 XEGlinreg02.xpl

This is shown in Figure 1.2. If we assume that the model which generates the data is $ y_i=\alpha+\beta x_i+u_i$, then using the sample we can estimate the parameters $ \alpha $ and $ \beta $.

2184 XEGlinreg03.xpl

In Figure 1.3 we present the sample, the population regression function (thick line), and the sample regression function (thin line). For fixed values of $ x$ in the sample, the Sample Regression Function is going to depend on the sample, whereas on the contrary, the Population Regression Function will always take the same values regardless the sample values.

Figure 1.2: Sample $ n=9$ of $ (X,Y)$
\includegraphics[width=0.59\defpicwidth]{sample.ps}

Figure 1.3: Sample and Population Regression Function
\includegraphics[width=0.59\defpicwidth]{prf_srf.ps}

With a data generating process (DGP) at hand, then it is possible to create new simulated data. If $ \alpha $, $ \beta $ and the vector of exogenous variables $ X$ is known (fixed), a sample of size $ n$ is created by obtaining $ n$ values of the random variable $ u$ and then using these values, in conjunction with the rest of the model, to generate $ n$ values of $ Y$. This yields one complete sample of size $ n$. Note that this artificially generated set of sample data could be viewed as an example of real-world data that a researcher would be faced with when dealing with the kind of estimation problem this model represents. Note especially that the set of data obtained depends crucially on the particular set of error terms drawn. A different set of error terms would create a different data set of $ Y$ for the same problem (see for more details Kennedy (1998)).


1.1.4 Example

In order to show how a DGP works, we implement the following experiment. We generate three replicates of sample $ n=10$ of the following data generating process: $ y_i=2+0.5 x_i+u_i$. $ X$ is generated by a uniform distribution as follows $ X\sim U[0,1]$.

2354 XEGlinreg04.xpl

This code produces the values of $ X$, which are the same for the three samples, and the corresponding values of $ Y$, which of course differ from one sample to the other.