3.5 Simple Analysis of Variance

In a simple (i.e., one-factorial) analysis of variance (ANOVA), it is assumed that the average values of the response variable $y$ are induced by one simple factor. Suppose that this factor takes on $p$ values and that for each factor level, we have $m=n/p$ observations. The sample is of the form given in Table 3.1, where all of the observations are independent.


Table 3.1: Observation structure of a simple ANOVA.
sample element factor levels $l$
1 $y_{11}$ $\cdots$ $y_{1l}$ $\cdots$ $y_{1p}$
2 $\vdots$   $\vdots$   $\vdots$
$\vdots$ $\vdots$   $\vdots$   $\vdots$
$k$ $y_{k1}$ $\cdots$ $y_{kl}$ $\cdots$ $y_{kp}$
$\vdots$ $\vdots$   $\vdots$   $\vdots$
$m=n/p$ $y_{m1}$ $\cdots$ $y_{ml}$ $\cdots$ $y_{mp}$


The goal of a simple ANOVA is to analyze the observation structure

\begin{displaymath}
y_{kl}= \mu_{l}+ \varepsilon_{kl} \textrm{ for } k=1,\ldots,m,\textrm{ and }l=1,\ldots,p.
\end{displaymath} (3.41)

Each factor has a mean value $\mu_{l}$. Each observation $y_{kl}$ is assumed to be a sum of the corresponding factor mean value $\mu_{l}$ and a zero mean random error $\varepsilon_{kl}$. The linear regression model falls into this scheme with $m=1$, $p=n$ and $\mu_i = \alpha+\beta x_i$, where $x_i$ is the $i$-th level value of the factor.

EXAMPLE 3.14   The ``classic blue'' pullover company analyzes the effect of three marketing strategies
1 advertisement in local newspaper,
2 presence of sales assistant,
3 luxury presentation in shop windows.

All of these strategies are tried in 10 different shops. The resulting sale observations are given in Table 3.2.


Table 3.2: Pullover sales as function of marketing strategy.
shop marketing strategy
$k$ factor l
  1   2   3
1 9   10   18
2 11   15   14
3 10   11   17
4 12   15   9
5 7   15   14
6 11   13   17
7 12   7   16
8 10   15   14
9 11   13   17
10 13   10   15


There are $p=3$ factors and $n=mp=30$ observations in the data. The ``classic blue'' pullover company wants to know whether all three marketing strategies have the same mean effect or whether there are differences. Having the same effect means that all $\mu_{l}$ in (3.41) equal one value, $\mu$. The hypothesis to be tested is therefore

\begin{displaymath}H_{0}:\mu_{l}=\mu \textrm{ for } l=1,\ldots,p. \end{displaymath}

The alternative hypothesis, that the marketing strategies have different effects, can be formulated as

\begin{displaymath}H_{1}:\mu_{l}\neq\mu_{l'}\textrm{ for some } l\textrm{ and }l'.\end{displaymath}

This means that one marketing strategy is better than the others.

The method used to test this problem is to compute as in (3.38) the total variation and to decompose it into the sources of variation. This gives:

\begin{displaymath}
\sum_{l=1}^{p}\sum_{k=1}^{m}(y_{kl}-\bar{y})^2=m\sum_{l=1}^{...
...bar{y})^2+\sum_{l=1}^{p}\sum_{k=1}^{m}(y_{kl}-\bar{y}_{l})^2
\end{displaymath} (3.42)

The total variation (sum of squares=SS) is:

\begin{displaymath}
SS(\textrm{reduced})=\sum_{l=1}^p \sum_{k=1}^{m}(y_{kl}-\bar{y})^2
\end{displaymath} (3.43)

where $\bar{y}={n}^{-1}\sum_{l=1}^p\sum_{k=1}^{m} y_{kl}$ is the overall mean. Here the total variation is denoted as $SS$( reduced), since in comparison with the model under the alternative $H_1$, we have a reduced set of parameters. In fact there is 1 parameter $\mu = \mu_l$ under $H_0$. Under $H_1$, the ``full'' model, we have three parameters, namely the three different means $\mu_l$.

The variation under $H_{1}$ is therefore:

\begin{displaymath}
SS(\textrm{full})=\sum_{l=1}^p\sum_{k=1}^{m}(y_{kl}-\bar{y_{l}})^2
\end{displaymath} (3.44)

where $\bar{y_{l}}= m^{-1}\sum_{k=1}^{m}y_{kl}$ is the mean of each factor $l$. The hypothetical model $H_{0}$ is called reduced, since it has (relative to $H_{1}$) fewer parameters.

The $F$-test of the linear hypothesis is used to compare the difference in the variations under the reduced model $H_{0}$ (3.43) and the full model $H_{1}$ (3.44) to the variation under the full model $H_{1}$:


\begin{displaymath}
F = \frac{ \{SS(\textrm{reduced}) - SS(\textrm{full})\} / \{ df(r) -
df(f)\}}{SS (\textrm{full}) / df(f)}.
\end{displaymath} (3.45)

Here $df(f)$ and $df(r)$ denote the degrees of freedom under the full model and the reduced model respectively. The degrees of freedom are essential in specifying the shape of the $F$-distribution. They have a simple interpretation: $df(\cdot )$ is equal to the number of observations minus the number of parameters in the model.

From Example 3.14, $p=3$ parameters are estimated under the full model, i.e., $df(f) = n- p=30 - 3 = 27$. Under the reduced model, there is one parameter to estimate, namely the overall mean, i.e., $df(r)=n-1=29$. We can compute

\begin{displaymath}
SS(\textrm{reduced}) = 260.3
\end{displaymath}

and

\begin{displaymath}SS (\textrm{full}) = 157.7.\end{displaymath}

The $F$-statistic (3.45) is therefore

\begin{displaymath}
F = \frac{(260.3 - 157.7) /2}{157.7 /27} = 8.78.
\end{displaymath}

This value needs to be compared to the quantiles of the $F_{2,27}$ distribution. Looking up the critical values in a $F$-distribution shows that the test statistic above is highly significant. We conclude that the marketing strategies have different effects.

The $F$-test in a linear regression model

The $t$-test of a linear regression model can be put into this framework. For a linear regression model (3.27), the reduced model is the one with $\beta$ = 0:

\begin{displaymath}y_i = \alpha + 0 \cdot x_i + \varepsilon_i.\end{displaymath}

The reduced model has $n-1$ degrees of freedom and one parameter, the intercept $\alpha$.

The full model is given by $\beta \neq 0$,

\begin{displaymath}y_i = \alpha + \beta \cdot x_i + \varepsilon_i,\end{displaymath}

and has $n-2$ degrees of freedom, since there are two parameters $(\alpha,\beta)$.

The $SS$( reduced) equals

\begin{displaymath}SS(\textrm{reduced}) = \sum_{i=1}^{n}(y_i-\bar{y})^2=\textrm{\it total variation}.\end{displaymath}

The $SS$( full) equals

\begin{displaymath}SS(\textrm{full}) = \sum_{i=1}^{n}(y_i-\hat{y_i})^{2} = \textrm{RSS}=\textrm{\it unexplained variation}.\end{displaymath}

The $F$-test is therefore, from (3.45),
$\displaystyle F$ $\textstyle =$ $\displaystyle \frac{ \left( \textrm{\it total variation - unexplained variation} \right) /1}
{\textrm{\it (unexplained variation)}/(n-2)}$ (3.46)
  $\textstyle =$ $\displaystyle \frac{\textrm{\it explained variation}}{\textrm{\it (unexplained variation)}
/(n-2)}.$ (3.47)

Using the estimators $\hat{\alpha}$ and $\widehat{\beta}$ the explained variation is:
$\displaystyle \sum_{i=1}^{n}\left(\hat{y}_i-\bar{y}\right)^2$ $\textstyle =$ $\displaystyle \sum_{i=1}^{n}\left(\hat{\alpha}+\widehat{\beta}x_{i}-\bar{y}\right)^2$  
  $\textstyle =$ $\displaystyle \sum_{i=1}^{n}\left\{(\bar{y}-\widehat{\beta}\bar{x})+
\widehat{\beta}x_{i}-\bar{y} \right\}^2$  
  $\textstyle =$ $\displaystyle \sum_{i=1}^{n}\widehat{\beta}^2(x_{i}-\bar{x})^2$  
  $\textstyle =$ $\displaystyle \widehat{\beta}^2 n s_{XX} .$  

From (3.32) the $F$-ratio (3.46) is therefore:
$\displaystyle F$ $\textstyle =$ $\displaystyle \frac{\hat{\beta}^2n s_{XX}}{{RSS}/(n-2)}$ (3.48)
  $\textstyle =$ $\displaystyle \left(\frac{\widehat{\beta}}{{SE}(\widehat{\beta})}\right)^2.$ (3.49)

The $t$-test statistic (3.33) is just the square root of the $F$- statistic (3.49).

Note, using (3.39) the $F$-statistic can be rewritten as

\begin{displaymath}F= \frac{r^2/1}{(1-r^2)/(n-2)}.\end{displaymath}

In the pullover Example 3.11, we obtain $F= \frac{0.028}{0.972}\frac{8}{1}= 0.2305$, so that the null hypothesis $\beta=0$ cannot be rejected. We conclude therefore that there is only a minor influence of prices on sales.

Summary
$\ast$
Simple ANOVA models an output $Y$ as a function of one factor.
$\ast$
The reduced model is the hypothesis of equal means.
$\ast$
The full model is the alternative hypothesis of different means.
$\ast$
The $F$-test is based on a comparison of the sum of squares under the full and the reduced models.
$\ast$
The degrees of freedom are calculated as the number of observations minus the number of parameters.
$\ast$
The $F$-statistic is

\begin{displaymath}
F = \frac{ \{SS(\textrm{reduced}) - SS(\textrm{full})\} / \{ df(r) -
df(f)\}}{SS (\textrm{full}) / df(f)}.
\end{displaymath}

$\ast$
The $F$-test rejects the null hypothesis if the $F$-statistic is larger than the 95% quantile of the $F_{df(r)-df(f),df(f)}$ distribution.
$\ast$
The $F$-test statistic for the slope of the linear regression model $y_i = \alpha + \beta x_i + \varepsilon_i$ is the square of the $t$-test statistic.