3.4 Designs for Linear Regression Models

I start with the simplest metamodel, namely a first-order polynomial with a single factor. An example is the 'Student' simulation in Sect. 3.2, where I now assume that we are interested only in the power so

in (3.4) now denotes the type II error predicted through the regression model. I further assume a single factor (say) $x = \sigma /n$ ('relative' variability; i.e., absolute variability corrected for sample size); see (3.4). Elementary mathematics proves that - to fit a straight line - it suffices to have two input/output observations; see 'local area 1' in Fig. 3.1. It is simple to prove that the 'best' estimators of the regression parameters in (3.4) result if those two values are as far apart as 'possible'.

**Figure 3.1:** Two simple polynomial regression models with predictor $\hat {y}$ for the output of a simulation with a single factor
$\includegraphics[clip]{text/3-3/fig1.eps}$

In practice, the analysts do not know over which experimental area a first-order polynomial is a 'valid' model. This validity depends on the goals of the simulation study; see Kleijnen and Sargent (2000).

So the analysts may start with a local area, and simulate the two (locally) extreme input values. Let's denote these two extreme values of the 'coded' variable

and

, which implies the following standardization of the original variable

$\displaystyle x = \frac{z - \bar{z}}{(z_{\max} - z_{\min} )/2 }\,,$

(3.5)

The Taylor series argument implies that - as the experimental area gets bigger (see 'local area 2' in Fig. 3.1) - a better metamodel may be a second-order polynomial:

$\displaystyle y = \beta _{0} + \beta _{1} x + \beta _{2} x^{2} + e.$

(3.6)

I emphasize that the second-order polynomial in (3.6) is nonlinear in

(the regression variable), but linear in $\boldsymbol {\beta }$ (the regression parameters or factor effects to be estimated ). Consequently, such a polynomial is a type of linear regression model (also see Chap. II.8).

Finally, when the experimental area covers the whole area in which the simulation model is valid (see again Fig. 3.1), then other global metamodels become relevant. For example, Kleijnen and Van Beers (2003a) find that Kriging (discussed in Sect. 3.5) outperforms second-order polynomial fitting.

Note that Zeigler, Praehofer, and Kim (2000) call the experimental area the 'experimental frame'. I call it the domain of admissible scenarios, given the goals of the simulation study.

3.4.2 Simple Regression Models for Simulation Models with Multiple Factors

**Figure 3.2:** One-factor-at-a-time design for two factors $x_{1}$ and $x_{2}$ , with output
$\includegraphics[clip]{text/3-3/fig2.eps}$

**Table 3.1:** A one-factor-at-a-time design for two factors, and possible regression variables
Scenario	$\boldsymbol{x}_{0}$	$\boldsymbol{x}_{1}$	$\boldsymbol{x}_{2}$	$\boldsymbol{x}_{1} \boldsymbol{x}_{2}$
		0	0	0
			0	0
		0		0

Let's now consider a regression model with

factors; for example,

. The design that is still most popular - even though it is inferior - changes one factor at a time. For

such a design is shown in Fig. 3.2 and Table 3.1; in this table the factor values over the various factor combinations are shown in the columns denoted by $\boldsymbol{x}_{1}$ and $\boldsymbol{x}_{2}$ ; the 'dummy' column $\boldsymbol{x}_{0}$ corresponds with the polynomial intercept $\widehat{\beta }_{0}$ in (3.4). In this design the analysts usually start with the 'base' scenario, denoted by the factor combination

; see scenario 1 in the table. Next they run the two scenarios

and

; see the scenarios 2 and 3 in the table..

In a one-factor-at-a-time design, the analysts cannot estimate the interaction between the two factors. Indeed, Table 3.1 shows that the estimated interaction (say) $\beta _{1; 2}$ is confounded with the estimated intercept $\widehat{\beta }_{0}$ ; i.e., the columns for the corresponding regression variables are linearly dependent. (Confounding remains when the base values are denoted not by zero but by one; then these two columns become identical.)

**Table 3.2:** The $2^{3}$ design and possible regression variables
Scenario	0

**Figure 3.3:** The $2^{3}$ design
$\includegraphics[clip]{text/3-3/fig3.eps}$

In practice, analysts often study each factor at three levels (which may be denoted by

, $\phantom{+} 0$ ,

) in their one-at-a-time design. However, two levels suffice to estimate the parameters of a first-order polynomial (see again Sect. 3.4.1).

To enable the estimation of interactions, the analysts must change factors simultaneously. An interesting problem arises if

increases from two to three. Then Fig. 3.2 becomes Fig. 3.3, which does not show the output (

), since it would require a fourth dimension (instead $x_{3}$ replaces

); the asterisks are explained in Sect. 3.4.3. And Table 3.1 becomes Table 3.2. The latter table shows the $2^{3}$ factorial design; i.e., in the experiment each of the three factors has two values and all their combinations of values are simulated. To simplify the notation, the table shows only the signs of the factor values, so

means

and

means

. The table further shows possible regression variables, using the symbols '0' through '

' - to denote the indexes of the regression variables $x_{0}$ (the dummy, always equal to

) through $x_{1} x_{2} x_{3}$ (third-order interaction). Further, I point out that each column is balanced; i.e., each column has four plusses and four minuses - except for the dummy column.

The $2^{3}$ design enables the estimation of all eight parameters of the following regression model, which is a third-order polynomial that is incomplete; i.e., some parameters are assumed zero:

$\displaystyle y = \beta _{0} + \sum\limits_{j = 1}^{3} {\beta _{j} } x_{j} +\su... ...; j^{\prime}} x_{j} x_{j^{\prime}} } +\beta _{1; 2; 3} x_{1} x_{2} x_{3} + e\,.$

(3.7)

Indeed, the $2^{3}$ design implies a matrix of regression variables $\boldsymbol {X}$ that is orthogonal:

$\displaystyle \boldsymbol{X}^{T}\boldsymbol{X} = n\boldsymbol{I}\,,$

(3.8)

$\displaystyle \boldsymbol{\widehat{\beta}} = (\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^{T}\boldsymbol{w}$

(3.9)

$\displaystyle \boldsymbol{{\text{cov}}}(\,\boldsymbol{\widehat{\beta }}\,) = \l... ...ft(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T}\right]^{T}\,.$

(3.10)

$\displaystyle \boldsymbol{{\text{cov}}}(\boldsymbol{w}) = \sigma ^{2}\boldsymbol{I}\,,$

(3.11)

$\displaystyle \boldsymbol{{\text{cov}}}\left(\,\boldsymbol{\widehat{\beta }}\,\right) = \sigma^{2}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\,.$

(3.12)

Therefore I conclude that the analysts should choose between the following two options.

The variances of the estimated regression parameters - which are on the main diagonal of $\boldsymbol{{\text{cov}}}(\,\boldsymbol{\widehat{\boldsymbol{\beta} }}\,)$ in (3.10) - can be used to test statistically whether some factors have zero effects. However, I emphasize that a significant factor may be unimportant - practically speaking. If the factors are scaled between

and

(see the transformation in (3.5)), then the estimated effects quantify the order of importance. For example, in a first-order polynomial regression model the factor estimated to be the most important factor is the one with the highest absolute value for its estimated effect. See Bettonvil and Kleijnen (1990).

3.4.3 Fractional Factorial Designs and Other Incomplete Designs

The incomplete third-order polynomial in (3.7) included a third-order effect, namely $\beta _{1; 2; 3}$ . Standard DOE textbooks include the definition and estimation of such high-order interactions. However, the following claims may be made:

In this $2^{3 - 1}$ design two columns are identical, namely the

column (with four plusses) and the dummy column. Hence, the corresponding two effects are confounded - but the high-order interaction $\beta _{1; 2; 3}$ is assumed zero, so this confounding can be ignored!

Sometimes a first-order polynomial suffices. For example, in the (sequential) optimization of black-box simulation models the analysts may use a first-order polynomial to estimate the local gradient; see Angün et al. (2002). Then it suffices to take a $2^{k - p}$ design with the biggest

value that makes the following condition hold: $2^{k- p} > k$ . An example is:

and

so only

scenarios are simulated; see Table 3.3. This table shows that the first three factors (labeled

, and

) form a full factorial $2^{3}$ design; the symbol '

' means that the values for factor

are selected by multiplying the elements of the columns for the factors

and

. Note that the design is still balanced and orthogonal. Because of this orthogonality, it can be proven that the estimators of the regression parameters have smaller variances than one-factor-at-a-time designs give. How to select scenarios in $2^{k - p}$ designs is discussed in many DOE textbooks, including Kleijnen (1975, 1987).

Actually, these designs - i.e., fractional factorial designs of the $2^{k - p}$ type with biggest

value still enabling the estimation of first-order polynomial regression models - are a subset of Plackett-Burman designs. The latter designs consists of

combinations with

rounded upwards to a multiple of four; for example, if

, then Table 3.4 applies. If

, then the Plackett-Burman design is a $2^{7- 4}$ fractional factorial design; see Kleijnen (1975, pp. 330-331). Plackett-Burman designs are tabulated in many DOE textbooks, including Kleijnen (1975). Note that designs for first-order polynomial regression models are called resolution III designs.

**Table 3.4:** The Placket-Burman design for 11 factors
Scenario

Resolution IV designs enable unbiased estimators of first-order effects - even if two-factors interactions are important. These designs require double the number of scenarios required by resolution III designs; i.e., after simulating the scenarios of the resolution III design, the analysts simulate the mirror scenarios; i.e., multiply by

the factor values in the original scenarios.

Resolution V designs enable unbiased estimators of first-order effects plus all two-factor interactions. To this class belong certain $2^{k - p}$ designs with small enough

values. These designs often require rather many scenarios to be simulated. Fortunately, there are also saturated designs; i.e., designs with the minimum number of scenarios that still allow unbiased estimators of the regression parameters. Saturated designs are attractive for expensive simulations; i.e., simulations that require relatively much computer time per scenario. Saturated resolution V designs were developed by Rechtschaffner (1967).

Central composite designs (CCD) are meant for the estimation of second-order polynomials. These designs augment resolution V designs with the base scenario and

scenarios that change factors one at a time; this changing increases and decreases each factor in turn. Saturated variants (smaller than CCD) are discussed in Kleijnen (1987, pp. 314-316).

The main conclusion is that incomplete designs for low-order polynomial regression are plentiful in both the classic DOE literature and the simulation literature. (The designs in the remainder of this chapter are more challenging.)

3.4.4 Designs for Simulations with Too Many Factors

Most practical, non-academic simulation models have many factors; for example, Kleijnen et al. (2003b) experiment with a supply-chain simulation model with nearly $100\,$ factors. Even a Plackett-Burman design would then take $102\,$ scenarios. Because each scenario needs to be replicated several times, the total computer time may then be prohibitive. For that reason, many analysts keep a lot of factors fixed (at their base values), and experiment with only a few remaining factors. An example is a military (agent-based) simulation that was run millions of times for just a few scenarios - changing only a few factors; see Horne and Leonardi (2001).

However, statisticians have developed designs that require fewer than

scenarios - called supersaturated designs; see Yamada and Lin (2002). Some designs aggregate the

individual factors into groups of factors. It may then happen that the effects of individual factors cancel out, so the analysts would erroneously conclude that all factors within that group are unimportant. The solution is to define the

and

levels of the individual factors such that all first-order effects $\beta _{j}$ ( $j = 1, \ldots, k$ ) are non-negative. My experience is that in practice the users do know the direction of the first-order effects of individual factors.

There are several types of group screening designs; for a recent survey including references, I refer to Kleijnen et al. (2003b). Here I focus on the most efficient type, namely Sequential Bifurcation designs.

This design type is so efficient because it proceeds sequentially. It starts with only two scenarios, namely, one scenario with all individual factors at

, and a second scenario with all factors at

. Comparing the outputs of these two extreme scenarios requires only two replications because the aggregated effect of the group factor is huge compared with the intrinsic noise (caused by the pseudorandom numbers). The next step splits - bifurcates - the factors into two groups. There are several heuristic rules to decide on how to assign factors to groups (again see Kleijnen et al. 2003b). Comparing the outputs of the third scenario with the outputs of the preceding scenarios enables the estimation of the aggregated effect of the individual factors within a group. Groups - and all its individual factors - are eliminated from further experimentation as soon as the group effect is statistically unimportant. Obviously, the groups get smaller as the analysts proceed sequentially. The analysts stop, once the first-order effects $\beta _{j}$ of all the important individual factors are estimated. In their supply-chain simulation, Kleijnen et al. (2003b) classify only

of the $92\,$ factors as important. (Next, this shortlist of important factors is further investigated to find a robust solution.)

3.4 Designs for Linear Regression Models

3.4.1 Simple Regression Models for Simulations with a Single Factor

3.4.2 Simple Regression Models for Simulation Models with Multiple Factors

3.4.3 Fractional Factorial Designs and Other Incomplete Designs

3.4.4 Designs for Simulations with Too Many Factors