14
Latin hypercube sampling From Wikipedia, the free encyclopedia Jump to: navigation , search Latin hypercube sampling (LHS) is a statistical method for generating a sample of plausible collections of parameter values from a multidimensional distribution . The sampling method is often applied in uncertainty analysis . The technique was first described by McKay in 1979. [1] It was further elaborated by Ronald L. Iman , and others [2] in 1981. Detailed computer codes and manuals were later published. [3] In the context of statistical sampling, a square grid containing sample positions is a Latin square if (and only if) there is only one sample in each row and each column. A Latin hypercube is the generalisation of this concept to an arbitrary number of dimensions, whereby each sample is the only one in each axis- aligned hyperplane containing it. When sampling a function of variables, the range of each variable is divided into equally probable intervals. sample points are then placed to satisfy the Latin hypercube requirements; note that this forces the number of divisions, , to be equal for each variable. Also note that this sampling scheme does not require more samples for more dimensions (variables); this independence is one of the main advantages of this sampling scheme. Another advantage is that random samples can be taken one at a time, remembering which samples were taken so far. The maximum number of combinations for a Latin Hypercube of divisions and variables (i.e., dimensions) can be computed with the following formula: For example, a Latin hypercube of divisions with variables (i.e., a square) will have 24 possible combinations. A Latin hypercube of divisions with variables (i.e., a cube) will have 576 possible combinations.

Latin Hypercube Sampling

Embed Size (px)

Citation preview

Page 1: Latin Hypercube Sampling

Latin hypercube samplingFrom Wikipedia, the free encyclopediaJump to: navigation, search

Latin hypercube sampling (LHS) is a statistical method for generating a sample of plausible collections of parameter values from a multidimensional distribution. The sampling method is often applied in uncertainty analysis.

The technique was first described by McKay in 1979.[1] It was further elaborated by Ronald L. Iman, and others[2] in 1981. Detailed computer codes and manuals were later published.[3]

In the context of statistical sampling, a square grid containing sample positions is a Latin square if (and only if) there is only one sample in each row and each column. A Latin hypercube is the generalisation of this concept to an arbitrary number of dimensions, whereby each sample is the only one in each axis-aligned hyperplane containing it.

When sampling a function of variables, the range of each variable is divided into equally probable intervals. sample points are then placed to satisfy the Latin hypercube requirements; note that this forces the number of divisions, , to be equal for each variable. Also note that this sampling scheme does not require more samples for more dimensions (variables); this independence is one of the main advantages of this sampling scheme. Another advantage is that random samples can be taken one at a time, remembering which samples were taken so far.

The maximum number of combinations for a Latin Hypercube of divisions and variables (i.e., dimensions) can be computed with the following formula:

For example, a Latin hypercube of divisions with variables (i.e., a square) will have 24 possible combinations. A Latin hypercube of divisions with variables (i.e., a cube) will have 576 possible combinations.

Orthogonal sampling adds the requirement that the entire sample space must be sampled evenly. Although more efficient, orthogonal sampling strategy is more difficult to implement since all random samples must be generated simultaneously.

Page 2: Latin Hypercube Sampling

In two dimensions the difference between random sampling, Latin Hypercube sampling and orthogonal sampling can be explained as follows:

1. In random sampling new sample points are generated without taking into account the previously generated sample points. One does thus not necessarily need to know beforehand how many sample points are needed.

2. In Latin Hypercube sampling one must first decide how many sample points to use and for each sample point remember in which row and column the sample point was taken.

3. In Orthogonal sampling, the sample space is divided into equally probable subspaces. All sample points are then chosen simultaneously making sure that the total ensemble of sample points is a Latin Hypercube sample and that each subspace is sampled with the same density.

Thus, orthogonal sampling ensures that the ensemble of random numbers is a very good representative of the real variability, LHS ensures that the ensemble of random numbers is representative of the real variability whereas traditional random sampling (sometimes called brute force) is just an ensemble of random numbers without any guarantees.

References

1. ̂ McKay, M.D.; Beckman, R.J.; Conover, W.J. (May 1979). "A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code" (JSTOR Abstract). Technometrics (American Statistical Association) 21 (2): 239–245. doi:10.2307/1268522. ISSN 0040-1706. JSTOR 1268522. OSTI 5236110.

2. ̂ Iman, R.L.; Helton, J.C.; and Campbell, J.E. (1981). "An approach to sensitivity analysis of computer models, Part 1. Introduction, input variable selection and preliminary variable assessment". Journal of Quality Technology 13 (3): 174–183.

3. ̂ Iman, R.L.; Davenport, J.M. ; Zeigler, D.K. (1980). Latin hypercube sampling (program user's guide). OSTI 5571631.

Page 3: Latin Hypercube Sampling

Further reading

Tang, B. (1993). "Orthogonal Array-Based Latin Hypercubes". Journal of the American Statistical Association 88 (424): 1392–1397. doi:10.2307/2291282. JSTOR 2291282.

Owen, A.B. (1992). "Orthogonal arrays for computer experiments, integration and visualization". Statistica Sinica 2: 439–452.

Ye, K.Q. (1998). "Orthogonal column Latin hypercubes and their application in computer experiments". Journal of the American Statistical Association 93 (444): 1430–1439. doi:10.2307/2670057. JSTOR 2670057.

.1.11 Latin Hypercube Design

Table 3.3 shows a Latin Hypercube design with three parameters. Here the values (A, B and C) correspond to the three diffusion recipes and the parameter (p1 to p3) corresponds to three furnaces. This results a scheme where each recipe is tested once in each furnace. In practice, the furnaces should be assigned at random with the columns and the experiment number at random with the rows.

Table 3.3: Latin Hypercube design.

  p1 p2 p3

1 A B C

2 C A B

3 B C C

For the numerical simulation this strategy is adapted slightly. It is very rare that the equipment is parametrized so sufficiently that this kind of fine tuning can be done on a simulation level.

The way this method is used in simulation is for the column (in the example the furnace) to be replaced by a control parameter. The recipe is replaced by a subregion of the parameter and the row is again an experiment.

The number of experiments is equal to the size of the subsections and can be larger than the number of control parameters

The algorithm is working in two loops

1. For every parameter of an experiment the m intervals are randomly chosen, under the condition that each interval is only taken once for all m experiments. 2. For each parameter value a random number is taken out of the selected intervals.

Page 4: Latin Hypercube Sampling

In the Table 3.4 the output of a four dimensional Latin Hypercube design with the ranges -1 - 1 is shown.

Table 3.4: Latin Hypercube design.

  p1 p2 p3 p4

1 0.5669910 -0.2465637 -0.3702434 -0.7918264

2 -0.8699596 -0.7146524 0.1408818 0.8313969

3 -0.2900985 0.4635731 0.5991107 0.09391561

4 0.09718282 0.6744157 -0.7708488 -0.09295038

The values of input parameters p1 and p2 are shown in Figure 3.4. The small boxes represent the subsections of the control parameter ranges for p1 and p2, and the numbers represent evaluation points located in the subregion corresponding to Table 3.4.

Figure 3.4: Positions of the four design points in the subspace spanned by p1 and p2.

atin Hypercube DesignsIn a Latin Hypercube, each factor has as many levels as there are runs in the design. The levels are spaced evenly from the lower bound to the upper bound of the factor. Like the sphere-packing method, the Latin Hypercube method chooses points to maximize the minimum distance between design points, but with a constraint. The constraint maintains the even spacing between factor levels.Creating a Latin Hypercube DesignTo use the Latin Hypercube method:1. Select DOE > Space Filling Design.2. Enter responses, if necessary, and factors. (See Enter Responses and Factors into the Custom

Page 5: Latin Hypercube Sampling

Designer.)

3. Alter the factor level values, if necessary. For example, Space-Filling Dialog for Four Factors shows adding two factors to the two existing factors and changing their values to 1 and 8 instead of the default –1 and 1.

Space-Filling Dialog for Four Factors

4. Click Continue.

5. In the design specification dialog, specify a sample size (Number of Runs). This example uses a sample size of eight.

6. Click Latin Hypercube (see Space-Filling Design Dialog). Factor settings and design diagnostics results appear similar to those in Latin Hypercube Design for Four Factors and Eight Runs with Eight Levels, which shows the Latin Hypercube design with four factors and eight runs.

Note: The purpose of this example is to show that each column (factor) is assigned each level only once, and each column is a different permutation of the levels.Latin Hypercube Design for Four Factors and Eight Runs with Eight Levels

Visualizing the Latin Hypercube Design

Page 6: Latin Hypercube Sampling

To visualize the nature of the Latin Hypercube technique, create an overlay plot, adjust the plot’s frame size, and add circles using the minimum distance from the diagnostic report as the radius for the circle.1. First, create another Latin Hypercube design using the default X1 and X2 factors.2. Be sure to change the factor values so they are 0 and 1 instead of the default –1 and 1.3. Click Continue.4. Specify a sample size of eight (Number of Runs).

5. Click Latin Hypercube. Factor settings and design diagnostics are shown in Latin Hypercube Design with two Factors and Eight Runs.

Latin Hypercube Design with two Factors and Eight Runs

6. Click Make Table.7. Select Graph > Overlay Plot.8. Specify X1 as X and X2 as Y, then click OK.

9. Right-click the plot and select Size/Scale > Size to Isometric to adjust the frame size so that the frame is square.

10. Right-click the plot, select Customize from the menu. In the Customize panel, click the large plus sign to see a text edit area, and enter the following script:

For Each Row(Circle({:X1, :X2}, 0.404/2))where 0.404 is the minimum distance number you noted in the Design Diagnostics panel (Latin Hypercube Design with two Factors and Eight Runs). This script draws a circle centered at each design point with radius 0.202 (half the diameter, 0.404), as shown on the left in Comparison of Latin Hypercube Designs with Eight Runs (left) and 10 Runs (right). This plot shows the efficient way JMP packs the design points.

11. Repeat the above procedure exactly, but with 10 runs instead of eight (step 5). Remember to change 0.404 in the graphics script to the minimum distance produced by 10 runs.

You should see a graph similar to the one on the right in Comparison of Latin Hypercube Designs with Eight Runs (left) and 10 Runs (right). Note the irregular nature of the sphere packing. In fact, you can repeat the process to get a slightly different picture because the arrangement is dependent on the random starting point.Comparison of Latin Hypercube Designs with Eight Runs (left) and 10 Runs (right)

Page 7: Latin Hypercube Sampling

Note that the minimum distance between each pair of points in the Latin Hypercube design is smaller than that for the Sphere-Packing design. This is because the Latin Hypercube design constrains the levels of each factor to be evenly spaced. The Sphere-Packing design maximizes the minimum distance without any constraints.

Monte Carlo Simulation BasicsPart 1 of "A Practical Guide to Monte Carlo Simulation", by Jon Wittwer, PhD

[ Preface ]         [ Sales Forecast Example ]

A Monte Carlo method is a technique that involves using random numbers and probability to solve problems. The term Monte Carlo Method was coined by S. Ulam and Nicholas Metropolis in reference to games of chance, a popular attraction in Monte Carlo, Monaco (Hoffman, 1998; Metropolis and Ulam, 1949).

Computer simulation has to do with using computer models to imitate real life or make predictions. When you create a model with a spreadsheet like Excel, you have a certain number of input parameters and a few equations that use those inputs to give you a set of outputs (or response variables). This type of model is usually deterministic, meaning that you get the same results no matter how many times you re-calculate. [ Example 1 : A Deterministic Model for Compound Interest ]

Figure 1: A parametric deterministic model maps a set of input variables to a set of output variables.

Page 8: Latin Hypercube Sampling

Monte Carlo simulation is a method for iteratively evaluating a deterministic model using sets of random numbers as inputs. This method is often used when the model is complex, nonlinear, or involves more than just a couple uncertain parameters. A simulation can typically involve over 10,000 evaluations of the model, a task which in the past was only practical using super computers.

Example 2: A Stochastic Model

By using random inputs, you are essentially turning the deterministic model into a stochastic model. Example 2 demonstrates this concept with a very simple problem.

[ Example 2: A Stochastic Model for a Hinge Assembly ]

In Example 2, we used simple uniform random numbers as the inputs to the model. However, a uniform distribution is not the only way to represent uncertainty. Before describing the steps of the general MC simulation in detail, a little word about uncertainty propagation:

The Monte Carlo method is just one of many methods for analyzing uncertainty propagation, where the goal is to determine how random variation, lack of knowledge, or error affects the sensitivity, performance, or reliability of the system that is being modeled.

Monte Carlo simulation is categorized as a sampling method because the inputs are randomly generated from probability distributions to simulate the process of sampling from an actual population. So, we try to choose a distribution for the inputs that most closely matches data we already have, or best represents our current state of knowledge. The data generated from the simulation can be represented as probability distributions (or histograms) or converted to error bars, reliability predictions, tolerance zones, and confidence intervals. (See Figure 2).

Page 9: Latin Hypercube Sampling

Uncertainty Propagation

Figure 2: Schematic showing the principal of stochastic uncertainty propagation. (The basic principle behind Monte Carlo simulation.)

If you have made it this far, congratulations! Now for the fun part! The steps in Monte Carlo simulation corresponding to the uncertainty propagation shown in Figure 2 are fairly simple, and can be easily implemented in Excel for simple models. All we need to do is follow the five simple steps listed below:

Step 1: Create a parametric model, y = f(x1, x2, ..., xq).

Step 2: Generate a set of random inputs, xi1, xi2, ..., xiq.

Step 3: Evaluate the model and store the results as yi.

Step 4: Repeat steps 2 and 3 for i = 1 to n.

Step 5: Analyze the results using histograms, summary statistics, confidence intervals, etc.

On to an example problem

Sales Forecasting ExamplePart 2 of "A Practical Guide to Monte Carlo Simulation", by Jon Wittwer, PhD

Page 10: Latin Hypercube Sampling

[ Monte Carlo Simulation Basics ]     [ Generating Random Inputs ]

Our example of Monte Carlo simulation in Excel will be a simplified sales forecast model. Each step of the analysis will be described in detail.

The Scenario: Company XYZ wants to know how profitable it will be to market their new gadget, realizing there are many uncertainties associated with market size, expenses, and revenue.

The Method: Use a Monte Carlo Simulation to estimate profit and evaluate risk.

You can download the example spreadsheet by following the instructions below. You will probably want to refer to the spreadsheet occasionally as we proceed with this example.

MCExample_SalesForecast.zipFile Size: ~420 kBRequirements: Excel 97 or LaterNo Macros Used

Download the Sales Forecast Example (.zip)

Download Instructions: 1. Download: Click on the download link or image to the left and save the file on your

computer. Depending on your browser and operating system, you may need to "right-click" on the link and select "Save Target As..."

2. Unzip: You will need to extract the excel file from the .zip archive. Most computers come with utilitizes for extracting ZIP files.

3. Open in Excel: After extracting the spreadsheet from the .zip file, open the .xls file in Excel. If calculation is too slow on your machine, you may need to remove or disable the COUNTIF formulas (used in the cumulative probability calculations).

Step 1: Creating the Model

We are going to use a top-down approach to create the sales forecast model, starting with:

Profit = Income - Expenses

Both income and expenses are uncertain parameters, but we aren't going to stop here, because one of the purposes of developing a model is to try to break the problem down into more fundamental quantities. Ideally, we want all the inputs to be independent. Does income depend on expenses? If so, our model needs to take this into account somehow.

We'll say that Income comes solely from the number of sales (S) multiplied by the profit per sale (P) resulting from an individual purchase of a gadget, so Income = S*P. The profit per sale takes into account the sale price, the initial cost to manufacturer or purchase the product wholesale, and other transaction fees (credit cards, shipping, etc.). For our purposes, we'll say the P may fluctuate between $47 and $53.

We could just leave the number of sales as one of the primary variables, but for this example, Company XYZ generates sales through purchasing leads. The number of sales per month is the number of leads per month (L) multiplied by the conversion rate (R) (the percentage of leads that result in sales). So our final equation for Income is:

Page 11: Latin Hypercube Sampling

Income = L*R*P

We'll consider the Expenses to be a combination of fixed overhead (H) plus the total cost of the leads. For this model, the cost of a single lead (C) varies between $0.20 and $0.80. Based upon some market research, Company XYZ expects the number of leads per month (L) to vary between 1200 and 1800. Our final model for Company XYZ's sales forecast is:

Profit = L*R*P - (H + L*C)

Y = ProfitsX1 = LX2 = CX3 = RX4 = P

Notice that H is also part of the equation, but we are going to treat it as a constant in this example. The inputs to the Monte Carlo simulation are just the uncertain parameters (Xi).

This is not a comprehensive treatment of modeling methods, but I used this example to demonstrate an important concept in uncertainty propagation, namely correlation. After breaking Income and Expenses down into more fundamental and measurable quantities, we found that the number of leads (L) affected both income and expenses. Therefore, income and expenses are not independent. We could probably break the problem down even further, but we won't in this example. We'll assume that L, R, P, H, and C are all independent.

Note: In my opinion, it is easier to decompose a model into independent variables (when possible) than to try to mess with correlation between random inputs.