Model-fitting and Data Analysisjae1001.user.srcf.net/.../L1_Slides_13-11-06.pdf · Introduction zResponse to need identified by Graduate School ... zMathCad will not let you combine

Dr James Elliott[[email protected]]

Model-fitting and Data Analysis

Graduate Lecture CourseMichaelmas Term 2006

Lecture 1 (of 2)

13/11/2006

Copyright © 2002 University of Cambridge. Not to be quoted or copied without permission.

Introduction

Response to need identified by Graduate School Committee for some formal instruction to graduates on the subject of data handling and analysisDuring your PhDs (and future research careers!) you will need to be handle large volumes of complex data, and extract information to support your scientific hypotheses


Reading list for lecture 1

G. E. P. Box, W. G. Hunter, and J. S. Hunter, Statistics for experimenters (Wiley, New York, 1978). [ULClassmark 202.c.97.667]G. L. Squires, Practical physics (CUP, New York, 1985). [UL Classmark 352:1.b.200.4, MSM Classmark LcZ26]


Some software packages for data analysis

Microsoft Excel (Office 2000 version)– Excel 2003 available from UCS (£32.00 + media, ID 6924)– Should be readily available on most laboratory computers

MathCad Professional (Mathsoft 2001 version)– Version 12 available from UCS (£44.50 + media, ID 7208)– Less common, but useful for manipulating units– Also capable of limited symbolic algebra via Maplegnuplot (MS Windows version 3.7)

– Free of charge!– Available from http://www.gnuplot.info/

Other packages that we will not discuss– IDL, PV-WAVE, Semper, Fit2D, ImageJ, Mathematica, MatLab…


Summary of today’s lecture

1. Basic data handling and recording1.1 Numerical hygiene1.2-1.4 Units and dimensional analysis1.5-1.6 Distinguishing and estimating magnitude of errors

2. Quantitative treatment of random and systematic errors2.1 Refresher on continuous and discrete probability2.2 Central Limit Theorem and Gaussian distribution2.3 Standard error, and confidence limits2.4 Theory of small changes and errors, combining random errors

3. Fitting models to experimental and simulated data3.1 Graphical methods3.2 Linear least-squares regression analysis3.3 Fitting data to more complex functional forms


Look-ahead to next week’s lecture

3. Fitting models to experimental and simulated data3.3 The χ2 test for goodness of fit

4. Maximum likelihood methods4.1 The maximum likelihood principle4.2 Bayes’ theorem and Jaynes’ principle

5. Artificial Neural Networks (ANNs)5.1 Structure of simple ANNs5.2 Training of ANNs using back propagation and Bayesian

methods5.3 Applications of ANNs

6. Design of Experiments (DoE) method6.1 Experimental design6.2 Identification of variable factors and objectives6.3 Response contour plots


1.1 Numerical “hygiene”

Numerical “hygiene” – what does this mean?The vast majority of your data will be in the form of numbers, often recorded in digital formatThe way in which these are recorded should reflect your confidence in their precision – zeroes to the right of the decimal point have a significance in science!For example, 1.000 carries an implicit assumption that the quantity is known to at least 4 significant figuresSimilarly, writing 8.230842372190 … , or however many decimal places your calculator runs to, is unjustified if you cannot be sure about the magnitude of any errorThinking carefully will improve quality of your science!


1.2 Units and dimensional analysis

In addition to their numerical magnitude, physical quantities must also posses well-defined units: [m], [m s–1], etc.Without units, a quantity can have no physical meaning!As well as being essential for meaning, units can also be very helpful for keeping track of how physical quantities are combined – this is called dimensional analysisIn particular – physical quantities with differing units should not be combined togetherAlso, should not take the logarithm of, exponentiate or otherwise evaluate a function of quantities with unitsQuantities must be non-dimensionalised before they are used in functions or combined with others of different type


1.3.1 SI prefixes and base units

The Système International d’Unités (SI) has a set of seven base units from which all others are derived

cdcandelaintensity

molmoleamount

Kkelvintemperature

Aamperecurrent

ssecondtime

kgkilogrammass

mmetrelength

Unit SymbolUnit nameBase quantity



Other quantities, called derived quantities, are defined in terms of the seven base quantities via a system of quantity equationsFor ease of understanding and convenience, twenty two SI derived units are given special names and symbolsFor example, Newton (force) [m kg s–2], Joule (energy) [m2

kg s–2 = N m], Coulomb (charge) [A s], etc. (for the others, see link to NIST website below)A couple of common mistakes regarding units from personal experience:

– never refer to ‘degrees’ K, since Kelvin is an absolute scale– some quantities with the same units are not equivalent, e.g. torque

and energy [N m], pressure and energy density [N m–2 = Pa]

[1] http://physics.nist.gov/cuu/Units/units.html


1.3.4 Relationships among the SI units

[1] http://physics.nist.gov/cuu/Units/units.html



There are 20 SI prefixes used to form decimal multiples and submultiples of SI units

yyocto10–24dadeka101

zzepto10–21hhecto102

aatto10–18kkilo103

ffemto10–15Mmega106

ppico10–12Ggiga109

nNANO!10–9Ttera1012

µmicro10–6Ppeta1015

mmilli10–3Eexa1018

ccenti10–2Zzetta1021

ddeci10–1Yyotta1024

SymbolNameFactorSymbolNameFactor



It is important to note that the kilogram is the only SI unit with a prefix as part of its name and symbolBecause the SI prefixes strictly represent powers of 10, they should not be used to represent powers of 2In order to resolve this problem, in 1998 a new set of prefixes for binary multiples were introduced by the International Electrotechnical Commission (IEC)However, it should be noted that binary multiples are not part of the International System of Units (SI), and have yet to enjoy widespread usage or recognitionSo, beware when buying your new hard drive, computer memory, or broadband connection!


1.4.1 Dimensional analysis

Let’s look at an example with MathCad

MathCad will not let you combine quantities with different units – subtracting a force from a power in this case


1.4.2 Dimensional analysis

What about if we have a formula combining together quantities with different units?

MathCad will calculate the units of resulting expression


1.4.3 Applying dimensional analysis

You can apply dimensional analysis to your own formulae – either to check those you have derived for consistency, or to determine units of quantities you are unsure ofIf the units of d and t are length and time, respectively, are the following valid expressions?

If Re is dimensionless, what are units of viscosity η?

2 2

23/ 2

( )Pt Evdt t

−

+

− ( )2 2

/ log( )d t v tFv P+

+ B

sin( )exp( / )

V tPt k Tω

−

ηρ= /e vdR


1.5.1 Distinguishing experimental errors

Contrary to what you may have learned in University or high school, an error is not the difference between your answer and the solution to a problem in the back of a textbook or examiner’s handbook!In fact, an error is a measure of the confidence that you have in the quantity you have measured or determined from some experiment or simulationJust as physical quantities are not complete without units, so experimental data are not meaningful without a quoted error, even if this is just an estimateFor example, (6.50 ± 0.05)×10–11 m3 kg–1 s–2 could change the laws of physics, but (6.5 ± 0.3)×10–11 m3 kg–1 s–2 is just a badly designed experiment


1.5.2 Distinguishing experimental errors

It is conventional to classify experimental errors into two different categories:

– Random errors : those causing readings to be scattered randomly about some mean value corresponding to the ‘true’ quantity

– Systematic errors : those which are constant throughout a set of readings (although not necessarily between different experiments) leading to systematic deviation from the ‘true’ value

Most experiments contain both types of error, which are defined according to the effect that they produceIt is convenient to make a distinction between the words accurate and precise in the context of discussing errorsA result is said to be accurate if it is relatively free from systematic error, or precise if random error is small

[1] G. L. Squires, Practical physics (CUP, New York, 1985).


1.6 Estimating magnitude of errors

Random errors are always present in any experiment, and may be reduced in magnitude by taking repeated readings, with the quantity approaching its true valueThey can be estimated with statistical methods, as we shall discover shortlySystematic errors are more insidious, and arise because of physical effects (desired or otherwise) which change the results of the measurementRepeated measurements neither reveal nor eliminate systematic errors, and there is no general method for detecting their presenceYour best course is to be familiar with common sources of systematic error, and gain experience with apparatus


2.0 Quantitative treatment of errors

The quantitative treatment of random errors is based on statistical theory of discrete distributionsEssentially, we are saying that the true value of a physical quantity has a certain probability of lying within a given range of valuesDepending on what we want our experiment or model to achieve, we will demand a more or less precise answer (we always strive for accuracy!)In a world of finite time and resources, we need to make sure that we are not measuring quantities either too roughly, or too preciselyWe therefore devote some time to discrete distributions


2.1.1 Properties of discrete distributions

Let’s say we have a set of n measurements denoted by:

Then the arithmetic and geometric mean values of this set are defined as follows:

A histogram plot gives a distribution curve, which in this case takes discrete values

1 2 3, , ,...,i nx x x x x=

1

1 n

ii

x xn =

= ∑1/

1

nn

g ii

x x=

⎛ ⎞= ⎜ ⎟⎝ ⎠∏



Now, assuming there are no systematic errors, the deviation between each datum and the true value X is the random error in that measurement:

We then define the root-mean square (or r.m.s.) of errors to be the standard deviation of the distribution

This is equivalent to the standard error in each datumThe quantity σ2 is known as the variance, and both s.d. and variance are measures of the distribution width

i ie x X= −

( )2

1

1 n

ii

x Xn =

σ = −∑ ( )22

1

1 n

ii

x Xn =

σ = −∑



One way of reducing the standard error is to take samples from the distribution of fixed size, and calculate the standard error in the mean from each set, σm

Averaging σm over many samples, it can be shown that that it is related to the standard deviation of the distributionand the sample size in the following way:

Hence, the standard error in the mean of n measurements is √n times smaller than the standard error in a single observationThis comes from the fact that individual errors are independent and average to zero [1]

m nσ

σ =




At first sight, calculating random error in our data seems straightforward – however, remember we don’t know the true value of X in advance!Hence, we can only estimate σ from the residuals of the measurements, which are defined as follows:

Notice mean of x is used here as an unbiased estimatorfor true value X. Unlike errors, the residuals are known We define s.d. and variance of the sample as follows:

i id x x= −

( )2

1

1 n

ii

s x xn =

= −∑ ( )22

1

1 n

ii

s x xn =

= −∑



Now, to link s and σ, we must consider what happens if we take samples of data averaged over large number of sets in the distributionThe result we obtain when using the residuals is:

Hence, combining this with our previous formula for σm:

However, <s2> is not know, so we must use s2

2 2 2ms = σ −σ

2 2 2 21 1 1m

n s sn n

σ = σ =− −


2.2.1 Quantitative estimate for random errors

Using our value for s2 (which is a biased estimator for σ2), we obtain the following estimates for the standard error in each datum and in the mean

Notice that σ and s become identical as n → ∞, in other words, s becomes an unbiased estimator for the standard error in the limit of large sample sizeThe different standard errors are evaluated with separate functions in Excel: STDEV (for σ) and STDEVP (for s)Make sure you know the difference!

1/ 2 1/ 21 1 1m

ns sn n

⎛ ⎞ ⎛ ⎞σ ≈ σ ≈⎜ ⎟ ⎜ ⎟− −⎝ ⎠ ⎝ ⎠



The variance can be expressed in a slightly different way, as follows:

Hence

( )

( )

22

1

22 2

1

2 2

1 122

1

1 2

1 12

n

iin

i ii

n n

i ii i

s x xn

x x x xn

x x x xn n

x x

=

=

= =

= −

= + −

= + −

= −

∑

∑

∑ ∑

( )1 12 2 22

1n x x

n⎛ ⎞σ ≈ −⎜ ⎟−⎝ ⎠




2.3.1 Discrete probability distributions

A discrete probability distribution is just a frequency distribution (histogram) normalised such that the total area under the curve sums to unityTwo common discrete probability distributions are the Binomial and Poisson distributions [1]

As the number of samples n increases, there is a tendency towards a continuous distribution



2.3.2 Discrete probability distributions

Binomial – characterised by mean Np, variance Np(1–p)

Arises from N events each with just two outcomes with independent probabilities p and (1–p), respectively Poisson – characterised only by mean a = Np

Arises from waiting times with constant arrival rate, and in the limit of binomial when N → ∞ and p → 0

!( ) (1 )!( )!

n N nNP n p pn N n

−= −−

( )( ) exp!

naP n an

= −


2.4.1 Continuous probability distributions

The continuous limit of the Binomial and Poisson distributions is of course the normal (or Gaussian) distribution [1]

In fact, there is an amazing theorem (first proved by Gauss) which states that the mean of independent random variables from any probability distribution converge to the normal distributionThis result is known as the Central Limit Theorem [2], and is it for this reason that the distribution of residuals about the mean is usually assumed to follow a normal distribution“Everybody believes in the exponential law of errors: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation”

[1] http://www.stat.sc.edu/~west/javahtml/CLT.html[2] http://mathworld.wolfram.com/CentralLimitTheorem.html



Definitions for discrete distributions can readily be generalised to the continuous case:

Normal distribution with mean µ and s.d. σ given by

( )2

22 2

22

( ) ( )d ( )d 1 ( )d

( ) d ( )d ( )d

b

a

P a x b p x x p x x x xp x x

p x x x x x p x x xp x x

x x

∞ ∞

−∞ −∞

∞ ∞ ∞

−∞ −∞ −∞

≤ ≤ = = =

⎛ ⎞σ = − = − ⎜ ⎟

⎝ ⎠

= −

∫ ∫ ∫

∫ ∫ ∫

2

2

1 ( )( ) exp ( )d 122

xN x N x x∞

−∞

⎛ ⎞−µ= − =⎜ ⎟σσ π ⎝ ⎠

∫



Usually, we refer to p(x) as the probability density function (p.d.f.) and the integrated probability as the cumulative distribution function (c.d.f.)For the normal distribution, the c.d.f. has a special name: the Gaussian error function, erf(x)

Erf(x) can be evaluated in Excel via the ERF function, or the complementary error function ERFC(x) = 1 – ERF(x)

2

2

( ) ( )d

1 ( ) 1exp d 1 erf2 22 2

x

x

D x N x x

x xx

−∞

−∞

′ ′=

′⎛ ⎞ ⎡ ⎤−µ −µ⎛ ⎞′= − = +⎜ ⎟ ⎢ ⎥⎜ ⎟σσ π σ⎝ ⎠⎣ ⎦⎝ ⎠

∫

∫


2.5.1 Confidence limits and error bounds

So, if we assume our experimental results are sampled from a normal distribution, then writing something like µ±σimplies that we think a repeated measurement will lie within one s.d. of µ just over 68% of the time

If we want more confidence in our answer, then we must widen the range of our interval, giving rise to the term confidence limits or confidence intervalSome common (and arbitrary!) choices of confidence limits are 95% and 99%

1( ) erf 68%2

P x ⎛ ⎞µ −σ < < µ +σ = ≅⎜ ⎟⎝ ⎠


2.5.2 Confidence limits and error bounds

A confidence interval of 99% corresponds to a probability of finding a result within 2.5σ of the mean valueAn alternative way of stating this is that 1 in 100 measurements will lie outside this range by pure chanceSome people refer to this interpretation of the probability as a frequentist approach, as it assumes that the measurements are performed many timesOthers prefer to adopt an inferential approach, in which the confidence limits are interpreted as the likelihood of a single particular value lying with the given rangeA third group feel that the above interpretations are rather philosophical, and prefer to forget about them!


2.6.1 Theory of small changes and errors

If we know the functional dependence of a quantity on a variable, we can estimate the error resulting from a small change this variable as follows:

Similarly, for functions of more than one variable

Here ∆x and dx refer to a small change and an infinitesimal change in quantity x, respectively

d d d dy y yy xx x x

∆≅ ∴∆ ≅ ∆∆

1 2 11 1 1

( , , , ) ny yy x x x y x

x x x∂ ∆ ∂

≅ ∴∆ ≅ ∆∂ ∆ ∂

…



For example, consider the error in measuring Bragg spacing from powder X-ray diffractometer traceFrom Bragg’s law at first order, we know:

Hence:

Using λ = 1.5418 Å, θ = 70°, ∆θ = 0.05°, we obtain a value of d = (0.8204 ± 0.0003) Å

2 sin2sin

d d λλ = θ⇒ =

θ

2

cos( ) 2 sin

cot

d dd d

d

∂ ∆ ∂ λ θθ ≅ ∴∆ ≅ ∆θ = − ∆θ

∂θ ∆θ ∂θ θ= − θ⋅∆θ



For functions of more than one variable, we can combine errors resulting from each variable in the following way:

However, it is usual that just one or two errors dominate the whole expression for the total error For example, going back to our powder diffractometer example, consider the effect of ∆λ = 0.0001 Å

( ) ( ) ( )2 2 21 2 where i i

i

yy y y y xx

⎛ ⎞∂∆ = ∆ + ∆ + ∆ = ∆⎜ ⎟∂⎝ ⎠

…

( )2 2

22 9 7cot 4.2 10 1.0 10dd

− −∆ ∆λ⎛ ⎞ ⎛ ⎞= + θ ∆θ = × + ×⎜ ⎟ ⎜ ⎟λ⎝ ⎠ ⎝ ⎠



The following table shows how to combine the errors of some common functions


3.1.1 Fitting models to experimental data

Often, we would like to fit a set of numerical data with a particular model function, perhaps a straight line


3.1.2 Fitting models to experimental data

Graphical methods can be quite effective!Can fit by eye to straight line, or transform function to a straight line and then fit, e.g.

Estimate error in gradient and intersection point from distribution of data around the fitted lineTest suitability of fit by computing the residuals about the fitted line – histogram should follow a normal distributionAny systematic deviations from normal distribution indicate that fitted function is not suitable

B B

B

exp ln ln

Let ln and fit to obtain / and ln

Q Qy A y Ak T k T

mz y z c m Q k c AT

⎛ ⎞−= = −⎜ ⎟

⎝ ⎠

= = + = − =


3.2.1 Linear least-squares regression analysis

We can be more quantitative by using method of least-squares, which defines ‘best’ fit as minimum of the sum of squares of the error between the line and each datum

Differentiating partially w.r.t to each parameter:

yields two simultaneous equations giving m and c

( )2i i

i

S y mx c= − −∑

( )

( )

2 0

2 0

i i ii

i ii

S x y mx cmS y mx cc

∂= − − − =

∂∂

= − − − =∂

∑

∑



Solving from preceding equations, we have:

The latter expressions for m and c are suitable for evaluation in Excel or on calculator – in fact, Excel has inbuilt linear regression function (→ Add Trendline)When ‘best’ values of m and c are found, then residualscan be calculated from

( )

( )2 22

1

1

1 [line passes through ( , )]

i i i i i ii i i i

ii i i

i i

i ii i

x x y x y x ynm

x xx x

nmc y mx y y x y

n n

− − −= =

− ⎛ ⎞− ⎜ ⎟⎝ ⎠

= − = −

∑ ∑ ∑ ∑∑ ∑ ∑

∑ ∑

i i id y mx c= − −



Most importantly, error estimates for m and c follow from the residuals, which we will simply quote here:

Note that best fit line should strictly be written as:

( )( )

( )( )

22

2

2 22

2

12

11

2

ii

ii

ii ii

ii

dm

nx x

x dncn nx x

∆ ≈−−

⎛ ⎞⎜ ⎟

∆ ≈ +⎜ ⎟ −−⎜ ⎟⎝ ⎠

∑∑

∑ ∑∑

( ) ( ) ( )2 2 22

( )( )

where

y m m x x b b

c b x m

= ± ∆ − + ± ∆

∆ = ∆ + ∆since m and c are notindependent



Note that Excel will not calculate the standard errors in m and c for you with the Trendline option, although more advanced regression is available via ‘Analysis Toolpak’Instead, you are presented with ‘goodness of fit’ parameter R (or R2) which is related to the amount of sample variance explained by the modelHowever, R2 cannot tell you whether the model is physically meaningful or even appropriate to use!Generally speaking, R2 is rather useless for scientific studies, so much better to deal in standard errorsThere are better regression tools that use non-linear methods, such as freely available gnuplot package


3.3.1 Non-linear least squares analysis

We will not discuss theory of non-linear techniques, but they can be used to fit multi-parameter functions directly without transforming into a straight lineWe will cover an example used in polymer science to fit X-ray diffraction data with triple Gaussian profile

Scattering data is assumed to consist of two crystalline reflections superimposed on a broad amorphous halo

All peaks are modelled by Gaussian functions

We want to deconvolute the scattering into separate peaks



Start by defining a Gaussian function to represent the 110 crystal reflection:gnuplot> crystal_110(x) = a*exp(-(x-b)**2/(2*c**2))

Then define two more functions to represent the 200 crystal reflection and the amorphous phase:gnuplot> crystal_200(x) = d*exp(-(x-e)**2/(2*f**2))gnuplot> amorph(x) = g*exp(-(x-h)**2/(2*i**2))

Combine these all into total intensity function:gnuplot> total(x) = crystal_110(x)+crystal_200(x)+amorph(x)

Fit this to data set, first setting some sensible initial guesses for your parameters (we have 9 in total!):gnuplot>

a=100;b=2.1;c=0.5;d=40;e=2.5;f=0.02;g=30;h=2.78;i=0.02gnuplot> fit total(x) “data.dat” via a,b,c,d,e,f,g,h,i



Output from gnuplot fitting routine gives parameters and their standard errors

Also of interest is the correlation matrix of fit parameters, which reveals any interdependencies

Final set of parameters Asymptotic Standard Error

======================= ==========================

a = 109.232 +/- 2.01 (1.84%)

b = 2.33536 +/- 0.01051 (0.45%)

c = 0.31653 +/- 0.006204 (1.96%)

d = 37.1998 +/- 3.693 (9.927%)

e = 2.53854 +/- 0.008686 (0.3422%)

f = 0.0853875 +/- 0.009819 (11.5%)

g = 90.408 +/- 3.604 (3.986%)

h = 2.78811 +/- 0.003543 (0.1271%)

i = -0.0832852 +/- 0.004017 (4.823%)


Summary of today’s lecture

1. Basic data handling and recording1.1 Numerical hygiene1.2-1.4 Units and dimensional analysis1.5-1.6 Distinguishing and estimating magnitude of errors

2. Quantitative treatment of random and systematic errors2.1 Refresher on continuous and discrete probability2.2 Central Limit Theorem and Gaussian distribution2.3 Standard error, and confidence limits2.4 Theory of small changes and errors, combining random errors

3. Fitting models to experimental and simulated data3.1 Graphical methods3.2 Linear least-squares regression analysis3.3 Fitting data to more complex functional forms

Documents

Model-fitting and Data Analysisjae1001.user.srcf.net/.../L1_Slides_13-11-06.pdf · Introduction zResponse to need identified by Graduate School ... zMathCad will not let you combine