Table of contents Statistical analysis Measures of ... · Table of contents Statistical analysis Measures of statistical central tendencies ... Measures of central tendency of a set

Table of contents Statistical analysis Measures of statistical central tendencies Measures of variability Aleatory uncertainties Epistemic uncertainties

Measures of statistical dispersion or deviation The range Mean difference The variance The standard deviation Coefficient of variation

Measures of uncertainty

Systems of events Entropy

Random (stochastic) variables Discontinuous (discrete) random variables

Moments of discrete random variables Probability distributions of discrete random variables Binomial distribution Poisson distribution

Continuous random variables

Probability Density Function Cumulative Distribution Function Probability distributions of continuous random variables Uniform distribution Simpson’s (triangular) distribution Normal distribution Lognormal distribution Shifted exponential distribution Gamma distribution Shifted Rayleigh distribution Type I Largest value (Gumbel) distribution Type III Smallest values (for 0=ε it is known as the Weibull distribution) Beta distribution Type I Smallest values distribution

Combinations of random variables

Measures of statistical central tendencies Measures of central tendency of a set of data x1, x2, ..., xN locate only the centre of a distribution of measures. Other measures often are needed to describe data. The mean is often used to describe central tendencies. Mean has two related meanings in statistics:

• the arithmetic mean • the expected value of a random variable.

In mathematics and statistics, the arithmetic mean, often referred to as simply the mean or average. The term "arithmetic mean" is preferred in mathematics and statistics. The arithmetic mean is analytically defined on a data set x1, x2, ..., xN as it follows:

1

1( )N

ii

x x xN

μ=

= = ∑

Measures of variability Statistics uses summary measures to describe the amount of variability or spread in a set of data x1, x2, ..., xN. The variability applies to the extent to which data points in a statistical distribution or data set diverge from the average or mean value. Variability also refers to the extent to which these data points differ from each other. There are several commonly used measures of variability: range, mean difference, variance and standard deviation as well as the combined measure of variability defined as coefficient of variation with respect to the mean value.

Uncertainty represents a state of having limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome.

The uncertainty (doubt) in statistics and probability theory represents the estimated amount or percentage by which an observed or calculated value may differ from the true value.

Uncertainties can be distinguished as being either aleatory or epistemic.

Aleatory uncertainties

Objective or external or irreducible uncertainty arises because of natural, unpredictable variability of the wave and wind climate or of ship operations. The inherent randomness normally cannot be reduced although the knowledge of the phenomena may help in quantifying the uncertainty.

Epistemic uncertainties

Uncertainty is due to a lack of knowledge about the climate properties. The epistemic (or subjective or internal or modelling) uncertainty can be reduced with sufficient study, better measurement facilities, more observations or improved modelling and, therefore, expert judgments may be useful in its reduction.

The range A measure of statistical dispersion or deviation is a real number that is zero if all the data are identical, and increases as the data becomes more diverse. It cannot be less than zero. Most measures of dispersion have the same scale as the quantity being measured. In other words, if the measurements have units, such as metres or seconds, the measure of dispersion has the same units. Basic measures of dispersion include:

• Range • Mean difference • Variance

Additional measures are: • Standard deviation – the square root of the variance • Coefficient of variation – the standard deviation divided by the mean value •

(See Excel example: GraduationRateofNavalArchitectureinZagreb) The example presents the statistical properties of the input and output rates of numbers of students naval architecture at the Faculty of Mechanical Engineering and Naval Architecture at the University of Zagreb.

05

101520253035404550556065707580859095

100105110115120

Upis

ano/

dipl

omira

lo

Godina

Studij brodogradnje

UpisanoDiplomiralo

The range In descriptive statistics, the range is the length of the smallest interval which contains all the data of a dataset x1, x2, ..., xN. The range is calculated by subtracting the smallest observation (sample minimum Smin) from the greatest (sample maximum Smax) and indicates of statistical dispersion R=Smax-Smin. The range, in the sense of the difference between the highest and lowest scores, is also called the crude range. The midrange point, i.e. the point halfway between the two extremes, is an indicator of the central tendency of the data. It is not appropriate for small samples. The mean difference In probability theory and statistics, the mean difference is used as a measure of how far a set of numbers of a dataset x1, x2, ..., xN are spread out from each other. It is one of several descriptors of a probability distribution, describing how far the numbers lie from the mean (expected value). For random variable X = x1, x2, ..., xN with mean value μ the mean difference of X is:

1

1( ) ( )N

ii

MD x abs xN

μ=

= −∑ or the relative mean difference is then

( )( ) MD xRMD xμ

=

The variance In probability theory and statistics, the variance is another indicator used as a measure of how far a set of numbers are spread out from each other. For random variable X with expected value (mean) μ = E[X], the variance of X is:

2 2 2 2

1 1

1 1( ) ( )N N

i ii i

Var x x xN N

σ μ μ= =

= = − = −∑ ∑

Proof: 2

2 2 2 2 2

1 1 1

1 1 1( ) ( ) ( ) 21

N N Ni

i i ii i i

Var x x x N xN N N N

μ μσ μ μ μ= = =

= = − = − + = −∑ ∑ ∑

The standard deviation The widely used measure of variability or diversity in statistics and probability theory is the standard deviation. It shows how much variation or "dispersion" there is from the "average". The standard deviation is the square root of its variance:

( ) ( )X Var Xσ = The standard deviation, unlike variance, is expressed in the same units as the data. The coefficient of variation Other measures of dispersion are dimensionless (scale-free). In They have no units even if the variable itself has units. In widest use is the coefficient of variation defined as follows:

( )( )( )XCOV XX

σμ

=

For measurements with percentage as unit, the coefficient of variation and the standard deviation will have percentage points as unit.

Measures of uncertainty Systems of events

Random events are in general considered as abstract concepts and the relations among

events are characterized axiomatically. The algebraic structure of the set of events turns out to be Boolean algebra.

The disjoined random events jE with probabilities )( ii Epp = , 1, 2, ,i N= ⋅⋅ ⋅ configure a system SN in a form of an N-element finite scheme:

( ) ( ) ( ) ( )1 2

1 1 2 2

j N

Nj j N N

E E E E

p p E p p E p p E p p E

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎞⎛= ⎟⎜⎜ ⎟= = ⋅⋅⋅ = ⋅⋅ ⋅ =⎝ ⎠

S

The probability of a system of events SN is then in general 1

( ) 1N

N ii

p p=

= ≤∑S . For a

complete distribution is ( ) ( ) 1N Np or p =P S . A system of N events: E1, E2, ... , EN is called a complete system of events if the

following axioms hold: ( 1, 2, , )kE k N≠ ∅ = ⋅ ⋅ ⋅ (a)

)( kjforEE kj ≠∅= (b)

1 2 NE E E I+ + ⋅ ⋅ ⋅ + = (c) The "∅" in (a) and (b) means an impossible event and "I" in (c) denotes a sure event. The

fact that Ej and Ek are exclusive is expressed in (b). The (c) denotes that at least one of the events Ek, k = 1, 2, ..., N, occurs.

Entropy

Uncertainty of a single stochastic event E with known probability p=p(E)≠0 plays a

fundamental role in information theory. To each probability can be assigned the equivalent number (2) of probabilities or events )(/1)( EpE =ν . The entropy of a single stochastic event E can be interpreted according to Wiener (1948) either as a measure of the information yielded by the event or how unexpected the event was and can be defined as the logarithm of the equivalent number of events )(Eν as follows:

[ ]2 2 2( ) log ( ) log 1/ ( ) log ( )H E E p E p Eν= = = − The unit of unexpectedness 1)2/1( =H expresses how unexpected is for example to get a

tail when flipping a coin. More important than unexpectedness of a single stochastic event are the uncertainties of systems of N events. The uncertainty of a complete system S of N events can be expressed as the weighted sum of unexpectedness of all events by the Shannon’s entropy (Shannon and Weaver, 1949), as it follows:

1 1 1

( ) log log(1/ ) logN N N

N j j j j j jj j j

H p p p p pυ= = =

= = = −∑ ∑ ∑S

The uncertainty of an incomplete system of N events S can be defined as the limiting case of the Renyi’s entropy (1970) of order 1, as it is shown:

1

1

1( ) log( )

NRN j j

j

H p pp =

= − ∑SS

The definition of the unit of uncertainty according to Renyi (1970) is not more and not less arbitrary than the choice of the unit of some physical quantity. E.g., if the logarithm applied is of base two, the unit of entropy is denoted as one "bit". One bit is the uncertainty of a system of two equally probable events. If the natural logarithm is applied, the unit is denoted as one "nit". Outcomes with 0 probabilities do not change the uncertainty. By convention, 0 log 0= 0. Some characteristics of the probabilistic uncertainty measures and properties of the entropy are summarized next. The entropy HN(S) is equal to zero when the state of the system S can be surely predicted, i.e., no uncertainty exists at all. This occurs when one of the probabilities of events pi, i=1,2,...,N is equal to one, let us say pk=1 and all other probabilities are equal to zero, pj=0, j≠k. The entropy is maximal when all events are equally probable and the probability of failure is equal to pi= 1/N, for i=1, 2, ..., N, and it amounts to HN(S)max=log N, that is the Hartley's entropy (1928). Hartley’s entropy (1928) corresponds to the Renyi’s entropy of order 0 (1970). The entropy increases as the number of events increase. The entropy does not depend on the sequence of events: Hn(p1,p2,...,pN)= Hn(pk(1),pk(2),...,pk(N)), where k is an arbitrary permutation on (1,2,...,N). The uniqueness theorem by Khinchin (1957) states that the entropy is the only function that measures the probabilistic uncertainty of systems of events in agreement with human experience of uncertainty.

(see Excel example U1-EntropyDieCoin)

Random (stochastic) variables Deterministic variables are normally described by their properties:

• N - Nominal value or the exact value And possibly with tolerances

• T - Tolerance • t=T/N - Relative tolerance

Description of characteristics of random variables:

• μ=ON+N Mean value • O=μ/N-1 Mean deviation of the nominal value (bias) • Var=σ2 Variance • σ=Var1/2 Standard deviation • COV=σ/μ Coefficient of variation • F Probability distribution:

PDF – probability density function CDF – cumulative distribution function Empirically, it is possible for practical purposes to relate the tolerance of the deterministic variables to the standard deviation of random variables: T=n σ For example, supposing a normal probability distribution, for n=3 is less than 27 samples out of 10000 expected to be outside the tolerable margins. In other words, the confidence interval is 99.73% that a random sample will be within the prescribed tolerance interval of ±3σ.

(See Excel example: MSproperty-plating-statistics) The example presents the statistical analysis of mechanical properties mild shipbuilding steels (MS) for rolled plates and profilers obtained by tensile testing in the Laboratory for experimental mechanics at the Faculty of Mechanical Engineering and Naval Architecture at the University of Zagreb.

(See Excel example: MSproperty-profils-statistics)

Discontinuous (discrete) random variables Definition: The values of discontinuous (discrete) random variables x1, x2, ... are probabilities p(x1), p(x2), ... with the property ( ) 1i

i

p x =∑ .

Moments of discrete random variables ( )r

r i ii

m x p x=∑

( ) ( )rr i i

i

M x p xμ= −∑

Expectation ( )i i

i

x p xμ =∑

Variance 2 2( ) ( ) ( )i i

i

V x x p xσ μ= = −∑

Probability distributions of discrete random variables Binomial distribution

( ) x n xnP x p q

x−⎞⎛

= ⎟⎜⎝ ⎠

Mean np=

Sigma npq=

(see Excel example DD1-DistributionBinomial)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0 5 10 15 20 25

P(x)

x

Binomial distribution

CDF

PDF

n=25p=0.5

Mean=12.5Sigma=2.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35 40 45 50

P(x)

x

Binomial distribution

n=5

n=10

n=50

p=0.75p=0.25 p=0.50

n=20

n=2

Poisson distribution

( )!

xmmP x e

x−=

0m np= >

Mean m=

Variance m=

Sigma m= 1 /COV m=

(see Excel example DD2-DistributionPoisson)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60

P(x)

x

Poisson distribution

Poisson disribution

Cumulative density function

Continuous random variables A random variable is called continuous if it can assume all possible values in the possible range of the random variable. In continuous random variable the value of the variable is never an exact point. It is always in the form of an interval, the interval may be very small. Probability Density Function (PDF) The probability function of the continuous random variable is called probability density function of briefly p.d.f. It is denoted by f(x) and represents the probability that the random variable X takes the value between x and x+Δx where Δx is a very small change in X. Cumulative Distribution Function (CDF) In terms of probability density function the cumulative distribution function is defined as:

( )x

CDF f x dx−∞

= ∫

(see Excel example DC3-DistributionNormal)

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

‐4.0 ‐3.5 ‐3.0 ‐2.5 ‐2.0 ‐1.5 ‐1.0 ‐0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

CDF

PDF

Mean=2Sigma=1

Moments of continuous random variables

( )rrm x f x dx

+∞

−∞

= ∫

( ) ( )rrM x f x dxμ

+∞

−∞

= −∫

Expectation ( )xf x dxμ

+∞

−∞

= ∫

Variance 2 2( ) ( ) ( )V x x f x dxσ μ

+∞

−∞

= = −∫

Probability distributions of continuous random variables Uniform distribution

1 1( )( ) 3( )

f xb a b a

μσ

= = ⋅− +

1 1( ) ( ) ( )3( )

F x x a x ab a b a

μσ

= − ⋅ = − ⋅ ⋅− +

2ba +

=μ 32 ⋅

−=

abσ

(see Excel example DC6-DistributionUniform)

Simpson’s (triangular) distribution

( ) ( ) ( )3

x a x af x f b f bb a b a

μσ

− −= ⋅ = ⋅ ⋅

− +

2( )( ) ( )2 ( )

x aF x f bb a−

= ⋅⋅ −

2( )f bb a

=−

2ba +

=μ 32 ⋅

−=

abσ

(see Excel example DC7-DistributionSimpson)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20

Simpson's (triangular) distribution

CDF

PDF

Normal distribution

212

2

1( )2

x

f x eμ

σ

πσ

− ⎞⎛− ⋅⎜ ⎟⎝ ⎠= ⎟

⎠⎞

⎜⎝⎛ −

Φ=σμxxF )(

Standard normal cumulative probability

( )21

212

uu eϕ

π−

=

( ) ∫∞−

−=Φ

x ueu

2

21

21π σ

μ−=

xu

0>σ

(see Excel example DC3-DistributionNormal)

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

‐4.0 ‐3.5 ‐3.0 ‐2.5 ‐2.0 ‐1.5 ‐1.0 ‐0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

CDF

PDF

Mean=2Sigma=1

Lognormal distribution

2ln1

21( )2

y

y

x

y

f x ex

μσ

σ π

⎞⎛ −− ⋅ ⎟⎜⎜ ⎟

⎝ ⎠=⋅

ln

( ) y

y

xF x

μσ

⎞⎛ −= Φ ⎟⎜⎜ ⎟

⎝ ⎠

ln y

y

xu

μσ−

=

2 2 2ln /y x x xμ μ σ μ⎡ ⎤= +⎣ ⎦ 2 2ln 1 /y x xσ σ μ⎡ ⎤= +⎣ ⎦

2

2y

y

x eσ

μμ

+=

2

22 1

yy y

x e eσ

μ σσ+

= − (see Excel example DC4-DistributionLogNormal)

0.0000

0.1000

0.2000

0.0 5.0 10.0 15.0 20.0

CDF

PDF

Mean‐y=1.91Sigma‐y=0.547

Mean‐x=8Sigma‐x=5

(see Excel example DC4-DistributionLogNormal-MildSteel)

0.00000.00100.00200.00300.00400.00500.00600.00700.00800.00900.01000.01100.01200.01300.01400.01500.01600.01700.01800.01900.0200

0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0

CDF

PDF

Mean‐y=5.58Sigma‐y=0.114

Mean‐x=268Sigma‐x=30.5

Lognormal distribution of yield stressof mild shipbuilding steel

Shifted exponential [ ])()( oxxexf −−= λλ [ ])(1)( oxxexF −−−= λ

0>λ 1 1

oo

xx

μ λλ μ

= + =−

λ

σ 1=

Gamma distribution

)(1

)()()( x

k

ekxxf λλλ −

−

Γ=

)(),()(

kxkxF

ΓΓ

=λ

0>λ

0>k

Gamma function ( ) duuek k

o

u 1−∞

− ⋅=Γ ∫

)!1()( −=Γ⋅ kkk

Incomplete gamma function ( ) duuexk k

x

o

u 1, −− ⋅=Γ ∫

λμ k=

λσ k=

kλμ

= ( )2 kλσ =

1/( )

( )

kx

k

xf x ek

θ

θ

−−=

⋅Γ

1( ) ( , )( )

xF x kkγ

θ=Γ

kμ θ= ⋅ kσ θ= ⋅ 2

k μσ⎞⎛= ⎜ ⎟

⎝ ⎠

2σθμ

=

(see Excel example DC9-DistributionGamma)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

.0 5.0 10.0

Gamma distribution

PDF

CDF

Shifted Rayleigh

2

21

2

)()(⎟⎠⎞

⎜⎝⎛ −⋅−−

= α

α

oxxo exxxf

2

21

1)(⎟⎠⎞

⎜⎝⎛ −⋅−

−= αoxx

exF

2παμ ⋅+= ov

22 πασ −⋅=

Type I Largest value (Gumbel)

( )( )( )x un o

n ox u enf x e e

ααα− −− −= ⋅ ⋅

( )

( ) 1x un oeF x e

α− −−= − 0>nα

nnu

αμ 5772.0

+= 6n

πσα

=

0.5772n

n

u μα

= − 6n

πασ

=⋅

(see Excel example DC5-DistributionGumbel)

0.00

0.01

0.02

0.03

0.04

0.05

.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0

CDF

PDFMean= 100Sigma= 20

un= 109an= 0.064

CDF

Type III Smallest values (for 0=ε it is known as the Weibull distribution) 1

( )kxkk xf x e ε

ε ε

− ⎞⎛−⎜ ⎟⎝ ⎠⎞⎛= ⎜ ⎟

⎝ ⎠ ( ) 1

kx

F x e ε⎞⎛−⎜ ⎟

⎝ ⎠= − 0>k

11k

μ ε ⎞⎛= ⋅Γ +⎜ ⎟⎝ ⎠

22 11 1k k

σ ε ε⎞ ⎞⎛ ⎛= ⋅Γ + − ⋅Γ +⎜ ⎟ ⎜ ⎟⎝ ⎝⎠ ⎠

Type III Smallest values (for 0=ε it is known as the Weibull distribution) k

uxk

eux

ukxf

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−−

⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−= ε

ε

εε

ε1

1

11

)(

k

ux

exF⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−

−= εε

11)( 0>k

⎟⎠⎞

⎜⎝⎛ +Γ⋅−+=

ku 11)( 1 εεμ

⎟⎠⎞

⎜⎝⎛ +Γ−⎟

⎠⎞

⎜⎝⎛ +Γ⋅−=

kku 1121)( 2

1 εσ

Beta distribution

1

11

))(,()()()( −−

−−

−−−

= rq

rq

abrqBxbaxxf

0>q

0>r

Beta function )()()(),(

rqrqrqB

+ΓΓΓ

=

rqabqa

+−

+=)(μ

1)(

+++−

=rq

qrrqabσ

Type I Smallest values distribution )(

11 )(1)( ouxoeuxexf

−−=ααα

)1(11)(uxeexF−−−=

α

01 >α

n

uα

μ 5772.01 −=

16απσ =

Type II Largest value k

ox

uko

o

ex

uukxf

⎟⎠⎞

⎜⎝⎛−−

⎟⎠⎞

⎜⎝⎛=

1

)( k

ox

u

exF⎟⎠⎞

⎜⎝⎛−

=)( 0>k

⎟⎠⎞

⎜⎝⎛ −Γ⋅=

kun

11μ

⎟⎠⎞

⎜⎝⎛ −Γ−⎟

⎠⎞

⎜⎝⎛ −Γ⋅=

kkun

1121 2σ

Combinations of random variables For linear combinations 1 1 2 2 ... k ky a X a X a X= + + + of random variables 1 2 ... kX X X+ + +

With given arithmetic means

1 2, , ..., kμ μ μ

And standard deviations 2 2 21 2, , ..., kσ σ σ

Theorem 1: The mean value of the linear combination of random variables is the sum of the mean values of components:

1 1 2 2 1 1 2 2( ... ) ( ) ( ) ... ( )k k k kE a X a X a X a E X a E X a E X+ + + = + + +

1 1 1 2 ... k ka a aμ μ μ μ= + + +.

Theorem 2: The variance of the linear combination of random variables is the sum of the variances of components:

2 2 2 2 2 2 21 1 2 2 ... k ka a aσ σ σ σ= + + +

Documents

Table of contents Statistical analysis Measures of ... · Table of contents Statistical analysis Measures of statistical central tendencies ... Measures of central tendency of a set