53
Fitting probability distributions to data PhD Cyber Physical Systems M. Dolfin University of Messina 5th and 12th June 2018 M. Dolfin Dep. of Engineering - University of Messina

Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting probability distributions to dataPhD Cyber Physical Systems

M. Dolfin

University of Messina

5th and 12th June 2018

M. Dolfin Dep. of Engineering - University of Messina

Page 2: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Course Outline

Exploring DataVisualize data sets, calculate descriptive group statistics, andexplore data distributions.

Fitting a Curve to DataPerform linear and nonlinear regression to fit a curve to data,and evaluate the quality of the fit.

M. Dolfin Dep. of Engineering - University of Messina

Page 3: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Readings

W. J. Palm III,Introduction to Matlab for Engineers (3rd ed.), McGraw-Hill(2011).

www.mathworks.com

P. Brandimarte,Numerical Methods in Finance and Economics. AMatlab-based introduction, Wiley-Interscience (2006).

M. Dolfin Dep. of Engineering - University of Messina

Page 4: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Motivation

In today’s data driven world, we collect information from all sortof systems. Statistical methods can help us understand thedata and draw conclusions.

Learn statistical methods in Matlab to perform commondata-driven tasks, like exploring, visualizing and summarizingthe data. Re-sampling the data and estimating missing values,and fitting the data to build predictive models.

M. Dolfin Dep. of Engineering - University of Messina

Page 5: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Data Visualizations

Before trying to perform calculations or draw conclusions fromdata, it is helpful to get a qualitative feel for the data.Visualization is often a useful method when beginning toexplore data.

HistogramsA histogram provides a quick visual insight into how a data setis distributed. The range of possible values is divided intointervals, or bins. Then a bar chart is created, where the heightof each bar corresponds to how frequently values in that binappear in the data.

histogramIt is a function that creates a histogram with the bins chosenautomatically based on the data.

M. Dolfin Dep. of Engineering - University of Messina

Page 6: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

histogram function

M. Dolfin Dep. of Engineering - University of Messina

Page 7: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Data Visualizations

Box PlotsA box plot is another way to visualize the distribution of a dataset. The central box represents the middle 50% ofobservations, with the red line at the median. The "whisker"lines show the extent of ∼ 99% of the data. Remaining outliersare shown individually with red crosses.

boxplotCreates a box plot.

M. Dolfin Dep. of Engineering - University of Messina

Page 8: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

boxplot function

M. Dolfin Dep. of Engineering - University of Messina

Page 9: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Data Visualizations

Scatter PlotsA scatter plot explores how two variables are related to eachother.

You can use the scatter function or plot function to create ascatter plot.

To view how one variable changes in response to two othervariables, you can use scatter3 or plot3.

M. Dolfin Dep. of Engineering - University of Messina

Page 10: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

scatter function

M. Dolfin Dep. of Engineering - University of Messina

Page 11: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Exercise

Create an histogram for each of the variables height andweight;visualize possible correlation between height and weightdata using the scatter plot (add appropriate labels to theaxis;create a boxplot of the height measurements (add anappropriate label to the axis).

M. Dolfin Dep. of Engineering - University of Messina

Page 12: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Measures of Centrality

From the histograms of height data shown, it appears thatwomen’s heights are centered on approximately 160 cm,whereas men’s heights are centered on approximately 175 cm.

M. Dolfin Dep. of Engineering - University of Messina

Page 13: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Measures of Centrality

Quantifying these statements requires calculating somemeasure of central tendency. There are several standardmeasures of ”average” (or centrality).

MeanThe mean (also referred to as the arithmetic mean, or oftensimply the average) is a common measure of centrality. Themean is useful for symmetric distributions, but notoriouslysensitive to outliers. If your data set is not distributedsymmetrically or has extreme outliers, you will need to considerhow these factors will affect the calculation of the mean.

MeanThe function mean calculates the mean of a data set.

M. Dolfin Dep. of Engineering - University of Messina

Page 14: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Measures of Centrality

MedianThe median gives the midpoint of the sorted data, so half thedata is greater than the median and half is smaller. The medianis much more resistant than the mean to changes in a few datavalues, and is an especially useful center for nonsymmetric(skewed) distributions, like the distribution of weight data.

Median functionmedWeight = median(weight)meanWeight = mean(weight)

Additional Measures of Centralitymode Most frequent values in a data set.trimmean Mean excluding outliers.

M. Dolfin Dep. of Engineering - University of Messina

Page 15: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Exercise

Evaluate the mean of the data on weights and save it in avariable named meanW;evaluate the median of the data on weights and save it in avariable named medW.

M. Dolfin Dep. of Engineering - University of Messina

Page 16: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Measures of Spread

All individuals have the same height or they are equally likely tobe any eight between 135 and 200 cm.

M. Dolfin Dep. of Engineering - University of Messina

Page 17: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Standard deviation

Typically used to measure the spread of symmetric distributionsthat follow a normal distribution (“bell curve”).

Is the square root of the variance, the sum of the squares of thedistances of data values from the mean, and it tends to amplifythe effect of outliers.

stdCalculates the standard deviation of a data set.

M. Dolfin Dep. of Engineering - University of Messina

Page 18: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Interquartile Range

It is based on the median, i.e. the 50th percentile point. It givesthe distance between the 25th percentile and 75th percentile. Itgives the width of the region that contains the middle 50% ofthe data values.

Useful for nonsymmetric distributions, being resistant to outliers(like the median).

iqrCalculates the interquartile range of a data set.

M. Dolfin Dep. of Engineering - University of Messina

Page 19: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Additional measures of Spread

rangeDifference between maximum and minimum values.

varVariance of a data set.

M. Dolfin Dep. of Engineering - University of Messina

Page 20: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Data Distributions

A histogram can give a qualitative feel for the shape of a dataset. The exact shape of the distribution is given by thedistribution’s probability density function (pdf).

The pdf is a mathematical function f(x) such that the area underthe curve between two values of x is the probability that arandomly selected x will fall between those two values.

M. Dolfin Dep. of Engineering - University of Messina

Page 21: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Discrete Distributions

The pdf for discrete data is a normalized histogram, where thearea of a given bar is the probability that a randomly selecteddata point falls in that interval.

The pdf is a mathematical function f(x) such that the area underthe curve between two values of x is the probability that arandomly selected x will fall between those two values.

histogram(weight); histogram(weight, ’Normalization’,’pdf”)

M. Dolfin Dep. of Engineering - University of Messina

Page 22: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Continuous Distributions

Uniform;Normal;Gamma;Exponential;Chi-square;T-student;F-Snedecor.

M. Dolfin Dep. of Engineering - University of Messina

Page 23: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Uniform distribution

Uniformly distributed data has an equal probability of appearinganywhere in an interval. The pdf is shaped like a rectangle.The standard uniform distribution is defined as 1 on the interval0 to 1, and 0 everywhere else.

M. Dolfin Dep. of Engineering - University of Messina

Page 24: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Uniform probability Distribution

y = f (x |a,b) = 1b − a

I(a,b)(x)

Standard uniform distributiona = 0 and b = 1

Uniform cumulative distribution function

p = F (x |a,b) = x − ab − a

I(a,b)(x)

M. Dolfin Dep. of Engineering - University of Messina

Page 25: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Uniform probability distributionunifpdf(x,a,b)

Cumulative distribution functionunifcdf(x,a,b)

Moments[E,sigma]=unistat(a,b)

M. Dolfin Dep. of Engineering - University of Messina

Page 26: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Exercise

What is the probability that an observation from a standarduniform distribution will be less than 0.75?probability = unifcdf(0.75)

What is the probability that an observation from a uniformdistribution with |a| = |-1| and |b| = |1| will be less than 0.75?probability = unifcdf(0.75,-1,1)

M. Dolfin Dep. of Engineering - University of Messina

Page 27: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Exercise

Generate and plot in the same graphic window four uniformpdf on the intervals [0,1], [-3,2], [2,4] and [0.8,2.5];generate on the same graphic window but below the firstgraph the corresponding cdf,evaluate the moments of the distributions;publish the script.

M. Dolfin Dep. of Engineering - University of Messina

Page 28: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Normal distribution

A normal distribution is the classic "bell curve" distribution. Themost probable values are near the mean, and values furtherfrom the mean are less probable.

The normal distribution is defined by two numbers: the mean µand the standard deviation σ. The standard normal distributionhas a mean of 0 and standard deviation of 1.

M. Dolfin Dep. of Engineering - University of Messina

Page 29: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fit and plot a Normal distribution

Recall that normal distributions are defined by the mean andstandard deviation. To fit a normal distribution, you first need tocalculate these two values.

ExerciseFind the mean and the standard deviation of the height dataand assign the values to meanHeight and stdHeight,respectively.

To fit a normal distribution to the height data, you also need tocreate a vector of x-values over which to calculate the pdf.Create a vector x of 100 evenly spaced values between 135and 200 using the linspace function.

M. Dolfin Dep. of Engineering - University of Messina

Page 30: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fit and plot a Normal distribution

The normpdf function can calculate the normal probabilitydensity function with three inputs:normpdf (x ,mean, std)

Visualize the probability density function by plotting pdfHeightvs x. Enter hold on so the next plot is added to the currentfigure.

You can overlay the discrete pdf of the height data by using the’Normalization’ option of the histogram function:

histogram(data,′ Normalization,′ pdf ′)

M. Dolfin Dep. of Engineering - University of Messina

Page 31: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fit and plot a Normal distribution

The height data is close to normally distributed, so the normalpdf closely fits the data.Now, follow the steps of the exercise to overlay a normaldistribution to the weight data.Does the weight data look normally distributed?

M. Dolfin Dep. of Engineering - University of Messina

Page 32: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

disttool

Interactive plots of probability distributions. This is an examplethat displays a plot of the cdf or pdf of the distributions in theStatistics and Machine Learning Toolbox.

Use popup menus to change the distribution (Normal toBinomial) or the function (cdf to pdf).

Change the parameters of the distribution by typing a newvalue or by moving a slider.Interactively calculate new values by dragging a reference lineacross the plot.

M. Dolfin Dep. of Engineering - University of Messina

Page 33: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting Distributions

The Distribution FitterApp allows you to plot your own data withdifferent common distributions overlaid to help you visualize thebest fit.

M. Dolfin Dep. of Engineering - University of Messina

Page 34: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting Distributions

M. Dolfin Dep. of Engineering - University of Messina

Page 35: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting Distributions

M. Dolfin Dep. of Engineering - University of Messina

Page 36: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting Distributions

M. Dolfin Dep. of Engineering - University of Messina

Page 37: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting Distributions

M. Dolfin Dep. of Engineering - University of Messina

Page 38: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Fitting Distributions

M. Dolfin Dep. of Engineering - University of Messina

Page 39: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Parametric Fitting

If you know your distribution type, you can create a probabilitydistribution object allowing you to access distributionparameters as well as other relevant properties related to thefit.

M. Dolfin Dep. of Engineering - University of Messina

Page 40: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

fitdist function

pd = fitdist(x ,distname)

Outputs: pdA probability distribution object. The distribution specified by[distname] determines the class type of the returned probabilitydistribution object.

Inputs: x, distnameA column vector of data used to calculate the fit parameters.A string which identifies the probability distribution to model.There are several probability distributions that you can selectfrom.

M. Dolfin Dep. of Engineering - University of Messina

Page 41: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Task

Try to create a t Location-Scale probability distributionobject named tFit from the data in FTSEreturns usingtLocationScale.Try to create a variable named msn to hold theParameterValues property of the t Location-Scaleprobability distribution object.Find the inverse cdf value at 0.01. Name the resultparamVaR99.

M. Dolfin Dep. of Engineering - University of Messina

Page 42: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers

You can generate random numbers based on a distribution.The rand function generates uniformly distributed randomnumbers. rand(n,1) generates a column vector of n uniformlydistributed random numbers.

ExerciseGenerate a column vector of 1000 uniformly distributed randomnumbers. Name the output xUniform.

The uniformly distributed numbers should range between 0 and1, and each number has an equal probability of appearing inthat interval.You can confirm this by creating a histogram.

M. Dolfin Dep. of Engineering - University of Messina

Page 43: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers

The randn function generates normally distributed randomnumbers with mean 0 and standard deviation 1. randn(n,1)generates a column vector of n normally distributed randomnumbers.

ExerciseGenerate a column vector of 1000 normally distributed randomnumbers. Name the output xNormal.

The normally distributed numbers should have the highestprobability of being near the mean. The probability shoulddecrease as the distance from the mean increases

Create a histogram of the numbers in xNormal.

M. Dolfin Dep. of Engineering - University of Messina

Page 44: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:Extreme Value Distribution

You can actually generate random numbers from your datausing your probability distribution object. The simulated valuesthat are generated are not simply sampled from the data, butgenerated from the fitted distribution.

M. Dolfin Dep. of Engineering - University of Messina

Page 45: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:Generalized Extreme Value Distribution

M. Dolfin Dep. of Engineering - University of Messina

Page 46: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:Non-Parametric Distribution

M. Dolfin Dep. of Engineering - University of Messina

Page 47: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:Logistic Distribution

M. Dolfin Dep. of Engineering - University of Messina

Page 48: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:Normal Distribution

M. Dolfin Dep. of Engineering - University of Messina

Page 49: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:Rayleigh Distribution

M. Dolfin Dep. of Engineering - University of Messina

Page 50: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Generating Random Numbers:t Location-Scale Distribution

M. Dolfin Dep. of Engineering - University of Messina

Page 51: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

You can generate random numbers from the probabilitydistribution object using the random function. The first input isthe probability distribution object, and the second and thirdinputs represent the number of rows and columns, respectively.randomVector = random(distributionObject,1,4)

TaskTry to create a 3-by-4 matrix of random numbers namedrandMat using the generated distribution object.Again, try to create a 3-by-4 matrix of random numbersnamed randMat2 using the generated distribution object.

M. Dolfin Dep. of Engineering - University of Messina

Page 52: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Exercise

What is the probability that an observation from a standarduniform distribution will be less than 0.75?probability = unifcdf(0.75)

What is the probability that an observation from a uniformdistribution with |a| = |-1| and |b| = |1| will be less than 0.75?probability = unifcdf(0.75,-1,1)

M. Dolfin Dep. of Engineering - University of Messina

Page 53: Fitting probability distributions to data PhD Cyber ... · In today’s data driven world, we collect information from all sort of systems. Statistical methods can help us understand

Exercise

Compare the probability distributions, the cumulativedistributions of the following uniform continuous randomvariables and evaluate the first two moments for each of thefollowing uniform probability distribution

X1 ∼ U(0,1),X2 ∼ U(−3,2),X3 ∼ U(2,4),X4 ∼ U(0.8,1.5).

M. Dolfin Dep. of Engineering - University of Messina