Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Fitting probability distributions to dataPhD Cyber Physical Systems
M. Dolfin
University of Messina
5th and 12th June 2018
M. Dolfin Dep. of Engineering - University of Messina
Course Outline
Exploring DataVisualize data sets, calculate descriptive group statistics, andexplore data distributions.
Fitting a Curve to DataPerform linear and nonlinear regression to fit a curve to data,and evaluate the quality of the fit.
M. Dolfin Dep. of Engineering - University of Messina
Readings
W. J. Palm III,Introduction to Matlab for Engineers (3rd ed.), McGraw-Hill(2011).
www.mathworks.com
P. Brandimarte,Numerical Methods in Finance and Economics. AMatlab-based introduction, Wiley-Interscience (2006).
M. Dolfin Dep. of Engineering - University of Messina
Motivation
In today’s data driven world, we collect information from all sortof systems. Statistical methods can help us understand thedata and draw conclusions.
Learn statistical methods in Matlab to perform commondata-driven tasks, like exploring, visualizing and summarizingthe data. Re-sampling the data and estimating missing values,and fitting the data to build predictive models.
M. Dolfin Dep. of Engineering - University of Messina
Data Visualizations
Before trying to perform calculations or draw conclusions fromdata, it is helpful to get a qualitative feel for the data.Visualization is often a useful method when beginning toexplore data.
HistogramsA histogram provides a quick visual insight into how a data setis distributed. The range of possible values is divided intointervals, or bins. Then a bar chart is created, where the heightof each bar corresponds to how frequently values in that binappear in the data.
histogramIt is a function that creates a histogram with the bins chosenautomatically based on the data.
M. Dolfin Dep. of Engineering - University of Messina
histogram function
M. Dolfin Dep. of Engineering - University of Messina
Data Visualizations
Box PlotsA box plot is another way to visualize the distribution of a dataset. The central box represents the middle 50% ofobservations, with the red line at the median. The "whisker"lines show the extent of ∼ 99% of the data. Remaining outliersare shown individually with red crosses.
boxplotCreates a box plot.
M. Dolfin Dep. of Engineering - University of Messina
boxplot function
M. Dolfin Dep. of Engineering - University of Messina
Data Visualizations
Scatter PlotsA scatter plot explores how two variables are related to eachother.
You can use the scatter function or plot function to create ascatter plot.
To view how one variable changes in response to two othervariables, you can use scatter3 or plot3.
M. Dolfin Dep. of Engineering - University of Messina
scatter function
M. Dolfin Dep. of Engineering - University of Messina
Exercise
Create an histogram for each of the variables height andweight;visualize possible correlation between height and weightdata using the scatter plot (add appropriate labels to theaxis;create a boxplot of the height measurements (add anappropriate label to the axis).
M. Dolfin Dep. of Engineering - University of Messina
Measures of Centrality
From the histograms of height data shown, it appears thatwomen’s heights are centered on approximately 160 cm,whereas men’s heights are centered on approximately 175 cm.
M. Dolfin Dep. of Engineering - University of Messina
Measures of Centrality
Quantifying these statements requires calculating somemeasure of central tendency. There are several standardmeasures of ”average” (or centrality).
MeanThe mean (also referred to as the arithmetic mean, or oftensimply the average) is a common measure of centrality. Themean is useful for symmetric distributions, but notoriouslysensitive to outliers. If your data set is not distributedsymmetrically or has extreme outliers, you will need to considerhow these factors will affect the calculation of the mean.
MeanThe function mean calculates the mean of a data set.
M. Dolfin Dep. of Engineering - University of Messina
Measures of Centrality
MedianThe median gives the midpoint of the sorted data, so half thedata is greater than the median and half is smaller. The medianis much more resistant than the mean to changes in a few datavalues, and is an especially useful center for nonsymmetric(skewed) distributions, like the distribution of weight data.
Median functionmedWeight = median(weight)meanWeight = mean(weight)
Additional Measures of Centralitymode Most frequent values in a data set.trimmean Mean excluding outliers.
M. Dolfin Dep. of Engineering - University of Messina
Exercise
Evaluate the mean of the data on weights and save it in avariable named meanW;evaluate the median of the data on weights and save it in avariable named medW.
M. Dolfin Dep. of Engineering - University of Messina
Measures of Spread
All individuals have the same height or they are equally likely tobe any eight between 135 and 200 cm.
M. Dolfin Dep. of Engineering - University of Messina
Standard deviation
Typically used to measure the spread of symmetric distributionsthat follow a normal distribution (“bell curve”).
Is the square root of the variance, the sum of the squares of thedistances of data values from the mean, and it tends to amplifythe effect of outliers.
stdCalculates the standard deviation of a data set.
M. Dolfin Dep. of Engineering - University of Messina
Interquartile Range
It is based on the median, i.e. the 50th percentile point. It givesthe distance between the 25th percentile and 75th percentile. Itgives the width of the region that contains the middle 50% ofthe data values.
Useful for nonsymmetric distributions, being resistant to outliers(like the median).
iqrCalculates the interquartile range of a data set.
M. Dolfin Dep. of Engineering - University of Messina
Additional measures of Spread
rangeDifference between maximum and minimum values.
varVariance of a data set.
M. Dolfin Dep. of Engineering - University of Messina
Data Distributions
A histogram can give a qualitative feel for the shape of a dataset. The exact shape of the distribution is given by thedistribution’s probability density function (pdf).
The pdf is a mathematical function f(x) such that the area underthe curve between two values of x is the probability that arandomly selected x will fall between those two values.
M. Dolfin Dep. of Engineering - University of Messina
Discrete Distributions
The pdf for discrete data is a normalized histogram, where thearea of a given bar is the probability that a randomly selecteddata point falls in that interval.
The pdf is a mathematical function f(x) such that the area underthe curve between two values of x is the probability that arandomly selected x will fall between those two values.
histogram(weight); histogram(weight, ’Normalization’,’pdf”)
M. Dolfin Dep. of Engineering - University of Messina
Continuous Distributions
Uniform;Normal;Gamma;Exponential;Chi-square;T-student;F-Snedecor.
M. Dolfin Dep. of Engineering - University of Messina
Uniform distribution
Uniformly distributed data has an equal probability of appearinganywhere in an interval. The pdf is shaped like a rectangle.The standard uniform distribution is defined as 1 on the interval0 to 1, and 0 everywhere else.
M. Dolfin Dep. of Engineering - University of Messina
Uniform probability Distribution
y = f (x |a,b) = 1b − a
I(a,b)(x)
Standard uniform distributiona = 0 and b = 1
Uniform cumulative distribution function
p = F (x |a,b) = x − ab − a
I(a,b)(x)
M. Dolfin Dep. of Engineering - University of Messina
Uniform probability distributionunifpdf(x,a,b)
Cumulative distribution functionunifcdf(x,a,b)
Moments[E,sigma]=unistat(a,b)
M. Dolfin Dep. of Engineering - University of Messina
Exercise
What is the probability that an observation from a standarduniform distribution will be less than 0.75?probability = unifcdf(0.75)
What is the probability that an observation from a uniformdistribution with |a| = |-1| and |b| = |1| will be less than 0.75?probability = unifcdf(0.75,-1,1)
M. Dolfin Dep. of Engineering - University of Messina
Exercise
Generate and plot in the same graphic window four uniformpdf on the intervals [0,1], [-3,2], [2,4] and [0.8,2.5];generate on the same graphic window but below the firstgraph the corresponding cdf,evaluate the moments of the distributions;publish the script.
M. Dolfin Dep. of Engineering - University of Messina
Normal distribution
A normal distribution is the classic "bell curve" distribution. Themost probable values are near the mean, and values furtherfrom the mean are less probable.
The normal distribution is defined by two numbers: the mean µand the standard deviation σ. The standard normal distributionhas a mean of 0 and standard deviation of 1.
M. Dolfin Dep. of Engineering - University of Messina
Fit and plot a Normal distribution
Recall that normal distributions are defined by the mean andstandard deviation. To fit a normal distribution, you first need tocalculate these two values.
ExerciseFind the mean and the standard deviation of the height dataand assign the values to meanHeight and stdHeight,respectively.
To fit a normal distribution to the height data, you also need tocreate a vector of x-values over which to calculate the pdf.Create a vector x of 100 evenly spaced values between 135and 200 using the linspace function.
M. Dolfin Dep. of Engineering - University of Messina
Fit and plot a Normal distribution
The normpdf function can calculate the normal probabilitydensity function with three inputs:normpdf (x ,mean, std)
Visualize the probability density function by plotting pdfHeightvs x. Enter hold on so the next plot is added to the currentfigure.
You can overlay the discrete pdf of the height data by using the’Normalization’ option of the histogram function:
histogram(data,′ Normalization,′ pdf ′)
M. Dolfin Dep. of Engineering - University of Messina
Fit and plot a Normal distribution
The height data is close to normally distributed, so the normalpdf closely fits the data.Now, follow the steps of the exercise to overlay a normaldistribution to the weight data.Does the weight data look normally distributed?
M. Dolfin Dep. of Engineering - University of Messina
disttool
Interactive plots of probability distributions. This is an examplethat displays a plot of the cdf or pdf of the distributions in theStatistics and Machine Learning Toolbox.
Use popup menus to change the distribution (Normal toBinomial) or the function (cdf to pdf).
Change the parameters of the distribution by typing a newvalue or by moving a slider.Interactively calculate new values by dragging a reference lineacross the plot.
M. Dolfin Dep. of Engineering - University of Messina
Fitting Distributions
The Distribution FitterApp allows you to plot your own data withdifferent common distributions overlaid to help you visualize thebest fit.
M. Dolfin Dep. of Engineering - University of Messina
Fitting Distributions
M. Dolfin Dep. of Engineering - University of Messina
Fitting Distributions
M. Dolfin Dep. of Engineering - University of Messina
Fitting Distributions
M. Dolfin Dep. of Engineering - University of Messina
Fitting Distributions
M. Dolfin Dep. of Engineering - University of Messina
Fitting Distributions
M. Dolfin Dep. of Engineering - University of Messina
Parametric Fitting
If you know your distribution type, you can create a probabilitydistribution object allowing you to access distributionparameters as well as other relevant properties related to thefit.
M. Dolfin Dep. of Engineering - University of Messina
fitdist function
pd = fitdist(x ,distname)
Outputs: pdA probability distribution object. The distribution specified by[distname] determines the class type of the returned probabilitydistribution object.
Inputs: x, distnameA column vector of data used to calculate the fit parameters.A string which identifies the probability distribution to model.There are several probability distributions that you can selectfrom.
M. Dolfin Dep. of Engineering - University of Messina
Task
Try to create a t Location-Scale probability distributionobject named tFit from the data in FTSEreturns usingtLocationScale.Try to create a variable named msn to hold theParameterValues property of the t Location-Scaleprobability distribution object.Find the inverse cdf value at 0.01. Name the resultparamVaR99.
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers
You can generate random numbers based on a distribution.The rand function generates uniformly distributed randomnumbers. rand(n,1) generates a column vector of n uniformlydistributed random numbers.
ExerciseGenerate a column vector of 1000 uniformly distributed randomnumbers. Name the output xUniform.
The uniformly distributed numbers should range between 0 and1, and each number has an equal probability of appearing inthat interval.You can confirm this by creating a histogram.
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers
The randn function generates normally distributed randomnumbers with mean 0 and standard deviation 1. randn(n,1)generates a column vector of n normally distributed randomnumbers.
ExerciseGenerate a column vector of 1000 normally distributed randomnumbers. Name the output xNormal.
The normally distributed numbers should have the highestprobability of being near the mean. The probability shoulddecrease as the distance from the mean increases
Create a histogram of the numbers in xNormal.
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:Extreme Value Distribution
You can actually generate random numbers from your datausing your probability distribution object. The simulated valuesthat are generated are not simply sampled from the data, butgenerated from the fitted distribution.
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:Generalized Extreme Value Distribution
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:Non-Parametric Distribution
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:Logistic Distribution
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:Normal Distribution
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:Rayleigh Distribution
M. Dolfin Dep. of Engineering - University of Messina
Generating Random Numbers:t Location-Scale Distribution
M. Dolfin Dep. of Engineering - University of Messina
You can generate random numbers from the probabilitydistribution object using the random function. The first input isthe probability distribution object, and the second and thirdinputs represent the number of rows and columns, respectively.randomVector = random(distributionObject,1,4)
TaskTry to create a 3-by-4 matrix of random numbers namedrandMat using the generated distribution object.Again, try to create a 3-by-4 matrix of random numbersnamed randMat2 using the generated distribution object.
M. Dolfin Dep. of Engineering - University of Messina
Exercise
What is the probability that an observation from a standarduniform distribution will be less than 0.75?probability = unifcdf(0.75)
What is the probability that an observation from a uniformdistribution with |a| = |-1| and |b| = |1| will be less than 0.75?probability = unifcdf(0.75,-1,1)
M. Dolfin Dep. of Engineering - University of Messina
Exercise
Compare the probability distributions, the cumulativedistributions of the following uniform continuous randomvariables and evaluate the first two moments for each of thefollowing uniform probability distribution
X1 ∼ U(0,1),X2 ∼ U(−3,2),X3 ∼ U(2,4),X4 ∼ U(0.8,1.5).
M. Dolfin Dep. of Engineering - University of Messina