35
Visual Displays of Data and Basic Descriptive Statistics http://www.halcyonmaps.com

Visual Displays of Data and Basic Descriptive Statistics

Embed Size (px)

Citation preview

Page 1: Visual Displays of Data and Basic Descriptive Statistics

Visual Displays of Data and Basic Descriptive Statistics

http://www.halcyonmaps.com

Page 2: Visual Displays of Data and Basic Descriptive Statistics

• Where to get information on R :• R: http://www.r-project.org/

• Just need the base

• RStudio: http://rstudio.org/

• A great IDE for R

• Work on all platforms

• Sometimes slows down performance…

• CRAN: http://cran.r-project.org/

• Library repository for R

• Click on Search on the left of the website to search for package/info on packages

Page 3: Visual Displays of Data and Basic Descriptive Statistics

Finding our way around R/RStudio

Script Window

Command Line

Page 4: Visual Displays of Data and Basic Descriptive Statistics

• Basic Input and Output

Handy Commands:

x <- 4

x <- “text goes in quotes”

variables: store

information

Numeric input

Text (character) input

:Assignment operator

Page 5: Visual Displays of Data and Basic Descriptive Statistics

• Get help on an R command:• If you know the name: ?command name• ?plot brings up html on plot command

• If you don’t know the name:• Use Google (my favorite)• ??key word

Handy Commands:

Page 6: Visual Displays of Data and Basic Descriptive Statistics

• Histograms:• Histograms: “bin” a variable and plot frequencies

nD nD

Counts Relative Frequencies

First Thing: Look at your Data!

Page 7: Visual Displays of Data and Basic Descriptive Statistics

Histograms

Page 8: Visual Displays of Data and Basic Descriptive Statistics

• Box and Whiskers Plots:

1 .5188 1 .5189 1 .5190 1 .5191 1 .5192

25th-%tile1st-quartile

75th-%tile3rd-quartile median

50th-%tile

range

possibleoutliers

possibleoutliers

First Thing: Look at your Data!

Page 9: Visual Displays of Data and Basic Descriptive Statistics

• Note the relationship:Box-and-Whiskers

Page 10: Visual Displays of Data and Basic Descriptive Statistics

With Outliers:

Without Outliers:

Box-and-Whiskers

Page 11: Visual Displays of Data and Basic Descriptive Statistics

Box-and-Whiskers

Page 12: Visual Displays of Data and Basic Descriptive Statistics

Stem-and-Leaf Displays

• Consider a numerical data set x1, x2, x3,…, xn

– each xi consists of at least two digits.

– an informative visual representation a stem-and-leaf display.

Page 13: Visual Displays of Data and Basic Descriptive Statistics

Stems Leaves for each stem

Stem-and-Leaf Displays

Page 14: Visual Displays of Data and Basic Descriptive Statistics

Dotplots• Each observation is represented by a dot

above the corresponding location on a horizontal measurement scale. –When a value occurs more than once, there is a dot

for each occurrence– Dots are stacked vertically.

• A dotplot is useful when:– there is not a large set of data– where there are relatively few distinct values.

Page 15: Visual Displays of Data and Basic Descriptive Statistics

Dotplots

Page 16: Visual Displays of Data and Basic Descriptive Statistics

• Given a sample from some population:• What is a good “summary” value which well

describes the sample?

• We will look at:

• Average (arithmetic mean)

• Median

• Mode

Measures of Location

For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures”

LA Mohammed, B Found, M Caligiuri and D RogersJ Forensic Sci 56(1),S136-S141 (2011)

Page 17: Visual Displays of Data and Basic Descriptive Statistics

Histogram Points of Interest

• Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study.

• What is a good summary number?

• How spread out is the data? (We will talk about this later)

Page 18: Visual Displays of Data and Basic Descriptive Statistics

• Arithmetic sample mean (average):• The sum of data divided by number of observations:

Measures of Location

intuitive formula

fancy formula

Page 19: Visual Displays of Data and Basic Descriptive Statistics

• Example from LAM study: • Compute the average absolute size of segment 1 for

the genuine signature of subject 2:

Subj. 2; Gen; Seg. 1

Absolute Size (cm)

1 0.05482 0.29513 0.10264 0.10055 0.24916 0.12877 0.04968 0.22999 0.256

10 0.0538

Measures of Location

Page 20: Visual Displays of Data and Basic Descriptive Statistics

• Example: • More useful: Consider again Absolute Average Velocity for

Genuine Signatures across all writers in the LAM study:

92 subjects × 10 measurements/subject = 920 velocity measurements

Average Absolute Average Velocity:

Measures of Location

Page 21: Visual Displays of Data and Basic Descriptive Statistics

• Follow up question: • Is there a difference in the Abs. Avg. Veloc. for Genuine

signatures vs. Disguised signatures (DWM and DNM)??

Genuine DWM DNM

• We will learn how to answer this, but not yet.

Measures of Location

Page 22: Visual Displays of Data and Basic Descriptive Statistics

• Sample median:• Ordering the n pieces of data from smallest value to

largest value, the median is the “middle value”:

• If n is odd, median is largest data point.

• If n is even, median is average of and largest data points.

th1

2

n

th

2

n th12

n

Measures of Location

Page 23: Visual Displays of Data and Basic Descriptive Statistics

• Example: • Median of Average Absolute Velocity for Genuine Signatures,

LAM:

Avg

Measures of Location

Page 24: Visual Displays of Data and Basic Descriptive Statistics

• Sample mode:• Needs careful definition but basically:

• The data value that occurs the most

Avg

mode = 9.2541

Med

Measures of Location

Page 25: Visual Displays of Data and Basic Descriptive Statistics

• Some trivia:

Nice and symmetric:Mean = Median = Mode

Mean

Modes

Measures of Location

Page 26: Visual Displays of Data and Basic Descriptive Statistics

Measures of Location

Page 27: Visual Displays of Data and Basic Descriptive Statistics

Measures of Location

Toss out the largest 5% and smallest 5% of the data

Page 28: Visual Displays of Data and Basic Descriptive Statistics

• Sample variance:• (Almost) the average of squared deviations from the

sample mean.

Measures of Data Spread

22

1

1

1

n

ii

s x xn

data point i

sample mean

there are n data points

2s s• Standard deviation is • The sample average and standard dev. are the most

common measures of central tendency and spread

• Sample average and standard dev have the same units

Page 29: Visual Displays of Data and Basic Descriptive Statistics

Measures of Data Spread• If you have “enough” data, you can fit a smooth

probability density function to the histogram

Page 30: Visual Displays of Data and Basic Descriptive Statistics

Measures of Data Spread

~ 68%±1s

~ 95%±2s

~ 99%±3s

• Trivia: The famous (standardized) “Bell Curve”

• Also called “normal” and “Gaussian”• Mean = 0

• Std Dev = 1

• Units are in

Std Devs

- - -

Page 31: Visual Displays of Data and Basic Descriptive Statistics

Measures of Data Spread

Page 32: Visual Displays of Data and Basic Descriptive Statistics

• Sample range:• The difference between the largest and smallest

value in the sample• Very sensitive to outliers (extreme observations)

• Percentiles:• The pth percentile data value, x, means that p-

percent of the data are less than or equal to x.• Median = 50th percentile

Measures of Data Spread

Page 33: Visual Displays of Data and Basic Descriptive Statistics

1st-%tile99th-%tile

1.520031.52008

Measures of Data Spread

Page 34: Visual Displays of Data and Basic Descriptive Statistics

Measures of Data Spread

Page 35: Visual Displays of Data and Basic Descriptive Statistics

• Sample relative standard deviation:• Ratio of standard dev to the average

• Also called coefficient of variation

• Data quality-outliers:• Rule of thumb, if :

xi > 75th-%tile + g×(75th-%tile - 25th-%tile)

xi < 25th-%tile + g×(75th-%tile - 25th-%tile)

• xi outlier for = 1.5g

• xi extreme outlier for = 3g

Measures of Data Spread

%RSD = 100s

x