Visual Displays of Data and Basic Descriptive Statistics

Visual Displays of Data and Basic Descriptive Statistics

http://www.halcyonmaps.com

• Where to get information on R :• R: http://www.r-project.org/

• Just need the base

• RStudio: http://rstudio.org/

• A great IDE for R

• Work on all platforms

• Sometimes slows down performance…

• CRAN: http://cran.r-project.org/

• Library repository for R

• Click on Search on the left of the website to search for package/info on packages

http://www.r-project.org/

http://rstudio.org/

http://cran.r-project.org/

Finding our way around R/RStudio

Script Window

Command Line

• Basic Input and Output

Handy Commands:

x <- 4

x <- “text goes in quotes”

variables: store

information

Numeric input

Text (character) input

:Assignment operator

• Get help on an R command:• If you know the name: ?command name• ?plot brings up html on plot command

• If you don’t know the name:• Use Google (my favorite)• ??key word

Handy Commands:

• Histograms:• Histograms: “bin” a variable and plot frequencies

nD nD

Counts Relative Frequencies

First Thing: Look at your Data!

Histograms

• Box and Whiskers Plots:

1 .5188 1 .5189 1 .5190 1 .5191 1 .5192

25th-%tile1st-quartile

75th-%tile3rd-quartile median

50th-%tile

range

possibleoutliers

possibleoutliers

First Thing: Look at your Data!

• Note the relationship:Box-and-Whiskers

With Outliers:

Without Outliers:

Box-and-Whiskers

Box-and-Whiskers

Stem-and-Leaf Displays

• Consider a numerical data set x1, x2, x3,…, xn

– each xi consists of at least two digits.

– an informative visual representation a stem-and-leaf display.

Stems Leaves for each stem

Stem-and-Leaf Displays

Dotplots• Each observation is represented by a dot

above the corresponding location on a horizontal measurement scale. –When a value occurs more than once, there is a dot

for each occurrence– Dots are stacked vertically.

• A dotplot is useful when:– there is not a large set of data– where there are relatively few distinct values.

Dotplots

• Given a sample from some population:• What is a good “summary” value which well

describes the sample?

• We will look at:

• Average (arithmetic mean)

• Median

• Mode

Measures of Location

For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures”

LA Mohammed, B Found, M Caligiuri and D RogersJ Forensic Sci 56(1),S136-S141 (2011)

Histogram Points of Interest

• Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study.

• What is a good summary number?

• How spread out is the data? (We will talk about this later)

• Arithmetic sample mean (average):• The sum of data divided by number of observations:


intuitive formula

fancy formula

• Example from LAM study: • Compute the average absolute size of segment 1 for

the genuine signature of subject 2:

Subj. 2; Gen; Seg. 1

Absolute Size (cm)

1 0.05482 0.29513 0.10264 0.10055 0.24916 0.12877 0.04968 0.22999 0.256

10 0.0538


• Example: • More useful: Consider again Absolute Average Velocity for

Genuine Signatures across all writers in the LAM study:

92 subjects × 10 measurements/subject = 920 velocity measurements

Average Absolute Average Velocity:


• Follow up question: • Is there a difference in the Abs. Avg. Veloc. for Genuine

signatures vs. Disguised signatures (DWM and DNM)??

Genuine DWM DNM

• We will learn how to answer this, but not yet.


• Sample median:• Ordering the n pieces of data from smallest value to

largest value, the median is the “middle value”:

• If n is odd, median is largest data point.

• If n is even, median is average of and largest data points.

th1

2

n

th

2

n th12

n


• Example: • Median of Average Absolute Velocity for Genuine Signatures,

LAM:

Avg


• Sample mode:• Needs careful definition but basically:

• The data value that occurs the most

Avg

mode = 9.2541

Med


• Some trivia:

Nice and symmetric:Mean = Median = Mode

Mean

Modes




Toss out the largest 5% and smallest 5% of the data

• Sample variance:• (Almost) the average of squared deviations from the

sample mean.

Measures of Data Spread

22

1

1

1

n

ii

s x xn

data point i

sample mean

there are n data points

2s s• Standard deviation is • The sample average and standard dev. are the most

common measures of central tendency and spread

• Sample average and standard dev have the same units

Measures of Data Spread• If you have “enough” data, you can fit a smooth

probability density function to the histogram


~ 68%±1s

~ 95%±2s

~ 99%±3s

• Trivia: The famous (standardized) “Bell Curve”

• Also called “normal” and “Gaussian”• Mean = 0

• Std Dev = 1

• Units are in

Std Devs

- - -


• Sample range:• The difference between the largest and smallest

value in the sample• Very sensitive to outliers (extreme observations)

• Percentiles:• The pth percentile data value, x, means that p-

percent of the data are less than or equal to x.• Median = 50th percentile


1st-%tile99th-%tile

1.520031.52008



• Sample relative standard deviation:• Ratio of standard dev to the average

• Also called coefficient of variation

• Data quality-outliers:• Rule of thumb, if :

xi > 75th-%tile + g×(75th-%tile - 25th-%tile)

xi < 25th-%tile + g×(75th-%tile - 25th-%tile)

• xi outlier for = 1.5g

• xi extreme outlier for = 3g


%RSD = 100s

x

Documents

Visual Displays of Data and Basic Descriptive Statistics