Upload
myron-hoover
View
217
Download
0
Embed Size (px)
Citation preview
Visual Displays of Data and Basic Descriptive Statistics
http://www.halcyonmaps.com
• Where to get information on R :• R: http://www.r-project.org/
• Just need the base
• RStudio: http://rstudio.org/
• A great IDE for R
• Work on all platforms
• Sometimes slows down performance…
• CRAN: http://cran.r-project.org/
• Library repository for R
• Click on Search on the left of the website to search for package/info on packages
Finding our way around R/RStudio
Script Window
Command Line
• Basic Input and Output
Handy Commands:
x <- 4
x <- “text goes in quotes”
variables: store
information
Numeric input
Text (character) input
:Assignment operator
• Get help on an R command:• If you know the name: ?command name• ?plot brings up html on plot command
• If you don’t know the name:• Use Google (my favorite)• ??key word
Handy Commands:
• Histograms:• Histograms: “bin” a variable and plot frequencies
nD nD
Counts Relative Frequencies
First Thing: Look at your Data!
Histograms
• Box and Whiskers Plots:
1 .5188 1 .5189 1 .5190 1 .5191 1 .5192
25th-%tile1st-quartile
75th-%tile3rd-quartile median
50th-%tile
range
possibleoutliers
possibleoutliers
First Thing: Look at your Data!
• Note the relationship:Box-and-Whiskers
With Outliers:
Without Outliers:
Box-and-Whiskers
Box-and-Whiskers
Stem-and-Leaf Displays
• Consider a numerical data set x1, x2, x3,…, xn
– each xi consists of at least two digits.
– an informative visual representation a stem-and-leaf display.
Stems Leaves for each stem
Stem-and-Leaf Displays
Dotplots• Each observation is represented by a dot
above the corresponding location on a horizontal measurement scale. –When a value occurs more than once, there is a dot
for each occurrence– Dots are stacked vertically.
• A dotplot is useful when:– there is not a large set of data– where there are relatively few distinct values.
Dotplots
• Given a sample from some population:• What is a good “summary” value which well
describes the sample?
• We will look at:
• Average (arithmetic mean)
• Median
• Mode
Measures of Location
For reference see (available on-line): “The Dynamic Character of Disguised Behaviour for Text-based, Mixed and Stylized Signatures”
LA Mohammed, B Found, M Caligiuri and D RogersJ Forensic Sci 56(1),S136-S141 (2011)
Histogram Points of Interest
• Velocity for the first segment of genuine signatures in (soon to be classic) Mohammed et al. study.
• What is a good summary number?
• How spread out is the data? (We will talk about this later)
• Arithmetic sample mean (average):• The sum of data divided by number of observations:
Measures of Location
intuitive formula
fancy formula
• Example from LAM study: • Compute the average absolute size of segment 1 for
the genuine signature of subject 2:
Subj. 2; Gen; Seg. 1
Absolute Size (cm)
1 0.05482 0.29513 0.10264 0.10055 0.24916 0.12877 0.04968 0.22999 0.256
10 0.0538
Measures of Location
• Example: • More useful: Consider again Absolute Average Velocity for
Genuine Signatures across all writers in the LAM study:
92 subjects × 10 measurements/subject = 920 velocity measurements
Average Absolute Average Velocity:
Measures of Location
• Follow up question: • Is there a difference in the Abs. Avg. Veloc. for Genuine
signatures vs. Disguised signatures (DWM and DNM)??
Genuine DWM DNM
• We will learn how to answer this, but not yet.
Measures of Location
• Sample median:• Ordering the n pieces of data from smallest value to
largest value, the median is the “middle value”:
• If n is odd, median is largest data point.
• If n is even, median is average of and largest data points.
th1
2
n
th
2
n th12
n
Measures of Location
• Example: • Median of Average Absolute Velocity for Genuine Signatures,
LAM:
Avg
Measures of Location
• Sample mode:• Needs careful definition but basically:
• The data value that occurs the most
Avg
mode = 9.2541
Med
Measures of Location
• Some trivia:
Nice and symmetric:Mean = Median = Mode
Mean
Modes
Measures of Location
Measures of Location
Measures of Location
Toss out the largest 5% and smallest 5% of the data
• Sample variance:• (Almost) the average of squared deviations from the
sample mean.
Measures of Data Spread
22
1
1
1
n
ii
s x xn
data point i
sample mean
there are n data points
2s s• Standard deviation is • The sample average and standard dev. are the most
common measures of central tendency and spread
• Sample average and standard dev have the same units
Measures of Data Spread• If you have “enough” data, you can fit a smooth
probability density function to the histogram
Measures of Data Spread
~ 68%±1s
~ 95%±2s
~ 99%±3s
• Trivia: The famous (standardized) “Bell Curve”
• Also called “normal” and “Gaussian”• Mean = 0
• Std Dev = 1
• Units are in
Std Devs
- - -
Measures of Data Spread
• Sample range:• The difference between the largest and smallest
value in the sample• Very sensitive to outliers (extreme observations)
• Percentiles:• The pth percentile data value, x, means that p-
percent of the data are less than or equal to x.• Median = 50th percentile
Measures of Data Spread
1st-%tile99th-%tile
1.520031.52008
Measures of Data Spread
Measures of Data Spread
• Sample relative standard deviation:• Ratio of standard dev to the average
• Also called coefficient of variation
• Data quality-outliers:• Rule of thumb, if :
xi > 75th-%tile + g×(75th-%tile - 25th-%tile)
xi < 25th-%tile + g×(75th-%tile - 25th-%tile)
• xi outlier for = 1.5g
• xi extreme outlier for = 3g
Measures of Data Spread
%RSD = 100s
x