152
R Programming Guy Lebanon September 22, 2015

Data Analysis with R (combined slides)

Embed Size (px)

Citation preview

Page 1: Data Analysis with R (combined slides)

R Programming

Guy Lebanon

September 22, 2015

Page 2: Data Analysis with R (combined slides)

Goals

I Understand when to use R and when not to use it

I Understand basic syntax and be able to write short programs

I Understand scalability issues in R and di↡erent ways to resolvethem

I Prepare for the next module: visualizing data with R

Module will be separated to 4 parts: (a) getting started, (b) datatypes, (c) control flow and functions, and (c) scalability andinterfaces.

Page 3: Data Analysis with R (combined slides)

R, Matlab, and Python

R is similar to Matlab and Python:

I They run inside an interactive shell or graphical user interface

I They emphasize storing and manipulating data asmultidimensional arrays.

I They include many general purpose and specialized packages(linear algebra, statistics, ML, etc.)

I They are typically slower than C, C++, and Fortran (thoughvectorization can help)

I They can interface with native C++ code for speeding upbottlenecks

Page 4: Data Analysis with R (combined slides)

R, Matlab, and Python

The three languages di↡er:

I R and Python are open-source and free. Matlab is not.

I It is easier to contribute packages to R

I R has a large group of motivated contributors who contributehigh quality packages

I R syntax is more suitable for statistics and data

I R has better graphics capabilities

I R is popular in statistics, biostatistics, and social sciences.Matlab is popular in engineering and applied math. Python ispopular in web development and scripting.

Page 5: Data Analysis with R (combined slides)

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

100

1000

10000

2002 2004 2006 2008 2010 2012 2014

Near exponential growth in contributed packages

Page 6: Data Analysis with R (combined slides)

Running R

I Interactively:I Type R in prompt (type q() to quit)I R graphic applicationI R-StudioI Within Emacs

I Non-Interactively:I call script from R: source("foo.R")I call script from shell: R CMD BATCH foo.RI call script from shell: Rscript foo.RI executable script, prefixed by #!/usr/bin/Rscript, followed

by ./foo.R < inFile > outFile

Page 7: Data Analysis with R (combined slides)
Page 8: Data Analysis with R (combined slides)

R Language

I drops whitespace, semi-colons optional but are needed formultiple commands in the same line

I comments: #

I case sensitive

I functional and object oriented programming (a=b rephrased as’=’(a,b))

I interpreted but with lazy evaluation

I not strongly typed

I help() displays help on a function, dataset, etc.

a = 3.2

a = "a string"; b = 2 # no strong typing

print(a)

## [1] "a string"

Page 9: Data Analysis with R (combined slides)

ls() # list variable names in workspace memory

# save all variables to a file

save.image(file = "R_workspace")

# save specified variables

save(new.var, legal.var.name, file = "R_workspace")

# load variables saved in file

load("R_workspace")

help("load")

install.packages("ggplot2")

library(ggplot2)

system("ls -al")

Page 10: Data Analysis with R (combined slides)

Scalars

Major scalar types: numeric, integer, logical, string, dates, andfactors (NA: not available)

a = 3.2; b = 3 # double types

c = as.integer(b) # cast to integer type

d = TRUE

e = as.numeric(d) # casting to numeric

f = "this is a string" # string

ls.str() # show variables and their types

## a: num 3.2

## b: num 3

## c: int 3

## d: logi TRUE

## e: num 1

## f: chr "this is a string"

Page 11: Data Analysis with R (combined slides)

Factors can be ordered or unordered

# ordered factor

current.season = factor("summer",

levels = c("summer", "fall", "winter", "spring"),

ordered = TRUE)

# unordered factor

my.eye.color = factor("brown",

levels = c("brown", "blue", "green"), ordered = FALSE)

Page 12: Data Analysis with R (combined slides)

Vectors and Arrays

x = c(4, 3, 3, 4, 3, 1) # c for concatenate

length(x) # return length

2*x+1 # element-wise arithmetic

# Boolean vector (default is FALSE)

y = vector(mode = "logical", length = 4)

# numeric vector (default is 0)

z = vector(length = 3, mode = "numeric")

Page 13: Data Analysis with R (combined slides)

q = rep(3.2, times = 10) # repeat value multiple times

w = seq(0, 1, by = 0.1) # values in [0,1] in 0.1 increments

w = seq(0, 1, length.out = 11) # equally spaced values

w <= 0.5 # boolean vector

any(w <= 0.5) # is it true for some elements?

all(w <= 0.5) # is it true for all elements?

which(w <= 0.5) # for which elements is it true?

w[w <= 0.5] # extracting from w entries for which w<=0.5

subset(w, w <= 0.5) # an alternative with the subset function

w[w <= 0.5] = 0 # zero out all components <= 0.5

Page 14: Data Analysis with R (combined slides)

Arrays are multidimensional generalization of vectors.

z = seq(1, 20,length.out = 20) # create a vector 1,2,..,20

x = array(data = z, dim = c(4, 5)) # create a 2-d array

x[2,3] # refer to the second row and third column

x[2,] # refer to the entire second row

x[-1,] # all but the first row - same as x[c(2,3,4),]

y = x[c(1,2),c(1,2)] # 2x2 top left sub-matrix

2 * y + 1 # element-wise operation

y %*% y # matrix product (both arguments are matrices)

x[1,] %*% x[1,] # inner product

t(x) # matrix transpose

outer(x[,1], x[,1]) # outer product

rbind(x[1,], x[1,]) # vertical concatenation

cbind(x[1,], x[1,]) # horizontal concatenation

Page 15: Data Analysis with R (combined slides)

Lists

Lists are ordered collections of possibly di↡erent types. Namedpositions allow creating self-describing data.

L=list(name = 'John', age = 55,

no.children = 2, children.ages = c(15, 18))

names(L) # displays all position names

L[[2]] # second element

L[2] # list containing second element

L$name # value in list corresponding to name

L['name'] # same thing

L$children.ages[2] # same as L[[4]][2]

Page 16: Data Analysis with R (combined slides)

Dataframes

Dataframe are ordered sequence of lists sharing the samesignature. A popular usecase is a table where rows correspond todata examples and columns correspond to dimensions or features.

vecn = c("John Smith","Jane Doe")

veca = c(42, 45)

vecs = c(50000, 55000)

R = data.frame(name = vecn, age = veca, salary = vecs)

R

## name age salary

## 1 John Smith 42 50000

## 2 Jane Doe 45 55000

names(R) = c("NAME", "AGE", "SALARY") # modify column names

R

## NAME AGE SALARY

## 1 John Smith 42 50000

## 2 Jane Doe 45 55000

Page 17: Data Analysis with R (combined slides)

Datasets

Example: Iris dataset (in datasets package)

names(iris) # lists the dimension (column) names

head(iris, 4) # show first four rows

iris[1,] # first row

iris$Sepal.Length[1:10] # sepal length of first ten samples

# allow replacing irisΒ£Sepal.Length with shorter Sepal.Length

attach(iris, warn.conflicts = FALSE)

mean(Sepal.Length) # average of Sepal.Length across all rows

colMeans(iris[,1:4]) # means of all four numeric columns

subset(iris, Sepal.Length < 5 & Species != "setosa")

# count number of rows corresponding to setosa species

dim(subset(iris, Species == "setosa"))[1]

summary(iris)

Page 18: Data Analysis with R (combined slides)

If-Else

a = 10; b = 5; c = 1

if (a < b) {d = 1

} else if (a == b) {d = 2

} else {d = 3

}print(d)

## [1] 3

AND: &&, OR: ||, equality: ==, inequality: !=

Page 19: Data Analysis with R (combined slides)

Loops

For, repeat, and while loops:

sm=0

# repeat for 100 iteration, with num taking values 1:100

for (num in seq(1, 100, by = 1)) {sm = sm + num

}repeat {sm = sm - num

num = num - 1

if (sm == 0) break # if sm == 0 then stop the loop

}a = 1; b = 10

while (b>a) {sm = sm + 1

a = a + 1

b = b - 1

}

Page 20: Data Analysis with R (combined slides)

Functions

By default, arguments flow into the parameters according to theirorder at the call site. Providing parameter names allow out oforder binding.

foo(10, 20, 30) # parameter bindings by order

foo(y = 20, x = 10, z = 30) # out of order parameter bindings

foo(z = 30) # missing parameters assigned default values

Page 21: Data Analysis with R (combined slides)

# myPower(.,.) raises the first argument to the power of the

# second. The first argument is named bas and has default value 10.

# The second parameter is named pow and has default value 2.

myPower = function(bas = 10, pow = 2) {res = bas^pow # raise base to a power

return(res)

}myPower(2, 3) # 2 is bound to bas and 3 to pow (in-order)

# same binding as above (out-of-order parameter names)

myPower(pow = 3, bas = 2)

myPower(bas = 3) # default value of pow is used

Page 22: Data Analysis with R (combined slides)

Vectorized Code

Vectorized code runs much faster than loops due to R interpreteroverhead.

a = 1:10000000; res = 0

system.time(for (e in a) res = res + e^2)

## user system elapsed

## 3.742 0.029 3.800

system.time(sum(a^2))

## user system elapsed

## 0.180 0.032 0.250

Page 23: Data Analysis with R (combined slides)

External/Native API

Often, 10% percent of the code is responsible for 90% ofcomputing time. Implementing bottlenecks in C/C++ allowsstaying mostly within the R environment.

dyn.load("fooC2.so") # load compiled C code

A = seq(0, 1, length = 10)

B = seq(0, 1, length = 10)

.Call("fooC2", A, B)

Newer packages: Rcpp, RcppArmadillo, RcppEigen

Page 24: Data Analysis with R (combined slides)

## [1] 13.34 17.48 21.21 24.71 28.03 31.24 34.34 37.37 40.33 43.24

## [1] 13.34 17.48 21.21 24.71 28.03 31.24 34.34 37.37 40.33 43.24

0.0

0.5

1.0

1.5

0 250 500 750 1000

array size

com

puta

tion

time

(sec

)

language

C

R

Page 25: Data Analysis with R (combined slides)

Graphing Data with R

Guy Lebanon

September 22, 2015

Page 26: Data Analysis with R (combined slides)

Goals

I Learn how to use base graphics

I Learn how to use base ggplot2

I Understand basic graph types and when to use them

Module will be separated to 4 parts: (a) base graphics, (b)ggplot2, (c) datasets, (d) basic graph types and case studies.

Page 27: Data Analysis with R (combined slides)

Base Graphics

Base graphics syntax: plot function followed by helper functionsfor annotating the graph.

plot(x = dataframe$col_1, y = dataframe$col_2)

title(main = "figure title") # add title

Examples of low-level functions in the graphics package are:

I title adds or modifies labels of title and axes,

I grid adds a grid to the current figure,

I legend displays a legend connecting symbols, colors, andline-types to descriptive strings, and

I lines adds a line plot to an existing graph.

Page 28: Data Analysis with R (combined slides)

GGPLOT2

Philosophy: (a) Grammar of graphics, (b) logical separation ofgraphics and data, (c) concise and maintainable code.Option 1: Use the qplot function. Pass dataframe column names,dataframe name, geometry, and graphing options.

qplot(x = x1,

y = x2,

data = DF,

main = "figure title",

geom = "point")

Remember to install and load package using

install.packages('ggplot2')

library(ggplot2)

Page 29: Data Analysis with R (combined slides)

Option 2: Use the ggplot function. Pass dataframe, columnnames through aes function. Compose function output withadditional layers using + operator.

ggplot(dataframe, aes(x = x, y = y)) +

geom_line() + geom_point()

Function (and addition operator) returns an object that can beprinted (using the print function) or saved for later.

Page 30: Data Analysis with R (combined slides)

Datasets

We will use the three datasets below.

I faithful: eruption time and waiting time to next eruption(both in minutes) of the Old Faithful geyser in YellowstoneNational Park, Wyoming, USA.

I mtcars: model name, weight, horsepower, fuel eοΏ½ciency, andtransmission type of cars from 1974 Motor Trend magazine.

I mpg: fuel economy and other car attributes fromhttp://fueleconomy.gov (similar to mtcars but larger andnewer).

Page 31: Data Analysis with R (combined slides)

names(faithful)

## [1] "eruptions" "waiting"

names(mtcars)

## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"

## [11] "carb"

names(mpg)

## [1] "manufacturer" "model" "displ" "year"

## [5] "cyl" "trans" "drv" "cty"

## [9] "hwy" "fl" "class"

Page 32: Data Analysis with R (combined slides)

Strip Plot

Strip plots graph one-dimensional numeric data as points in atwo-dimensional space, with one coordinate corresponding to theindex of the data point, and the other coordinate corresponding toits value.

plot(faithful$eruptions, xlab = "sample number",

ylab = "eruption times (min)",

main = "Old Faithful Eruption Times")

Page 33: Data Analysis with R (combined slides)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 50 100 150 200 250

1.5

2.5

3.5

4.5

Old Faithful Eruption Times

sample number

erup

tion

times

(min

)

Page 34: Data Analysis with R (combined slides)

I We conclude from the figure above that Old Faithful has twotypical eruption times β€” a long eruption time around 4.5minutes, and a short eruption time around 1.5 minutes.

I It also appears that the order in which the dataframe rows arestored is not related to the eruption variable.

Page 35: Data Analysis with R (combined slides)

Histograms

Histograms graph one-dimensional numeric data by dividing therange into bins and counting number of occurrences in each bin. Itis critical to set the bin width value correctly.

qplot(x = waiting,

data = faithful,

binwidth = 3,

main = "Waiting time to next eruption (min)")

ggplot(faithful ,aes(x = waiting)) +

geom_histogram(binwidth = 1)

Page 36: Data Analysis with R (combined slides)

0

10

20

30

40

40 60 80 100

waiting

coun

t

Waiting time to next eruption (min)

Page 37: Data Analysis with R (combined slides)

There are clearly two typical eruption times – one around 2minutes and one around 4.5 minutes.y values can be replaced with probability/frequency using thefollowing syntax.

ggplot(faithful, aes(x = waiting, y = ..density..)) +

geom_histogram(binwidth = 4)

Selecting the best bandwidth to use when graphing a specificdataset is diοΏ½cult and usually requires some trial and error.

Page 38: Data Analysis with R (combined slides)

0

10

20

30

40

40 60 80 100

waiting

coun

tWaiting time to next eruption (min)

Page 39: Data Analysis with R (combined slides)

0

5

10

15

40 60 80

waiting

coun

tWaiting time to next eruption (min)

Page 40: Data Analysis with R (combined slides)

0

20

40

60

80

50 75 100

waiting

coun

tWaiting time to next eruption (min)

Page 41: Data Analysis with R (combined slides)

Line Plot

Line plot: a graph displaying a relation between x and y as a linein a Cartesian coordinate system. The relation may correspond toan abstract mathematical function or to a relation between twosamples (for example, dataframe columns)

x = seq(-2, 2, length.out = 30)

y = x^2

qplot(x, y, geom = "line") # line plot

qplot(x, y, geom = c("point", "line")) # line and point plot

dataframe = data.frame(x = x, y = y)

ggplot(dataframe, aes(x = x, y = y)) +

geom_line() + geom_point() # same as above but with ggplot

Page 42: Data Analysis with R (combined slides)

S = sort.int(mpg$cty, index.return = T)

# x: city mpg

# ix: indices of sorted values of city mpg

plot(S$x, # plot sorted city mpg values with a line plot

type = "l",

lty = 2,

xlab = "sample number (sorted by city mpg)",

ylab = "mpg")

lines(mpg$hwy[S$ix] ,lty = 1) # add dashed line of hwy mpg

legend("topleft", c("highway mpg", "city mpg"),

lty = c(1, 2))

Page 43: Data Analysis with R (combined slides)

0 50 100 150 200

1015

2025

3035

sample number (sorted by city mpg)

mpg

highway mpgcity mpg

Page 44: Data Analysis with R (combined slides)

Smoothed Histograms

Denoting n values by x

(1), . . . , x (n), the smoothed histogram is thefollowing function fh : R ! R+

fh(x) =1

n

nX

i=1

Kh(x οΏ½ x

(i))

where the kernel function Kh : R ! R typically achieves itsmaximum at 0, and decreases as |x οΏ½ x

(i)| increases. We alsoassume that the kernel function integrates to one

RKh(x) dx = 1

and satisfies the relation

Kh(r) = h

οΏ½1K1(r/h).

We refer to K1 as the base form of the kernel and denote it as K .

Page 45: Data Analysis with R (combined slides)

Four popular kernel choices are the tricube, triangular, uniform,and Gaussian kernels, defined as Kh(r) = h

οΏ½1K (r/h) where the

K (Β·) functions are respectively

K (r) = (1οΏ½ |r |3)3 Β· 1{|r |<1} (Tricube)

K (r) = (1οΏ½ |r |) Β· 1{|r |<1} (Triangular)

K (r) = 2οΏ½1 Β· 1{|r |<1} (Uniform)

K (r) = exp(οΏ½x

2/2)/p2⇑ (Gaussian).

As h increases the kernel functions Kh become wider.

Page 46: Data Analysis with R (combined slides)

h=1 h=2

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

gaussiantriangular

tricubeuniform

βˆ’2 0 2 βˆ’2 0 2

x

K_h(x)

Page 47: Data Analysis with R (combined slides)

βˆ’2 0 2 4 6 8

0.00

0.10

0.20

0.30

x

f_h(

x)Smoothed histogram (h=1/6)

Page 48: Data Analysis with R (combined slides)

βˆ’2 0 2 4 6 8

0.00

0.05

0.10

0.15

0.20

0.25

x

f_h(

x)Smoothed histogram (h=1/3)

Page 49: Data Analysis with R (combined slides)

βˆ’2 0 2 4 6 8

0.05

0.10

0.15

x

f_h(

x)Smoothed histogram (h=1)

Page 50: Data Analysis with R (combined slides)

In ggplot2:

ggplot(faithful, aes(x = waiting, y = ..density..)) +

geom_histogram(alpha = 0.3) +

geom_density(size = 1.5, color = "red")

Page 51: Data Analysis with R (combined slides)

0.00

0.02

0.04

40 60 80 100

waiting

density

Page 52: Data Analysis with R (combined slides)

Scatter Plot

A scatter plot graphs the relationships between two numericvariables. It graphs each pair of variables as a point in a twodimensional space whose coordinates are the corresponding x , yvalues.

plot(faithful$waiting,

faithful$eruptions,

pch = 17,

col = 2,

cex = 1.2,

xlab = "waiting times (min)",

ylab = "eruption time (min)")

Page 53: Data Analysis with R (combined slides)

50 60 70 80 90

1.5

2.5

3.5

4.5

waiting times (min)

erup

tion

time

(min

)

Page 54: Data Analysis with R (combined slides)

I We conclude from the two clusters in the scatter plot abovethat there are two distinct cases: short eruptions and longeruptions.

I Furthermore, the waiting times for short eruptions aretypically short, while the waiting times for the long eruptionsare typically long.

I This is consistent with our intuition: it takes longer to buildthe pressure for a long eruption than it does for a shorteruption.

Page 55: Data Analysis with R (combined slides)

The relationship between two numeric variables and a categoricalvariable can be graphed using a scatter plot where the categoricalvariable controls the size, color, or shape of the markers.

plot(mtcars$hp,

mtcars$mpg,

pch = mtcars$am,

xlab = "horsepower",

cex = 1.2,

ylab = "miles per gallon",

main = "mpg vs. hp by transmission")

legend("topright", c("automatic", "manual"), pch = c(0, 1))

Page 56: Data Analysis with R (combined slides)

●●

●

●

●

●

●●

●

●

●

●

●

50 100 150 200 250 300

1015

2025

30

mpg vs. hp by transmission

horsepower

mile

s pe

r gal

lon

●

automaticmanual

Page 57: Data Analysis with R (combined slides)

We draw several conclusions from this graph.

I There is an inverse relationship between horsepower and mpg.

I For a given horsepower amount, manual transmission cars aregenerally more fuel eοΏ½cient.

I Cars with the highest horsepower tend to be manual (the twohighest horsepower cars in the dataset are Maserati Bora andFord Pantera, both sports cars with manual transmissions).

Page 58: Data Analysis with R (combined slides)

Changing marker size in a scatter plot

qplot(x = wt,

y = mpg,

data = mtcars,

size = cyl,

main = "MPG vs. weight (x1000 lbs) by cylinder")

Page 59: Data Analysis with R (combined slides)

● ●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

10

15

20

25

30

35

2 3 4 5

wt

mpg

cyl

●

●

●

●

●

4

5

6

7

8

MPG vs. weight (x1000 lbs) by cylinder

Page 60: Data Analysis with R (combined slides)

I When data is noisy, it is useful to add a smoothed line curveto visualize median trends

I One technique to address this issue is to add a smoothed linecurve yS , which is a weighted average of the original data(y (i), x (i)) i = 1, . . . , n:

yS(x) =nX

i=1

Kh(x οΏ½ x

(i))Pni=1 Kh(x οΏ½ x

(i))y

(i).

where the Kh functions above are the kernel functionsdescribed earlier

IyS(x) is an average the y

(i) values, weighted in a way thatemphasizes y (i) values whose corresponding x

(i) values areclose to x .

I The denominator in the definition of yS ensures that theweights defining the weighted average sum to 1.

Page 61: Data Analysis with R (combined slides)

qplot(disp,

mpg,

data = mtcars,

main = "MPG vs Eng. Displacement") +

stat_smooth(method = "loess",

degree = 0,

span = 0.2,

se = TRUE)

span parameter influences the value of h in the slide before andcan make the line more or less smooth. Optional argument seadds standard errors as shaded region.

Page 62: Data Analysis with R (combined slides)

●●

●●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

10

15

20

25

30

35

100 200 300 400

disp

mpg

MPG vs Eng. Displacement

Page 63: Data Analysis with R (combined slides)

Facets

I Facets are a way to display multiple graphs next to each otherin the same scale with shared axes.

I This is an e↡ective way to visualize data that has higherdimensionality than 2 (mixed numeric-categorical).

I The argument facets in qplot or ggplot takes a formulaa β‡  b where a, b specify the variables according to which therows and columns are organized.

qplot(x = wt,

y = mpg,

facets = .~amf,

data = mtcars,

main = "MPG vs. weight by transmission")

Page 64: Data Analysis with R (combined slides)

automatic manual

●

●●

●

●●

●●

●●

●

●●

●

●

●●●

●●●

●

●

●

●

●●

●

●

●

●

●

10

15

20

25

30

35

2 3 4 5 2 3 4 5

wt

mpg

MPG vs. weight by transmission

Page 65: Data Analysis with R (combined slides)

● ●●

●●●

●

● ●

●●●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

10

15

20

25

30

35

10

15

20

25

30

35

flatVβˆ’shape

2 3 4 5

wt

mpg

MPG vs. weight by engine

Page 66: Data Analysis with R (combined slides)

automatic manual

●

●●●

●

●●

●●●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

10

15

20

25

30

35

10

15

20

25

30

35

flatVβˆ’shape

2 3 4 5 2 3 4 5

wt

mpg

MPG vs. weight by transmission and engine

Page 67: Data Analysis with R (combined slides)

I Manual transmission cars tend to have lower weights and bemore fuel eοΏ½cient

I Cars with V-shape engines tend to weigh less and be more fueleοΏ½cient

I Manual transmission and V-engine cars tend to be lighter andmore fuel eοΏ½cient. Automatic transmission and non V-engineare heavier and less fuel eοΏ½cient.

β€œAll pairs” plot:

DF = mpg[, c("cty", "hwy", "displ")]

library(GGally)

ggpairs(DF)

Page 68: Data Analysis with R (combined slides)

cty

hwy

displ

cty hwy displ

10

15

20

25

30

35

Corr:0.956

Corr:βˆ’0.799

20

30

40

● ●●●

●●●●●●●

●●●●●●●

●

●

●●●

●●●●●

●

●●●

●●

●●●●●

●●●●

●

●●●●●●●●●●

●

●●●●

●

●●●

●●●

●●●●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●

●●

●●●●●●●

●●●●

●●●

●

●●●●●●●●●●●●●●

●●

●●

●●●●●

●●●

●

●●●●

●●●●●●●

●●●●●●●●

●●●●●

●

●●●●

●●●●●●●

●●●●●●●

●

●●●●●

●●●●

●●●●

●

●

●●

●●●●

●●

●●

●●●●●●●●

●●●

Corr:βˆ’0.766

2

3

4

5

6

7

10 15 20 25 30 35

● ●●●

●●●

●● ●●

●●●●

●●

●

●● ●●●

●●●●

●

●●●

●

● ●●●●

●●●●●●●●●●●●●●●

●●●●●

●

●● ●●●●

●● ●●●●●●●●

●

●●

●●●●●●

●●●●●

●●

●●●●●●●●

●

●●●●●●●●●●● ●●●●

●

●●●●●●●●

●●●● ●

●●

●●●●

●●●

●●●●

●● ●●

●●●●●●●

●

●

●●●●

●

●●●●●●●●

●●●●●●●●●●●

●

●●●●●●●

●●●●●●●

●●●●●

●

●

●●●●●●●

●●●●

●

●●● ●●●●

●●

●●●●●●●●●●

●●

●

20 30 40

●●●●

●●●

●●●●

●●●●●●

●

●● ●●●

●●●●

●

●●●

●

●●●●●

●●

●●●●●●●●●●●●●

●●●●●

●

●● ●●●●

●● ●●●●●●●●

●

●●

●●●●●

●

●●●●●

●●

●●●●●●●●

●

●●●●●●●●●

●●●●●●

●

●●●●●●●

●●●

●● ●

●●

●●●●

●●●

●●●

●

●● ●●

●●●●●●●

●

●

●●●●

●

●●●●●●●●●●●●●●●●

●●●

●

●● ●●●●●

●●●●●●●

●●●●●

●

●

●●●●●●●

●●●●

●

●●●●●●●

●●

●●●●●●●●●●

●●

●

2 3 4 5 6 7

Page 69: Data Analysis with R (combined slides)

Contour Plots

Contour plots graph relationship between three numeric variables:z as a fuction of x , y . Steps: (a) create a grid for x values, (b)create a grid for y values, (c) create an expanded x β‡₯ y grid, (d)compute values of z on the expanded grid, (e) graph data.

x_grid = seq(-1, 1, length.out = 100)

y_grid = x_grid

R = expand.grid(x_grid, y_grid)

names(R) = c('x', 'y')

R$z = R$x^2 + R$y^2

ggplot(R, aes(x = x,y = y, z = z)) + stat_contour()

Page 70: Data Analysis with R (combined slides)

βˆ’1.0

βˆ’0.5

0.0

0.5

1.0

βˆ’1.0 βˆ’0.5 0.0 0.5 1.0

x

y

Page 71: Data Analysis with R (combined slides)

Quantiles and Box-Plots

Box plots are an alternative to histograms that are usually moreβ€œlossy” but emphasize quantiles and outliers in a way that ahistogram cannot.

I The r -percentile of a numeric dataset is the point at whichapproximately r percent of the data lie underneath, andapproximately 100οΏ½ r percent lie above.

I Another name for the r percentile is the 0.r quantile.

I The median or 50-percentile is the point at which half of thedata lies underneath and half above.

I The 25-percentile and 75 percentile are the values below which25% and 75% of the data lie. These points are also called thefirst and third quartiles (the second quartile is the median).

I The interval between the first and third quartiles is called theinter-quartile range (IQR) (region covering the central 50% ofdata).

Page 72: Data Analysis with R (combined slides)

The box plot is composed of;

I box denoting the IQR,

I an inner line bisecting the box denoting the median,

I whiskers extending to the most extreme point no further than1.5 times IQR length away from the edges of the box,

I points outside the box and whiskers marked as outliers.

ggplot(mpg, aes("",hwy)) +

geom_boxplot() +

coord_flip() +

scale_x_discrete("")

Page 73: Data Analysis with R (combined slides)

●●●

20 30 40

hwy

Page 74: Data Analysis with R (combined slides)

ggplot(mpg, aes(reorder(class, -hwy, median), hwy)) +

geom_boxplot() +

coord_flip() +

scale_x_discrete("class")

Page 75: Data Analysis with R (combined slides)

● ●● ●

●●

●

●● ●● ●●●●

●●● ●

compact

midsize

subcompact

2seater

minivan

suv

pickup

20 30 40

hwy

class

Page 76: Data Analysis with R (combined slides)

I The graph suggests the following fuel eοΏ½ciency order amongvehicle classes: pickups, SUV, minivans, 2-seaters,sub-compacts, midsizes, and compacts.

I The compact and midsize categories have almost identical boxand whiskers but the compact category has a few high outliers.

I The spread of subcompact cars is substantially higher thanthe spread in all other categories.

I We also note that SUVs and two-seaters have almost disjointvalues (the box and whisker ranges are completely disjoint)leading to the observation that almost all 2-seater cars in thesurvey have a higher highway mpg than SUVs.

Page 77: Data Analysis with R (combined slides)

QQ-Plots

I Quantile-quantile plots are useful for comparing two datasets,one of which may be sampled from a certain distribution.

ggplot(R, aes(sample = samples)) +

stat_qq(distribution = qt, dparams = pm)

I They are essentially scatter plots of the quantiles of onedataset vs. the quantiles of another dataset.

I The shape of the scatter plot implies the following conclusions(the proofs are straightforward applications of probabilitytheory).

Page 78: Data Analysis with R (combined slides)

I A straight line with slope 1 that passes through the originimplies that the two datasets have identical quantiles, andtherefore that they are sampled from the same distribution.

I A straight line with slope 1 that does not pass through theorigin implies that the two datasets have distributions ofsimilar shape and spread, but that one is shifted with respectto the other.

I A straight line with slope di↡erent from 1 that does not passthrough the origin implies that the two datasets havedistributions possessing similar shapes but that one istranslated and scaled with respect to the other.

I A non-linear S shape implies that the dataset correspondingto the x-axis is sampled from a distribution with heavier tailsthan the other dataset.

I A non-linear reflected S shape implies that the dataset whosequantiles correspond to the y -axis is drawn from a distributionhaving heavier tails than the other dataset.

Page 79: Data Analysis with R (combined slides)

D = data.frame(samples = c(rnorm(200, 1, 1),

rnorm(200, 0, 1),

rnorm(200, 0, 2)))

D$parameter[1:200] = 'N(1,1)';

D$parameter[201:400] = 'N(0,1)';

D$parameter[401:600] = 'N(0,2)';

qplot(samples,

facets = parameter~.,

geom = 'histogram',

data = D)

Page 80: Data Analysis with R (combined slides)

0

10

20

30

40

50

0

10

20

30

40

50

0

10

20

30

40

50

N(0,1)

N(0,2)

N(1,1)

βˆ’4 0 4 8

samples

count

Page 81: Data Analysis with R (combined slides)

D = data.frame(samples = c(rnorm(200, 1, 1),

rnorm(200, 0, 1),

rnorm(200, 0, 2)));

D$parameter[1:200] = 'N(1,1)';

D$parameter[201:400] = 'N(0,1)';

D$parameter[401:600] = 'N(0,2)';

ggplot(D, aes(sample = samples)) +

stat_qq() +

facet_grid(.~parameter)

Page 82: Data Analysis with R (combined slides)

N(0,1) N(0,2) N(1,1)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

βˆ’5.0

βˆ’2.5

0.0

2.5

5.0

βˆ’3 βˆ’2 βˆ’1 0 1 2 3βˆ’3 βˆ’2 βˆ’1 0 1 2 3βˆ’3 βˆ’2 βˆ’1 0 1 2 3

theoretical

sample

Page 83: Data Analysis with R (combined slides)

x_grid = seq(-6, 6, length.out = 200)

R = data.frame(density = dnorm(x_grid, 0, 1))

R$tdensity = dt(x_grid, 1.5)

R$x = x_grid

ggplot(R, aes(x = x, y = density)) +

geom_area(fill = I('grey')) +

geom_line(aes(x = x, y = tdensity)) +

labs(title = "N(0,1) (shaded) and t-distribution (1.5 dof)")

Page 84: Data Analysis with R (combined slides)

0.0

0.1

0.2

0.3

0.4

βˆ’6 βˆ’3 0 3 6

x

dens

ity

N(0,1) (shaded) and tβˆ’distribution (1.5 dof)

Page 85: Data Analysis with R (combined slides)

x_grid = seq(-6, 6, length.out = 200)

R = data.frame(density = dnorm(x_grid, 0, 1))

R$samples = rnorm(200, 0, 1)

pm = list(df = 1.5)

ggplot(R, aes(sample = samples)) +

stat_qq(distribution = qt, dparams = pm)

Page 86: Data Analysis with R (combined slides)

● ●● ●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●

βˆ’2

0

2

βˆ’30 βˆ’20 βˆ’10 0 10 20 30

theoretical

sample

Page 87: Data Analysis with R (combined slides)

Preprocessing Data

Guy Lebanon

September 30, 2015

Page 88: Data Analysis with R (combined slides)

Goals

I Learn how to handle missing data

I Learn how to handle outliers

I Learn when and how to transform data

I Learn standard data manipulations techniques

Module will be separated to 4 parts based on the four goals above.

Page 89: Data Analysis with R (combined slides)

Missing Data

Data may be missing for a variety of reasons.

I corrupted during its transfer or storage

I some instances in the data collection process were skippeddue to diοΏ½culty or price associated with obtaining the data

Di↡erent features in di↡erent samples may be missing: first sample(row) may have third feature (column) missing while the secondsample may have the fifth feature missing.

Page 90: Data Analysis with R (combined slides)

Examples of Missing Data

I Recommendation systems recommend to users items from acatalog based on historical user rating. Often, there are a lotof items in the catalog and each user typically indicates theirstar ratings for only a small subset of them.

I In longitudinal studies some of the subjects may not be ableto attend each of the surveys throughout the study period.The study organizers may also have lost contact with some ofthe subjects, in which case all measurements beyond a certaintime point are missing.

I In sensor data, some of the measurements may be missing dueto sensor failure, battery discharge, or electrical interference.

I In user surveys, users may choose to not respond to some ofthe questions for privacy reasons.

Page 91: Data Analysis with R (combined slides)

Missing Completely at Random

I If a variable (dataframe column) is as likely to be missing asall other variables, we say that it is MCAR.

I For example, in the case of users rating movies using 1-5stars, we consider ratings of specific movies as dataframecolumns and ratings associated with specific users asdataframe rows. Since some movies are more popular thanothers, some columns are more likely to be missing thanothers, violateing the MCAR definition.

Page 92: Data Analysis with R (combined slides)

Missing at Random (MAR)

I MAR occurs when the probability that a variable is missingdepends only on the other information available in the dataset.

I For example, in a survey recording gender, race, and income,gender and race are not very objectionable questions, so weassume for now that the survey respondents answer thesequestions fully. The income question is more sensitive andusers may choose to not respond for privacy reasons.

I The tendency to report income or to not report incometypically varies from person to person. If it only depends ongender and race, then the data is MAR.

I If the decision whether to report or not report income dependsalso on other variables that are not in the dataframe (such asage or profession), the data is not MAR.

Page 93: Data Analysis with R (combined slides)

Handling Missing Data

Most methods are designed to work with fully observed data.Below are some general ways to convert missing data tonon-missing data.

I Remove all data instances (for example dataframe rows)containing missing values.

I Replace all missing entries with a substitute value, for examplethe mean of the observed instances of the missing variable.

I Estimate a probability model for the missing variable andreplace the missing value with one or more samples from thatprobability model.

In the case of MCAR, all three techniques above are reasonable inthat they may not introduce systematic errors. In the more likelycase of MAR or non-MAR data the methods above may introducesystematic bias into the data analysis process.

Page 94: Data Analysis with R (combined slides)

Missing Data and R

I R represents missing data using the NA symbol.

I The function is.na returns a data structure having TRUE

values where the corresponding data is missing and FALSE

otherwise.

I complete.cases() returns a vector whose components areFALSE for all samples (dataframe rows) containing missingvalues and TRUE otherwise.

I na.omit() returns a new dataframe omitting all samples(dataframe rows) containing missing values.

I Some functions have an na.rm argument, which if set to TRUE

changes the function behavior so that it proceeds to operateon the supplied data after removing all dataframe rows withmissing values.

Page 95: Data Analysis with R (combined slides)

The code below analyzes the dataframe movies in the ggplot2package, which contains 24 attributes (genre, year, budget, userratings, etc.) for 58788 movies obtained from the websitehttp://www.imdb.com with some missing values.

mean(movies$length) # average length

## [1] 82.34

mean(movies$budget) # average budget

## [1] NA

# average budget (removing missing values)

mean(movies$budget, na.rm = TRUE)

## [1] 13412513

mean(is.na(movies$budget)) # frequency of non-missing budget

## [1] 0.9113

Page 96: Data Analysis with R (combined slides)

moviesNoNA = na.omit(movies)qplot(rating, budget, data = moviesNoNA, size = I(1.2)) +stat_smooth(color = "red", size = I(2), se = F)

●●

●

●

●

●● ●●

●

●

●●

●

●

●●

●●

●

●

● ●

●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●●● ●

●

●

●●

●

● ●

●

●

●

●●●●

●

●

●● ●●

●

● ●

●●

●

●

●

●

●

●●

●

●

●●

●●●

●

●

●

●●

●

●●

●

●

● ●

●

●

●

●

● ●

● ●● ●●

●

●●

●

●●

●●

●●

●

● ●●

●

●

●●●

●

●

●

●●

●

●

●●●

●

●

●

●

●●●

●

● ● ●

●

●

●

●

●

●●●

●

●

●

● ●●●●

●

●●

●

●

●

● ●● ●

●

●

●

●●

●

●

● ●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

● ●●● ●●

●

●

●

●

● ●●

●

●● ● ●

●

●

● ●●

●

●

●●●

●

●●

●

●

●

●

● ●

●

●

●

● ●● ●

●

●

●●

●

●

●●

●

●

●● ● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●●

●●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●●●● ● ●● ● ●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●● ●

●

●

●

●

●●

●

●

●● ●●

● ●

●

●●

●

●

● ●

●

●

●

● ●

●●

●●●

●

●

●

●●●

●

●

●●

●

● ●

●

● ●●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●● ●● ●● ●

●● ● ●

●●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●● ●● ● ●●

●

●

●

●●

●

●● ●

●

●

●●

●● ●● ●●

●●

●

●

●

●

●

●●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●●

●

●

●●

●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●

●

●

●

●

● ●●●

●

●●

●

●

●

●

●

● ● ●

●

●●

●

●●

●

●

●●

●

●● ●● ●●

●

●

● ●●●●

●

● ●●

● ●

●

●

● ●

●

●

●

●●

●●

●

●

●● ●

● ●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●● ●

●●

● ●●●

●●

●

●

●

●

●

●

● ●

●

● ●

●

●

●

●●

●

●

●● ● ●

●●● ●

●●

●●

●●

●

●

●

●● ●●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

● ●

●●

●

● ●

●

●

●●

●● ●

●

●

●

●● ●●

●

●

●

●●●

● ●

●

●

●●

●

●

●● ●

●

● ●●

●

●

●

●●

●

●●●

●

●

●

●

● ●●●● ●●●● ●●

●

●●

● ●● ● ●

●

●●

●●●

●

● ●● ●●

●

●● ●● ●●

●

●

●●

●

●

●●●

●

●

●●

●● ●

●

●

●● ●● ●● ● ●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

● ●●

●

●●

●

● ●●

●●

●

●

●

●

●

●

● ● ● ●●

●

● ●

●

●●●

●

●●●●● ●●●●

●

●

● ●● ●●●

●

●●

●

● ●●●

●

●

●●

●●

●●

●●

●

●

●

●

● ●●●

●

●

●

●●

●

●●●

●

●

●

●

● ● ●

● ● ●

●

● ●

●

●●

●●

● ●●

●

●

●

●

●●

●

●●

●

●

●●

●

●●●● ●

●

●

●

● ●

●

●

●

●● ●

●●

●●

●

● ●

●

●●

●●

●

● ● ●●

●

● ●●●

●●●

●

●

●

●●

●●

●

● ● ● ●

●

●●●●

●

●

●

●

●

●

● ●●

●

●

● ●

●

●

●●

●●●

●

●

●

●

●● ●●

●

●●

●

●

●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●● ●●

●

● ●

●

●

●

●●

●

● ● ●

●

●

●

●● ●

●●

● ●

●

● ●

●

●

●

● ● ●

●

●

●

●●

●

● ●

●●●

●

●●

●●●

●

●

●●

●

●●●

●●

●

● ●

●

●● ●

●

● ● ●

●

●

●●

●

●●●

●

●●●

●●

●

● ●● ●●

●

●●

●●

●●● ●

●

●

●

● ●●

●

●

●

●●

● ● ●●

●●

●

●

●● ● ● ●● ●

●

●

●

●●● ●●

●

● ●● ●

●

●

●● ●●●●

●

●●●

●●

●●

●●

●

●●●

● ●●

●

●●●

●

●

●●

●

● ●● ●

●

● ●

●

●

●

●

●

●

●

●

●

● ●● ●● ●●●

●● ●●

● ●

●

●●●

●

●

● ●●

● ● ●●●

●

● ●●

●●● ●

●

● ●

●●

●

● ●

●

●

●● ●

●●

●

●

●

●

●

●

●

●●

●

●● ● ● ●

●●●

●

● ●

●

●●●

●

●●

●●●

●

●

●

● ●

●

●

●

● ●

●

●●

●●

●

●

●

●

●●

●

●●

●

●●●●

●

●● ●●

●

●

●

●

●

●●●

●

● ● ●

●

●

●

●

●

●

●

● ●● ●

●

●

● ●●

●

●

●

●● ●●● ●●

●

●●

●

●● ●●

●

●

●

●

●●

●

●

●

●●

●

● ●

●

●

●

●

●

●

● ●●

●

●

●●

●● ●●

●●

●

●●

●●●● ●●●● ●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

● ●● ●

●

●

●

●●●

●

● ●●●

●●

●

●

●

●

● ●●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●● ●

●● ●●●

●

●● ● ●

●●

●

●

●

●●●

●

●

●

● ●

●

●

●

●●● ● ●●● ●●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●●

●●

●

●●

●

●

●

●●

●

●

●

● ●

●

●● ●●●

●

●● ●

●

●

●

● ●●

●

●

●

●

●●

●●●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●

● ●

●

●

● ●● ●●●

●●

●

● ●●

●

●

●

● ●●

● ● ●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●● ●● ●

●

●

●

●

●● ●●

●

●

●

●

●

●●

●

●● ●● ●● ● ●●●

●● ●●

●

● ●

●

●● ●●

●

●

●

●

●

●●

●●

●● ●

●

●● ●

●

●

●● ●● ● ●

●

●● ●● ●

●

●

●

●

●

●

●

● ●●● ●●● ●● ●

●

●●

●

●

●

●

●●●●●

● ●●

●

●●●

●

●

●

●

●●●●

●

●

●

● ●

●

●● ●

●

●●

●

●●●

●

●

●●

●

●

●

●

●

●

● ●●

●●

●

●

●

●●●

● ●

●

●

●●

●

●

●

●

●

●

●

●

● ● ●

●●

●●

●

●

●

●●

●

●

●

● ●●

●●●

●

●

●

●●

●

● ●

●●● ●●

●

●

●

●

●

●

●

●

●

● ●

●

● ●●● ●

●

●

●●

●●

●

●● ● ●● ●● ●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

● ●

●

●

●

● ●

●

● ●● ●

●

●● ●●●

●

●

● ●

●●●

●

●

●

●

●

●●

●

●

●

●●●

●

● ●

●

●

●

●

●●

●●● ●●

●

●

●

●

●

●

●

●●

●

●

● ●●●

●

●

● ●

●

●●

●●● ●

●

● ●

●

●●●

●

●●

●

●

●

● ●● ●

●

● ●

●

●

● ●●

●

● ●

●

●●

●

●

●

●● ●

●

●●●

●

●

●

●●

●

●●

●●

●

● ●

●

● ●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

● ●●●

●

●

●

●

● ●●

●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●● ●●

●

●●

●

●

● ●● ●● ●● ● ●

●

● ●

●

●

●

●

● ●

●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●● ●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●● ●

●

●

●

●

●●

●

●

●

●

●● ● ●

●

●

●

●● ● ●●● ●

●

●

●●

●●

●

●

●

●

●

●●●●

●

●

●● ●

●●

●

●

● ●

●

●

●

●

●

●●

●

●

●

● ●

●

●● ●

●

●●

●

●

●

● ●● ● ●●●

●●

●

● ●●

●

●●

● ●

●

● ●

●

● ●

●

●

●●

●

●● ●●●● ●

●

●

● ●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●●

● ● ●●●

●● ●●

●●

●

● ●●

●

● ●●

●

●

●

●

●●

●● ●●●●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●●

●● ●

●

●

●

● ●

●

●

●

●●

●● ●●

●●

● ●●●

●

●●● ●●●

●

●

●

●●

●

●●

●

●

●

●●

● ●

●●

●● ●

●

●

●

●● ●

●

●

● ●

●

●

●

●●

●

●

●

●

● ●●

●

●●

●

●

●●

●

●

●

●

●

●

●●●●

●●

●

●● ●●●

●●

● ●

●

● ●●

●

●

●● ●● ●● ●●●● ● ●

●

● ● ●● ●

●

●

●

●

●

●

● ●●

●

●

●

● ●

●

●

●

●●●●

●

● ●●●

●

●

●●● ●●

●

●

●

●● ●●

●

●●●●

●

●

●

● ●

●

●

●

● ●● ●

●●

●

●

●

●

●

●

●●

●

●

●●

●

● ●●

●

●●

●

●

●

●● ● ●

●

●

●

●

●

●●

●

●

● ●●

●

●● ● ●

●

●●● ●

●

●

●●

●

●

●

●

● ●● ●●●

●

●

●

●

●

●

● ● ●

●

●●

●

●

●●

●

●

●●● ●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●● ●●

●

●

●

●●

●

●

●●

●● ●

●

● ●●●●

●

● ●

●

●

●

●●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●●●● ●

●

● ●●

●

●

●

●

● ●●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

● ●● ● ●● ● ●

● ●

●

●●● ●●●

●

●

● ●●●

● ●● ●● ●

●

●

●

● ●●

●

● ●●

●●●

●●

●

●● ●

●

●

●

●

●

●

●● ●● ●

●

●●

●

●

●●

●

●

●

●

● ● ●● ●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●●●

●●

●

●●●●

●

●

● ●●● ●● ●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

● ●

● ●●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

● ●

●

●

●●

●

●

●

●

●

● ●

●●

●

●

●

●

●

● ●

●

●●

●●●

●

●

●● ●●

●

●

● ●

●

● ●●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●

●● ●

●

●

●

●

●

●●● ●

●

●● ●

●

● ●●

●

●●●

●

●

●

●

●

●● ●

●●

●● ●● ●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

● ●●

●

● ●●●

●

●

●

●● ●●● ●

●

● ●

●

●●●

● ●

●

●

●

●

●●

●● ●

●

● ●

●

●

● ●●

●

●

●

●

●●

●

●●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

● ●●●

●

●

●

●

●● ● ●

●

●

●

●●

●

●●

●

●

●●

●

●●

●●● ●●

●

●●

● ●

●

●●

●

●●

●

● ●●●●

●

● ●● ● ●●

●

●● ●●●

●

● ●

●

●●

●

●

●●●●●

●●

●

●●● ●

●

● ●●

● ●●●

●●

●

●

●

●●

●

● ●

●

●

●●

●

● ● ●

●

●

●

●●

●

●

● ●● ● ●● ●●●

●● ● ●

●

●

●

● ●●

● ●●

●

● ●

●

●

●

●

●

●

●● ●

●● ●● ●

●● ●● ●

●

●●

●●

●

●●

●

●

●●

●●● ●

●

●

●

● ● ●●

●

●

●

●

●

●●

●

●●

●

● ●● ●●

●

●

●

●

●

● ●●

●

●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

● ● ●●

●●

● ●●●

●

●

● ●●

●

●●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●●

●

●●●

●

●

●

● ●

●

●

● ●

●

●

●●●

●

● ●

●●● ●

●● ●●●

●

●

●

● ● ●

●

●

●

●

●

● ●

●

●●

●

●

●●

● ● ●●

●●●

●●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

●

● ●

●

● ●●

●

●●

●

●●

●●

● ●

●

●

●● ●●●●● ● ●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●●●●

●●●

●

●

●

●

●

●●● ●● ●

●

●

●

●

●

●

●

●●

●● ●●

●

●

●

●

●

●

●

●●

●

●

●● ●

●

●

●

●

●

●●

●

●●●

●

●●

●●

●●●

● ● ●

●

●

●

●● ●

●

●

●●

● ●

●

●

●

●

●

●● ●● ●

●

●

●

●●

●●●

●

●●●

●

●

●● ●●●

●

●

●●

●●●

●

● ●

●

●●

● ● ●

●●●

●

●

●

● ● ●●

●

●

● ●●●● ●

●

●

●

● ●

●

●

●

●●●

● ●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●

●●

●●

● ●● ● ●● ●● ●

●

●● ● ●●

●

●●

●

●●

●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

●

●● ● ● ●● ●

● ●●●● ●

●●

●

●●●

●

●

●

●

●

●● ●●

●

● ● ●● ● ●

●

●●

●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●● ●

●

●

●●●

●

●

●

●

●

● ●●

●

●

●

●● ●●

●

●

●

●

●

● ●●● ●

●

●

●

●

●

●

●●

●

●●

● ●● ●

●●

●

●

● ●

●

● ●●

●● ●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

● ● ●

●

●● ●●

●

●

●

●●

●

●●

●●

●

●●● ●

●

●

●

●● ●●●●●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ● ●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

● ●●●

●

●●

●● ●● ●

●

● ●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●●●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

● ●● ●

●

●

●

●● ●

●●

●

●●

●

●● ●

●

● ●●● ● ●

●

●

●

●●

●

●●

●●

●

●

● ●●

● ●●●

● ●

●

●● ●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●● ●

●●

●

● ●

●●●

●

●

●

●

●

●

●

●

●

●●● ●

●

●● ●●

● ●●● ● ●●

●

●●●

●

● ●●●●●

●●

●

●

●

●

●

● ●● ● ●

●

●

●

●

●

●●

●

●●●

●

●●

●

●●● ●

●

●

●

●●

● ●

●●●● ●●

●

●

●

●

● ●● ● ●● ●●●

●●

●

●

●

●●●●

● ●

●

●

●

●●●

●

●

●

● ●

● ●

● ●

●

●●

●

●● ●

●

●

●

●

●

●

●

● ●

●

●● ●

●

●

●

●

●

●

●

● ● ● ●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●●

●

●●●

● ●●

●

●

●●●

●

●● ●●●● ● ●● ●

●

● ● ●● ● ●●●

●●

●● ●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●●

●

●

●

● ●●

●

●

●●●● ●

●

●

● ●

●

●

●

● ●● ●

●

●

●● ●

●

●●

●

● ●●

●

●● ●●

●

●

●

●● ●● ●●●●● ●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●●● ● ●

●

●●

●●

●

●

●

●

●● ●

●

● ● ● ●●●

●

●●

●

●

●

●●

●

● ●●

●

●

●

● ● ● ●●

●

●

●

●

●

●●

●

●

● ●

● ●●●●

●

● ●

●

●● ● ●

●

● ●●

●

●

●

● ●● ●

●

●

● ●●

●

●●

●

●

●

●

● ● ●●

●

●

●

●● ● ●●

●

●

●●

●

●

●●● ●

●

●

● ●● ●●

●

●● ●

●

●

●●

●

●

●

●

●

●

●

●●● ●● ●● ●

●●

●

●● ●● ● ●●

●

●

●

●

●

●

●

●

●● ● ●

● ●

●

●●

●

●

●

●●●

●

●●

●

●

●●

● ●● ●●

●

●

●

●●●●●

●

●

●●●●

●

●● ●

●

●

●●

●

●

● ●●

●● ●

●

●●● ●

●

●

●

●

●

● ●

●

●

●●

●

●

●●● ●●● ●

●

●

●

●●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●● ●

●

● ●●●

●

●●●● ●

●●

●

●●

●●

●

●●●

●

●

● ●●●

●

●●

● ● ●●

●

●● ●●●

●

●●

●

●

●

●

●

●

●

●

●● ●●

●●●

●

●●

●

●

●●●

●●

●

●

●

●

●

●

●● ●●●

●

●● ●

●

●●

●●

● ●●●● ●

● ●

●

●● ●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●● ●

●●

●●●

● ●●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●●

● ●● ●●

●

●●

●

●

●

●●

●

●

●●●●● ● ●●●● ●

●●

●

● ●●

●

●

●

●●

●

● ● ●

●

●

●●

● ●●●

●

●

●

●●●●

●●

●●

●●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

● ●

●

● ●

●

●●

●

●●●●●

●

●

●

●

●

●● ●● ●●●●●●

●

●

●● ●

●

●

●●

●

●

●

●

● ●

●

●

●● ●●

●●

●●

●

●●● ●●

●

●

●

●

● ●

●

●

● ●

●

●

●● ●●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

● ●

●

●●

● ●

●●●●● ●

●

●

●

●●●

●●

●

●

● ●●●

●●

●●

●

●●

●●●

●●

● ●

●

● ●●●

●●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●●

●● ●●

●

●

●

●

●

●● ●

●

●

●●

●

● ●●

●

●●

●● ● ●●●

●

●

●● ●

●

●

● ●

●●

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5 5.0 7.5 10.0

rating

budget

Page 97: Data Analysis with R (combined slides)

moviesNoNA = na.omit(movies)qplot(rating, votes, data = moviesNoNA, size = I(1.2))

●

●

●●

●

●● ●●● ●

●

●

●

●

●

●

● ●●

●

● ●●

●●

●

● ●

●

● ●

●

●

●

●● ●

●●

●●● ●

●

●●

●

●

● ●

●

●

●

●●●●

●

●●● ●● ●● ●

●

●

●●

●

●

●

●●

●

●

●● ●● ● ●

●

●

●●

●●

● ● ●● ●●●

●●

● ●●

●

● ●●●

●●

●●● ●● ●●

●

● ● ●

●

●● ●●

●

●●● ●

●

●

●●●

●

●

● ● ●●

●

●

● ● ●●

●

●

●

●

●

●●

●

●● ● ●●●●

●

●●●

● ●

● ●● ●

●

●

●

●●

●

●

● ●●●●

●

●

●●●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

● ●●

● ●●

●●

●

●

● ●

●

●●● ● ●

●●

●

●

● ●● ●● ●

●

●●

●

●

●

●

● ●●

●●

● ●●

●

●

●

●●

●●

●● ●● ●●●

●

●

●

●

●

●● ●

●●

●

●

●

●

●

● ●●● ●●

●

●●●●●

●

●● ●●

●●●●

●

●● ●●●● ● ●● ● ●● ●

●

●●

●●

●● ●

●●●

●

●

● ●●

●

●● ●●●● ●

●●● ●●

●●

●

●●

●●

●●

●

●● ●

●

●

●

●● ●●

● ●●●

●●

●●

●

●● ●

●

● ●●● ●

●● ●●●● ●

●●

●

●

●

●● ●●

●

●

● ●

●

●

●●● ●●

●

●

●●

●

●

●● ●● ●● ●● ● ● ●● ●

●

●

●

● ● ●●

●

●●

●

● ●

●

●

●●

●

●●● ●● ● ●● ●● ●●●

●

●● ●●

●●●

●

●●

● ●●

●

●●

●●

●●● ●● ●● ●● ●

●

●

●●●

●

●●

●

●

●

●

●

●

●●● ●●

●

● ●●●

●●

●

●

●● ●● ●

●

●● ●

●

●

● ●●●●●

●

●●●

●

●

● ●●●● ●● ●

●

●

●

●

●

●

●

●

●

●

●●● ●● ●

●●●● ●● ●●

●●

● ●●●●●● ●●

● ●

●

●●

●● ●

●

●●

●●

●

●

●●

●

●

●

●●

●●

●●

● ●

●

● ●●

●

●●●

●

● ●●

●

● ●●●

●

● ●●

●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●

●

● ● ●● ●●

●

● ●● ●

● ●●

●

●

●● ●

●

● ●●

●

●

●

●●

●

●

●●● ● ●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●● ●●

●

●

●● ●●

●

●●

●●● ● ●

●

● ●●

●●

●● ●●●

●

●●

● ●

●

●●

●● ●

● ●●●

●

●

●●● ●●●● ●●

●

●

● ●

●●

● ●

●

●

●

● ●●● ● ●● ●●●● ● ●● ●●

●

● ●● ●

●

●● ●

●

● ● ● ●● ● ●● ●● ●●●● ● ●●

●

●

●

●

●

●

●

●

●●●

● ●●

●●

●

●

●

●

●

●●●●●

●

●●

●

● ●

●

●●

●

●●● ●●

● ● ● ●●●● ●

●

●●

●

●●

●

● ●● ●●●●

●

●

● ●● ●

●

●●

●●

●

● ●● ● ●●●● ●●

●

●

● ●●

●● ●● ●

●● ●●

●

●

●● ●●●

●● ●

●

● ●●

● ●

●

●

●

●●

● ●● ● ●

●

●●●●

●●

●

● ●●●

●

●●●

●● ●● ●●

●

●

● ●●

●

●

●● ●●●

●

●

●

● ●●

● ●

●

●

●● ● ●●●

● ●●●●●●

●

●

●

●●

●●

●

● ● ● ●

●

●●● ●● ●● ●

●

●

● ●●

●

●●

●

●

●

●●● ●●

●●

●

●

●●

●

●●

●

●

● ●

●

●●● ●●

●●●●

●

●● ●

●

● ●● ●●

●

● ●● ●

●

●● ●● ● ●●

●

●●● ●

●●● ●●● ● ● ●

●

● ● ●

●

●●●● ● ● ● ●●●

●●●

●

●

●●

●

●●●●

●●●●●

● ●

●

●● ● ● ● ● ●●●● ●●

● ●●●

●●● ●●● ● ●● ●●

●

●●

●

●●●● ●

●

●

●

● ●●●

●

●

●

●● ●●

●

●●

●● ●● ● ● ●●

●●

●

●

●●● ●

●

● ●

●

● ●

●●

●● ●●● ● ●

●

●

● ●●

●

●●●

●

●●● ● ●●

●

●●

● ●

●

●● ●● ●● ●

●

● ●

●

● ●

●

●● ●

●

●● ●● ●●

●

●●

●● ●●● ●●●● ●●● ● ●

●● ● ●●●

●

● ●●●

●● ●● ● ●●

●

● ● ●

●

●

●●

●●

●●

●

●

●

●

●●

●●●● ● ● ● ●

●●●

●

● ●● ●●

●

●

●●

●●●

●

●●

● ●

●

●

●

●

●●

●

●

●●●

● ●

●

●●

●

●●

●

●

● ●●

●

●● ●● ●●

●●

● ● ●●●● ●

●●●●

● ●●● ●

●

●

●

●

●

● ●●●

●● ●● ●●● ●●

●

●●●

●● ● ●●

●

● ●

●●●

●● ●

●

●

● ●

●

● ● ● ● ●●

●

●●

●

●●

● ● ●●

●

●●

●●

● ●

●

● ●●●● ●

●

●● ● ●● ●

●

●

● ● ●

●●

● ●

●

●

●●

●

● ●

●

●

●●

●●●● ●

●

●●●●

●

●

●

● ●●

●

●●●

●●

●

●

●●

●

●

●●●

●

●

●

●

●

●● ●●● ●●●

●●

●● ●

●●

●

●

●

●●● ●

●

●

● ●● ●

●

●●

●

● ●●● ●

●●

●

●

●●

●●

●●

●

●●● ●

●● ●

●

●

●●● ●

●

●●

●

●

●● ●● ●● ●●● ● ●● ●

●

●

●

● ●

●

●

●

● ●

●●

●

● ● ●

●

●

●

●

●

●●●● ● ●

●

●

● ●●● ●● ● ●

●

●

●

●● ●

●

●

● ●

●

●

● ●● ●●●

●

●

●

● ●●

●

●

●

● ●● ● ● ●●●●

●

●●

●

●

●●

●

● ●

●

●

●

●●● ●● ●

●

●●

●●

● ●●●●

●

●●

●●

●

●● ●

●

●● ●●

●● ●● ●●

●

● ●

● ●

● ●●● ●●

●

●

● ●●● ●● ●●

●

●

●

●

●

●● ●● ●

●●

●● ● ● ●●●

●●●

●●● ●

●● ●●●

●

●●

●

● ●

●●

●●

●●●●● ●

● ●●●● ●

●

●

●●●

● ●● ●

●

●●●

●

●

● ●●● ●

●

●●● ●

●

● ●● ●

●

●

●

●

● ●●

●

●

●●

●

●●● ● ●

●

●●●●

●

●●●● ●

●

● ● ●●

●

●

●

●

●

●

●●

●● ●● ●

●●● ●

●

●

●

●●●

● ● ●●● ●●

●

● ●

●

●

●

●●

●

● ●

●

● ●●● ●

●

●

●

●

●

●

●

●● ● ●● ●●

●

● ●

●●

● ●●

●●

●

●●● ●

●

●

●

● ●

●

● ●

●

● ●

●

●

●

● ●

●

●

●

● ●●●● ●● ●●● ● ●

●

●

●

●●

●● ●●●● ●

●●●●

●

● ●

●

●

●● ●● ●●● ●●

●

●

●

●●

●●

●●●●● ●

●

●

●●

● ●● ●● ● ●● ●

●●

●●

●

●●

●●● ●●

●

●●

● ●●

● ●

●

●

●

●

●

●

●●●

●

●●

●●

●● ●

● ●

●● ● ●

●

●

●

●

●● ●●●

●

●

●

● ● ●●

●●●●●●●

● ●

●

●

●●

● ●●●●

●

●●

●

●●●● ●● ●● ●

●

●●

●

● ●●

●

●●●

●

●●

●

●●

●

● ●●●

● ●●●

●

●

●

●● ●●

●

●

● ●●● ●● ●● ●● ● ●

●

● ●

●●

●

●

● ●●

●●

●●

●

● ●●●

●

●

● ●●●●

●

●

●● ●●● ●●

●●●

●

● ●

●

●

●●

● ●●

●

●●

●●●●

●

●

●

●

●●

●

●●● ●●●

● ●

●

● ● ●●

●●●

●

●● ●●

●

●

●● ●●

●

●●

●

●

●●

●

●

●

● ●● ● ●

●

●●

●● ●

●

●● ●

●

●●●●● ●●

●

●

●

●●●●

●●

● ● ●● ●●● ● ●●

● ●

● ●

●

●

●

●●

● ●● ●● ●●●

●

●●●

● ●● ● ●●●

●

●●

●

●

●

●

●

●● ●● ● ●

●

● ●

●

●

●

●

●

● ● ●●●● ●●

●

● ●

●

●

●

●

●

● ● ●●●●

●

●●

●

●

●●

●

●● ●●●● ●

●

●

●

●● ●●●●

●

● ●●●

● ●●

●

● ●●● ● ●●

●

●●●● ●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●● ●

●●

●● ●●●

●●

●

●●

●

●

●●● ●

●

●

●

●●●

● ●● ● ●● ●

●●

●

●●● ●●●

●●●

● ● ●

●●

●

●

●● ●● ●●●

●● ●

●●

●●

●

●

●

●

● ●● ● ●●●

●

●

●●

● ● ●●●●

●

●

●●

●

●●

● ●

●

●●●● ●●

●

●●

●● ● ●● ●●

●●

●●

●

●

●● ●●●

● ●●●● ● ●

●●

●●

● ●

●●

●●●

●

● ● ●●●

●

● ●

●

●●

●●● ●

●

● ●●●

●

● ●●●●

● ●

●

●●● ●●● ●● ●●

●●

●

● ●● ●

●

● ●

●

●●●

●●

●

●

●●●●

●

●

●

●

●

● ●●

●

●● ●●

●

●● ●●

●

● ●●●

●

● ●

●

●●●

●●

● ● ●

●

●●● ●●

●

●

●

●●

● ●

● ●● ●

●

●●●

●

● ● ●

● ● ●● ●●

●

●● ● ●

●

●●● ●

●●

● ●● ●

●

● ●● ●

●

●●

●●●

●

●

●

●

●

●●● ●●● ●●● ●

●

●

●

●

●

●● ●● ●

●● ●●●●

●

● ●

●

●● ●●

●

●

●

●● ●●●

●●

●

●

●●

●●

●●● ●

●

●

●●

● ●●

●

●

●

●

● ●

●

● ● ●● ●

●

●● ●● ●

●

●

●

●

●● ●●●

●● ● ● ●● ● ●● ● ●● ●● ●

●● ●● ●●●

● ●● ●● ●●

●

● ●●●●

● ●

●

●● ●

●

●●● ●●

● ●●●

●

●

●

●

●

●

●● ●● ●

●

●●●

●●

● ●● ●●

● ● ●

●●

●

●

●●

●● ●

●

●●●

●

●

● ●

●

●● ●

●●

● ●●● ●●● ●●●●

●● ● ●

●

● ●● ●●

● ●● ●

●

●● ●●

●●

●

●

●

●

●

●

● ●● ●●

●●

●

●

●

●●● ●

● ●● ●

●

●

●

●

●●

● ●●

●

●●●

●

●

●

●

● ● ●

●

●●

●

●

●

● ●

●

● ●●

●● ●

●

●● ●●

●

●●●●

● ●●

●

●● ●●

●

●

●

●

●●

●

●

●

●● ●

●● ●

●

●

●

●

●

●● ●●

● ●

●

● ●● ●●●

● ● ●

●

● ●●

●

●●●

●

●

●●

●

●● ●●● ●●

●

● ● ●● ● ● ●●●

●

● ●

●●

●●

●● ●●

●

● ●●

●

● ●

●

● ●

●

●● ●●

● ●

●

●● ● ● ●

●●

●●

●● ●● ●

●

● ●

●

●

● ●● ●

● ●

● ●●

●

●● ●●

●

●

● ●●

●●

●

●

●

●●

●

● ●●

●

●● ●●

●● ● ●●●

●

●

●

●●

●

●

●

● ●● ●

●

●●●

●

●●

●● ● ●

●

●●

●

●●

●

● ●●●●

●

● ●● ● ●●

●

●● ●

●

●

●

● ●

●

●●

●

●

●●●●●

● ●

●

● ●● ●

●

● ●

●

● ●●●

●

●● ● ●● ●

●● ● ●

●●● ●● ● ● ●● ●● ●●

●

● ●● ● ●●

●

●

●

●● ●

●

●

●●●

●

●● ●● ● ●

●

●

●

●

●●●●● ● ●● ●● ●● ● ●● ●

●

●

●●●

●

●●

●

●●● ●●●

●

●

●●

●●

●●

●

● ●

●

●

●●

●

●●●

● ● ● ●●

●

●

●●

●

● ●● ●

●

● ● ●●●●●● ●

●

● ●●

●●

●

●

●

● ●●● ●●

●

●

●

●

● ●●●●

● ●

●

●

●

●

●

● ●● ●●●

●● ●●●● ●

●●

● ●

●

●●●

● ●●

● ●● ●●

●

● ●●●●

●

● ●●

●

● ●●

●

●●●●

● ●●

●

● ●●● ●●●

●●

●● ● ●

●●

●

●● ●●

●

●●●

●

●●● ● ● ●● ●●●

●●●

●

●

●●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●●

●

●

●●● ●●

●

●● ●●

●

● ●● ●●●●

● ●●●

●● ●●●●● ● ●

●

●●● ●

●

●

●

● ●

●

●

● ●● ●

●●●

●●

●

● ●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●● ●●

●

●● ●

●

●● ●●

●●

●

● ●●

●

●

●● ●● ●●●●

●

●

●●● ●●

●

● ●

●●

●

●

●● ●●

●

●● ● ●

●

●●

●

●

●● ●●●

●

●

●

●●

●● ●● ●● ● ●

●

●

●

●●●●

●

●●

●● ● ●●

●

●

●●● ● ●

●

●●

●

●

●

● ● ●●

●●

● ●●●● ●

●

●●

●●

● ●●●

●

●● ●●

●●●

●●

●

●

●

●●

●

●●●● ●

●

● ●●

●●

●

●

●

●

● ●

●

●● ● ●

●

●●

● ●● ●●

●

●●

●

●

●

● ●● ● ●●

●

●●

●

● ●●

●●

●●●

●●

●●

●

●

●

●

●●

● ●

● ●

●●

●

●

●

●

●●●● ● ● ●● ●●●●

●●

●

●● ●

●●●

●

●● ●● ●● ●● ● ● ● ●● ● ●● ●●

●●

●

●

●

●● ●●

● ●

●●

●

●

●

●

●●●

●

●● ●● ●●

●

●●●

●

●●●

●● ●●

●

●

●

●●

●

●

●●

●●

●

●

●

●● ●

●

●

●

●

●

● ●

● ● ●● ●

●

● ●●

●

●

●

● ●

●

● ●● ● ● ●

●●

●●

● ●●

●●

●

●

●

●●

●●

●

● ●● ●

●

● ●

●●

●● ●● ●●

●

●●

●●

●●●●● ●● ●●● ● ●● ●●●●

●●

● ●

●

●

●

●

●

●●● ● ●

●

● ● ●●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●● ● ●

●

● ●

●

●

●

●●

●

●● ●● ●● ●

●

● ●

●

●● ●●

●

● ● ●

●

●●

●●

●●● ●

●●●

●●

●

●

●●

● ●●●

●

●● ●

● ●

●

● ●

●

● ●● ●●

●●

● ● ●

●

●

● ●●

●

●●

●

●

● ●●● ●●

●

●● ●● ● ●●

●

●

●

●

● ●● ● ●●●

● ●● ●● ●●

●●

●

●

● ●● ●●●

●

●

●

●

●

●●●●

●● ●● ● ●●●

●●

●

●

●●

●

●●

●●●

●

● ●● ● ●● ●●● ● ●●

●

●●

●

●

●

●●●● ●

●

●

●

●●●● ● ●● ●

●

●

●

●

●

● ●●

●

●●●

● ●

●●

●

●● ●

●

●

●●● ●●

●

● ●● ●● ●●

●

●

● ●● ● ●● ●●●●

●

●● ●●●●

●● ●● ●● ●●● ●●

●● ●● ●● ●

●

● ●

●

●● ●

●●

●●

●

●● ● ●

●

● ● ●

●

●

●

●●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●●● ●●●

●

●

● ●●● ● ● ●

●

●

●●

●●

●

●● ●●●● ●

●

●

●

●

● ● ●● ● ● ●●●

●●

● ●

●

●

●● ●●

●

●

●

●

●

●

●● ●●

●

●●

●

●

●

● ●●●●

●

● ●● ●●●

● ●●●● ● ●● ●

● ●

●● ●

●

●

●

●

● ●●

●

●● ●●

●

●

●

●● ●● ●

●

●

●●

●

● ●

●

●

●

●

●

●● ● ●●● ● ●●● ● ●●

● ●●

●●

●

●●

●● ●

●

● ● ●

●

●●●

●●

●●

●

●

●

●● ●● ●●●● ● ● ●●●● ● ●●

●

●

●

●

● ● ●●

●

● ●●● ●

●

●● ● ●●

●●● ●

●

●

● ●●●

●

●

● ●● ● ● ●●

●●

●● ● ●●

●

●● ●● ●

●

●●

●●

●

●

●

●●● ●

●●

● ●● ● ●● ●● ●

●

●● ● ●

●

●

●

●

● ●● ●● ●● ●●

●

●

●

● ●● ●● ● ●

●

●

●

● ●●●

●

●●● ● ●

●

●● ●●● ●

●

●

●

●● ●

●

●●

● ●● ●● ●●●

●

● ●● ●●● ●

●

●●●●

●

●● ●

●

● ●● ● ●● ●●

●

●●

●

●●● ●

●

●● ●

●● ●

●

●

●

●●

●

●

●● ●●● ●

● ●

●

●●●● ●●

●

●● ●●

●●●

●

●

●

● ● ●●

●●●

●

●

●● ●

●

● ●●●

●

●

● ●● ● ●●●

●● ● ●●

●

●●

●

●● ●● ●●

●

●● ● ●●

●

●● ●●

●

●

●●

●

●

●

●● ●

●● ●● ●● ●●●● ●●

●

●

●●●

●● ●

●

●

●

●

●

●●●●●

●

●● ●

●

●

●

●

●

● ●●●● ● ● ●

●

●● ●● ●●

●●

●●● ●●●

●● ● ●● ●

●●

● ● ●●●

●

●●●●●

●

●●

● ●●●●

●

●●● ●

●

●●● ●● ●●

●●

●●●

●

●

●

●

●●

●

●● ●● ●●● ●●

●●

●

●

●

●● ●● ●

●●

● ●●

●

●

●

●

●●

●●●●● ● ●● ●● ● ●●

●

● ●● ●●

●

●●

● ●

●

●●

●●

●

● ●●●●

●

●

●●● ●●●

●

●●● ●●● ●

●

●

●

●

●

●●

●

● ● ●

●

● ●●● ●● ●●

●

●●●●●● ●●● ●

●

● ●● ●●●

●

●● ●

●

●● ●

●

●●●

●

●

●

● ●

●

●● ●● ●●

●●●●

●

● ●● ●●●●

●●

● ●

●

●

● ●

●

●●● ●●

●●●●

●

● ●● ● ●●

●

●

●

●

●●●

●

●

●●●

● ●

●●

● ●● ●●● ●●●●●● ● ●

●

●

●

●●

●

●

●

●

● ●●●●● ●●

●

●● ● ●● ● ●● ●

●

● ●● ● ●●

●

●

●

●

●

●●

●●● ●

●● ●● ●

●

●● ●●●

●

●●

●

●

●

●

●●

●

●

●

●●●

● ●●

●●● ●

●● ●

●

●● ●●● ● ●

●

● ●

●

●0

50000

100000

150000

2.5 5.0 7.5 10.0

rating

votes

Page 98: Data Analysis with R (combined slides)

I Number of votes (which can be used as a surrogate forpopularity) tend to increase as the average rating increase.

I Spread in the number of votes increases with the averagerating.

I Movies featuring the highest average ratings have a very smallnumber of votes.

Note that users tend to see movies that they think they will like,and thus the observed ratings tend to be higher than ratingsgathered after showing users random movies.

Page 99: Data Analysis with R (combined slides)

Outliers

Two types of outliers

I Corrupted values, for example, human errors during a manualprocess of entering measurements in a spreadsheet.

I Substantially unlikely values given our modeling assumptions,for example Black Monday stock crash on October 19, 1987,when the Dow Jones Industrial Average lost 22% in one day.

In both cases, data analysis based on outliers may result indrastically wrong conclusions.

Page 100: Data Analysis with R (combined slides)

library(Ecdat)data(SP500, package = 'Ecdat')qplot(r500,main = "Histogram of log(P(t)/P(t-1)) for SP500 (1981-91)",xlab = "log returns",data = SP500)

0

300

600

900

βˆ’0.2 βˆ’0.1 0.0 0.1

log returns

coun

t

Histogram of log(P(t)/P(tβˆ’1)) for SP500 (1981βˆ’91)

Page 101: Data Analysis with R (combined slides)

qplot(seq(along = r500),r500,data = SP500,geom = "line",xlab = "trading days since January 1981",ylab = "log returns",main = "log(P(t)/P(t-1)) for SP500 (1981-91)")

βˆ’0.2

βˆ’0.1

0.0

0.1

0 1000 2000

trading days since January 1981

log

retu

rns

log(P(t)/P(tβˆ’1)) for SP500 (1981βˆ’91)

Page 102: Data Analysis with R (combined slides)

Robustness

Robustness describes a lack of sensitivity of data analysisprocedures to outliers.

I The mean of n numbers is a non-robust procedure while themedian is a robust procedure.

I Assuming a symmetric distribution of samples around 0, weexpect the mean to be zero, or at least close to it. But, thepresence of a single outlier (very positive value or verynegative value) may substantially a↡ect the mean calculationand drive it far away from zero, even for large n.

I In contrast the median will not change its value.

Page 103: Data Analysis with R (combined slides)

Dealing with Outliers

Truncating. Remove all values deemed as outliers.

Winsorization. Replace outliers with the most extreme of theremaining values.

Robustness. Analyze the data using a robust procedure.

Page 104: Data Analysis with R (combined slides)

Removing Outliers

To remove outliers we need to first detect them.

I Values below the ↡ percentile or above the 100οΏ½ ↡ percentilefor some small ↡ > 0.

I Values more than c standard deviations away from the mean.

I Chicken-and-egg problem since standard deviation and meancalculations above will be corrupted by outliers. One solutionis computing the mean and standard deviation after removingthe most extreme values (see next slide). Alternativelypercentile (that are more robust) can be used.

Page 105: Data Analysis with R (combined slides)

originalData = rnorm(20)originalData[1] = 1000sortedData = sort(originalData)originalData = originalData[3:18]lowerLimit = mean(sortedData) - 5 * sd(sortedData)upperLimit = mean(sortedData) + 5 * sd(sortedData)noOutlierInd = (lowerLimit < originalData) &

(originalData < upperLimit)dataWithoutOutliers = originalData[noOutlierInd]

Page 106: Data Analysis with R (combined slides)

library(robustHD)originalData = c(1000, rnorm(10))print(originalData[1:5])

## [1] 1000.0000 -0.6265 0.1836 -0.8356 1.5953

print(winsorize(originalData[1:5]))

## [1] 3.2060 -0.6265 0.1836 -0.8356 1.5953

Page 107: Data Analysis with R (combined slides)

Data Transformations: Skewness and PowerTransformation

I In many cases, data is drawn from a highly-skeweddistribution that is not well described by one of the commonstatistical distributions.

I A simple transformation may map the data to a form that iswell described by common distributions, such as the Gaussianor Gamma distributions

I A suitable model can then be fitted to the transformed data(if necessary, predictions can be made on the original scale byinverting the transformation).

Power Transformation Family: replace non-negative data x by

fοΏ½(x) =

8><

>:

(xοΏ½ οΏ½ 1)/οΏ½ οΏ½ > 0

log x οΏ½ = 0

οΏ½(xοΏ½ οΏ½ 1)/οΏ½ οΏ½ < 0

x > 0, οΏ½ 2 R.

Page 108: Data Analysis with R (combined slides)

I Intuitively, the power transform maps x to x

οΏ½, up tomultiplication by a constant and addition of a constant.

I This mapping is convex for οΏ½ > 1 and concave for οΏ½ < 1.

I A choice of οΏ½ < 1 removes right-skewness (data has a heavytail to the right) with smaller values of οΏ½ resulting in a moreaggressive removal of skewness. Similarly, a choice of οΏ½ > 1removes left-skewness.

I Subtracting 1 and dividing by οΏ½ makes fοΏ½(x) continuous in οΏ½as well as in x .

I One way to select οΏ½ is to try di↡erent values, graph theresulting histograms, and select one of them. There are alsomore sophisticated methods for selecting οΏ½ based on themaximum likelihood method.

Page 109: Data Analysis with R (combined slides)

print(diamonds[1:10,1:8])

## carat cut color clarity depth table price x## 1 0.23 Ideal E SI2 61.5 55 326 3.95## 2 0.21 Premium E SI1 59.8 61 326 3.89## 3 0.23 Good E VS1 56.9 65 327 4.05## 4 0.29 Premium I VS2 62.4 58 334 4.20## 5 0.31 Good J SI2 63.3 58 335 4.34## 6 0.24 Very Good J VVS2 62.8 57 336 3.94## 7 0.24 Very Good I VVS1 62.3 57 336 3.95## 8 0.26 Very Good H SI1 61.9 55 337 4.07## 9 0.22 Fair E VS2 65.1 61 337 3.87## 10 0.23 Very Good H VS1 59.4 61 338 4.00

Page 110: Data Analysis with R (combined slides)

diamondsSubset = diamonds[sample(dim(diamonds)[1], 1000),]qplot(price, data = diamondsSubset)

0

50

100

150

200

0 5000 10000 15000 20000

price

count

Page 111: Data Analysis with R (combined slides)

qplot(log(price), size = I(1), data = diamondsSubset)

0

20

40

6 7 8 9 10

log(price)

count

Page 112: Data Analysis with R (combined slides)

I Power transformations are useful also for examining therelationship between two or more data variables.

I The following plot shows the relationship between diamondprice and diamond carat. It is hard to draw much informationfrom that plot beyond the fact that there is a non-linearincreasing trend.

I Transforming both variables using a logarithm shows a strikinglinear relationship on a log-log scale.

Page 113: Data Analysis with R (combined slides)

qplot(log(price), size = I(1), data = diamondsSubset)

0

20

40

6 7 8 9 10

log(price)

count

Page 114: Data Analysis with R (combined slides)

qplot(carat,price,size = I(1),data = diamondsSubset)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●●●

●

● ●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●●●

●

●

●

●

●

●●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

0

5000

10000

15000

1 2 3 4

carat

price

Page 115: Data Analysis with R (combined slides)

qplot(carat,log(price),size = I(1),data = diamondsSubset)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

6

7

8

9

10

1 2 3 4

carat

log(price)

Page 116: Data Analysis with R (combined slides)

qplot(log(carat),price,size = I(1),data = diamondsSubset)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

● ● ●

●

● ●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●● ●

●

●

●

●

●

●●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

0

5000

10000

15000

βˆ’1.5 βˆ’1.0 βˆ’0.5 0.0 0.5 1.0 1.5

log(carat)

price

Page 117: Data Analysis with R (combined slides)

qplot(log(carat),log(price),size = I(1),data = diamondsSubset)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

6

7

8

9

10

βˆ’1.5 βˆ’1.0 βˆ’0.5 0.0 0.5 1.0 1.5

log(carat)

log(price)

Page 118: Data Analysis with R (combined slides)

library(MASS)print(Animals[1:12,])

## body brain## Mountain beaver 1.35 8.1## Cow 465.00 423.0## Grey wolf 36.33 119.5## Goat 27.66 115.0## Guinea pig 1.04 5.5## Dipliodocus 11700.00 50.0## Asian elephant 2547.00 4603.0## Donkey 187.10 419.0## Horse 521.00 655.0## Potar monkey 10.00 115.0## Cat 3.30 25.6## Giraffe 529.00 680.0

Page 119: Data Analysis with R (combined slides)

qplot(brain, body, data = Animals)

● ●●●●

●

●● ●●● ●● ●

●●

●●●●●●● ●●

●

●●0

25000

50000

75000

0 2000 4000

brain

body

Page 120: Data Analysis with R (combined slides)

qplot(brain, body, log = "xy", data = Animals)

●

●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

1

100

10000

10 1000

brain

body

Page 121: Data Analysis with R (combined slides)

Data Transformations: Binning

I A numeric variable represents real valued measurementswhose values are ordered in a manner consistent with thenatural ordering of the real line. Dissimilarity between twomeasurements a, b is described by the Euclidean distance|b οΏ½ a|. For example, height and weight are numeric variables.

I An ordinal variable represents measurements in a certainrange R for which we have a well defined order relation.Numeric variables are special cases of ordinal variables. Forexample, the seasons of the year are ordinal measurements.

I A categorical variable represents measurements that do notsatisfy the ordinal or numeric assumption. For example, fooditems on a restaurant’s menu are categorical variables.

Page 122: Data Analysis with R (combined slides)

I Binning (also known as discretization): taking a numericvariable x 2 R (typically a real value, though it may be aninteger), dividing its range into several bins, and replacing itwith a number representing the corresponding bin.

I It is useful to bin values in order to accomplish datareduction, improve scalability for big-data, or capturenon-linear e↡ects in linear models.

I Binarization is a special case (replaces a variable with either 0or 1 depending on whether the variable is greater or smallerthan a certain threshold).

Page 123: Data Analysis with R (combined slides)

I For example, suppose x represent the tenure of an employee(in years) and ranges from 0 to 50.

I A binning process may divide the range [0, 50] into thefollowing ranges (0, 10], (10, 20], . . . , (41, 50] and usecorresponding replacement values of 5, 15, . . . , 45 respectively.

I The notation (a, b] corresponds to all values larger than a andsmaller or equal to b.

Discretization in R can be done via the function cut.

Page 124: Data Analysis with R (combined slides)

Data Transformations: Indicator Variables

I Replace a variable x (numeric, ordinal, or categorical) takingk values with a binary k-dimensional vector v , such that v[i](or vi in mathematical notation) is one if and only if x takeson the i-value in its range.

I Replace variable by vector that is all zeros, except for onecomponent that equals one.

I Often, indicator variables are used in conjunction withbinning: bin the variable into k bins and then create a k

dimensional indicator variable.

I High dimensional indicator vectors may be easily handled incomputations by taking advantage of its extreme sparsity.

Page 125: Data Analysis with R (combined slides)

Uses of Indicator Variables

I Models for numeric or binary data cannot directly modelordinal or categorical data. Using indicator variables canmitigate this problem.

I Transform the data using several non-linear transformations(for example multiple power transformations), bin thetransformed data, and create indicator vectors. Training alinear models on the such vectors may capture complexnon-linear relationships.

I It is often much easier to compute with indicator functionssince they are binary, and thus replacing numeric variableswith indicator vectors may improve scalability.

Page 126: Data Analysis with R (combined slides)

Data Manipulations: Shu✏ing

I A common operation in data analysis is to select a randomsubset of the rows of a dataframe, with or withoutreplacement.

I sample() accepts a vector of values from which to sample(typically a vector of row indices), the number of samples,whether the sampling is done with or without replacement,and the probability of sampling di↡erent values.

I sample(k,k) generates a random permutation of order k.

I After obtaining the indices that we wish to sample. we form anew array or dataframe containing the sampled rows of theoriginal dataframe.

D = array(data = seq(1, 20, length.out = 20), dim = c(4, 5))D_shuffled = D[sample(4, 4),]

Page 127: Data Analysis with R (combined slides)

Data Manipulations: Partitioning

I In some cases, we need to partition the dataset’s rows intotwo or more collection of rows.

I Generate a random permutation of k objects (usingsample(k,k)), where k is the number of rows in the data,and then divide the permutation vector into two or more partsbased on the prescribed sizes, and new dataframes whose rowscorrespond to the divided permutation vector.

D = array(data = seq(1, 20, length.out = 20), dim = c(4, 5))rand_perm = sample(4,4)first_set_of_indices = rand_perm[1:floor(4*0.75)]second_set_of_indices = rand_perm[(floor(4*0.75)+1):4]D1 = D[first_set_of_indices,]D2 = D[second_set_of_indices,]

Page 128: Data Analysis with R (combined slides)

Tall Data

2015/01/01 apples 200

2015/01/01 oranges 150

2015/01/02 apples 220

2015/01/02 oranges 130

I Data in tall format is an array or dataframe containingmultiple columns where one or more columns act as a uniqueidentifier and an additional column represents value.

I This format is convenient for adding new recordsincrementally (e.g., adding sales transactions as they occur),and for removing old records.

I A disadvantage of tall data format is that it not easy forconducting analysis or summarizing it (e.g., computingaverage daily sales).

Page 129: Data Analysis with R (combined slides)

Wide Data

Date apples oranges

--------------------------

2015/01/01 200 150

2015/01/02 220 130

I Represents in multiple columns the information that tall dataholds in multiple rows

I Simpler to analyze

I Harder to add/remove entries

When converting tall data to wide data, we need to specify IDvariables that define the row and column structure (date and itemin the example above).

Page 130: Data Analysis with R (combined slides)

Reshaping DataR package reshape2 converts data between tall and wide formats.The melt function accepts a dataframe in a wide format, and theindices of the columns that act as unique identifiers (remainingcolumns act as measurements or values) and returns a tall versionof the data.

print(smiths)

## subject time age weight height## 1 John Smith 1 33 90 1.87## 2 Mary Smith 1 NA NA 1.54

smiths_tall = melt(smiths, id = 1)print(smiths_tall[1:4,])

## subject variable value## 1 John Smith time 1## 2 Mary Smith time 1## 3 John Smith age 33## 4 Mary Smith age NA

Page 131: Data Analysis with R (combined slides)

acast/dcast is the inverse of melt.The arguments are a dataframe in wide form, a formulaa β‡  b β‡  Β· Β· Β· β‡  where each of a, b, . . . represents a list of variableswhose values will be displayed along the dimensions of the returnedarray or dataframe (a for rows, b for columns, etc.), and a functionfun.aggregate that aggregrates multiple values into a singlevalue.

qplot(total_bill,tip,facets = sex~time,size = I(1.5),data = tips)

Page 132: Data Analysis with R (combined slides)

Dinner Lunch

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●●●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

● ●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●●

●●

●

●●

●

●

●● ●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

● ●●

●

●

●●●

●

●●

●

●

●

●

●●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

● ●

●●

●

2.5

5.0

7.5

10.0

2.5

5.0

7.5

10.0

Female

Male

10 20 30 40 50 10 20 30 40 50

total_bill

tip

Page 133: Data Analysis with R (combined slides)

tipsm = melt(tips, id = c("sex","smoker","day","time","size"))dcast(tipsm, # Mean of measurement variables broken by sex

sex~variable,fun.aggregate = mean)

## sex total_bill tip## 1 Female 18.06 2.833## 2 Male 20.74 3.090

Page 134: Data Analysis with R (combined slides)

# Number of occurrences for measurement variables broken by sex

dcast(tipsm,sex~variable,fun.aggregate = length)

## sex total_bill tip## 1 Female 87 87## 2 Male 157 157

Page 135: Data Analysis with R (combined slides)

# Average total bill and tip for different times

dcast(tipsm,time~variable,fun.aggregate = mean)

## time total_bill tip## 1 Dinner 20.80 3.103## 2 Lunch 17.17 2.728

Page 136: Data Analysis with R (combined slides)

# Similar to above with breakdown for sex and time:

dcast(tipsm,sex+time~variable,fun.aggregate = length)

## sex time total_bill tip## 1 Female Dinner 52 52## 2 Female Lunch 35 35## 3 Male Dinner 124 124## 4 Male Lunch 33 33

Page 137: Data Analysis with R (combined slides)

# Similar to above, but with mean and added margins

dcast(tipsm,sex+time~variable,fun.aggregate = mean,margins = TRUE)

## sex time total_bill tip (all)## 1 Female Dinner 19.21 3.002 11.108## 2 Female Lunch 16.34 2.583 9.461## 3 Female (all) 18.06 2.833 10.445## 4 Male Dinner 21.46 3.145 12.303## 5 Male Lunch 18.05 2.882 10.465## 6 Male (all) 20.74 3.090 11.917## 7 (all) (all) 19.79 2.998 11.392

Page 138: Data Analysis with R (combined slides)

Observations:

1. On average, males pay higher total bill and tip than females.

2. Males pay more frequently than females.

3. Dinner bills and tips are generally higher than lunch bills andtips.

4. Males pay disproportionately more times for dinner than theydo for lunch (this holds much less for females).

5. Even accounting for (4) by conditioning on paying for lunch ordinner, males still pay higher total bills and tips than females.

Page 139: Data Analysis with R (combined slides)

Split-Apply-Combine

Many data analysis operations on dataframes can be decomposedto three stages:

1. splitting the dataframe along some dimensions to form smallerarrays or dataframes,

2. applying some operation to each of the smaller arrays ordataframes, and

3. combining the results of the application stage into a singlemeaningful array or dataframe.

Repeatedly programming all three stages whenever we need tocompute a data summary may be tedious and can lead to errors.The plyr package automates this process, letting the analystconcentrate on the data analysis rather than the three stages.

Page 140: Data Analysis with R (combined slides)

The plyr package implements the following functions that di↡er inthe type of input arguments they receive and the type of outputthey provide.

output array dataframe list discardedinputarray aaply adply alply a ply

dataframe daply ddply dlply d ply

list laply ldply llply l ply

Arguments: data, dimensions/columns used to to split the data,function to execute in the apply stage.

Page 141: Data Analysis with R (combined slides)

library(plyr)names(baseball)

## [1] "id" "year" "stint" "team" "lg" "g" "ab" "r"## [9] "h" "X2b" "X3b" "hr" "rbi" "sb" "cs" "bb"## [17] "so" "ibb" "hbp" "sh" "sf" "gidp"

# count number of players recorded for each year

bbPerYear = ddply(baseball, "year", "nrow")head(bbPerYear)

## year nrow## 1 1871 7## 2 1872 13## 3 1873 13## 4 1874 15## 5 1875 17## 6 1876 15

Page 142: Data Analysis with R (combined slides)

qplot(x = year, y = nrow,data = bbPerYear, geom = "line",ylab="number of player seasons")

0

100

200

300

1900 1950 2000

year

num

ber o

f pla

yer s

easo

ns

Page 143: Data Analysis with R (combined slides)

# compute mean rbi (batting attempt resulting in runs)

# for all years. Summarize is the apply function, which

# takes as argument a function that computes the rbi mean

bbMod=ddply(baseball, "year", summarise,mean.rbi = mean(rbi, na.rm = TRUE))

qplot(x = year, y = mean.rbi, data = bbMod,geom = "line", ylab = "mean RBI")

20

30

40

50

60

1900 1950 2000

year

mea

n R

BI

Page 144: Data Analysis with R (combined slides)

# add a column career.year which measures the number of years

# passed since each player started batting

bbMod2 = ddply(baseball,"id",transform,career.year = year - min(year) + 1)

# sample a random subset 3000 rows to avoid over-plotting

bbSubset = bbMod2[sample(dim(bbMod2)[1], 3000),]qplot(career.year,

rbi, data = bbSubset,size = I(0.8),geom = "jitter",ylab = "RBI",xlab = "years of playing") +

geom_smooth(color = "red", se = F, size = 1.5)

Page 145: Data Analysis with R (combined slides)

●

●●

●

●

●

● ● ●● ●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

● ● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●●

●

●

●

●

●

●● ● ●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●

●

●

●●

●

●

●

● ●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●●

●

●

● ●●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

● ●● ●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

● ●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

● ● ●●

●

●

●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ● ●●

●

● ●●

●

●●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

● ● ●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

● ●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●●●

●●●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●

●

●

● ●

●

●

●●

●

●

●● ●

●

● ●

●

●

●

0

50

100

150

0 10 20 30

years of playing

RBI

Page 146: Data Analysis with R (combined slides)

The ozone dataset contains a 3-dimensional array of ozonemeasurements varying by latitude, longitude, and time.

library(plyr)latitude.mean = aaply(ozone, 1, mean)longitude.mean = aaply(ozone, 2, mean)time.mean = aaply(ozone, 3, mean)longitude = seq(along = longitude.mean)qplot(x = longitude,

y = longitude.mean,ylab = "mean ozone level",geom="line")

Page 147: Data Analysis with R (combined slides)

266

267

268

269

0 5 10 15 20 25

longitude

mea

n oz

one

leve

l

Page 148: Data Analysis with R (combined slides)

latitude = seq(along = latitude.mean)qplot(x = latitude,

y = latitude.mean,ylab = "mean ozone level",geom = "line")

Page 149: Data Analysis with R (combined slides)

260

270

280

290

300

310

0 5 10 15 20 25

latitude

mea

n oz

one

leve

l

Page 150: Data Analysis with R (combined slides)

months = seq(along = time.mean)qplot(x = months,

y = time.mean,geom = "line",ylab = "mean ozone level",xlab = "months since January 1985")

Page 151: Data Analysis with R (combined slides)

260

265

270

275

0 20 40 60

months since January 1985

mea

n oz

one

leve

l

Page 152: Data Analysis with R (combined slides)

I Ozone has a clear minimum mean ozone level at longitude 19and latitude 12

I Ozone level has an interesting temporal periodicitysuperimposed with a slight increasing trend.

I The periodicity coincides with the annual season cycle (eachperiod is 12 months)

I The functions in the plyr package are very general andsimplify the coding of many data analysis tasks.