19
Principal Component Analysis Principal Component Analysis in R 2018 Ontario Summer School on HPC Marcelo Ponce May 2018 2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis in Rmponce/courses/London2018/PCA.pdf · Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Principal Component Analysis

Principal Component Analysis in R2018 Ontario Summer School on HPC

Marcelo Ponce

May 2018

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Principal Component AnalysisPrincipal component analysis (PCA) is a statistical procedure that uses anorthogonal transformation to convert a set of observations of possiblycorrelated variables into a set of values of linearly uncorrelated variablescalled principal components.

PCA is mostly used as a tool inexploratory data analysis and formaking predictive models.

Unsupervised.

PCA is sensitive to the relative scalingof the original variables.

SVD, dimensionality reduction, ...

Also related to clusterization algs.(k-means) ...

“PCA ≈ fitting an n-dim

ellipsoid: PC;axes”

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

It’s often used to visualize genetic distance and relatedness betweenpopulations.

PCA has successfully foundlinear combinations of thedifferent markers, thatseparate out different clusterscorresponding to differentlines of individuals’Y-chromosomal geneticdescent.

A principal components analysis scatterplot of Y-STR haplotypes calculated from

repeat-count values for 37 Y-chromosomal STR markers from 354 individuals.

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

Standard deviation 83.7324 14.21240

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

Cumulative Proportion 0.9655 0.99335

0.9991 1.00000

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

Standard deviation 83.7324 14.21240

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

Cumulative Proportion 0.9655 0.99335

0.9991 1.00000

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

Computing PCA in Rß pricomp

uses eigen-values/vectors andcovariance matrix

à prcomp

uses singular value decomposition(SVD)

prcomp(x, retx = TRUE, center = TRUE, scale = FALSE, tol = NULL, ...)

# USArrests data vary by orders of

magnitude, so scaling is appropriate

> prcomp(USArrests) # inappropriate

> pc1 <- prcomp(USArrests, scale = T)

> pc2 <- prcomp(~Murder + Assault +

Rape, data = USArrests, scale = TRUE)

> summary(pc1)

> summary(pc2)

> plot(pc1)

> plot(pc2)

Importance of components: Comp.1 Comp.2

Comp.3 Comp.4

Standard deviation 1.5748783 0.9948694

0.5971291 0.41644938

Proportion of Variance 0.6200604

0.2474413 0.0891408 0.04335752

Cumulative Proportion 0.6200604 0.8675017

0.9566425 1.00000000

Importance of components: PC1 PC2 PC3

PC4

Standard deviation 83.7324 14.21240

6.4894 2.48279

Proportion of Variance 0.9655 0.02782

0.0058 0.00085

Cumulative Proportion 0.9655 0.99335

0.9991 1.00000

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: PCA

> library(ggfortify)

# select only numerical data...

> iris data <- iris[c(1, 2, 3, 4)]

# compute PCA...

> iris pca <- prcomp(iris data)

# visualize PCA

> autoplot(iris pca, data = iris, colour =

’Species’)

# draw eigenvectors...

> autoplot(iris pca, data = iris, colour =

’Species’, loadings = TRUE)

# attach eigenvector labels and options...

> autoplot(prcomp(df), data = iris, colour =

’Species’, loadings = TRUE, loadings.colour

= ’blue’, loadings.label = TRUE, load-

ings.label.size = 3)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

Basics

The iris dataset: Cluster and Local Fisher Analyses

> library(cluster)

# cluster analysis...

> autoplot(clara(iris data,3))

# draw convex for each cluster...

> autoplot(fanny(iris data,3), frame=TRUE)

# draw probability ellipse

> autoplot(pam(iris pca,3), frame=TRUE,

frame.type=’norm’)

# Local Fisher Discriminant Analysis (LFDA)

> library(lfda)

> model <- lfda(iris[-5], iris[,5], r=3, met-

ric=’plain’)

> autoplot(model, data = iris, frame=TRUE,

frame.colour = ’Species’)

# Semi-supervised LFDA (SELF)

> model <- self(iris[-5], iris[, 5], beta =

0.1, r = 3, metric="plain")

> autoplot(model, data = iris, frame = TRUE,

frame.colour = ’Species’)

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

3D PCA

3D PCAAdditional Packages required:

install.packages("rgl")

install.packages("pca3d")

> library(rgl)

> library(pca3d)

# PCA analysis of ’metabo’ data

# relative abundances of metabolites

from serun samples of three groups

> data(metabo)

> dim(metabo) # 136 424

# PCA analysis, including all rows but

the ’group’ column

> pca <- prcomp(metabo[,-1],

scale=TRUE)

# 2D PCA

> pca2d(pca, group=metabo[,1])

# 3D PCA

> pca3d(pca, group=metabo[,1])

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

3D PCA

3D PCAAdditional Packages required:

install.packages("rgl")

install.packages("pca3d")

> library(rgl)

> library(pca3d)

# PCA analysis of ’metabo’ data

# relative abundances of metabolites

from serun samples of three groups

> data(metabo)

> dim(metabo) # 136 424

# PCA analysis, including all rows but

the ’group’ column

> pca <- prcomp(metabo[,-1],

scale=TRUE)

# 2D PCA

> pca2d(pca, group=metabo[,1])

# 3D PCA

> pca3d(pca, group=metabo[,1])

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)

Principal Component Analysis

3D PCA

References

PCA

http://uc-r.github.io/pca

http://genomicsclass.github.io/book/pages/pca_svd.html

https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html

https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_principal_components_analysis.pdf

2018 Ontario Summer School: PCA in R M.Ponce (SciNet HPC / UofT)