Clustering and Visualisation using R programming

CLUSTERING AND

VISUALIZATION USING R

Nixon Mendez

Department of Bioinformatics

OUTLINE

Microarray Data of Yeast Cell Cycle

Clustering Analysis :-

Principal Component Analysis (PCA)

Multidimensional Scaling (MDS)

K-Means

Self-Organizing Maps (SOM)

Hierarchical Clustering

CLUSTERING

Microarray Data of Yeast Cell Cycle

Spellman et al., (1998). Comprehensive Identification of Cell Cycle-

regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray

Hybridization. Molecular Biology of the Cell 9, 3273-3297.

We found 800 yeast genes whose transcripts oscillate through one peak per

cell cycle.

These 800 genes by using an objective, empirical model of cell cycle

regulation, whose threshold was somewhat arbitrary.

Examine the effects of inducing either the cyclin Cln3p or the B-type cyclin

Clb2p, on more than half of these 800 genes.

A full description and complete data sets are available at http://cellcycle-

www.stanford.edu

Loading the data

> mic <- read.delim("C:/Users/Nixon/Desktop/R prog/mic.txt")

> View(mic)

> cell.matrix <- mic

> n <- dim(cell.matrix)[1]

> p <- dim(cell.matrix)[2]-2

> cell.data <- cell.matrix[,3:p+2]

> gene.name <- cell.matrix[,1]

> gene.phase <- cell.matrix[,2]

> phase <- unique(gene.phase)

> phase.name <- c("G1", "S", "S/G2", "G2/M", "M/G1")

## standardized data

> cell.sdata <- (cell.data-apply(cell.data, 1, mean))/sqrt(apply(cell.data, 1, var))

View Microarray Data Before visualization we must set the colors.

maPalette is created

##CODE

> cell.image <- as.matrix(t(cell.sdata[n:1,]))

> RGcol <- maPalette(low = "green", high = "red", k = 50)

> image(cell.image, xlab="Exp.", ylab="Genes", col = RGcol)

OUTPUT

The PCA summaries the dispersion of data points as data cloud in

a small number of major axes (principal components) of variation

among the variables.

Syntax :

# entering raw data and extracting PCs from the correlation

matrix

fit <- princomp(mydata, cor=TRUE)

# screenplot

plot(fit,type="lines")

> cell.pca <- princomp(cell.sdata, cor=TRUE,

scores=TRUE)

# 2D plot for first two components

> pca.dim1 <- cell.pca$scores[,1]

> pca.dim2 <- cell.pca$scores[,2]

> plot(pca.dim1, pca.dim2,

main="PCA for Cell Cycle Data on Genes", xlab="1st

PCA Componnet", ylab="2nd PCA Componnet",

col=c(1,2,3,4,5), pch=c(phase))

> legend(0.8, 1, phase.name, pch="01234", col=c(1,2,3,4,5))

PCA OUTPUT

Multidimensional scaling takes a set of dissimilarities and returns a set of

points such that the distances between the points are approximately equal to

the dissimilarities.

# Classical MDS

# N rows (objects) x p columns (variables)

# each row identified by a unique row name

d <- dist(mydata) # euclidean distances between the rows

fit <- cmdscale(d,eig=TRUE, k=2) # k is the number of dim

fit # view results

# plot solution

x <- fit$points[,1]

y <- fit$points[,2]

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2",

main="Metric MDS", type="n")

text(x, y, labels = row.names(mydata), cex=.7)

#correlation matrix

> cell.cor<- cor(t(cell.sdata))

#distance matrix

> cell.dist<- sqrt(2*(1-cell.cor))

> cell.mds<- cmdscale(cell.dist)

> mds.dim1 <- cell.mds[,1]

> mds.dim2 <- cell.mds[,2]

> plot(mds.dim1, mds.dim2, type="n", xlab="MDS-1", ylab="MDS-2",

main="MDS for Cell Cycle Data")

> text(mds.dim1, mds.dim2,gene.phase , cex=0.8, col= i+1)

> legend(0.7, 0.8, phase.name, pch="01234", col=c(1,2,3,4,5))

MDS OUTPUT

K-means Clustering

It is a prototype based , partitional clustering technique that

attempts to find a user-specified number of clusters (k),

which are presented by their centroids.

K-means Clustering

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",

"MacQueen"), trace=FALSE)

Arguments

X numeric matrix of data, or an object that can

be coerced to such a matrix (such as a numeric

vector or a data frame with all numeric

columns).

Centers either the number of clusters, say k, or a set of

initial (distinct) cluster centres. If a number, a

random set of (distinct) rows in x is chosen as

the initial centres.

iter.max the maximum number of iterations allowed.

nstart if centers is a number, how many random sets

should be chosen?

algorithm character: may be abbreviated. Note that

"Lloyd" and "Forgy" are alternative names for

one algorithm.

K-means Clustering

> no.group <- 5

> no.iter <- 20

> cell.kmeans <- kmeans(cell.sdata, no.group, no.iter)

> plot(cell.sdata[,1:4], col = cell.kmeans$cluster)

K-means Output

SOM is unique in the sense that it combines both aspects. It can be used

at the same time both to reduce the amount of data by clustering, and to

construct a nonlinear projection of the data onto a low-dimensional

display.

som(data, xdim, ydim, init="linear",neigh="gaussian", topol="rect",

radius=NULL, rlen=NULL,)

ARGUMENTS :

neigh - a character string specifying the neighborhood function type.

The following are permitted: "bubble" "gaussian"

topol - a character string specifying the topology type when measuring

distance in the map. The following are permitted: "hexa" "rect"

radius - a vector of initial radius of the training area in som-algorithm

for the two training phases. Decreases linearly to one during training.

rlen - a vector of running length (number of steps) in the two training

phases.

> library(som)

> cell.som <- som(cell.sdata, xdim=5, ydim=4, topol="rect",

neigh="gaussian")

> plot(cell.som)

SOM OUTPUT

Hierarchical clustering is a method of cluster analysis which seeks to

build a hierarchy of clusters. Strategies for hierarchical clustering

generally fall into two types:

Agglomerative: This is a "bottom up" approach: each observation

starts in its own cluster, and pairs of clusters are merged as one

moves up the hierarchy.

Divisive: This is a "top down" approach: all observations start in one

cluster, and splits are performed recursively as one moves down the

hierarchy.

dist(as.matrix(mtcars)) - find distance matrix

hclust(d) - apply hirarchical clustering

plot(hc) - plot the dendrogram

hang - The fraction of the plot height by which labels

should hang below the rest of the plot.

method - the agglomeration method to be used

## Hierarchical Clustering on Genes

> cell.exp.hc.ave <- hclust(dist(d(cell.sdata)), method = "ave")

> plot(cell.exp.hc.ave, cex=0.8)

## Hierarchical Clustering on Experiments

> cell.gene.hc.ave <- hclust(dist(cell.sdata), method = "ave")

> plot(cell.gene.hc.ave, hang = -1, cex=0.5, labels=gene.name)

Hierarchical Clustering Output

THANK YOU!!

Clustering and Visualisation using R programming

Data & Analytics

Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

Data Mining coursebelanche/Docencia/mineria/English...Topics • Introduction to Data Mining • Preprocess • Finding profiles • Visualisation techniques • Clustering • Association

Clustering Graphs, Spectra and Semidefinite Programmingjebara/6772/notes/notes11.pdfClustering Graphs, Spectra and Semideﬁnite Programming Tony Jebara April 13, 2015. Clustering

Outline What is IGUANA IGUANA and Other Projects Architecture Framework ORCA Visualisation IGUANA at D0 GEANT4 Visualisation OSCAR Visualisation DDD Visualisation

COMP5048 Course Outline COMP5048 Information Visualisationshhong/lec1.pdf · COMP5048 Information Visualisation ... Programming Assignment ... Russian-Polish border 422,000 men

SemCluster: Clustering of Programming …...our clustering approach, based on quantitative semantic pro-gram features, can capture the essence of solution strategies and address issues

Visualisation dashboards

The revolutionary visualisation software › fileadmin › area_download › Industrial_TechnicalFiles › ... · Object-oriented programming and modular applications Automatic interchangeability

The Use of Learning Analytics to Support Improvements in ...€¦ · Early intervention clustering tool ... • The usefulness of any learning analytics report/ visualisation will

m width/cm...width [cm] height [m] rning3 visualisation 23 of K 1) 3 5 7 3 5 7 o -centre rning3 visualisation 24 of K 2) 9 5 1 3 7 1 3 5 7 9 1. rning3 visualisation 25 ds (NE) K clustering

Thread Clustering: Sharing-Aware Scheduling on SMP …demke/2227/S.14/Papers/p47-tam.pdf · Thread Clustering: Sharing-Aware Scheduling on ... Concurrent Programming—parallel pro-gramming

Programming Exercise 7: K-means Clustering and Principal ...dydaktyka:ml:ex7.pdf · K-means Clustering and Principal Component Analysis Machine Learning May 13, 2012 Introduction

BI Visualisation

Spatio-Temporal Patterns of Passengers’ Interests at ...€¦ · (modelling, prediction, clustering, visualisation and simulation), with applications in transport, crime, health,

Programming NATO’s Training & Exercises - 4cstrategies.com · 3 – PROGRAMMING NATO’S TRAINING & EXERCISES These fields can also be used to generate reports and other visualisation

Annotation Driven Hierarchical Clustering Analysis - … · 3 Visualisation and Annotation of Gene Expression Data 24 ... collapsing is performed on subtrees that are either at a

The role of Visualisation in the study of Computer … › sites › ppig.org › files › 2016-PPIG-27th...The role of Visualisation in the study of Computer Programming 1.Introduction

A Dynamic Programming Approach To Document Clustering - ACIT

WEB GEOSPATIAL VISUALISATION FOR CLUSTERING ANALYSIS …vuir.vu.edu.au/25917/1/Jingyuan Zhang.pdf · WEB GEOSPATIAL VISUALISATION FOR CLUSTERING ANALYSIS OF EPIDEMIOLOGICAL DATA Jingyuan

Programming Exercise 7: K-means Clustering and Principal … · 2019. 8. 28. · Programming Exercise 7: K-means Clustering and Principal Component Analysis Machine Learning Introduction