Session 2: Statistics - Heidelberg University · Session 2: Statistics Based on R Lecture by Juan...

Christine Gläßer ---- ZMBH ---- Room 504 ---- +49(0)6221-54 6824 ---- c.glaesser@zmbh.uni-heidelberg.de

R course for beginners

Session 2: Statistics

Based on R Lecture by Juan Luis Mateo, COS

Session 1 – Recap commands

Working directory

I/O Check data Create data Mathematical operations

getwd() read.delim() data[row,column] rbind() +

setwd() write.table() colnames() <- -

dir() rownames() c() *

length() 1:10 /

seq() ^

rep() sum()

array() mean()

What are we going to do?

Descriptive Statistics

Fisher's exact test

Chi square test

Student's t-test

CorrelationClustering

Statistical tests Data description

Estimation of variance

Student's t-test

Chi square test

Correlation PCA

Descriptive StatisticsFisher's exact

Clustering

Data description

1) Measures of centrality

- Mean: estimate of the mean value of a variable in your sample

- Median: value separating the higher half of your data from the lower half

- Quantiles: value separating x% data from the rest----> the median is also the 2-quantile

----> in most cases, 75% and 25% are of interest

x=1n∑i=1

Data description

1) Measures of centrality

- Mean: estimate of the mean value of a variable in your sample

- Median: value separating the higher half of your data from the lower half

- Quantiles: value separating x% data from the rest----> the median is also the 2-quantile

----> in most cases, 75% and 25% are of interest

m=∑i=1

Data description

2) Measures of spread

- Range: difference between minimum and maximum value in your data

- Variance: showing how far values are from the mean value

----> standard deviation as equivalent measure, square root of variance

s=√∑i=1n

( xi−x)2

Data description – descriptive statistics

Commands: mean, median, min, max, median, quantile, sd, range

Mean/Sd: cp. Session 1Median: computes the sample median Min: returns minima of input valuesMax: returns maxima of input valuesQuantile: calculating the quantiles

OR: use one of various summary commands!

Command: summary

1) Load data „sleep_data_simple.txt“

Remember from session 1: where's your data stored? Direct R to that folder, then load data

2) Describe your data: mean,median,25th and 75th quartiles,min,max

1) Load data „sleep_data_simple.txt“

Remember from session 1: where's your data stored? Direct R to that folder, then load data

2) Describe your data: mean,median,25th and 75th quartiles,min,max

Statistical tests – Recap

In theory, we don't have noise and events follow a precise law

(e.g. free fall: )

In reality, measurements are not precise

Averaging to get rid of noise, smooting data

----> idea of statistics

----> the more data, the better

h=h0−12g t2

Copyright Juan L. Mateo, COS

Statistics?

Model probability for specific types of events

a) Binomial distribution: repetition with binary outcome

http://en.wikipedia.org/wiki/Binomial_distribution

Statistics?

b) Poisson distribution: probability of a given number of events in a defined amount of time, we know the average

http://en.wikipedia.org/wiki/Poisson_distribution

Statistics?

c) Exponential distribution: processes with exponential behaviour

http://en.wikipedia.org/wiki/Exponential_distribution

Statistics?

d) Normal distribution: probability of an event falling far from the expected value

http://en.wikipedia.org/wiki/Normal_distribution

Statistics?

Hypothesis testing: test if our data follows that distribution

● Null hypothesis H0: statement we want to test---> Standard: two things are comparable

● Alternative hypothesis: H0 is false

● Result: probability

Statistics?

Probability: measure of uncertainty

----> p-value

„Gold standard“: p-value of 0.05, meaning 95% confidence that your observation is significant

Statistics?

HOWEVER...

False positives!

Imagine: 10,000 genes

None is differentially expressed, but you think there are some

Assume a p-value of 0.05

Thanks to Simon Anders, EMBL

Statistics?

HOWEVER...

False positives!

P-value definition: result is assigned value p, then probability of seeing a result this strong only due to noise is p-value

Statistics?

HOWEVER...

False positives!

---> 5% of genes will have p-value <0.05 (500 genes!)

Statistics?

HOWEVER...

False positives!

---> assume 1000 genes have p-value <0.05; those contain 500 false positives (50%!)

Statistics?

HOWEVER...

False positives!

---> techniques to adjust p-value ---> Benjamini-Hochberg most common, adjusts 0.05 raw to 0.5

Statistics?

● Focus on continuous samples

● Parametric tests: tests require assumptions about data distribution

● Non parametric tests: tests do not require assumptions about data distribution

Statistics?

Statistical tests – Student's t-test

● Parametric test

● Data is normally distributed

● Either one sample: H0: mean has a specific value

or two samples: H0: samples have equal mean values

Additional assumption: variances are equal ---> if not: Welch's t-test

Student's t-test

● Parametric test

● Data is normally distributed

● Either one sample: H0: mean has a specific value

or two samples: H0: samples have equal mean values

Additional assumption: variances are equal ---> if not: Welch's t-test

● Paired tests (e.g. same proband, different arms) give more statistical power; paired t-test possible

Student's t-test

Q: is there a significant difference between group X and Y?

Do a t-test with sleep data X and Y

Command: t.test

Student's t-test

Statistical tests – Chi square test

● Nominal data

● H0: frequencies of values of our samples are independent

● Samples are sufficiently large

Chi square test

Command: chisq.test

Error message refers to small samples!

Statistical tests – Fisher's exact test

● Equivalent to Chi square test, but with ...

● … small samples

Command: fisher.test

Two-sided: both directions are considered equally likely

Fisher's exact test

Statistical tests Data description

Estimation of variance

Student's t-test

Chi square test

Correlation PCA

Descriptive StatisticsFisher's exact

Clustering

Estimation of variance – Recap

Correlation: Statistical relationships involving dependence

● Pearson's correlation coefficient works for linear relationships● Spearmans' rank correlation coefficient for non-linear relationships

● anti-correlation: negative values● correlation: positive values

Variance?

PCA: Principle Component Analysis

● Convertion of set of (possibly correlated) values into set of values of

linearly uncorrelated variables = principal components● First principal component has the largest possible variance

Variance?

http://www.bestcoloringpagesforkids.com/nemo-coloring-pages.html

Comp. 1

Comp. 2

Clustering: Finding similarities

● Taxonomy in biology based on clustering (cladistics)● Different methods: hierarchical clustering, kmeans clustering...

a) Hierarchical clustering

● Known from taxonomy● Build a hierarchy of clusters● Agglomerative: each observation starts in own cluster,

then those clusters are connected● Divisive: all observations in one cluster, then those are split

Variance?

http://www.scholarsjunction.com/Taxonomy.aspx

Clustering: Finding similarities

● Taxonomy in biology based on clustering (cladistics)● Different methods: hierarchical clustering, kmeans clustering...

b) Kmeans clustering

● Number of clusters is known● Each cluster has a center (=centroid)● Iteratively: 1) choose centroids

2) choose centroids, so that they are closer to your data points 3) relate all data points to the closest centroids 4) recalculate centroids

----> do so until the centroids do not change again

Variance?

Estimation of variance – RecapVariance?http://www-m9.ma.tum.de/material/felix-klein/clustering/Methoden/K-Means.php

Estimation of variance – RecapVariance?

http://www-m9.ma.tum.de/material/felix-klein/clustering/Methoden/K-Means.php

Estimation of variance – RecapVariance?

http://www-m9.ma.tum.de/material/felix-klein/clustering/Methoden/K-Means.php

Estimation of variance – Correlation

Now: using default data set, provided by R----> mtcars

Correlation

Command: cor()---> check ?cor for settings---> calculate correlation using Pearson

Correlation

Command: cor()---> check ?cor for settings---> calculate correlation using Spearman

Correlation

Estimation of variance – PCA

Command: prcomp()---> check ?prcomp for settings---> calculate PCA for data set „USArrests“---> prcomp advises to scale data before calculating PCA; data will have unit variance afterwards---> thus, our command is: prcomp(USArrests, scale=TRUE)

---> look at the summary to determine how strong your variance is in each component (e.g.)Command: summary(prcomp(USArrests, scale=TRUE)) OR you could store the result of prcomp(USArrest, scale=TRUE) in a variable...

---> look at the summary to determine how strong your variance is in each component (e.g.)Command: summary(prcomp(USArrests, scale=TRUE)) pr <- prcomp(USArrests, scale=TRUE) summary(pr)

Command: prcomp()---> a plot would be more informative! ---> biplot(pr)

Estimation of variance – Clustering

Hierarchical clustering

First, we need to calculate the distances in our data

Command: dist()---> distance <-dist(USArrests)

Then, we can go on with clustering

Command: hclust()---> hclust(distance)

Again, a plot would be nicer...

Clustering

Hierarchical clustering

First, we need to calculate the distances in our data

Command: dist()---> distance <-dist(USArrests)

Then, we can go on with clustering

Command: hclust()---> hclust(distance)

Again, a plot would be nicer...

Command: plot(hclust(distance))

Clustering

Kmeans clustering

How many centroids?

---> use e.g. cluster structure derived by hclust...---> … do a scree plot …---> …

Assuming 8 clusters

Command: kmeans()---> read ?kmeans

Clustering

http://www.janda.org/workshop/factor%20analysis/SPSS%20run/SPSS08.htm

Kmeans clustering

How many centroids?

Assuming 8 clusters

Command: kmeans()---> read ?kmeans

---> kmeans(USArrests, 8)

Clustering

Kmeans clustering

How many centroids?

Assuming 8 clusters

Command: kmeans()---> where are the clusters?

Outlook: kmeans with random starts, hclust with different methods

Clustering

Kmeans clustering

How many centroids?

Assuming 8 clusters

---> kmeans(USArrests, 8)$cluster

Clustering

Kmeans clustering

How many centroids?

Assuming 8 clusters

---> kmeans(USArrests, 8)$cluster

A plot would be cool...

Clustering

Kmeans clustering

How many centroids?

Assuming 8 clusters

Command: kmeans()---> plot(USArrests, kmeans(USArrests, 8)$cluster)

With colors?

---> plot(USArrests, col=kmeans(USArrests, 8)$cluster)

Clustering

Kmeans clustering

How many centroids?

Assuming 8 clusters

With names?

---> advanced, special libraries can simplify your task...

Clustering

Kmeans clustering

How many centroids?

Assuming 8 clusters

With names?

---> advanced, special libraries can simplify your task...

Clustering

http://www.statmethods.net/advstats/images/cluster5.jpg

Session 2: Statistics - Heidelberg University · Session 2: Statistics Based on R Lecture by Juan...

Documents

CS 6824: Papers and Projects - Undergraduate Coursescourses.cs.vt.edu/.../lectures/lecture-04-papers-and-projects.pdf · Papers Projects CS 6824: Papers and Projects T. M. Murali

Session 3: Plots and graphics - Heidelberg University · Christine Gläßer ---- ZMBH ---- Room 504 ---- +49(0)6221-54 6824 ---- c.glaesser@zmbh.uni-heidelberg.de R course for beginners

Une publication de L'Association nigeriane des …eprints.covenantuniversity.edu.ng/6824/1/Dr. Abiodun-Eyiayekan 1.pdf · Resume Eugenia N. ABIODUN-ENIAYEKAN Department of Languages

CS 6824: Hypergraph Algorithms and Applicationscourses.cs.vt.edu/cs6824/2014-spring/lectures/... · I Related book: Hypergraphs, Volume 45: Combinatorics of Finite Sets (North-Holland

Consultancy report Ref: 6824 R03 - ISVR · Consultancy report Ref: 6824 R03 Initial Circulation: National Physical Laboratory: Dr R Barham, Project Manager Mr T Sherwood ISVR Consulting:

Prediction of Frost Events Using Machine Learning and IoT ...static.tongtianta.site/paper_pdf/d7738f76-6824-11e... · Learning and IoT Sensing Devices Ana Laura Diedrichs , Facundo

Leica DMRXE, upright microscope - uni-heidelberg.de · Leica DMRXE, upright microscope ZMBH Imaging Facility Short descripon: Color CCD camera Jenopk ProgRes C10plus Dark ﬁeld microscopy

2002 Fraud Trends Detective Dave Street Moreno Valley Police Department. 909-486-6824

GA723(@/7:/0:3-/16B :C0 - Bayside Trailable Yacht Clubbaysidetrailableyachtclub.com/resources/Newsletter_files... · 2012-03-16 · Commodore Kenton Lillecrapp 03 9836 6824 Past

FYE Dec FY10 FY11 FY12 1QFY13 Key Financial Summaryinternetfileserver.phillip.com.sg/POEMS/Stocks/...Zhongmin Baihui Retail Group Ltd (“ZMBH”) owns, operates and manages department

Zyklusvorlesung Molekularbiologie WS 2009/10Victor Sourjik, Seite 1 Kontrolle der Genexpression Victor Sourjik, ZMBH

SKRIPSI - Repositori Universitas Muria Kuduseprints.umk.ac.id/6824/1/COVER.pdf · kemampuan mereka dalam mendengarkan, berbicara, membaca dan menulis. Namun, fakta menunjukkan bahwa

ASIAMONEY BROKERS POLL 2018 - maybank-keresearch.com 2018_Regional.pdf · MALAYSIA kevin.wong@maybank-ib.com +603 2082 6824 MALAYSIA JOHN CHEONG KEY REPORTS • NVL VN: The pressure

Brochure CABLES - Microflown · 2018-10-01 · Brochure CABLES MICROFLOWN // CHARTING SOUND FIELDS Microflown Technologies Tivolilaan 205 6824 BV Arnhem The Netherlands Phone : +31

default€¦ · Welcome to the Core Facility for Mass Spectrometry & Proteomics (CFMP) at the ZMBH Our mission is to provide researchers Of the Heidelberg Life Science Campus state-of-the-art

IIBMH R1 Supplement Chapter 50 - gacc.nifc.gov supplements... · EXHIBIT NR13..... 6824 25 SOUTH DAKOTA..... 68 26 27 . INTERAGENCY COOPERATIVE RELATIONS CHAPTER 50 Release Date:

Article The impact of three evidencebased programmes ...clok.uclan.ac.uk/6824/1/6824_Berry.pdfPetra Wagner / Dagmar Strohmeier (pp. 176 186) Clinical Significance of Parent Training

01 Report 2018.pdfZhongmin Baihui Retail Group Ltd (the “Group” or “ZMBH”) is principally engaged in the ownership, operation and management of department stores and supermarkets

The Zentrum f¼r Molekulare Biologie Heidelberg (ZMBH

ELECTRIC SMOKING OVEN PRO ELEKTRISCHER RAUCHEROFEN …€¦ · INSTRUCTIONS MUURIKKA ELECTRIC SMOKING OVEN PRO 6824 SPECIFICATIONS PRODUCT NO. 6824 Electrical connection: earthed