Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Introduction to Biostatistics and Bioinformatics

Exploring Data and Descriptive Statistics

Learning Objectives

Python matplotlib library to visualize data:• Scatter plot• Histogram• Kernel density estimate• Box plots

Descriptive statistics:• Mean and median• Standard deviation and inter quartile range• Central limit theorem

An Example Data Set

0.022-0.0830.048-0.010-0.1250.195-0.071-0.1470.0330.0800.0730.0160.1480.1350.006-0.0890.165-0.088-0.1370.094

Scatter Plot

0.022-0.0830.048-0.010-0.1250.195-0.071-0.1470.0330.0800.0730.0160.1480.1350.006-0.0890.165-0.088-0.1370.094

Order or Measurement

Histogram

Measurement Measurement Measurement

Bin size = 0.1 Bin size = 0.05 Bin size = 0.025

Cumulative Distributions

Measurement

Kernel Density Estimate

Measurement

Original Distribution

Measurement

Original Distribution Kernel Density Estimate

Measurement

Bin size = 0.05

Histogram

Measurement

More Data

Measurement

Original Distribution Kernel Density Estimate

Measurement

Bin size = 0.05

Histogram

Measurement

Exercise 1

Download ibb2015_7_exercise1.py

(a) Draw 20 points from a normal distribution with mean=0 and standard deviation=0.1.

import numpy as np

y=0.1*np.random.normal(size=20)print y

[-0.09946073 -0.19612617 0.03442682 0.02622746 -0.28418124 -0.04245968 0.05922837 0.01199874 0.13454915 -0.07482707 -0.11688758 0.01714036 0.03280043 0.01356022 0.09128649 -0.18923468 0.14536047 -0.07764629 -0.0349553 0.04300367]

Exercise 1

(b) Make scatter plot of the 20 points.

import matplotlib.pyplot as plt

x=range(1,points+1)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.scatter(x,y,color='red',lw=0,s=40)ax1.set_xlim([0,points+1])ax1.set_ylim([-1,1])fig.savefig('ibb2015_7_exercise1_scatter_points'+str(poi

nts)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Exercise 1

(c) Plot histograms.

for bin in [20,40,80]:fig, (ax1) = plt.subplots(1,figsize=(6,6))

ax1.hist(y,bins=bin,histtype='step',color='black', range=[-1,1], lw=2, normed=True)ax1.set_xlim([-1,1])fig.savefig('ibb2015_7_exercise1_bin'+str(bin)+'_points'+str(points)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Exercise 1

(d) Plot cumulative distribution.

y_cumulative=np.linspace(0,1,points)x_cumulative=np.sort(y)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.plot(x_cumulative,y_cumulative,color='black', lw=2)ax1.set_xlim([-1,1])ax1.set_ylim([0,1])fig.savefig('ibb2015_7_exercise1_cumulative_points'+

str(points)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Exercise 1

(e) Plot kernel density estimate.

import scipy.stats as stats

kde_points=1000kde_x = np.linspace(-1,1,kde_points)fig, (ax1) = plt.subplots(1,figsize=(6,6))kde_y=stats.gaussian_kde(y)ax1.plot(kde_x,kde_y(kde_x),color='black', lw=2)ax1.set_xlim([-1,1])fig.savefig('ibb2015_7_exercise1_kde_points'+str(points)

+'.png',dpi=300,bbox_inches='tight')plt.close(fig)

Comparing Measurements

Comparing Measurements – Cumulative distributions

Systematic Shifts

Exercise 2

(a) Generate 5 data sets with 20 data points each from normal distributions with means = 0, 0, 0.1, 0.5 and 0.3 and standard deviation=0.1.

y=[]for j in range(5):

y.append(0.1*np.random.normal(size=20))y[2]+=0.1y[3]+=0.5y[4]+=0.3print y

Exercise 2

(b) Make scatter plots for the 5 data sets.

sixcolors=['#D4C6DF','#8968AC','#3D6570','#91732B','#963725','#4D0132']

fig, (ax1) = plt.subplots(1,figsize=(6,6))for j in range(5):

ax1.scatter(np.linspace(j+1-0.2,j+1+0.2,20), y[j],color=sixcolors[6-(j+1)], lw=0, alpha=1)

ax1.set_xlim([0,6])ax1.set_ylim([-1,1])

fig.savefig('ibb2015_7_exercise2_scatter_sample'+str(20),dpi=300,bbox_inches='tight')

plt.close(fig)

Correlation Between Two Variables

Data Visualization

http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html

Process of Statistical Analysis

Population

Random Sample

Sample Statistics

Describe

MakeInferences

DistributionsComplex Normal Skewed Long tails

xxx n,...,,21

Sample

Mean - Sample Size

Normal Distribution

806040200 Sample Size

Mean – Sample SizeComplex Normal Skewed Long tails

Sample Size

Mode, Maximum and Minimum

xxx n,...,,21

Sample

Maximum),...,,max(

21 xxx n

Minimum

),...,,min(21 xxx n

Modethe most common value

Median, Quartiles and Percentiles

xxx n,...,,21

Sample

Quartiles

1 for 25% of the sample

2for 50% of the sample

(median)xQ i

xP im for m% of the sample

Percentiles

Median and Mean – Sample SizeComplex Normal Skewed Long tails

Sample Size

Median - Gray

Variance

xxx n,...,,21

Variance

Sample

Variance – Sample SizeComplex Normal Skewed Long tails

Sample Size

Inter Quartile Range (IQR)

xxx n,...,,21

Sample

Quartiles

2for 50% of the sample

(median)xQ i

Inter Quartile Range

QQIQR13

Inter Quartile Range and Standard Deviation

Complex Normal Skewed Long tails

Sample Size

IRQ/1.349 - Gray

Central Limit Theorem

The sum of a large number of values drawn from many distributions converge normal if:

• The values are drawn independently;• The values are from the one distribution; and • The distribution has to have a finite mean and

variance.

Uncertainty in Determining the MeanComplex Normal Skewed Long tails

n=1000

Standard Error of the Mean

xxx n,...,,21

Variance

Sample

Standard Error of the Mean

Exercise 3

(a) Generate skewed data sets.

sample_size=10x_test=np.random.uniform(-1.0,1.0,size=30*sample_size)y_test=np.random.uniform(0.0,1.0,size=30*sample_size)y_test2=skew(x_test,-0.1,0.2,10)y_test2/=max(y_test2)x_test2=x_test[y_test<y_test2]x_sample=x_test2[:sample_size]

1. Generate a pair of random numbers within the range.2. Assign them to x and y3. Keep x if the point (x,y) is within the distribution.4. Repeat 1-3 until the desired sample size is obtained.5. The values x obtained in this was will be distributed according to

the original distribution.

Exercise 3(b) Calculate the mean of samples drawn from the skewed data set and the

standard error of the mean, and plot the distribution of averages.

for repeat in range(1000):…average.append(np.mean(x_sample))

sem=np.std(average)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.set_title('Sample size = '+str(sample_size)+', SEM = '

+str(sem))ax1.hist(average,bins=100,histtype='step',color='red',range=

[-0.5,0.5],normed=True,lw=2)ax1.set_xlim([-0.5,0.5])

Box Plot

M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

Box PlotsComplex Normal Skewed Long tails

Box Plots with All the Data PointsComplex Normal Skewed Long tails

Box Plots, Scatter Plots and Bar GraphsNormal Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard error error bars: standard error

Box Plots, Scatter Plots and Bar GraphsSkewed Distribution

Error bars: standard deviation error bars: standard deviation

error bars: standard errorerror bars: standard error

Exercise 4

Download ibb2015_7_exercise4.py and plot box plots for a skewed data set.

fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.scatter(np.linspace(1-0.1, 1+0.1,sample_size),

x_sample, facecolors='none', edgecolor=thiscolor, lw=1)

bp=ax1.boxplot(x_samples, notch=False, sym='')plt.setp(bp['boxes'], color=thiscolor, lw=2)plt.setp(bp['whiskers'], color=thiscolor, lw=2)plt.setp(bp['medians'], color='black', lw=2)plt.setp(bp['caps'], color=thiscolor, lw=2)plt.setp(bp['fliers'], color=thiscolor, marker='o', lw=0)

fig.savefig(…)

Descriptive Statistics - Summary

• Example distribution: • Normal distribution• Skewed distribution• Distribution with long tails• Complex distribution with several peaks

• Mean, median, quartiles, percentiles

• Variance, Standard deviation, Inter Quartile Range (IQR), error bars

• Box plots, bar graphs, and scatter plots

Descriptive Statistics – Recommended Reading

http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html

Homework

Plot the ratio of the standard error of the mean and the standard deviation as a function of sample size (use sample sizes of 3, 10, 30, 100, 300, 1000) for the skewed distribution in Exercise 3. Modify ibb2015_7_exercise3.py to generate this plot and email both the script and the plot.

Next Lecture: Sequence Alignment Concepts

Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Documents

Bioinformatics and Biostatistics in Limagrain / Biogemma JOBIM Conference, July 2015

Previous Lecture: Descriptive Statistics. Introduction to Biostatistics and Bioinformatics Data types and representations in Molecular Biology This Lecture

Biostatistics and Bioinformatics - University Bulletinbulletin.gwu.edu/public-health/biostatistics... · 2020-05-08 · BIOSTATISTICS AND BIOINFORMATICS The Department of Biostatistics

Biostatistics 410.645.01 Class 1 1/25/2000 Introduction Descriptive Statistics

Bioinformatics Biostatistics with dynamic programming and sequence alignment

Welcome Biostatistics and Bioinformatics Summer 2018

TRANSLATIONAL BIOINFORMATICS 101 · 2017-11-29 · TRANSLATIONAL BIOINFORMATICS 101 . JESSICA D. TENENBAUM. Department of Bioinformatics and Biostatistics, Duke University . Durham,

Previous Lecture: Probability. Introduction to Biostatistics and Bioinformatics Sequence Alignment Concepts This Lecture

Biostatistics - Descriptive Stat

Previous Lecture: Motifs. Introduction to Biostatistics and Bioinformatics Phylogenetics This Lecture

Division of Bioinformatics and Biostatistics · Division of Bioinformatics and Biostatistics. ... Implement Galaxy Platform, (2) Manage CLC Genomics Workbench, (3 ... – Arkansas

Bayesian Biostatistics - USP · Bayesian Biostatistics Emmanuel Lesaﬀre Interuniversity Institute for Biostatistics and statistical Bioinformatics KU Leuven & University Hasselt

Introduction to biostatistics - cbdm.uni-mainz.de · Introduction to biostatistics Hristo Todorov, AK Prof. Dr. Susanne Gerber. ... Introduction to statistics. Descriptive statistics

Www.biostat.ir1 بسم الله الرّحمن الرّحيم. 2 Biostatistics Academic Preview Descriptive Statistics

Biostatistics and Bioinformatics Biochemistry Sequence

Introduction to Biostatistics - Stony Brook Medicine · the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). Introduction to Biostatistics

SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Department of Biostatistics and Bioinformatics 2016 BB... · Fall 2016 Newsletter Department of Biostatistics and Bioinformatics Leadership/ DGS Megan Neely (extreme right) and other

Journal Club – Bioinformatics and Biostatistics Han Fangrepository.cshl.edu/28563/1/Lyonlabmeeting Jul 2013.pdf · Journal Club – Bioinformatics and Biostatistics Han Fang hfang@cshl.edu

Bioinformatics & biostatistics tools for monogenic and multifactorial disease investigation in consanguineous populations