56
Bio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am 4 th floor classroom 4515 MSRB

Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Bio5488

Genomics

Spring, 2016

Lectures: Mon, Wed 10:00-11:30 am

Lab: Fri 10:00-11:30 am

4th floor classroom

4515 MSRB

Page 2: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Outline of the day

• Outline of the course

• What is genomics?

• A little history

• The simple principles of genomics

• Being quantitative

• From a student to an investigator

Page 3: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

A few TA administrivia...

• If you didn’t receive an email from [email protected] this week, please email [email protected] or talk to a TA after class

• If you’re taking the lab:– Please fill out the anonymous survey:

https://goo.gl/YfGnEp

– Read assignment 1 (link on the syllabus section in Blackboard)

– Attempt to install the required software (instructions in the resources section of Blackboard)

– Bring your laptop to class on Friday

Page 4: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Course Web Site

• http://www.genetics.wustl.edu/bio5488/home.html

• Blackboard: bb.wustl.edu

• Linux Primer

• Python Primer

• Lecture notes

• Schedule

• Weekly Assignments and Answers

• Weekly Readings

Page 5: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Grading

4 credit

• ¼ midterm

• ¼ final

• ½ weekly assignments

3 credit

• ½ midterm

• ½ final

What is the key to your success?

Page 6: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am
Page 7: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am
Page 8: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

http://www.genome.gov/Director/

Page 9: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

History of Bio5488

1998

Bio5495

Eddy

2003

Bio5488

Cohen

Mitra

2006

Bio5495

Brent

2012

Bio5488

Wang

Conrad

HGP

Computational

Biology

Functional

Genomics

Microarray

Systems/Synt

hetic

Biology

Next-gen

Sequencing

ENCODE

GWAS

Roadmap

etc

Page 10: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

History of Genomics and Epigenomics

1865 Gregor Mendel: founding of genetics

1953 Watson and Crick: double helix model for DNA

1955 Sanger: first protein sequence, bovine insulin

1970 Needleman-Wunsch algorithm for sequence alignment

1977 Sanger: DNA sequencing

1978 The term “bioinformatics” appeared for the first time

1980 The first complete gene sequence (Bacteriophage FX174), 5386 bp

1981 Smith-Waterman algorithm for sequence alignment

1981 IBM: first Personal Computer

1983 Kary Mullis: PCR

1986 The term "Genomics" appeared for the first time: name of a journal

1986 The SWISS-PROT database is

1987 Perl (Practical Extraction Report Language) is released by Larry Wall.

1990 BLAST is published

1995 The Haemophilus influenzea genome (1.8 Mb) is sequenced

1996 Affymetrix produces the first commercial DNA chips

2001 A draft of the human genome (3,000 Mbp) is published

Page 11: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

History of Genomics and EpigenomicsHGP

Computational

Biology

Machine Learning

Data mining

Structural informatics

Drug Design

Statistical Modeling

Database

Gene

expression

Microarray

Omics

Systems/Synt

hetic

Biology

Next-gen

Sequencing

ENCODE

GWAS

Roadmap etc

Sequence analysis

Hidden Markov Model

Gene finding

BLAT

Genome Browser

Motif finding

Assembly

90’s

00’s

10’s

Comparative

Genomics

Evolution

Page 12: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Genome, genetics, and genomics

• What is a genome?– The genetic material of an organism.

– A genome contains genes, regulatory elements, and other

mysterious stuff.

• What is genetics?– The study of genes and their roles in inheritance.

• What is genomics– The study of all of a person's genes (the genome), including

interactions of those genes with each other and with the

person's environment

Page 13: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

What about genomes• Characterize the genome

– How big

– How many genes

– How are they organized

• Annotate the genome– What, where, and how

• Modern genomics: “ChIPer” vs “Mapper”– Direct measurement

– Inference

– Comparison

– Evolution

• From genome to molecular mechanisms to diseases– Genomes/epigenomes of diseased cells

• What do you want to learn from this class?– Being quantitative

– Concept/philosophy

– Techniques/problem solving skills

– Do not forget genetics!!!

Page 14: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Motivation slides

Page 15: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

The sequence explosion

Genomes sequencedhttp://www.genomesonline.org/, (as of November 2015)

49559 Bacteria, 1136 Archaea, 11122 Eukarya, 4473 viruses

18385 metagenome samples

2298 Synthetic genomes

Page 16: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

The Human Genome Project

June 26, 2000 President Clinton, with

Craig Venter and

Francis Collins,

announces completion

of "the first survey of

the entire human

genome."February 15, 2001

April 1, 2010

The human genome completed!

– 2001

The human genome completed,

again! – 2010

10 years after draft sequence:

What have we learned?• Genome sequencing

• Functional elements

• Evolution of genome

• Basis of diseases

• Human history

• Computational biology

Page 17: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Francis CollinsNIH Director

“ If I was a senior in college or a

first-year graduate student trying to

figure out what area to work in, I

would be a computational

biologist.

-- During AAAS Meeting Jan 2010

COLLINS: .... Computational biologists

are having a really good time and it’s going to get better.

ROSE: Their day is coming?

COLLINS: Their day is here, but it’s

going to be even more here in a few years.

COLLINS: They’re going to be the breakthrough artists ....

-- Mar 15, 2010, Charlie Rose Show

Page 18: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am
Page 19: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

The Human genome:

the “blueprint” of our body

1013 different cells in

an adult human

The cell is the basic

unit of life

DNA = linear

molecule inside the

cell that carries

instructions needed

throughout the

cell’s life ~ long

string(s) over a

small alphabet

Alphabet of four

(nucleotides/bases)

{A,C,G,T}

GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCT

CCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATG

AAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAA

ATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT

AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG

ATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTG

CATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTA

TTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT

James Watson

Francis Crick

Fred Sanger

Page 20: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

DNA, Chromosome, and Genome

Page 21: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Building an Organism

cellDN

AEvery cell has the same sequence of DNA

Subsets of the DNA

sequence determine the

identity and function of

different cells

Page 22: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

What makes us different?

Differences between individuals?

Differences between species?

Page 23: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

One genome, thousands of epigenomes

Embryonic

stem cells

Fetal

tissues

Adult cells

and tissues

Page 24: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Understanding the genome

• View from 2000

– Protein-coding genes 35,000 - 120,000

– Regulatory sequence Less than protein-

coding information

– Transposons Junk DNA

• We now know ...

All these are WRONG!

Page 25: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

How many genes do we have?

Science 2005

fly

worm

human

weed

fish

rice

What we used to think

Gene numbers do not correlate with organism complexity.

Many gene families are surprisingly old.

Page 26: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Complexity, Genome Size and the C-value Paradox

www.genomesize.comOrganism Genome Size (MB)

Amoeba 670,000

Fern 160,000

Salamander 81,300

Onion 18,000

Paramecium 8,600

Toad 6,900

Barley 5,000

Chimp 3,600

Gorilla 3,500

Human 3,500

Mouse 3,400

Dog 3,300

Pig 3,100

Rat 3,000

Boa Constrictor 2,100

Zebrafish 1,900

Chicken 1,200

Fruit fly 180

C. elegans 100

Plasmodium falciparum 25

Yeast, Fission 14

Yeast, Baker's 12Escherichia coli 4.6

Bacillus subtilis 4.2

H. influenzae 1.8

Mycoplasma genitalium 0.60

C-value: the amount of

DNA contained within a

haploid nucleus (e.g. a

gamete) or one half the

amount in a diploid

somatic cell of a

eukaryotic organism,

expressed in picograms

(1pg = 10−12 g).

Page 27: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Complexity and Organism Specific Genes

• Only 14 out of 731 genes on mouse chromosome 16 have no human homolog (Celera)

• Many genes specific to the mouse are olfactory receptors, also some differences in immunity and reproduction

Page 28: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Most functional information is non-coding

!5% highly conserved, but only 1.5% encodes proteins

21

chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35

chr2:

DLX1

DLX2

Vertebrate Cons

ChimpRhesus

BushbabyTree_shrew

MouseRat

Guinea_PigShrew

HedgehogDogCat

HorseCow

ArmadilloElephantTenrec

OpossumPlatypusLizard

ChickenZebrafishTetraodon

FuguSticklebackMedaka

172660000 172665000 172670000 172675000UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)

DLX1

GapsHumanChimpRhesus

BushbabyTree_shrew

MouseRat

Guinea_PigShrew

HedgehogDogCat

HorseCow

ArmadilloElephantTenrec

OpossumPlatypusLizard

ChickenZebrafishTetraodon

FuguSticklebackMedaka

UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)

K P R T I Y S S L Q L Q A L N

1A A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GC T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A TA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA C A A T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA C A A T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CCCCC C T A GGA C A A T T T A T T CC A GT T T GC A GC T GGA CGC T T T GA A TA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A GC CC A GGA C A A T C T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A C T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA CGA T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA C A A T T T A T T CC A GT T T GC A GT T GC A GGC T T T GA A CA A A C C T A GGA CGA T T T A T T CC A GT T T GC A GC T GC A GGC T T T GA A TA A A C CC A GGA C T A T T T A T T CC A GT C T GC A GT T GC A GGC T T T GA A CA A A C CC A GGA C T A T A T A T T CC A GT T T GC A GT T GC A GGC A T T GA A CA A GC CGC GC A CC A T C T A C T CC A GC C T C C A GC T CC A GGC C T T GA A CA A A C CC A GGA C T A T T T A T T CC A GT T T GC A GC T GC A GGC T C T GA A CA A GC CCC GGA CC A T A T A C T CC A GT C T C C A GC T GC A GGC T C T GA A CA A A C CC A GGA C T A T C T A T T CC A GT T T A C A GC T CC A GGC CC T GA A CA A A C C A A GGA C T A T C T A T T C A A GT T T A C A A C T CC A A GC CC T GA A CA A A C C A A GGA C T A T C T A T T CC A GT T T A C A A C T T C A A GC T C T A A A CA A A C C A A GGA C T A T A T A T T CC A GT T T A C A GC T T C A GGC T C T GA A C

What do they do?

Most functional information is non-coding

Page 29: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Ultra conserved elements

ultra conserved

e.d 12.5

Page 30: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

HARs: Human accelerated regions

• 118 bp segment with 18 changes between

the human and chimp sequences

• Expect less than 1

Page 31: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Human HAR1F differs from the ancestral

RNA structure

Page 32: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Main components in the Human genome

Only 1.5% of the human genome are protein-coding regions

Transposable elements make up almost half of the human genome

Barbara McClintock

Page 33: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Transposable Elements (TEs)

Evolutionary pattern of LTR10B1 genomic copies

Wang et al. PNAS 2007

The LTR10 and MER61 families are

particularly enriched for copies with a p53

site. These ERV families are primate-specific

and transposed actively near the time when

the New World and Old World monkey

lineages split.

TEs can shape transcriptional network

Page 34: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Understanding Human Disease

NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/

Published Genome-Wide Associations through 12/2013 Published GWA at p≤5X10-8 for 17 trait categories

Page 35: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

ENCODE Project

Proposed in Nov 2009

HapMap 1000 Genomes

Epigenome

TCGA

Page 36: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Thinking Quantitatively

Page 37: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Biology Is A Quantitative Science!!!

1) Mendel’s Laws

2) Chargaff’s Rules

Gregor Mendel

1823-1884

Erwin Chargaff

1929-1992

Page 38: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Thinking Quantitatively• Space

– Be comprehensive

• Signal to Noise Ratio– Sensitivity, specificity, dynamic range

– What is my background control?

• Distributions– Normal, Gaussian, Poisson, negative binomial, extreme value,

hypergeometric, etc.

– Discrete vs continuous

• The P value

• Statistics, Probability, Computation, and Informatics

• Don’t forget genetics!!!

Page 39: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Spaces: Be comprehensive

• Conditions: spatial, temporal, treatment – think about controlling for multiple variables

• Think globally – interaction between local features and global features (Placenta histone example)

• Be comprehensive about what assumptions are made – some we know, some we don’t (genome assembly example)

Page 40: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Signal to Noise

m/z

rela

tive a

bundance

400 600 800 1000

100

75

50

25

input

ChIP

Different sources of noise

Page 41: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Sensitivity, Specificity, and Dynamic Range

• Sensitivity– What is the smallest signal that can reliably be detected

(signal to noise)?

– True positive rate

• Specificity– How well can we discriminate between similar signals?

– True negative rate

• Dynamic Range– What is the linear range of detection?

– What is the range of natural variation?

Page 42: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Distributions

Normal (Gaussian)

Poisson

Extreme value

Negative binomial

Page 43: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

The P value*for discreet variables

# o

f o

ccurr

ences

0 1 2 3 4 5 6 7 8 9 10

5

10

15

20

25What is the chance of getting exactly

seven?

10.0

139172317931

9

#

#

P

P

trialsoftotal

sevenwerethattrialsofP

# o

f occurr

ences

0 1 2 3 4 5 6 7 8 9 10

5

10

15

20

25

16.0

139172317931

139

#

#

P

P

trialsoftotal

betterorsevenwerethattrialsofP

What is the chance of getting seven or

better?

Page 44: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

“There Are Three Kinds of Lies:

Lies, Damned Lies, and Statistics”

• Comparing averages

• Comparing variation

-Benjamin Disraeli (1804-1881)

Statistics: drawing inferences from the results

of an experiment

0

0.2

0.4

0.6

0.8

1

1 7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

Page 45: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Gaussian (Normal) Distributions I

• Mean (x) =

• Standard Deviation (s) =

0

0.2

0.4

0.6

0.8

1

1 7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

height

# o

f people

n

i

ixn 1

1

1

)(1

2

n

xxn

i

i

Page 46: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Gaussian (Normal) Distributions II

• Mean (x or )

• Standard Deviation (s or )

• Compare means (t-tests)

• Compare standard deviations (f-tests)

• Calculate P-values for particular results (z-tests)

0

0.2

0.4

0.6

0.8

1

1 7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

height

# o

f people

Page 47: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

“The most important questions of life are

…really only problems of probability.”

-Pierre Simon, Marquis de Laplace (1749-

1827)

Probability: Computing the chance of a particular

outcome of an experiment

E

EC

S

P(E)=E/S

P(Ec)=1-P(E)

Page 48: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Independent Events/Mutually Exclusive Events

What is the probability of A and B both occurring?

P(AB) = P(A)*P(B) (Independent)

P(A|B) = P(A) (Independent)

P(A|B)*P(B) = P(B|A)*P(A) (Bayes rule)

P(AB) = P(A|B) = 0 (Mutually Exclusive)

What is the probability of A or B occurring (independent)?

P(A) * P(B) + P(A) * 1-P(B) + 1-P(A) * P(B)

both + A only + B only

What is the probability of A or B occurring (mutually exclusive)?

P(A) + P(B)

Page 49: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

An example

Ala (A) 7.81 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.88

Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.45

Asn (N) 4.20 Gly (G) 6.93 Met (M) 2.37 Trp (W) 1.15

Asp (D) 5.30 His (H) 2.28 Phe (F) 4.01 Tyr (Y) 3.07

Cys (C) 1.56 Ile (I) 5.91 Pro (P) 4.84 Val (V) 6.71

What is the probability that a peptide of length 25 contains at least one SP motif?

P(SP) = P(S)P(P) = .0688*.0484 =.00329

P(SPc) = 1 - P(SP) = .9966

P(no SP anywhere in a 25 mer) = P(SPc)24 = 0.92

P(at least one SP in a 25 mer) = 1- P(SPc)24 =0.08

What is the probability that the last residue of a protein is either K or R?

P(K)+P(R) = .0593 + .0532= .11

Amino acid percentages of Swisprot

Page 50: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

counting permutations

Permutations:

Question: How many different 5-mer

sequences can I make using each of the

amino acids S, T, A, G, P once and only

once?

Answer:

5! = 5*4*3*2*1 = 120

Page 51: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

counting combinations

Combinations:

Question: How different 5-mer sequences

can I make using three Ser’s and two

Pro’s

Answer:

101*2*1*2*3

1*2*3*4*5

!2!*3

!5

Page 52: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Example

A particular cluster has 25

coexpressed genes in it. 15 of these

genes are annotated as being

involved in rRNA transcription.

Is 15/25 significant?

Page 53: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Hypergeometric Probability Distributionthe “overlap problem” or sampling without replacement

• N = number of genes in the genome (6000 for yeast)

• n = number of genes in the cluster (25)

• m = number of rRNA transcription genes (109 from MIPS)

• s = number of rRNA transcription genes in the cluster (15)

N

n s m

Page 54: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Hypergeometric Probability Distribution

N

n s m

)!(!

!)(

bab

a

b

awhere

n

N

in

mN

i

m

sXPn

si

Page 55: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Hypergeometric Probability Distribution

6000

25 15 109

1625

15

101

25

6000

25

1096000109

)15(

i

iiXP

Page 56: Bio5488 Genomics Spring, 2012genetics.wustl.edu/bio5488/files/2016/01/Bio5488_Intro_2016.pdfBio5488 Genomics Spring, 2016 Lectures: Mon, Wed 10:00-11:30 am Lab: Fri 10:00-11:30 am

Therefore in our example…

15/25 rRNA transcription genes in the cluster is significant.

BUT…

1) 10 out of 25 genes in the cluster are not rRNA transcription genes.

2) 94 rRNA transcription genes are not in the cluster.

3) What about the other genes in the cluster?

6000

25 15 109