32
The Human Genome 3000000000 bases

The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Embed Size (px)

Citation preview

Page 1: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

The Human Genome

3000000000bases

Page 2: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

The raw dataNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagtcacttcctccttcagGAACATTGCAGTGGGCCTAAGTGCCTCCTCTCGGGACTGGTATGGGGACGGTCATGCAATCTGGACAACATTCACCTTTAAAAGTTTATTGATCTTTTGTGACATGCACGTGGGTTCCCAGTAGCAAGAAACTAAAGGGTCGCAGGCCGGTTTCTGCTAATTTCTTTAATTCCAAGACAGTCTCAAATATTTTCTTATTAACTTCCTGGAGGGAGGCTTATCATTCTCTCTTTTGGATGATTCTAAGTACCAGCTAAAATACAGCTATCATTCATTTTCCTTGATTTGGGAGCCTAATTTCTTTAATTTAGTATGCAAGAAAACCAATTTGGAAATATCAACTGTTTTGGAAACCTTAGACCTAGGTCATCCTTAGTAAGATcttcccatttatataaatacttgcaagtagtagtgccataattaccaaacataaagccaactgagatgcccaaagggggccactctccttgcttttcctcctttttagaggatttatttcccatttttcttaaaaaggaagaacaaactgtgccctagggtttactgtgtcagaacagagtgtgccgattgtggtcaggactccatagcatttcaccattgagttatttccgcccccttacgtgtctctcttcagcggtctattatctccaagagggcataaaacactgagtaaacagctcttttatatgtgtttcctggatgagccttcttttaattaattttgttaagggatttcctctagggccactgcacgtcatggggagtcacccccagacactcccaattggccccttgtcacccaggggcacatttcagctAtttgtaaaacctgaaatcactagaaaggaatgtctagtgacttgtgggggccaaggcccttgttatggggatgaaggctcttaggtggtagccctccaagagaatagatggtgAatgtctcttttcagacattaaaggtgtcagactctcagttaatctctcctagatccaggaaaggcctagaaaaggaaggcctgactgcattaatggagattctctccatgtgcaaaatttcctccacaaaagaaatccttgcagggccattttaatgtgttggccctgtgacagccatttcaaaatatgtcaaaaaatatattttggagtaaaatactttcattttccttcagagtctgctgtcgtatgatgccataccagagtcaggttggaaagtaagccacattatacagcgttaacctaaaaaaacaaaaaactgtctaacaagattttatggtttatagagcatgattccccggacacattagatagaaatctgggcaagagaagaaaaaaaggtcagagtttaatcctcaTTCCTAAGTTAtgtaaaccaaaaataaaattctgaagatgtcctgatcatctgaatggacccttcctctggaccagggcattccaaagttaacctgaaaattggtttgggccatgatgggaagggaggtttggatatgcctcattatgccctcttccctttcagaattcaggaaaagccaacc

agcattaacatcaacacagattttcagatcttaggtttctttccgatctattctctctgaaccctgctacctggaggcttcatctgcataataaaactttagtctccacaaccccttatcttaccccagacattcctttctattgataataactctttcaaccaattgccaatcagggtatgtttaaatctacctatgacctggaagcccccactttgcaccctgagatcaaaccagtgcaaatcttatatgtattgatttgtcAATGAAAACAGTCAAAGCCagtcaggcacagtggctcatgcctgtaatcccagcactttgggaggctgaggcgggtagatcacctgaggtcaggagttcgacaccagcctggccaacatggtgaaaccccgtccctactaaaatacaaaaattagcccagcttggtggtgggcacctgtaatcttagctactgcagagactgaggcaggagaatcgcttgaacccaggaggtggaggttgcagtgacctgagattttgccattgcactccagcctgggcaacagagcaagactctatctcaaaaaacaaacaaacaaacaaacaaacaaacaaacTgtcaaaatctgtacagtatgtgaagagatttgttctgaaccaaatatgaatgaccatggtccatgacacagccctcagaagaccctgagaacatgtgcccaaggtggtcacagtgcatcttagttttgtacattttagggagatatgagacttcagtcaaatacatttttaaaaaatacattggttttgtccagaaagccagaaccactcaaagcaggggtttccaggttataagtagatttaaaatttttctgattgacaattggttgaaagagttgtcaatagaaaggaatgtctgcattgtgacaagaggttgtggagaccaagtttctgtcatgcagatgaagccttcaggtagcaggcttccaagataacaggttgtaaatagttcttatcagacttaaGTTCTGTGGAGACGTAAAATGAGGCATATCTGACCTCCACTTccaaaaacatctgagacaggtctcagttaattaagaaagtttgttctgcctagtttaaggacatgcccatgacactgcctcaggaggtcctgacagcatgtgcccaaggtggtcaggatacagcttgcttctatatattttagggagaaaatacatcaGCCtgtaaacaaaaaattaaattctaaggtccctgaaccatctgaatgggctttcttctaggccagggcactctaaaattgaagaacctgaacattcctttctattgataatactttcagccagttgagcccattcagaCCACAGCAAGGTGCCAGGCCAGGCAAGGGCTGACTTGAGATACCTGCCAGATGAGTCACTGGCAAAAGGTGCTGCTCCCTGGTGAGGGAGAAACACCAGGGGCTGGGAGAGGCCCAGAAGGCTCTGAAGGAGTTTTGGTTTGGCTGGCCATGTGTGCAATTAGCGTGATGAGCTCTGACATGGCCTTGCATGGACGGATTGGGCAGG

A’s T’s C’s and G’s and N’s

Page 3: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Composition of the human genome

• Nearly half the genome is repeats

• Only approximately 1.5% is known coding genes

• Unknown functional fraction?!

Page 4: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

The repeat content Jumping -genes

1. Transposition-derived repeats

2. Inactive retroposed cellular genes.

3. Simple repeats - microstats

4. Segmental duplications

5. Tandom repeats (telomere, centromere)

Page 5: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Few than expected genes

GeneSweep – Ewan Birney (Welcome Trust Sanger Institute)

The happy winner.

Lee Rowen of the Institute for Systems Biology. 25,947 genes.

Page 6: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Genome complexity

Regulators elements

Promoters, enhancers, repressors…

This is where it get complicated.

Alternative splicing

56% for Humans 22% for Worms

Page 7: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Variation among chromosomes

Initial sequencing and analysis of the human genome

International Human Genome Sequencing Consortium Nature 409, 860 - 921 (15 February 2001)

• Overall recombination rate dependent on chromosome length.

• Large variation in the gene density between chromosome.

• Difference in organisation

Page 8: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Variation within chromosomes

Rec

omb

inat

ion

GC

Gen

e d

ensi

ty

The genome is non-random in its organisation

Recombination – High at telomere

GC – Variation at many scales - Isochores

Gene Density – Organisation by function

Page 9: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

New observations

• Variation at multiple scales within and between chromosomes

• Only twice as many genes as flies and worms – but more proteins

• Genes have arrived from bacteria and transposable elements

• Transposons inactive and LTR probably also (Alu’s in GC rich regions)

• Most mutations occur in males (higher mutation rate)

• GC poor regions correspond to dark bands.

• Recombination rates are higher at telomeres

• Lots of between individual variation

2001

Page 10: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Humans Genome Project starts 1990

Draft Human Genome completed 2001

Fewer gaps 147,821 341

More continuity 81kb 38,500kb

Gene rich regions completed 2003

Each chromosome compiled and annotated. 2006!

Go home?

• Error rate of ~1 in per 100,000 bases

• 2.85 billion bases

• Covers ~99% of the euchromatic genome.

Completing the Human Genome

Page 11: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

New builds: Build 36, May 2006

Build 35, May 2004

Build 34, July 2003

Build 33, April 2003

Not quite finished

December 2001 - NCBI 28 July 2003 - NCBI 34

Page 12: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosome 1

Segmental duplications

- allow genes to diversify and acquire novel functions.

• Duplication of a gene from one to many positions on the chromosome.

• A pericentric inversion follows a duplication of two genes

Page 13: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes 2 and 4

Gene deserts

Megabase sized genomic segments containing no known coding genes.

(some show conservation)

Role of these regions?

Lowest recombination rates of all the autosomes

Page 14: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes 3

Lowest rate of segmental duplication

Large inversion from our ancestor with chimps.

Page 15: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes 7

Complex repeat patterns and fragile locations

Williams-Beuren syndrome associated with a large deletion (1.6Mb).

Lots of repetitive and duplicated DNA.

What is the true sequences?

“It is characterized by a distinctive, "elfish" facial appearance, along with a low nasal

bridge; an unusually cheerful demeanor and ease with strangers, coupled with

unpredictably occurring negative outbursts; mental retardation coupled with an unusual facility with language; a love for music; and

cardiovascular problems, such as supravalvular aortic stenosis and transient

hypercalcemia.”

Page 16: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes 10

Multi-species alignment – gene involved in cancer

Conservation indicates the location of functional elements.Some are known genes.Others aren’t – higher levels of conservation!

Page 17: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes 19

Very high gene density

Increase in all classes of known genes.

26 genes per megabase.

What is special about this chromosome?

Has high recombination rate. And repeat density And GC content.

Page 18: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes 12 and 3

Recombination rate variation

Knowing the physical positions of variants allows recombination

rates

Male and female rates differ

Fine scale variation

Page 19: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

N.C.B.I. www.ncbi.nlm.nih.gov/genome/guide/human/

Ensembl www.ensembl.org/Homo_sapiens/

UCSC genome.ucsc.edu/cgi-bin/hgGateway

• A joint project between EMBL and the Sanger Institute.

• Primarily funded by the Welcome Trust.

• Mr Ensembl – Ewan Birney

• Based at the University of California Santa Cruz.

• Largely funded by the NHGRI.

• Mr UCSC – David Hassler

• Part of the National Institute of Health.

• Has a number of important associated projects.

• Mr NCBI – David Lipman.

Where is the data available

Page 20: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

• Compositional Base compositionInsertion deletionsSegmental duplications RepeatsTransposable elements

• Functional

GenesRegulatory elementsGene expression

• Evolutionary Species comparisonVariation dataPopulation genetic analysis

What data availableUse drop down controls below and press refresh to alter tracks

displayed.Tracks with lots of items will automatically be displayed in more

compact modes.Mapping and Sequencing Tracks Base Position Chromoso

me BandSTS Markers

FISH Clones

Recomb Rate

Map Contigs Assembly Gap Coverage BAC End P

airs

Fosmid End Pairs

GC Percent

WSSD Duplication

Short Match

Restr Enzymes

Phenotype and Disease Associations

RGD QTL

Human Mutation

Genes and Gene Prediction Tracks

Known Genes

CCDS RefSeq Genes

Other RefSeq MGC Gene

s

Vega Genes

Vega Pseudogenes

Ensembl Genes

AceView Genes

ECgene Genes

N-SCAN

SGP Genes

Geneid Genes

Genscan Genes Exoniphy

Augustus Genes

Retroposed Genes

Superfamily Yale Pseu

doEvoFold

sno/miRNA ExonWalk

mRNA and EST Tracks

Human mRNAs

Spliced ESTs

Human ESTs Other mR

NAsOther ESTs

H-Inv TIGR Gene Index

UniGene Gene Bounds

Alt-Splicing

Expression and Regulation

Allen Brain

GNF Atlas 2

GNF Ratio

Affy HuEx 1.0 Affy U133

Affy GNF1H

Affy U133Plus2

Affy U95 CpG Islands FirstEF

5x Reg Potential

TFBS Conserved

Affy Txn Phase2 SGMO/EI

O CD34 HS

NHGRI DNaseI-HS

Reg Potential 7 species

T-ScanS miRNA

PicTar miRNA

Comparative Genomics

Conservation

Fugu Blat Fugu Chain Fugu Net Tetraodon

Ecores

Tetraodon Chain

Tetraodon Net

Zebrafish Chain

Zebrafish Net

X. tropicalis Chain

X. tropicalis Net

Chicken Chain

Chicken Net

Opossum Chain

Opossum Net

Cow Chain

Cow Net Cow BAC Ends

Cow Synteny Dog Chain

Dog Net

Rat Chain Rat Net Mouse Chain

Mouse Net

Rhesus Chain

Rhesus Net

Chimp Chain

Chimp Net

Variation and Repeats

SNPs SNP Arrays

HapMap LD Tajima's D

SNPsTajima's D

SNP Recomb Rates

SNP Recomb Hots

Segmental Dups

Structural Var

RepeatMasker

Simple Repeats

Microsatellite RIPs Self Chain

Page 21: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

• Human chromosomes are numbered

• Arms are labelled p and q

• Regions labelled ascending from centromere.

• Bases numbered from beginning of small arm to end of long arm.

Orientation

Page 22: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Microsatellites and repeats

Transposable elements

• Important in many common diseases

• Some of the most polymorphic loci

• Make up a large proportion of the genome

Annotation - Repeats

Page 23: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

mRNA evidence

Protein evidence

Gene predictionEST evidence

Predicted transcripts- Known Novel

Manually annotated genes

• Different levels of evidence for genes

• Based on homology

• Based on expression

• Based on prediction

Annotation - genes

Page 24: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Expression

Levels & Tissues

Regulatory

Elements

• Regulatory elements might be important in complex diseases

• Micro array technology is generating expression data on a large scale

Annotation – Expression and Regulation

Expression varies in space and time

Page 25: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Cross Species

Within Humans

Annotation – Evolutionary

Variation is the most important feature of the genome!?

(issues - alignment)

(issues - ascertainment)

Page 26: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Encylopedia of DNA Elements - Encode

• Variation group – SNPs indels

• Function group – Promoters, transcription and binding

• Chromatin group – Chromatin modification, replication origins

• Multiple sequence alignment – Conservation vs Constraint

Aim: Understand everything possible about these regions.

1% of genome

14 manually chosen regions

(Alpha & beta globin, HOXA, FOXP2 and CFTR)

Plus 26 random regions

Page 27: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Human Variation

SNPs – most common variation in the human genome

10 million common variants.

Synonymous Non-synonymous variation

Information in the density of SNPs.

Information in the frequency of SNPs.

Information in the correlation between SNPs.

Page 28: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

HapMap Project

2002 HapMap phase I begins

Three populations (YRI) Yoruba in Ibadan, Nigeria 90 (CEU) Utah, USA 90 (CHB) Han Chinese in Beijing 45 (JPT) Japanese in Tokyo 44

Approximately 1 million SNPs

2005 Phase I complete, phase II begins

Increase from 1 million to ~ 4.6 million

2006 Phase II complete, “phase III” begins

Additional 6 populations

Kenya, African Americans, Mexican Americans, Italy, India

Page 29: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

• Linkage Disequilibrium information is an important tool

• Population genetic annotation is often sample specific

The International HapMap

Page 30: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Learing from studies

of human variation

•Can learn about how genetic diversity is structured across the globe

•Identify regions which have been under recent positive selection

•Identify recombination hotspots

Page 31: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Hot Topics

• Micro RNA’s

20mers of RNA that form a diversity of roles – e.g. regulating mRNA levels

• Structural variation

The genome of is full of polymorphic insertions and deletions, from 1kb to a Megabase

• Genome-wide association studies

Millions of £s being spend on scanning the genome for loci showing association with disease status.

Page 32: The Human Genome 3000000000 bases. The raw data NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GATCTGATAAGTCCCAGGACTTCAGAAGagctgtgagaccttggccaagt

Chromosomes X and Y

Sex chromosomes