Transcript
Page 1: (Human) Genomics BIOM/PHAR206 – 05/19/2014

(Human) GenomicsBIOM/PHAR206 – 05/19/2014

Olivier Harismendy, PhDDivision of Genome Information Sciences

Department of PediatricsMoores UCSD Cancer Center

Page 2: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UCSC Genome Browser• isPCR• BLAT• LiftOver• Track types

– BED minimum– BED extended– WIG

• Track Display and Shuffle• Browser Navigation• Custom Session – Export Figure• Custom Tracks

Page 3: (Human) Genomics BIOM/PHAR206 – 05/19/2014

0-based coordinates

Sequence A|C|C|G|G|T|C|G|A

1 based 1 2 3 4 5 6 7 8 9

0 based 1 2 3 4 5 6 7 8 9

Page 4: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Human Genome Assemblies

Page 5: (Human) Genomics BIOM/PHAR206 – 05/19/2014

BED Track Formats

track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On"

chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0

chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0

chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0

chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255

chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255

chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255

chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0

chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255

Page 6: (Human) Genomics BIOM/PHAR206 – 05/19/2014

BED Track FormatsHeader: space separated parameters•name=<track_label> •description=<center_label> •type=<track_type> - Defines the track type. The track type attribute is required for BAM, BED detail, bedGraph, bigBed, bigWig, broadPeak, narrowPeak, Microarray, VCF and WIG tracks.•visibility=<display_mode> 0 - hide, 1 - dense, 2 - full, 3 - pack, and 4 - squish. •color=<RRR,GGG,BBB> - Defines the main color for the annotation track. •itemRgb=On •colorByStrand=<RRR,GGG,BBB RRR,GGG,BBB> - Sets colors for + and - strands, in that order. •useScore=<use_score> •group=<group> - •priority=<priority> - When the group attribute is set, defines the display position of the track relative to other tracks•db=<UCSC_assembly_name> - When set, indicates the specific genome assembly for which the annotation data is intended; •offset=<offset> - Defines a number to be added to all coordinates in the annotation track. The default is "0".•maxItems=<#> - Defines the maximum number of items the track can contain. •url=<external_url> - Defines a URL for an external link associated with this track. •htmlUrl=<external_url> - Defines a URL for an HTML description page to be displayed with this track. •bigDataUrl=<external_url> - Defines a URL to the data file for BAM, bigBed, bigWig or VCF tracks.

Page 7: (Human) Genomics BIOM/PHAR206 – 05/19/2014

BED Track Formats• For intervals• Header: space separated configuration parameters

– chrom - The name of the chromosome – chromStart - The starting position of the feature in the chromosome or scaffold. The

first base in a chromosome is numbered 0.– chromEnd - The ending position of the feature in the chromosome or scaffold. The

chromEnd base is not included in the display of the feature. – name - Defines the name of the BED line. – score - A score between 0 and 1000. – strand - Defines the strand - either '+' or '-'.– thickStart - The starting position at which the feature is drawn thickly – thickEnd - The ending position at which the feature is drawn thickly – itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). – blockCount - The number of blocks (exons) in the BED line.– blockSizes - A comma-separated list of the block sizes. – blockStarts - A comma-separated list of block starts.

Page 8: (Human) Genomics BIOM/PHAR206 – 05/19/2014

WIG track format# 150 base wide bar graph at arbitrarily spaced positions,# threshold line drawn at y=11.76# autoScale off viewing range set to [0:25]# priority = 10 positions this as the first graph# Note, one-relative coordinate system in use for this formattrack type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10variableStep chrom=chr19 span=15049304701 10.049304901 12.549305401 15.049305601 17.549305901 20.049306081 17.549306301 15.049306691 12.549307871 10.0# 200 base wide points graph at every 300 bases, 50 pixel high graph# autoScale off and viewing range set to [0:1000]# priority = 20 positions this as the second graph# Note, one-relative coordinate system in use for this formattrack type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20 fixedStep chrom=chr19 start=49307401 step=300 span=2001000 900 800 700 600 500 400 300 200 100

Page 9: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Specific Tracks of interest• UCSC genes• RefSeq Genes• RepeatMasker• Conservation• TF motif predictions• dbSNP• ENCODE• Roadmap

Page 10: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Custom Sessions• Create an account• Customize the tracks displayed• Add you own track (limited in size and time)• Save and Share

Page 11: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Table Browser• Subset gene, region, genome• Output BED or fasta• Intersection• Filters

Page 12: (Human) Genomics BIOM/PHAR206 – 05/19/2014

ENCODE / Roadmap Tracks• Track search• Cell Types / Tissue Types• Raw • Peaks• HMM

Page 13: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UNIX commands• Head • More (press Q to exit)• Cat

– Example cat file – Example cat file1 file2

• Grep – Grep –v ‘expression’– Grep –A 1 ‘expression’– Grep –B 2 ‘expression’– Example: grep –v ‘#’ file.txt to remove comments

• Expression metacharacters– $ end of line– $ beginning of line– [AB] A or B– * any character– Example: ‘CDKN*’ or ‘chr[1-7]’

Page 14: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UNIX commands• Cut

– cut –f 1– cut –f 3 –d ‘:’

• Sort – sort –n– sort –nr (or sort –n –r)– sort –k 2

• uniq– uniq– uniq -c

• wc– wc –l file.txt – Example: cut –f 1 file | sort | uniq -c

Page 15: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UNIX commands• Sed

– Sed ‘s/foo/bar/g’ file: find and replace

• Awk– Awk ‘$3>2000’ file : select row with 3rd field>2000– Awk ‘{if ($3>2000) print $1,$2}’ file only print first

2 columns– Awk ‘{sum+=$3} END {print sum}’ file print sum of

column 3– Awk ‘{sum+=$3} END {print sum/NR}’ file print

average of column 3

• Join– join –j 1 sorted_file1 sorted_file2

Page 16: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Demo #1 and #2

Page 17: (Human) Genomics BIOM/PHAR206 – 05/19/2014

DNA variants(Sequence differences)

Highly Similar Genomes

Phenotypic Differences(Physical traits)

Human Genetic Variation

Page 18: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Variant Types

Frazer et al. 2009

Rahim, Harismendy et al (2008)

Page 19: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Within any given individual there are ~ 4 million genetic variants encompassing ~ 12 Mb

Variants from an individual genome

Page 20: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Variants from multiple genomes

Within a given individual the majority of variants

are common.

Page 21: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Next Generation DNA analysis• Whole genome sequencing

– Mutations (coding and non-coding)– Translocations– Copy Number Variants

• Whole Exome Sequencing– Mutations (coding)– ~Copy number variants (trisomia, gene

amplifications)• Gene Panel

– Mutations (coding)

Page 22: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Variant Frequencies

• Common genetic variants – second allele present at greater than 3% frequency

• Rare genetic variant– present at less than 3% frequency, and commonly at very low

frequencies

• Private variants– in limited families or single individuals

Page 23: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Map of Genetic Variation

Relationships between common SNPs in the human genome

Frazer et al (2007)

HapMap Project

Genotyped ~ 3.1 million SNPs in 270 individuals– 90 Yoruba in Ibadan, Nigeria (YRI) – 90 European descent in Utah, USA (CEU)– 45 Han Chinese in Beijing, China (CHB)– 45 Japanese in Tokyo, Japan (JPT)

Page 24: (Human) Genomics BIOM/PHAR206 – 05/19/2014

1000G Project

Page 25: (Human) Genomics BIOM/PHAR206 – 05/19/2014

VCF format##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Page 26: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Linkage Disequilibrium (LD) Given two biallelic sites there are four combinations that can be

observed with the following distributions.

SNP 1 = A/G

SNP 2 = A/C

SNP1-SNP2

Case r2=1 Case r2=0

AA 70 25

AC 0 25

GA 0 25

GC 30 25

LD measure the level of correlation between SNPsLD is the consequence of recombination at preferential sites

Page 27: (Human) Genomics BIOM/PHAR206 – 05/19/2014

LD Bin structure exampleLD bin = groups of SNPs with r2≥0.8

•The majority of common SNPs are in LD bins in the human genome

•Genotypes of a set of ~500,000 “tag SNPs” provide information (r2

≥ 0.8) regarding a large fraction (90%) of all 8 million common SNPs present in humans.

Page 28: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS principle

Tests if common SNPs tagging an interval in the human genome are “associated” with a disease

From phenotype to genotype

http://www.mpg.de

Page 29: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS results

WTCCC (2007)

PR interval

Large number to test requires low p-value (5.10-8)Sample sizes determine variant frequencies and effect size (Power)

Q1 2011221 traits

1319 studies>4000 associated SNPs

Page 30: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS highlights

• Many genes/loci not previously known to be involved in the diseases studied

• Newly identified pathways suggest that molecular sub-phenotypes of common diseases may exist

• Many common diseases have the same associated genes suggesting similar etiologies

Page 31: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS limitations– Genetic

• Small Effect sizes : only explains a small fraction (1-25%) of the heritability

• Missing heritability can be hiding in– Rare variants with large effects– Epitasis (Gene x Gene interactions)– Gene x Environment interaction (overlooked in heritability studies)

– Clinical• Limited Prognostic value : classic marker (family history, life style)

work better• Limited by ethnicity

– Functional• Proxy SNPs are not the functional ones• Genes associated by proximity : Variants are mostly outside• Cell type and condition unknown

Page 32: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Demo #3

Page 33: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Cancer Types

Page 34: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Clinical Data Collectedage_at_initial_pathologic_diagnosis 100% history_of_colon_polyps 82%

preoperative_pretreatment_cea_level

60%

icd_10 89% pretreatment_history 100%

icd_o_3_histology 99%primary_lymph_node_presentation

_assessment98%

ajcc_cancer_staging_handbook_edition 80% icd_o_3_site 99% primary_tumor_pathologic_spread 100%

anatomic_site_colorectal 88% informed_consent_verified 100% prior_diagnosis 100%bcr_patient_uuid 100% kras_gene_analysis_performed 89% race 57%

braf_gene_analysis_performed 87% kras_mutation_codon 4% residual_tumor 82%

braf_gene_analysis_result 6% kras_mutation_found 9%synchronous_colon_cancer_presen

t87%

circumferential_resection_margin 10%loss_expression_of_mismatch_repair_protei

ns_by_ihc74% tissue_source_site 100%

colon_polyps_present 42% lymph_node_examined_count 98% tumor_stage 96%date_of_form_completion 100% lymphatic_invasion 87% tumor_tissue_site 100%

date_of_initial_pathologic_diagnosis 100% lymphnode_pathologic_spread 100% venous_invasion 83%

days_to_birth 100% microsatellite_instability 16% vital_status 100%days_to_death 89% non_nodal_tumor_deposits 43% weight 51%

days_to_initial_pathologic_diagnosis 100% number_of_abnormal_loci 12% anatomic_organ_subdivision 2%

days_to_last_followup 96%number_of_first_degree_relatives_with_can

cer_diagnosis85%

loss_expression_of_mismatch_repair_proteins_by_ihc_result

18%

days_to_last_known_alive 61% number_of_loci_tested 12%

distant_metastasis_pathologic_spread 98% number_of_lymphnodes_positive_by_he 94%

ethnicity 55% number_of_lymphnodes_positive_by_ihc 9%

gender 100% patient_id 100%height 47% perineural_invasion_present 33%

histological_type 99% person_neoplasm_cancer_status 86%

Personal and history Histology

Clinical Molecular

Page 35: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Days after Dx

Patie

nts

Dec

reas

ing

Intr

insi

c se

nsiti

vity

Clinical Data Collected

Page 36: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Molecular Data Collected

Molecule Method Measured entity Data

RNA microarrays 15,000 transcripts Expression levels

RNA RNA-Seq All known and novel trasncripts

Expression levels, isoform quantification, editing, Novel transcripts, Fusion

Trasncripts

DNA microarrays 100k to 1M SNP Copy Number Aberrations, LoH, Polymorphisms

DNA Sanger Sequencing 30 M Base pairs Coding Mutations

DNA whole exome sequencing 50 M Base pairs Coding Mutations, Copy Number

Aberrations

DNA whole genome 3 billion base pairs Coding and Regulatory Mutations, Copy Number Aberrations, Rearragements

DNA Methylation Array 450,000 CpG Methylation levels

DNA Methylation Array 27,000 CpG Methylation levels

Page 37: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Demo #4


Recommended