37
(Human) Genomics BIOM/PHAR206 – 05/19/2014 Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer Center

(Human) Genomics BIOM/PHAR206 – 05/19/2014

Embed Size (px)

DESCRIPTION

(Human) Genomics BIOM/PHAR206 – 05/19/2014. Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer Center. UCSC Genome Browser. isPCR BLAT LiftOver Track types BED minimum BED extended WIG Track Display and Shuffle - PowerPoint PPT Presentation

Citation preview

Page 1: (Human) Genomics BIOM/PHAR206 – 05/19/2014

(Human) GenomicsBIOM/PHAR206 – 05/19/2014

Olivier Harismendy, PhDDivision of Genome Information Sciences

Department of PediatricsMoores UCSD Cancer Center

Page 2: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UCSC Genome Browser• isPCR• BLAT• LiftOver• Track types

– BED minimum– BED extended– WIG

• Track Display and Shuffle• Browser Navigation• Custom Session – Export Figure• Custom Tracks

Page 3: (Human) Genomics BIOM/PHAR206 – 05/19/2014

0-based coordinates

Sequence A|C|C|G|G|T|C|G|A

1 based 1 2 3 4 5 6 7 8 9

0 based 1 2 3 4 5 6 7 8 9

Page 4: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Human Genome Assemblies

Page 5: (Human) Genomics BIOM/PHAR206 – 05/19/2014

BED Track Formats

track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On"

chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0

chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0

chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0

chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255

chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255

chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255

chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0

chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255

Page 6: (Human) Genomics BIOM/PHAR206 – 05/19/2014

BED Track FormatsHeader: space separated parameters•name=<track_label> •description=<center_label> •type=<track_type> - Defines the track type. The track type attribute is required for BAM, BED detail, bedGraph, bigBed, bigWig, broadPeak, narrowPeak, Microarray, VCF and WIG tracks.•visibility=<display_mode> 0 - hide, 1 - dense, 2 - full, 3 - pack, and 4 - squish. •color=<RRR,GGG,BBB> - Defines the main color for the annotation track. •itemRgb=On •colorByStrand=<RRR,GGG,BBB RRR,GGG,BBB> - Sets colors for + and - strands, in that order. •useScore=<use_score> •group=<group> - •priority=<priority> - When the group attribute is set, defines the display position of the track relative to other tracks•db=<UCSC_assembly_name> - When set, indicates the specific genome assembly for which the annotation data is intended; •offset=<offset> - Defines a number to be added to all coordinates in the annotation track. The default is "0".•maxItems=<#> - Defines the maximum number of items the track can contain. •url=<external_url> - Defines a URL for an external link associated with this track. •htmlUrl=<external_url> - Defines a URL for an HTML description page to be displayed with this track. •bigDataUrl=<external_url> - Defines a URL to the data file for BAM, bigBed, bigWig or VCF tracks.

Page 7: (Human) Genomics BIOM/PHAR206 – 05/19/2014

BED Track Formats• For intervals• Header: space separated configuration parameters

– chrom - The name of the chromosome – chromStart - The starting position of the feature in the chromosome or scaffold. The

first base in a chromosome is numbered 0.– chromEnd - The ending position of the feature in the chromosome or scaffold. The

chromEnd base is not included in the display of the feature. – name - Defines the name of the BED line. – score - A score between 0 and 1000. – strand - Defines the strand - either '+' or '-'.– thickStart - The starting position at which the feature is drawn thickly – thickEnd - The ending position at which the feature is drawn thickly – itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). – blockCount - The number of blocks (exons) in the BED line.– blockSizes - A comma-separated list of the block sizes. – blockStarts - A comma-separated list of block starts.

Page 8: (Human) Genomics BIOM/PHAR206 – 05/19/2014

WIG track format# 150 base wide bar graph at arbitrarily spaced positions,# threshold line drawn at y=11.76# autoScale off viewing range set to [0:25]# priority = 10 positions this as the first graph# Note, one-relative coordinate system in use for this formattrack type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10variableStep chrom=chr19 span=15049304701 10.049304901 12.549305401 15.049305601 17.549305901 20.049306081 17.549306301 15.049306691 12.549307871 10.0# 200 base wide points graph at every 300 bases, 50 pixel high graph# autoScale off and viewing range set to [0:1000]# priority = 20 positions this as the second graph# Note, one-relative coordinate system in use for this formattrack type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20 fixedStep chrom=chr19 start=49307401 step=300 span=2001000 900 800 700 600 500 400 300 200 100

Page 9: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Specific Tracks of interest• UCSC genes• RefSeq Genes• RepeatMasker• Conservation• TF motif predictions• dbSNP• ENCODE• Roadmap

Page 10: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Custom Sessions• Create an account• Customize the tracks displayed• Add you own track (limited in size and time)• Save and Share

Page 11: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Table Browser• Subset gene, region, genome• Output BED or fasta• Intersection• Filters

Page 12: (Human) Genomics BIOM/PHAR206 – 05/19/2014

ENCODE / Roadmap Tracks• Track search• Cell Types / Tissue Types• Raw • Peaks• HMM

Page 13: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UNIX commands• Head • More (press Q to exit)• Cat

– Example cat file – Example cat file1 file2

• Grep – Grep –v ‘expression’– Grep –A 1 ‘expression’– Grep –B 2 ‘expression’– Example: grep –v ‘#’ file.txt to remove comments

• Expression metacharacters– $ end of line– $ beginning of line– [AB] A or B– * any character– Example: ‘CDKN*’ or ‘chr[1-7]’

Page 14: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UNIX commands• Cut

– cut –f 1– cut –f 3 –d ‘:’

• Sort – sort –n– sort –nr (or sort –n –r)– sort –k 2

• uniq– uniq– uniq -c

• wc– wc –l file.txt – Example: cut –f 1 file | sort | uniq -c

Page 15: (Human) Genomics BIOM/PHAR206 – 05/19/2014

UNIX commands• Sed

– Sed ‘s/foo/bar/g’ file: find and replace

• Awk– Awk ‘$3>2000’ file : select row with 3rd field>2000– Awk ‘{if ($3>2000) print $1,$2}’ file only print first

2 columns– Awk ‘{sum+=$3} END {print sum}’ file print sum of

column 3– Awk ‘{sum+=$3} END {print sum/NR}’ file print

average of column 3

• Join– join –j 1 sorted_file1 sorted_file2

Page 16: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Demo #1 and #2

Page 17: (Human) Genomics BIOM/PHAR206 – 05/19/2014

DNA variants(Sequence differences)

Highly Similar Genomes

Phenotypic Differences(Physical traits)

Human Genetic Variation

Page 18: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Variant Types

Frazer et al. 2009

Rahim, Harismendy et al (2008)

Page 19: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Within any given individual there are ~ 4 million genetic variants encompassing ~ 12 Mb

Variants from an individual genome

Page 20: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Variants from multiple genomes

Within a given individual the majority of variants

are common.

Page 21: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Next Generation DNA analysis• Whole genome sequencing

– Mutations (coding and non-coding)– Translocations– Copy Number Variants

• Whole Exome Sequencing– Mutations (coding)– ~Copy number variants (trisomia, gene

amplifications)• Gene Panel

– Mutations (coding)

Page 22: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Variant Frequencies

• Common genetic variants – second allele present at greater than 3% frequency

• Rare genetic variant– present at less than 3% frequency, and commonly at very low

frequencies

• Private variants– in limited families or single individuals

Page 23: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Map of Genetic Variation

Relationships between common SNPs in the human genome

Frazer et al (2007)

HapMap Project

Genotyped ~ 3.1 million SNPs in 270 individuals– 90 Yoruba in Ibadan, Nigeria (YRI) – 90 European descent in Utah, USA (CEU)– 45 Han Chinese in Beijing, China (CHB)– 45 Japanese in Tokyo, Japan (JPT)

Page 24: (Human) Genomics BIOM/PHAR206 – 05/19/2014

1000G Project

Page 25: (Human) Genomics BIOM/PHAR206 – 05/19/2014

VCF format##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Page 26: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Linkage Disequilibrium (LD) Given two biallelic sites there are four combinations that can be

observed with the following distributions.

SNP 1 = A/G

SNP 2 = A/C

SNP1-SNP2

Case r2=1 Case r2=0

AA 70 25

AC 0 25

GA 0 25

GC 30 25

LD measure the level of correlation between SNPsLD is the consequence of recombination at preferential sites

Page 27: (Human) Genomics BIOM/PHAR206 – 05/19/2014

LD Bin structure exampleLD bin = groups of SNPs with r2≥0.8

•The majority of common SNPs are in LD bins in the human genome

•Genotypes of a set of ~500,000 “tag SNPs” provide information (r2

≥ 0.8) regarding a large fraction (90%) of all 8 million common SNPs present in humans.

Page 28: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS principle

Tests if common SNPs tagging an interval in the human genome are “associated” with a disease

From phenotype to genotype

http://www.mpg.de

Page 29: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS results

WTCCC (2007)

PR interval

Large number to test requires low p-value (5.10-8)Sample sizes determine variant frequencies and effect size (Power)

Q1 2011221 traits

1319 studies>4000 associated SNPs

Page 30: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS highlights

• Many genes/loci not previously known to be involved in the diseases studied

• Newly identified pathways suggest that molecular sub-phenotypes of common diseases may exist

• Many common diseases have the same associated genes suggesting similar etiologies

Page 31: (Human) Genomics BIOM/PHAR206 – 05/19/2014

GWAS limitations– Genetic

• Small Effect sizes : only explains a small fraction (1-25%) of the heritability

• Missing heritability can be hiding in– Rare variants with large effects– Epitasis (Gene x Gene interactions)– Gene x Environment interaction (overlooked in heritability studies)

– Clinical• Limited Prognostic value : classic marker (family history, life style)

work better• Limited by ethnicity

– Functional• Proxy SNPs are not the functional ones• Genes associated by proximity : Variants are mostly outside• Cell type and condition unknown

Page 32: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Demo #3

Page 33: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Cancer Types

Page 34: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Clinical Data Collectedage_at_initial_pathologic_diagnosis 100% history_of_colon_polyps 82%

preoperative_pretreatment_cea_level

60%

icd_10 89% pretreatment_history 100%

icd_o_3_histology 99%primary_lymph_node_presentation

_assessment98%

ajcc_cancer_staging_handbook_edition 80% icd_o_3_site 99% primary_tumor_pathologic_spread 100%

anatomic_site_colorectal 88% informed_consent_verified 100% prior_diagnosis 100%bcr_patient_uuid 100% kras_gene_analysis_performed 89% race 57%

braf_gene_analysis_performed 87% kras_mutation_codon 4% residual_tumor 82%

braf_gene_analysis_result 6% kras_mutation_found 9%synchronous_colon_cancer_presen

t87%

circumferential_resection_margin 10%loss_expression_of_mismatch_repair_protei

ns_by_ihc74% tissue_source_site 100%

colon_polyps_present 42% lymph_node_examined_count 98% tumor_stage 96%date_of_form_completion 100% lymphatic_invasion 87% tumor_tissue_site 100%

date_of_initial_pathologic_diagnosis 100% lymphnode_pathologic_spread 100% venous_invasion 83%

days_to_birth 100% microsatellite_instability 16% vital_status 100%days_to_death 89% non_nodal_tumor_deposits 43% weight 51%

days_to_initial_pathologic_diagnosis 100% number_of_abnormal_loci 12% anatomic_organ_subdivision 2%

days_to_last_followup 96%number_of_first_degree_relatives_with_can

cer_diagnosis85%

loss_expression_of_mismatch_repair_proteins_by_ihc_result

18%

days_to_last_known_alive 61% number_of_loci_tested 12%

distant_metastasis_pathologic_spread 98% number_of_lymphnodes_positive_by_he 94%

ethnicity 55% number_of_lymphnodes_positive_by_ihc 9%

gender 100% patient_id 100%height 47% perineural_invasion_present 33%

histological_type 99% person_neoplasm_cancer_status 86%

Personal and history Histology

Clinical Molecular

Page 35: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Days after Dx

Patie

nts

Dec

reas

ing

Intr

insi

c se

nsiti

vity

Clinical Data Collected

Page 36: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Molecular Data Collected

Molecule Method Measured entity Data

RNA microarrays 15,000 transcripts Expression levels

RNA RNA-Seq All known and novel trasncripts

Expression levels, isoform quantification, editing, Novel transcripts, Fusion

Trasncripts

DNA microarrays 100k to 1M SNP Copy Number Aberrations, LoH, Polymorphisms

DNA Sanger Sequencing 30 M Base pairs Coding Mutations

DNA whole exome sequencing 50 M Base pairs Coding Mutations, Copy Number

Aberrations

DNA whole genome 3 billion base pairs Coding and Regulatory Mutations, Copy Number Aberrations, Rearragements

DNA Methylation Array 450,000 CpG Methylation levels

DNA Methylation Array 27,000 CpG Methylation levels

Page 37: (Human) Genomics BIOM/PHAR206 – 05/19/2014

Demo #4