62
High Throughput Genomic Data Vítor Santos Costa

High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

High Throughput Genomic Data

Vítor Santos Costa

Page 2: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

A Brief History of Sequencing and Gene Expression

Limitations of Sanger Sequencing Low throughput

Inconsistent base quality Expensive

Not quantitative

Frederick “Fred” Sanger

Hybridization Based Gene Expression Quantification

Reliance on existing knowledge about genome sequence

High background due to cross-hybridization Requires lots of starting material

Limited dynamic range of detection

Page 3: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Next Generation Sequencing (Massive Parallel Sequencing)

Principles 1) Fragmentation and tagging of genomic/cDNA

fragments – provides universal primer allowing complex genomes to be amplified with common PCR primers

2) Template immobilization – DNA separated into single strands and captured onto beads (1 DNA molecule/bead)

3) Clonal Amplification – Solid Phase Amplification

4) Sequencing and Imaging – Cyclic reversible termination (CRT) reaction

Page 4: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Next Generation Sequencing (Massive Parallel Sequencing)

Clonal Amplification – Solid Phase Amplification Priming and extension of single strand, single molecule template; bridge amplification of the immobilized template with immediately primers to form clusters (creates 100-200 million spatially separated template clusters) providing free ends to which a universal sequencing primer can be hybridized to initiate NGS reaction – each cluster represents a population of identical templates

Page 5: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Next Generation Sequencing (Massive Parallel Sequencing)

Cyclic Reversible Termination – DNA Polymerase bound to primed template adds 1 (of 4) fluorescently modified nucleotide. 3’ terminator group prevents additional nucleotide incorporation.

Following incorporation, remaining unincorporated nucleotides are washed away. Imaging is performed to determine the identity of the incorporated nucleotide.

Cleavage step then removes terminating group and the fluorescent dye. Additional wash is performed before starting next incorporation step

This is repeated ~250 million times (25Gb) with HiSeq2500 (~4 days)

Unlike SANGER termination is REVERSIBLE

Page 6: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

RNA Sequencing

Population of RNA (poly A+) converted to a library of cDNA fragments with adaptors attached to one or both ends

Solid Phase Amplification performed

Molecules sequenced from one end (Single End) or both ends (Pair End)

Reads are typically 30-400bp depending on sequence technology used

Page 7: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

RNA Purification and AnalysisRNA Purification: Can use Qiagen Kit or Phenol/Chloroform Extraction, do not use

Invitrogen RNA isolation kit

RNA Quality Assessment (Agilent 2100 BioAnalyzer)

RNA Quantification (Qubit) – nanodrop considered too inaccurate

Page 8: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

TRUSEQ Library Preparation

Library Construction Effective elimination of ribosomal RNA (negative selection) followed by polyA

selection (for mRNA)

High Quality Strand Information

Can be used with low quality/low abundance RNA (10-100ng)

48 barcodes allows for multiplexing

Small RNAs can be directly sequenced

Large RNAs must be fragmented

http://res.illumina.com/documents/products/datasheets/datasheet_truseq_stranded_rna.pdf

Page 9: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Sequencing Apparatus at UCLA

Page 10: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Experimental Design: Single End (SR) vs Paired End (PE)

Single Read: one read sequenced from one end of each sample cDNA insert (Rd1 SP: Read 1 Seqeuncing Primer)

Paired End: two reads (one from each end) sequenced from each sample cDNA insert (Rd1 and Rd2 sequencing primer)

SR: often used for expression studies or SNP detection; NOT good for splice isoforms PE: used for discovery of novel transcripts, splice isoforms and for de novo transcriptome assembly

Page 11: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Experimental Design: How many reads do I need

Study Type Reads Needed Expression Profiling 5-10 Million Alternative splicing, quantifying cSNPs 50-100 M De Novo Transcriptome Assembly 100-1000 M

Sequencing Instrument Reads per Lane (SR:PE) Reads per Flow Cell HiSEQ 2500 185:375M 1.5:3 Billion

Greater Sequencing Depth correlates with better genomic coverage and more robust differential gene expression analysis

Page 12: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Sequence AnalysisTheory Practice

Page 13: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Sequence AnalysisOne flow cell can generate up to 600Gb of data. Where am I going to store this data?

Stem Cell Core will keep raw data for up to 6 months

Sequencing analyses takes a ton of processing power: Currently the Cheng Lab is insufficiently capable of storage, processing and expertise. While analyses programs have become more user friendly (i.e.Galaxy), storage and processing capability will always be required.

Hoffman2 Cluster: 11, 000 processors, 1300 active users using up to 8 million computing hrs per month. A typical user account allows 20GB of permanent storage. Users are also provided a scratch folder (~100GB) where you can store files for up to 7 days at which point they are deleted permanently.

Access to the Hoffman2 cluster requires ucla email account (email Shirley Goldstein [email protected])

However access also requires a PI sponser. Genhong is currently not.

Hoffman2 also provides computing tutorials on a regular basis (See website)

http://ccn.ucla.edu/wiki/index.php/Hoffman2:Getting_an_Account

Page 14: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Converting RAW data to FASTQ

RAW data from HISEQ 2500 run yields two files 1) .bcl file: contains base identity information for each run 2) .stats file: contains base intensity and quality information Most (and probably all) programs need a merged file (named FASTQ or QSEQ)

Download and install bclconverter (already installed on Optiplex 990)

SxaQSEQsXA050L3:xG3KF4Ue

~bin/setupBclToQseq.py -i FOLDER_CONTAINING_LANE_DIRS -p POSITION_DIR -o OUT_DIR --overwrite followed by make in OUT_DIR

COMMAND

If multiplexed, files then need to be de-multiplexed (this is slightly complicated)

Page 15: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Converting RAW data to FASTQ

@SN971:3:2304:20.80:100.00#0/1 NAAATTTCACATTGCGTTGGGAACAGTTGGCCCAAACTCAGGTTGCAGTAACTGTCACAATACCATTCTCCATCAACTTCAAGAAATGTTCAACAAAACAC + @P\cceeegggggiihhiiiiiiihighiiiiiiiiiiiiiifghhhhgfghiifihihfhhiiiihiggggggeeeeeeddcdddccbcdddcccccccc

FASTQ File

Line 1: begins with ‘@’ followed by sequence identifier Line 2: raw sequence Line 3: + Line 4: base quality values for sequence in Line 2

Lane #

Tile #INSTRUMENT NAME

X Y

ADAPTOR INDEX

SINGLE END READ

Page 16: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

GALAXY

User friendly web interface for processing and analyzing Sequencing Data Galaxy has also been installed on the Optiplex 990

Allows for application of workflows – enable automated processing and mapping of data

Can add tools to the galaxy toolbox Obtain a Galaxy account linked to the hoffman2 cluster for higher processing

power – email Weihong Yan ([email protected])

Video tutorials

Published workflows

Page 17: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

My RNA Seq Workflow

Work in progress

Page 18: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Quality ControlFASTQ Groomer: converts FASTQ data from different sources (ie Illumina, 454 Sequence etc) to a consensus FASTQ file FASTQ QC: assesses base quality of sequence reads Per base sequence quality per sequence quality scores GC content Sequence Length Sequence Duplication Overrepresented sequences Kmer content

Genhong

Shankar

Kislay

FASTQ TRIMMER: eliminate sequences below phRed score (usually <20) Remember to check how many reads are lost from original input after processing

Quality

Page 19: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Reference Mapping - TOPHAT

INPUT FASTQ (processed)

Output (4 files) Insertions (.bed) Deletions (.bed) Junctions (.bed)

Accepted Hits (.bam)

.bed files can be downloaded to excel -sam (Sequence Aligment/Map) or bam (binary compressed version of sam) – can be used to visualize reads using UCSC Genome Browser or Integrative Genomics Viewer

https://genome.ucsc.edu/FAQ/FAQformat.html#format1Link to File type descriptions

TOPHAT provides both identifying and quantifying

information

Page 20: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Reference Mapping - TOPHAT

Often 10-20% of reads do not map to

any consensus region of genome

Page 21: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Estimating Transcript Abundance - Cufflinks

INPUT .bam file (Accepted Hits)

Reference (.gtf) Refseq, Ensembl, etc

Output (tabular form, excel) FPKM quantifiable

Page 22: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Visualizing Reads Across the Genome

Upload Files to UCSC Genome Browser Convert .bam file to .bedgraph (using Galaxy)

Requires some coding Size Limitations

Upload Files to Integrative Genome Viewer Convert .bam file to .bedgraph (using Galaxy)

Upload directly

WT

IFNAR KO

IL-27R KO

WT

IFNAR KO

IL-27R KO

Page 23: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

How do I quantify expression from RNA-seq?RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)

Longer and more highly expressed transcripts are more likely be represented among RNA-seq reads

RPKM normalizes by transcript length and the total number of reads captured and mapped in the experiment

Sequencing depth can alter RPKM values

Page 24: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Differential Gene Expression AnalysisRPKM -Can calculate Fold change -Input sequence reads must be similar -replicates not needed -provides NO statistical test for differential gene expression -useful for Cluster based classification of genes http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/Help/4%20Quantitation/4.3%20Pipelines/4.3.1%20RNA-Seq%20quantitation%20pipeline.html

DESeq -Input .bam file -Can set statistical threshold -Input sequence reads can be somewhat dissimilar -Must have replicates -Not currently on Galaxy (must use Edge R)

CuffDiff (available on GALAXY) -Input .bam file -Can set statistical threshold (p<0.05 or whatever) -replicates encouraged but not needed -Input sequence reads can be somewhat dissimilar -can provide differential splicing and promoter usage

Page 25: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Differential Gene Expression Analysis: Sampling VarianceConsider a bag of balls with K number of red balls where K is much less than the total number of balls. You can sample n number of balls. P represents the proportion of red balls in your sample.

Estimate of the number of balls (u) = pn K (the actual number of balls) follows a Poisson distribution and hence K varies

around the expected value (u) with a standard deviation of 1/ sqroot (u)

Microarray data follows a Poisson distribution. However RNA seq does not. In RNA Seq genes with high mean counts (either because they’re long or highly expressed) tend to show more variance (between samples) than genes with low

mean counts. Thus this data fits a Negative Binomial Distribution

Poisson Negative Binomial

Page 26: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Differential Gene Expression AnalysisCuffDiff: If you have two samples, cuffdiff tests, for each transcript whether there is evidence that the concentration of this transcript is not the same in the two samples

DESeq/EdgeR: If you have two different experimental conditions, with replicates for each condition, DESeq tests whether, for a given gene, the change in the expression strength between the two conditions is large as compared to the variation within each group.

You will get different answers with different tests

Page 27: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Resources

RNA-seq: technical variablity and sampling McIntyre et al. BMC Genomics 2011 12:293

Statistical Design and Analysis of RNA Sequencing Data Auer and Doerge. Genetics 2010 185(2): 405-416

Analyzing and minimizing PCR amplication bias in Illumina sequencing libraries

Aird et al. Genome Biology 2011 12:R18

ENCODE RNA-Seq guidelines www.encodeproject.org/ENCODE/experiment_guidelines.html

Page 28: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Further Reading

RNA-seq: technical variablity and sampling McIntyre et al. BMC Genomics 2011 12:293

Statistical Design and Analysis of RNA Sequencing Data Auer and Doerge. Genetics 2010 185(2): 405-416

Analyzing and minimizing PCR amplication bias in Illumina sequencing libraries

Aird et al. Genome Biology 2011 12:R18

ENCODE RNA-Seq guidelines www.encodeproject.org/ENCODE/experiment_guidelines.html

Page 29: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Further ReadingBioinformatics for High Throughput Sequencing Rodriguez-Ezpeleta et al. SpringerLink New York, NY Springer c2012

RNA sequencing: advances, challenges and opportunities Ozsolak and Milos. Nature Reviews Genetics 12 87-98

Computational methods for transcriptome annotation and quantification using RNA-seq Garber et al. Nature Methods 8, (2011)

Next-generation transcriptome assembly Martin and Wang. Nature Reviews Genetics 12 671-682.

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks Trapnell et al. Nature Protocols 2012

SEQanswers.com

Page 30: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

DevTox Preliminary Learning Results

Vítor Santos Costa C David Page

Page 31: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Questions

• Can we detectexposure to lead-30? • What genes are important? • What functions are envolved? • How many days to get a stable model?

Page 32: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Methods

• Data in CSV • Used edgeR to look for differential

expression • Converted to WEKA format

– Used Information Gain to select N attributes – Several Classifiers

Page 33: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Weka Analysis

• Predict Ti was subject to lead-30 given that we have T1 .... Ti-1

• Assumes T1 .... Ti-1 independent

Attribute Selection Choose highest info gain

Classifier Naïve Bayes Random Forests

Page 34: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

WEKA: Learning

• Predictive accuracy is 95% • Mis-predictions at

– Days 2, 4, 5 – Other days always ok

• Easier task than lead/no-lead

Page 35: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

edgeR: most signif genes

• IGF2: 3.75e-130 • FOXD3: 2.43e-117 • ITIH2: 4.72e-88 • CYP1B1: 1.87e-86 • ASIC4: 1.13e-83 • COLEC12: 2.88e-82 • GREB1: 1.74e-78 • SLC6A13: 3.96e-73 • PCSK2: 9.07e-72 • DSG2: 3.39e-70

Page 36: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Most Information Gain

• 0.914 4781 FLRT2 • 0.914 1556 PRRX1 • 0.914 9490 ARMCX2 • 0.914 7896 ASPN • 0.833 9585 THEM4 • 0.833 9639 TAC1 • 0.833 8353 DCN • 0.833 13181 RDH10 • 0.833 1394 GPR124 • 0.833 5819 ISLR • 0.833 11146 TLE4 • 0.833 2147 ONECUT1 • 0.833 13318 SLFN5

Page 37: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex
Page 38: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

0 5 10 15 20 25

0200

400

600

800

Time

Activity

FLRT2

CTRL L−3 L−30

Page 39: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Gene Ontology

• Similar to GOStats • Most Differential Functions

– Used hypergeometric test

• Visualisation through graphviz

Page 40: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

GO Annotations (edgeR < 10-10)

• translational initiation • nuclear-transcribed mRNA catabolic process, nonsense-

mediated decay • translational elongation • SRP-dependent cotranslational protein targeting to membrane • viral infectious cycle • translational termination • viral transcription • plasma membrane • cytosolic large ribosomal subunit • cytosolic small ribosomal subunit • structural constituent of ribosome • ribosome

Page 41: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

GO Annotations (IG > 0.5)

• synaptic transmission • cell adhesion • G-protein coupled receptor signaling pathway • axon guidance • nervous system development • positive regulation of transcription from RNA

polymerase II promoter • negative regulation of cell proliferation • in utero embryonic development • positive regulation of transcription, DNA-dependent • multicellular organismal development

Page 42: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Go Graph (IG < 0.5)

Page 43: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

1 3 5 7 9 11 13 15 17 19 21 23 25

Time of Entry for Genes in Final Set at p−value < 1e−10

Time

050

100

150

Page 44: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

2 4 6 8 10 12 14 16 18 20 22 24 26

outin

Genes in Final Set at p−value < 1e−10

Time

0200

400

600

800

1000

1200

1400

Page 45: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

0 2 4 6 8 10 12 14 16 18 20 22 24

Time of Entry for Genes in Final Set at Info Gain > 0.500000

Time

050

100

150

Page 46: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

5 6 7 8 9 11 13 15 17 19 21 23 25

outin

Genes in Final Set at Info Gain > 0.500000

Time

01000

2000

3000

4000

5000

Page 47: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Discussion

• Good: Good prediction • Good: Gene Functions look sensible • Bad: No clear breakpoint • Interesting: GO vs edgeR

Page 48: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Questions

• Can we detectexposure to lead-30? • What genes are important? • What functions are envolved? • How many days to get a stable model?

Page 49: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Discussion

• Good: Good prediction • Good: Gene Functions look sensible • Bad: No clear breakpoint • Interesting: GO vs edgeR

Page 50: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Machine Learning from 3D Data

Page 51: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

3D Case: Review

• Same ML methodology but data generation was different

• Exposed tissue model now 3D – All 7 neural cell types – Larger so spatio-temporal issues

• 35 Toxins and 26 Controls, 2 replicates each • Days 2, 7, 28 • 7 Compounds yield empty data on Day 28

Page 52: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex
Page 53: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Using Average of Replicates to Predict

Day:              2          7       28AUC:       0.9016 0.8767 0.8118

Page 54: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

3 of 4 Remaining Issues

1) Any other ways to further improve performance? Combining data from multiple days? Generating more data… leads to next issue

2) Learning curves… how does AUC vary with amount of data (number of compounds)?

3) Trying to reduce cost by running all samples through sequencer at once, reducing cost but also reducing reads

4) Predicting on Blinded Set of 10 compounds – to be done next and reported next month

Page 55: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

1) Combining Predictions from Days

• Because 7 compounds yield empty data for Day 28, difficult to combine data over days

• Easier to combine predictions over days – Average probabilities of toxic from all three

days; use the average of the Day 28 probabilities for the 7 compounds missing Day 28 data

– Or just use Days 2 and 7 • Each approach increases AUC to 0.93 (as

always, all results from cross-validation)

Page 56: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

2) Learning Curves

• Plot #compounds on x axis and AUC on y axis – If curve is flat at right, little value from more

data – If curve is still increasing, more data should help

• Each point is average of 30 random samples of selected number of compounds

• Near far right, little variation possible in samples because using almost all data

Page 57: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Day 2 Learning Curve

Page 58: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Day 7 Learning Curve

Page 59: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Day 28 Learning Curve

Page 60: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Message from Learning Curves

• Appears to be possibility for further data to improve accuracy

• Curves are flattening, so now in the region of diminishing returns in AUC from investment in data

Page 61: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

3) Minimizing Lanes Per Compound

• No replicates, so compare to earlier results with Day 2 AUC of 0.83 and Day 7 of 0.81

• One lane used per compound • Day 2: 0.78 • Day 7: 0.67

Page 62: High Throughput Genomic Data - DCCrvr/resources/MAP-i/MAPi-2014-vsc.pdf · High Throughput Genomic Data Vítor Santos Costa. ... fragments – provides universal primer allowing complex

Summary and Next Step

• Our best method as judged by cross-validation combines output probabilities from both replicates and all days (or days 2 and 7)

• Next step: apply model from this method to the ten blinded compounds and measure accuracy

• More training data could provide some added accuracy