Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
NGS Assembly and RNA-‐seq
Manpreet S. Katari
Outline• Fastq – File format widely used to provide sequence with a quality score for each base.
• Sequence assembly–What coverage is acceptable?
• RNA-‐seq– How to align the sequences?–What questions can we ask using RNA-‐seq?
Fastq formatRead Identifier
Read Sequence
Read Sequence Quality
Output of Bowtie Alignment (SAM)
Bowtie output (SAM)1. HYYD8:00007:000872. 163. gb|CM0004554. 13851175. 36. 29M1D9M1D9M2D21M2D18M1D70M7. *8. 09. 010. CAATGAGCTAACAACTGCAATGGGGCCATAATGGCTGCTTGTCGTTTGGCACGTACATGGACTAGCTTCCCCCGTGGCACAAAAAT
GGCTCTACGTTCTGTTACGAGCGCACCTACTGAAGGTCTCTCATAGGAGTGTATGTATATGCATATACAT11. ;:=>>:333*33,33<<:7:3*344,444-‐449>>::4-‐6666B<EB>ABA@?;::44,4444<<4,4*555545-‐
??670??==?<?@?>>>><7<<45-‐??>>?>>>??;<44444-‐5,:;;<776767-‐55?667?=@@888@AA@?<>;<5512. AS:i:-‐58 XN:i:0 XM:i:4 XO:i:5 XG:i:7 NM:i:11 MD:Z:29^A9^T9^TG10C0T1G0A6^CC18^A70 YT:Z:UU
XR:Z:@HYYD8%3A00007%3A00087%0AATGTATATGCATATACATACACTCCTATGAGAGACCTTCAGTAGGTGCGCTCGTAACAGAACGTAGAGCCATTTTTGTGCCACGGGGGAAGCTAGTCCATGTACGTGCCAAACGACAAGCAGCCATTATGGCCCCATTGCAGTTGTTAGCTCATTG%0A+%0A55<;><?@AA@888@@=?766?55-‐767677<;;%3A,5-‐44444<;??>>>?>>??-‐54<<7<>>>>?@?<?==??076??-‐545555*4,4<<4444,44%3A%3A;?@ABA>BE<B6666-‐4%3A%3A>>944-‐444,443*3%3A7%3A<<33,33*333%3A>>=%3A;%0A
CIGAR string
29M 1D 9M 1D 9M 2D 21M 2D 18M 1D 70M
Outline• Fastq – File format widely used to provide sequence with a quality score for each base.
• Sequence assembly–What coverage is acceptable?
• RNA-‐seq– How to align the sequences?–What questions can we ask using RNA-‐seq?
Genome Assembly & Annotation
Whole-genome shotgun sequencing summary
Shatz et al. Genome Research 2010, Analysis of large genomes
Comparison of overlap graph and de Brujin graph for assembly
Using “pair-‐mate” reads to connect contigs
• Process of assembling raw sequence reads into accurate contiguous sequence– Required to achieve
1/10,000 accuracy• Manual process
– Look at sequence reads at positions where programs can’t tell which base is the correct one
– Fill gaps– Ensure adequate coverage
GapSingle
stranded
Finishing I
• To fill gaps in sequence, design primers and sequence from primer
• To ensure adequate coverage, find regions where there is not sufficient coverage and use specific primers for those areas
GAP
Primer
Primer
Finishing II
Each nucleotide is sequenced many times
Assembly Progression(Macro View)
Outline• Fastq – File format widely used to provide sequence with a quality score for each base.
• Sequence assembly–What coverage is acceptable?
• RNA-‐seq– How to align the sequences?–What questions can we ask using RNA-‐seq?
Transcriptomics using RNA-‐seq
Questions that can be addressed with genome-‐wide expression analysis:
• What genes have similar function?• What regulatory pathways exist?• Can we subdivide experiments or genes into meaningful classes?
• Can we correctly classify an unknown experiment or gene into a known class?
• Can we make better treatment decisions for a cancer patient based on his or her gene expression profile?
RNA-‐seq provides even more
Candidate new and revised exons
Bowtie &
TopHat
Normalizing the Data
• RPKM (Reads per Kilobase of exons per million reads)
Score = R
R = # of unique reads for the geneN = Size of the gene (sum of exons / 1000)T = total number of reads in the library mapped to the genome / 1,000,000
NT
Recent studies show that it is notNecessary to control for size of genesSo most only control for T.
Test using a negative binomial model [glm.nb()]
p-value = 0.258 p-value = 1.03e-05
x y x y
0200
600
1000
Volcano plotfold-change vs. significance
-log (p-value)
Log ratio
p=10-2
p=10-3
p=10-18
Clustering (genes) p points in a T-‐dimensional space ( p = # of genes, T = # of conditions )
based on proximity of the points:
• Extract a few typical expression patterns (cluster centroids)• Partition genes based on their profile similarity (clusters, memberships)
Genes with similar expression profiles are likely to have common or related functions, and possibly to be co-‐regulated
T = 3
Similarly, conditions can be classified into different groups based on similarities in their expression profiles (all or subsets of genes).
Hierarchical Clustering
This example illustrates single-‐linkage clustering in Euclidean space on 6 points.
• Find the pair(s) with the highest pairwise similarity• Join these as a group and calculate an “average” profile(single, average, or complete linkage)• Iteratively join groups until all are linked
The UPGMA method of phylogenetic reconstruction usesaverage linking …
AB
CD
E F
A B C D E F
End Result
Genes are grouped according to similarities in their expression levels across a variety of conditions.
Conditions
Genes
(clustered by sim
ilarity in
expressio
n profiles)
• Place genes with similar expression profiles into clusters.
• Similarity is defined by Pearson correlation.
Gene Set Enrichment
• Often when we characterize this list of genes, we use statistics to show that the property or annotation is significantly over-‐represented compared to if the list was created randomly.
• Two of the common statistical methods are :– Hypergeometric Test– Fisher’s exact test.
Gene Ontology
• “The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.”
• “The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-‐independent manner.”– “A gene product might be associated with or located in one or
more cellular components; it is active in one or more biological processes, during which it performs one or more molecular functions.”
Go is a directed acyclic graph
Hypergeometric Test• The hypergeometric distribution is a discrete probability
distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.
• Think of an urn with two types of marbles, blue and red where blue is success and red is failure. Stand next to the urn with your eyes closes and select 10 marbles without replacement. What is the probability that 4 of the 10 will be blue?