Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM

NGS Assembly and RNA-‐seq

Manpreet S. Katari

Outline• Fastq – File format widely used to provide sequence with a quality score for each base.

• Sequence assembly–What coverage is acceptable?

• RNA-‐seq– How to align the sequences?–What questions can we ask using RNA-‐seq?

Fastq formatRead Identifier

Read Sequence

Read Sequence Quality

Output of Bowtie Alignment (SAM)

Bowtie output (SAM)1. HYYD8:00007:000872. 163. gb|CM0004554. 13851175. 36. 29M1D9M1D9M2D21M2D18M1D70M7. *8. 09. 010. CAATGAGCTAACAACTGCAATGGGGCCATAATGGCTGCTTGTCGTTTGGCACGTACATGGACTAGCTTCCCCCGTGGCACAAAAAT

GGCTCTACGTTCTGTTACGAGCGCACCTACTGAAGGTCTCTCATAGGAGTGTATGTATATGCATATACAT11. ;:=>>:333*33,33<<:7:3*344,444-‐449>>::4-‐6666B<EB>ABA@?;::44,4444<<4,4*555545-‐

??670??==?<?@?>>>><7<<45-‐??>>?>>>??;<44444-‐5,:;;<776767-‐55?667?=@@888@AA@?<>;<5512. AS:i:-‐58 XN:i:0 XM:i:4 XO:i:5 XG:i:7 NM:i:11 MD:Z:29^A9^T9^TG10C0T1G0A6^CC18^A70 YT:Z:UU

XR:Z:@HYYD8%3A00007%3A00087%0AATGTATATGCATATACATACACTCCTATGAGAGACCTTCAGTAGGTGCGCTCGTAACAGAACGTAGAGCCATTTTTGTGCCACGGGGGAAGCTAGTCCATGTACGTGCCAAACGACAAGCAGCCATTATGGCCCCATTGCAGTTGTTAGCTCATTG%0A+%0A55<;><?@AA@888@@=?766?55-‐767677<;;%3A,5-‐44444<;??>>>?>>??-‐54<<7<>>>>?@?<?==??076??-‐545555*4,4<<4444,44%3A%3A;?@ABA>BE<B6666-‐4%3A%3A>>944-‐444,443*3%3A7%3A<<33,33*333%3A>>=%3A;%0A

CIGAR string

29M 1D 9M 1D 9M 2D 21M 2D 18M 1D 70M




Genome Assembly & Annotation

Whole-genome shotgun sequencing summary

Shatz et al. Genome Research 2010, Analysis of large genomes

Comparison of overlap graph and de Brujin graph for assembly

Using “pair-‐mate” reads to connect contigs

• Process of assembling raw sequence reads into accurate contiguous sequence– Required to achieve

1/10,000 accuracy• Manual process

– Look at sequence reads at positions where programs can’t tell which base is the correct one

– Fill gaps– Ensure adequate coverage

GapSingle

stranded

Finishing I

• To fill gaps in sequence, design primers and sequence from primer

• To ensure adequate coverage, find regions where there is not sufficient coverage and use specific primers for those areas

GAP

Primer

Primer

Finishing II

Each nucleotide is sequenced many times

Assembly Progression(Macro View)




Transcriptomics using RNA-‐seq

Questions that can be addressed with genome-‐wide expression analysis:

• What genes have similar function?• What regulatory pathways exist?• Can we subdivide experiments or genes into meaningful classes?

• Can we correctly classify an unknown experiment or gene into a known class?

• Can we make better treatment decisions for a cancer patient based on his or her gene expression profile?

RNA-‐seq provides even more

Candidate new and revised exons

Bowtie &

TopHat

Normalizing the Data

• RPKM (Reads per Kilobase of exons per million reads)

Score = R

R = # of unique reads for the geneN = Size of the gene (sum of exons / 1000)T = total number of reads in the library mapped to the genome / 1,000,000

NT

Recent studies show that it is notNecessary to control for size of genesSo most only control for T.

Test using a negative binomial model [glm.nb()]

p-value = 0.258 p-value = 1.03e-05

x y x y

0200

600

1000

Volcano plotfold-change vs. significance

-log (p-value)

Log ratio

p=10-2

p=10-3

p=10-18

Clustering (genes) p points in a T-‐dimensional space ( p = # of genes, T = # of conditions )

based on proximity of the points:

• Extract a few typical expression patterns (cluster centroids)• Partition genes based on their profile similarity (clusters, memberships)

Genes with similar expression profiles are likely to have common or related functions, and possibly to be co-‐regulated

T = 3

Similarly, conditions can be classified into different groups based on similarities in their expression profiles (all or subsets of genes).

Hierarchical Clustering

This example illustrates single-‐linkage clustering in Euclidean space on 6 points.

• Find the pair(s) with the highest pairwise similarity• Join these as a group and calculate an “average” profile(single, average, or complete linkage)• Iteratively join groups until all are linked

The UPGMA method of phylogenetic reconstruction usesaverage linking …

AB

CD

E F

A B C D E F

End Result

Genes are grouped according to similarities in their expression levels across a variety of conditions.

Conditions

Genes

(clustered by sim

ilarity in

expressio

n profiles)

• Place genes with similar expression profiles into clusters.

• Similarity is defined by Pearson correlation.

Gene Set Enrichment

• Often when we characterize this list of genes, we use statistics to show that the property or annotation is significantly over-‐represented compared to if the list was created randomly.

• Two of the common statistical methods are :– Hypergeometric Test– Fisher’s exact test.

Gene Ontology

• “The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.”

• “The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-‐independent manner.”– “A gene product might be associated with or located in one or

more cellular components; it is active in one or more biological processes, during which it performs one or more molecular functions.”

Go is a directed acyclic graph

Hypergeometric Test• The hypergeometric distribution is a discrete probability

distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.

• Think of an urn with two types of marbles, blue and red where blue is success and red is failure. Stand next to the urn with your eyes closes and select 10 marbles without replacement. What is the probability that 4 of the 10 will be blue?

Documents

Katari NGS assembly rna seq - CGIARhpc.ilri.cgiar.org/.../Katari_NGS_assembly_rna_seq.pdf · Katari_NGS_assembly_rna_seq Author: Manpreet Katari Created Date: 8/28/2015 8:38:40 AM