Practicum Pressentation PDF

Preview:

Citation preview

Predicting lncRNA Transcripts Out of Comprehensive Rat Renal Cell type-specific Transcriptome Libraries

Gui Chen 11/20/2015

WHY LONG NON CODING RNA?➤ Many long non-coding transcripts

(lncRNAs) function in a variety of responses which include differentiation, cell cycle, and maintenance of stem-cell like phenotypes, and are cell-type specific in their expression. Yet, very little is known about their regulation or roles in disease states.

➤ A newly established rat renal gene expression database and recently assembled rn6 genome sequecne have paved a way for us to conduct such study.

WHAT IS EXACTLY THE DATA SOURCE?

➤ 110(renal tubule segments) + 5(glomeruli) renal cell-type specific gene expression profiles as a product of work described in the paper shown left.

➤ 7 polyadenylated mRNA-seq(PA-seq) & cortical collecting duct(4 control rat and 4 water loaded rat)

➤ Totally 125 libraries

WHAT IS THE FORMAT OF THE DATA➤ Original transcripts data are stored in

GTF format which is a flat tab-delimited file format that can be directly loaded into excel.

➤ Next is a real case example of what GTF records looks like.

GTF FILE EXAMPLE

How can we pick out those transcripts that potentially are long non coding RNA transcripts from thousands of transcripts?

1. What are the characteristics of lncRNA from preliminary data and experience?

➤ Less conserved than protein-coding genes.(PhyloCSF)

➤ A much shorter ORF(open reading frame) than that of genes(they don’t necessarily have, if have, have one short and by chance or they are originally genes?)

➤ When forcely translated into protein, there is no counterpart in nr database(none redundant protein database).(Blastx)

➤ They are consistently and significantly expressed at least in one type of cell.

2. Extract records satisfying all the characteristics above.

A pipeline is established based on this idea.

Theoretically the pipeline works like this…

➤ The biggest circle represents the whole searching space.

➤ small rectangles inside the big circle represent subset of records in the whole searching space, which satisfy certain lncRNA charateristic.

➤ The intersection of all the small rectangles representing the predicted set of lncRNA transcripts.

all the transcripts

less conserved ones

no counterpart in nrdatabase

short ORF

true positive expression

Predicted lncRNAs

What do we get by each step? (take multiexon transcripts as examples)

➤ Find transcripts with short ORF(length < 150)

Because each record in fasta file contains two rows, there are actually n/2 records.

What do we get by each step? (take multiexon transcripts as examples)

➤ Find transcripts with no counterpart in nr database(E-value threshold > 10E-4 )

What do we get by each step? (take multiexon transcripts as examples)

➤ Find transcripts are consistently and significantly expressed for all replicates in at least one type of cell (fpkm > 0.1)

Classification of lncRNAs

➤ sense and antisense lncRNAs

➤ sense lncRNAs can be classified into intergenic, cons, incs, ponds lncRNAs

RESULT

THANK YOU& Happy Thanksgiving!

Recommended