32
Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* [email protected] *Work done while doing internship at Cold Spring Harbor Laboratory, NY.

Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* [email protected] * Work done while doing internship

Embed Size (px)

Citation preview

Page 1: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors

Nihar Sheth*[email protected]*Work done while doing internship at Cold Spring Harbor Laboratory, NY.

Page 2: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Overview

Broad goal is to better understand the Splice site selection mechanism.

A software pipeline for data extraction.Frequency-based Splice Site PredictorBenchmarking gene test setsA tool to compare Splice site predictors

Page 3: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Introduction to Splice Site

Exon Intron Exon Intron ExonGene

mRNA

Protein

Central dogma of Molecular Biology

Page 4: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Introduction to Splice Site

E I E EI 3’5’

Splice Junction

XX YY

Class XX YY

U2_GT_AG GT AG

U2_GC_AG GC AG

U12_GT_AG GT AG

U12_AT_AC AT AC

Donor (5’) SS Acceptor (3’) SS

Page 5: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Introduction to Splice Site

Alternative splicing is described as a regulated process of differential inclusion or exclusion of regions of the pre-mRNA.

E-1 I-1 E-2 I-2 E-3

E-1 E-3

E-1 I-1 E-2 I-2 E-3

E-1 E-2 E-3

E-1 I-1 E-2 I-2 E-3

E-1 E-2 E-3

Alternative 5’ SS

Alternative 3’ SS

Exon skipping

Page 6: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Extraction

Purpose: Prepare Clean Annotated SS dataset which can be used for various SS problems.

NCBI Human Refseq dataset.Developed a software pipeline.

Page 7: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Extraction

• Used .gbs and .fa files for each human chromosome.

• Separeted NM_ and XM_ accessions for data extraction.

• Annotate intron-less genes.• Identify and annotate alternative

splicing cases.

Page 8: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data ExtractionLOCUS NT_004350 534457 bp

DEFINITION Homo sapiens chromosome 1 genomic contig.ACCESSION NT_004350 ← Contig AccessionVERSION NT_004350.16 GI:37547298

mRNA join(292536..292635,409464..409813,467426..467476, 608491..608629,619830..619932,626130..626337,

…….. 655305..655479,657016..657632) ← exon Coordinates /gene="PRDM16" ← gene symbol /product="PR domain containing 16" ← gene description /note="unclassified transcription discrepancy; Derived by automated computational analysis using gene prediction method: BestRefseq. Supporting evidence includes similarity to: 1 mRNA" /transcript_id="NM_022114.1" ← gene accession /db_xref="GI:11545830" /db_xref="GeneID:63976" /db_xref="LocusID:63976" ← locuslink id /db_xref="MIM:605557"

Page 9: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Extraction

• We store SS record - (Contig accession, gene name, gene accession, chromosome no, splice-number, splice-junction, splice-start, splice-end, gene-orientation, splice-type) and related seq. into the fasta file.

>NT_004321.15 PRDM16 NM_022114.1 1 Splice-1 409464 409455 409467 + 3ss

AGCCTCCTTTAGGTGA>NT_004321.15 PRDM16 NM_022114.1 1

Splice-1 292635 292632 292643 + 5ssAAAAGTAAGTC

Page 10: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Extraction

Data is available http://katahdin.cshl.org:9331/nihar/capstone/

Page 11: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Extraction

Type NM_ Accessions XM_ Accessions Total

Genes in set 15387 4876 20263

Intron-less Genes 1213 964 2177

Splice Site Pairs 162046 30498 192544

*Non-redundant 5’ss 160784 30498 191282

*Non-redundant 3’ss 160533 30498 191031

+Alternative 5’ss 1262 0 1262

+Alternative 3’ss 1513 0 1513

* Some introns share either 5’ss or 3’ss due to alternative splicing and thus, while calculating SS Pairs, we calculate those shared splice sites multiple times. + Refseq dataset is highly curated dataset, and thus, it contains relatively low number of alternative splicing cases. Previous work estimates around 40-60% genes are alternatively spliced.

Page 12: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Analysis

-8TTTTTCAGGT2 -3CAGGTAAGT6 3’ SS Position Weight Matrix

Logo5’ SS Position Weight Matrix

Logo

• Previously, Shapiro and Senapathy calculated PWM on small dataset. We did the same to check how much our data satisfies previous finding and if we find something new.

Page 13: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Analysis

We want to check whether frequency of SS has any significance or not.

Extracted 9-nt subsequence and calculated frequency of it in dataset.

Donor (5’) Splice site and their frequency count9-nt Frequency countCAGGTGAGT 2061CAGGTGAGG 2007CAGGTAAGA 1773CAGGTGAGC 1695.....AAAGTTCCC 1TGTATATAG 1GATGTTTGT 1ATTGATATC 1

Page 14: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Data Analysis

Type NM_ Accessions

XM_ Accessions

Total

Unique 9-nt 5 ‘ss motifs

4684 3391 5920

Unique 9-nt 3 ‘ss motifs

8622 5609 9788

Max. Frequency count in 5’ss data

2061 364 2385

Max. Frequency count in 3’ss data

895 137 1031

Some statistics from frequency-based analysis of splice-site dataset

Page 15: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Frequency-based SS Predictor

Thousands of genes yet to be identified.Estimate says 40-60% genes are

alternatively spliced. Many times, it is difficult to identify alternative SS by experiments.

Gene Prediction Programs use SS predictor.

Page 16: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Frequency-based SS Predictor

Normalized frequency count of 9-nt subsequence to score of 1-100.

Score = (frequency – min. count) /

(max. count – min. count) * 100Developed a predictor which takes a

genomic sequence and returns score for each 9-nt seq. if it exists in frequency dataset.

Page 17: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Frequency-based SS Predictor

Page 18: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Frequency-based SS Predictor

Page 19: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Creating Test gene sets

Purpose: A need to create common test set to evaluate publicly available SS predictors.

Extracted 40-nt long splice site regions using pipeline.

Dr. Sachidanandam developed a software to classify splice sites in SS classes.

Page 20: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Creating Test gene sets

Once individual Splice site pairs were classified, we went back to Gene, and put it into Four Splice Site class Bin.

Gene can be in multiple Bins.U2_GT_AG is a major class and 99% of

splice sites are of this class.We found five groups in which all genes

were belonged.

Page 21: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Creating Test gene sets

Five GroupsU12_AT_AC 187 GenesU12_GT_AG 688 GenesU2_GC_AG 1085 GenesU2_GT_AG 13289 GenesMultiple Classes(3 or more) 131

Genes

Creating Test gene sets (continue)

Page 22: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Creating Test gene sets (continue)

We decided to further divide U2_GT_AG group.

Ranked each gene in this group by minimum of frequency count of donor Splice sites of that gene.

Group Min. Freq. Count

No. of Genes

High_U2_GT_AG >= 200 2689

Intermediate_U2_GT_AG

>= 50 & <200 3698

Low_U2_GT_AG < 50 6902

Page 23: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Creating Test gene sets (Continue)

Randomly selected 20 genes from each group

Page 24: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Comparing SS Predictors

We decided to compare following splice site Predictors for Donor (5’) Splice SiteShapiro and Senapathy Consensus matrix

based scoring ∆ E scoring NN based scoringMAXENT, MDD,MM DonorFreq (Frequency-based SS predictor)

Page 25: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Comparing SS PredictorsDeveloped s/w modules to fetch predicted

Splice sites from web-based predictors and store into DB.

Used various threshold values to calculate Specificity and sensitivity for each predictor.

Specificity (Sp) = (Truly Predicted SS / Total Predicted SS) * 100

Sensitivity (Sn) = (Truly Predicted SS / Total SS) * 100

Page 26: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Comparing SS PredictorsA predictor which gives less false

positives for higher number of true positives would be better splice site predictor.

We draw the Specificity Vs Sensitivity graph for each predictor to compare the performance of various predictors.

Here, we compare predictors using seven test gene sets and draw Sp vs Sn graph for each set.

Page 27: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Comparing SS Predictors

Page 28: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Comparing SS Predictors

Page 29: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Comparing SS Predictors

No clear winner.Overall, MIT MAXENT and DonorFreq

are doing good.As expected, SS predictors did not

perform well for test sets having minor classes compare to GT_AG_U2 sets.

Page 30: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Future Work

Improve the Frequency-based predictor using Splice site classification data available now.

Group Genes based on frequency count and Splice Site classes, and use Gene Ontology data to investigate whether they participate in any pathways.

Page 31: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship

Acknowledgement

Dr. Ravi SachidanandamDr. Francesc Roca, for explaining

everything I know about Splice site.Dr. Sun Kim

Page 32: Frequency-based Splice Site Predictor and A Tool to compare Splice site Predictors Nihar Sheth* nisheth@indiana.edu * Work done while doing internship