Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing Jihye Kim Bioinformatics Research Center

Embed Size (px)

DESCRIPTION

Central Dogma of Molecular Biology cytoplasm Nucleus DNA intronexon gene Pre-mRNA TRANSCRIPTION RNA SPLICING EXPORT TRANSLATION matureRNA protein

Citation preview

Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing Jihye Kim Bioinformatics Research Center Outline Background and Motivation Association Rule Mining (ARM) Use ARM techniques to discover cis-regulatory elements involved in alternative splicing Conclusions and Future Directions Central Dogma of Molecular Biology cytoplasm Nucleus DNA intronexon gene Pre-mRNA TRANSCRIPTION RNA SPLICING EXPORT TRANSLATION matureRNA protein Splicing Introns are removed and flanking exons are concatenated [image from Alternative Splicing Over 70% of human genes show AS Some genes express thousands of different mRNAs protein Pre-mRNAmRNA Biological Relevance of AS Major mechanism to generate protein diversity Important in gene regulation Highly relevant to disease 15% disease-causing mutations affect splicing [Krawczak 1992] [Krawczak 1992] Krawczak, M., Reiss, J., and Cooper, D.N Hum. Genet. 90: 41-54 Types of Alternative Splicing [Source from Cartegni et al. 2002] Cassette Exon Regulation of AS [Image from J.R. Sanford, et al., Cell Science at a Glance 117(26:6261] Spliceosome detects splice site Often, splicing factors bind to intron/exon to assist/repress exon splicing Cis-Regulatory Elements Short sequences ESE, ESS, ISE, ISS Close to splice sites [Source fromGENEINFO:Specie:Homo sapiens, human GENEINFO:Gene Name:fibronectin eda exon GENEINFO:Entry type:Exon enhancer GENEINFO:Methods:In vivo splicing assay SEQINFO:Sequence:GAAGAAGA SEQINFO:Sequence origin:Exonic [Image from Z.Wang and C. Burge, RNA 2008 Investigating AS Regulation Several computational methods Over-represented hexamers from brain-specific genes [Brudno 2001] RESCUE-ESE founds 10 motifs with enhancer activity [Fairbrother 2002] Motif pairs by coCOA (compositionally orthogonalized Co-Occurrence Analysis) [Friedman 2008] Most methods use only sequence data focus on the effect of individual motifs [Brudno 2001] Brudno M., Gelfand M.S., et al., 2001 NAR 20 (11) [Fairbrother 2002] Fairbrother WG., et al., 2002 Science 9;297(5583): [Friedman 2008] Friedman B.A., et al., 2008 Genome Res 18(10) Motivation Often, AS is regulated by combination of several binding factors Exonic UAGG AND GGGG motifs required for skipping of the cassette exon of the glutamate NMDA R1 receptor [Han 2005] GGGG UAGG [Han 2005] K. Han, et al., PloS Biol (5):e158 Goal Find Motifs and Motif combinations involved in AS Motif Exon exclusion Motif Motif Motif, Motif Exon exclusion Association Rules : Unexpected relationships between two objects Association Rule Mining By Agrawal et al. in 1993 Initially used for Market Basket Analysis An association rule is a pattern that states when X occurs, Y occurs with certain probability X : antecedent (left-hand-side, lhs), Y : consequent (right-hand-side, rhs) Goal: Find all interesting rules X YX Y ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam An unexpected rule Beer => Diaper Rule Strength Measures Given a rule, Support = Pr(X Y) Confidence = Pr(Y | X) Lift = Pr(X Y)/ Pr(X)Pr(Y) Dependency of lhs and rhs Generally, lhs and rhs have positive dependency if lift >1.0 X YX Y ARM Example Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemset = itemset whose support > 0.5 Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support) Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support) Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Bread(2/5 < 0.5) ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemsets (support) Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Beer (0.8)Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) Association Rules (confidence) ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) Association Rules (confidence) Beer => Jam(2/4 < 0.7) ARM Example Min supp = 0.5 Min conf = 0.7 Frequent Itemsets Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana Cart 2 : Beer, Nuts, Tissue, Diaper Cart 3 : Apple, Beer Cart 4 : Jam, Beer, Diaper Cart 5 : Bread, Butter, Tissue, Jam Beer (0.8), Jam (0.6), Diaper (0.6) {Beer, Diaper} (0.6) Association Rules (confidence) Beer => Diaper (0.75) Apriori Algorithm Most popular ARM algorithm Two steps: 1. Find all itemsets that satisfy min_supp. (frequent itemsets) any subset of a frequent itemset is also frequent Find all 1-item frequent itemsets; then all 2-item frequent itemsets, and so on. 2. Generate Rules A B is an association rule if Confidence(A B) min_conf Association Rules of Motifs in AS Beer => Diaper : Shopping items purchased together in a market basket data Motif A => Motif B : Motif pair regulates together alternative splicing Part I : Finding association rules of cis- regulatory elements involved in alternative splicing [Proceedings of the 45th annual southeast regional conference (ACM-SE) Winston-Salem, North Carolina pp. 232 237, 2007, BEST REGULAR PAPER] AS Datasets in Mouse Dataset Splice Array [Pan 2004] with 6 probes 3126 exon skipping genes in mouse %ASex : percentage of exon skipping in 10 tissues [Pan 2004] Pan, Q., et al., 2004 Mol Cell 16(6): K-mers Around Cassette Exon (items) Pre-mRNA sequences Transcripts from NCBI BLAT to align transcripts to mouse genome 200 bps from 7 regions around cassette exon 2565 genes in total Items (6mers) : AAAAAA to TTTTTT in region 1 7 ARM for AS Motif Rules Items : all possible hexamers (motifs) Transactions : 2565 AS genes Goal : finding motif association rules in AS genes. (e.g., AGGATA TTAGCT) By Apriori algorithm [Agrawal 1993] Find All Frequent Hexamers Generate Hexamer Rules [Agrawal 1993] Agrawal R., Imielinski T., Swami AN., 1993 SIGMOD 22(2): ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = Frequent 3-mer sets (support) AGG (0.8), ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6) ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6) - Rules (confidence) AGG GAT conf = 2 / 4 = 0.5 < minconf ARM Example [Example] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = Frequent 3mers sets (support) AGG (0.8), GAT (0.6), TAG (0.6), {AGG,TAG} (0.6) - Rules (confidence) AGG TAG (0.75) TAG AGG (1.0) Motif Association Rules from AS Genes _TGAAGA, 7_GAAGAA (ASF/SF2, SRp55) - 6_TTTTCT, 6_AATAAA, 6,000 6-mers - Candidates of regulatory motifs Association Rules Minconf = 0.4 Frequent 6-mers Minsup = 0.05 (129 genes) - 4_AAAAAT 4_TGAAGA, 4_AAAGGA 4_AGAAGA, - 4_GAAAAA 4_AAGAAG, 4_CTGCCT 4_CTGGAG, - 4_AGGAAA 4_AAGAAG, 4_AATAAA 4_AAGAAG - Candidates of regulatory combinations for AS Clustering by AS Pattern in 10 Tissues Hypothesize : Motif combinations cause AS profile Cluster genes based on AS profile. We use Euclidean distance / Correlation Average linkage clustering Frequent 6-mers in cluster are motif candidates Association Rules from Clusters Lift (X Y) > 2.0 Comparison with outside the cluster (p-value < 2.13e-10) Association rules are candidates of motif combinations for the corresponding AS pattern Correlation based clusters 112 frequent hexamers (0 39 for each cluster) AS profile of Genes with a Motif Rule Example: 7_AGCAGC => 6_GCAGCC Summary Motifs and motif association rules from a group of genes with similar AS pattern Candidates of motif combinations BUT: Problems in choosing the right threshold Dependent on clustering technique Part II : Mining of Cis-regulatory Motifs Associated with Tissue-specific Alternative Splicing by Discretization-Based Quantitative Association Rule Mining Motivation Avoid clustering of genes Use AS patterns as attributes -> Association Rule Mining with quantitative attributes Quantitative Association Rule Mining Mine numeric or quantitative data Two methods : Discretization (Binning methods, e.g., equi-width, equi-depth, distance-based) Distribution-based Example Cart 1 : Liquor $21, Vegetables $20, Meat $12 Cart 2 : Liquor $7, Vegetables $70 Cart 3 : Liquor $86, Meat $59 Cart 4 : Liquor $29, Vegetables $3 Cart 5 : Liquor $98 Cart 6 : Liquor $33, Meat $16 Discretization-based Discretization of numeric attributes Intuitive and popular Sensitive to bin size Expense of Liquor ($) Equi-width (width $20) Equi-depth (depth 2) Distance- based [0, 20] [21,40] [41,60] [61,80] [81,100] [7, 21] [29, 33] [86, 98] [7, 7] [21, 33] [86, 98] AS profile items Use quartile to convert numeric %ASexes to character AS profile items BrainLow :The first %ASex quartile in Brain BrainHigh : The last %ASex quartile in Brain BrainLowBrainHigh Finding Motifs Involved in Tissue-Specific AS Items : hexamers in gene regions exon skipping rate in tissues Transactions : 2565 genes from Pans data set Goal : find associations between hexamers and exon skipping rate AGGATA in cassette exon High exon skipping in Brain Tissue-Specific AS Motif Combinations 1464 association rules are found in total 204 complex rules are found lhs : combinations of 113 frequent hexamers rhs : AS profile items in tissues All rules have >1.9 lift 117 rules show motif combinations in different regions AS profile of Motif 1260 simple rules with 806 hexamers AntecedentConsequentSupportConfidenceLift {X4_GCTGGA, X4_TGCTGG}{IntestineLow} {X4_GCTGGA, X4_TGCTGG}{LungLow} {X4_TGCTGG, X4_CTGGAG}{IntestineLow} {X4_TGCTGG, X4_CTGGAG}{LungLow} {X5_TTTTTA, X7_AGAGGA}{HeartHigh} {X1_AGCAGC, X5_TTTTTA}{MuscleHigh} {X1_GAGCAG, X3_TTTTAA}{MuscleHigh} {X1_GAGCAG, X3_TTCTTT}{LiverHigh} {X4_AGAAGA, X5_TTATTT}{SalivaryLow} {X4_AGAAGA, X5_TTATTT}{HeartLow} {X4_AGAAGA, X5_TTATTT}{KidneyLow} {X4_AGAAGA, X5_TTATTT}{LiverLow} {X3_ATTTTT, X6_TTCCTG}{SalivaryHigh} {X3_TTGTTT, X6_TGTCTC}{LiverHigh} {X2_GCCTGG, X3_CCTCTG}{LiverLow} {X2_GTGGGG, X5_TTGTTT}{MuscleHigh} {X5_ATTTTA, X6_TGCTGT}{SalivaryHigh} {X5_TCTTTT, X6_TTGTCT}{SalivaryHigh} {X3_TCTGTT, X6_TTGTCT}{HeartHigh} {X5_TTTTTA, X6_TTGTCT}{HeartHigh} {X3_CTCTTT, X5_TTAAAA}{KidneyHigh} {X2_GGGTGG, X5_TTATTT}{SalivaryHigh} {X5_TCTTTT, X6_TTTTCA}{IntestineHigh} {X3_TTTATT, X6_TTTCCT}{IntestineHigh} {X5_TCTTTT, X5_TTATTT, X5_TTTTTA}{HeartHigh} {X5_TTCTTT, X5_TATTTT, X5_TTTTCT}{SalivaryHigh} {X3_TATTTT, X3_ATTTTT, X5_TTGTTT}{BrainHigh} {5_TTTTTA, 7_AGAGGA} => {HeartHigh} AS Profile of Motif Combinations AEDB in EBI AS regulatory sequences from literature 292 enhancers and silencers 42% of exonic hexamers, 63% of intronic hexamers are included in AEDB motifs significantly (p-value Liquor:mean=$12/week (overall mean =$7/week) Association between a subset of a database and its extraordinary behavior To define extraordinary behavior, statistical tests are used Our Data Heptamers : categorical items Exon skipping rates : quantitative items G1 : 1_ACTGGAG, , 7_TTTTCGA, 43(Brain), , 78(Testis) G2 : 1_AAGCTTG, , 7_TCTTAAA, 22(Brain), , 54(Testis) G3 : 1_AGGCCAA, , 7_TGAATTT, 4(Brain), , 13(Testis) G4 : 1_ATATTTT, , 7_TTTTCGA, 89(Brain), , 100(Testis) Our goal Mining of heptamer(s) => exon skipping rate rules Mean of exon skipping rates T-test for extraordinary exon skipping rates E.g., 4_TTGCGAC => mean(Brain) =80 (overall mean(Brain) = 30) Our algorithm For each frequent heptamer set S and tissue type Compute the mean exon skipping rates of genes including S, and genes excluding S Identify and report interesting rules using a t-test G1: A, B, C, D G2: A, C, F G3: B, D, F G4: B, H including D(mean = 17.5) excluding D(mean=73.5) Exon Skipping Rate (Brain) t-test Example [Sequence] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 [Exon Skipping Rate] Example [Sequence] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 [Exon Skipping Rate] TAG : 89, 68, 59 (mean = 72) TAG : 33, 21 (mean = 27) T-Test P-value = Implementation Input : sequence, exon skipping rate Output : interesting rules Data structure : Lattice with hash table Each node has hash table with gene indices Gene index links to its exon skipping rates Compare each node with root node Script language : Perl Results In total, we mined 97 significant rules 59 different heptamer sets 71 individual frequent heptamers 3 complex rules with two heptamers Validation via Shuffling Experiment 100 shuffled experiments (gene sequences and exon skipping rates) Same distribution as original data 14.7 interesting rules in average No complex rule Motif Conservation Motif conservation score by PhastCons Three heptamers from complex rules are highly conserved A third of the heptamers from simple rules are highly conserved Validation of our heptamer sets 43% of our predictions overlaps with enhancer/silencer sequences from AEDB significantly (p-value = 0.017) higher than random heptamer sets Some Interesting rules GCTGGAG in cassette exon Reported for 7 tissues. Overlaps with the 5' end of a potential SC35 binding site Known binding site of muscle-specific cardiac troponin T transcripts [Hodges, D., et al. Genetics 151, 263276 (1999)] Complex Rules {6_TTTAAAA, 3_TTATTTT}=>{meandiff(Brain) = } Neither of these heptamers is part of a simple rule Complex Rules {2_TTTCTCT, 3_TTTCTCT} =>{meandiff(Spleen) = } Complex Rules {3_AAAATAT, 3_TTTGTTT} => {meandiff(spleen) = } Exon Skipping Rate in Spleen Motif repeats More repeats of heptamers in complex rules Simple Rules Complex Rules Simple Rules Complex Rules 73.8 Conclusions Three types of ARM methods applied Motifs associated with Tissue-specific AS Many motifs are conserved We found interacting motifs Many motifs are overlapping with known cis- elements Conclusions Some heptamers affect multiple tissues Conclusions and Future Directions Compare tissues and conditions Explore more flexible motif representation Include additional features (e.g., trans-factors, exon length, splice site strength, RNA folds, etc) Build predictive model of AS Compare with other approaches, e.g., tree- based methods such as decision trees, regression trees, multivariate adaptive regression splines, etc Acknowledgements Dr. Steffen Heber Dr. Eric A. Stone Dr. Zhao-Bang Zeng Dr. Barbara Sherry Dr. Ben Redelings Dr. Reed Cartwright Dr. Jeffrey L. Thorne Chris Smith Sihui Zhao Brian Howard Sunil Suchindran Li Zhang Dr. Katerina Kechris Hyunmin Kim Lattice Data Structure For finding frequent itemsets efficiently A superset cannot be frequent if any of its subsets is not frequent Example [Sequence] Seq 1 : ACGATTAGG Seq 2 : GAATAGG Seq 3 : TGCAGG Seq 4 : GGATTAGG Seq 5 : CAGAT Min support = 0.5 Min confidence = 0.7 [AS profile] BH, HH BH, HL BH, HH BL, HH BH, HL BH : BrianHigh BL : BrainLow HH : HeartHigh HL : HeartLow + Example Seq 1 : ACG, AGG, ATT, CGA, GAT, TAG, TTA, BH, HH Seq 2 : AAT, AGG, ATA, GAA, TAG, BH, HL Seq 3 : AGG, CAG, GCA, TGC, BH, HH Seq 4 : AGG, ATT, GAT, GGA, TAG, TTA, BL, HH Seq 5 : AGA, CAG, GAT, BH, HL Example Seq 1 : ACG, AGG, ATT, CGA, GAT, TAG, TTA, BH, HH Seq 2 : AAT, AGG, ATA, GAA, TAG, BH, HL Seq 3 : AGG, CAG, GCA, TGC, BH, HH Seq 4 : AGG, ATT, GAT, GGA, TAG, TTA, BL, HH Seq 5 : AGA, CAG, GAT, BH, HL Frequent itemsets (support) AGG(0.8), GAT(0.6), TAG(0.6), {AGG, TAG}(0.6), BH(0.8), HH(0.6), {BH,HH}(0.6) Minimum support = 0.5 Example Seq 1 : ACG, AGG, ATT, CGA, GAT, TAG, TTA, BH, HH Seq 2 : AAT, AGG, ATA, GAA, TAG, BH, HL Seq 3 : AGG, CAG, GCA, TGC, BH, HH Seq 4 : AGG, ATT, GAT, GGA, TAG, TTA, BL, HH Seq 5 : AGA, CAG, GAT, BH, HL Frequent itemsets (support) AGG(0.8), GAT(0.6), TAG(0.6), {AGG, TAG}(0.6), BH(0.8), HH(0.6), {BH,HH}(0.6) Rules (confidence) {AGG, TAG} BH (1.0) Minimum support = 0.5 Minimum confidence = 0.7