Upload
german-mulanax
View
224
Download
0
Embed Size (px)
Citation preview
What does the genome says?
Sebastian ReyesGenome Center
UCDavis
Outline
• What’s genome annotation??? Why do we need it???
• What’s an annotation pipeline???• Repeat prediction• mRNA prediction• ncRNA prediction• Annotating my genome• What are the maker .ctl files??• What’s a gff file???
AGGTCCTACCGGGAGTCGAACCCAGGTCGCTGGATTCAAAGTCCAGAGTGCTAACCACTACACCATAGAACC
Tomato Glycine tRNA
>SL2.40ch02:49495500..49512399aggtagatcgatatgtaatctgcatatttattcaggaaaaagacatgaattgccccctcaacttgtaccaaaaagtcttttacacaattttattaatggggcaacctattacacacattaagtacccctttcgtcacccaacccaaaacattatgcatagagtgagagtcactcgcctctagcctctaactgaataaattgatgcctcgacccgattcatttgatccaaacccgtttattttcacctaggttcatttttttcaatccaaatttagtttattttaccctagtccatttatttcacgcacccatttcgtttcgctaagaacaccaagctctcgagggagagagaatactaatgaggcgttcatatgtgtgtaacaaaataacacttataacgagcgtaatttatgtgtctgatttaatcaacaaagttatttttcaatatttttgttctttatcttaatttatgtgaccctttttttaaaatttatattaaaaaatatacctattgaaaatatactttattttgaacttcttcttttttttcttagtccttttacaggtatacatagtaacgtgtcaattgtcattctctatatttgaatatcaatttcaatttctctattttataccttttttcattttgatttttttaattatttttctaaataatcatcatgtaatataaaaatttatgtatttcaaaatgtcgatgtaatatgaatttcataaagtaaaataagataaatattacatcatagttaaagaaaaactctattttcaagacggttatgttgctcctataaaactaaagtttgggattatatataaaacttaaaagagaaagaaaagtttaattaattgagtactgtataggtaccactacaaacatgaaagaacaacccaaaaaaaagaagaaaaagaaagttacaacacaaaaaagaaacgttattgatcatcaactgatctctctctctagataataataatataataaatatatggattacaatttcttcctcccccatggcgtagtatttccatcgaatagctcatcaatgtctgtttcaatttcttcaggttggtactttacatacttctccgtctgccttcccggagtaaatgtttgactaaaccttcccaagaatgacctgtacgctggtataaccatgtttgatatggaaactctcagctccgattgcaattgctcgtcgctaatcacccaactgctctgcgtcttgtgtatctcatcaaacaatgcattgaaactcttgaacctctcttttaatattggtttattcactttcccattcacattcaatccctcgtggtttaaacattgcaacaatttaccccacgtttctctttggtaattcttgtggtactgtcttagatctgaagatcgctttctgtaccattgatcccccatcaggctattcatctctggagatccttttattttttgcaagatgtatcgtccgttattcatcatgaatattgaactgagtgaagtgtctttgtaaagccttgattttccctccagatttgaatccaatagatccatcacttttatcatatgcgtctcaaacggtgatggtttattcgatgacgtattattattattatgattattattattattattattattattactactctgacccgcttgtggattttgacaatcaaaatccgatcccgtggctgagtcagctctttctatcatttgatgttctctgaatacctgctctaacgtgtctctgtactcccctacatatttcagataattcatgatgtaccgagttaaaggatgaacagcaccacctggaaccgcggtcttgtttgagtccccctggattgagttttccagctccgaaaagatagaaaccatggattcccctaagcggccccgactcaatgtagcttcagccttgagttcatcagcgtaagaaactggaaatatcttatccaccaaagggataaaatcccgcaacgtctcataaatgtcgagaaacttgaagagtttttcagcagcccgttttgtcatggaaacggcttcagcgaaattaaggagttggatcatcatacctcgagacaaattgctgaagatggtctcagaaattgatggttgatcctcaaacaccgcatccgccagcttgcgttcactggagaagagcacattagtgcagtgcctgaatgttgcaatccaagcagtaacctctctctccaatggttcccaattcatcttctgcacatcctcgatgctatatttttcaaagccaagcttatgcaagctttcctccaaagctttcctccgtgcgatgaaatagacctgacagcattctgcttcatagcctcctgcaattaaggctttggaaaatttattcaatgtggctacgatttcctcagagtagcctggaaatttattttcttcagaaggtttagattcagaagcagtgtcttgttctgcatcttcattggtgtctgatgatgaagaattagggttaggcttggcagaagaagtgtctaaattggtgatatctgaatcagtattgatcttgtaatcgtagaggattgacttgtattcttcctctatgtaggacattgctcgttgaagaacaccatcgacgcggctgattgaataagcatatttgtactctgaagaaaatcgacagagggaggtaaatagcttggagattcgatctacaatatttaaaaatgatgtggcttcctcctgagctagctggctccatttgacgggtgcatcaccgccatcatattcatcaattttcgcttcaacaagaacagcaaattgttctacaaacacaggaacatcgggaggcttagattcatcatcatcccctttaaagtttgatgattcagagataaattgatctatttcttctgaaaccttatcgagatcaggaggcaggggagaagaaacttcaaccttgacctcatcgtcgtcttcaggtttgatatcgtctgtcttggcaccatcatcctctgttgtgttgatgactctatcgtctgtttgcagttgttcaacaatttcttcttctgtagactttgtttcatgggcaggaggagtttcatctggctgtttgtcattggattttgtttgatcagaagaagtttgtgttagatccatgattgtaattgtgtgagtgagtaggtaagtttagcaggaaaattgagcttacattatcaaaaagagtaaggaaagtgttgtgattggagatgcagcgagcaaatcgacgagaaaggaaatgtgtgggatcacattaattagttagttacctttaacgtccactcttgtccacgtctaatattctcaatgccataaacttgcacgaatcaaggacggcctctcccactcactactcaccatctccataaccatctcatccaaaatatattactacttccattttagtatgtcaacataaaattgacttaattttttaatttttgagattctcaatcttttagcttatataacatcaaacttgtggtcactgtgttatgaaacaaatgattatgggaaatattgaaaaagcaaaataaagttgtagttataacattttagtaatttgtccaatattttttaaaaaaaaattataatgattttatcaaaattatttatgagatgacataaattagggatagggaagaagaaaggaggaggaaggtagaggaaacaagtgatcctattctccatgtatgtatttaagtgtagacaggcaggcccgacccatccgatccaatgtgaattattaaccctaggtaggtattggtaaagcttgagatgcgaatgagaatgtgatcaatgggtaaaccaactacccctcatcttcctccaggttccaacctacatactttattaataaatgttttgggtttttactcccaagctcaggattatgtttgcatacatctgctgcctgacattctttttgctacatgctctcacgaaacttctattatttatctaggttatcaacttcaatcacattagtagtattaaaggaagcttgaatttgttatgcatccccaattacatgcaaatattgctaatcatttttcctatttttaattactctgcaccttccagataatgctatcaaacatagctcactgctggagcccacacagcaagtactgatatcaaaacatgattctcctccgagaagcaaccttccaactttttcttaccgggatattgcaaccgccacaaacaatttcaggcgacaatccattattggggaaggtggctttgggccagtattcaaagggaaacttaacacgaatcaggtgattttcactaaatgaagtgtgttttggtcatttcaaatttatcatccttgaatttctataaccattcaggttgtggctgttaagaagctaaatcattccggtcttcaaggggataaggagttctttgtagaggttcacatgctctcactgatgcggcaccctaacctggttaatctaattggttactgctctgaaggagagcagcgacttcttatctatgaattcatgcctctaggatctttggaatatcatctccatggtaatgctcattcactattctcaccttagaaataaattcttatttgtcaacctgcaccaatctgttatgcttcagagctctacattcatgcattatttgctataaacttttggatcgtctttgatagcttaactcatttagagcccgacctaataaaataaatagcttttaagtttttaagtcaaaaaaataaaaagtaggagagacctactcttctcttttttaaaagtttattttaagttatttttaaccttgtcaaacactttcagaagctaaaagtgacttcaaagtaagtttgatcaaacttttaagtcccttgaagatatgaagttactattagtatcattctttaatactcattcacttctattttcattatttcttttttctttatcgaggaagtcttcacttctttaaggctgcactgccacatttgatggatatcaactgccctctcttattattctattaaccatgtattaaatttgttgatcctttttgtggttgtctgccaaagttgttactatgtgtgggaacttttttttctgtagatttttttaaaaattgtacatgcatcagaggatgcaactaatatattcattttccctctaaagtatgaaaaatgtgttgattaatagtataatgtgcagatattacgcctgacatgaagccattggattgggataccaggatggtaatagcatctggtgccgccaaaggcctggagtaccttcacaaccacgctgatcgtcctgttatctacagggaccttaagtcggcaaacatattattgggtgagggtttccatgctaaactttctgactttggtcttgcaaagtttggcccaattgcagacaacacacatgtttcaactcgggttatgggcactcatggatattgtgcacctgagtatgctggaacaggaaaactgactatgaaatccgacatctatagttttggtgtgctcttgttggagctaattactggatgtagagcaatggatgactctcacgaacatggaaaagaaatgcttgttgactgggtaattacttcacatttaattgattgaagtcattatacttggcaagctggaaaatgaaccaactttgagacacaaatataagtttttgtttcactattgagtttaaaaatgtttgtgtgtccttttttgactcttgaagcataaagttttttctctgagtgatgcatagttctatcaacgtgtcaaaggtattttggttgaaggtctctctctgccttggaggaggcccaaagttaccgaaaaatgagaatgactgatctatcaagtgatatgcacagttctgaccattgaaatgtatcatagtttgatttttttttctttttctgaatgagaagtgtaggctcctgtttgtgcgcttccttttctccctgctaacagtgaagtttctcaggcacgtcctatgttaaaagaccgcatgaactatgtacagttagcggatccaatgttaagaggcaaatttccacaatctgtcttccgcagggtagtagaactggtcttaatgtgtgttcaggatgatccccatgctcgacctcacatgaaagacatcgtgcttgctttgagttactttgcatcccaaaagcatgattcacctgcagctcagattggatctcatgggggagaagggacaaatggaagctctgttgattttgatggagctcagatggatataacagaaataagagcttcaaacaaagatcaagagcgggagagagctgttgcagaggccaagaagtggggcgagacctggagagagaaagggaaacagaatgcggatgatgatttagattataaatcaaggtggtgattgattgtaagttagtttccttgtcacagaacagcatttttattcaaatttttgaaacagtgtcgtatgtatatactcataaaaaagaaattattggcattattgatgtatttgtatgcttgacaaaataaattcaatacaactactatatggcatttatcattaacaagtactactaattaggttttgttcaagtactaaccaaacaaacaagaatgttaaattaagattaaaatatataagcaaccacatggattaaggattcaagagtaataatgtaagaattgtaccgtacagagtttagtaatgtaggtatgattcttcttatttcactttttatccctcaaaatatcataaataaattcatgtattcggttaagatcggttattggtaaaaattaaaatcaaattaatctagttggttttctaaattactaaaaccaaaccaaacggataaaataaattggtttggttcaatttttctattttttttttcagtttgaaagtaatacatttttcaggacaaacatatcttgatcgacacaaacacctagtatatgaacaatagagataaaaactattgttggttcaattagcaataagcaatattaaagttatcattacacgaataaagagattcaattaagagaaagtgtggaaccttaactaaggtaagtgtggggtttaagaaaatgataaagctaaagacttaaagttacttaaaatttaaaaaactatttatattttatttataaataatatataaataattataaaatttatatatctaattataagttcaatttcgatatttttgcagttgtctttttttagtaaaatcaaaatcaaactaaatagtattgatattcaaaattcaaagccaaatcaaacgaaatttcaagtttttaattagtttggttcgaattgtggtttggtatgattttttgtagccataaccgtaaacaatcatggttcggtttggatcgatattaattaaaaccttaaatttctaaacccaaaccaaatggaaaaaaaaaccatcggtttggtttggtctgattcggggttttgttctttttggtttaaattaaatttattgagagtatatgcatctcgatagatccaaaaacctattacatgaacaatccacatgaaagttattatgtattcaattaaacaatattatagttatcattatgcgaataaagtggttcaattaaaaaaatgtgactatctaatgtgacgcggggtagataaagataaagattttgattgacttggtcttagtcgggctaaaaattaagtgttgaatttaaaaaaaatgtgaaagttaagacttaagttaattaaattttatttatgagtaatatataaataataatgaaatttatatacaaaataatattgattttcaaaagagaggtcctaccgggagtcgaacccaggtcgctggattcaaagtccagagtgctaaccactacaccatagaaccaattgatactattcttcagctgaattgtatttagattatgtaagcactataaatatataatattttttttctgtgatgaactccttaaaagtgagaaagagggtcatttggttggtgggtgagtaaacaagtaatttcattacagaaacaaaaatagaaataggagggctagggcattagtgaagaaggaaaggaaaggtaacaaaagttgatggggtccagactccagtatgtcagccaacgataatatgccacgtcagtttgaccatcctgtaaagtgaaaaatgggtcccctaccaaaccaaaccaaaccctttttacacactctctctttactaccctcatgcctttgtcggctcatttaaccaaaaaaaaaagcttcaccggaatctgagtgtttgtccggcaaggcacagaagtctctctgtttgctgctggagatctcgaggaagccccagagatgtatgtcgttcctcctccaaaacgacccgatccattgtctggatccgaggacttgcggatttaccagacatggaaaggaagcaatgtaagagcttatccagtttattcacacttgctatctcatctaaaacttaacatcatgtttgtcacattaacttttaccaacgattattttatcaggacaaactgaatcaggtctataaaaacgatttactatcttaaaaatagacaaaagttatgactccttccttactagaacttttattcttaaattgctgacagatatttttcttccaaggaaggttcatatttgggccagacgcaagatccctagcactgaccatatttctcatagtggcccctgtctcagttttctgtgtctttgttgcaagaaagctcatggatgatttttcaaatcacctggggatattgattatggttgtagtcattgtgttcacattctatgtaagttctctctccctacttctcattgattctaacatcctcatcacctgcactttatgacaacacaatacctctataaacttgcctttatttaatctcacggttactggaaacaaaattccgagtattagaaagtcaatcatctcaaagttgatcacatatcaacttggcactgcatctcaaggatgtaaggttggcttggttgtgtcaatagtcttccattcgacggttaagtggctctccttacaatgaataggattggcaacctaattgccttcctattttgtatgtctaatagtttcacatttcaatattacaggttttagttctactccttctcacatctggaagggatccaggaataattcctcgtaatgcacaccctccagaaccagaaggttatgatggtactgtggaaggtggtggacaaacccctcaattacgtttgcctcgcattaaagaagttgaggtcaatggcattaccgtcaagatcaaatactgtgacacctgcatgctttatagacctccccgctgttctcactgttcaatctgcaataactgcgttgaaagatttgaccatcactgcccttgggtagggcaatgcattgggctggtaatagttcttttttttcttgttctttttcgtaacatatttttagttagtactgccacactagtaactgatattctctcttttcagttttacttttcaatgcttgatatgtctgatgaactattgcatataatctgatatgaagagtaaccagacattttatcatgatcacactccttctctctgcattgaaacctcataaaaagagacttaacaaaagactgattcttttacttgtaaaaacgaataaaataggtaagaccgattctttctttgttatctttttaaaaaatttacctattcactctaatgttaattcgcgaaaacaagcaagttacataaaatgtctaaagaaaaattgttgtcttaatgaagcaatgctgtggatctgaagtattagatcctcaaagcatcttgaaatctagctctcaaagtttataattatgtaaaaccgttaatactaatttatgaatcagactcttttaactattgatgctgttggcagcaggaggaagtttaaaaggtttgtaacttgagcatttagttcaaatggccatcatgaagacaattatatgttttagaaatatttagcactctagatctttgcagtggaactaaagttagagggtttagtacacatatccacatgcttaagttctcacatttcagctgaaatctttttgtccttgcgtacagtacttgctgtcgtgattctcagatttttgaagttatgaagtcaggctggtcaatttacctgttaatgtcacgtccacctttgatttcagaaatatctatagtctttgcttgcgctgataatttctgtaacaaatgcatcttttaaaacggatcatctagtgttgatgtcttggactctttccttatgttgcacagatttctgaaacttgtgtctttttattgcagcgaaactaccgttttttctttatgtttgtcttctctacaacacttctttgcatatatgtttttgggttctgctgggtctatattaagagaatcatggttctagatgacaccaccatatggaaagcaatgatcaaaacaccggcttccattgttctaatagcatacacttttatatcagtatggtttgttggaggtcttactgcttttcatctatacctcatcagtactaatcaggtatgttcgtggattgtaattttgtttttccctttgtctatctgaggtggtttggattttgatgtgaactgggagtgacgtctattttttgggtactgcagactacttatgagaattttagataccgatatgattggcgtgccaatccctacaacagaggagtgatgcagaatttcaaggagatattttgtactagtattcctccatccaagaacaatttccgtgcaaaggtgcccagggaacccaaggtggcaactcgatctgcaggtgggggttttgtgagtccaaacatggggaaggctgtggaggacatagaaatgggcaggaaagcggtttggagtgaagtaggggataacgaaggacaacttagcgacaatgatggcctgaacattaaagatggcatgttagggcaaatgtctcctgagataagaagtacagtagatgagagtgatcgtgcaggaatacatcctagaggatcaagctggggaaggaaaagtggaagctgggagatgtcacctgaagttcttgctttggcatcaagagtgggagaagctaatcgaacaggtgggagtagcagaccaacagatcaaaaaaagttgtgattaacagatagtatgaaaactggattgaatagattagtggttcttggaggtgtatggtatgtggtcattcagtgtcgtgtattatcagatgttggctttaggaagtgtgtgatatgaggggtggttttaattcctaaaacttgtattgtatgtgtggattagttagtgtaatacattagttttgctctgttcatgctaggcggtgcatttattctttgtgcttaaacaatgtgggcaagagtcccatataatatatatagtacgttgtaacagttgttattttacaagtaagaatctgggtttttgctgaatcacagtaaggggtggatttgagtaaaaatagagaaaatacacaagtcccgtgtaaccacccacaatagtatatatggctcgaaggaattcttttataatatgataacaccaactaaaaaggtagtccaattatatttcttcaaccataggttcttttcctggaacaaacaagcattcataatatttttcctttcgatcacagccatgagccaaccaaacttgctccaccatagttgttaccaacatctccctcgtgtcagcacaagttaagcctaagggactcctcaactttccatcctttaggatccattaaattaaaaaaaggaaatagtatgacccttcttgagaagtctttctgttgatgggtagagaacgttggcatttgtaaatggtttcttatcaaatgtgtgtgtggtttcatttaatacttcagatgagatatcatgccctcaagattaaccctagatggtctccttcttacaattgtagatttattattcaaattatgaatgtctgagaacaatacaattgattcagatattatattcattaaaacaatacattacgacatataaaaataatacatcacaaccatcaattaccactttttatggagaaaagaaaaattaaaaagtttcaaataatatagtcatgggtaaatataaaacatcgtcactaaaaaatgagggagaatagtttccaccaacttttgcaaataggttaggtaaattaaaattaaaatattaatgttttatatatattaatattttgtatatattaaaaaatttcaaataatattaatgttccgcgtgtgcaaaatattttgtatatattaagagtgagttggtacgaaggaaaatattttctcgaaaatgttttccaattttctcatatttggttgatataaaagttttgtaaaatgtttttcaaatcaactcattttcctcgaaattaaggaaaatgacttctcttcaaaaattaaggaaaacatttttcaaaattctactccaatttcaaattgcattttttttttcgaaagacatcaattttaaaagaatattttcaattttaaaattttatgtgtttacccaatccctcccccaaccccctaaaatattaatttcattcataaatcaaacacacgaaaatattttctactcacctaccaaacatgaaaaaataaatttaaaatctacttatttttcaaaacacatgtatctacgctaggttggggggtctaaaattaatttaattaagaaaaacataatctttccaactttaagcaaattatgtaaatcagattgatggagaaaaaaattaccagtcatgatttgtcttaaataatcattaataagtacataaaaataaaagtaatacgacatctaaaaatattccaagttgccatattttttcttattacactctgttgcacttttcattttacatggcgctcaaaaaattataaataaaagaataaaatttttaaatttttttacaaataaatataaatatatttcaagaaataattgcaaaagtaatatatgacaaattctattatttattttaataaaataatggagacaattattttttataagaatatggaacggaattaataagagtagtggatggatccagctaaggtttttttaataaataaagggagaatagtggtcagcaaaattttgcaataaatagatgtgtacaacttccaaaatactcttcgttcagaacgttatccataacatgtcagcatctcgtggctacacatcatcttcttttattaataatatatcaccttgcctattgcaatattaattacaaattatgtttctcatcatcaaatttatattcaagggaacacatgtttcatacaccaatttcaatttattaaaatcagttcatcaatcatgactaatagttaatctttttaccaagttgaaaacaaattaaatatcatctagataactgcttgtaatatgagtgaatcattgcacgttttatatacctactactcgcacaaatctcgaaggaaaccaattagctgtaagctgactattacatttttttttatttttttaaaaaaaaaggaaaatcaaaatttttatttttaaacttgtgttagtaattgaaaaacttaccacatgtaagaagaaaaaatctttttaaagttgacccaatagtaaattgtaaaagaaaaaaagtaaaaattatggtctatagatatccttttcagaatagcatcatgtggtatatatttagcatattcttccttacaaatgcaatatcgaaatcaacttttaaaaagtaacgacgattaaggggagcgaatatggttactgaagggaaatttcgtacttttgcaagtcgaattattgcagcatgaagttaaatttgaaaaaaaaaatatatataataagagagttgagattttagaaggaatctataatgtgatgtattaattattaattaaaataaaggagtaaagaggaggacgcgtctgatgtacgtggGgagggagagaggaatatccaaaagcataacaccgattacggattgagaatattcctgatcccctctaacctcccataaataccattcttttcttacttgttgttgcatccaatccaatccaatcatcccgttttctcttctcttctcttcttctcatatatataatatatatactagtagtatatatatatataacaatacaacaacaatggctcttacagctgttcatgtttccgatgttcccaatctagatcaagtccctgacaaagctcctctatatgccacccgattctctcaaggtttcctttgtctttaatttctatactcatttattctctttctttctagacaatacatacataatctcttattttgctttgcaggcattgaaattggaagagcatccgaatttttagttgttggacacagagggaacgggatgaatttgttgcaatcggctgaccggagaatgaatgccctcaaagaaaattccattctttctttcaatgcagctgccaattacccaatcgattttattgaatttgacgttcaggtaattttctattttccgcgtatttccttttttctttctattccccccgcccaaaaaaaaaatcgacccgacccgacccgatttaaaaatattaacagaaaaaccttaaaccctagatcttggcccagtatctctctgaagatggattctacctttagattaccaaagtaaatgtcattttttgtttgtaggtgacaaaggatgattgccctgttatttttcacgacgatttcatcctcactcaacataataatgtaagctcttttaatatccaataacccctttctcttctaatcaaataaaatctcattctcatggtgcgtgtgcagggtacagtttatgaaaggagaattactgaattgtcacttgctgaatttcttagctatggaccccaaaaagaagagggtctcactggaaaacctttaatgaggaaaacgaaagatggaaagattgttagctggacagttgaaaccgatgattccgcatgtaccttaaaagaagcttttgagaaagtgaatccatctattggtttcaacatcgagctcaaatttgatgatcacattgtttatcaacaggactacctcatccatgctcttaaagcagtgttacatgtcgtattagagtatgctaaaggcagaccaatcatattctcaagtttccagcctgatgctgctctgcttgtcaagaagctccagacatgttaccctgtacgtacgtttccattttggaatctatcttaaacacaagtgaattcgatctgacatacttaccttttggcactgtgtaggtgttttttctcacaaatggaggtacagagatttactatgatgttcgaagaaactcgttggaagaggccactaaactgtgcttagagggtggtttggaaggtattgtttcggaggtgaaaggcatcttcaggaatccaggagtagttaacaagatcaaagagtccaagctgtctctgctgacatacggcaaattgaagtaagtagagtattgattcagttggacactaacgaactaacgaactaaccatttatttatttattatacgttcagtaatgtgcctgaagctgtgtatatgcaacacctgatgggaattgatggagtgatagtggattttgttgaacaagtaacagatgctgtgtgtaagctggtgaagaagccagatgagatattgctggaaggggaggaaaaggttcaaaatagacctcaattttcacagagggaattgtcttttctgctcaaacttatccccgaactgatacaacaataacaattcataaaatgattgtagatagcaaagcgtgtagaaatgtagattttcattgtatttgcaccctctctgtaatcatatcaaaatattttatgaattgtcttggaattttagcaacatcaactcacttcaattaggttggttattaaacagtccaacccacattagttaccaaatgtacctgtggctgtgaatcatagttaaaattgacatgacctttcgagagaaagcaaagcaagtaacgtattaatctcaagtagaggggattcatttatgacagtgaatattaattaaacaataaccattagtcttcagttcccaattctatatttaaaagttccaatcttcaattcctaattgcttaaaaatatattacaacgttttgaagccaaagcaattagatacaattatcatcaccttgtgttatcgactctccacacagagtgatagacatgttcctcaaatacaatattctttccatttcatttattccgtcttacttcccttttgattgcaacttttcgcccagtacatttaagaccacaagattaaagaatatcttgatgcactttacatatctttaatttaaaagttttccttactttcttgaaactcaatatcaagttaaaacaaattaaagtatttttatttcgacagataatgatagcagaatatctaataggtaaactccaatatacatgttgtatattaattcatttcaaaataccagagaagtcgttgtttcttgttggactttgatggattaggatgtatgtatgttttggtccagtacttgtcaaggctatgagactcataagaaatagggcaaaatttctatttatatgctataacaaagtttgcataatttcgctccatagcaaacatatatgtgtataattcgttatacatatacaattgaaacgaattgtataaaacgagaaagagaaaaattatatacaatttgaatttgtataaaacgaaaaagagagaaagacaaaagaaatggtttatataagtgtatattgagaat
• What is genome annotation?– Attaching biological information to genome
sequence(i) Structural annotation: genes, regulatory elements, etc.(ii) Functional annotation: biochemical, phenotypic, etc.
– Creating annotations or features– Based in known information
Why should we worry about genome annotations?http://gmod.org/wiki/MAKER_Tutorial
Genome annotation pipelines
The NCBI Eukaryotic Genome Annotation
Pipelinehttp://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
Ensembl Gene Sethttp://uswest.ensembl.org/info/genome/genebuild/genome_annotation.html
http://genome.jgi.doe.gov/programs/plants/Plant%20Genome%20Annotation%20Pipeline%20SOP.pdf
JGI Plant Genomics Group Annotation Processhttp://www.yandell-lab.org/software/maker.html
Maker
BGI genome annotation pipeline
Genome sequence
Gene annotationRepeat
annotationncRNA
annotation
Gene set
Function annotation
homologDe novocDNA/
ESTDe novo homolog
Statistics resultsStatistics results
UniProtKEGGInterPro
Statistics results
miRNA/snRNA
rRNAtRNA
GLEAN setRNA-
seq data
Annotation classifications
• By source– De novo– Sequence alignment– Model prediction
• By type– Repeat elements– mRNA’s– Non coding RNA’s
FINDING REPEATSThe start
Repeats 101
• Sequences that are present in the genome in a high copy number
• Duplication rate and structure depends of the type• Main types– Tandem repeat elements (TRF) : segments of small
sequences repeated in tandem– Retro-transposons : self duplicating elements that
transpose using an intermediary RNA– DNA-transposons : repeat elements that don’t use an
RNA intermediary
Repeats
Make a big portion of eukaryote genomes
Not junk anymore
Repeat prediction
• Using pre-constructed repeat libraries– Representative sequences of elements found in
eukaryotic species (also prokaryotic organisms)– Repbase (http://www.girinst.org/repbase/update/index.html)– Dfam (http://www.dfam.org) – Homology base searches
Programs: RepeatRunner and RepeatMasker
http://www.repeatmasker.org/ http://www.yandell-lab.org/software/repeatrunner.html
Repeat prediction
• De novo analysis of the sequence• Building custom repeat libraries
– Predict repetitive elements using the genomic sequence– Based in sequence structure or repetitiveness
Programs: TRF for tandem repeatsRepeatModeler or RepeatScout for general searchesMITE-hunter and LTRharvest for type specific searcheshttp://tandem.bu.edu/trf/trf.html http://www.repeatmasker.org/RepeatModeler.html http://bix.ucsd.edu/repeatscout/ http://target.iplantcollaborative.org/mite_hunter.html http://www.zbh.uni-hamburg.de/?id=206 => ltrharvest
WHERE ARE THE GENESThe hard part
The expected of a gene
Gene prediction
• Combination of methods that predicts the existence of a transcribe mRNA
• Base in supporting “evidence”– EST’s sequences– Assembled transcriptomes– RNA-seq reads– Protein alignments– Ab initio predictions– Repeat element distribution
“EST” evidence
• Bundle of EST’s, unigenes, assembled transcriptomes and RNA-seq reads
• Flags of potential transcription• Aligned using splice-aware aligners
Programs: GMAP, GSNAP, TopHat, exoneratehttp://research-pub.gene.com/gmap/
http://tophat.cbcb.umd.edu/
http://www.ebi.ac.uk/~guy/exonerate/
Protein alignments
• Using proteins from close related species• Evidence of expression of a particular domain
or entire protein• Also use splice-aware aligners
Programs: Exoneratehttp://www.ebi.ac.uk/~guy/exonerate/
Ab initio predictions
• Predictions using Hidden Markov Modelshttp://en.wikipedia.org/wiki/Hidden_Markov_model
• Tailored for particular genomes– Unless you are working with a model species, will
require construction of custom HMM’s• Search for mRNA-like sequences in the genome• Similar to protein domain searches
Programs: Snap, augustus, genemark, fgeneshhttp://korflab.ucdavis.edu/software.html => snap
http://augustus.gobics.de/
http://opal.biology.gatech.edu/ => genemark http://www.softberry.com/berry.phtml?topic=fgenesh&group=help&subgroup=gfind
What’s a gene model
Combined evidence generates a gene model
Gene functional annotation
• Assignation of putative biological functions to mRNA’s (“genes”)
• Searches for protein domains and similarity to references protein databases
• GO annotations, KEGG enzimes, pFam domain
Programs: Interproscan, KAAShttp://www.ebi.ac.uk/Tools/pfa/iprscan/
http://www.genome.jp/kegg/kaas/
THE OTHER RNA’S
ncRNA’s
tRNAmiRNA
snoRNA
Non coding RNA’s: tRNA’s
• Simpler of the ncRNA’s prediction• Uses junction of sequence homology and
models for the secondary structure
Programs: tRNAscan-SE http://lowelab.ucsc.edu/tRNAscan-SE/
All the others
Programs: Infernal, Snoscan and miRPredict
• Scan with other ncRNA databases, snoRNA’s and miRNA’s respectively
• Commonly personalized ncRNA libraries are required for comprehensive searches
http://infernal.janelia.org/
http://lowelab.ucsc.edu/snoscan/
http://sourceforge.net/projects/mirpredict/
ANNOTATING A GENOME WITH MAKER
Outline of the maker annotation
pipeline
MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomesBrandi L. Cantarel, Ian Korf, Sofia M.C. Robb, Genis Parra, Eric Ross, Barry Moore, Carson Holt, Alejandro Sánchez Alvarado, and Mark YandellGenome Res. January 2008 18: 188-196; Published in Advance November 19, 2007, doi:10.1101/gr.6743907
http://www.yandell-lab.org/software/maker-p.htmlhttp://gmod.org/wiki/MAKER_Tutorial
How to run Maker in iPlant
http://gmod.org/wiki/MAKER_Tutorial
https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+Atmosphere+Tutorial
https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+at+TACC+Lonestar+Guide
Maker Tips• For troubleshooting and testing the files better to use a small fasta
file with a single sequence (i.e. a 500bp fragment of a chromosome/scaffold)– Verify fasta files don’t present any errors for maker– Verify that the system don’t have any issues (missing programs)
• Maker MPI (parallelization method) don’t scale up to systems with a high number of cores, manual parallelization it’s required above 24cores– Split up of genomic fasta file into smaller pieces
• Some scaffolds would report back FAILED, most cases are scaffolds with high content of repeat elements– Required independent run into RepeatMasker
• AED score => maker gene model quality score (lower the best)Annotation Edit Distance
Maker exe.file#-----Location of Executables Used by MAKER/EVALUATORmakeblastdb=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executableblastn=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executableblastx=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executabletblastx=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executableformatdb= #location of NCBI formatdb executableblastall= #location of NCBI blastall executablexdformat= #location of WUBLAST xdformat executableblasta= #location of WUBLAST blasta executableRepeatMasker=/home/sreyesch/MichelmoreBin/bin/RepeatMasker #location of RepeatMasker executableexonerate=/home/sreyesch/MichelmoreBin/bin/exonerate #location of exonerate executable
#-----Ab-initio Gene Prediction Algorithmssnap=/home/sreyesch/MichelmoreBin/bin/snap #location of snap executablegmhmme3=/home/sreyesch/MichelmoreBin/bin/gmhmme3 #location of eukaryotic genemark executablegmhmmp=/home/sreyesch/MichelmoreBin/bin/gmhmmp #location of prokaryotic genemark executableaugustus=/home/sreyesch/MichelmoreBin/bin/augustus #location of augustus executablefgenesh= #location of fgenesh executable
#-----Other Algorithmsfathom=/home/sreyesch/MichelmoreBin/bin/fathom #location of fathom executable (experimental)probuild=/home/sreyesch/MichelmoreBin/bin/probuild #location of probuild executable (required for genemark)
Location of all required programs for maker to run99.9999% of the time don’t NEEDs modification
Maker bopts.file#-----BLAST and Exonerate Statistics Thresholdsblast_type=wublast #set to 'wublast' or 'ncbi'
pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignmentspid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligmentseval_blastn=1e-10 #Blastn eval cutoffbit_blastn=40 #Blastn bit cutoff
pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignmentspid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligmentseval_blastx=1e-06 #Blastx eval cutoffbit_blastx=30 #Blastx bit cutoff
pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Maskingpid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Maskingeval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element maskingbit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignmentspid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligmentseval_tblastx=1e-10 #tBlastx eval cutoffbit_tblastx=40 #tBlastx bit cutoff
eva_pcov_blastn=0.8 #EVALUATOR Blastn Percent Coverage Threshold EST-Genome Alignmentseva_pid_blastn=0.85 #EVALUATOR Blastn Percent Identity Threshold EST-Genome Alignmentseva_eval_blastn=1e-10 #EVALUATOR Blastn eval cutoffeva_bit_blastn=40 #EVALUATOR Blastn bit cutoff
ep_score_limit=20 #Exonerate protein percent of maximal score thresholden_score_limit=20 #Exonerate nucleotide percent of maximal score threshold
Maker blast options (blast opts)Specific options that maker will use to perform the blast aligmentsDon’t require modification
Maker opts.file
• File that contains our determined inputs for maker to run
• We need to modify it– To edit file, safest way is in the command line– Mac users should use their preffered command line
editor– Windows could use wordpad to do editing of the file– Ubunto default ubuntu text editor works well
• Didn’t fit in the screenhttp://www.yandell-lab.org/maker/Berkeley_Qi/maker_opts.ctl
Generalized annotation pipeline with maker for non-model species
Genomic Sequence
Enough EST?
tRNA prediction
Repeat library construction
Maker 1st iteration
without ab initio
predictions
Maker 1st iteration with
HMM for closer specie
Training of ab initio predictors
Maker 2nd iteration with custom HMM’s
Other ncRNA prediction
Uncurated Genome
Annotation
mRNA functional annotation
YesNo
12
3
4
5 6
7
Generalized annotation pipeline with maker for non-model species
1. Perform tRNA prediction with tRNAscan2. Construct custom repeat library3. Initial maker prediction
1. If enough “Est” data is available (representative of most of genes and a high depth), use maker without any ab initio prediction
2. If not enough “EST” is available, search for a close related HMM that can be use for prediction
4. Train ab initio predictors using the predicted gene models from initial maker5. Perform second maker iteration using all the information from first run, but
using the HMM’s generated in step 46. Do draft functional annotation of mRNA’s predicted in the second maker
iteration7. Perform other ncRNA prediction8. Gather all the results from second maker iteration, tRNA and other ncRNA
and you got your uncurated genome annotation
HOW DO WE STORE THE ANNOTATIONS
Generic Feature Format (GFF)
• Standardized format for storing genomic annotation (features)
• Have conventions for a wide variety of features
• Current version is gff3• http://www.sequenceontology.org/gff3.shtml
##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + .ID=gene00001;Name=EDENctg123 . TF_binding_site 1000 1012 . + .ID=tfbs00001;Parent=gene00001ctg123 . mRNA 1050 9000 . + .ID=mRNA00001;Parent=gene00001;Name=EDEN.1ctg123 . mRNA 1050 9000 . + .ID=mRNA00002;Parent=gene00001;Name=EDEN.2ctg123 . mRNA 1300 9000 . + .ID=mRNA00003;Parent=gene00001;Name=EDEN.3ctg123 . exon 1300 1500 . + .ID=exon00001;Parent=mRNA00003ctg123 . exon 1050 1500 . + .ID=exon00002;Parent=mRNA00001,mRNA00002ctg123 . exon 3000 3902 . + .ID=exon00003;Parent=mRNA00001,mRNA00003ctg123 . exon 5000 5500 . + .ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . exon 7000 9000 . + .ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . CDS 1201 1500 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 3000 3902 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 5000 5500 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 7000 7600 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 1201 1500 . + 0ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 5000 5500 . + 0ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 7000 7600 . + 0ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 3301 3902 . + 0ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 5000 5500 . + 1ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 7000 7600 . + 1ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 3391 3902 . + 0ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 5000 5500 . + 1ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 7000 7600 . + 1ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
Column descriptionsColumn 1: "seqid” The ID of the landmark used to establish the coordinate system for the current featureColumn 2: "source” The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature.Column 3: "type” The type of the feature. Predefined terms to characterized the feature.Columns 4 & 5: "start" and "end” The start and end coordinates of the feature on the reference sequence.Column 6: "score” The score of the feature. Highly varies depending of sourceColumn 7: "strand” The strand of the feature. + for positive strand, - for minus strand, and . for features that are not stranded.Column 8: "phase” For features of type "CDS", indicates which phase it’s the CDS readColumn 9: "attributes” A list of feature attributes in the format tag=value separated by semicolons.