39
What does the genome says? Sebastian Reyes Genome Center UCDavis

What does the genome says? Sebastian Reyes Genome Center UCDavis

Embed Size (px)

Citation preview

Page 1: What does the genome says? Sebastian Reyes Genome Center UCDavis

What does the genome says?

Sebastian ReyesGenome Center

UCDavis

Page 2: What does the genome says? Sebastian Reyes Genome Center UCDavis

Outline

• What’s genome annotation??? Why do we need it???

• What’s an annotation pipeline???• Repeat prediction• mRNA prediction• ncRNA prediction• Annotating my genome• What are the maker .ctl files??• What’s a gff file???

Page 3: What does the genome says? Sebastian Reyes Genome Center UCDavis

AGGTCCTACCGGGAGTCGAACCCAGGTCGCTGGATTCAAAGTCCAGAGTGCTAACCACTACACCATAGAACC

Tomato Glycine tRNA

Page 4: What does the genome says? Sebastian Reyes Genome Center UCDavis

>SL2.40ch02:49495500..49512399aggtagatcgatatgtaatctgcatatttattcaggaaaaagacatgaattgccccctcaacttgtaccaaaaagtcttttacacaattttattaatggggcaacctattacacacattaagtacccctttcgtcacccaacccaaaacattatgcatagagtgagagtcactcgcctctagcctctaactgaataaattgatgcctcgacccgattcatttgatccaaacccgtttattttcacctaggttcatttttttcaatccaaatttagtttattttaccctagtccatttatttcacgcacccatttcgtttcgctaagaacaccaagctctcgagggagagagaatactaatgaggcgttcatatgtgtgtaacaaaataacacttataacgagcgtaatttatgtgtctgatttaatcaacaaagttatttttcaatatttttgttctttatcttaatttatgtgaccctttttttaaaatttatattaaaaaatatacctattgaaaatatactttattttgaacttcttcttttttttcttagtccttttacaggtatacatagtaacgtgtcaattgtcattctctatatttgaatatcaatttcaatttctctattttataccttttttcattttgatttttttaattatttttctaaataatcatcatgtaatataaaaatttatgtatttcaaaatgtcgatgtaatatgaatttcataaagtaaaataagataaatattacatcatagttaaagaaaaactctattttcaagacggttatgttgctcctataaaactaaagtttgggattatatataaaacttaaaagagaaagaaaagtttaattaattgagtactgtataggtaccactacaaacatgaaagaacaacccaaaaaaaagaagaaaaagaaagttacaacacaaaaaagaaacgttattgatcatcaactgatctctctctctagataataataatataataaatatatggattacaatttcttcctcccccatggcgtagtatttccatcgaatagctcatcaatgtctgtttcaatttcttcaggttggtactttacatacttctccgtctgccttcccggagtaaatgtttgactaaaccttcccaagaatgacctgtacgctggtataaccatgtttgatatggaaactctcagctccgattgcaattgctcgtcgctaatcacccaactgctctgcgtcttgtgtatctcatcaaacaatgcattgaaactcttgaacctctcttttaatattggtttattcactttcccattcacattcaatccctcgtggtttaaacattgcaacaatttaccccacgtttctctttggtaattcttgtggtactgtcttagatctgaagatcgctttctgtaccattgatcccccatcaggctattcatctctggagatccttttattttttgcaagatgtatcgtccgttattcatcatgaatattgaactgagtgaagtgtctttgtaaagccttgattttccctccagatttgaatccaatagatccatcacttttatcatatgcgtctcaaacggtgatggtttattcgatgacgtattattattattatgattattattattattattattattattactactctgacccgcttgtggattttgacaatcaaaatccgatcccgtggctgagtcagctctttctatcatttgatgttctctgaatacctgctctaacgtgtctctgtactcccctacatatttcagataattcatgatgtaccgagttaaaggatgaacagcaccacctggaaccgcggtcttgtttgagtccccctggattgagttttccagctccgaaaagatagaaaccatggattcccctaagcggccccgactcaatgtagcttcagccttgagttcatcagcgtaagaaactggaaatatcttatccaccaaagggataaaatcccgcaacgtctcataaatgtcgagaaacttgaagagtttttcagcagcccgttttgtcatggaaacggcttcagcgaaattaaggagttggatcatcatacctcgagacaaattgctgaagatggtctcagaaattgatggttgatcctcaaacaccgcatccgccagcttgcgttcactggagaagagcacattagtgcagtgcctgaatgttgcaatccaagcagtaacctctctctccaatggttcccaattcatcttctgcacatcctcgatgctatatttttcaaagccaagcttatgcaagctttcctccaaagctttcctccgtgcgatgaaatagacctgacagcattctgcttcatagcctcctgcaattaaggctttggaaaatttattcaatgtggctacgatttcctcagagtagcctggaaatttattttcttcagaaggtttagattcagaagcagtgtcttgttctgcatcttcattggtgtctgatgatgaagaattagggttaggcttggcagaagaagtgtctaaattggtgatatctgaatcagtattgatcttgtaatcgtagaggattgacttgtattcttcctctatgtaggacattgctcgttgaagaacaccatcgacgcggctgattgaataagcatatttgtactctgaagaaaatcgacagagggaggtaaatagcttggagattcgatctacaatatttaaaaatgatgtggcttcctcctgagctagctggctccatttgacgggtgcatcaccgccatcatattcatcaattttcgcttcaacaagaacagcaaattgttctacaaacacaggaacatcgggaggcttagattcatcatcatcccctttaaagtttgatgattcagagataaattgatctatttcttctgaaaccttatcgagatcaggaggcaggggagaagaaacttcaaccttgacctcatcgtcgtcttcaggtttgatatcgtctgtcttggcaccatcatcctctgttgtgttgatgactctatcgtctgtttgcagttgttcaacaatttcttcttctgtagactttgtttcatgggcaggaggagtttcatctggctgtttgtcattggattttgtttgatcagaagaagtttgtgttagatccatgattgtaattgtgtgagtgagtaggtaagtttagcaggaaaattgagcttacattatcaaaaagagtaaggaaagtgttgtgattggagatgcagcgagcaaatcgacgagaaaggaaatgtgtgggatcacattaattagttagttacctttaacgtccactcttgtccacgtctaatattctcaatgccataaacttgcacgaatcaaggacggcctctcccactcactactcaccatctccataaccatctcatccaaaatatattactacttccattttagtatgtcaacataaaattgacttaattttttaatttttgagattctcaatcttttagcttatataacatcaaacttgtggtcactgtgttatgaaacaaatgattatgggaaatattgaaaaagcaaaataaagttgtagttataacattttagtaatttgtccaatattttttaaaaaaaaattataatgattttatcaaaattatttatgagatgacataaattagggatagggaagaagaaaggaggaggaaggtagaggaaacaagtgatcctattctccatgtatgtatttaagtgtagacaggcaggcccgacccatccgatccaatgtgaattattaaccctaggtaggtattggtaaagcttgagatgcgaatgagaatgtgatcaatgggtaaaccaactacccctcatcttcctccaggttccaacctacatactttattaataaatgttttgggtttttactcccaagctcaggattatgtttgcatacatctgctgcctgacattctttttgctacatgctctcacgaaacttctattatttatctaggttatcaacttcaatcacattagtagtattaaaggaagcttgaatttgttatgcatccccaattacatgcaaatattgctaatcatttttcctatttttaattactctgcaccttccagataatgctatcaaacatagctcactgctggagcccacacagcaagtactgatatcaaaacatgattctcctccgagaagcaaccttccaactttttcttaccgggatattgcaaccgccacaaacaatttcaggcgacaatccattattggggaaggtggctttgggccagtattcaaagggaaacttaacacgaatcaggtgattttcactaaatgaagtgtgttttggtcatttcaaatttatcatccttgaatttctataaccattcaggttgtggctgttaagaagctaaatcattccggtcttcaaggggataaggagttctttgtagaggttcacatgctctcactgatgcggcaccctaacctggttaatctaattggttactgctctgaaggagagcagcgacttcttatctatgaattcatgcctctaggatctttggaatatcatctccatggtaatgctcattcactattctcaccttagaaataaattcttatttgtcaacctgcaccaatctgttatgcttcagagctctacattcatgcattatttgctataaacttttggatcgtctttgatagcttaactcatttagagcccgacctaataaaataaatagcttttaagtttttaagtcaaaaaaataaaaagtaggagagacctactcttctcttttttaaaagtttattttaagttatttttaaccttgtcaaacactttcagaagctaaaagtgacttcaaagtaagtttgatcaaacttttaagtcccttgaagatatgaagttactattagtatcattctttaatactcattcacttctattttcattatttcttttttctttatcgaggaagtcttcacttctttaaggctgcactgccacatttgatggatatcaactgccctctcttattattctattaaccatgtattaaatttgttgatcctttttgtggttgtctgccaaagttgttactatgtgtgggaacttttttttctgtagatttttttaaaaattgtacatgcatcagaggatgcaactaatatattcattttccctctaaagtatgaaaaatgtgttgattaatagtataatgtgcagatattacgcctgacatgaagccattggattgggataccaggatggtaatagcatctggtgccgccaaaggcctggagtaccttcacaaccacgctgatcgtcctgttatctacagggaccttaagtcggcaaacatattattgggtgagggtttccatgctaaactttctgactttggtcttgcaaagtttggcccaattgcagacaacacacatgtttcaactcgggttatgggcactcatggatattgtgcacctgagtatgctggaacaggaaaactgactatgaaatccgacatctatagttttggtgtgctcttgttggagctaattactggatgtagagcaatggatgactctcacgaacatggaaaagaaatgcttgttgactgggtaattacttcacatttaattgattgaagtcattatacttggcaagctggaaaatgaaccaactttgagacacaaatataagtttttgtttcactattgagtttaaaaatgtttgtgtgtccttttttgactcttgaagcataaagttttttctctgagtgatgcatagttctatcaacgtgtcaaaggtattttggttgaaggtctctctctgccttggaggaggcccaaagttaccgaaaaatgagaatgactgatctatcaagtgatatgcacagttctgaccattgaaatgtatcatagtttgatttttttttctttttctgaatgagaagtgtaggctcctgtttgtgcgcttccttttctccctgctaacagtgaagtttctcaggcacgtcctatgttaaaagaccgcatgaactatgtacagttagcggatccaatgttaagaggcaaatttccacaatctgtcttccgcagggtagtagaactggtcttaatgtgtgttcaggatgatccccatgctcgacctcacatgaaagacatcgtgcttgctttgagttactttgcatcccaaaagcatgattcacctgcagctcagattggatctcatgggggagaagggacaaatggaagctctgttgattttgatggagctcagatggatataacagaaataagagcttcaaacaaagatcaagagcgggagagagctgttgcagaggccaagaagtggggcgagacctggagagagaaagggaaacagaatgcggatgatgatttagattataaatcaaggtggtgattgattgtaagttagtttccttgtcacagaacagcatttttattcaaatttttgaaacagtgtcgtatgtatatactcataaaaaagaaattattggcattattgatgtatttgtatgcttgacaaaataaattcaatacaactactatatggcatttatcattaacaagtactactaattaggttttgttcaagtactaaccaaacaaacaagaatgttaaattaagattaaaatatataagcaaccacatggattaaggattcaagagtaataatgtaagaattgtaccgtacagagtttagtaatgtaggtatgattcttcttatttcactttttatccctcaaaatatcataaataaattcatgtattcggttaagatcggttattggtaaaaattaaaatcaaattaatctagttggttttctaaattactaaaaccaaaccaaacggataaaataaattggtttggttcaatttttctattttttttttcagtttgaaagtaatacatttttcaggacaaacatatcttgatcgacacaaacacctagtatatgaacaatagagataaaaactattgttggttcaattagcaataagcaatattaaagttatcattacacgaataaagagattcaattaagagaaagtgtggaaccttaactaaggtaagtgtggggtttaagaaaatgataaagctaaagacttaaagttacttaaaatttaaaaaactatttatattttatttataaataatatataaataattataaaatttatatatctaattataagttcaatttcgatatttttgcagttgtctttttttagtaaaatcaaaatcaaactaaatagtattgatattcaaaattcaaagccaaatcaaacgaaatttcaagtttttaattagtttggttcgaattgtggtttggtatgattttttgtagccataaccgtaaacaatcatggttcggtttggatcgatattaattaaaaccttaaatttctaaacccaaaccaaatggaaaaaaaaaccatcggtttggtttggtctgattcggggttttgttctttttggtttaaattaaatttattgagagtatatgcatctcgatagatccaaaaacctattacatgaacaatccacatgaaagttattatgtattcaattaaacaatattatagttatcattatgcgaataaagtggttcaattaaaaaaatgtgactatctaatgtgacgcggggtagataaagataaagattttgattgacttggtcttagtcgggctaaaaattaagtgttgaatttaaaaaaaatgtgaaagttaagacttaagttaattaaattttatttatgagtaatatataaataataatgaaatttatatacaaaataatattgattttcaaaagagaggtcctaccgggagtcgaacccaggtcgctggattcaaagtccagagtgctaaccactacaccatagaaccaattgatactattcttcagctgaattgtatttagattatgtaagcactataaatatataatattttttttctgtgatgaactccttaaaagtgagaaagagggtcatttggttggtgggtgagtaaacaagtaatttcattacagaaacaaaaatagaaataggagggctagggcattagtgaagaaggaaaggaaaggtaacaaaagttgatggggtccagactccagtatgtcagccaacgataatatgccacgtcagtttgaccatcctgtaaagtgaaaaatgggtcccctaccaaaccaaaccaaaccctttttacacactctctctttactaccctcatgcctttgtcggctcatttaaccaaaaaaaaaagcttcaccggaatctgagtgtttgtccggcaaggcacagaagtctctctgtttgctgctggagatctcgaggaagccccagagatgtatgtcgttcctcctccaaaacgacccgatccattgtctggatccgaggacttgcggatttaccagacatggaaaggaagcaatgtaagagcttatccagtttattcacacttgctatctcatctaaaacttaacatcatgtttgtcacattaacttttaccaacgattattttatcaggacaaactgaatcaggtctataaaaacgatttactatcttaaaaatagacaaaagttatgactccttccttactagaacttttattcttaaattgctgacagatatttttcttccaaggaaggttcatatttgggccagacgcaagatccctagcactgaccatatttctcatagtggcccctgtctcagttttctgtgtctttgttgcaagaaagctcatggatgatttttcaaatcacctggggatattgattatggttgtagtcattgtgttcacattctatgtaagttctctctccctacttctcattgattctaacatcctcatcacctgcactttatgacaacacaatacctctataaacttgcctttatttaatctcacggttactggaaacaaaattccgagtattagaaagtcaatcatctcaaagttgatcacatatcaacttggcactgcatctcaaggatgtaaggttggcttggttgtgtcaatagtcttccattcgacggttaagtggctctccttacaatgaataggattggcaacctaattgccttcctattttgtatgtctaatagtttcacatttcaatattacaggttttagttctactccttctcacatctggaagggatccaggaataattcctcgtaatgcacaccctccagaaccagaaggttatgatggtactgtggaaggtggtggacaaacccctcaattacgtttgcctcgcattaaagaagttgaggtcaatggcattaccgtcaagatcaaatactgtgacacctgcatgctttatagacctccccgctgttctcactgttcaatctgcaataactgcgttgaaagatttgaccatcactgcccttgggtagggcaatgcattgggctggtaatagttcttttttttcttgttctttttcgtaacatatttttagttagtactgccacactagtaactgatattctctcttttcagttttacttttcaatgcttgatatgtctgatgaactattgcatataatctgatatgaagagtaaccagacattttatcatgatcacactccttctctctgcattgaaacctcataaaaagagacttaacaaaagactgattcttttacttgtaaaaacgaataaaataggtaagaccgattctttctttgttatctttttaaaaaatttacctattcactctaatgttaattcgcgaaaacaagcaagttacataaaatgtctaaagaaaaattgttgtcttaatgaagcaatgctgtggatctgaagtattagatcctcaaagcatcttgaaatctagctctcaaagtttataattatgtaaaaccgttaatactaatttatgaatcagactcttttaactattgatgctgttggcagcaggaggaagtttaaaaggtttgtaacttgagcatttagttcaaatggccatcatgaagacaattatatgttttagaaatatttagcactctagatctttgcagtggaactaaagttagagggtttagtacacatatccacatgcttaagttctcacatttcagctgaaatctttttgtccttgcgtacagtacttgctgtcgtgattctcagatttttgaagttatgaagtcaggctggtcaatttacctgttaatgtcacgtccacctttgatttcagaaatatctatagtctttgcttgcgctgataatttctgtaacaaatgcatcttttaaaacggatcatctagtgttgatgtcttggactctttccttatgttgcacagatttctgaaacttgtgtctttttattgcagcgaaactaccgttttttctttatgtttgtcttctctacaacacttctttgcatatatgtttttgggttctgctgggtctatattaagagaatcatggttctagatgacaccaccatatggaaagcaatgatcaaaacaccggcttccattgttctaatagcatacacttttatatcagtatggtttgttggaggtcttactgcttttcatctatacctcatcagtactaatcaggtatgttcgtggattgtaattttgtttttccctttgtctatctgaggtggtttggattttgatgtgaactgggagtgacgtctattttttgggtactgcagactacttatgagaattttagataccgatatgattggcgtgccaatccctacaacagaggagtgatgcagaatttcaaggagatattttgtactagtattcctccatccaagaacaatttccgtgcaaaggtgcccagggaacccaaggtggcaactcgatctgcaggtgggggttttgtgagtccaaacatggggaaggctgtggaggacatagaaatgggcaggaaagcggtttggagtgaagtaggggataacgaaggacaacttagcgacaatgatggcctgaacattaaagatggcatgttagggcaaatgtctcctgagataagaagtacagtagatgagagtgatcgtgcaggaatacatcctagaggatcaagctggggaaggaaaagtggaagctgggagatgtcacctgaagttcttgctttggcatcaagagtgggagaagctaatcgaacaggtgggagtagcagaccaacagatcaaaaaaagttgtgattaacagatagtatgaaaactggattgaatagattagtggttcttggaggtgtatggtatgtggtcattcagtgtcgtgtattatcagatgttggctttaggaagtgtgtgatatgaggggtggttttaattcctaaaacttgtattgtatgtgtggattagttagtgtaatacattagttttgctctgttcatgctaggcggtgcatttattctttgtgcttaaacaatgtgggcaagagtcccatataatatatatagtacgttgtaacagttgttattttacaagtaagaatctgggtttttgctgaatcacagtaaggggtggatttgagtaaaaatagagaaaatacacaagtcccgtgtaaccacccacaatagtatatatggctcgaaggaattcttttataatatgataacaccaactaaaaaggtagtccaattatatttcttcaaccataggttcttttcctggaacaaacaagcattcataatatttttcctttcgatcacagccatgagccaaccaaacttgctccaccatagttgttaccaacatctccctcgtgtcagcacaagttaagcctaagggactcctcaactttccatcctttaggatccattaaattaaaaaaaggaaatagtatgacccttcttgagaagtctttctgttgatgggtagagaacgttggcatttgtaaatggtttcttatcaaatgtgtgtgtggtttcatttaatacttcagatgagatatcatgccctcaagattaaccctagatggtctccttcttacaattgtagatttattattcaaattatgaatgtctgagaacaatacaattgattcagatattatattcattaaaacaatacattacgacatataaaaataatacatcacaaccatcaattaccactttttatggagaaaagaaaaattaaaaagtttcaaataatatagtcatgggtaaatataaaacatcgtcactaaaaaatgagggagaatagtttccaccaacttttgcaaataggttaggtaaattaaaattaaaatattaatgttttatatatattaatattttgtatatattaaaaaatttcaaataatattaatgttccgcgtgtgcaaaatattttgtatatattaagagtgagttggtacgaaggaaaatattttctcgaaaatgttttccaattttctcatatttggttgatataaaagttttgtaaaatgtttttcaaatcaactcattttcctcgaaattaaggaaaatgacttctcttcaaaaattaaggaaaacatttttcaaaattctactccaatttcaaattgcattttttttttcgaaagacatcaattttaaaagaatattttcaattttaaaattttatgtgtttacccaatccctcccccaaccccctaaaatattaatttcattcataaatcaaacacacgaaaatattttctactcacctaccaaacatgaaaaaataaatttaaaatctacttatttttcaaaacacatgtatctacgctaggttggggggtctaaaattaatttaattaagaaaaacataatctttccaactttaagcaaattatgtaaatcagattgatggagaaaaaaattaccagtcatgatttgtcttaaataatcattaataagtacataaaaataaaagtaatacgacatctaaaaatattccaagttgccatattttttcttattacactctgttgcacttttcattttacatggcgctcaaaaaattataaataaaagaataaaatttttaaatttttttacaaataaatataaatatatttcaagaaataattgcaaaagtaatatatgacaaattctattatttattttaataaaataatggagacaattattttttataagaatatggaacggaattaataagagtagtggatggatccagctaaggtttttttaataaataaagggagaatagtggtcagcaaaattttgcaataaatagatgtgtacaacttccaaaatactcttcgttcagaacgttatccataacatgtcagcatctcgtggctacacatcatcttcttttattaataatatatcaccttgcctattgcaatattaattacaaattatgtttctcatcatcaaatttatattcaagggaacacatgtttcatacaccaatttcaatttattaaaatcagttcatcaatcatgactaatagttaatctttttaccaagttgaaaacaaattaaatatcatctagataactgcttgtaatatgagtgaatcattgcacgttttatatacctactactcgcacaaatctcgaaggaaaccaattagctgtaagctgactattacatttttttttatttttttaaaaaaaaaggaaaatcaaaatttttatttttaaacttgtgttagtaattgaaaaacttaccacatgtaagaagaaaaaatctttttaaagttgacccaatagtaaattgtaaaagaaaaaaagtaaaaattatggtctatagatatccttttcagaatagcatcatgtggtatatatttagcatattcttccttacaaatgcaatatcgaaatcaacttttaaaaagtaacgacgattaaggggagcgaatatggttactgaagggaaatttcgtacttttgcaagtcgaattattgcagcatgaagttaaatttgaaaaaaaaaatatatataataagagagttgagattttagaaggaatctataatgtgatgtattaattattaattaaaataaaggagtaaagaggaggacgcgtctgatgtacgtggGgagggagagaggaatatccaaaagcataacaccgattacggattgagaatattcctgatcccctctaacctcccataaataccattcttttcttacttgttgttgcatccaatccaatccaatcatcccgttttctcttctcttctcttcttctcatatatataatatatatactagtagtatatatatatataacaatacaacaacaatggctcttacagctgttcatgtttccgatgttcccaatctagatcaagtccctgacaaagctcctctatatgccacccgattctctcaaggtttcctttgtctttaatttctatactcatttattctctttctttctagacaatacatacataatctcttattttgctttgcaggcattgaaattggaagagcatccgaatttttagttgttggacacagagggaacgggatgaatttgttgcaatcggctgaccggagaatgaatgccctcaaagaaaattccattctttctttcaatgcagctgccaattacccaatcgattttattgaatttgacgttcaggtaattttctattttccgcgtatttccttttttctttctattccccccgcccaaaaaaaaaatcgacccgacccgacccgatttaaaaatattaacagaaaaaccttaaaccctagatcttggcccagtatctctctgaagatggattctacctttagattaccaaagtaaatgtcattttttgtttgtaggtgacaaaggatgattgccctgttatttttcacgacgatttcatcctcactcaacataataatgtaagctcttttaatatccaataacccctttctcttctaatcaaataaaatctcattctcatggtgcgtgtgcagggtacagtttatgaaaggagaattactgaattgtcacttgctgaatttcttagctatggaccccaaaaagaagagggtctcactggaaaacctttaatgaggaaaacgaaagatggaaagattgttagctggacagttgaaaccgatgattccgcatgtaccttaaaagaagcttttgagaaagtgaatccatctattggtttcaacatcgagctcaaatttgatgatcacattgtttatcaacaggactacctcatccatgctcttaaagcagtgttacatgtcgtattagagtatgctaaaggcagaccaatcatattctcaagtttccagcctgatgctgctctgcttgtcaagaagctccagacatgttaccctgtacgtacgtttccattttggaatctatcttaaacacaagtgaattcgatctgacatacttaccttttggcactgtgtaggtgttttttctcacaaatggaggtacagagatttactatgatgttcgaagaaactcgttggaagaggccactaaactgtgcttagagggtggtttggaaggtattgtttcggaggtgaaaggcatcttcaggaatccaggagtagttaacaagatcaaagagtccaagctgtctctgctgacatacggcaaattgaagtaagtagagtattgattcagttggacactaacgaactaacgaactaaccatttatttatttattatacgttcagtaatgtgcctgaagctgtgtatatgcaacacctgatgggaattgatggagtgatagtggattttgttgaacaagtaacagatgctgtgtgtaagctggtgaagaagccagatgagatattgctggaaggggaggaaaaggttcaaaatagacctcaattttcacagagggaattgtcttttctgctcaaacttatccccgaactgatacaacaataacaattcataaaatgattgtagatagcaaagcgtgtagaaatgtagattttcattgtatttgcaccctctctgtaatcatatcaaaatattttatgaattgtcttggaattttagcaacatcaactcacttcaattaggttggttattaaacagtccaacccacattagttaccaaatgtacctgtggctgtgaatcatagttaaaattgacatgacctttcgagagaaagcaaagcaagtaacgtattaatctcaagtagaggggattcatttatgacagtgaatattaattaaacaataaccattagtcttcagttcccaattctatatttaaaagttccaatcttcaattcctaattgcttaaaaatatattacaacgttttgaagccaaagcaattagatacaattatcatcaccttgtgttatcgactctccacacagagtgatagacatgttcctcaaatacaatattctttccatttcatttattccgtcttacttcccttttgattgcaacttttcgcccagtacatttaagaccacaagattaaagaatatcttgatgcactttacatatctttaatttaaaagttttccttactttcttgaaactcaatatcaagttaaaacaaattaaagtatttttatttcgacagataatgatagcagaatatctaataggtaaactccaatatacatgttgtatattaattcatttcaaaataccagagaagtcgttgtttcttgttggactttgatggattaggatgtatgtatgttttggtccagtacttgtcaaggctatgagactcataagaaatagggcaaaatttctatttatatgctataacaaagtttgcataatttcgctccatagcaaacatatatgtgtataattcgttatacatatacaattgaaacgaattgtataaaacgagaaagagaaaaattatatacaatttgaatttgtataaaacgaaaaagagagaaagacaaaagaaatggtttatataagtgtatattgagaat

Page 5: What does the genome says? Sebastian Reyes Genome Center UCDavis

• What is genome annotation?– Attaching biological information to genome

sequence(i) Structural annotation: genes, regulatory elements, etc.(ii) Functional annotation: biochemical, phenotypic, etc.

– Creating annotations or features– Based in known information

Page 6: What does the genome says? Sebastian Reyes Genome Center UCDavis

Why should we worry about genome annotations?http://gmod.org/wiki/MAKER_Tutorial

Page 8: What does the genome says? Sebastian Reyes Genome Center UCDavis

BGI genome annotation pipeline

Genome sequence

Gene annotationRepeat

annotationncRNA

annotation

Gene set

Function annotation

homologDe novocDNA/

ESTDe novo homolog

Statistics resultsStatistics results

UniProtKEGGInterPro

Statistics results

miRNA/snRNA

rRNAtRNA

GLEAN setRNA-

seq data

Page 9: What does the genome says? Sebastian Reyes Genome Center UCDavis

Annotation classifications

• By source– De novo– Sequence alignment– Model prediction

• By type– Repeat elements– mRNA’s– Non coding RNA’s

Page 10: What does the genome says? Sebastian Reyes Genome Center UCDavis

FINDING REPEATSThe start

Page 11: What does the genome says? Sebastian Reyes Genome Center UCDavis

Repeats 101

• Sequences that are present in the genome in a high copy number

• Duplication rate and structure depends of the type• Main types– Tandem repeat elements (TRF) : segments of small

sequences repeated in tandem– Retro-transposons : self duplicating elements that

transpose using an intermediary RNA– DNA-transposons : repeat elements that don’t use an

RNA intermediary

Page 12: What does the genome says? Sebastian Reyes Genome Center UCDavis

Repeats

Make a big portion of eukaryote genomes

Not junk anymore

Page 13: What does the genome says? Sebastian Reyes Genome Center UCDavis

Repeat prediction

• Using pre-constructed repeat libraries– Representative sequences of elements found in

eukaryotic species (also prokaryotic organisms)– Repbase (http://www.girinst.org/repbase/update/index.html)– Dfam (http://www.dfam.org) – Homology base searches

Programs: RepeatRunner and RepeatMasker

http://www.repeatmasker.org/ http://www.yandell-lab.org/software/repeatrunner.html

Page 14: What does the genome says? Sebastian Reyes Genome Center UCDavis

Repeat prediction

• De novo analysis of the sequence• Building custom repeat libraries

– Predict repetitive elements using the genomic sequence– Based in sequence structure or repetitiveness

Programs: TRF for tandem repeatsRepeatModeler or RepeatScout for general searchesMITE-hunter and LTRharvest for type specific searcheshttp://tandem.bu.edu/trf/trf.html http://www.repeatmasker.org/RepeatModeler.html http://bix.ucsd.edu/repeatscout/ http://target.iplantcollaborative.org/mite_hunter.html http://www.zbh.uni-hamburg.de/?id=206 => ltrharvest

Page 15: What does the genome says? Sebastian Reyes Genome Center UCDavis

WHERE ARE THE GENESThe hard part

Page 16: What does the genome says? Sebastian Reyes Genome Center UCDavis

The expected of a gene

Page 17: What does the genome says? Sebastian Reyes Genome Center UCDavis

Gene prediction

• Combination of methods that predicts the existence of a transcribe mRNA

• Base in supporting “evidence”– EST’s sequences– Assembled transcriptomes– RNA-seq reads– Protein alignments– Ab initio predictions– Repeat element distribution

Page 18: What does the genome says? Sebastian Reyes Genome Center UCDavis

“EST” evidence

• Bundle of EST’s, unigenes, assembled transcriptomes and RNA-seq reads

• Flags of potential transcription• Aligned using splice-aware aligners

Programs: GMAP, GSNAP, TopHat, exoneratehttp://research-pub.gene.com/gmap/

http://tophat.cbcb.umd.edu/

http://www.ebi.ac.uk/~guy/exonerate/

Page 19: What does the genome says? Sebastian Reyes Genome Center UCDavis

Protein alignments

• Using proteins from close related species• Evidence of expression of a particular domain

or entire protein• Also use splice-aware aligners

Programs: Exoneratehttp://www.ebi.ac.uk/~guy/exonerate/

Page 20: What does the genome says? Sebastian Reyes Genome Center UCDavis

Ab initio predictions

• Predictions using Hidden Markov Modelshttp://en.wikipedia.org/wiki/Hidden_Markov_model

• Tailored for particular genomes– Unless you are working with a model species, will

require construction of custom HMM’s• Search for mRNA-like sequences in the genome• Similar to protein domain searches

Programs: Snap, augustus, genemark, fgeneshhttp://korflab.ucdavis.edu/software.html => snap

http://augustus.gobics.de/

http://opal.biology.gatech.edu/ => genemark http://www.softberry.com/berry.phtml?topic=fgenesh&group=help&subgroup=gfind

Page 21: What does the genome says? Sebastian Reyes Genome Center UCDavis

What’s a gene model

Combined evidence generates a gene model

Page 22: What does the genome says? Sebastian Reyes Genome Center UCDavis

Gene functional annotation

• Assignation of putative biological functions to mRNA’s (“genes”)

• Searches for protein domains and similarity to references protein databases

• GO annotations, KEGG enzimes, pFam domain

Programs: Interproscan, KAAShttp://www.ebi.ac.uk/Tools/pfa/iprscan/

http://www.genome.jp/kegg/kaas/

Page 23: What does the genome says? Sebastian Reyes Genome Center UCDavis

THE OTHER RNA’S

Page 24: What does the genome says? Sebastian Reyes Genome Center UCDavis

ncRNA’s

tRNAmiRNA

snoRNA

Page 25: What does the genome says? Sebastian Reyes Genome Center UCDavis

Non coding RNA’s: tRNA’s

• Simpler of the ncRNA’s prediction• Uses junction of sequence homology and

models for the secondary structure

Programs: tRNAscan-SE http://lowelab.ucsc.edu/tRNAscan-SE/

Page 26: What does the genome says? Sebastian Reyes Genome Center UCDavis

All the others

Programs: Infernal, Snoscan and miRPredict

• Scan with other ncRNA databases, snoRNA’s and miRNA’s respectively

• Commonly personalized ncRNA libraries are required for comprehensive searches

http://infernal.janelia.org/

http://lowelab.ucsc.edu/snoscan/

http://sourceforge.net/projects/mirpredict/

Page 27: What does the genome says? Sebastian Reyes Genome Center UCDavis

ANNOTATING A GENOME WITH MAKER

Page 28: What does the genome says? Sebastian Reyes Genome Center UCDavis

Outline of the maker annotation

pipeline

MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomesBrandi L. Cantarel, Ian Korf, Sofia M.C. Robb, Genis Parra, Eric Ross, Barry Moore, Carson Holt, Alejandro Sánchez Alvarado, and Mark YandellGenome Res. January 2008 18: 188-196; Published in Advance November 19, 2007, doi:10.1101/gr.6743907

http://www.yandell-lab.org/software/maker-p.htmlhttp://gmod.org/wiki/MAKER_Tutorial

Page 30: What does the genome says? Sebastian Reyes Genome Center UCDavis

Maker Tips• For troubleshooting and testing the files better to use a small fasta

file with a single sequence (i.e. a 500bp fragment of a chromosome/scaffold)– Verify fasta files don’t present any errors for maker– Verify that the system don’t have any issues (missing programs)

• Maker MPI (parallelization method) don’t scale up to systems with a high number of cores, manual parallelization it’s required above 24cores– Split up of genomic fasta file into smaller pieces

• Some scaffolds would report back FAILED, most cases are scaffolds with high content of repeat elements– Required independent run into RepeatMasker

• AED score => maker gene model quality score (lower the best)Annotation Edit Distance

Page 31: What does the genome says? Sebastian Reyes Genome Center UCDavis

Maker exe.file#-----Location of Executables Used by MAKER/EVALUATORmakeblastdb=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executableblastn=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executableblastx=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executabletblastx=/home/sreyesch/MichelmoreBin/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executableformatdb= #location of NCBI formatdb executableblastall= #location of NCBI blastall executablexdformat= #location of WUBLAST xdformat executableblasta= #location of WUBLAST blasta executableRepeatMasker=/home/sreyesch/MichelmoreBin/bin/RepeatMasker #location of RepeatMasker executableexonerate=/home/sreyesch/MichelmoreBin/bin/exonerate #location of exonerate executable

#-----Ab-initio Gene Prediction Algorithmssnap=/home/sreyesch/MichelmoreBin/bin/snap #location of snap executablegmhmme3=/home/sreyesch/MichelmoreBin/bin/gmhmme3 #location of eukaryotic genemark executablegmhmmp=/home/sreyesch/MichelmoreBin/bin/gmhmmp #location of prokaryotic genemark executableaugustus=/home/sreyesch/MichelmoreBin/bin/augustus #location of augustus executablefgenesh= #location of fgenesh executable

#-----Other Algorithmsfathom=/home/sreyesch/MichelmoreBin/bin/fathom #location of fathom executable (experimental)probuild=/home/sreyesch/MichelmoreBin/bin/probuild #location of probuild executable (required for genemark)

Location of all required programs for maker to run99.9999% of the time don’t NEEDs modification

Page 32: What does the genome says? Sebastian Reyes Genome Center UCDavis

Maker bopts.file#-----BLAST and Exonerate Statistics Thresholdsblast_type=wublast #set to 'wublast' or 'ncbi'

pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignmentspid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligmentseval_blastn=1e-10 #Blastn eval cutoffbit_blastn=40 #Blastn bit cutoff

pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignmentspid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligmentseval_blastx=1e-06 #Blastx eval cutoffbit_blastx=30 #Blastx bit cutoff

pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Maskingpid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Maskingeval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element maskingbit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignmentspid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligmentseval_tblastx=1e-10 #tBlastx eval cutoffbit_tblastx=40 #tBlastx bit cutoff

eva_pcov_blastn=0.8 #EVALUATOR Blastn Percent Coverage Threshold EST-Genome Alignmentseva_pid_blastn=0.85 #EVALUATOR Blastn Percent Identity Threshold EST-Genome Alignmentseva_eval_blastn=1e-10 #EVALUATOR Blastn eval cutoffeva_bit_blastn=40 #EVALUATOR Blastn bit cutoff

ep_score_limit=20 #Exonerate protein percent of maximal score thresholden_score_limit=20 #Exonerate nucleotide percent of maximal score threshold

Maker blast options (blast opts)Specific options that maker will use to perform the blast aligmentsDon’t require modification

Page 33: What does the genome says? Sebastian Reyes Genome Center UCDavis

Maker opts.file

• File that contains our determined inputs for maker to run

• We need to modify it– To edit file, safest way is in the command line– Mac users should use their preffered command line

editor– Windows could use wordpad to do editing of the file– Ubunto default ubuntu text editor works well

• Didn’t fit in the screenhttp://www.yandell-lab.org/maker/Berkeley_Qi/maker_opts.ctl

Page 34: What does the genome says? Sebastian Reyes Genome Center UCDavis

Generalized annotation pipeline with maker for non-model species

Genomic Sequence

Enough EST?

tRNA prediction

Repeat library construction

Maker 1st iteration

without ab initio

predictions

Maker 1st iteration with

HMM for closer specie

Training of ab initio predictors

Maker 2nd iteration with custom HMM’s

Other ncRNA prediction

Uncurated Genome

Annotation

mRNA functional annotation

YesNo

12

3

4

5 6

7

Page 35: What does the genome says? Sebastian Reyes Genome Center UCDavis

Generalized annotation pipeline with maker for non-model species

1. Perform tRNA prediction with tRNAscan2. Construct custom repeat library3. Initial maker prediction

1. If enough “Est” data is available (representative of most of genes and a high depth), use maker without any ab initio prediction

2. If not enough “EST” is available, search for a close related HMM that can be use for prediction

4. Train ab initio predictors using the predicted gene models from initial maker5. Perform second maker iteration using all the information from first run, but

using the HMM’s generated in step 46. Do draft functional annotation of mRNA’s predicted in the second maker

iteration7. Perform other ncRNA prediction8. Gather all the results from second maker iteration, tRNA and other ncRNA

and you got your uncurated genome annotation

Page 36: What does the genome says? Sebastian Reyes Genome Center UCDavis

HOW DO WE STORE THE ANNOTATIONS

Page 37: What does the genome says? Sebastian Reyes Genome Center UCDavis

Generic Feature Format (GFF)

• Standardized format for storing genomic annotation (features)

• Have conventions for a wide variety of features

• Current version is gff3• http://www.sequenceontology.org/gff3.shtml

Page 38: What does the genome says? Sebastian Reyes Genome Center UCDavis

##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + .ID=gene00001;Name=EDENctg123 . TF_binding_site 1000 1012 . + .ID=tfbs00001;Parent=gene00001ctg123 . mRNA 1050 9000 . + .ID=mRNA00001;Parent=gene00001;Name=EDEN.1ctg123 . mRNA 1050 9000 . + .ID=mRNA00002;Parent=gene00001;Name=EDEN.2ctg123 . mRNA 1300 9000 . + .ID=mRNA00003;Parent=gene00001;Name=EDEN.3ctg123 . exon 1300 1500 . + .ID=exon00001;Parent=mRNA00003ctg123 . exon 1050 1500 . + .ID=exon00002;Parent=mRNA00001,mRNA00002ctg123 . exon 3000 3902 . + .ID=exon00003;Parent=mRNA00001,mRNA00003ctg123 . exon 5000 5500 . + .ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . exon 7000 9000 . + .ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . CDS 1201 1500 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 3000 3902 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 5000 5500 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 7000 7600 . + 0ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 1201 1500 . + 0ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 5000 5500 . + 0ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 7000 7600 . + 0ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 3301 3902 . + 0ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 5000 5500 . + 1ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 7000 7600 . + 1ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 3391 3902 . + 0ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 5000 5500 . + 1ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 7000 7600 . + 1ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

Page 39: What does the genome says? Sebastian Reyes Genome Center UCDavis

Column descriptionsColumn 1: "seqid” The ID of the landmark used to establish the coordinate system for the current featureColumn 2: "source” The source is a free text qualifier intended to describe the algorithm or operating procedure that generated this feature.Column 3: "type” The type of the feature. Predefined terms to characterized the feature.Columns 4 & 5: "start" and "end” The start and end coordinates of the feature on the reference sequence.Column 6: "score” The score of the feature. Highly varies depending of sourceColumn 7: "strand” The strand of the feature. + for positive strand, - for minus strand, and . for features that are not stranded.Column 8: "phase” For features of type "CDS", indicates which phase it’s the CDS readColumn 9: "attributes” A list of feature attributes in the format tag=value separated by semicolons.