View
126
Download
1
Category
Preview:
DESCRIPTION
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Citation preview
Genome Annotation
Karan Veer Singh,
Scientist.
NBAGR, Karnal,
India
1
• The genome contains all the biological information required to build and maintain any given living organism
• The genome contains the organisms molecular history
• Decoding the biological information encoded in these molecules will have enormous impact in our understanding of biology
The Genome
1. Structural genomics-genetic and physical mapping of genomes.
2. Functional genomics-analysis of gene function (and non-genes).
3. Comparative genomics-comparison of genomes across species.
Includes structural and functional genomics.
Evolutionary genomics.
Genomics
The Human genome project promised to revolutionise medicine and explain every base of our DNA.
Large MEDICAL GENETICS focus
Identify variation in the genome that is disease causing
Determine how individual genes play a role in health
and disease
Human Genome Project
Human Genome Project & Functional Genome
It cost 3 billion dollars and took 10 years to complete (5 less than initially predicted).• Approx 200 Mb still in progress
– Heterochromatin– Repetitive
Genomics & Genome annotation
First genome annotation software system was designed in 1995 by Dr. Owen White with The Institute for Genomic Research that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae
It involve assembling of the reads to form contigs then assembling with a reference genome (reference assembly) or de novo assembly to obtain the complete genome
Variations such as mutations, SNP, InDels etc can be identified
The genome is then annotated by structural and functional annotation
Mapping Image of Whole genome in an easily understandable manner.
Sequence to Annotation
Input1 to Genome Viewer- Variant Annotation
Input2 to Genome Viewer- Structural Annotation
Structural Annotation- AUGUSTUS (version 2.5.5)
Input3 to Genome Viewer-Functional Annotation
Genome Annotation
The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do
Finding and attaching the structural elements and its related function to each genome locations
11
Genome Annotation
12
gene structure prediction
Identifying elements (Introns/exons,CDS,stop,start) in the genome
gene function prediction
Attaching biological information to these elements- eg: for which protein exon will code for
Structural annotationStructural annotation - identification of genomic elementsOpen reading frame and their localisationgene structurecoding regionslocation of regulatory motifs
Functional annotation
Functional annotation- attaching biological information to genomic elements
biochemical functionbiological functioninvolved regulations
Genome annotation - workflow
16
Genome sequence
Repeats
Structural annotation-Gene finding
Protein-coding genesnc-RNAs (tRNA, rRNA), Introns
Functional annotation
View in Genome viewer
Masked or un-masked genome sequence
Genome Repeats & features
17
Percentage of repetitive sequences in different organisms
Genome Genome Size (Mb)
% Repeat
Aedes aegypti 1,300 ~70
Anopheles gambiae 260 ~30
Culex pipiens 540 ~50
Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR
Polymorphic between individuals/populations
Finding repeats as a preliminary to gene prediction
18
Repeat discovery
Homology based approaches
Use RepeatMasker to search the genome and mask the sequence
Masked sequence
Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set
19
>my sequence
atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct
Positions/locations are not affected by masking
Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked
20
>my sequence
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct
Genome annotation - workflow
21
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked
Structural annotation
Identification of genomic elements
Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s Introns
22
Methods
24
Similarity• Similarity between sequences which does not necessarily infer any
evolutionary linkage
Ab- initio prediction• Prediction of gene structure from first principles using only the
genome sequence
Genefinding
25
ab initio similarity
ab initio prediction
26
Genome
Coding potential
Coding potential
ATG & Stop codons
ATG & Stop codons
Splice sites
Examples:
Genefinder, Augustus, Glimmer, SNAP, fgenesh
Genefinding - similarity
27
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise, Augustus, Prodigal
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on conservation of exon sequences
Examples: Twinscan and SLAM
Genome annotation - workflow
28
Genome sequence
Map repeats
Gene finding- structural annotation Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked
Genefinding - non-coding RNA genes
29
Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples
tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes
Rfam - a suite of HMM’s trained against a large number of different RNA genes
Gene-finding omissions
30
Alternative isoformsCurrently there is no good method for predicting alternative isoformsOnly created where supporting transcript evidence is present
PseudogenesEach genome project has a fuzzy definition of pseudogenesBadly curated/described across the board
PromotersRarely a priority for a genome projectSome algorithms exist but usually not integrated into an annotation set
Practical- structural annotation
31
Eukaryotes- AUGUSTUS (gene model)
~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff
Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
Structural Annotation-output Structural Annotation conducted using AUGUSTUS (version 2.5.5),
Magnaporthe_grisea as genome model
Functionalannotation
33
Genome annotation - workflow
34
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked
Functional annotation
35
Genome
ATG STOP
AAAn
A B
Transcription
Primary Transcript
Processed mRNA
Polypeptide
Folded protein
Functional activity
Translation
Protein folding
Enzyme activity
RNA processing
m7G
Find function
Functional annotation
36
Attaching biological information to genomic elements
Biochemical functionBiological functionInvolved regulation and interactionsExpression
• Utilize known structural annotation to predicted protein sequence
Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have been assigned a function
Functions are assigned
37
Functional annotation - Other features
Other features which can be determined
Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain Secretome
See http://expasy.org/tools/ for a good list of possible prediction algorithms38
Functional annotation - Other features (Ontologies)
Use of ontologies to annotate gene products
Gene Ontology (GO) Cellular component Molecular function Biological process
39
Practical - FUNCTIONAL ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein
Blasting the genome.fasta for annotations
(nucleotide/protein)
sorting for blast minimum E-value (>=0.01) for
nucleotide/protein
assigning functions
40
Functional annotation- output
August 2008 Bioinformatics tools for Comparative Genomics of Vectors
41
Conclusion
Annotation accuracy is dependent available supporting data at the time of annotation; update information is necessary
Gene predictions will change over time as new data becomes available (NCBI) that are much similar than previous ones
Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)
42
Genome annotation - workflow
43
Genome sequence
Map repeats
Gene finding- structural annotation
Protein-coding genesnc-RNAs, Introns
Functional annotation
View in Genome viewer
Masked or un-masked
Genome Viewer
The Files that can be visualised
Annotation files
Indel files
Consensus sequence
Comparative Genomics 44
Genome View
August 2008 45
46
47
48
Short Read track
49
Thank You
50
Recommended