37
Gene & genome organisation Computational gene identification

Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Gene & genome organisation

Computational gene identification

Page 2: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Eubacterial gene

Page 3: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 4: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 5: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 6: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Regulatory elements

PromoterTranslation start

Transcription stop

polyA signal

Transcription start

Translation stop

Exons

Introns

DNA

Eukaryotic gene

Page 7: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Promoters in eukaryotic DNA

1. TATA box

2. Initiators5’ Y Y A+1 N [T,A] Y Y Y 3’

3. CpG islands

Page 8: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

EMBOSSCpGPlot

Page 9: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

polyA tail

Splicing

Translation

RNA (primarytranscript)

RNA (spliced)

Protein

Page 10: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 11: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 12: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 13: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 14: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Regulated splicing

Primary RNA transcript

Exons may be combined differently during splicing. One gene can in this way give rise to multiple forms of a protein.

Splicevariants

Page 15: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

1

1

1

1

1

1

1

2

2

3

3

3

3

3

3

4

4

4

4

4

4

4

5

5

5

5

5

5

5

6

6

6

6

6

6

6

7

7

8

8

8

8

8

8

9

9

9

9

9

9

9

10

10

10

10

10

10

10

13a

13a

13

13

13

13

13

13

11

11

11

12

12

Nonmuscle

Smooth muscle

Striated muscle

Striated muscle'

Hepatoma

Brain

Alternative splicing of the - tropomyosin pre-mRNA?

Page 16: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 17: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Genome No of genes Genes / MB

Homo sapiens 3000 Mb ~40,000? ~13

Mycoplasma genitalium 0.6 MB ~600 ~1000

Higher eukaryote genomes contain a substantial amount of non-coding sequences

Page 18: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

LINES long interspersed elements 6-7 kbLINE1 : 600,000 copies in human genome

= 15 % of genomic DNA

Repetitive DNA~50 % of human genomic DNA

mobile elements - - viral retrotransposons

common in yeast & Drosophila- non-viral retrotransposons

common in mammals LINES SINES

Page 19: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 20: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

SINES short interspersed elements~300 bp

Sequence conservation ~80 % within the same species

Alu sequence the most abundant class of SINE 1 million copies = 10 % of genomic DNA

Many Alu sequences have cleavage sites for the restriction enzyme AluI, (AGCT), hence the name

Originally derived from SRP RNA by reverse transcription

Page 21: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Repetitive sequences genome size

Mammals 35 - 45 % ~ 3 GB

Fugu : < 15 % ~365 MB

Page 22: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Why so many repetitive elements in higher mammals?

Mobile elements probably had a significant influenceon evolution of higher organisms :

Novel genes and new controls on gene expressionwere created because mobile elements have served as sites for recombination, leading to gene duplications and other gene rearrangements (exon shuffling).

Page 23: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Detection of repetitive DNA

Dotplot analysis

Page 24: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Detection of repetitive DNA

RepeatMaskerhttp://ftp.genome.washington.edu/RM/RepeatMasker.html

RepeatMasker is a program that screens DNA sequences for interspersed repeats known to exist in mammalian genomes as well as for low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (replaced by Ns). On average, over 40% of a human genomic DNA sequence is masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green.

1159 13.2 3.2 0.0 HSU08988 6563 6781 (22462) + MER7A DNA/MER2_type 1 226 (109)5901 11.3 2.5 0.8 HSU08988 6782 7720 (21523) C TIGGER1 DNA/MER2_type (0) 2418 14651617 12.7 6.3 1.8 HSU08988 7738 8021 (21222) C AluSx SINE/Alu (4) 298 23811 8.5 1.5 1.5 HSU08988 8027 8699 (20544) C TIGGER1 DNA/MER2_type (943) 1475 8032035 11.0 0.3 0.7 HSU08988 8700 9000 (20243) C AluSg SINE/Alu (0) 300 12055 9.1 4.4 0.0 HSU08988 9003 9695 (19548) C TIGGER1 DNA/MER2_type (1608) 810 2 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334 (1)

Page 25: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Human genome contains a substantial number ofpseudogenes - non-functional gene variants

Non-processed pseudogeneGene duplication has resulted in new copy of geneCopy has mutated to become non-functional

Processed pseudogenesNon-functional genomic copies of mRNAs.Often contain multiple mutations

Page 26: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Human genome: * variation of GC content

* longer introns in AT-rich regions

Page 27: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 28: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Gene prediction methods

- Ab initio, pattern recognition- Database searching

Identification of ORFsFinding long ORFs

Stop codon expected every 64/3 = 21 codonsnumber of stop codons=3(UAA, UAG, UGA)

Average proteins are much longer

Disadvantages: short genes are not detectedsome ORFs are false positivesnot suitable for eukaryotes

Page 29: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334
Page 30: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

LOCUS AAB32243 47 aa BCT 03-MAR-1995DEFINITION aepH=putative exoenzyme production regulatory peptide [Erwinia carotovora, carotovora, Peptide, 47 aa].ACCESSION AAB32243PID g691744VERSION AAB32243.1 GI:691744DBSOURCE locus S74077 accession S74077.1KEYWORDS .SOURCE Pectobacterium carotovorum carotovora. ORGANISM Pectobacterium carotovorum Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Pectobacterium.REFERENCE 1 (residues 1 to 47) AUTHORS Murata,H., Chatterjee,A., Liu,Y. and Chatterjee,A.K. TITLE Regulation of the production of extracellular pectinase, cellulase, and protease in the soft rot bacterium Erwinia carotovora subsp. carotovora: evidence that aepH of E. carotovora subsp. carotovora 71 activates gene expression in E. carotovora subsp. car JOURNAL Appl. Environ. Microbiol. 60 (9), 3150-3159 (1994) MEDLINE 95031027 REMARK GenBank staff at the National Library of Medicine created this entry [NCBI gibbsq 157517] from the original journal article. This sequence comes from Fig. 2A.COMMENT Method: conceptual translation supplied by author.FEATURES Location/Qualifiers source 1..47 /organism="Pectobacterium carotovorum" /db_xref="taxon:554" Protein 1..47 /product="aepH" /name="putative exoenzyme production regulatory peptide" CDS 1..47 /gene="aepH+" /coded_by="S74077.1:576..719" /note="Author translates GTG start as Val"ORIGIN 1 vgqepkgies rkiqdghvrk kvgrqqglwv rttkkekfsr msrdanv

Example of awful ORF prediction

Page 31: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Codon usage for enteric bacterial (highly expressed) genes 7/19/83

AmAcid Codon Number /1000 Fraction ..

Gly GGG 13.00 1.89 0.02Gly GGA 3.00 0.44 0.00Gly GGU 365.00 52.99 0.59Gly GGC 238.00 34.55 0.38

Glu GAG 108.00 15.68 0.22Glu GAA 394.00 57.20 0.78Asp GAU 149.00 21.63 0.33Asp GAC 298.00 43.26 0.67

Val GUG 93.00 13.50 0.16Val GUA 146.00 21.20 0.26Val GUU 289.00 41.96 0.51Val GUC 38.00 5.52 0.07

Ala GCG 161.00 23.37 0.26Ala GCA 173.00 25.12 0.28Ala GCU 212.00 30.78 0.35Ala GCC 62.00 9.00 0.10

Arg AGG 1.00 0.15 0.00Arg AGA 0.00 0.00 0.00Ser AGU 9.00 1.31 0.03Ser AGC 71.00 10.31 0.20...

Compositional bias in coding regions

Page 32: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

CodonPreference

Codon preference plot is constructed by calculating a codon preferencestatistic for each position of each of three reading frames. The statistic is calculated over a window of length w and window moved along the sequencein increments of three bases, maintainin the reading frame. The magnitude of the codon preference statistic is a measure of the likeness of particular window of codons to a predetermined preferred usage.

p = preference parameter = relative likelihood of a codon being found in a gene as opposed to a random sequence

fABC/FABCp = ------- rABC/RABC

f frequency of codon ABC(found in frequency table)F sum of frequencies for all codons that are members of ABCs synonymous family

r frequency of codon ABC in a random sequenceR sum of frequencies of ABCs synonymous family in a random sequence

Codon preference statistic P

(sum logpi/w)P = e

w is between 25 and 50

Page 33: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

CodonPreference is a frame-specific gene finder that tries to recognize protein coding sequences by virtue of the similarity of their codon usage to a codon frequency table or by the bias of their composition (usually GC) in the third position of each codon.

Page 34: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Compositional bias of exonsK-tuple methodfrom Bishop ed., Guide to Human Genome Computing

Consider a sequence S = {s i} of length L.It can be transformed into a sequence of k-tuples (i.e oligonucleotidesof length k) :

W = {Wk,i} (i = 1, …, L - k + (1) ; Wk,i ? ? ?

Here ? = {Wk,i} is the set of all the possible oligonucleotidesWk of length k. In this way it is possible to construct a table F with the occurrencefrequency F(Wk) for all possible k-tuples of the set of sequences {S} having the function of interest.

Consider two sets of sequences {S(1)} and {S(2)} with mutually exclusive functions, for instance intron and exon. It is possible to calculate the k-tuple frequency tables F1 and F2 for these two sets of sequences. The difference in frequencies between these tables can be used for discrimination.To analyze the test sequence using the F1 and F2 tables, calculate the local discriminant index for the ith position:

d(i) = F1(Sk,i) / (F1( Sk,i) + F2 ( Sk,i))

d(i) is smoothed using an averaging window of 2w+1 consecutive positions i + w

?D????????? ?d(i) j = i - w

k= 6 is often used

Page 35: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Local sites (=signals) used in the prediction of genes

PromotersTerminators of transcriptionStart and stop codonsSplice sitesBranch pointsPolyadenylation sites

Signal sensors = methods for detecting signals

Content sensorsHexamer counts to discriminate betweenexons and introns

Gene finding methods:Combination of signal and content sensors

Page 36: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Gene prediction methods

Ab initio: HMM methods

Genscan http://genes.mit.edu/GENSCAN.htmlHMMGene http://www.cbs.dtu.dk/services/HMMgene/Genie http://www.fruitfly.org/seq_tools/genie.htmlGeneMark.hmm http://genemark.biology.gatech.edu/GeneMark/eukhmm.cgiFGENEH http://genomic.sanger.ac.uk/gf/gf.shtmlGeneID http://www1.imim.es/geneid.html

Ab initio: Neural network methods

GRAIL http://compbio.ornl.gov/Grail-1.3/NetGene2 http://www.cbs.dtu.dk/services/NetGene2/

Homology based

Blast http://www.ncbi.nlm.nih.gov/BLASTProcrustes http://www-hto.usc.edu/software/procrustes/index.htmlGenewise http://www.sanger.ac.uk/Software/Wise2

Page 37: Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Limitations

Non-coding parts, 5’ and 3’ UTRs,and non-coding RNAs are not detected

Lack of suitable training sets ofvery long genomic sequences

Methods are conservative - they have been trained on “typical genes”