Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334

Gene & genome organisation

Computational gene identification

Eubacterial gene

Regulatory elements

PromoterTranslation start

Transcription stop

polyA signal

Transcription start

Translation stop

Exons

Introns

DNA

Eukaryotic gene

Promoters in eukaryotic DNA

1. TATA box

2. Initiators5’ Y Y A+1 N [T,A] Y Y Y 3’

3. CpG islands

EMBOSSCpGPlot

polyA tail

Splicing

Translation

RNA (primarytranscript)

RNA (spliced)

Protein

Regulated splicing

Primary RNA transcript

Exons may be combined differently during splicing. One gene can in this way give rise to multiple forms of a protein.

Splicevariants

1

1

1

1

1

1

1

2

2

3

3

3

3

3

3

4

4

4

4

4

4

4

5

5

5

5

5

5

5

6

6

6

6

6

6

6

7

7

8

8

8

8

8

8

9

9

9

9

9

9

9

10

10

10

10

10

10

10

13a

13a

13

13

13

13

13

13

11

11

11

12

12

Nonmuscle

Smooth muscle

Striated muscle

Striated muscle'

Hepatoma

Brain

Alternative splicing of the - tropomyosin pre-mRNA?

Genome No of genes Genes / MB

Homo sapiens 3000 Mb ~40,000? ~13

Mycoplasma genitalium 0.6 MB ~600 ~1000

Higher eukaryote genomes contain a substantial amount of non-coding sequences

LINES long interspersed elements 6-7 kbLINE1 : 600,000 copies in human genome

= 15 % of genomic DNA

Repetitive DNA~50 % of human genomic DNA

mobile elements - - viral retrotransposons

common in yeast & Drosophila- non-viral retrotransposons

common in mammals LINES SINES

SINES short interspersed elements~300 bp

Sequence conservation ~80 % within the same species

Alu sequence the most abundant class of SINE 1 million copies = 10 % of genomic DNA

Many Alu sequences have cleavage sites for the restriction enzyme AluI, (AGCT), hence the name

Originally derived from SRP RNA by reverse transcription

Repetitive sequences genome size

Mammals 35 - 45 % ~ 3 GB

Fugu : < 15 % ~365 MB

Why so many repetitive elements in higher mammals?

Mobile elements probably had a significant influenceon evolution of higher organisms :

Novel genes and new controls on gene expressionwere created because mobile elements have served as sites for recombination, leading to gene duplications and other gene rearrangements (exon shuffling).

Detection of repetitive DNA

Dotplot analysis

Detection of repetitive DNA

RepeatMaskerhttp://ftp.genome.washington.edu/RM/RepeatMasker.html

RepeatMasker is a program that screens DNA sequences for interspersed repeats known to exist in mammalian genomes as well as for low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (replaced by Ns). On average, over 40% of a human genomic DNA sequence is masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green.

1159 13.2 3.2 0.0 HSU08988 6563 6781 (22462) + MER7A DNA/MER2_type 1 226 (109)5901 11.3 2.5 0.8 HSU08988 6782 7720 (21523) C TIGGER1 DNA/MER2_type (0) 2418 14651617 12.7 6.3 1.8 HSU08988 7738 8021 (21222) C AluSx SINE/Alu (4) 298 23811 8.5 1.5 1.5 HSU08988 8027 8699 (20544) C TIGGER1 DNA/MER2_type (943) 1475 8032035 11.0 0.3 0.7 HSU08988 8700 9000 (20243) C AluSg SINE/Alu (0) 300 12055 9.1 4.4 0.0 HSU08988 9003 9695 (19548) C TIGGER1 DNA/MER2_type (1608) 810 2 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334 (1)

Human genome contains a substantial number ofpseudogenes - non-functional gene variants

Non-processed pseudogeneGene duplication has resulted in new copy of geneCopy has mutated to become non-functional

Processed pseudogenesNon-functional genomic copies of mRNAs.Often contain multiple mutations

Human genome: * variation of GC content

* longer introns in AT-rich regions

Gene prediction methods

- Ab initio, pattern recognition- Database searching

Identification of ORFsFinding long ORFs

Stop codon expected every 64/3 = 21 codonsnumber of stop codons=3(UAA, UAG, UGA)

Average proteins are much longer

Disadvantages: short genes are not detectedsome ORFs are false positivesnot suitable for eukaryotes

LOCUS AAB32243 47 aa BCT 03-MAR-1995DEFINITION aepH=putative exoenzyme production regulatory peptide [Erwinia carotovora, carotovora, Peptide, 47 aa].ACCESSION AAB32243PID g691744VERSION AAB32243.1 GI:691744DBSOURCE locus S74077 accession S74077.1KEYWORDS .SOURCE Pectobacterium carotovorum carotovora. ORGANISM Pectobacterium carotovorum Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Pectobacterium.REFERENCE 1 (residues 1 to 47) AUTHORS Murata,H., Chatterjee,A., Liu,Y. and Chatterjee,A.K. TITLE Regulation of the production of extracellular pectinase, cellulase, and protease in the soft rot bacterium Erwinia carotovora subsp. carotovora: evidence that aepH of E. carotovora subsp. carotovora 71 activates gene expression in E. carotovora subsp. car JOURNAL Appl. Environ. Microbiol. 60 (9), 3150-3159 (1994) MEDLINE 95031027 REMARK GenBank staff at the National Library of Medicine created this entry [NCBI gibbsq 157517] from the original journal article. This sequence comes from Fig. 2A.COMMENT Method: conceptual translation supplied by author.FEATURES Location/Qualifiers source 1..47 /organism="Pectobacterium carotovorum" /db_xref="taxon:554" Protein 1..47 /product="aepH" /name="putative exoenzyme production regulatory peptide" CDS 1..47 /gene="aepH+" /coded_by="S74077.1:576..719" /note="Author translates GTG start as Val"ORIGIN 1 vgqepkgies rkiqdghvrk kvgrqqglwv rttkkekfsr msrdanv

Example of awful ORF prediction

Codon usage for enteric bacterial (highly expressed) genes 7/19/83

AmAcid Codon Number /1000 Fraction ..

Gly GGG 13.00 1.89 0.02Gly GGA 3.00 0.44 0.00Gly GGU 365.00 52.99 0.59Gly GGC 238.00 34.55 0.38

Glu GAG 108.00 15.68 0.22Glu GAA 394.00 57.20 0.78Asp GAU 149.00 21.63 0.33Asp GAC 298.00 43.26 0.67

Val GUG 93.00 13.50 0.16Val GUA 146.00 21.20 0.26Val GUU 289.00 41.96 0.51Val GUC 38.00 5.52 0.07

Ala GCG 161.00 23.37 0.26Ala GCA 173.00 25.12 0.28Ala GCU 212.00 30.78 0.35Ala GCC 62.00 9.00 0.10

Arg AGG 1.00 0.15 0.00Arg AGA 0.00 0.00 0.00Ser AGU 9.00 1.31 0.03Ser AGC 71.00 10.31 0.20...

Compositional bias in coding regions

CodonPreference

Codon preference plot is constructed by calculating a codon preferencestatistic for each position of each of three reading frames. The statistic is calculated over a window of length w and window moved along the sequencein increments of three bases, maintainin the reading frame. The magnitude of the codon preference statistic is a measure of the likeness of particular window of codons to a predetermined preferred usage.

p = preference parameter = relative likelihood of a codon being found in a gene as opposed to a random sequence

fABC/FABCp = ------- rABC/RABC

f frequency of codon ABC(found in frequency table)F sum of frequencies for all codons that are members of ABCs synonymous family

r frequency of codon ABC in a random sequenceR sum of frequencies of ABCs synonymous family in a random sequence

Codon preference statistic P

(sum logpi/w)P = e

w is between 25 and 50

CodonPreference is a frame-specific gene finder that tries to recognize protein coding sequences by virtue of the similarity of their codon usage to a codon frequency table or by the bias of their composition (usually GC) in the third position of each codon.

Compositional bias of exonsK-tuple methodfrom Bishop ed., Guide to Human Genome Computing

Consider a sequence S = {s i} of length L.It can be transformed into a sequence of k-tuples (i.e oligonucleotidesof length k) :

W = {Wk,i} (i = 1, …, L - k + (1) ; Wk,i ? ? ?

Here ? = {Wk,i} is the set of all the possible oligonucleotidesWk of length k. In this way it is possible to construct a table F with the occurrencefrequency F(Wk) for all possible k-tuples of the set of sequences {S} having the function of interest.

Consider two sets of sequences {S(1)} and {S(2)} with mutually exclusive functions, for instance intron and exon. It is possible to calculate the k-tuple frequency tables F1 and F2 for these two sets of sequences. The difference in frequencies between these tables can be used for discrimination.To analyze the test sequence using the F1 and F2 tables, calculate the local discriminant index for the ith position:

d(i) = F1(Sk,i) / (F1( Sk,i) + F2 ( Sk,i))

d(i) is smoothed using an averaging window of 2w+1 consecutive positions i + w

?D????????? ?d(i) j = i - w

k= 6 is often used

Local sites (=signals) used in the prediction of genes

PromotersTerminators of transcriptionStart and stop codonsSplice sitesBranch pointsPolyadenylation sites

Signal sensors = methods for detecting signals

Content sensorsHexamer counts to discriminate betweenexons and introns

Gene finding methods:Combination of signal and content sensors

Gene prediction methods

Ab initio: HMM methods

Genscan http://genes.mit.edu/GENSCAN.htmlHMMGene http://www.cbs.dtu.dk/services/HMMgene/Genie http://www.fruitfly.org/seq_tools/genie.htmlGeneMark.hmm http://genemark.biology.gatech.edu/GeneMark/eukhmm.cgiFGENEH http://genomic.sanger.ac.uk/gf/gf.shtmlGeneID http://www1.imim.es/geneid.html

Ab initio: Neural network methods

GRAIL http://compbio.ornl.gov/Grail-1.3/NetGene2 http://www.cbs.dtu.dk/services/NetGene2/

Homology based

Blast http://www.ncbi.nlm.nih.gov/BLASTProcrustes http://www-hto.usc.edu/software/procrustes/index.htmlGenewise http://www.sanger.ac.uk/Software/Wise2

Limitations

Non-coding parts, 5’ and 3’ UTRs,and non-coding RNAs are not detected

Lack of suitable training sets ofvery long genomic sequences

Methods are conservative - they have been trained on “typical genes”

Documents

Gene & genome organisation Computational gene identificationbio.lundberg.gu.se/courses/bio2/geneid.pdf · 691 15.2 0.0 0.0 HSU08988 9705 9816 (19427) + MER7A DNA/MER2_type 223 334