Upload
ruth-lee
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Gene, Proteins, and Genetic Code
Protein Synthesis in a Cell
A protein sequence
>gi|7228451|dbj|BAA92411.1| EST AU055734(S20025) corresponds to a region …
MCSYIRYDTPKLFTHVTKTPPKNQVSNSINDVGSRRATDRSVASCSSEKSVGTMSVKNASSISFEDIEKSISNWKIPKVN
IKEIYHVDTDIHKVLTLNLQTSGYELELGSENISVTYRVYYKAMTTLAPCAKHYTPKGLTTLLQTNPNNRCTTPKTLKWD
EITLPEKWVLSQAVEPKSMDQSEVESLIETPDGDVEITFASKQKAFLQSRPSVSLDSRPRTKPQNVVYATYEDNSDEPSI
SDFDINVIELDVGFVIAIEEDEFEIDKDLLKKELRLQKNRPKMKRYFERVDEPFRLKIRELWHKEMREQRKNIFFFDWYE
SSQVRHFEEFFKGKNMMKKEQKSEAEDLTVIKKVSTEWETTSGNKSSSSQSVSPMFVPTIDPNIKLGKQKAFGPAISEEL
VSELALKLNNLKVNKNINEISDNEKYDMVNKIFKPSTLTSTTRNYYPRPTYADLQFEEMPQIQNMTYYNGKEIVEWNLDG
FTEYQIFTLCHQMIMYANACIANGNKEREAANMIVIGFSGQLKGWWNNYLNETQRQEILCAVKRDDQGRPLPDRDGNGNP
TELKEGFHMEEKDEPIQEDDQVVGTIQKYTKQKWYAEVMYRFIDGSYFQHITLIDSGADVNCIREDEILDQLVQTKREQV
VNSIYLHDNSFPKSMDLPDQKITEKRAKLQDIPHHEERLLDYREKKSRDGQDKLPMEVEQSMATNKNTKILLRAWLLST
A protein sequence may have a few hundreds to several thousands amino acids.
Protein synthesis
Genetic code ..ATTCACAGTGGA..
I
H
S
G
Notes on translation
• Three Reading frames
• Third base not important
• 5’ -> 3’
• Start and end codon• Open Reading Frame (ORF)
• Each gene is an ORF, but not all ORF are genes.
The Central Dogma of Molecular Biology
DNA RNA Proteintranscript translation
replication
genotype phenotype
Exception – retroviruses
DNA RNA Proteintranscript translation
replication
genotype phenotype
ProteinPhenotype
DNA(Genotype)
Biology
Genes• One gene encodes one protein (or sometimes
RNA).• Like a program, it starts with start codon (e.g.
ATG), then each three code one amino acid. Then a stop codon (e.g. TGA) signifies end of the gene.
• Genes are dense in prokaryotes and sparse in eukaryotes.
• In the middle of a eukaryotic gene, there are introns that are spliced out (as junk) after transcription. Good parts are called exons. This is the task of gene finding.
Gene related diseases
• Hemophilia: on X chromosome.• Sickle-Cell Anemia: single nucleotide mutation in the first
exon of beta-globin gene (removes a cutting site). 1 in 12 African Americans are carriers. (sick for homozygotes)
• BRCA1 gene (chr. 17q) – responsible for ½ inherited breast cancer (10% of breast cancer)
• Fragile X syndrome (mentally retard) – 1 in 1250 males, 2500 females (dominate, but females have partially expressed good gene). FMR-1 gene: tri-nucleotide repeats >200 causes disease.
• P53 gene: chr. 17p, tumor suppressor protein.
Gene Prediction and AnnotationProkaryotes
1. Start/stop codon (ORF)2. Promoters3. Content4. Sequence similarity
Start Codon
May miss short genes.Do not know which start codon to use.Overlapping ORF at different reading frames.
Promoters
<-- upstream downstream -->
5'-XXXXPPPPPPXXXXXXXXXPPPPPPXXXXGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGXXXX-3‘
-35 -10 Gene to be transcribed
-10: T A T A A T 77% 76% 60% 61% 56% 82%-35: T T G A C A 69% 79% 61% 56% 54% 54%
Pribnow box
In prokaryotes, the promoter consists of two short sequences at -10 and -35 position upstream of the gene, that is, prior to the gene in the direction of transcription. The sequence at -10 is called the Pribnow box and usually consists of the six nucleotides TATAAT. The Pribnow box is absolutely essential to start transcription in prokaryotes. The other sequence at -35 usually consists of the six nucleotides TTGACA. Its presence allows a very high transcription rate.
These rules are only approximately correct.
Scoring a 6-mer as Pribnow box
•We need a “score function” to measure the likelihood that a 6-mer is a pribnow box
An exemplary function for pribnow box fitness evaluation
log()
Content I – codon bias• A codon XYZ occurs with different freqencies in
coding regions and non-coding regions• different amino acids have different freq.• Diff. codons for the same amino acid have diff. freq.• In non-coding regions approx. p(X)*p(Y)*p(Z)
http://www.kazusa.or.jp/codon/
Codon bias• First use many known genes of the organism or
similar organisms to train codon frequency table.• Each codon ci has f(ci).
• Second compute the background frequency of each base bf(X) for X=A,C,G,T
• The “significance” of a codon c=XYZ is then –log( f(c) / (bf(X)*bf(Y)*bf(Z))).
• High average significance in a region is an indication of gene.
Content II - Hidden Markov Model (HMM)
Eukaryotes
• Basic idea similar to Prokaryotes
• Difference:
DNA-specific transcription factors
• These are the basic of gene-regulatory network• Another hot area in Bioinformatics
Splicing
• Consensus sequences have been identified as necessary but not sufficient for splicing. In vertebrates, these sequences are (the slash identifies the exon-intron or intron-exon junction): • C(orA)AG/GTA(orG)AGT "donor" splice site • T(orC)nNC(orT)AG/G "acceptor" splice site. • A third sequence, which in yeast is TACTAAC , is necessary
within the intron sequence.
These rules are only approximately correct.