View
214
Download
0
Category
Preview:
Citation preview
DNA
• DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides
• A nucleotide has three parts: a sugar, a phophate and a base
• Four bases
• Two strands are complementary• Base pairing: A-T; G-C• Pyrimidine and Purine form complementary H
bonding
Secondary Structure of DNA
• Genome– The entire DNAs of a cell is the genome– Individual units for coding proteins or RNA are genes
– A gene starts with ATG, ends with one or two stop codons
– Called ORF (Open Reading Frame)
– Biological Info– Contained in genome– Encoded in nucleotide sequences of DNA or RNA– Partitioned into discrete units, genes
Genome
Genome Databases
Completed genomes ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/ http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html http://www.ebi.ac.uk/genomes/mot/index.html http:/pir.goergetown.edu/pirwww/search/genome.html
Organism-specific databases http://www.unledu/stc-95/ResTools/biotools/biotools10.html http://www.fp.mcs.anl.gov/~gaasterland/genomes.html http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html http://www.bioinformatik.de/cgi-bin/browse/Catalog/
Databases/Genome_Proejcts
Human Genome• Human Genome Project
– Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003 schedule
• What did the sequence reveal ?– 3 Bbp (base pair)
– 24 chromosomes,
– 22 autosomes plus two sex chromasomes (X,Y)
– Longest 250 Mbp, shorted 55 Mbp
– Mitochondrial genome
– Circular DNA molecule of 16.569 Mbp
– ~10**(13) cells
– How many is 3 Bbp ?– Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm).
– In this format, 3 Bbp writes out in 5,000 mi
Other Species
Organism Genome size # of genes
Epstein – Barr virus 0.17 Mbp 80
E.Coli 4.6 Mbp 4,406
Yeast (S. cerevisiae) 12.5 Mbp 6,172
Nematode worm (C.elegans) 100.3 Mbp 19,099
Thale cress (A. thaliana) 115.4 Mbp 25,498
Fruit fly (D. melanogaster) 128.3 Mbp 13,601
Human (H. sapiens) 3223.0 Mbp 20,500
Fugu (Takifugu rubripes) 390.0 Mbp 30,000
Wheat 16000.0 Mbp 30,000
• In double strands• # of A = # of T; # of G = # of C• Erwin Chargaff’s 1st Parity Rule, 1951
• In a single strand ?• # of A = # of T; # of G = # of C• Erwin Chargaff’s 2nd Parity Rule
Monomer counts in DNA
• Download the Yeast Chromosome 1 sequence from www.cs.uml.edu/~kim/100/yeast01.txt to your C:\100
• Open a Command Prompt from Applications (NOT JES)
• cd C:\100• python• In Python
• NAME the DNA file• Read all lines and put them
into a single string, ‘dna’
• What does lines[0] have ?• What is happening here ?
Parsing DNA Data Files
>>> fp = open(‘yeast01.txt’)>>> lines=fp.readlines()
>>> lines[0]
• Line by line processing is difficult• Each line ends with ‘\n’• How to concatenate all
the lines into a LONG string by removing ‘\n’
• Why lines[1:], not lines[0:]?
Parsing DNA Data Files
>>> dna = ‘’.join(lines[1:])>>> dna[0:100]>>> dna = dna.replace(‘\n’,’’)
Base-Pair Distribution in a DNA String
• Write a Python function, basePairFreq(dna)• To count the number of ‘A’,’T’,’C’,’G’ in the concatenated dna
string
• How about the distribution of pairs of bases (bimers) ?• ACTTAGG
• AC, CT, TT, TA, AG, GG
• How about trimers, tetramers, pentamers, hexamers, … ?
DNA Base Countingdef baseFreq(dna):
count = [0.0,0.0, 0.0, 0.0]
num = 0
length = len(dna)
for i in range(0,length):
if dna[i:i+1] == 'A': count[0] = count[0]+1
elif dna[i:i+1] == 'C': count[1] = count[1]+1
elif dna[i:i+1] == 'T': count[2] = count[2]+1
elif dna[i:i+1] == 'G': count[3] = count[3]+1
else: num=num
num = num+1
for i in range(0,4):
count[i] = count[i]/num
return count
Base Counting (in Notepad)
def baseFreq(dna): count = [0.0,0.0] num = 0 length = len(dna) for i in range(0,length): if dna[i:i+1] == 'A': count[0] = count[0]+1 elif dna[i:i+1] == 'C': count[1] = count[1]+1 elif dna[i:i+1] == 'T': count[2] = count[2]+1 elif dna[i:i+1] == 'G': count[3] = count[3]+1 else: num=num num = num+1 for i in range(0,4): count[i] = count[i]/num return count
##### main() function #############dataFile = input('Enter a DNA file name\n')fp = open(dataFile)lines = fp.readlines()dnaStr = ''.join(lines)dnaStr = dnaStr.replace('\n', '')
freq = basePairFreq(dnaStr)print(freq)
Recommended