Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols

Lecture 4

BNFO 235

Usman Roshan

IUPAC Nucleic Acid symbols

IUPAC Amino Acid symbols

Genetic code

Splitting and joining strings

• split: splits a string by regular expression and returns array– @s = split(/,/);– @s = split(/\s+/);

• join: joins elements of array and returns a string (opposite of split)– $seq=join(“”, @pieces);– $seq=join(“X”, @pieces);

Searching and substitution

• $x =~ /$y/ ---- true if expression $y found in $x

• $x =~ /ATG/ --- true if open reading frame ATG found in $x

• $x !~ /GC/ --- true if GC not found in $x• $x =~ s/T/U/g --- replace all T’s with U’s• $x =~ s/g/G/g --- convert all lower case

g to upper case G

DNA regular expressions

Taken from Jagota’s Perl for Bioinformatics

DNA Sequence Evolution

AAGACTT -3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

T_GACTTAAGGCTT

_GGGCTT TAGACCTT A_CACTT

ACCTT (Cat)

ACACTTC (Lion)

TAGCCCTTA (Monkey)

TAGGCCTT (Human)

GGCTT(Mouse)

T_GACTTAAGGCTT

AAGACTT


AAGGCTT T_GACTT

AAGACTT

TAGGCCTT (Human)

TAGCCCTTA (Monkey)

A_C_CTT (Cat)

A_CACTTC (Lion)

_G_GCTT (Mouse)


AAGGCTT T_GACTT

AAGACTT

Comparative Bioinformatics

• Fundamental notion of biology: all life is related by an unknown evolutionary Tree of Life.

• Therefore, if we know something about one species we can make inferences about other ones.

• Also, by comparing multiple species we can make inferences about sets of species.

• How do we compare DNA or protein sequences of two different species?

Comparative Bioinformatics

• We need to know how often do mutations from A to T occur or A to C occur.

• To determine this we manually create a set of “true” alignments and estimate the likelihood of A changing to C, for example, by counting the number of time A changes to C and computing related statistics.

• Now we have a realistic “scoring matrix” which can be used to evaluate how related are two species based on their DNA.

Problems

• Write a Perl subroutine called readmatrix that reads a DNA substitution scoring matrix from a file called “dna.txt” and stores it in a two dimensional array. The format of the scoring matrix in the file isA C G T

A 10 3 1 4C 3 12 3 5G 1 3 15 2T 4 5 2 11• Write a Perl subroutine called translate that takes an

mRNA sequence and converts it into a protein sequence and also returns the sequence.

Problems

• Write a Perl program that reads in a substitution scoring matrix from a file called “matrix.txt”, reads in a pair of DNA sequences of equal length from a file called “dna.txt”, and returns the total substitution score between the two sequences.

• Write a Perl program that reads pairs of DNA sequences from a file called “DNApairs.txt” and estimates the frequency of nucleotide substitutions.

Documents

Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols