50
CS177 Lecture 13 Review/Summary of the Madej lectures Tom Madej 12.06.04

CS177 Lecture 13 Review/Summary of the Madej lectures

  • Upload
    torie

  • View
    21

  • Download
    1

Embed Size (px)

DESCRIPTION

CS177 Lecture 13 Review/Summary of the Madej lectures. Tom Madej 12.06.04. Overview. Basic biology. Protein/DNA sequence comparison. Protein structure comparison/classification. NCBI databases overview. Miscellaneous topics. Lodish et al. Molecular Cell Biology , W.H. Freeman 2000. - PowerPoint PPT Presentation

Citation preview

Page 1: CS177 Lecture 13 Review/Summary of the Madej lectures

CS177 Lecture 13 Review/Summary of the Madej

lectures

Tom Madej 12.06.04

Page 2: CS177 Lecture 13 Review/Summary of the Madej lectures

Overview

• Basic biology.

• Protein/DNA sequence comparison.

• Protein structure comparison/classification.

• NCBI databases overview.

• Miscellaneous topics.

Page 3: CS177 Lecture 13 Review/Summary of the Madej lectures
Page 4: CS177 Lecture 13 Review/Summary of the Madej lectures

Lodish et al. Molecular Cell Biology, W.H. Freeman 2000

Page 5: CS177 Lecture 13 Review/Summary of the Madej lectures

Protein/DNA sequence comparison

• What is the meaning of a sequence alignment?

• Scoring methods; amino acid substitution matrices, PSSMs.

• Basic computational methods; e.g. BLAST.

• Know how to run PSI-BLAST, interpret the results.

Page 6: CS177 Lecture 13 Review/Summary of the Madej lectures

Homology

“… whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology.”

E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function, Kluwer 2003

Page 7: CS177 Lecture 13 Review/Summary of the Madej lectures
Page 8: CS177 Lecture 13 Review/Summary of the Madej lectures

A simple phylogenetic tree…

Page 9: CS177 Lecture 13 Review/Summary of the Madej lectures

Human hemoglobin and more distantly related globins

• Human and horse

• Human and fish

• Human and insect

• Human and bacteria

Page 10: CS177 Lecture 13 Review/Summary of the Madej lectures

Alignment notation: different notations for the same alignment!

VISDWNMPN-------MDGLECILVV----AANDGPMPQTRE

VISDWnm---pnMDGLECILVVaandgpmPQTRE

Page 11: CS177 Lecture 13 Review/Summary of the Madej lectures

Computing sequence alignments

• You must be able to recognize the “answer” (correct alignment) when you see it (scoring system).

• You must be able to find the answer; i.e. compute it efficiently.

Page 12: CS177 Lecture 13 Review/Summary of the Madej lectures

Scoring and computing alignments

• “Position independent” amino acid substitution tables; e.g. BLOSUM62.

• Global alignment algorithms such as Smith-Waterman (dynamic programming); or fast heuristics such as BLAST.

Page 13: CS177 Lecture 13 Review/Summary of the Madej lectures
Page 14: CS177 Lecture 13 Review/Summary of the Madej lectures

Score this alignment:

VISDWnm---pnMDGLECILVVaandgpmPQTRE

Use: BLOSUM62 matrix; gap opening penalty 10;gap extension penalty 1

(-1 + 4 – 2 – 3 – 3) –10 – 1*11 + (-2 + 0 – 2 – 2 + 5) = -27

Page 15: CS177 Lecture 13 Review/Summary of the Madej lectures

BLAST (Basic Local Alignment Search Tool)

• Extremely fast, can be on the order of 50-100 times faster than Smith-Waterman.

• Method of choice for database searches.

• Statistical theory for significance of results (extreme value distribution).

• Heuristic; does not guarantee optimal results.

• Many variants, e.g. PHI-, PSI-, RPS-BLAST.

Page 16: CS177 Lecture 13 Review/Summary of the Madej lectures

Why database searches?

• Gene finding.

• Assigning likely function to a gene.

• Identifying regulatory elements.

• Understanding genome evolution.

• Assisting in sequence assembly.

• Finding relations between genes.

Page 17: CS177 Lecture 13 Review/Summary of the Madej lectures

Issues in database searches

• Speed.

• Relevance of the search results (selectivity).

• Recovering all information of interest (sensitivity).– The results depend on the search parameters, e.g. gap

penalty, scoring matrix.– Sometimes searches with more than one matrix should be

performed.

Page 18: CS177 Lecture 13 Review/Summary of the Madej lectures

E-values, P-values

• E-value, Expectation value; this is the expected number of hits of at least the given score, that you would expect by random chance for the search database.

• P-value, Probability value; this is the probability that a hit would attain at least the given score, by random chance for the search database.

• E-values are easier to interpret than P-values.

• If the E-value is small enough, e.g. no more than 0.10, then it is essentially a P-value.

Page 19: CS177 Lecture 13 Review/Summary of the Madej lectures

PSI-BLAST

• Position Specific Iterated BLAST

• As a first step runs a (regular) BLAST.

• Hits that cross the threshold are used to construct a position specific score matrix (PSSM).

• A new search is done using the PSSM to find more remotely related sequences.

• The last two steps are iterated until convergence.

Page 20: CS177 Lecture 13 Review/Summary of the Madej lectures

PSSM (Position Specific Score Matrix)

• One column per residue in the query sequence.

• Per-column residue frequencies are computed so that log-odds scores may be assigned to each residue type in each column.

• There are difficulties; e.g. pseudo-counts are needed if there are not a lot of sequences, the sequences must be weighted to compensate for redundancy.

Page 21: CS177 Lecture 13 Review/Summary of the Madej lectures

Two key advantages of PSSMs

• More sensitive scoring because of improved estimates of probabilities for a.a.’s at specific positions.

• Describes the important motifs that occur in the protein family and therefore enhances the selectivity.

Page 22: CS177 Lecture 13 Review/Summary of the Madej lectures

Position Specific Substitution Rates

Active site serineWeakly conserved serine

Page 23: CS177 Lecture 13 Review/Summary of the Madej lectures

Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Active site nucleophile

Serine scored differently in these two positions

Page 24: CS177 Lecture 13 Review/Summary of the Madej lectures

PSI-BLAST key points

• The first PSSM is constructed from all hits that cross the significance threshold using “standard” BLAST.

• The search is then carried out with the PSSM to draw in new significant hits.

• If new hits are found then a new PSSM is constructed; these last two steps are iterated.

• The computation terminates upon “convergence”, i.e. when no new sequences are found to cross the significance threshold.

Page 25: CS177 Lecture 13 Review/Summary of the Madej lectures

Protein structure comparison/classification

• Protein secondary structure elements.

• Supersecondary structures (simple structure motifs).

• Folds and domains.

• Comparing structures (VAST).

• Superfolds.

• Fold classification (SCOP).

• Conserved Domain Database (CDD).

Page 26: CS177 Lecture 13 Review/Summary of the Madej lectures

α-helix (3chy)

backbone atoms with sidechains

Page 27: CS177 Lecture 13 Review/Summary of the Madej lectures

Parallel β-strands (3chy)

Page 28: CS177 Lecture 13 Review/Summary of the Madej lectures

Anti-parallel β-strands (1hbq)

Page 29: CS177 Lecture 13 Review/Summary of the Madej lectures

Higher level organization

• A single protein may consist of multiple domains. Examples: 1liy A, 1bgc A. The domains may or may not perform different functions.

• Proteins may form higher-level assemblies. Useful for complicated biochemical processes that require several steps, e.g. processing/synthesis of a molecule. Example: 1l1o chains A, B, C.

Page 30: CS177 Lecture 13 Review/Summary of the Madej lectures

Supersecondary structures

• β-hairpin

• α-hairpin

• βαβ-unit

• β4 Greek key

• βα Greek key

Page 31: CS177 Lecture 13 Review/Summary of the Madej lectures

Supersecondary structure: simple units

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Page 32: CS177 Lecture 13 Review/Summary of the Madej lectures

Supersecondary structure: Greek key motifs

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

Page 33: CS177 Lecture 13 Review/Summary of the Madej lectures

Protein folds

• There is a continuum of similarity!

• Fold definition: two folds are similar if they have a similar arrangement of SSEs (architecture) and connectivity (topology). Sometimes a few SSEs may be missing.

• Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%).

Page 34: CS177 Lecture 13 Review/Summary of the Madej lectures

Vector Alignment Search Tool (VAST)

• Fast structure comparison based on representing SSEs by vectors.

• A measure of statistical significance (VAST E-value) is computed (very differently from a BLAST E-value).

• VAST structure neighbor lists useful for recognizing structural similarity.

Page 35: CS177 Lecture 13 Review/Summary of the Madej lectures

Superfolds (Orengo, Jones, Thornton)

• Distribution of fold types is highly non-uniform.

• There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. There are many examples of “isolated” fold types.

• Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions.

• It is a research question as to the evolutionary relationships of the superfolds, i.e. do they arise by divergent or convergent evolution?

Page 36: CS177 Lecture 13 Review/Summary of the Madej lectures

Superfolds and examples

• Globin 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin

• α-up-down 2hmqA hemerythrin; 256bA cytochrome B562; 1lpe apolipoprotein E3

• Trefoil 1i1b interleukin-1β; 1aaiB ricin; 1tie erythrina trypsin inhibitor

• TIM barrel 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco

• OB fold 1quqA replication protein A 32kDa subunit; 1mjc major cold-shock protein; 1bcpD pertussis toxin S5 subunit

• α/β doubly-wound 5p21 Ras p21; 4fxn flavodoxin; 3chy CheY

• Immunoglobulin 2rhe Bence-Jones protein; 2cd4 CD4; 1ten tenascin

• UB αβ roll 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G

• Jelly roll 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin

• Plaitfold (Split αβ sandwich) 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidine-containing phosphocarrier

Page 37: CS177 Lecture 13 Review/Summary of the Madej lectures

SCOP (Structural Classification of Proteins)

• http://scop.mrc-lmb.cam.ac.uk/scop/

• Levels of the SCOP hierarchy:– Family: clear evolutionary relationship– Superfamily: probable common evolutionary origin– Fold: major structural similarity

Page 38: CS177 Lecture 13 Review/Summary of the Madej lectures
Page 39: CS177 Lecture 13 Review/Summary of the Madej lectures

Bioinformatics databases

• Entrez is by far the most useful, because of the links between the individual databases, e.g. literature, sequence, structure, taxonomy, etc.

• Other specialty databases available on the internet can also be very useful, of course!

Page 40: CS177 Lecture 13 Review/Summary of the Madej lectures

Genomes

Taxonomy

Links Between and Within Nodes

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structures

Word weight

VAST

BLASTBLAST

Phylogeny

ComputationalComputational

Computational

Computational

Page 41: CS177 Lecture 13 Review/Summary of the Madej lectures

Entrez queries

• Be able to formulate queries using index terms (Preview/Index), and limits.

Page 42: CS177 Lecture 13 Review/Summary of the Madej lectures
Page 43: CS177 Lecture 13 Review/Summary of the Madej lectures

Exercises!

• How many protein structures are there that include DNA and are from bacteria?

• In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000?

• Notice that the results are not 100% accurate!

• In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse?

Page 44: CS177 Lecture 13 Review/Summary of the Madej lectures

P53 tumor suppressor protein

• Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer.

• Mutations in p53 are found in most tumor types.

• p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle.

Page 45: CS177 Lecture 13 Review/Summary of the Madej lectures

G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.

Page 46: CS177 Lecture 13 Review/Summary of the Madej lectures

Exercise!

• Use Cn3D to investigate the binding of p53 to DNA.

• Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this).

Page 47: CS177 Lecture 13 Review/Summary of the Madej lectures

Miscellaneous topics

• BLAST a sequence against a genome; locate hits on chromosomes with map viewer.

• Obtain genomic sequence with map viewer.

• Spidey to predict intron/exon structure.

• How sequence variations can affect protein structure/function.

Page 48: CS177 Lecture 13 Review/Summary of the Madej lectures

“EST exercise” summary

• BLAST the EST (or other DNA seq) against the genome.

• From the BLAST output you can get the genomic coordinates of any nucleotide differences.

• Use map viewer to locate the hit on a chromosome; assume the hit is in the region of a gene.

• By following the gene link you can get an accession for mRNA.• By using the “dl” link you can get an accession for the genomic

sequence.

• Use “spidey” with the mRNA and genomic sequence to locate changed residues in the protein.

Page 49: CS177 Lecture 13 Review/Summary of the Madej lectures

“EST exercise” summary (cont.)

• From the gene report you can follow the protein link, and then “Blink”.

• From the BLAST link page you can get to CDD and related structures.

• Since you know where are the changed residues you can use the structures to study what effect the changes might have on the function of the protein.

Page 50: CS177 Lecture 13 Review/Summary of the Madej lectures

Gene variants that can affect protein function

• Mutation to a stop codon; truncates the protein product!

• Insertion/deletion of multiple bases; changes the sequence of amino acid residues.

• Single point change could alter folding properties of the protein.

• Single point change could affect the active site of the protein.

• Single point change could affect an interaction site with another molecule.