29
Brian Kidd October 21, 2010 Computational Biology Tools Lecture 8: Protein Sequence Databases and Analysis Tools

Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Embed Size (px)

Citation preview

Page 1: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Brian Kidd

October 21, 2010

Computational Biology Tools

Lecture 8:

Protein Sequence Databases and

Analysis Tools

Page 2: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

see survey 04 on eCommons (tests and quizzes)

Page 3: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Questions/Concerns from Last Time

Page 4: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Overview

1. Protein Sequence Databases• SwissProt, UniProt, NCBI

2. Protein Analysis tools• Linear sequence analysis• 3D structure analysis

3. Finding Distant Homologs

Page 5: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Sequence Databases

SwissProt (ExPASY)

highly curated, updated less frequently

translated nucleotide sequences

automatic translation, fast but less info

Unified Protein Resource

Combines SwissProt, TrEMBL, PIR sequences

TrEMBL (ExPASY)

UniProt (EBI)

Page 6: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Sequence Analysis Sites

For protein sequences and tools to analyze them, the two major centers are:

ExPASY: Expert Protein Analysis System

many tools – http://ca.expasy.org/tools

Databases: SwissProt, TrEMBL

PIR: Protein Information Resource (folded into UniProt consortium; no longer major resource site)

NCBI: Entrez Protein and Domains

Page 7: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

More Sequence Databases

Non-redundant

NR (NCBI), UniRef (PIR/EBI)

Reference

RefSeq (NCBI) – reannotated by NCBI

Domains/Families

Pfam – protein families (Sanger center + mirror sites)

SMART – Simple Modular Architecture Research Tool

CDD – Conserved protein Domain Database (NCBI), combines Pfam, SMART, and COGs databases

InterPro – (based on UniProt, at EMBL-EBI

Many others...

Page 8: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Linear Sequence Analysis

Calculate its physical properties

What can you learn from a (single) protein sequence?

Signal sequences, transmembrane domains, coiled-coils, post-translational modification sites, secondary structure (non-homologous)

Domains, functional motifs (homologous)

Identify sequence motifs and families

Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy (hydrophobic vs. hydrophilic regions)

Does not take into account post-translational modifications, so calculations are usually not 100% accurate

Page 9: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Protein Sequence Analysis Tools

ExPASY Proteomics Tools

Calculate physical properties

Predict sequence motifs

what ExPASY calls “Topology” : localization, TM domains

Signal sequences, post-translational modifications

Search pattern and profile collections

PredictProtein and Meta-PP

Meta-server providing access to many servers with one submission form

Page 10: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Structure Databases

Experimental

PDB: Protein Data Bank

Families:

SCOP, CATH, Dali Database, Homstrad

Models/Predictions

ModBase

SwissModel

NOTE: all of these databases are described in the January Database issue of Nucleic Acids Research (NAR)

Includes links to the databases

Page 11: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

3D Structure Analysis

Visualization

Evaluate structure “quality”

Domain structure, global fold, active sites, point mutations, SNPs, splice sites

Calculate physical properties

Prediction

Surface areas, distances, side-chain conformations, contact maps

Structural alignment (i.e. similarity to other structures)

Physical properties: binding affinity, pKa’s, stability, specificity

3D structure (homology modeling, fold recognition, de novo)

Advanced: protein design, “docking” of two proteins, active site modeling

Page 12: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Secondary Structure Prediction

Three good methods:

Psipred

SAM-T02/T04/T06

PhD (PredictProtein)

Compare a couple methods

Use the three-state predictions

Page 13: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Information FlowSequence ⇔ Structure ⇔ Function

Evolutionary selection operates on function

Structure is more closely linked to function than is sequence, so structure tends to be more conserved than sequence

Need to search farther in sequence space to find proteins with related structures and functions

Page 14: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Detecting Remote Similarities

Remote similarities can more easily be detected by comparing protein sequences

DNA sequences change faster than protein sequences (wobble position, redundant codons)

4 letter DNA code vs. 20 letter amino acid code means that matches by chance are more likely in DNA ➜ the protein code has more information in it!

Page 15: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Detecting Homology

NEAR Evolutionary Distance FAR

BLASTnBLASTp

PSIBLASTFold Recognition

METHODS

DNA SequenceProtein Sequence

Protein Structure

SIMILARITY

Page 16: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Similar Sequence Share Similar Structures

Compare all pairs of proteins in the same “family” (pairs for which homology is very probable)

Homologs do not necessarily share much sequence similarity

Proteins with > 30% sequence identity almost always share the same fold

Saunder et al., Proteins 40:6-22 (2000).

Family

All others Immunoglobulins

Mor

e st

ruct

ural

sim

ilarit

y

Page 17: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

PSIBLAST

Position-Specific Iterated BLAST

Use BLASTp and identify related sequences (E-value threshold)

Creates a scoring matrix specialized for your sequence

Allows more distantly related sequences to be identified

Steps:

Create a profile from related sequences

Search for related sequences using this profile

Repeat

Page 18: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

!"#$%&'(

!"#$%&)*

BLASTing the Protein Universe

Page 19: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Evolution and the Protein Universe

Page 20: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

PSIBLASTing the Protein Universe

!"#$%"&'()*

!"#$%"&'()+

Page 21: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Sequence Profiles

Align all sequences and count how often each amino acid occurs at every position

Combine with prior information about substitution frequencies using pseudo-counts from BLOSUM62

Convert to log odds score to give a Position-Specific Scoring Matrix (PSSM)

Page 22: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

!!!!!!!!!!!"!!#!!$!!%!!&!!'!!(!!)!!*!!+!!,!!-!!.!!/!!0!!1!!2!!3!!4!!5!!!!!!6!.!!!76!78!78!79!78!76!78!79!78!!6!!8!78!!:!!;!79!78!76!78!76!!6!!!!!!8!-!!!76!!6!!;!!6!7<!!8!!<!78!!;!79!79!!9!78!7<!76!!;!76!79!78!79!!!!!!9!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!<!5!!!!;!79!79!7<!76!79!79!7<!7<!!9!!6!79!!6!76!79!78!!;!79!76!!<!!!!!!=!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!:!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!!>!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!!?!,!!!76!79!79!7<!76!79!79!7<!79!!8!!8!79!!6!!9!79!78!76!78!!;!!9!!!!!!@!,!!!76!79!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!8!!!!!6;!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!66!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!68!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!69!3!!!78!79!7<!7<!78!78!79!7<!79!!6!!<!79!!8!!6!79!79!78!!>!!;!!;!!!!!6<!"!!!!9!78!76!78!76!76!78!!<!78!78!78!76!78!79!76!!6!76!79!79!76!!!!!6=!"!!!!8!76!!;!76!78!!8!!;!!8!76!79!79!!;!78!79!76!!9!!;!79!78!78!!!!!6:!"!!!!<!78!76!78!76!76!76!!9!78!78!78!76!76!79!76!!6!!;!79!78!76!!!!!AAA!!!!9>!1!!!!8!76!!;!76!76!!;!!;!!;!76!78!79!!;!78!79!76!!<!!6!79!78!78!!!!!9?!)!!!!;!79!76!78!79!78!78!!:!78!7<!7<!78!79!7<!78!!;!78!79!79!7<!!!!!9@!2!!!!;!76!!;!76!76!76!76!78!78!76!76!76!76!78!76!!6!!=!79!78!!;!!!!!<;!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!<6!4!!!78!78!78!79!79!78!78!79!!8!78!76!78!76!!9!79!78!78!!8!!>!76!!!!!<8!"!!!!<!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!

!"#$%&'()*'"+%,-#-*.#"$/0-1)%/*2%!3*10-#*/4%5'*#$-1)%%678,9%:;<=>;?>::<;@AB%C%?::D%E#F*%G-4'H%I%8#*)+%7*1B

A Sample PSSM

Page 23: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

PSSM Corruption

False positives can occur in a PSIBLAST search if the PSSM becomes corrupted

One sequence that is not homologous to the query gets included in the alignment used to make the PSSM

The PSSM now looks a bit like this spurious sequence and will match well to other similar spurious sequences

The additional spurious sequences that are detected are included in the new alignment, amplifying the corrupting signal

How do PSSMs become corrupted?

Once a “bad” sequence is included in the PSSM, the search veers off course and cannot be corrected

Page 24: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Preventing PSSM Corruption

Applying filtering of biased composition regions (low complexity filter)

Use better methods to estimate the E-value (composition-based statistics)

Increase threshold for judging two sequences to be similar: adjust E-value from 0.001 (default) to a lower value such as 0.0001

Manually inspect the output from each iteration and remove suspicious hits

Page 25: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

PHI-BLAST

Pattern-Hit-Initiated BLAST

What other proteins contain a particular sequence pattern and are similar in the vicinity of this pattern?

May filter out cases where pattern matches randomly and doesn’t indicate homology

Combines matching of regular expressions with local alignments surrounding the match

Pattern matching uses ScanProsite syntax

Sequence similarity search is like PSIBLAST

Page 26: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Syntax Rules for Patterns[] any one of the listed characters allowed

E[LIV]X(0,3)PP[STG]matches:

ELPPS

EVIPPG

does not match:

ELIVPPPPG

{} any character except the listed ones allowed

x(n) n positions in which any residue is allowed

x(n,m) n-m positions in which any residue is allowed

Examples:GXW[YF][EA][IVLM]matches:

GTWFEL

GKWYAI

does not match:

GGWYFEI

GWYEI

Page 27: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Gene Discovery with BLAST

Start with the sequence of a known protein

Search a DNA database (e.g HTGS, dbEST, or genomic sequence from a specific organism

Find matches...• to DNA encoding known

proteins• to DNA encoding

related (novel!) proteins• to false positives

Search your DNA or protein against a protein database (nr) to confirm you have identified a novel gene

tblastn

insepctblastx

orblastp

nr

Page 28: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

Essentials at this Point

Accessing literature and sequence information from various databases (NCBI and UCSC)

BLAST (all variants)

Pairwise sequence analysis tools and algorithms

Single sequence analysis tools DNA:EMBOSS, ORFs, Restriction Enzymes, & Primers

Protein databases and analysis tools

PSI and PHI BLASTs

Page 29: Computational Biology Tools - University of California ... · Computational Biology Tools ... Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy ... so

For Next Time

Reading

Problem set

B4D Chapter 9 – Building a Multiple Sequence Alignment

B4D Chapter 10 – Editing and Publishing Alignments

Continue working on PS #2 (due Friday, October 29)

http://www.soe.ucsc.edu/classes/bme110/Fall10/calendar.html