Computational Biology Tools - University of California ... · Computational Biology Tools ......

Preview:

Citation preview

Brian Kidd

October 21, 2010

Computational Biology Tools

Lecture 8:

Protein Sequence Databases and

Analysis Tools

see survey 04 on eCommons (tests and quizzes)

Questions/Concerns from Last Time

Overview

1. Protein Sequence Databases• SwissProt, UniProt, NCBI

2. Protein Analysis tools• Linear sequence analysis• 3D structure analysis

3. Finding Distant Homologs

Sequence Databases

SwissProt (ExPASY)

highly curated, updated less frequently

translated nucleotide sequences

automatic translation, fast but less info

Unified Protein Resource

Combines SwissProt, TrEMBL, PIR sequences

TrEMBL (ExPASY)

UniProt (EBI)

Sequence Analysis Sites

For protein sequences and tools to analyze them, the two major centers are:

ExPASY: Expert Protein Analysis System

many tools – http://ca.expasy.org/tools

Databases: SwissProt, TrEMBL

PIR: Protein Information Resource (folded into UniProt consortium; no longer major resource site)

NCBI: Entrez Protein and Domains

More Sequence Databases

Non-redundant

NR (NCBI), UniRef (PIR/EBI)

Reference

RefSeq (NCBI) – reannotated by NCBI

Domains/Families

Pfam – protein families (Sanger center + mirror sites)

SMART – Simple Modular Architecture Research Tool

CDD – Conserved protein Domain Database (NCBI), combines Pfam, SMART, and COGs databases

InterPro – (based on UniProt, at EMBL-EBI

Many others...

Linear Sequence Analysis

Calculate its physical properties

What can you learn from a (single) protein sequence?

Signal sequences, transmembrane domains, coiled-coils, post-translational modification sites, secondary structure (non-homologous)

Domains, functional motifs (homologous)

Identify sequence motifs and families

Molecular weight (MW), isoelectric point (pI), amino acid content, hydropathy (hydrophobic vs. hydrophilic regions)

Does not take into account post-translational modifications, so calculations are usually not 100% accurate

Protein Sequence Analysis Tools

ExPASY Proteomics Tools

Calculate physical properties

Predict sequence motifs

what ExPASY calls “Topology” : localization, TM domains

Signal sequences, post-translational modifications

Search pattern and profile collections

PredictProtein and Meta-PP

Meta-server providing access to many servers with one submission form

Structure Databases

Experimental

PDB: Protein Data Bank

Families:

SCOP, CATH, Dali Database, Homstrad

Models/Predictions

ModBase

SwissModel

NOTE: all of these databases are described in the January Database issue of Nucleic Acids Research (NAR)

Includes links to the databases

3D Structure Analysis

Visualization

Evaluate structure “quality”

Domain structure, global fold, active sites, point mutations, SNPs, splice sites

Calculate physical properties

Prediction

Surface areas, distances, side-chain conformations, contact maps

Structural alignment (i.e. similarity to other structures)

Physical properties: binding affinity, pKa’s, stability, specificity

3D structure (homology modeling, fold recognition, de novo)

Advanced: protein design, “docking” of two proteins, active site modeling

Secondary Structure Prediction

Three good methods:

Psipred

SAM-T02/T04/T06

PhD (PredictProtein)

Compare a couple methods

Use the three-state predictions

Information FlowSequence ⇔ Structure ⇔ Function

Evolutionary selection operates on function

Structure is more closely linked to function than is sequence, so structure tends to be more conserved than sequence

Need to search farther in sequence space to find proteins with related structures and functions

Detecting Remote Similarities

Remote similarities can more easily be detected by comparing protein sequences

DNA sequences change faster than protein sequences (wobble position, redundant codons)

4 letter DNA code vs. 20 letter amino acid code means that matches by chance are more likely in DNA ➜ the protein code has more information in it!

Detecting Homology

NEAR Evolutionary Distance FAR

BLASTnBLASTp

PSIBLASTFold Recognition

METHODS

DNA SequenceProtein Sequence

Protein Structure

SIMILARITY

Similar Sequence Share Similar Structures

Compare all pairs of proteins in the same “family” (pairs for which homology is very probable)

Homologs do not necessarily share much sequence similarity

Proteins with > 30% sequence identity almost always share the same fold

Saunder et al., Proteins 40:6-22 (2000).

Family

All others Immunoglobulins

Mor

e st

ruct

ural

sim

ilarit

y

PSIBLAST

Position-Specific Iterated BLAST

Use BLASTp and identify related sequences (E-value threshold)

Creates a scoring matrix specialized for your sequence

Allows more distantly related sequences to be identified

Steps:

Create a profile from related sequences

Search for related sequences using this profile

Repeat

!"#$%&'(

!"#$%&)*

BLASTing the Protein Universe

Evolution and the Protein Universe

PSIBLASTing the Protein Universe

!"#$%"&'()*

!"#$%"&'()+

Sequence Profiles

Align all sequences and count how often each amino acid occurs at every position

Combine with prior information about substitution frequencies using pseudo-counts from BLOSUM62

Convert to log odds score to give a Position-Specific Scoring Matrix (PSSM)

!!!!!!!!!!!"!!#!!$!!%!!&!!'!!(!!)!!*!!+!!,!!-!!.!!/!!0!!1!!2!!3!!4!!5!!!!!!6!.!!!76!78!78!79!78!76!78!79!78!!6!!8!78!!:!!;!79!78!76!78!76!!6!!!!!!8!-!!!76!!6!!;!!6!7<!!8!!<!78!!;!79!79!!9!78!7<!76!!;!76!79!78!79!!!!!!9!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!<!5!!!!;!79!79!7<!76!79!79!7<!7<!!9!!6!79!!6!76!79!78!!;!79!76!!<!!!!!!=!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!!:!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!!>!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!!?!,!!!76!79!79!7<!76!79!79!7<!79!!8!!8!79!!6!!9!79!78!76!78!!;!!9!!!!!!@!,!!!76!79!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!8!!!!!6;!,!!!78!78!7<!7<!76!78!79!7<!79!!8!!<!79!!8!!;!79!79!76!78!76!!6!!!!!66!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!68!"!!!!=!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!!!!69!3!!!78!79!7<!7<!78!78!79!7<!79!!6!!<!79!!8!!6!79!79!78!!>!!;!!;!!!!!6<!"!!!!9!78!76!78!76!76!78!!<!78!78!78!76!78!79!76!!6!76!79!79!76!!!!!6=!"!!!!8!76!!;!76!78!!8!!;!!8!76!79!79!!;!78!79!76!!9!!;!79!78!78!!!!!6:!"!!!!<!78!76!78!76!76!76!!9!78!78!78!76!76!79!76!!6!!;!79!78!76!!!!!AAA!!!!9>!1!!!!8!76!!;!76!76!!;!!;!!;!76!78!79!!;!78!79!76!!<!!6!79!78!78!!!!!9?!)!!!!;!79!76!78!79!78!78!!:!78!7<!7<!78!79!7<!78!!;!78!79!79!7<!!!!!9@!2!!!!;!76!!;!76!76!76!76!78!78!76!76!76!76!78!76!!6!!=!79!78!!;!!!!!<;!3!!!79!79!7<!7=!79!78!79!79!79!79!78!79!78!!6!7<!79!79!68!!8!79!!!!!<6!4!!!78!78!78!79!79!78!78!79!!8!78!76!78!76!!9!79!78!78!!8!!>!76!!!!!<8!"!!!!<!78!78!78!76!76!76!!;!78!78!78!76!76!79!76!!6!!;!79!78!!;!!

!"#$%&'()*'"+%,-#-*.#"$/0-1)%/*2%!3*10-#*/4%5'*#$-1)%%678,9%:;<=>;?>::<;@AB%C%?::D%E#F*%G-4'H%I%8#*)+%7*1B

A Sample PSSM

PSSM Corruption

False positives can occur in a PSIBLAST search if the PSSM becomes corrupted

One sequence that is not homologous to the query gets included in the alignment used to make the PSSM

The PSSM now looks a bit like this spurious sequence and will match well to other similar spurious sequences

The additional spurious sequences that are detected are included in the new alignment, amplifying the corrupting signal

How do PSSMs become corrupted?

Once a “bad” sequence is included in the PSSM, the search veers off course and cannot be corrected

Preventing PSSM Corruption

Applying filtering of biased composition regions (low complexity filter)

Use better methods to estimate the E-value (composition-based statistics)

Increase threshold for judging two sequences to be similar: adjust E-value from 0.001 (default) to a lower value such as 0.0001

Manually inspect the output from each iteration and remove suspicious hits

PHI-BLAST

Pattern-Hit-Initiated BLAST

What other proteins contain a particular sequence pattern and are similar in the vicinity of this pattern?

May filter out cases where pattern matches randomly and doesn’t indicate homology

Combines matching of regular expressions with local alignments surrounding the match

Pattern matching uses ScanProsite syntax

Sequence similarity search is like PSIBLAST

Syntax Rules for Patterns[] any one of the listed characters allowed

E[LIV]X(0,3)PP[STG]matches:

ELPPS

EVIPPG

does not match:

ELIVPPPPG

{} any character except the listed ones allowed

x(n) n positions in which any residue is allowed

x(n,m) n-m positions in which any residue is allowed

Examples:GXW[YF][EA][IVLM]matches:

GTWFEL

GKWYAI

does not match:

GGWYFEI

GWYEI

Gene Discovery with BLAST

Start with the sequence of a known protein

Search a DNA database (e.g HTGS, dbEST, or genomic sequence from a specific organism

Find matches...• to DNA encoding known

proteins• to DNA encoding

related (novel!) proteins• to false positives

Search your DNA or protein against a protein database (nr) to confirm you have identified a novel gene

tblastn

insepctblastx

orblastp

nr

Essentials at this Point

Accessing literature and sequence information from various databases (NCBI and UCSC)

BLAST (all variants)

Pairwise sequence analysis tools and algorithms

Single sequence analysis tools DNA:EMBOSS, ORFs, Restriction Enzymes, & Primers

Protein databases and analysis tools

PSI and PHI BLASTs

For Next Time

Reading

Problem set

B4D Chapter 9 – Building a Multiple Sequence Alignment

B4D Chapter 10 – Editing and Publishing Alignments

Continue working on PS #2 (due Friday, October 29)

http://www.soe.ucsc.edu/classes/bme110/Fall10/calendar.html

Recommended