Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop....

Preview:

Citation preview

Bioinformatics. Analysis of proteomic data.

Dr Richard J Edwards 28 August 2009; CALMARO workshop.

©Gary Larson

(In not much detail)

Bioinformatic analysis of proteomic data

Improving sequence identifications Dealing with redundancy Annotating protein hits

Adding value to protein lists Accession number mapping & data integration Gene Ontology analysis Protein interaction networks

Example: identifying E. huxleyi proteins with multi-species and EST sequence databases

Open Discussion

Improving identifications:dealing with redundancy.

Identifying redundancy

Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Choice of database affects redundancy identification SwissProt/IPI indicate splice variants EnsEMBL peptides map back onto non-redundant gene IDs Poor annotation hard to differentiate variant/error/family

Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Example: alpha tubulin protein family

Identifying redundancy Sometimes, identification cannot be conclusive

Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

Basic peptidegrouping scenarios

Identifying redundancy Sometimes, identification cannot be conclusive

Different scenarios canpresent different problems

How important is it to study? Might need to identify

protein(s) through furtherexperiments

?? ?

???

?

Copyright ©2005 American Society for Biochemistry and Molecular BiologyNesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

A simplified example of a protein summary list

Identifying redundancy

Final protein list: Conclusive IDs Protein groups Inconclusive IDs

Are inconclusive/ group hits redundant?

Same protein from different species

Splice variants

Does it matter? Inflated

numbers Biased analyses Comparisons

between experiments

Unique to protein

Unique to group

No unique

Homology groupings

Can use BLAST to identify groups of related proteins Help identify possible redundancies Need to look at peptides

Particularly useful for “off-species” identifications Tendency for many hits

to same protein in different species

Clustering proteins by %identity

http://www.southampton.ac.uk/~re1u06/software/gablam/

Improving identifications:annotating protein hits.

Protein annotation

Database

Protein ListProtein List

NOISE

Poorly (un)annotated proteins Real proteins or database noise? Reliable annotation?

Most of our protein data comes from DNA sequences

PDB: 53,660 structures = 3D

SwissProt: 392,667 = Curated

TrEMBL: >6 million &UniParc: >16 million

= Most inferred from DNA Most annotation inferred through

sequence analysis

Protein data from translated DNA

Lots of errors! Sequence errors Annotation errors

AnnotationTranslation

Where does the data come from?

Protein annotation

Use standard sequence analysis tools Manual guidance/care = better than automated databases!

Homology searching BLAST vs. UniProtKB Protein domain searches, e.g. PFam

Conservation analysis Multiple sequence alignment with homologues

Are functionally important sites conserved?

Phylogenetic analysis Evolutionary relationships can help distinguish function

Assignment to protein subfamily etc. Useful where BLAST hits have competing annotation

http://www.southampton.ac.uk/~re1u06/software/haqesac/

Beyond proteomics:adding value to protein

lists.

What Bioinformatics cannot (usually) do

Magic

Replace hypothesis driven research

Directed analysis is always better than “fishing” (e.g. GO)

Provide a definitive answer

Ranking/prioritising better

Follow-up analyses

Many possibilities What was the aim of the study? What resources are available for your organism?

Imitation is the sincerest form of flattery Find a good study and copy the best bits

Easier to describe Easier to justify to reviewers

Hypothesis-driven analysis is best Many tools facilitate hypothesis generation (data

exploration) Be aware of risk of testing a hypothesis on data used to

generate it Be aware of multiple testing issues

Follow-up analyses

EBI and NCBI both provide many useful tools EBI run many good courses at Hinxton

http://www.ebi.ac.uk/Tools/

Seek collaborations

Time / Energy

Rew

ard

Bioinformatics

Find a tame bioinformatician to help if needed Good collaboration = Trade

Papers / Grants / improving the bioinformatics E.g. adding your organism/database

to an online resource

©Gary Larson

Accession number mapping Other databases may contain better/specific annotation

UniProtKB, OMIM etc.

Results from searches against older databases may need updating

EBI tool: PICR [Protein Identifier Cross-Reference Service]

BioMart: Query & Xref tool for manydatabases www.biomart.org

http://www.ebi.ac.uk/Tools/picr/

BioMart

Gene Ontology analysis

Gene Ontology [GO] = gene annotation project Controlled vocabulary allows standardisation & comparisons

http://www.geneontology.org/

Gene Ontology analysis

Many Gene Ontology exploration tools AmiGO, GOA, FatiGO, DAVID etc. Depend on source databases

May need to map IDs using PICR first

GO enrichment Assess frequency of GO terms in your list against

expectation Often a big multiple testing issue Be aware of biases – how is expectation derived

E.g. Abundant, conserved proteins more likely to be annotated & more likely to be identified in a proteomics experiment

Best if hypothesis-driven or used for data confirmation E.g. Enrichment of certain subcellular fraction

Protein interaction networks Can be useful for identifying protein complexes in

data E.g. STRING [http://string-db.org/]

Example: identifying E. huxleyi proteins with multi-species and EST

sequence databases

Combined search strategy

Genome unavailable (for download & searching)

dbESTThalassiosirapseudonana

Taxa-limitedDatabase

90,000 E huxESTs

Protein ListProtein List

:Rhodophyta::Stramenopiles

::Haptophyceae:

:Alveolata::Cryptophyta:

EST dataset

BLASTdatabase

MS/MS dataMASCOT

hits

MASCOT hitsTranslated to

6RFs

RFs and MASCOTpeptides filtered

FIESTA consensus &

annotation

Final proteinidentifications

BUDAPESTCORE

1

2

3

45

Poor qualityRFs removed

OPTIONAL(MANUAL or AUTOMATED)

90,000 E huxESTs

173 ESTs728

189 RFs

113

615

Taxa-limitedDatabase

117 Cons321

34 Cons34

83 Cons287

173 EST hits (728 peptides)

83 Consensus sequences 40 Clusters by homology

(variants/isoforms)

287 Peptides 239 Unique to one

consensus 48 Shared within one

cluster

http://www.southampton.ac.uk/~re1u06/software/budapest/

Annotating EST ConsensusSequences Homology searching & phylogenetics

SequenceDatabase

Consensus

UniProt

Taxa-limitedDatabase

Alignment

Protein family identification

Redundancy/Variants

Combined search strategy

Genome unavailable (for download & searching)

dbESTThalassiosirapseudonana

Taxa-limitedDatabase

90,000 E huxESTs

173 Hits83 Consensus40+ Proteins

96 Hits26+ Proteins

:Rhodophyta::Stramenopiles

::Haptophyceae:

:Alveolata::Cryptophyta:

64+ Proteins(12 Common)

Conclusions.

Summary Extra analysis of raw protein lists adds value

False positives vs. Real proteins Annotation of uncharacterised hits

Numerous tools for mining protein lists Data exploration and/or hypothesis testing Community/Organism dependent Worth contacting bioinformaticians for further development

Development of customised bioinformatics solutions can greatly increase power of study Increased availability of high throughput technologies

Poor annotation & high error rates Increased need for bioinformatics post-processing to improve

quality

Open DiscussionR.Edwards@Southampton.ac.uk

Recommended