Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI 2006-04846

Preview:

Citation preview

Update

Susan Bridges, Fiona McCarthy, Shane Burgess

NRI 2006-04846

1.Some of what we’ve been doing :Confirmation of predicted/hypothetical proteins in chicken

2. Something of more interest to almost everyone in here for analyzing your data.

Educate researchers who need to use GO.

University of Delaware, 12-13 November, 2007.

…… currently working with researchers from the Universities of Delaware and Maryland to provide GO annotations necessary to facilitate publication of array data.

First residential workshop at MSU in May 20-22 2008.

Avian Genome Conference 18-20 May, 2008GO Annotation Jamboree 21-22 May, 2008

agbase@cse.msstate.edu

“Hypothetical” and “predicted” proteins

Naive and activated purified CD4+ T cells; transformed CD4+ T cells; spleen; brain tissues; bursal B and stromal cells; muscle; and serum.

Database of all predicted proteins, from chicken build 2.1, using DFF-2D LC MS2 and our computational pipeline.

Experimentally-confirmed 7,809 chicken predicted proteins: 52% were expressed in more than one tissue.

6,027 (77%) of these proteins mapped to human and mouse orthologs and we assigned standardized nomenclature to 5,326 (64%).

8,213 GO associations to 21% of the identified chicken proteins using the ISS evidence code to transfer function between human-chicken and human-mouse orthologs

increased the current chicken GO annotations by 8% and doubled the number of chicken manually-curated annotations.

In PRIDE and NCBI databases and being used at NCBI to promote XP (computational model) to NP (confirmed product) accessions i.e. the words “hypothetical” and “predicted” are removed.

We also add experimentally-derived cell component GO annotations.

48%(3,779)

1%(61)4%

(313)7%

(561)

26%(2,020)

14%(1,073)

0%(0)

0%(2)

In one tissue In two tissues In three tissues In four tissuesIn five tissues In six tissues In seven tissues In all eight tissues

Tissue distribution of expressed ‘predicted’ proteins

0

1000

2000

3000

4000

5000

6000

Spleen

UA

01

Strom

a

Tcell

s B-cells

Serum

Muscle

Brain

Tissue type

Nu

mb

er o

f p

rote

ins

Tissue specific proteins

Proteins identified inother tissues

chicken: human/mouse orthologs (1:1)

236

Mouse orthologsHuman orthologs

5,685 106

No human or mouse orthologs

1,784

Cumulative external visits to AgBase

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

05 05 05 05 05 05 06 06 06 06 06 06 06 06 06 06 06 06 07 07 07 07 07 07 07 07 07 07 07J Au Se Oc No De Ja Fe MaAp MaJu J Au Se Oc No De Ja Fe MaAp Ma Ju J Au Se Oc No De

07

Summary of GO annotations for last 12 months

11,716 GO annotations for chicken & cow:• 214 cow gene products GO annotated

(1,521 GO annotations)• 1,762 chicken gene products GO

annotated (10,194 GO annotations)• in addition, orthology with human and

mouse genes used to GO annotate 7,809 computationally ‘predicted’ chicken proteins (8,213 GO annotations)

Annotation metrics

Database distribution of AgBase GO Annotations

AgBase Community file

GO Consortium file

Chicken Dec '07Cow Dec '07

GO Annotation of Arrays

Functional annotation usingGene Ontology

Nomenclature(species’ genome nomenclature committees)

Other annotations

using other bio-ontologies e.g.

AnatomyOntology

Structural Annotationincluding Sequence Ontology

Genomic Annotation

Quality improvement of annotationsPre-annotation Re-annotation

GO annotation of arrays.

Array IDs

‘known’ genes frompublic databases

‘predicted’ genesfrom genome sequencing

Are strict mammalian orthologs available ?

GO annotation of literature

Is functional literature available ?

Gene product IDs

Electronic GO annotation using InterPro data (IEA)

GO annotation from orthologs (ISO)

Collate GO annotations

Submit to EBI-GOA, GOC

YES

YES NO

NO

structural mapping

link to array IDs(updateable)

AgBase: annotating arrays

1. Del-Mar 14K Chicken Integrated Systems microarray (GPL1731).• 14,053 chicken genes represented

• 9,587 contigs GO annotated

(CC:3,514; MF:6,640; BP:4,623)

• 3,101 singletons GO annotated

(CC:487; MF: 881; BP:646)

• many singletons map to chicken ESTs with no associated GO

metabolic process

transport

cell communication

development

immune response

cell death

cell differentiation

response to stress

sensory perception

cell motility

regulation of biological process

cellular organization and biogenesis

behavior

response to chemical stimulus

process unknown

Figure 1A: Biological Process associated with Del-Mar 14K array

Relative amount of GO BP associated with Del-Mar 14K array compared to total chicken GO.

-6.0

-4.0

-2.0

0.0

2.0

4.0

6.0

de

velo

pm

en

t

imm

un

e r

esp

on

se

cell

de

ath

resp

on

se t

o s

tre

ss

pro

cess

un

kno

wn

cell

mo

tility

cell

diff

ere

ntia

tion

be

ha

vio

r

tra

nsp

ort

reg

ula

tion

of

bio

log

ica

l pro

cess

sen

sory

pe

rce

ptio

n

resp

on

se t

o c

he

mic

al s

timu

lus

secr

etio

n

cellu

lar

org

an

iza

tion

an

d b

iog

en

esi

s

resp

on

se t

o s

timu

lus

me

tab

olic

pro

cess

cell

com

mu

nic

atio

n

Arr

ay

GO

/to

tal c

hic

ken

GO

GO Biological Processes

AgBase: annotating arrays

2. TAMU Agilent 44K chicken array

• approx 44,000 chicken genes represented

• added GO annotation for 8,731 chicken gene products

• many of the array IDs with no associated GO annotation map to chicken EST sequences

AgBase: annotating arrays

3. FHCRC Chicken 13K v2.0 (GPL1836)• 13,007 chicken genes represented• 2,491 array IDs mapped to chicken gene products & GO annotated• 628 mapped to chicken gene products with no GO• approx 2,000 array IDs mapped to human or mouse gene products with GO annotation

GO Annotation Quality Score: “GAQ”

GAQ : no. annotations; DAG depth; GO evidence code

• calculate overall GAQ score for any dataset (eg. array)• calculate GAQ for subsets (eg. biological processes studied

using arrays)

“Gene Ontology”“Biological Process”

IEA inferred from electronic annotation ISS inferred from sequence similarity IMP inferred from mutant phenotype IGI inferred from genetic interaction IPI inferred from physical interaction IDA inferred from direct assay IEP inferred from expression pattern TAS traceable author statement NAS non-traceable author statement ND no biological data available RCA inferred from reviewed computational analysis IC inferred by curator

Evidence Code

Your Favorite Gene

Low GAQ score

Your NEW Favorite gene

High GAQ score

Quantification of re-annotation

Metrics

Granularity Specificity

# previous annotations # chicken annotations

# re-annotations # human/mouse annotations

Quality

Gene Annotation Quality (GAQ) score

0

5001000

15002000

25003000

35004000

4500

Whole Array Chicken Human/Mouse

Annotation type

Nu

mb

er

of a

nn

ota

tion

s

Pre-annotation

Re-annotation

• 13% of previous annotations to other species were corrected to chicken specific annotations

300% increase

50% increase700% increase

GRANULARITY SPECIFICITY

Bart van den Berg, CVM MSU/ Sue Lamont and Huaijun Zhu

2.8579,599207,869Total GAQ score

4.84,240886Total # proteins (Breadth)

2.8108,53739,355Confidence score total

2.7231,18487,250Depth

Fold differenceRe-annotationPre-annotation

GAQ score summary

Quality improvement of annotationsPre-annotation Re-annotation

GO biological process annotations

-4.88

-3.61

-1.80

-0.75-0.04

0.18 0.33 0.461.04 1.06 1.26 1.64

5.12

-6

-4

-2

0

2

4

6

cell co

mm

unica

tion

meta

bolic p

roce

ss

cata

bolic p

roce

ss

transp

ort

regula

tion o

f bio

logica

l pro

cess

Macro

mole

cule

m

eta

bolic p

roce

ss

bio

logica

l_pro

cess

cell m

otility

resp

onse

to stim

ulu

s

Nucle

obase

, nucle

osid

e, n

ucle

otid

e a

nd n

ucle

ic acid

meta

bolic p

roce

ss

cell d

iffere

ntia

tion

cell d

eath

multice

llula

r org

an

ismal

develo

pm

ent

GO Term

Rela

tive

diff

ere

nce

microarray GO / total chicken GO

Modeling using the GO

Functional Understanding

ImpliedDerivedPhysiology (= Cellular Component + Biological

Process + Molecular Function)

Network ModelingGene Ontology

(interactions)

Hypothesis-driven GO-based data interrogation

Buza, J. J. and S.C. Burgess. Modeling the proteome of a Marek's disease transformed cell line: a natural animal model for CD30 over-expressing lymphomas. Proteomics, 2007. 7:1316-26.

Avian Genome Conference 18-20 May, 2008GO Annotation Jamboree 21-22 May, 2008

agbase@cse.msstate.edu