1
Motivation: Similarity based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. The most popular ones are the sequence similarity search methods such as BLAST. New approaches that overcome the limitations of the methods that relying solely upon sequence similarity are rising. One of these novel approaches is the comparison of the organization/architecture of the structural domains in the proteins. The idea is that the shared structural units may indicate shared evolutionary and functional properties associated between these units. Results: Here we propose a new algorithm for the comparison of domain architectures in order to identify similarities and to propagate functional annotations between the proteins in the UniProt Database. The method “UniProt Domain Architecture Alignment” is unique from previous approaches in three major ways: (i) the use of InterPro Database for the domain annotation, (ii) the incorporation of the domain weights into the dynamic programming step, and (iii) the inclusion of information regarding non-annotated regions in the proteins into the domain architectures. The performance of the method was measured through the identification of orthology using the OMA database (F1 score: 0.62). The results indicated the effectiveness of the approach for similarity detection. We plan to integrate the algorithm into a learning based system for the automatic annotation of uncharacterized proteins in the UniProtKB/TrEMBL database. ABSTRACT Generation of the Domain Architectures: 1) Collect the hits for each protein from InterPro. 2) Remove all non-domain type hits. 3) Order the domain hits sequentially. 4) Merge the hits from the same InterPro hierarchy into single hits using the condensed view algorithm provided by this resource. 5) Treat the overlapping hits from unrelated InterPro entries. 6) Add the stretches of residues without domain hits (> 30 a.a.) as “GAP” domains in the DAs. Domain weighting: Inverse domain frequency: Neighboring domain count: Term frequency: Domain hit sizes: Domain similarity measure: Weight matrix: Final scoring matrix: Weighted Domain Architecture Alignment: Needleman-Wunsch Global Sequence Alignment algorithm (Needleman and Wunsch, 1970) is the core of the proposed DA alignment method: Modification of the algorithm in order to work with 7137 distinct InterPro domains as its alphabet instead of 20 amino acids. Integration of the domain weights into a scoring matrix in order to direct the alignments to achieve maximum weighted scores. Scoring of the non-annotated regions on proteins as gaps during the alignment. METHODOLOGY RESULTS & DISCUSSION InterPro Domains, DAs and DA Alignment Domain annotation coverage difference b/w domain databases: Statistics about the directionality in DAs: Evaluation of the performance of the method The performance of the proposed method in identification of orthologous protein sequences proteins from Orthologous Matrix project (OMA) release March 2014 (Altenhoff, et al., 2011). The randomly selected UniProtKB/SwissProt proteins from the OMA groups were subjected to the DA alignment procedure. The performance of the method was evaluated by measuring its ability to identify the orthologous proteins as orthologs usually share the same function. CONCLUSIONS Here we proposed a new approach in the field of protein function prediction. The method is distinguished from all previous approaches in three main aspects: 1) Different types of domain weights are integrated into the scoring matrix to direct the alignment of DAs to an optimal solution. 2) The information pertaining to the non-annotated regions of the proteins are integrated into DAs and thus scored during the alignment. 3) InterPro is used as the domain resource in order to increase the coverage of domain annotation on the protein sequences. The results of the ortholog sequence analysis suggest that the proposed approach can identify the functional relationships between proteins. As future work, we are planning to use the method in a pipeline for the automatic annotation of the proteins in the UniPRotKB/TrEMBL database. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Correspondence: [email protected] Tunca Doğan 1 , Alex Bateman 1 , Maria J. Martin 1 UniProt Domain Architecture Alignment: A New Approach for Protein Similarity Search using InterPro Domain Annotation INTRODUCTION Discovery of functional properties for proteins is a key step in biomedical research. Experimental identification of proteins is still a quite laborious and expensive task. This led to many computational methods being developed to infer the unknown properties of the proteins based on their sequence similarities to experimentally annotated proteins (i.e. BLAST, PSI-BLAST). Different approaches have been tried lately, especially in the field of protein function prediction, to augment the performance of sequence methods. One of these approaches is the study of protein domains: the structural building blocks in proteins that are able to function and fold independently from the rest of the protein. The concept of domain architectures (DA), defined as the organizational properties of a protein regarding the domains it contains. Here we present the UniProt Domain Architecture Alignment procedure for the detection of functional similarities between proteins containing domain annotation: Four types of attributes are incorporated into the measurement of the domain architectural similarities: domain content, order, position and recurrence. The proposed method incorporates domain information from the InterPro database in order to increase the domain information coverage on the proteins. Figure 1. Different types of overlapping domain hits on protein sequences Figure 2. Resolution process for the overlap hits. Figure 3. Domain hit statistics of UniProtKB/SwissProt proteins from various databases Figure 4. The fraction of overlap hits by InterPro domains on the residues of all UniProtKB/SwissProt proteins Overlap domain hits problem in the InterPro database: Figure 5. Co-occurrence frequencies of a selection of domain pairs, hit together on UniProtKB/SwissProt proteins (InterPro accessions of the domains are shown at the top of the bars). Table 1. Performance results of the proposed method in the identification of orthologous proteins in OMA groups. ACKNOWLEDGEMENTS T.D. thanks Andrew Nightingale for the editorial work on the manuscript. Funding: This work was supported by TUBITAK BIDEB-2219 post-doctoral research fellowship program. N t : total number of proteins in the test set N d : number of proteins containing domain d E d : total number of distinct neighboring domains to d N d,p : domain copy number of domain d in protein p D p : total number of domains in protein p Z min (d 1 ,d 2 ) & Z max (d 1 ,d 2 ) : sizes of the shorter and longer hits respectively; of domain d in protein 1 and in protein 2 Z av : average size of all domain hits on all proteins in the set O d,e : similarity ratio between domain d and domain e A p1,p2 , C p1,p2 , F p1,p2 , S p1,p2 & I p1,p2 : local weight matrices R p1,p2 : raw scoring matrix W p1,p2 : general weight matrix between proteins 1 and 2 Commons License. F1000 Posters: Us der Creative Commons License. F1000 Posters: Use Permitted u Permitted under Creative Commons License. F1000 Posters: Use Permitted under Creative Posters: Use Permitted under Creative Commons License. F1000 Posters: Use Permitted under Creative Commons Li 000 Posters: Use Permitted under Creative Commons License. F1000 Posters: Use Permitted under Creative Commons Licen Use Permitted under Creative Commons License. F1000 Posters: Use Permitted under Creative Co under Creative Commons License. F1000 Posters: Use Permitted unde ve Commons License. F1000 Posters: Use P

UniProt Domain Architecture Alignment: A New Approach for

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UniProt Domain Architecture Alignment: A New Approach for

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

(—THIS SIDEBAR DOES NOT PRINT—)

DES IGN GUIDE

This PowerPoint 2007 template produces a 44”x44”

presentation poster. You can use it to create your research

poster and save valuable time placing titles, subtitles, text,

and graphics.

We provide a series of online tutorials that will guide you

through the poster design process and answer your poster

production questions. To view our template tutorials, go

online to PosterPresentations.com and click on HELP DESK.

When you are ready to print your poster, go online to

PosterPresentations.com

Need assistance? Call us at 1.510.649.3001

QUICK START

Zoom in and out As you work on your poster zoom in and out to the

level that is more comfortable to you. Go to VIEW >

ZOOM.

Title, Authors, and Affiliations Start designing your poster by adding the title, the names of the

authors, and the affiliated institutions. You can type or paste text

into the provided boxes. The template will automatically adjust the

size of your text to fit the title box. You can manually override this

feature and change the size of your text.

TIP: The font size of your title should be bigger than your name(s)

and institution name(s).

Adding Logos / Seals Most often, logos are added on each side of the title. You can insert

a logo by dragging and dropping it from your desktop, copy and

paste or by going to INSERT > PICTURES. Logos taken from web sites

are likely to be low quality when printed. Zoom it at 100% to see

what the logo will look like on the final poster and make any

necessary adjustments.

TIP: See if your school’s logo is available on our free poster

templates page.

Photographs / Graphics You can add images by dragging and dropping from your desktop,

copy and paste, or by going to INSERT > PICTURES. Resize images

proportionally by holding down the SHIFT key and dragging one of

the corner handles. For a professional-looking poster, do not distort

your images by enlarging them disproportionally.

Image Quality Check Zoom in and look at your images at 100% magnification. If they look

good they will print well. If they are blurry or pixelated, you will

need to replace it with an image that is at a high-resolution.

ORIGINAL DISTORTED

Corner handles

Go

od

pri

nti

ng

qu

alit

y

Bad

pri

nti

ng

qu

alit

y

QUICK START (cont. )

How to change the template color theme You can easily change the color theme of your poster by going to

the DESIGN menu, click on COLORS, and choose the color theme of

your choice. You can also create your own color theme.

You can also manually change the color of your background by going

to VIEW > SLIDE MASTER. After you finish working on the master be

sure to go to VIEW > NORMAL to continue working on your poster.

How to add Text The template comes with a number of pre-

formatted placeholders for headers and text

blocks. You can add more blocks by copying

and pasting the existing ones or by adding a

text box from the HOME menu.

Text size Adjust the size of your text based on how much content you have to

present. The default template text offers a good starting point.

Follow the conference requirements.

How to add Tables To add a table from scratch go to the INSERT menu and

click on TABLE. A drop-down box will help you select

rows and columns.

You can also copy and a paste a table from Word or

another PowerPoint document. A pasted table may need

to be re-formatted by RIGHT-CLICK > FORMAT SHAPE,

TEXT BOX, Margins.

Graphs / Charts You can simply copy and paste charts and graphs from Excel or

Word. Some reformatting may be required depending on how the

original document has been created.

How to change the column configuration RIGHT-CLICK on the poster background and select LAYOUT to see

the column options available for this template. The poster columns

can also be customized on the Master. VIEW > MASTER.

How to remove the info bars If you are working in PowerPoint for Windows and have finished your

poster, save as PDF and the bars will not be included. You can also

delete them by going to VIEW > MASTER. On the Mac adjust the

Page-Setup to match the Page-Setup in PowerPoint before you

create a PDF. You can also delete them from the Slide Master.

Save your work Save your template as a PowerPoint document. For printing, save as

PowerPoint of “Print-quality” PDF.

Print your poster When you are ready to have your poster printed go online to

PosterPresentations.com and click on the “Order Your Poster”

button. Choose the poster type the best suits your needs and submit

your order. If you submit a PowerPoint document you will be

receiving a PDF proof for your approval prior to printing. If your

order is placed and paid for before noon, Pacific, Monday through

Friday, your order will ship out that same day. Next day, Second day,

Third day, and Free Ground services are offered. Go to

PosterPresentations.com for more information.

Student discounts are available on our Facebook page.

Go to PosterPresentations.com and click on the FB icon.

© 2013 PosterPresentations.com 2117 Fourth Street , Unit C Berkeley CA 94710

[email protected]

Motivation: Similarity based methods have been widely used in order to

infer the properties of genes and gene products containing little or no

experimental annotation. The most popular ones are the sequence

similarity search methods such as BLAST. New approaches that overcome

the limitations of the methods that relying solely upon sequence similarity

are rising. One of these novel approaches is the comparison of the

organization/architecture of the structural domains in the proteins. The

idea is that the shared structural units may indicate shared evolutionary

and functional properties associated between these units.

Results: Here we propose a new algorithm for the comparison of domain

architectures in order to identify similarities and to propagate functional

annotations between the proteins in the UniProt Database. The method

“UniProt Domain Architecture Alignment” is unique from previous

approaches in three major ways: (i) the use of InterPro Database for the

domain annotation, (ii) the incorporation of the domain weights into the

dynamic programming step, and (iii) the inclusion of information regarding

non-annotated regions in the proteins into the domain architectures. The

performance of the method was measured through the identification of

orthology using the OMA database (F1 score: 0.62). The results indicated

the effectiveness of the approach for similarity detection. We plan to

integrate the algorithm into a learning based system for the automatic

annotation of uncharacterized proteins in the UniProtKB/TrEMBL database.

ABSTRACT

Generation of the Domain Architectures:

1) Collect the hits for each protein from InterPro.

2) Remove all non-domain type hits.

3) Order the domain hits sequentially.

4) Merge the hits from the same InterPro hierarchy into single hits using

the condensed view algorithm provided by this resource.

5) Treat the overlapping hits from unrelated InterPro entries.

6) Add the stretches of residues without domain hits (> 30 a.a.) as “GAP”

domains in the DAs.

Domain weighting:

Inverse domain frequency:

Neighboring domain count:

Term frequency:

Domain hit sizes:

Domain similarity measure:

Weight matrix:

Final scoring matrix:

Weighted Domain Architecture Alignment:

Needleman-Wunsch Global Sequence Alignment algorithm (Needleman and

Wunsch, 1970) is the core of the proposed DA alignment method:

• Modification of the algorithm in order to work with 7137 distinct

InterPro domains as its alphabet instead of 20 amino acids.

• Integration of the domain weights into a scoring matrix in order to

direct the alignments to achieve maximum weighted scores.

• Scoring of the non-annotated regions on proteins as gaps during the

alignment.

METHODOLOGY RESULTS & DISCUSSION

InterPro Domains, DAs and DA Alignment

Domain annotation coverage

difference b/w domain databases:

Statistics about the directionality in DAs:

Evaluation of the performance of the method

The performance of the proposed method in identification of orthologous

protein sequences proteins from Orthologous Matrix project (OMA) release

March 2014 (Altenhoff, et al., 2011).

The randomly selected UniProtKB/SwissProt proteins from the OMA groups

were subjected to the DA alignment procedure.

The performance of the method was evaluated by measuring its ability to

identify the orthologous proteins as orthologs usually share the same

function.

CONCLUSIONS

Here we proposed a new approach in the field of protein function

prediction. The method is distinguished from all previous approaches in

three main aspects:

1) Different types of domain weights are integrated into the scoring

matrix to direct the alignment of DAs to an optimal solution.

2) The information pertaining to the non-annotated regions of the

proteins are integrated into DAs and thus scored during the alignment.

3) InterPro is used as the domain resource in order to increase the

coverage of domain annotation on the protein sequences.

The results of the ortholog sequence analysis suggest that the proposed

approach can identify the functional relationships between proteins.

As future work, we are planning to use the method in a pipeline for the

automatic annotation of the proteins in the UniPRotKB/TrEMBL database.

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Correspondence: [email protected]

Tunca Doğan1, Alex Bateman1, Maria J. Martin1

UniProt Domain Architecture Alignment: A New Approach for Protein Similarity Search using InterPro Domain Annotation

INTRODUCTION

• Discovery of functional properties for proteins is a key step in

biomedical research.

• Experimental identification of proteins is still a quite laborious and

expensive task.

• This led to many computational methods being developed to infer the

unknown properties of the proteins based on their sequence similarities

to experimentally annotated proteins (i.e. BLAST, PSI-BLAST).

• Different approaches have been tried lately, especially in the field of

protein function prediction, to augment the performance of sequence

methods.

• One of these approaches is the study of protein domains: the structural

building blocks in proteins that are able to function and fold

independently from the rest of the protein.

• The concept of domain architectures (DA), defined as the

organizational properties of a protein regarding the domains it

contains.

• Here we present the UniProt Domain Architecture Alignment procedure

for the detection of functional similarities between proteins containing

domain annotation:

Four types of attributes are incorporated into the measurement of

the domain architectural similarities: domain content, order,

position and recurrence.

The proposed method incorporates domain information from the

InterPro database in order to increase the domain information

coverage on the proteins.

Figure 1. Different types of overlapping domain hits on protein sequences

Figure 2. Resolution process for the overlap hits.

Figure 3. Domain hit statistics of UniProtKB/SwissProt proteins from various databases

Figure 4. The fraction of overlap hits by InterPro domains on the residues of all UniProtKB/SwissProt proteins

Overlap domain hits problem in

the InterPro database:

Figure 5. Co-occurrence frequencies of a selection of domain pairs, hit together on UniProtKB/SwissProt proteins (InterPro accessions of the domains are shown at the top of the bars).

Table 1. Performance results of the proposed method in the identification of orthologous proteins in OMA groups.

ACKNOWLEDGEMENTS

T.D. thanks Andrew Nightingale for the editorial work on the manuscript.

Funding: This work was supported by TUBITAK BIDEB-2219 post-doctoral

research fellowship program.

Nt : total number of proteins in the test set Nd : number of proteins containing domain d

Ed : total number of distinct neighboring domains to d

Nd,p : domain copy number of domain d in protein p Dp : total number of domains in protein p

Zmin(d1,d2) & Zmax(d1,d2) : sizes of the shorter and longer hits respectively; of domain d in protein 1 and in protein 2 Zav : average size of all domain hits on all proteins in the set

Od,e : similarity ratio between domain d and domain e

Ap1,p2, Cp1,p2, Fp1,p2, Sp1,p2 & Ip1,p2 : local weight matrices

Rp1,p2 : raw scoring matrix Wp1,p2 : general weight matrix between proteins 1 and 2

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

Use P

ermitte

d und

er Crea

tive C

ommon

s Lice

nse.

F1000

Pos

ters:

U