15
Connecting TOPSAN to Computational Analysis Christian M Zmasek, Kyle Ellrott, Dana Weekes, Constantina Bakolitsa, John Wooley, Adam Godzik Joint Center for Structural Genomics Sanford-Burnham Medical Research Institute, La Jolla, California, USA University of California, San Diego, La Jolla, California, USA Joint Center for Molecular Modeling

Zmasek bosc2010 topsan

Embed Size (px)

Citation preview

Page 1: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis

Christian M Zmasek, Kyle Ellrott, Dana Weekes, Constantina Bakolitsa, John Wooley, Adam Godzik

Joint Center for Structural GenomicsSanford-Burnham Medical Research Institute, La Jolla, California, USA

University of California, San Diego, La Jolla, California, USAJoint Center for Molecular Modeling

Page 2: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 2

Overview

• What is TOPSAN?– TOPSAN: The Open Protein Structure Annotation Network – community based annotation protein structures

• “Semantic” TOPSAN• How to enter machine-readable, structured data• Example: editor → entry → semantic web• Different ways to download information• SPARQL example• Availability and licenses• Acknowledgements

Page 3: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 3

What is TOPSAN?

• TOPSAN: The Open Protein Structure Annotation Network • Ten-thousands of protein structures have been determined

by structural genomics (SG) centers and many more are expected

• While these structures are available in PDB (Protein Data Bank)…

• … annotations for most of them a limited to one-line PDB titles

• TOPSAN is the first database that specifically focuses on proving extensive annotations for the thousands of structures solved by the SG centers

Page 4: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 4

What is TOPSAN?

• TOPSAN’s main content are collaboratively (“open”) written articles/annotations for each solved protein structure

• TOPSAN combines automated with human edited elements • TOPSAN spans the range of analysis of

– single proteins– characterization of protein families– reconstruction of entire genomes

• Articles are created by structural genomics (SG) center staff and over 400 external users, so far covering 7,250 proteins

• Collaborating with PFAM to use JCSG structures to refine and create new PFAM families

Page 5: Zmasek bosc2010 topsan

5

TOPSAN example entry

Connecting TOPSAN to Computational Analysis

Page 6: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 6

“Semantic” TOPSAN

• Use the principles of the semantic web to turn TOPSAN into a database that can be:– edited– searched– linked

• TOPSAN content is being made accessible to computational query and analysis via semantic web technologies

Page 7: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 7

Entering machine-readable, structured data with the TOPSAN Protein Syntax (TPS)

• Takes the form subject, predicate, object• Subject: the protein in question• Predicate, examples:

– homologous– encoded_by– citation– member_of

• Object: “direct value” or link to other database• Example:

– {{ note.link( ‘pfam_family_member’, ‘PFAM:PF07980′ ) }}

• More information: http://topsan.wordpress.com/2010/06/01/96/

Page 8: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 8

Example: in the Editor

Page 9: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 9

Example: the resulting TOPSAN entry

Page 10: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 10

Example: on the Semantic Web

<http://purl.org/topsan/protein/2qcv> <http://purl.org/topsan/tps#simular_structure> <http://www.pdb.org/pdb/explore/explore.do?structureId=2afb>

<http://purl.org/topsan/protein/2qcv> <http://purl.org/topsan/tps#simular_structure> <http://www.pdb.org/pdb/explore/explore.do?structureId=2var>

<http://purl.org/topsan/protein/2qcv> <http://purl.org/topsan/tps#functional_assignment> <http://purl.org/obo/owl/EC#EC_2.7.1.45>

Page 11: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 11

Different ways to download information

• Generic TOPSAN page– Semantic information embedded into every TOPSAN page

• RDFa interface– http://topsan.org/rdfa/2A2M– XML

• Bulk Download– http://files.topsan.org/topsan.n3.gz– All unique semantic triples stored in a single N3 formatted

file

Page 12: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 12

Simple SPARQL

PREFIX tps:<http://purl.org/topsan/tps#>

SELECT ?id ?weight WHERE {

?id tps:molecular_weight ?weight

}

Page 13: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 13

Availability and Licenses

• Project Site: http://www.topsan.org • Software: http://www.topsan.org/Tools • Data: Open Source Licenses: Creative

Commons Attribution 3.0 License• Software: GNU General Public License

Page 14: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 14

Summary

• Structural genomics centers produce a large number of proteins structures, most of which never get a publication

• TOPSAN provides a means for community annotation of such protein structures

• The TOPSAN Protein Syntax (TPS) allows annotators to easily enter machine-readable, structured data

• TOPSAN content is being made accessible to computational query and analysis via semantic web technologies

• Many aspects of TOPSAN are still under development and are planned to evolve with user needs

Page 15: Zmasek bosc2010 topsan

Connecting TOPSAN to Computational Analysis 15

Acknowledgements

• Inspiration for TOPSAN/semantic web connection: DBCLS BioHackathon 2010

• Developers: Krishna Subramanian, Kyle Ellrott, Dana Weekes

• All contributors and users