31
Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center University of Georgia http://lsdis.cs.uga.edu Project Information:

Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Semantics powered Bioinformatics

Amit Sheth, William S. York, et alLarge Scale Distributed Information Systems Lab &

Complex Carbohydrate Research CenterUniversity of Georgia

http://lsdis.cs.uga.edu

Project Information:

Page 2: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Background: SW for Life Sciences• Bioinformatics of Glycan Expression –

component of the NCRR "Integrated Technology Resource for Biomedical Glycomics”.

• W3C Interest Group on Semantic Web for Health care and Life Sciences

• Deployed Active Semantic Electronic Medical Patient Record application at the Athens Heart Center

Page 3: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Agenda

• Review of Accomplishments/Ongoing Work:o GLYDE standardo GlycO Ontologyo ProPreO Ontologyo Semantic Analytical Glycomics Workflowo Visualizationo Semantic Web Services: WSDL-S/METEOR-S

Page 4: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

GLYDE standard

• An XML based representation format for glycan structures

• Inter-convertible with existing data represented using IUPAC or LINUCS.

• In progress: Incorporation of Probability based representation

• In progress: Incorporation of aspects for visualization of structures using GLYDE (XML) files

GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.

Page 5: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

• Enable querying and export of query results in GLYDE format

• Using GLYDE representation for disambiguation, mapping and matching

MonosaccharideDB

SweetDB

KEGG

<glyde><residue>

.

.</residue></glyde>

<glyde><residue>

.

.</residue></glyde>

QUERY

RESULT

GLYDE

Collaborative GlycoInformatics

Page 6: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

• Development of GLYDE semantic web portal • Integration with www.glycosciences.de

o Visualization aspect integrated with LiGraph (Heidelberg) or OntoVista (UGA)

• Semantic Annotation of publications in GlycoProteomics domain

GLYDE Semantic PortalKEGG

MonosaccharideDB

www.glycosciences.de

Collaborative GlycoInformatics

Page 7: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Collaborative GlycoInformaticsEvolving collaboration between:• LSDIS/CCRC:

    Will York,  Amit Sheth, Michael Pierce

• EUROCarbDB (German Cancer Research Center):     Willi von der Lieth

• Consortium for Functional Glycomics (CFG):     Rahul Raman, Ram Sasisekharan, Thomas Lütteke

• N.D. Zelinsky Institute of Organic Chemistry (Moscow)     Yuriy Knirel

• Mitsui Knowledge Industry (Japan):     Hisashi Narimatsu, Norihiro Kikuchi

• Kyoto Encyclopedia of Genes and Genomes (KEGG):     Minoru Kanehisa, Kiyoko F. Aoki-Kinoshita

• Palo Alto Research Center (PARC):     David Goldberg,

Page 8: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Semantic GlcyoInformatics - Ontologies

• GlycOGlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans)o Contains 600+ classes and 100+ properties –

describe structural features of glycans; unique population strategy

o URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco

• ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomicso Contains 330 classes, 6 million+ instanceso Models three phases of experimental proteomics

URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo

Page 9: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

GlycO taxonomy

The first levels of the GlycO taxonomy

Most relationships and attributes in GlycO

GlycO exploits the expressiveness of OWL-DL.Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.

Page 10: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Pathway representation in GlycO

Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.

Page 11: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Zooming in a little …The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC

2.4.1.145.

The product of this reaction is the

Glycan with KEGG ID 00020.

Reaction R05987catalyzed by enzyme 2.4.1.145

adds_glycosyl_residueN-glycan_b-D-GlcpNAc_13

Page 12: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Ontology Population

• The next slides show the different steps that were necessary to populate GlycO with glycan structures from multiple sources.

• GLYDE is used to disambiguate between representations from multiple sources

Page 13: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

Ontology population workflow

Page 14: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

[][Asn]{[(4+1)][b-D-GlcpNAc]{[(4+1)][b-D-GlcpNAc]

{[(4+1)][b-D-Manp]{[(3+1)][a-D-Manp]

{[(2+1)][b-D-GlcpNAc]{}[(4+1)][b-D-GlcpNAc]

{}}[(6+1)][a-D-Manp]{[(2+1)][b-D-GlcpNAc]{}}}}}}

Page 15: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>

Page 16: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Has CarbBank

ID?

IUPAC to LINUCS

LINUCS to GLYDE

Compare to Knowledge

Base

Already in KB?

YES

NO

Semagix Freedom knowledge extractor

Instance Data

YES: next Instance

Insert into KB

NO

Page 17: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

• ProPreO: A process ontology to capture proteomics experimental lifecycle:o Separationo Mass spectrometryo Analysiso 330 classeso 110 propertieso 6 million+ instances

ProPreO

Page 18: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.

Usage: Mass spectrometry analysis

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

Page 19: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

P(S | M = 3461.57) = 0.6 P(T | M = 3461.57)

= 0.4

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

Semantic Annotation of Experimental Data•Enables Ontology-mediated Disambiguation•Allows correlation between disparate entities using Semantic Relations

Page 20: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Cell Culture

Glycoprotein Fraction

Glycopeptides Fraction

extract

Separation technique I

Glycopeptides Fraction

n*m

n

Signal integrationData correlation

Peptide Fraction

Peptide Fraction

ms data ms/ms data

ms peaklist ms/ms peaklist

Peptide listN-dimensional arrayGlycopeptide identificationand quantification

proteolysis

Separation technique II

PNGase

Mass spectrometry

Data reductionData reduction

Peptide identificationbinning

n

1

Semantic GlycoProteomics Semantic GlycoProteomics WorkflowWorkflow

Page 21: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Web Services based Workflow = Web Process

Web Service 1Web Service 4

Web Service 2

Web Service 3

WS1

WS 2

WS 3

WS 4

WORKFLOW

LINUX

SolarisMAC

Windows XP

Page 22: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

BOWSER

• Use semantics for describing Web Services• WSDL-S (LSDIS/IBM)• Use service-level annotation of Web Services • Graphical traversal of taxonomy of biological

concepts to search for Web Services• http://128.192.9.11:8080/stargate/bowser.jsp

Page 23: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Semantic Annotation of Scientific DataSemantic Annotation of Scientific Data

830.9570 194.9604 2580.2985 0.3592688.3214 0.2526

779.4759 38.4939784.3607 21.77361543.7476 1.38221544.7595 2.9977

1562.8113 37.47901660.7776 476.5043

ms/ms peaklist data

<ms/ms_peak_list>

<parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer

mode = “ms/ms”/>

<parent_ion_mass>830.9570</parent_ion_mass>

<total_abundance>194.9604</total_abundance>

<z>2</z>

<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>

<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>

<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>

<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>

<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>

<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>

<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>

<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>

<ms/ms_peak_list>

Annotated ms/ms peaklist data

Page 24: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Semantic annotation of Scientific Semantic annotation of Scientific DataData

Annotated ms/ms peaklist data

<ms/ms_peak_list>

<parameter

instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”

mode = “ms/ms”/>

<parent_ion_mass>830.9570</parent_ion_mass>

<total_abundance>194.9604</total_abundance>

<z>2</z>

<mass_spec_peak m/z = 580.2985 abundance = 0.3592/>

<mass_spec_peak m/z = 688.3214 abundance = 0.2526/>

<mass_spec_peak m/z = 779.4759 abundance = 38.4939/>

<mass_spec_peak m/z = 784.3607 abundance = 21.7736/>

<mass_spec_peak m/z = 1543.7476 abundance = 1.3822/>

<mass_spec_peak m/z = 1544.7595 abundance = 2.9977/>

<mass_spec_peak m/z = 1562.8113 abundance = 37.4790/>

<mass_spec_peak m/z = 1660.7776 abundance = 476.5043/>

<ms/ms_peak_list>

Page 25: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Identified and quantified

peptides

Specific cellularprocess

Lectin

Collection of N-glycan ligands

Collection ofBiosynthetic enzymes

Discovery of relationship between biological Discovery of relationship between biological entitiesentities

Fragment ofSpecific protein

GlycOProPreO

Gene Ontology (GO)

Genomic database (Mascot/Sequest)

The inference: instances of the class collection of Biosynthetic enzymes (GNT-V) are involved in the specific cellular process (metastasis).

p

r

o

c

e

s

s

Page 26: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

• Formalize description and classification of Web Services using ProPreO concepts

Semantic Web Services using WSDL-SSemantic Web Services using WSDL-S

<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" …..xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> …..</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message>

WSDL ModifyDBWSDL-S ModifyDB

<?xml version="1.0" encoding="UTF-8"?><wsdl:definitions targetNamespace="urn:ngp" ……xmlns:wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns:ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" >

<wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema">……</complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> ProPreO

process Ontology

data

sequence

peptide_sequence

Concepts defined in

process Ontology

Description of a Web Service using:WebServiceDescriptionLanguage

Page 27: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Semantic Visualization

• Ontologies are meant for machine consumption

• Often too convoluted for the human eye• The scientist needs to know the concepts

she uses for annotation• Build a visualization environment that

translates the formal concepts into a representation the domain expert understands well

Page 28: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Single Glycan

Page 29: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Customizable Layouts

• Using customizable layouts, knowledge can be formalized in a machine understandable way and then visually translated for the user’s needs.– Cartoonist representation for the Glycobiologist– Chemical reactions as left side right side,

instead of convoluted representation in the ontology.

Page 30: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

Ongoing and Future Work

• SemURI: Semantic URI based provenance scheme using ProPreO

• RDF-based version of the GLYDE schema• A framework for semantic annotation of

experimental data• Integration of large datasets (~500MB)

into ProPreO for reasoning

Page 31: Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center

• http://lsdis.cs.uga.edu/projects/glycomics/

Further details at: