View
217
Download
2
Category
Tags:
Preview:
Citation preview
Core 2: Bioinformatics
NCBO-Berkeley
Berkeley Drosophila Genome Project
Finish the sequence of the euchromatic genome of Drosophila melanogaster
Annotated biological important features of this sequence
Produced gene disruptions using P element-mediated mutagenesis
Full length sequencing and expression characterization of a cDNA for every gene
Developing informatics tools
SimaChris
Who is here from NCBO-Berkeley
MarkShu
Chris GadFly database
schema GO database
schema Chado database
schema Perl libraries for all OBD data architect
Shu AmiGO,ImaGO &
database Compute Pipeline OBD dev & Data flow
Mark Apollo Genome
Annotation Editor
Phenote and other OBD interfaces
Sima Adh region
annotation Annotation of
entire Drosophila Genome
Project manager and coordinator nonpareil
Associate Director
OBD Outline
Core 2 aims, refresher Data models for OBD
phenotypes clinical trials others
Modeling frameworks exchange formats database system
SQL based vs ‘SemWeb’ dbs
Progress Demo
Core 2 Specific Aims
1. Apply ontologies Software toolkit for describing and classifying
data
2. Capture, manage, and view data annotations Database (OBD) and interfaces to store and view
annotations
3. Investigate and compare implications Linking human diseases to model systems
4. Maintain Ongoing reconciliation of ontologies with
annotations
Core 3 Driving Biological Projects
DBPs phenotypes: Fly and Zebrafish to human clinical trials
Core 2 Aims1. Apply ontologies to describe data2. Capture, manage, and view data annotations3. Link disease genes to model systems4. Reconcile annotation and ontology changes
Apply ontologies to describe data
Requirements Data capture tools
phenote demo tomorrow
no tool requirements from UCSF
Data model Database (OBD)
--aim 2
dataflow
user’sview
Data models
Common/shared domain specific models
Aim 3 linking disease genes model must support this
granularity comparability
Domain specific data models
FB, ZFIN genotype to phenotype
‘EAV’ qualities inhere in entities
orthologs phenotype to disease core 2 will help define common model
UCSF clinical trials existing ontology-friendly schema - trialbank
Phenotype data model
Qualities inhere in entities Entity term; PATO term
brain FBbt:00005095; fused PATO:0000642
gut MA:0000917; dysplastic PATO:0000640
tail fin ZDB:020702-16; ventralized PATO:0000636
kidney ZDB:020702-16; hypertrophied PATO:0000636
midface ZDB:020702-16; hypoplastic PATO:0000636
Pre-composed phenotype terms Mammalian Phenotype Ontology
“increased activated B-cell number” MPO:0000319
“pink fur hue” MPO:0000374
Extensions to simple model
What about Relational attributes Quantative vs qualitative Post-composing entity and attribute terms Relative states/values Variation in place, space and time A better treatment of absence
See CSHL Pheno meeting talk also, more detailed formal presentation
(available) Not to mention genotypes,
environments, provenance, etc
Modeling clinical trials
Model already described using frame-based schema
Further modeling required? abstraction
to integrate more with other OBD datatypes
views to only show parts relevant to
OBD/BioPortal
Future DBPs and use cases
OBD will contain a variety of general types of data
Modeling is expensive use existing models where appropriate but whole must be cohesive and
integrated Most of this talk focuses on the
pheno DBPs for illustrative purposes
Modeling frameworks
language technology
Modeling data: underlying formalism
Model is expressed with modeling language Options
Relational/SQL Semi-structured, XML Object-centric (UML, frame-based?) Logic based
description logic: e.g. OWL first-order logic: e.g. CL
Natural language descriptions Model should be independent of language it
is expressed in
Data exchange language: XML
Simple XML is suited for data exchange XML can drive software spec
constrains programmatic data model XSD can generate UML closed world assumption is useful
cf Ruttenberg et al
Mature technology well understood by developers, MODs standards
How OBD uses XML
obd-geno-pheno-xml (aka pheno-xml) actually multiple modular components
genotype schema phenotype schema: ‘EAV’ environment schema provenance schema
used as exchange format cf: gene ontology association files
no need for ClinicalTrials-XML
Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>
SQL Databases
Data storage, management and querying all MODs use SQL dbs
Lots of advantages scalable, standard QL, mature, APIs, etc pure relational model is reasonably formal
XML/SQL more or less compatible low impedance mismatch
Schemas for geno-pheno data
We already have schema: Chado Used by many MODs (eg FB)
others are ‘chado compliant’ (eg ZFIN) Modular
ontologies genomic genotype phenotype phylogenies …etc
Phenotype module needs updating will be driven by pheno-xml
Problem solved?
We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each
Is this enough to work with?
Issues
OBD will be much more than geno-pheno clinical trials future DBPs, other NCBCs any data expressed in an ontology language
Software and schema development expensive fragility in face of schema evolution development gets bogged down in data
exchange issues
Major issue
SQL and XPath work great for ‘traditional’ data…
…but are too low level for ontology-centric data lack of inference no way to directly express ontology
constraints
Use cases from previous experience: AmiGO
GO “find all TF genes” (is_a closure) “find all gene products localised to endoplasmic
reticulum” (part_of closure, over is_a)
Our solution (AmiGO & go-sqldb) pre-compute transitive closure over all
relations in db (sort of) works for GO (for now)
refresh problem explosive for tangled DAGs
OBD requires more ontological awareness
Other relations ontogenic (eg derives_from) transitive_over
Other types of data Pre- versus post- composed terms
E.g. MPO versus AO+PATO E.g. Entity+Spatial qualifier queries over either should be
interchangeable
Solution: more expressive formalisms
QLs and APIs should provide and abstract away common ontology operations ease of programming, optimisation
Choices ‘Semweb’ databases
RDF + RDFS + Owl [ lite + DL ] + extra lots to choose from, emerging standards compatible with Obo v1.2 spec
Deductive databases superset of relational databases from Prolog to full CL
Modeling phenotypes as RDF/OWL or Obo instances
classes/terms
instances
entity quality
Example query in SeRQL
SELECT DISTINCT EI, ET, OrgI, QI, QT, QNFROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN}WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue"
find mutations affecting the shape of the wing vein:
results of query on OBD-sesame:
one annotation to “wing vein L2”, “branched”
Advantages of ‘SemWeb’ dbs
Advantages over pure SQL The ontology is the model
constraints encoded in ontology e.g. certain quality types only applicable to certain
entity types agile development - fast database integration
Rich modeling constructs transitivity, subsumption, intersection, etc powerful QLs and APIs
More (technical) interoperation ‘for free’ URIs proven?
Open World Assumption (maybe a hindrance?)
Disadvantages of ‘SemWeb’ dbs
Disadvantages speed
may be slower than SQL ..but in-memory execution is fast
lack of maturity new technology.. but has a LOT of momentum
foundations are RDF triples appropriate? inherent difficulties modeling time SQL allows n-ary relations/predicates
Hybrid model
SemWeb dbs are commonly layered over SQL DBs
We can have the best of both worlds Data View layers
mapping between Obo/OWL model and domain-specific relational schema
(optionally) materialized for speed different applications use appropriate
layer
Current progress: OBD-Sesame
Sesame open source ‘triple store’ based on Jena
also used in Protégé-OWL
storage layer options mysql/postgresql generic schema in-memory disk-based
OBD in Sesame: current datasets
Pheno ZFIN & FB : EAV trial 2003 data Test ortholog set
FB ‘simple phenotype’ alleles ZFIN legacy phenotype data, automatically parsed to EAV
Ontologies: AOs, PATO, Cell, GO Method
excel & flatfiles->pheno-xml->owl OWL from http://www.fruitfly.org/~cjm/obo-download
Trialbank Method: ocelot->obo-xml->owl
Soon human orthologs and omim
Technology Evaluation: Sesame
Use case query set Benchmarks
preliminary conclusions SQL layering is terrible in-memory is fast
optimisations? other triple stores? up to date results on wiki
http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark
Need to test OWL-DL entailment Bigger dataset required for full evaluations Community effort: pub-semweb-lifesci list
Parallel development: an OBD Prototype
Initiated prior to OBD-Sesame Simple deductive database
prolog-based chado-like schema
can be views on Obo/OWL predicates amigo-clone user interface
Rapid prototyping Current dataset
as obd-sesame, plus CT trivial to drop in more
Example logic query
inheres(QI,EI) &inst(QI,QT) &label(QT,shape) &inst(EI,ETP) &part_of*(ETP,ET) &label(ET,’head capsule’)
find mutations affecting the shape of some partof the head capsule
results of query on OBD-prolog:
one annotation to “arista lateral”, “irregular shape”
OBD TODO
Pheno-xml finalise release version finalise Obo/OWL mapping logic specification
Data orthologies
OBD - BioPortal integration how will it work?
Versioning and reconciling changes decide on ontology versioning first
OBD dependencies
PATO development UMLS into OBO-site Ontologies
FMA accessibility? species-centric AO alignments (XSPAN?) Sept meeting on AO development Nov meeting on disease ontologies
Data MOD pheno annotation OMIM annotation
Bioportal
Misc
NLP for phenote Obol trial on evolutionary phenotype
characters cambridge NLP project can be used to ‘prime’ phenote
Decomposing MPO pink fur def= fur, has_quality: pink
Discussion
Will SemWeb dbs work? experiment
Ontology-based modeling the ontology is the model importance of
relations ontology upper ontology
Demos
http://yuri.lbl.gov/amigo/ct http://yuri.lbl.gov/amigo/obd http://spade.lbl.gov:8080/sesame/actio
nFrameset.jsp?repository=mem-rdfs-db
Recommended