Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project Finish the sequence of the...

Core 2: Bioinformatics

NCBO-Berkeley

Berkeley Drosophila Genome Project

Finish the sequence of the euchromatic genome of Drosophila melanogaster

Annotated biological important features of this sequence

Produced gene disruptions using P element-mediated mutagenesis

Full length sequencing and expression characterization of a cDNA for every gene

Developing informatics tools

SimaChris

Who is here from NCBO-Berkeley

MarkShu

Chris GadFly database

schema GO database

schema Chado database

schema Perl libraries for all OBD data architect

Shu AmiGO,ImaGO &

database Compute Pipeline OBD dev & Data flow

Mark Apollo Genome

Annotation Editor

Phenote and other OBD interfaces

Sima Adh region

annotation Annotation of

entire Drosophila Genome

Project manager and coordinator nonpareil

Associate Director

OBD Outline

Core 2 aims, refresher Data models for OBD

phenotypes clinical trials others

Modeling frameworks exchange formats database system

SQL based vs ‘SemWeb’ dbs

Progress Demo

Core 2 Specific Aims

1. Apply ontologies Software toolkit for describing and classifying

2. Capture, manage, and view data annotations Database (OBD) and interfaces to store and view

annotations

3. Investigate and compare implications Linking human diseases to model systems

4. Maintain Ongoing reconciliation of ontologies with

annotations

Core 3 Driving Biological Projects

DBPs phenotypes: Fly and Zebrafish to human clinical trials

Core 2 Aims1. Apply ontologies to describe data2. Capture, manage, and view data annotations3. Link disease genes to model systems4. Reconcile annotation and ontology changes

Apply ontologies to describe data

Requirements Data capture tools

phenote demo tomorrow

no tool requirements from UCSF

Data model Database (OBD)

--aim 2

dataflow

user’sview

Data models

Common/shared domain specific models

Aim 3 linking disease genes model must support this

granularity comparability

Domain specific data models

FB, ZFIN genotype to phenotype

‘EAV’ qualities inhere in entities

orthologs phenotype to disease core 2 will help define common model

UCSF clinical trials existing ontology-friendly schema - trialbank

Phenotype data model

Qualities inhere in entities Entity term; PATO term

brain FBbt:00005095; fused PATO:0000642

gut MA:0000917; dysplastic PATO:0000640

tail fin ZDB:020702-16; ventralized PATO:0000636

kidney ZDB:020702-16; hypertrophied PATO:0000636

midface ZDB:020702-16; hypoplastic PATO:0000636

Pre-composed phenotype terms Mammalian Phenotype Ontology

“increased activated B-cell number” MPO:0000319

“pink fur hue” MPO:0000374

Extensions to simple model

What about Relational attributes Quantative vs qualitative Post-composing entity and attribute terms Relative states/values Variation in place, space and time A better treatment of absence

See CSHL Pheno meeting talk also, more detailed formal presentation

(available) Not to mention genotypes,

environments, provenance, etc

Modeling clinical trials

Model already described using frame-based schema

Further modeling required? abstraction

to integrate more with other OBD datatypes

views to only show parts relevant to

OBD/BioPortal

Future DBPs and use cases

OBD will contain a variety of general types of data

Modeling is expensive use existing models where appropriate but whole must be cohesive and

integrated Most of this talk focuses on the

pheno DBPs for illustrative purposes

Modeling frameworks

language technology

Modeling data: underlying formalism

Model is expressed with modeling language Options

Relational/SQL Semi-structured, XML Object-centric (UML, frame-based?) Logic based

description logic: e.g. OWL first-order logic: e.g. CL

Natural language descriptions Model should be independent of language it

is expressed in

Data exchange language: XML

Simple XML is suited for data exchange XML can drive software spec

constrains programmatic data model XSD can generate UML closed world assumption is useful

cf Ruttenberg et al

Mature technology well understood by developers, MODs standards

How OBD uses XML

obd-geno-pheno-xml (aka pheno-xml) actually multiple modular components

genotype schema phenotype schema: ‘EAV’ environment schema provenance schema

used as exchange format cf: gene ontology association files

no need for ClinicalTrials-XML

Example pheno-xml <genotype id="ZFIN:tm84"> <name>ZFIN:tm84</name> <genotype_phenotype_association> <phenotype> <entity type="ZDB-ANAT-010921-528"> <quality type=“PATO:……” > <state type="PATO:0000636"> <time_range type="ZDB-STAGE-010723-12"/> </state> </quality> </entity> </phenotype> </genotype_phenotype_association>

SQL Databases

Data storage, management and querying all MODs use SQL dbs

Lots of advantages scalable, standard QL, mature, APIs, etc pure relational model is reasonably formal

XML/SQL more or less compatible low impedance mismatch

Schemas for geno-pheno data

We already have schema: Chado Used by many MODs (eg FB)

others are ‘chado compliant’ (eg ZFIN) Modular

ontologies genomic genotype phenotype phylogenies …etc

Phenotype module needs updating will be driven by pheno-xml

Problem solved?

We have two mature, complementary technologies, and can define schemas for our model in an appropriate formalism for each

Is this enough to work with?

Issues

OBD will be much more than geno-pheno clinical trials future DBPs, other NCBCs any data expressed in an ontology language

Software and schema development expensive fragility in face of schema evolution development gets bogged down in data

exchange issues

Major issue

SQL and XPath work great for ‘traditional’ data…

…but are too low level for ontology-centric data lack of inference no way to directly express ontology

constraints

Use cases from previous experience: AmiGO

GO “find all TF genes” (is_a closure) “find all gene products localised to endoplasmic

reticulum” (part_of closure, over is_a)

Our solution (AmiGO & go-sqldb) pre-compute transitive closure over all

relations in db (sort of) works for GO (for now)

refresh problem explosive for tangled DAGs

OBD requires more ontological awareness

Other relations ontogenic (eg derives_from) transitive_over

Other types of data Pre- versus post- composed terms

E.g. MPO versus AO+PATO E.g. Entity+Spatial qualifier queries over either should be

interchangeable

Solution: more expressive formalisms

QLs and APIs should provide and abstract away common ontology operations ease of programming, optimisation

Choices ‘Semweb’ databases

RDF + RDFS + Owl [ lite + DL ] + extra lots to choose from, emerging standards compatible with Obo v1.2 spec

Deductive databases superset of relational databases from Prolog to full CL

Modeling phenotypes as RDF/OWL or Obo instances

classes/terms

instances

entity quality

Example query in SeRQL

SELECT DISTINCT EI, ET, OrgI, QI, QT, QNFROM {EI} rdf:type {ET} rdfs:label {EN}, {EI} OBO_REL_part_of {OrgI} rdf:type {Tax} rdfs:label {TaxN}, {EI} OBO_REL_has_quality {QI} rdf:type {QT} rdfs:label {QN}WHERE label(EN) = "wing vein" AND label(TaxN) = ”Arthropoda" AND label(QN) = "ShapeValue"

find mutations affecting the shape of the wing vein:

results of query on OBD-sesame:

one annotation to “wing vein L2”, “branched”

Advantages of ‘SemWeb’ dbs

Advantages over pure SQL The ontology is the model

constraints encoded in ontology e.g. certain quality types only applicable to certain

entity types agile development - fast database integration

Rich modeling constructs transitivity, subsumption, intersection, etc powerful QLs and APIs

More (technical) interoperation ‘for free’ URIs proven?

Open World Assumption (maybe a hindrance?)

Disadvantages of ‘SemWeb’ dbs

Disadvantages speed

may be slower than SQL ..but in-memory execution is fast

lack of maturity new technology.. but has a LOT of momentum

foundations are RDF triples appropriate? inherent difficulties modeling time SQL allows n-ary relations/predicates

Hybrid model

SemWeb dbs are commonly layered over SQL DBs

We can have the best of both worlds Data View layers

mapping between Obo/OWL model and domain-specific relational schema

(optionally) materialized for speed different applications use appropriate

Current progress: OBD-Sesame

Sesame open source ‘triple store’ based on Jena

also used in Protégé-OWL

storage layer options mysql/postgresql generic schema in-memory disk-based

OBD in Sesame: current datasets

Pheno ZFIN & FB : EAV trial 2003 data Test ortholog set

FB ‘simple phenotype’ alleles ZFIN legacy phenotype data, automatically parsed to EAV

Ontologies: AOs, PATO, Cell, GO Method

excel & flatfiles->pheno-xml->owl OWL from http://www.fruitfly.org/~cjm/obo-download

Trialbank Method: ocelot->obo-xml->owl

Soon human orthologs and omim

Technology Evaluation: Sesame

Use case query set Benchmarks

preliminary conclusions SQL layering is terrible in-memory is fast

optimisations? other triple stores? up to date results on wiki

http://smi.stanford.edu/projects/cbio/mwiki-internal/index.php/RDF_Sesame_Demo_Benchmark

Need to test OWL-DL entailment Bigger dataset required for full evaluations Community effort: pub-semweb-lifesci list

Parallel development: an OBD Prototype

Initiated prior to OBD-Sesame Simple deductive database

prolog-based chado-like schema

can be views on Obo/OWL predicates amigo-clone user interface

Rapid prototyping Current dataset

as obd-sesame, plus CT trivial to drop in more

Example logic query

inheres(QI,EI) &inst(QI,QT) &label(QT,shape) &inst(EI,ETP) &part_of*(ETP,ET) &label(ET,’head capsule’)

find mutations affecting the shape of some partof the head capsule

results of query on OBD-prolog:

one annotation to “arista lateral”, “irregular shape”

OBD TODO

Pheno-xml finalise release version finalise Obo/OWL mapping logic specification

Data orthologies

OBD - BioPortal integration how will it work?

Versioning and reconciling changes decide on ontology versioning first

OBD dependencies

PATO development UMLS into OBO-site Ontologies

FMA accessibility? species-centric AO alignments (XSPAN?) Sept meeting on AO development Nov meeting on disease ontologies

Data MOD pheno annotation OMIM annotation

Bioportal

NLP for phenote Obol trial on evolutionary phenotype

characters cambridge NLP project can be used to ‘prime’ phenote

Decomposing MPO pink fur def= fur, has_quality: pink

Discussion

Will SemWeb dbs work? experiment

Ontology-based modeling the ontology is the model importance of

relations ontology upper ontology

http://yuri.lbl.gov/amigo/ct http://yuri.lbl.gov/amigo/obd http://spade.lbl.gov:8080/sesame/actio

nFrameset.jsp?repository=mem-rdfs-db

Core 2: Bioinformatics NCBO-Berkeley. Berkeley Drosophila Genome Project Finish the sequence of the...

Documents

The three-dimensional genome organization of Drosophila ...web.cmb.usc.edu/people/alber/pdf/Li_etal_Genome_Biology_2017.pdf · RESEARCH Open Access The three-dimensional genome organization

Susan E. Celniker1 and Gerald M. Rubin1,2Celera Genomics and the Berkeley Drosophila Genome Project (BDGP) collab-oratively performed a whole genome shotgun (WGS) assembly and generated

Quantitative Single-Embryo Profile of Drosophila Genome ... · Quantitative Single-Embryo Proﬁle of Drosophila Genome Activation and the Dorsal–Ventral Patterning Network Jeremy

Long-Range Targeted Manipulation of the Drosophila Genome by

Genome-Wide RNAi Analysis of Growth and Viability in Drosophila Cells

Analysis of Drosophila Species Genome Size and Satellite DNA

Levels of DNA cytosine methylation in the Drosophila genome · the Drosophila genome Saniya Deshmukh1, VK Chaithanya Ponnaluri2, Nan Dai2, Sriharsa Pradhan2 and Deepti Deobagkar1

Joyce Stamm - coursesource.org · Web viewThe August 2014 Drosophila melanogaster (BDGP Release 6 + ISO1 MT/dm6) assembly was produced by the Berkeley Drosophila Genome Project

DROSOPHILA - University of California, Berkeley · 2008. 1. 5. · Drosophila) Summary In Drosophila cells and HeLa cells treated with 4'-aminomethyl trioxsalen and ultraviolet light,

Drosophila melanogaster Embryogenesis - MITweb.mit.edu/manoli/tenurecase/M23_Candeias_GenomeBiology_11.pdf · organism Drosophila melanogaster, consisting of both high-quality genome-wide

Drosophila and genome-wide association studies: a review ... · Drosophila melanogaster, for functional investigation of findings from human GWAS. We highlight selected examples where

Drosophila Melanogaster Genome And its developmental process

Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium

1 Regions of very low H3K27me3 partition the Drosophila genome

A whole genome RNAi screen of Drosophila S2 cell spreading

The Genome Sequence of Drosophila Melanogaster

Obol: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project

The Drosophila genome: so that’s what it looks like!downloads.hindawi.com/journals/ijg/2000/724067.pdf · of the genome have long been available from the Bloomington Drosophila

A Genome-Wide Gene Function Prediction Resource for Drosophila

Engineering the Drosophila Genome: Chromosome ......Engineering the Drosophila Genome: Chromosome Rearrangements by Design Kent G. Golic and Mary M. Golic Department of Biology, University