Upload
rutger-vos
View
556
Download
0
Tags:
Embed Size (px)
Citation preview
NeXMLA future data exchange standard for
phylogenetics
Rutger VosUniversity of British Columbia
Increased automation in evolutionary informatics is hampered by poorly defined
“standards”
Introduction (1/7)The problem
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Addressing interoperability problems by coding our way out of it
Syntax:NeXML
Semantics:CDAO
Transport:PhyloWS
Introduction (2/7)EvoInfo interests
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (3/7)This subproject’s mission
To create a file format like nexus*
*Maddison, Swofford and Maddison, 1997. NEXUS: An Extensible File Format for Systematic Information. Syst. Biol. 46(4):590-621
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Fix (some) problems with nexusGive access to data at higher levelBe extensible
Expose data to xml goodies
, but:
Introduction (4/7)Nexus issues
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResourceshttps://www.nescent.org/wg_evoinfo/NEXUS_Problems
No explicit versionsNothing ever deprecated
No public extensionsLeads to hacks such as ‘mixed’ data, ‘hot comments’Phylogenetics post-’80s in private blocks
Hard/impossible to validate
Introduction (5/7)Parsing plain text versus parsing XML
• Processing nexus data involves lexing + parsing + processing
• XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Introduction (6/7)Extensibility
Extensible file format should provide the ability to:
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Define new data types that implement described ‘interfaces’
Attach typed data structures to core types
Attach custom XML
Introduction (7/7)XML goodies
Large stack of off-the-shelf tools:
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
XML parser libraries
Web service toolkits
Native XML databases
Editors / IDEs
Serialization / data binding tools
Design (1/5)Design principles
• Re-use of prior art• Follow design patterns• Referencing• Verbose and compact
representations
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (2/5)Re-use of prior art
• Generic key/value attachments following apple’s plist semantics:
<dict> <key>prior</key> <float>0.78</float>
</dict>
• Trees and networks following graphml
• General file structure following nexus concepts, i.e. blocks that reference each other
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (3/5)XML design patterns
“Declare before use”
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
“Metadata first”
“Venetian blinds”
Abstract inheritance through extension, concrete inheritance through restriction
Design (4/5)Inheritance
IDTagged (required id attribute)
Labelled (optional label attribute)
Annotated (optional dict elements)
Base (optional base/lang/href attributes)
AbstractElement (in root schema)
ConcreteElement (in instance document)
extends
extends
extends
extends
restricts
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Design (5/5)Referencing
• Elements sometimes refer to other elements, much like in nexus
• In nexml, elements refer to the id of other elements by the name of the referenced element:
<otu id="t1"/> <!-- referenced later: --> <node id="n1" otu="t1"/>
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
• Schema design
• Community feedback through wiki, email, telecon, projects (evoinfo, ppod, MIAPA) etc.
• Processors (perl, java, python, c++, VB, JavaScript) development in parallel
• Experiments with xml tools (ws, db, data binding tools)
Implementation (1/6)Approach
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (2/6) Entity relationships
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (3/6)inheritance tree for elements
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (4/6) anatomy of a “block”
<characters id="c1" xsi:type="nex:DnaSeqs" otus="t1">
</characters>
<dict><key>desc</key><string>description…</string></dict>
Contents…
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (5/6)Character Classes
RestrictionCellsRestrictionSeqsRestriction
ContinuousCellsContinuousSeqsContinuous
StandardCellsStandardSeqsStandard
ProteinCellsProteinSeqsProtein
RnaCellsRnaSeqsRNA
DnaCellsDnaSeqsDNA
CellsSequence
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Implementation (6/6)Tree Classes
IntTreeFloatTreeTree
IntNetworkFloatNetworkNetwork
IntFloat
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Current status (1/4)Schema blocks
• Done:o OTUso characters: dna, rna,
nucleotide, protein, categorical, continuous, restriction (compact and verbose)
o trees: graphml trees and networks, various edge formats and rootings
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Nexml parsers and writers: mesquite (java NeXML class libraries)
Bio::Phylo (BioPerl compatible)
pyNexml (python)
DAMBE (Visual Basic)
NCL (C++)
JavaScript
Current status (2/4)Parsers and writers
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Semantic annotation (CDAO) using SAWSDL
Current status (3/4)Experiments
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Scalability:Indexed files in dbxmlCreated large files from tolweb, rbclXInclude with tinyseq xml
REST Web services:ToL servicevalidation servicenexml2json, nexus2xmlSchema inclusion in wsdl
• Publish standard
• More restricted vocabulary attachments (e.g. Darwin core, CDAO-mediated terms)
• Substitution model descriptions
• Sets (in progress, using class identifiers)
• Distances
• Splits
Current status (4/4)To do
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Resources
NeXML Base URL: http://www.nexml.org• Wiki: /wiki• Mailing list: /mail• Issue tracker: /tracker• SVN repository: /code
EvoInfo: http://evoinfo.nescent.org CDAO: http://www.evolutionaryontology.org
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodiesDesign Principles Re-use Patterns Inheritance ReferencesImplementation Approach ERD Inheritance Anatomy Characters TreesCurrent status Schema blocks Parsers & writers Experiments To doResources
Acknowledgements
• Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran, Xuhua Xia
• Feedback: wg-evoinfo, pPOD, Wayne Maddison, David Maddison
• Additional funding, support: NESCent, GSoC