Bio solr building a better search for bioinformatics

Tom Winch & Matt Pearce21st April 2015

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

BioSolr

building a better search for bioinformatics

The European Bioinformatics Institute

Part of the European Molecular Biology Laboratory

Based on the Wellcome Genome Campus in Hinxton, Cambridge

Maintains the world’s most comprehensive range of freely available and up-to-date molecular databases, serving millions of researchers – indexing over 1 billion items

BioSolr project involves two teams from EMBL-EBI:Protein Data Bank in Europe (PDBe)Samples, Phenotypes and Ontologies (SPOt)

The genesis of BioSolr

Grant Ingersoll visits the Wellcome Campus in July '13

Around 90 people attend

Show of hands indicates 75% using Lucene/Solr

Sameer Velankar of EMBL-EBI identifies grant funding

Flax and EMBL-EBI apply successfully to the BBSRC

BioSolr One year BBSRC funded project from September 2014

“to significantly advance the state of the art withregard to indexing and querying biomedical data with freely available open source software”

Outputs:– Workshops– Papers & presentations– Software (Open source of course!)– Documentation

Inputs: from the PDBe & SPOt teams

BioSolr

Tom Winch – Working on site with Sameer Velankar & the PDBe team – Facet.contains & Xjoin

Matt Pearce– Working on site with Tony Burdett & the SPOt team– Indexing ontologies

BioSolr & PDBe - Introduction

Protein Data Bank (PDBe)

facet.contains – autosuggest

https://issues.apache.org/jira/browse/SOLR-1387

In Solr 5.1

DNA sequence similarity

BioSolr & PDBe – Xjoin concepts

The problem - sequences come from a live source

Joining with data from an external source

Custom SOLR code

BioSolr & PDBe – Solr classes

XJoinResultsFactory, XJoinResults

XJoinSearchComponent

XJoinQParserPlugin

XJoinValueSourceParser

BioSolr & PDBe – What next?

SOLR contrib – SOLR-7341

https://issues.apache.org/jira/browse/SOLR-7341

Joining from multiple external sources

Federated search

Washington, N. & Lewis, S. (2008) Ontologies: Scientific Data Sharing Made Easy. Nature Education 1(3):5

BioSolr & SPOt – Indexing Ontologies

Indexing Ontologies - the problem

You have a collection of documents annotated with ontology references.

You want to search both the documents and the associated ontology data.

This may include associated nodes – “has location”, “is part of”, etc.

Faceting by ontology reference would be nice!

Approach 1

– Keep the data separate

documents

Documents

Indexer

Documents

Indexer

ontology

Ontology

Indexer

Approach 1 - steps

Index the documents, with the node annotations, but no further detail.

Index the ontology in its own core.

Search the documents, then cross-match against the ontology.

BUT - Requires multiple calls, doesn't allow searching both cores at the same time.

Approach 2

• Add some ontology data to your documents.

Documents

Indexer Ontology

documents

Approach 2 – step 1

Index node references, plus their labels and synonyms.

Easier to include the ontology references in your search.

Can boost fields over others.

Approach 2 – step 2

Expand the ontology data being stored.

Include single (or multi)-level parent and child nodes, with labels.

Use dynamic fields to store additional relationships.

Dynamic fields allow searches across specific relation types.

BUT Requires some additional Solr look-ups to be fully dynamic.

Approach 2 – search screen

Approach 2 – search results

Approach 3

Search the ontology, and cross-match with documents.

Allow SPARQL queries over the ontology index.

SPARQL is a semantic query language

Approach 3

Adding Apache Jena

To allow SPARQL queries, we use Apache Jena to provide TDB-querying.

Jena uses Solr to search label fields.

Uses its own Triple Store for other fields.

Need to include reference URI in returned fields.

Integrating Jena results

Returned Jena data needs to be cross-matched against document index.

Use a filter query to choose the matching documents.

Integrating Jena results

Summary so far

We can search documents and ontology data with a single call to Solr.

We can dynamically search over additional related ontology nodes.

We can use SPARQL to search.

Can facet on individual ontology annotations...but we still can't present the facets in a tree.

https://github.com/flaxsearch/BioSolr/tree/master/spot

The ultimate goal

A generic ontology indexer using Solr.

Multiple ontologies stored in the same index.

Unique integer keys for each node, allowing cross-matching from document indexes.

Optional customisation, allowing for additional lookups or data manipulation.

BioSolr conclusions

Final workshop at EMBL-EBI in September

https://github.com/flaxsearch/BioSolr

Investigating funding to continue the project– We have some ideas around federated Solr search...

Thankyou!

Any questions?

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

Technology

Bio solr building a better search for bioinformatics