27
Tom Winch & Matt Pearce 21 st April 2015 [email protected] www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch BioSolr building a better search for bioinformatics

Bio solr building a better search for bioinformatics

Embed Size (px)

Citation preview

Page 1: Bio solr   building a better search for bioinformatics

Tom Winch & Matt Pearce21st April 2015

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch

BioSolr

building a better search for bioinformatics

Page 2: Bio solr   building a better search for bioinformatics

The European Bioinformatics Institute

Part of the European Molecular Biology Laboratory

Based on the Wellcome Genome Campus in Hinxton, Cambridge

Maintains the world’s most comprehensive range of freely available and up-to-date molecular databases, serving millions of researchers – indexing over 1 billion items

BioSolr project involves two teams from EMBL-EBI:Protein Data Bank in Europe (PDBe)Samples, Phenotypes and Ontologies (SPOt)

Page 3: Bio solr   building a better search for bioinformatics

The genesis of BioSolr

Grant Ingersoll visits the Wellcome Campus in July '13

Around 90 people attend

Show of hands indicates 75% using Lucene/Solr

Sameer Velankar of EMBL-EBI identifies grant funding

Flax and EMBL-EBI apply successfully to the BBSRC

Page 4: Bio solr   building a better search for bioinformatics

BioSolr One year BBSRC funded project from September 2014

“to significantly advance the state of the art withregard to indexing and querying biomedical data with freely available open source software”

Outputs:– Workshops– Papers & presentations– Software (Open source of course!)– Documentation

Inputs: from the PDBe & SPOt teams

Page 5: Bio solr   building a better search for bioinformatics

BioSolr

Tom Winch – Working on site with Sameer Velankar & the PDBe team – Facet.contains & Xjoin

Matt Pearce– Working on site with Tony Burdett & the SPOt team– Indexing ontologies

Page 6: Bio solr   building a better search for bioinformatics

BioSolr & PDBe - Introduction

Protein Data Bank (PDBe)

facet.contains – autosuggest

https://issues.apache.org/jira/browse/SOLR-1387

In Solr 5.1

DNA sequence similarity

Page 7: Bio solr   building a better search for bioinformatics

BioSolr & PDBe – Xjoin concepts

The problem - sequences come from a live source

Joining with data from an external source

Custom SOLR code

Page 8: Bio solr   building a better search for bioinformatics

BioSolr & PDBe – Solr classes

XJoinResultsFactory, XJoinResults

XJoinSearchComponent

XJoinQParserPlugin

XJoinValueSourceParser

Page 9: Bio solr   building a better search for bioinformatics

BioSolr & PDBe – What next?

SOLR contrib – SOLR-7341

https://issues.apache.org/jira/browse/SOLR-7341

Joining from multiple external sources

Federated search

Page 10: Bio solr   building a better search for bioinformatics

Washington, N. & Lewis, S. (2008) Ontologies: Scientific Data Sharing Made Easy. Nature Education 1(3):5

BioSolr & SPOt – Indexing Ontologies

Page 11: Bio solr   building a better search for bioinformatics

Indexing Ontologies - the problem

You have a collection of documents annotated with ontology references.

You want to search both the documents and the associated ontology data.

This may include associated nodes – “has location”, “is part of”, etc.

Faceting by ontology reference would be nice!

Page 12: Bio solr   building a better search for bioinformatics

Approach 1

– Keep the data separate

documents

Documents

Indexer

Documents

Indexer

ontology

Ontology

Indexer

Page 13: Bio solr   building a better search for bioinformatics

Approach 1 - steps

Index the documents, with the node annotations, but no further detail.

Index the ontology in its own core.

Search the documents, then cross-match against the ontology.

BUT - Requires multiple calls, doesn't allow searching both cores at the same time.

Page 14: Bio solr   building a better search for bioinformatics

Approach 2

• Add some ontology data to your documents.

Documents

Indexer Ontology

documents

Page 15: Bio solr   building a better search for bioinformatics

Approach 2 – step 1

Index node references, plus their labels and synonyms.

Easier to include the ontology references in your search.

Can boost fields over others.

Page 16: Bio solr   building a better search for bioinformatics

Approach 2 – step 2

Expand the ontology data being stored.

Include single (or multi)-level parent and child nodes, with labels.

Use dynamic fields to store additional relationships.

Dynamic fields allow searches across specific relation types.

BUT Requires some additional Solr look-ups to be fully dynamic.

Page 17: Bio solr   building a better search for bioinformatics

Approach 2 – search screen

Page 18: Bio solr   building a better search for bioinformatics

Approach 2 – search results

Page 19: Bio solr   building a better search for bioinformatics

Approach 3

Search the ontology, and cross-match with documents.

Allow SPARQL queries over the ontology index.

SPARQL is a semantic query language

Page 20: Bio solr   building a better search for bioinformatics

Approach 3

Page 21: Bio solr   building a better search for bioinformatics

Adding Apache Jena

To allow SPARQL queries, we use Apache Jena to provide TDB-querying.

Jena uses Solr to search label fields.

Uses its own Triple Store for other fields.

Need to include reference URI in returned fields.

Page 22: Bio solr   building a better search for bioinformatics

Integrating Jena results

Returned Jena data needs to be cross-matched against document index.

Use a filter query to choose the matching documents.

Page 23: Bio solr   building a better search for bioinformatics

Integrating Jena results

Page 24: Bio solr   building a better search for bioinformatics

Summary so far

We can search documents and ontology data with a single call to Solr.

We can dynamically search over additional related ontology nodes.

We can use SPARQL to search.

Can facet on individual ontology annotations...but we still can't present the facets in a tree.

https://github.com/flaxsearch/BioSolr/tree/master/spot

Page 25: Bio solr   building a better search for bioinformatics

The ultimate goal

A generic ontology indexer using Solr.

Multiple ontologies stored in the same index.

Unique integer keys for each node, allowing cross-matching from document indexes.

Optional customisation, allowing for additional lookups or data manipulation.

Page 26: Bio solr   building a better search for bioinformatics

BioSolr conclusions

Final workshop at EMBL-EBI in September

https://github.com/flaxsearch/BioSolr

Investigating funding to continue the project– We have some ideas around federated Solr search...

Page 27: Bio solr   building a better search for bioinformatics

Thankyou!

Any questions?

[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch