DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Building an open-source based search solution –first steps

Roman Kern

Institute of Knowledge ManagementGraz University of Technology

Know-Center [email protected], [email protected]

Data Science Meetup / 2012-04-12

Graz University of TechnologyOverview

Motivation

Background

Solr Ecosystem

Solr Features

Conclusions

2 / 28

Graz University of TechnologyMotivation

Search

I Change in users expectations

I Missing, sub-optimal search causes frustration

Science

I Information retrieval

I Success story

I Mostly focused on web search

Industry

I Enterprise search

I Heterogeneous data sources

3 / 28

Graz University of TechnologyBackground of the Speaker

http://a1.net

http://wissen.de

4 / 28

http://a1.net

http://wissen.de

Graz University of TechnologyApache Lucene Umbrella Project

Components

I Search engine ⇒ Lucene

I Search server ⇒ Solr

I Web search engine ⇒ Nutch

I Lightweight crawler ⇒ Droids

I File-format parsing ⇒ Tika

I Communicate with CMS ⇒ ManifoldCF

I Distributed coordination ⇒ ZooKeeper

I Natural language processing ⇒ OpenNLP

I Related projects: Hadoop, Mahout, Carrot2, ...

Common aspects

Apache license, implemented in Java, community

5 / 28

Graz University of TechnologyLucene

Search Engine Library

I Java APII Only for expert users

I Search-IndexI File-systemI In-memory index

I Advanced featuresI Incremental indexingI Update while searching

I Base for many projectsI SolrI ir-libI elasticsearch

I LIA (Lucene in Action)

http://lucene.apache.org/core/ 6 / 28

http://lucene.apache.org/core/

Graz University of TechnologyNutch

Web search engine

I Builds upon SolrI Web crawler

I Link database, crawl database

I DistributedI Runs on Hadoop

I Mode of operationI Crawl a single domainI Crawl the web with seed sites

http://nutch.apache.org/

7 / 28

http://nutch.apache.org/

Graz University of TechnologyDroids

Crawler component

I Lightweight crawlerI Main features

I ThrottlingI Multi-threadedI Well behaved (robots.txt)

http://incubator.apache.org/droids/

8 / 28

http://incubator.apache.org/droids/

Graz University of TechnologyTika

Text extraction

I Text & meta-dataI File-formats

I OfficeI Microsoft Formats (Apache POI)I OpenDocument

I Common text formatsI PDF (PDFBox)I HTML (tagsoup)

I Non-textI ImagesI Sound

http://tika.apache.org/

9 / 28

http://tika.apache.org/

Graz University of TechnologyManifoldCF

Content Management System Connectors

I Communicate with CMS/DMSI Connectors

I FileNet P8 (IBM)I Documentum (EMC)I LiveLink (OpenText)I Meridio (Autonomy)I Windows shares (Microsoft)I SharePoint (Microsoft)I More: Alfresco, JDBC, ...

I Data is then stored and indexedI e.g. Solr

http://incubator.apache.org/connectors/

10 / 28

http://incubator.apache.org/connectors/

Graz University of TechnologyZooKeeper

Distributed coordination

I Orchestrate serversI Distributed

I ConfigurationI Name lookupI Synchronization

http://zookeeper.apache.org/

11 / 28

http://zookeeper.apache.org/

Graz University of TechnologyOpenNLP

Natural language processing

I Process plain text

I Maximum entropy classification with beam searchI Models

I Sentence splittingI Token splittingI Part-of-speech (POS) taggingI Named entity recognitionI more: chunker, parser, co-reference resolution

http://opennlp.sourceforge.net/

12 / 28

http://opennlp.sourceforge.net/

Graz University of TechnologyHadoop

Distributed computing

I Scale out frameworkI Distributed file-system

I Data is partitionedI Stored on multiple nodes

I Map/Reduce paradigmI Map your algorithms to mappers & reducers

Related projects: HBase, Pig, Hive, ...

http://hadoop.apache.org/

13 / 28

http://hadoop.apache.org/

Graz University of TechnologyMahout

Distributed machine learning

I Scale out frameworkI Machine learning

I Recommender systemsI ClusteringI Classification

I IntegrationI StandaloneI HadoopI Amazon EC2

http://mahout.apache.org/

14 / 28

http://mahout.apache.org/

Graz University of TechnologyDetails

15 / 28

Graz University of TechnologySearch Server

What Solr is

I Web-Service

I Full-text indexing & search

I Support to store arbitrary content

What Solr isn’t

I Solr 6= grepI Database

I But, somehow similar to No-SQL databases

Solr vs. IR-Lib

I Solr: easy to use, easy to integrate, XML configuration

I IR-Lib: expert knowledge to use, Java configuration, fast

16 / 28

Graz University of TechnologyIndex Structure

Inverted Index

I Dictionary of words (terms)

I Map from term to document

Document

I List of fields

I Input fields are them mapped according to the schema

Field-types

I Defined in the schema

I Type (string, boolean, date, number) - internally mapped tostring

17 / 28

Graz University of TechnologyIndex Management

API

I HTTP Server

I Various formats (XML, binary, JavaScript, ...)

Document life-cycle

I There is no update

I Delete (done automatically by Solr)

I InsertI Implications

I An unique id is necessaryI Use batch updates

I Commit, rollback (and optimize)

18 / 28

Graz University of TechnologyInput Handling

Different input formats

I XML

I CSVI JDBC (database)

I DIH (data import handler)I Support incremental updates (via timestamps)

I Solr CellI Binary contentI Apache TikaI Text content and metadata

19 / 28

Graz University of TechnologyText Processing

Scope

I During indexing & query

Tokenization

I Split text into tokens

I Lower-case alignment

I Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒triplic, ...)

I Synonyms (via Thesaurus)

I Stop-word filtering

I Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)

I n-grams, soundex, umlauts

20 / 28

Graz University of TechnologyQuery Processing

Query parsers

I Lucene query parser (rich syntax)I AND, OR, NOT, range queries, wildcards, fuzzy query, phrase

queryI Boosting of individual partsI Example: ((boltzmann OR schroedinger) NOT einstein)

I Dismax query parserI No query syntaxI Searches over multiple fields (separate boost for each field)I Configure the amount of terms to be mandatoryI Distance between terms is used for ranking (phrase boosting)

Dismax is a good starting point, but may become expensive

21 / 28

Graz University of TechnologySearch Features

Query filter

I Additional query

I No impact on ranking

I Results are cached

Boosting query

I Only in Dismax

Query elevation

I Fix certain queries

Request handler

I Pre-define clauses

I Invariants

Function queries

I Score is computed on field values

22 / 28

Graz University of TechnologySearch Result

Ranking

I Relevance

I Sort on field value (only single term per document)

Available data & features

I Sequence of IDs & score

I Stored fields

I Snippets (plus highlighting)I Facets

I Count the search hitsI Types: field value, dates, queriesI Sort, prefix, ...I Could be used for term suggestion (aka. query suggestion)

I Field collapsing (grouping)

I Spell checking (did-you-mean)23 / 28

Graz University of TechnologyAdditional Solr Features

Query by Example

I More like this

Stats

I Per field

I Min, max, sum, missing, ...

Admin-GUI

I Webapp to troubleshoot queries

I Browse schema

JMX

I Read properties & statistics

I Can be accessed remotely

24 / 28

Graz University of TechnologyIntegration

Deployment

I Within a web application server

I Embedded

Monitor

I Log output

Access

I Various language bindings

I Java, Ruby, JavaScript, PHP, ...

25 / 28

Graz University of TechnologyMulti-core

Multiple indices

I Each index has its own configuration

Operations

I Reload (when configuration has been changed)

I Rename

I Swap

I Merge

I Create, Status

26 / 28

Graz University of TechnologyScale Solr

Replication

I Master and slaves nodes

I Replication

I Slaves poll master

Dispatch search request

I Load balancer

27 / 28

Graz University of TechnologySharding Indexes

Single index

I Index spawned over multiple machines

I Search is done in parallel

Mapping

I Application has to provide a deterministic mapping

I Document ⇒ index

28 / 28

Graz University of TechnologyConclusions

Ecosystem

I Vivid community

I Corporative backing

Solr

I Easy to get started

I Hard to optimize for specific requirements

29 / 28

Graz University of TechnologyThe End

Thank you!

30 / 28