30
Building an open-source based search solution – first steps Roman Kern Institute of Knowledge Management Graz University of Technology Know-Center Graz [email protected], [email protected] Data Science Meetup / 2012-04-12

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Embed Size (px)

DESCRIPTION

DataScience Talk by Roman Kern, Know Center - Graz University of Technology Date: April 12th 2012 Graz, Austria

Citation preview

Page 1: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Building an open-source based search solution –first steps

Roman Kern

Institute of Knowledge ManagementGraz University of Technology

Know-Center [email protected], [email protected]

Data Science Meetup / 2012-04-12

Page 2: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyOverview

Motivation

Background

Solr Ecosystem

Solr Features

Conclusions

2 / 28

Page 3: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyMotivation

Search

I Change in users expectations

I Missing, sub-optimal search causes frustration

Science

I Information retrieval

I Success story

I Mostly focused on web search

Industry

I Enterprise search

I Heterogeneous data sources

3 / 28

Page 4: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyBackground of the Speaker

http://a1.net

http://wissen.de

4 / 28

Page 5: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyApache Lucene Umbrella Project

Components

I Search engine ⇒ Lucene

I Search server ⇒ Solr

I Web search engine ⇒ Nutch

I Lightweight crawler ⇒ Droids

I File-format parsing ⇒ Tika

I Communicate with CMS ⇒ ManifoldCF

I Distributed coordination ⇒ ZooKeeper

I Natural language processing ⇒ OpenNLP

I Related projects: Hadoop, Mahout, Carrot2, ...

Common aspects

Apache license, implemented in Java, community

5 / 28

Page 6: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyLucene

Search Engine Library

I Java APII Only for expert users

I Search-IndexI File-systemI In-memory index

I Advanced featuresI Incremental indexingI Update while searching

I Base for many projectsI SolrI ir-libI elasticsearch

I LIA (Lucene in Action)

http://lucene.apache.org/core/ 6 / 28

Page 7: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyNutch

Web search engine

I Builds upon SolrI Web crawler

I Link database, crawl database

I DistributedI Runs on Hadoop

I Mode of operationI Crawl a single domainI Crawl the web with seed sites

http://nutch.apache.org/

7 / 28

Page 8: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyDroids

Crawler component

I Lightweight crawlerI Main features

I ThrottlingI Multi-threadedI Well behaved (robots.txt)

http://incubator.apache.org/droids/

8 / 28

Page 9: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyTika

Text extraction

I Text & meta-dataI File-formats

I OfficeI Microsoft Formats (Apache POI)I OpenDocument

I Common text formatsI PDF (PDFBox)I HTML (tagsoup)

I Non-textI ImagesI Sound

http://tika.apache.org/

9 / 28

Page 10: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyManifoldCF

Content Management System Connectors

I Communicate with CMS/DMSI Connectors

I FileNet P8 (IBM)I Documentum (EMC)I LiveLink (OpenText)I Meridio (Autonomy)I Windows shares (Microsoft)I SharePoint (Microsoft)I More: Alfresco, JDBC, ...

I Data is then stored and indexedI e.g. Solr

http://incubator.apache.org/connectors/

10 / 28

Page 11: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyZooKeeper

Distributed coordination

I Orchestrate serversI Distributed

I ConfigurationI Name lookupI Synchronization

http://zookeeper.apache.org/

11 / 28

Page 12: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyOpenNLP

Natural language processing

I Process plain text

I Maximum entropy classification with beam searchI Models

I Sentence splittingI Token splittingI Part-of-speech (POS) taggingI Named entity recognitionI more: chunker, parser, co-reference resolution

http://opennlp.sourceforge.net/

12 / 28

Page 13: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyHadoop

Distributed computing

I Scale out frameworkI Distributed file-system

I Data is partitionedI Stored on multiple nodes

I Map/Reduce paradigmI Map your algorithms to mappers & reducers

Related projects: HBase, Pig, Hive, ...

http://hadoop.apache.org/

13 / 28

Page 14: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyMahout

Distributed machine learning

I Scale out frameworkI Machine learning

I Recommender systemsI ClusteringI Classification

I IntegrationI StandaloneI HadoopI Amazon EC2

http://mahout.apache.org/

14 / 28

Page 15: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyDetails

15 / 28

Page 16: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologySearch Server

What Solr is

I Web-Service

I Full-text indexing & search

I Support to store arbitrary content

What Solr isn’t

I Solr 6= grepI Database

I But, somehow similar to No-SQL databases

Solr vs. IR-Lib

I Solr: easy to use, easy to integrate, XML configuration

I IR-Lib: expert knowledge to use, Java configuration, fast

16 / 28

Page 17: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyIndex Structure

Inverted Index

I Dictionary of words (terms)

I Map from term to document

Document

I List of fields

I Input fields are them mapped according to the schema

Field-types

I Defined in the schema

I Type (string, boolean, date, number) - internally mapped tostring

17 / 28

Page 18: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyIndex Management

API

I HTTP Server

I Various formats (XML, binary, JavaScript, ...)

Document life-cycle

I There is no update

I Delete (done automatically by Solr)

I InsertI Implications

I An unique id is necessaryI Use batch updates

I Commit, rollback (and optimize)

18 / 28

Page 19: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyInput Handling

Different input formats

I XML

I CSVI JDBC (database)

I DIH (data import handler)I Support incremental updates (via timestamps)

I Solr CellI Binary contentI Apache TikaI Text content and metadata

19 / 28

Page 20: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyText Processing

Scope

I During indexing & query

Tokenization

I Split text into tokens

I Lower-case alignment

I Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒triplic, ...)

I Synonyms (via Thesaurus)

I Stop-word filtering

I Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)

I n-grams, soundex, umlauts

20 / 28

Page 21: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyQuery Processing

Query parsers

I Lucene query parser (rich syntax)I AND, OR, NOT, range queries, wildcards, fuzzy query, phrase

queryI Boosting of individual partsI Example: ((boltzmann OR schroedinger) NOT einstein)

I Dismax query parserI No query syntaxI Searches over multiple fields (separate boost for each field)I Configure the amount of terms to be mandatoryI Distance between terms is used for ranking (phrase boosting)

Dismax is a good starting point, but may become expensive

21 / 28

Page 22: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologySearch Features

Query filter

I Additional query

I No impact on ranking

I Results are cached

Boosting query

I Only in Dismax

Query elevation

I Fix certain queries

Request handler

I Pre-define clauses

I Invariants

Function queries

I Score is computed on field values

22 / 28

Page 23: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologySearch Result

Ranking

I Relevance

I Sort on field value (only single term per document)

Available data & features

I Sequence of IDs & score

I Stored fields

I Snippets (plus highlighting)I Facets

I Count the search hitsI Types: field value, dates, queriesI Sort, prefix, ...I Could be used for term suggestion (aka. query suggestion)

I Field collapsing (grouping)

I Spell checking (did-you-mean)23 / 28

Page 24: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyAdditional Solr Features

Query by Example

I More like this

Stats

I Per field

I Min, max, sum, missing, ...

Admin-GUI

I Webapp to troubleshoot queries

I Browse schema

JMX

I Read properties & statistics

I Can be accessed remotely

24 / 28

Page 25: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyIntegration

Deployment

I Within a web application server

I Embedded

Monitor

I Log output

Access

I Various language bindings

I Java, Ruby, JavaScript, PHP, ...

25 / 28

Page 26: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyMulti-core

Multiple indices

I Each index has its own configuration

Operations

I Reload (when configuration has been changed)

I Rename

I Swap

I Merge

I Create, Status

26 / 28

Page 27: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyScale Solr

Replication

I Master and slaves nodes

I Replication

I Slaves poll master

Dispatch search request

I Load balancer

27 / 28

Page 28: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologySharding Indexes

Single index

I Index spawned over multiple machines

I Search is done in parallel

Mapping

I Application has to provide a deterministic mapping

I Document ⇒ index

28 / 28

Page 29: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyConclusions

Ecosystem

I Vivid community

I Corporative backing

Solr

I Easy to get started

I Hard to optimize for specific requirements

29 / 28

Page 30: DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Graz University of TechnologyThe End

Thank you!

30 / 28