61
Improving the Solr Update Chain Jan Høydahl

improving solrs update chain - Jan hoydahl

Embed Size (px)

Citation preview

Improving the Solr Update Chain

Jan Høydahl

What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion

2

Jan Høydahl

1995: Developer telecom1998: Java developer2000: Search - FAST2006: Lucene2007: Cominvent2011: Lucene committer

> 100 projects

5

Cominvent AS

6

Consulting & supportLucene/Solr

FAST

www.solrtraining.com

Why document processing?

7

Analysis is Field orientedFilters only see the “local” field

Why document processing?

8

But what if you want to:Add or remove fields?Make decisions based on other fields?

We need a way to modify the Document

Why document processing?

9

name

postcode

cv_pdf_url

Doc1

programmer near Barcelona

Why document processing?

10

name

postcode

cv_pdf_url

Doc1

cv_text

latlong programmer near Barcelona

Why document processing?

11

name

postcode

cv_pdf_url

Doc1

cv_text

latlong

Client

Why document processing?

12

name

postcode

cv_pdf_url

Doc1

cv_text

latlong

Client

3rd party pipeline

Solr’s Update Chain

13

The Update Chain

14

The Update Chain

15

name

postcode

cv_pdf_url

Doc

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

UrlFetcherProcessor

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

UrlFetcherProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

TikaExtractingProcessor

cv_text

How it’s wired

17

Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain

Chain definition in solrconfig.xml:

Other examples

18

Language Identification

Other examples

19

Entity extraction

The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.

Company

Location Date

Writing your own processor

21

Writing your own processor

21

Writing your own processor

22

Writing your own processor

23

Writing your own processor

24

•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and

ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki

Web crawl withLanguage Detection@ Oslo University

25

Solr @ Oslo University

26

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Room forimprovement?

32

Improvements

34

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Improvements

35

Pain:Potentially expensive initializationStaticRankProcessor: read&parse 50.000 lines

Proposed cure:Keep persistent state object in factory: private final Map<Object,Object> sharedObjCachenew StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache);Processor uses sharedObjCache for state

Improvements

36

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Improvements

37

Pain:Multi chains often need identical ProcessorsUiO’s two chains share 80% -> copy/paste

Proposed cure:Allow sharing of named instancesDefine:<processor name="langid" class="..">

Refer:<processor ref="langid" />

See SOLR-2823

Improvements

38

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Improvements

39

Pain:Chains are linear onlyHard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):New scriptable Update Chain - alternative to XMLScript chain logic in solr/conf/updateproc.groovyFull flexibility:chain myChain {

if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) }

Improvements

40

Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support

Improvements

41

Pain:Single threadedHeavy processing not efficient

Proposed cure:Local: Use multi threaded update requestsSolrCloud: Dedicated nodes, role=“processor” ?Wrap an external pipeline in UpdateProcessor

Example: OpenPipelineUpdateProcessor ?

Improvements

42

Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support

Improvements

43

Pain:Not really a “problem” :-)Nice to write processors in Python, Groovy, JS...

Proposed cure:Now: Finish SOLR-1725: Script based ProcessorLater: Make scripts first-class processors

<processor script="myScript.py" />or<processor ref="myScript" />

One last thing...

44

New standalone framework?

45

•The UpdateChain is Solr specific•Interest for a pure pipeline framework•Search engine independent•Scalable•Rich pool of processors•Several existing candidates

•Some initial thoughts:http://wiki.apache.org/solr/DocumentProcessing

Summary

46

Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome!

47

Extra

49

Calling out from UpdateChain

51

This is one way an external pipeline system can be integrated with Solr.

The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers.

Scaling with external pipeline

52

Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests.