61
Improving the Solr Update Chain Jan Høydahl

Improving the Solr Update Chain

Embed Size (px)

DESCRIPTION

A talk about the (hidden) document processing capability built right into Apache Solr. We show you what it its, how to use it, how to write your own plugins and suggest some future improvements.

Citation preview

Page 1: Improving the Solr Update Chain

Improving the Solr Update Chain

Jan Høydahl

Page 2: Improving the Solr Update Chain

What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion

2

Page 3: Improving the Solr Update Chain
Page 4: Improving the Solr Update Chain
Page 5: Improving the Solr Update Chain

Jan Høydahl

1995: Developer telecom1998: Java developer2000: Search - FAST2006: Lucene2007: Cominvent2011: Lucene committer

> 100 projects

5

Page 6: Improving the Solr Update Chain

Cominvent AS

6

Consulting & supportLucene/Solr

FAST

www.solrtraining.com

Page 7: Improving the Solr Update Chain

Why document processing?

7

Analysis is Field orientedFilters only see the “local” field

Page 8: Improving the Solr Update Chain

Why document processing?

8

But what if you want to:Add or remove fields?Make decisions based on other fields?

We need a way to modify the Document

Page 9: Improving the Solr Update Chain

Why document processing?

9

name

postcode

cv_pdf_url

Doc1

programmer near Barcelona

Page 10: Improving the Solr Update Chain

Why document processing?

10

name

postcode

cv_pdf_url

Doc1

cv_text

latlong programmer near Barcelona

Page 11: Improving the Solr Update Chain

Why document processing?

11

name

postcode

cv_pdf_url

Doc1

cv_text

latlong

Client

Page 12: Improving the Solr Update Chain

Why document processing?

12

name

postcode

cv_pdf_url

Doc1

cv_text

latlong

Client

3rd party pipeline

Page 13: Improving the Solr Update Chain

Solr’s Update Chain

13

Page 14: Improving the Solr Update Chain

The Update Chain

14

Page 15: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Doc

Page 16: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

Page 17: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

UrlFetcherProcessor

Page 18: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

UrlFetcherProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

TikaExtractingProcessor

cv_text

Page 19: Improving the Solr Update Chain
Page 20: Improving the Solr Update Chain
Page 21: Improving the Solr Update Chain

How it’s wired

17

Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain

Chain definition in solrconfig.xml:

Page 22: Improving the Solr Update Chain

Other examples

18

Language Identification

Page 23: Improving the Solr Update Chain

Other examples

19

Entity extraction

The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.

Company

Location Date

Page 24: Improving the Solr Update Chain
Page 25: Improving the Solr Update Chain

Writing your own processor

21

Page 26: Improving the Solr Update Chain

Writing your own processor

21

Page 27: Improving the Solr Update Chain

Writing your own processor

22

Page 28: Improving the Solr Update Chain

Writing your own processor

23

Page 29: Improving the Solr Update Chain

Writing your own processor

24

•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and

ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki

Page 30: Improving the Solr Update Chain

Web crawl withLanguage Detection@ Oslo University

25

Page 31: Improving the Solr Update Chain

Solr @ Oslo University

26

Page 32: Improving the Solr Update Chain

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

Page 33: Improving the Solr Update Chain

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

Page 34: Improving the Solr Update Chain

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

Page 35: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 36: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 37: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 38: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 40: Improving the Solr Update Chain
Page 41: Improving the Solr Update Chain
Page 42: Improving the Solr Update Chain

Room forimprovement?

32

Page 43: Improving the Solr Update Chain

Improvements

34

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Page 44: Improving the Solr Update Chain

Improvements

35

Pain:Potentially expensive initializationStaticRankProcessor: read&parse 50.000 lines

Proposed cure:Keep persistent state object in factory: private final Map<Object,Object> sharedObjCachenew StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache);Processor uses sharedObjCache for state

Page 45: Improving the Solr Update Chain

Improvements

36

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Page 46: Improving the Solr Update Chain

Improvements

37

Pain:Multi chains often need identical ProcessorsUiO’s two chains share 80% -> copy/paste

Proposed cure:Allow sharing of named instancesDefine:<processor name="langid" class="..">

Refer:<processor ref="langid" />

See SOLR-2823

Page 47: Improving the Solr Update Chain

Improvements

38

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Page 48: Improving the Solr Update Chain

Improvements

39

Pain:Chains are linear onlyHard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):New scriptable Update Chain - alternative to XMLScript chain logic in solr/conf/updateproc.groovyFull flexibility:chain myChain {

if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) }

Page 49: Improving the Solr Update Chain

Improvements

40

Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support

Page 50: Improving the Solr Update Chain

Improvements

41

Pain:Single threadedHeavy processing not efficient

Proposed cure:Local: Use multi threaded update requestsSolrCloud: Dedicated nodes, role=“processor” ?Wrap an external pipeline in UpdateProcessor

Example: OpenPipelineUpdateProcessor ?

Page 51: Improving the Solr Update Chain

Improvements

42

Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support

Page 52: Improving the Solr Update Chain

Improvements

43

Pain:Not really a “problem” :-)Nice to write processors in Python, Groovy, JS...

Proposed cure:Now: Finish SOLR-1725: Script based ProcessorLater: Make scripts first-class processors

<processor script="myScript.py" />or<processor ref="myScript" />

Page 53: Improving the Solr Update Chain

One last thing...

44

Page 54: Improving the Solr Update Chain

New standalone framework?

45

•The UpdateChain is Solr specific•Interest for a pure pipeline framework•Search engine independent•Scalable•Rich pool of processors•Several existing candidates

•Some initial thoughts:http://wiki.apache.org/solr/DocumentProcessing

Page 55: Improving the Solr Update Chain

Summary

46

Page 56: Improving the Solr Update Chain

Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome!

47

Page 58: Improving the Solr Update Chain

Extra

49

Page 60: Improving the Solr Update Chain

Calling out from UpdateChain

51

This is one way an external pipeline system can be integrated with Solr.

The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers.

Page 61: Improving the Solr Update Chain

Scaling with external pipeline

52

Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests.