Transcript
Page 1: Improving the Solr Update Chain

Improving the Solr Update Chain

Jan Høydahl

Page 2: Improving the Solr Update Chain

What will I cover?Who is Jan Høydahl?Intro to Solr’s (hidden) UpdateChainHow to write your own UpdateProcessorsExample: Web crawl @ Oslo UniversityA vision for future improvementsConclusion

2

Page 3: Improving the Solr Update Chain
Page 4: Improving the Solr Update Chain
Page 5: Improving the Solr Update Chain

Jan Høydahl

1995: Developer telecom1998: Java developer2000: Search - FAST2006: Lucene2007: Cominvent2011: Lucene committer

> 100 projects

5

Page 6: Improving the Solr Update Chain

Cominvent AS

6

Consulting & supportLucene/Solr

FAST

www.solrtraining.com

Page 7: Improving the Solr Update Chain

Why document processing?

7

Analysis is Field orientedFilters only see the “local” field

Page 8: Improving the Solr Update Chain

Why document processing?

8

But what if you want to:Add or remove fields?Make decisions based on other fields?

We need a way to modify the Document

Page 9: Improving the Solr Update Chain

Why document processing?

9

name

postcode

cv_pdf_url

Doc1

programmer near Barcelona

Page 10: Improving the Solr Update Chain

Why document processing?

10

name

postcode

cv_pdf_url

Doc1

cv_text

latlong programmer near Barcelona

Page 11: Improving the Solr Update Chain

Why document processing?

11

name

postcode

cv_pdf_url

Doc1

cv_text

latlong

Client

Page 12: Improving the Solr Update Chain

Why document processing?

12

name

postcode

cv_pdf_url

Doc1

cv_text

latlong

Client

3rd party pipeline

Page 13: Improving the Solr Update Chain

Solr’s Update Chain

13

Page 14: Improving the Solr Update Chain

The Update Chain

14

Page 15: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Doc

Page 16: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

Page 17: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

UrlFetcherProcessor

Page 18: Improving the Solr Update Chain

The Update Chain

15

name

postcode

cv_pdf_url

Docname

postcode

cv_pdf_url

Doc

latlong

PostcodeToLatLongProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

UrlFetcherProcessor

name

postcode

cv_pdf_url

Doc

cv_pdf_bin

latlong

TikaExtractingProcessor

cv_text

Page 19: Improving the Solr Update Chain
Page 20: Improving the Solr Update Chain
Page 21: Improving the Solr Update Chain

How it’s wired

17

Choose chain in your update request:.../solr/update/xml?..&update.chain=cv-chain

Chain definition in solrconfig.xml:

Page 22: Improving the Solr Update Chain

Other examples

18

Language Identification

Page 23: Improving the Solr Update Chain

Other examples

19

Entity extraction

The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999.

Company

Location Date

Page 24: Improving the Solr Update Chain
Page 25: Improving the Solr Update Chain

Writing your own processor

21

Page 26: Improving the Solr Update Chain

Writing your own processor

21

Page 27: Improving the Solr Update Chain

Writing your own processor

22

Page 28: Improving the Solr Update Chain

Writing your own processor

23

Page 29: Improving the Solr Update Chain

Writing your own processor

24

•Make generic processors - parameterized•Use SchemaAware, SolrCoreAware and

ResourceLoaderAware interfaces•Prefix param names to avoid name clash•Testing and testable methods•Donate back to Apache & document on Wiki

Page 30: Improving the Solr Update Chain

Web crawl withLanguage Detection@ Oslo University

25

Page 31: Improving the Solr Update Chain

Solr @ Oslo University

26

Page 32: Improving the Solr Update Chain

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

Page 33: Improving the Solr Update Chain

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

Page 34: Improving the Solr Update Chain

<?xml version="1.0"?><updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

27

Page 35: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 36: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 37: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 38: Improving the Solr Update Chain

<?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[\s\u00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/>

Solr @ Oslo University

28

Page 40: Improving the Solr Update Chain
Page 41: Improving the Solr Update Chain
Page 42: Improving the Solr Update Chain

Room forimprovement?

32

Page 43: Improving the Solr Update Chain

Improvements

34

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Page 44: Improving the Solr Update Chain

Improvements

35

Pain:Potentially expensive initializationStaticRankProcessor: read&parse 50.000 lines

Proposed cure:Keep persistent state object in factory: private final Map<Object,Object> sharedObjCachenew StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache);Processor uses sharedObjCache for state

Page 45: Improving the Solr Update Chain

Improvements

36

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Page 46: Improving the Solr Update Chain

Improvements

37

Pain:Multi chains often need identical ProcessorsUiO’s two chains share 80% -> copy/paste

Proposed cure:Allow sharing of named instancesDefine:<processor name="langid" class="..">

Refer:<processor ref="langid" />

See SOLR-2823

Page 47: Improving the Solr Update Chain

Improvements

38

Processors re-created for every requestDuplication of config between chainsNo support for non-linear or sub chainsDoes not scale very wellLack native scripting language support

Page 48: Improving the Solr Update Chain

Improvements

39

Pain:Chains are linear onlyHard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):New scriptable Update Chain - alternative to XMLScript chain logic in solr/conf/updateproc.groovyFull flexibility:chain myChain {

if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) }

Page 49: Improving the Solr Update Chain

Improvements

40

Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support

Page 50: Improving the Solr Update Chain

Improvements

41

Pain:Single threadedHeavy processing not efficient

Proposed cure:Local: Use multi threaded update requestsSolrCloud: Dedicated nodes, role=“processor” ?Wrap an external pipeline in UpdateProcessor

Example: OpenPipelineUpdateProcessor ?

Page 51: Improving the Solr Update Chain

Improvements

42

Processors re-created for every requestDuplication of config between chainsNo support for non-linear chains or sub chainsDoes not scale very wellLack native scripting language support

Page 52: Improving the Solr Update Chain

Improvements

43

Pain:Not really a “problem” :-)Nice to write processors in Python, Groovy, JS...

Proposed cure:Now: Finish SOLR-1725: Script based ProcessorLater: Make scripts first-class processors

<processor script="myScript.py" />or<processor ref="myScript" />

Page 53: Improving the Solr Update Chain

One last thing...

44

Page 54: Improving the Solr Update Chain

New standalone framework?

45

•The UpdateChain is Solr specific•Interest for a pure pipeline framework•Search engine independent•Scalable•Rich pool of processors•Several existing candidates

•Some initial thoughts:http://wiki.apache.org/solr/DocumentProcessing

Page 55: Improving the Solr Update Chain

Summary

46

Page 56: Improving the Solr Update Chain

Summary•Document centric vs field centric processing•UpdateChain is there - use it!•Works well for most “light” cases•Scaling issues, but caching config may help•More processors welcome!

47

Page 58: Improving the Solr Update Chain

Extra

49

Page 60: Improving the Solr Update Chain

Calling out from UpdateChain

51

This is one way an external pipeline system can be integrated with Solr.

The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers.

Page 61: Improving the Solr Update Chain

Scaling with external pipeline

52

Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests.


Recommended