15
Configuring Apache Solr for Thai Text Search Pairote Leelaphattarakij Inspica Co., Ltd. SolrStart Masterclass ‘from zero to hero’ June 29, 2014

Configuring Apache Solr for Thai Text Search

Embed Size (px)

DESCRIPTION

Presentation for Solr Masterclass Bangkok, June 2014.

Citation preview

Page 1: Configuring Apache Solr for Thai Text Search

Configuring Apache Solr for Thai Text Search

Pairote LeelaphattarakijInspica Co., Ltd.

SolrStart Masterclass ‘from zero to hero’June 29, 2014

Page 2: Configuring Apache Solr for Thai Text Search

• Open source search platform written in Java.

• Supports Thai language.

• Both indexing and querying use analyzer chain to

process text.

Apache Solr

Page 3: Configuring Apache Solr for Thai Text Search

Analyzer Chain

Analyzer

Tokenizer

Filter

Filter

Filter

Input Text

Output Tokens

Tokenizer breaks text into tokens

Filters examine sequence of tokens and keep them,

transform or discard them.

Page 4: Configuring Apache Solr for Thai Text Search

Built-in Components

org.apache.lucene.analysis.th

• ThaiAnalyzer

• ThaiWordFilter

• ThaiWordFilterFactory

• ThaiTokenizer

• ThaiTokenizerFactory

Page 5: Configuring Apache Solr for Thai Text Search

Analyzer

<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">

schema.xml

<analyzer class="solr.ThaiAnalyzer" /> </fieldType>

Use built-in ThaiAnalyzerto break thai text into tokens.

Page 6: Configuring Apache Solr for Thai Text Search

Filter

<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">

schema.xml

<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ThaiWordFilterFactory"/> </analyzer></fieldType>

Use ThaiWordFilterFactoryin combination with standard

tokenizer.

Page 7: Configuring Apache Solr for Thai Text Search

Tokenizer

<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">

schema.xml

<analyzer> <tokenizer class="solr.ThaiTokenizerFactory"/> </analyzer></fieldType>

ThaiWordFilterFactoryis deprecated,

use ThaiTokenizerFactory instead.

Page 8: Configuring Apache Solr for Thai Text Search

Default Configuration (Solr 4.9)

<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">

schema.xml

<analyzer> <tokenizer class="solr.ThaiTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_th.txt" /> </analyzer></fieldType>

Page 9: Configuring Apache Solr for Thai Text Search

• Built-in components use java.text.BreakIterator to

break thai text into tokens. (May not supported by all JREs)

• Poor accuracy of Thai word segmentation.

Issues with built-in components

Page 10: Configuring Apache Solr for Thai Text Search

• Provides highly accurate tokenization and other kinds of

text analysis.

• Available as a Solr plugin (JAR files).

• Compatible with Solr from version 1.4 to 4.9

Solr Plugin from Inspica

98%

Evaluated on BEST09 corpus fromNational Electronics and Computer Technology Center Thailand.

% Accuracy

Page 11: Configuring Apache Solr for Thai Text Search

Minimal Schema Change

<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100"> <analyzer>

schema.xml

<tokenizer class=" </analyzer></fieldType>

solr.ThaiTokenizerFactory"/>

Page 12: Configuring Apache Solr for Thai Text Search

Minimal Schema Change

<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100"> <analyzer>

schema.xml

<tokenizer class="

</analyzer></fieldType>

com.inspica.solr.analysis.th. ThaiTokenizerFactory" license="license.key"/>

Page 13: Configuring Apache Solr for Thai Text Search

Demo

Page 14: Configuring Apache Solr for Thai Text Search

Q&A

Page 15: Configuring Apache Solr for Thai Text Search

Thank you

facebook.com/inspicaInspica Team HQ71 Thanon Phaya Thai,Ratchathewi, Bangkok, 10400

[email protected]

Address Contact Follow