Upload
sagarote
View
348
Download
3
Embed Size (px)
DESCRIPTION
Presentation for Solr Masterclass Bangkok, June 2014.
Citation preview
Configuring Apache Solr for Thai Text Search
Pairote LeelaphattarakijInspica Co., Ltd.
SolrStart Masterclass ‘from zero to hero’June 29, 2014
• Open source search platform written in Java.
• Supports Thai language.
• Both indexing and querying use analyzer chain to
process text.
Apache Solr
Analyzer Chain
Analyzer
Tokenizer
Filter
Filter
Filter
Input Text
Output Tokens
Tokenizer breaks text into tokens
Filters examine sequence of tokens and keep them,
transform or discard them.
Built-in Components
org.apache.lucene.analysis.th
• ThaiAnalyzer
• ThaiWordFilter
• ThaiWordFilterFactory
• ThaiTokenizer
• ThaiTokenizerFactory
Analyzer
<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
schema.xml
<analyzer class="solr.ThaiAnalyzer" /> </fieldType>
Use built-in ThaiAnalyzerto break thai text into tokens.
Filter
<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
schema.xml
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ThaiWordFilterFactory"/> </analyzer></fieldType>
Use ThaiWordFilterFactoryin combination with standard
tokenizer.
Tokenizer
<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
schema.xml
<analyzer> <tokenizer class="solr.ThaiTokenizerFactory"/> </analyzer></fieldType>
ThaiWordFilterFactoryis deprecated,
use ThaiTokenizerFactory instead.
Default Configuration (Solr 4.9)
<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
schema.xml
<analyzer> <tokenizer class="solr.ThaiTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_th.txt" /> </analyzer></fieldType>
• Built-in components use java.text.BreakIterator to
break thai text into tokens. (May not supported by all JREs)
• Poor accuracy of Thai word segmentation.
Issues with built-in components
• Provides highly accurate tokenization and other kinds of
text analysis.
• Available as a Solr plugin (JAR files).
• Compatible with Solr from version 1.4 to 4.9
Solr Plugin from Inspica
98%
Evaluated on BEST09 corpus fromNational Electronics and Computer Technology Center Thailand.
% Accuracy
Minimal Schema Change
<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100"> <analyzer>
schema.xml
<tokenizer class=" </analyzer></fieldType>
solr.ThaiTokenizerFactory"/>
Minimal Schema Change
<!-- Thai --><fieldType name="text_th" class="solr.TextField" positionIncrementGap="100"> <analyzer>
schema.xml
<tokenizer class="
</analyzer></fieldType>
com.inspica.solr.analysis.th. ThaiTokenizerFactory" license="license.key"/>
Demo
Q&A
Thank you
facebook.com/inspicaInspica Team HQ71 Thanon Phaya Thai,Ratchathewi, Bangkok, 10400
Address Contact Follow