Upload
dangthu
View
216
Download
0
Embed Size (px)
Citation preview
Schedule• What is Full Text Search (FTS)?
• MySQL vs. Sphinx
• FTS Nuances
• Scaling Search
• Search Result Quality
• MySQL ❤ Sphinx
www.sphinxsearch.com
What is Full Text Search?
Technique for searching a collection of documents where each word in every document is matched against the search criteria.
www.sphinxsearch.com
FTS Nuances• Ranking and Relevance
• Stemming and Morphology
• Keyword Expansion
• Punctuation and Delimiters
• Character Sets
www.sphinxsearch.com
Ranking
Computing a relevance value (or weight) for a given document.
Rank and relevance are ultimately subjective.
www.sphinxsearch.com
Ranking (Sphinx)SIMPLE!
NONE
WORDCOUNT
FIELDMASK
!
!
!
ADVANCED!
BM25
PROXIMITY_BM25
SPH04
!
!
!
LEGACY!
PROXIMITY
MATCHANY
!
!
!
www.sphinxsearch.com
Ranking Speed (ms … less is more)
0
45
90
135
180
NONE
BM25
MATCHANY
SPH04
PROXIMITY
PROXIMITY_BM25
www.sphinxsearch.com
Advanced RankingSuper Simple Formula:
PROXIMITY_BM25 = sum(lcs * 1000 + BM25)
More Complicated:
OPTION RANKER = expr(‘3 * wlccs + 7 * atc + 1.2 * min_gaps + 0.4 * BM25f(1.2, 0.7, {title = 3}) + …’)
State-of-the-art:
SELECT … myRank(packedFactors(), attr1, attr2, …)
www.sphinxsearch.com
Advanced Ranking (In English)
!
• Lots of room for ranking tuning
• Some factors are fast, some are slow
• And we normally compute all of them
• (Much) slower, but (much) better search quality
• Overkill for 1M docs, vital for 1-10B docs
www.sphinxsearch.com
Stemming“run” -> [“run”, “running”, “ran”, “runs”, “runner”, “runnable”]
“compute” -> [“compute”, “computer”,“computation”, “computers”,“computed”, “computing”]
• Morphology is hard
• (Sphinx can, MySQL cannot … but maybe in MySQL 7.1 … seriously)
• Can be emulated by wildcard queries and/or reverse query expansion
• big -> (big OR bigger OR … )
www.sphinxsearch.com
MorphologyStemming’s #1 Problem? Precision
runner
walker
busy
business
octopii
->
->
->
->
->
run
walk
busi
busi
???
?
?
!
!
?!?!?!
Sphinx 2.2 has lemmatizers for that (lemmatize_en)
www.sphinxsearch.com
Morphology!
• Sphinx 2.2 lemmatizers to the rescue (again)
• morphology = lemmatize_en_all
• Reasonable index size growth
• Russian: +/- 15-20%
• English: Less (but currently untested)
www.sphinxsearch.com
Punctuation & Delimiters
www.sphinxsearch.com
• How do we even handle a mere… punctuation sign? • Say, is foo-bar a single word or two words? • The cold war arch-rivals were the U.S.A. and U.S.S.R. And this is another sentence. • However, as any J. Doe would confirm, a capital-dot-space-capital sequence is not the end.
• blend_chars, index_sp, and a few more tricks…
Character Sets
www.sphinxsearch.com
• How do we even handle a mere… character? !
• In English, N == n == ñ == Ň (come on, we ain’t even got dem keyboards) • Is that so in French? Spanish? Legalese? • Are the “simple” ASCII7 Latin characters “safe”? • What about non-Latin characters? Or, uh, CJK?
• Sphinx way, charset_table = all known chars, and their mappings
• (for case folding, or accent removal)
FTS Is Hard! (and profitable)
• Sphinx
• MySQL
• Solr
• Lucene
• ElasticSearch
• MarkLogic
• Brainware
• Inktomi
• BaseX
• KinoSearch
• Xapian
• Google <— LOL
www.sphinxsearch.com
Sphinx vs. MySQLSphinx MySQL
Ranking Many 1 (“natural”)
Character Sets N classes, config 1 class, collation
Stemming & Morphology Yes No
Punctuation & Delimiters Yes No
Wordforms, zones, exceptions, etc. Yes Wat?
www.sphinxsearch.com
Some Quick Numbers• QPS: More is better
• Fulltext Natural Mode
• 4M Rows
• InnoDB FTS took >1hr on all other tests, so was killed
• MyISAM FTS took >1hr when score was included, so was killed
0
225
450
675
900
QPS
Sphinx InnoDB* MyISAM
www.sphinxsearch.com
* These results are more skewed than expected. More benchmarking is required.
Search Result Quality• Not just ranking (sometimes, not even ranking!)
• Index Time, Data Cleanup & Preparation
• charset_table, blend_chars, regexp_filter
• wordforms, exceptions, morphology …
• Search Time, Query Rewrites & Retries
• Lots of keywords, but no matches
• Quorum, Typo Fixes, Keyboard Layout Fixes …
www.sphinxsearch.com
Search Result Quality• blend_chars (for magic characters)
• AT&T -> AT&T, AT, T
• Procter&Gamble -> Procter&Gamble, Procter, Gamble
• Exceptions
• C++ -> cplusplus, c++ -> cplusplus
• wordforms (for pre- or post-morph fun)
• tuna => fish, feline => cat, Core 2 Duo => c2d, St John => stjohn
www.sphinxsearch.com
Search Result Quality• Quorum Operator
• “So many keywords they will never all match”/3
• “So many keywords they will never all match”/0.35
• Typo Corrections (Sphynx -> Sphinx)
• /misc/suggest
• Keyboard Layout !
• Ghbdtn всем => привет всем
www.sphinxsearch.com
Quick Wins• Indexing time: MySQL
• DROP FULLTEXT Indexes
• Indexing time: Sphinx
• Increase mem_limit (256M - 2047M)
• Joined fields (vs. JOIN || GROUP_CONCAT)
• Dense IDs, store sparse DB IDs as attrs
• Re-order Sphinx fields (biggest first)
• wordforms, exceptions, lemmatizers, and other fun
www.sphinxsearch.com
Quick Wins• Search Time, Quality
• Try Quorum, avoid zero results
• Typo/Layout fixes, avoid zero results
• Try a different ranker (SPH04, EXPR, or …)
• Search Time, Performance • Faster Rankers (NONE or BM25)
• stopwords and/or hitless_word
• safety nets, max_query_time, max_predicted_time
• Virtual keywords
• UPGRADE
www.sphinxsearch.com
MySQL ❤ Sphinx
• Complement NOT compete
• Solutions to different problems
• Separation of concerns
www.sphinxsearch.com