35
MySQL & Sphinx Andrew Aksyonoff & Ryan Lowe

MySQL & Sphinx - Percona 11... · What is Full Text Search? Technique for searching a collection of documents where each word in every document is matched against the search criteria

  • Upload
    dangthu

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

MySQL & SphinxAndrew Aksyonoff & Ryan Lowe

Schedule• What is Full Text Search (FTS)?

• MySQL vs. Sphinx

• FTS Nuances

• Scaling Search

• Search Result Quality

• MySQL ❤ Sphinx

www.sphinxsearch.com

What is Full Text Search?

Technique for searching a collection of documents where each word in every document is matched against the search criteria.

www.sphinxsearch.com

You mean grep?

FTS Nuances• Ranking and Relevance

• Stemming and Morphology

• Keyword Expansion

• Punctuation and Delimiters

• Character Sets

www.sphinxsearch.com

Ranking

Computing a relevance value (or weight) for a given document.

Rank and relevance are ultimately subjective.

www.sphinxsearch.com

Sinatra

Ranking (Sphinx)SIMPLE!

NONE

WORDCOUNT

FIELDMASK

!

!

!

ADVANCED!

BM25

PROXIMITY_BM25

SPH04

!

!

!

LEGACY!

PROXIMITY

MATCHANY

!

!

!

www.sphinxsearch.com

Ranking Speed (ms … less is more)

0

45

90

135

180

NONE

BM25

MATCHANY

SPH04

PROXIMITY

PROXIMITY_BM25

www.sphinxsearch.com

Advanced RankingSuper Simple Formula:

PROXIMITY_BM25 = sum(lcs * 1000 + BM25)

More Complicated:

OPTION RANKER = expr(‘3 * wlccs + 7 * atc + 1.2 * min_gaps + 0.4 * BM25f(1.2, 0.7, {title = 3}) + …’)

State-of-the-art:

SELECT … myRank(packedFactors(), attr1, attr2, …)

www.sphinxsearch.com

Advanced Ranking (In English)

!

• Lots of room for ranking tuning

• Some factors are fast, some are slow

• And we normally compute all of them

• (Much) slower, but (much) better search quality

• Overkill for 1M docs, vital for 1-10B docs

www.sphinxsearch.com

Stemming“run” -> [“run”, “running”, “ran”, “runs”, “runner”, “runnable”]

“compute” -> [“compute”, “computer”,“computation”, “computers”,“computed”, “computing”]

• Morphology is hard

• (Sphinx can, MySQL cannot … but maybe in MySQL 7.1 … seriously)

• Can be emulated by wildcard queries and/or reverse query expansion

• big -> (big OR bigger OR … )

www.sphinxsearch.com

MorphologyStemming’s #1 Problem? Precision

runner

walker

busy

business

octopii

->

->

->

->

->

run

walk

busi

busi

???

?

?

!

!

?!?!?!

Sphinx 2.2 has lemmatizers for that (lemmatize_en)

www.sphinxsearch.com

MorphologyStemming’s #2 Problem? Ambiguity

dove

www.sphinxsearch.com

Morphology!

• Sphinx 2.2 lemmatizers to the rescue (again)

• morphology = lemmatize_en_all

• Reasonable index size growth

• Russian: +/- 15-20%

• English: Less (but currently untested)

www.sphinxsearch.com

Keyword Expansion

running -> (running | *running* | =running)

www.sphinxsearch.com

Punctuation & Delimiters

www.sphinxsearch.com

• How do we even handle a mere… punctuation sign? • Say, is foo-bar a single word or two words? • The cold war arch-rivals were the U.S.A. and U.S.S.R. And this is another sentence. • However, as any J. Doe would confirm, a capital-dot-space-capital sequence is not the end.

• blend_chars, index_sp, and a few more tricks…

Character Sets

www.sphinxsearch.com

• How do we even handle a mere… character? !

• In English, N == n == ñ == Ň (come on, we ain’t even got dem keyboards) • Is that so in French? Spanish? Legalese? • Are the “simple” ASCII7 Latin characters “safe”? • What about non-Latin characters? Or, uh, CJK?

• Sphinx way, charset_table = all known chars, and their mappings

• (for case folding, or accent removal)

You mean grep?

NO

FTS Is Hard! (and profitable)

• Sphinx

• MySQL

• Solr

• Lucene

• ElasticSearch

• MarkLogic

• Brainware

• Inktomi

• BaseX

• KinoSearch

• Xapian

• Google <— LOL

www.sphinxsearch.com

Sphinx vs. MySQLSphinx MySQL

Ranking Many 1 (“natural”)

Character Sets N classes, config 1 class, collation

Stemming & Morphology Yes No

Punctuation & Delimiters Yes No

Wordforms, zones, exceptions, etc. Yes Wat?

www.sphinxsearch.com

Some Quick Numbers• QPS: More is better

• Fulltext Natural Mode

• 4M Rows

• InnoDB FTS took >1hr on all other tests, so was killed

• MyISAM FTS took >1hr when score was included, so was killed

0

225

450

675

900

QPS

Sphinx InnoDB* MyISAM

www.sphinxsearch.com

* These results are more skewed than expected. More benchmarking is required.

Scaling Search (MySQL)

www.sphinxsearch.com

Scaling Search (Sphinx)

www.sphinxsearch.com

Search Result Quality

(Fiddling with knobs)

www.sphinxsearch.com

Search Result Quality• Not just ranking (sometimes, not even ranking!)

• Index Time, Data Cleanup & Preparation

• charset_table, blend_chars, regexp_filter

• wordforms, exceptions, morphology …

• Search Time, Query Rewrites & Retries

• Lots of keywords, but no matches

• Quorum, Typo Fixes, Keyboard Layout Fixes …

www.sphinxsearch.com

Search Result Quality• blend_chars (for magic characters)

• AT&T -> AT&T, AT, T

• Procter&Gamble -> Procter&Gamble, Procter, Gamble

• Exceptions

• C++ -> cplusplus, c++ -> cplusplus

• wordforms (for pre- or post-morph fun)

• tuna => fish, feline => cat, Core 2 Duo => c2d, St John => stjohn

www.sphinxsearch.com

Search Result Quality• Quorum Operator

• “So many keywords they will never all match”/3

• “So many keywords they will never all match”/0.35

• Typo Corrections (Sphynx -> Sphinx)

• /misc/suggest

• Keyboard Layout !

• Ghbdtn всем => привет всем

www.sphinxsearch.com

Quick Wins• Indexing time: MySQL

• DROP FULLTEXT Indexes

• Indexing time: Sphinx

• Increase mem_limit (256M - 2047M)

• Joined fields (vs. JOIN || GROUP_CONCAT)

• Dense IDs, store sparse DB IDs as attrs

• Re-order Sphinx fields (biggest first)

• wordforms, exceptions, lemmatizers, and other fun

www.sphinxsearch.com

Quick Wins• Search Time, Quality

• Try Quorum, avoid zero results

• Typo/Layout fixes, avoid zero results

• Try a different ranker (SPH04, EXPR, or …)

• Search Time, Performance • Faster Rankers (NONE or BM25)

• stopwords and/or hitless_word

• safety nets, max_query_time, max_predicted_time

• Virtual keywords

• UPGRADE

www.sphinxsearch.com

MySQL ❤ Sphinx

• Complement NOT compete

• Solutions to different problems

• Separation of concerns

www.sphinxsearch.com