Oktavia Search Engine - pyconjp2014

Preview:

DESCRIPTION

 

Citation preview

DeNA Co, Ltd. Yoshiki Shibukawa

9/14/2014 PyConJP

!  Yoshiki Shibukawa !  Work for DeNA Co, Ltd. !  @shibu_jp (twitter) !  yoshiki.shibukawa (Facebook) !  yoshiki@shibu.jp (mail)

!  Languages !  C/C++, Python, JavaScript

!  Founder of sphinx-users.jp !  San Francisco -> Tokyo

!  The Basic of Existing Search Engines !  The structure of Oktavia !  Oktavia API examples

!  In some cases, inverted index is not good for Eastern Asian Languages.

!  FM-index is a completely different search algorithm.

!  I published new PyPI module yesterday !  It includes only essential part of Oktavia !  I will add features more.

AM.txt (0)

• Good morning

• Hi

PM.txt (1)

• Good afternoon

• Good evening

• Hi

Word Document ID

Good 0, 1

Morning 0

Afternoon 1

Evening 1

Hi 0, 1

!  Word -> Document !  Split words in query

string and search each word from table and show result.

Good Morning → (0, 1) and (0,) → (0,)

• It is nice weather to go out to PyConJP. English

• 这是不错的天气出去PyConJP Chinese • 今日はPyConJPに出かけるにはいい天気ですね Japanese

• 그것은 PyConJP 에 외출 좋은 날씨 입니다 Korean※

※Korean has space between group of words, but not each word.

今日はPyConJPに出かけるにはいい天

気ですね

今日|は|PyConJP|に|出かける|に|は|いい|天気|です|ね

!  Split word by using Natural Language Processor like ChaSen, MeCab, Kuromoji

!  It needs deep knowledge of each language and big dictionary.

Word Doc ID

今日 0

は 0, 0

PyConJP 0

に 0, 0

出かける 0

いい 0

天気 0

です 0

ね 0

!  Document becomes words and it can use same inverted index backend.

!  Same word splitter is needed when creating index and searching.

!  2-gram

!  3-gram

!  Split a query word into fixed length strings then search each chunk

!  Use each chunk as a word

こんにちは

こん|んに|にち|ちは

こんにちは

こんに|んにち|にちは

Word Doc/Pos ID

こん (0, 0)

んに (0, 1)

にち (0, 2)

ちは (0, 3)

!  It can still use an inverted index algorithm.

!  Index file become big.

!  It can’t treat shorter words than chunk size.

こんにちは → こん / んに / にち / ちは → (0, 0) / (0, 1) / (0, 2) / (0, 3) → (0, 0)

Inverted Index

Have space Split document by space Simple Space is needed

Eastern Asian Language

N-gram Still simple Index becomes huge

NLP Works perfect

with Asian language

NLP processor and dictionary

is needed

!  It provides a search engine for browser. !  Inverted Index

!  It didn’t support Japanese. !  I sent some patches. !  But they were not enough…

!  Developed by… !  Paolo Ferragina !  Giovanni Manzini

!  FM-index is not popular in western countries. !  It is completely different from existing algorithm. !  Existing algorithm is enough for western

languages. !  It is popular in genome analysis.

!  I made new search engine by using this algorithm.

Estimated Time: 15min

!  Search Engine works on web browser. !  Written in Python and JSX (altJS made by

DeNA. See http://jsx.github.io/ )

!  It uses FM-index as a backend search algorithm.

!  It is similar to Action Script 3 !  Class statement (no prototype!) !  Strict type checking !  No “this” hell !  Performance optimization

!  FM-index is the fastest algorithm that uses a compressed index file.

!  FM-index doesn’t need word splitting.

!  Oktavia adds extra information !  Add region information to source text.

!  You can add as many metadata as you can. !  Section (documents and sections) !  Block (code block and so on) !  Splitter (word splitter) !  Table (rows and columns)

Ep4.txt

Use the Force, Luke. No, I am your father. Ep5.txt

Read Source

Generate Index

File API

Read Index

File API

Search Result

CLI tool Browser search program

Read Source

Generate Index

File API

Read Index

File API

Show Search Result

CLI tool Browser search program

!  I published yesterday. !  It supports Python 2.6, 2.7, 3.3, 3.4.

!  Use Oktavia API to implement search feature in your application

!  Build JSX version

!  web/bin/oktavia-jquery-ui.js, web/bin/oktavia-web-runtime.js are important.

$ git clone git@github.com:shibukawa/oktavia.git $ cd oktavia $ npm install $ ./node_modules/.bin/grunt build

!  Creating index !  Dump an index file in base64 encode and create

file in the following style.

!  concatenate with JSX web search runtime (web/bin/oktavia-web-runtime.js).

!  Add web/bin/oktavia-jquery-ui.js to your website. !  It reads index and runtime on WebWorker and

sends requests and show result.

var searchIndex = 'aGVsbG8gd29ybGQ…..=’;

Estimated Time: 23min

!  Oktavia provides APIs for creating your better search engine.

!  Most important part for user experience is an adjustment of scoring (sorting and filtering).

!  In some case, user feels “not available” is important information, but in other case, it is just noise.

!  I want to buy some bottle of wine for gift!

Cabernet Sauvignon [Sold Out] • From France

Pinot noir [Sold Out] • From Chili

Zinfandel [Sold Out] • From USA

Photo by Josh Kenzer under CC-NC-SA

!  I want to buy “My Little Pony DVD”!

Season One $32

Season Two $32

Season Three [Sold out]

!  Oktavia class (oktavia.py) !  Main entry point of creating/searching.

!  Metadata classes (metadata.py) !  Section !  Block !  Splitter !  Table

!  Query, Result classes (TBD)

!  Sorry, I am working… In future the following code will work:

!  In some cases, inverted index is not good for Eastern Asian Languages.

!  FM-index is a completely different search algorithm.

!  I published new PyPI module yesterday !  It includes only essential part of Oktavia !  I will add features more.

!  Office Hour !  13:40-14:10

!  Message !  Facebook(yoshiki.shibukawa) !  Twitter(@shibu_jp, @shibukawa)

Recommended