Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Preview:

DESCRIPTION

at AWSプロダクトシリーズ|よくわかるAmazon CloudSearch http://kokucheese.com/event/index/168838/

Citation preview

Build a Scalable Search Engine With Amazon CloudSearch

Agenda

•  Introduction to Search •  Amazon CloudSearch •  Building with CloudSearch

Introduction to Search

Search Engines Connect Us To Data

Documents

Representation of a Document

Field Value

id tt0371746

title Iron Man

description When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.

director John Favreau

actors Robert Downey Jr., Gwyneth Paltrow, Terrence Howard ...

rating 7.9

release_date 2008-05-02T00:00:00Z

Data Types

Doubles

Dates

Signed Integers Text

Literal

Geo

•  Latlon data type •  Region search •  Distance sort •  Supports mobile

Text Processing (Normalization)

•  Tokenization (parsing)

•  Downcasing •  Stemming •  Stopword removal •  Synonym Addition

When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. when wealth industrial tony stark force build armor suit after life threaten incident ultimate decide use technology fight against evil

Indexing

Term Documents (Posting List)

Iron The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady

...

Man Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man ...

Matching

The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady

Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man

Iron Man 2 Iron Man

Ranking and Relevance

•  The meat of the search engine •  TF-IDF – uniqueness and presence •  Additional Criteria

–  Measures of document value (e.g. rating) –  Observed user behavior –  Freshness

Summary

•  Search makes data accessible •  Search documents gather information about one search target •  Reverse indices provide the basis of text-text matching •  Relevance brings the best matches

Amazon CloudSearch

Building a Search service

•  Build your own –  Extend datastores and build custom relevance engine

•  Open Source

–  Apache Solr, ElasticSearch

•  Enterprise Search

–  FAST, Autonomy, Endeca

Challenges with building a Search service

•  COMPLEX: Requires extensive search expertise •  COSTLY: High upfront expenditure •  SLOW: Long time to market. Slows innovation

•  UNDIFFERENTIATED: Operational overhead that doesn’t add value to core product

Where CloudSearch fits in the picture

Amazon CloudSearch is a fully managed search service in the cloud that makes it easy to setup, operate, and scale a search solution for your website or application Similar benefits as other AWS Managed Services •  Easy to setup and operate (Console, SDK, CLT) •  Pay as you go •  No need to guess capacity •  Experiment fast with low risk •  Go Global in minutes

Reference Architecture

Automatic Scaling

SEARCH INSTANCE Index Partition n

Copy 1

SEARCH INSTANCE Index Partition 2

Copy 2

SEARCH INSTANCE Index Partition n

Copy 2

SEARCH INSTANCE Index Partition 2

Copy n

SEARCH INSTANCE

DATA Document Quantity and Size

TRAFFIC Search Request Volume and Complexity

Index Partition n Copy n

SEARCH INSTANCE Index Partition 1

Copy 1

SEARCH INSTANCE Index Partition 2

Copy 1

SEARCH INSTANCE Index Partition 1

Copy 2

SEARCH INSTANCE Index Partition 1

Copy n

Building With CloudSearch

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Create a Domain

Upload Data

2014年3月 CloudSearch Launch

Arabic, Armenian, Basque, Bulgarian, Catalan, Simplified Chinese, Traditional Chinese, Czech, Danish, Dutch, English,

Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese,

Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish

•  Support  for  33  languages

CloudSearchへのデータ投入(コンソールCSV)

生成したSDFフォーマットのファイルをダウンロードすることも出来る  

1  

2  

3  

Japanese Text Processing

•  形態素解析(Morphological Analysis) –  自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業

(http://ja.wikipedia.org/wiki/形態素解析) •  英語のようにスペースで区切られている言語と異なり、

•  日本語は日本語用の構文解析が必要

–  例) 彼はエンジニアだ •  彼(名詞-代名詞)/は(助詞-係助詞)/エンジニア(名詞-一般)/だ(助動詞) •  “エンジニア”を抽出してインデックスを作ることにより、 •  ”エンジニア”で検索された際に、高速なレスポンスの実現が可能

Japanese Text Processing •  正規化(Normalize)

–  エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちらの場合もヒットして欲しい

–  CloudSearchでサポートされている機能 –  更に突っ込んだ正規化に関しては要件に応じて下記のような実装を自分で行う事が望ま

しい場合もある •  NFD(Canonical Decomposition): 正規化形式D •  NFC(Canonical Composition): 正規化形式C •  NFKD(Compatibility Decomposition): 正規化形式KD •  NFKC(Compatibility Composition): 正規化形式KC

Japanese Text Processing •  Stemming

–  飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む

–  ステミング辞書への追加 (API/SDKでも追加可能)

Japanese Text Processing •  Stopword Removal

–  「の」、「は」、「か」といった意味の無い言葉を除く –  ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)

Japanese Text Processing •  Synonym Addition

–  Synonym = 同義語 •  「ベニス」「ベネチア」「ヴェネチア」 •  「昨年」「去年」

–  同じ意味なので検索された場合にヒットさせる

–  Stopwords, Stemming同様に追加可能

Japanese Text Processing •  Synonym Addition

–  シノニム辞書への追加 (API/SDKでも追加可能) •  Alias

–  pupilで検索してstudentのドキュメントがヒット

–  studentで検索してpupilのドキュメントはヒットしない

•  Group

–  1st, first, oneどれで検索しても

–  1st, first, oneの全てのドキュメントがヒット

Document Upload http(s)://< document service endpoint >/2013-01-01/documents/batch!!Accept: application/json !Content-Length: 1176 !Content-Type: application/json !Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com !!{ : , : "tt0371746", : { "directors" : [ "Jon Favreau" ], "release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action", "Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" : [ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},!{ , : "tt0434409"} ]!

Simple Queries

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Simple Queries http(s)/<search endpoint>/2013-01-01/search?q=iron+man!

{"status": {"rid": "oei6zt8oAgq5QOc=",!"time-ms": 4},!

"hits": {"found": 9, "start": 0,!"hit": [!

{"id": "tt1228705"},!{"id": "tt0120744"},!{"id": "tt0371746"},!{"id": "tt1866249"},!{"id": "tt0119558"},!{"id": "tt0402894"},!{"id": "tt1258972"},!{"id": "tt1300854"},!{"id": "tt0462465"} ] } }!

Complex Queries

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Faceting

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Drilldown

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Adjustable Ranking

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Highlighting

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Availability Options

Scaling Options

IAM Integration

Configuration API Only

{! "Version":"2012-10-17",! "Statement": [! { "Effect": "Allow", "Action": ["cloudsearch:*"], "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },! { "Effect": "Deny",! "Action": ["cloudsearch:DeleteDomain"],! "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }! ]!}!

Closing Thoughts

•  Content Discovery goes hand in hand with Content. Search is everywhere!

•  Amazon CloudSearch is a fully managed, easy to use, cost effective search service – easy to build, easy to scale

•  Get the powerful search features found in open source engines (Apache Solr) combined with value add AWS features (easy setup, on demand pricing, auto scaling, Multi-AZ, global availability)

Questions?

Jon Handler (handler@amazon.com)

Pravin Muthukumar (pravinm@amazon.com)