99 problems but the search aint one, confoo 2011, andrei zmievski

Preview:

DESCRIPTION

 

Citation preview

99 Problems, ButThe Search Ain’t OneAndrei Zmievski • ConFoo •!Mar 9, 2011

who am i?

curl http://localhost:9200/speaker/info/andrei

{“name”: “Andrei Zmievski”, “works”: “Analog Co-op”, “projects”: [“PHP”, “PHP-GTK”, “Smarty”, “Unicode/i18n”], “likes”: [“coding”, “beer”, “brewing”, “photography”], “twitter”: “@a”, “email”: “andrei@zmievski.org”}

what is elasticsearch?

a search engine for the NoSQL generation

domain-driven

distributed

RESTful

Hitchhiker’s Guide to the Galaxy (no, really)

document model

document-oriented

JSON-based

schema-free

based on Lucene

multi-tenancy

distributed, out of the box

engine

3 easy steps

1. index!"#$%&'()*+%,--./00$1!2$,13-/45660!17803.92:9#0;%&<=

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7==-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N=

requ

est

>

%%%%?1:?/-#"9

%%%%?OB7<9P?/?!178?

%%%%?O-I.9?/?3.92:9#?

%%%%?OB<?/?;?

Nresp

onse

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?;?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

!!!!"#$#%&"!'!()%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?;?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

total number of hits

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

!!!!!!"*+,-./"!'!"0$,1")%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?;?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse the index of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

!!!!!!"*#23."!'!"43.%5.6")%%%%%%?OB<?%/%?;?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse the type of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"(")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the id of the doc

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"(")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the hit score

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%?-11:?%/%TE

%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

!!!!!!"*+-"!'!"(")%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

7!!!!",%8."'!"9,-6.+!:8+.;45+")!!!!"#%&5"'!"<<!=6$>&.84)!>?#!#@.!A.%60@!9+,B#!C,.")!!!!"&+5.4"'!D"0$-+,E")!">..6")!"3@$#$E6%3@2"F)!!!!"#G+##.6"'!"%")!!!!"@.+E@#"'!(HIJ%N%J%N%N

resp

onse

the original doc contents

2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

requ

est

>%"#$$5"!'!K)%%?O3,2#<3?%/%>

%%%%?-1-2$?%/%;E

%%%%?3"!!9338"$?%/%;E

%%%%?82B$9<?%/%6

%%NE

%%?,B-3?%/%>

%%%%?-1-2$?%/%;E

%%%%?@2PO3!1#9?%/%6UV46LM64E

%%%%?,B-3?%/%G%>

%%%%%%?OB7<9P?%/%?!178?E

%%%%%%?O-I.9?%/%?3.92:9#?E

%%%%%%?OB<?%/%?;?E

%%%%%%?O3!1#9?%/%6UV46LM64E

%%%%%%?O31"#!9?%/%

>

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E

%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E

%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE

%%%%?-KB--9#?/%?2?E

%%%%?,9BH,-?/%;LM

N%N%J%N%N

resp

onse

the execution time

3. profit

that’s up to you

demo

distributed model

provides:

performance

resiliency (high-availability)

shards

a portion of the document space

each one is a separate Lucene index

thus, many per-index settings are available

document is sharded by its _id value

but can be assigned (routed) to a shard deterministically

zero-conf discovery

zen (multicast and unicast)

cloud (EC2 via API)

auto-routing

master node:

maintains cluster state

reassigns shards if nodes leave/join cluster

any node can process the search request

the query is handled via scatter-gather mechanism

replicas

each shard can have 1 or more replicas

# of replicas can be updated dynamically after index creation

replicas can be used for querying in parallel

shard allocationnode 1

start with a single node

shard allocation

PUT /person { “index”: { “number_of_shards”: 2, “number_of_replicas”: 1}}

node 1person1person2

shard allocationnode 1person1person2

node 2person1person2

start the second node

shard allocationnode 1 node 2 node 3 node 4person1person2

person1person2

start 2 more nodes

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

start 2 more nodes

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

PUT /person/info/1{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

hashed to shard 1PUT /person/info/1{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

replicated

PUT /person/info/1{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

PUT /person/info/2{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

hashed to shard 2

PUT /person/info/2{ … }

document shardingnode 1 node 2 node 3 node 4person1

person2person1

person2

replicated

PUT /person/info/2{ … }

scatter-gathernode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

shard allocationnode 1 node 2 node 3 node 4person1

person2person1

person2

GET /person/_search?q=name:thomas

demo

transactional model

per-document consistency

no need to commit/flush

uses write-ahead transaction log

write consistency (W) can be controlled

one, quorum, or all

(near) real-time search

1 second refresh rate by default

_refresh API also

index storage

node data considered transient

can be stored in local file system, JVM heap, native OS memory, or FS & memory combination

persistent storage requires a gateway

gateways

persistent store for cluster state and indices

asynchronous, translog-based write strategy

allows full recovery if a cluster restart is needed

supported gateways:local

shared FS

Hadoop via HDFS

S3

mapping

describes document structure to the search engine

automatically created with sensible defaults

explicit mapping can be provided (generally, a good idea)

can run into merge conflicts

mapping

important meta fields:

_source

_all

there are more

mapping types

simple:

string, integer/long, float/double, boolean, and null)

complex:

array, object

sample mapping

>?"39#?/%%%%%%?<9#B!:?E

%?-B-$9?/%%%%%?W17X-%(27B!?E

%?-2H3?/%%%%%%G?.#18B$B7H?E%?<9F"HHB7H?E%?.,.?JE

%?.13-W2-9?/%%?56;6&;5&55+;M/;Y/;5?E

%?.#B1#B-I?/%%5Ndocu

men

t

>?.13-?/%>

%%?.#1.9#-B93?%/%>

%%%%?"39#?/%%%%%%>?-I.9?/%?3-#B7H?E%?B7<9P?/%?71-O272$IZ9<?NE

%%%%?@9332H9?/%%%>?-I.9?/%?3-#B7H?E%[F113-\/%;UVNE

%%%%?-2H3?/%%%%%%>?-I.9?/%?3-#B7H?E%?B7!$"<9OB7O2$$?/%?71?NE

%%%%?.13-W2-9?%/%>?-I.9?%/%?<2-9?E%[3-1#9\/%[71\NE

%%%%?.#B1#B-I?%/%>?-I.9?%/%?B7-9H9#?N

NNN

map

ping

analyzers

break down (tokenize) and normalize fields during indexing and query strings at search time

analyzer = tokenizer + token filters (0 or more)

*-27<2#<%A72$IZ9#%S

%%%*-27<2#<%+1:97BZ9#%]

%%%%%%%*-27<2#<%+1:97%^B$-9#%]

%%%%%%%_1K9#!239%+1:97%^B$-9#%]

%%%%%%%*-1.%+1:97%^B$-9#

analyzers

analyzers, tokenizers, and filters can be customizedB7<9P/

%%272$I3B3/

%%%%272$IZ9#/

%%%%%%.?&%,E/%%%%%%%%-I.9/%!"3-1@

%%%%%%%%-1:97BZ9#/%3-27<2#<

%%%%%%%%8B$-9#/%G3-27<2#<E%$1K9#!239E%3-1.E

%%%%%%%%%%%%%%%%%23!BB81$<B7HE%.1#-9#*-9@Jelas

ticse

arch

.ym

l

`

?-B-$9?/%>?-I.9?/%?3-#B7H?E%?272$IZ9#?/%?9"$27H?NE

`

map

ping

API

API conventions

append ?pretty=true to get readable JSON

boolean values: false/0/off = false, rest is true

JSONP support via callback parameter

API structure

http://host:port/[index]/[type]/[_action/id]

GET http://es:9200/_status

GET http://es:9200/twitter/_status

POST http://es:9200/twitter/tweet/1

GET http://es:9200/twitter/tweet/1

API structure

http://host:port/[index]/[type]/[_action/id]

GET http://es:9200/twitter/tweet/_search

GET http://es:9200/twitter/user/_search

GET http://es:9200/twitter/tweet,user/_search

GET http://es:9200/twitter,facebook/_search

GET http://es:9200/_search

API query example>

%%%%?R"9#I?/%>

%%%%%%%%?8B$-9#9<?/%>

%%%%%%%%%%%%?R"9#I?/%>

%%%%%%%%%%%%%%%%?R"9#IO3-#B7H?/%>

%%%%%%%%%%%%%%%%%%%%?R"9#I?/%?811%F2#?E

%%%%%%%%%%%%%%%%%%%%?<982"$-O1.9#2-1#?/%?AaW?E

%%%%%%%%%%%%%%%%%%%%?8B9$<3?/%G?-B-$9?E%?<93!#B.-B17?JE

%%%%%%%%%%%%%%%%%%%%?F113-?/%5U6

%%%%%%%%%%%%%%%%N

%%%%%%%%%%%%NE

%%%%%%%%%%%%?8B$-9#?/%>

%%%%%%%%%%%%%%%%?#27H9?/%>?<2-9?/%>?H-?/%?56;;&6T&64?NN

%%%%%%%%%%%%N

%%%%%%%%N

%%%%NE

%%%%?8#1@/%;6E

%%%%?3BZ9?/%;6

N

API {core}

index

bulk

delete

delete by query

get

count

search

query

from/size paging

sort

highlighting

selective fields

API {indices}

create

delete

open/close

get/put/delete mapping

refresh

optimize

snapshot

update settings

analyze

status

flush

Query DSL

term / terms

range

prefix

bool

fuzzy

wildcard

query_string

default_operator

analyzer

phrase_slop

etc

filters

share some similar features with queries (term, range, etc)

why use a filter?

filters

faster than queries

cached (depends on the filter)

the cache is used for different queries against the same filter

no scoring

more useful ones: term, terms, range, prefix, and, or, not, exists, missing, query

facets

provide aggregated data based on the search request

terms, histogram, date histogram, range, statistical, and more

geo search

implemented as filters (and a facet)

geo_distance

geo_bounding_box

geo_polygon

interfaces

REST

Java /!Groovy

Language clients (REST/Thrift):

pyes, PHP (standalone and symfony), Ruby, Perl

Flume sink implementation

data import

ES is not the primary data store (usually)

to import/synchronize data:

write an agent (Gearman, message queues, etc)

use rivers (CouchDB, RabbitMQ, Twitter)

10 more features

versioning

index aliases

parent/child docs

scripting

dynamic mapping templates

load balancing nodes

plugins

more_like_this

multi_field mapping

percolation