Upload
aleksander-stensby
View
2.079
Download
10
Tags:
Embed Size (px)
Citation preview
• Aleksander M. Stensby
• CEO in Monokkel AS
• Previously COO in Integrasco AS
• Working with search and data analysis since 2004
www.monokkel.io
• Daglig leder i Monokkel AS
• Tidligere COO i Integrasco AS
• Persistering, Prosessering og Presentasjon av data
Persistence – Processing – PresentaHon
Agenda
• Search fundamentals primer • Intro to elasHcsearch
• Search, filter and aggregate! … and some bonus visualisaHon!
What we will not cover today…
• All the different searches, filters and aggregaHons available in elasHcsearch J
• Details on tokenizaHon, analyzers…
• ElasHcsearch in producHon and performance tuning…
• Data integraHon
Term Frequency
we 3
know 2
what 2
are 1
but 1
not 1
may 1
be 1
“We know what we are, but know not what we may be.”
Term Vector
The Inverted Index Term Frequency
blues 1
born 2
no 1
one 1
run 2
sing 1
some 1
the 1
to 3
told 1
we 1
were 2
when 1
you 1
Documents
3
1,3
2
2
1,2
3
3
3
1,2,3
2
1
1,3
2
2
dictionary postings
1. “We were born to run ”
2. “No one told you when to run”
3. “Some were born to sing the blues”
Searching
born
1. “We were born to run ”
2. “No one told you when to run”
3. “Some were born to sing the blues”
The Boolean Model Term Frequency
blues 1
born 2
no 1
one 1
run 2
sing 1
some 1
the 1
to 3
told 1
we 1
were 2
when 1
you 1
Documents
3
1,3
2
2
1,2
3
3
3
1,2,3
2
1
1,3
2
2
dictionary postings
born
Term Frequency
blues 1
born 2
no 1
one 1
run 2
sing 1
some 1
the 1
to 3
told 1
we 1
were 2
when 1
you 1
Documents
3
1,3
2
2
1,2
3
3
3
1,2,3
2
1
1,3
2
2
dictionary postings
born blues
Term Frequency
blues 1
born 2
no 1
one 1
run 2
sing 1
some 1
the 1
to 3
told 1
we 1
were 2
when 1
you 1
Documents
3
1,3
2
2
1,2
3
3
3
1,2,3
2
1
1,3
2
2
dictionary postings
born OR blues
Term Frequency
blues 1
born 2
no 1
one 1
run 2
sing 1
some 1
the 1
to 3
told 1
we 1
were 2
when 1
you 1
Documents
3
1,3
2
2
1,2
3
3
3
1,2,3
2
1
1,3
2
2
dictionary postings
born AND blues
Term Frequency
blues 1
born 2
no 1
one 1
run 2
sing 1
some 1
the 1
to 3
told 1
we 1
were 2
when 1
you 1
Documents
3
1,3
2
2
1,2
3
3
3
1,2,3
2
1
1,3
2
2
dictionary postings
born NOT blues
Similarity 1. “We were born to run ”
2. “No one told you when to run”
3. “Some were born to sing the blues”
[2, 0]
[0, 0]
[2, 5]
0
0 1 2 3 4 5
1
2
3
“blues”
“born” query: [2,5]
doc 3: [2,5]
doc 2: [0,0]
doc 1: [2,0]
Brief history of elasHcsearch
Shay Banon -‐> AbstracHon Layer on top of Lucene -‐> Compass -‐> Rewricen high performance, real-‐Hme, distributed -‐> ElasHcsearch -‐> February 2010
elasHcsearch
• Open source search engine -‐ wricen in Java
• Built on top of Lucene
• Simple, coherent, RESTful API
• Distributed, scalable search engine with real-‐Hme analyHcs
{ }
“more useable and concise API, scalability, and opera+onal tools on top of Lucene’s search
implementa+on”
Much more than just search!
• Real-‐Hme analyHcs • Log analysis • PredicHon modelling • RecommendaHons
Easy peasy…
• hcp://www.elasHcsearch.org/download
• bin/elasHcsearch or bin/elasHcsearch.bat on windows
• hcp://localhost:9200/ or curl –X GET hcp://localhost:9200/
Indexing data
curl -‐XPUT 'hcp://localhost:9200/monokkel/user/aleks' -‐d '{ "name" : "Aleksander Stensby" }’
Indexing data
• shakespeare.json – hcp://www.elasHcsearch.org/guide/en/kibana/current/snippets/shakespeare.json
• curl -‐XPUT localhost:9200/_bulk -‐-‐data-‐binary @shakespeare.json
Mapping
• Is it a number? String? Date? • Combining mulHple fields? • Default values? • Stored? • Analyzed? • How should we tokenize/analyse/normalize the field?
Mapping curl -‐XPUT hcp://localhost:9200/shakespeare -‐d ' { "mappings" : { "_default_" : { "properHes" : { "speaker" : {"type": "string", "index" : "not_analyzed" }, "play_name" : {"type": "string", "index" : "not_analyzed" }, "line_id" : { "type" : "integer" }, "speech_number" : { "type" : "integer" } } } } } ';
MulH Match Query
{ "query": { "mulM_match": {
"query": "romeo", "fields": [ "text_entry", "speaker" ] }
} }
Bool Query { "query": {
"bool": { "must": { "match": {"text_entry": "romeo" }}, "must_not": { "match": {"speaker": "ROMEO" }}, "should": [ { "match": {"speaker": "JULIET" }},
{ "match": {"speaker": "FRIAR LAURENCE" }} ] } }
}
And lots more…
filtered query prefix query simple query string query range query regexp query term query terms query wildcard query dis max query geoshape query nested query
more like this query more like this field query boosHng query common terms query constant score query fuzzy like this query fuzzy like this field query funcHon score query fuzzy query has child query has parent query
ids query indices query span first query span mulH term query span near query span not query span or query span term query top children query minimum should match mulH term query rewrite template query
hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/query-‐dsl-‐queries.html
Filtering
• Filters do not score so they are faster to execute than queries
• Filters can be cached in memory -‐ significantly faster than queries
If relevance is not important, use
filters, otherwise, use queries!
The Filtered Query:
{ "query": { "filtered": { "query": {YOUR_QUERY_HERE}, "filter": {YOUR_FILTER_HERE}
} } }
The Filtered Query:
{ "query": { "filtered": { "query": { "match": {"content": "monokkel" }}, "filter": { "term": { "tag": "awesome" }}
} } }
Terms Filter
{ "query": { "filtered": {
"filter": { "terms": { "speaker": ["ROMEO", "JULIET"] } } }
} }
Bool Filter { "query": { "filtered": {
"filter": { "bool" : {
"must" : [], "should" : [], "must_not" : [] }
} }
} }
Range Filter { "query": { "filtered": {
"filter": { "range" : {
"price" : { "gt" : 20, "lt" : 40 }
} } }
} }
And lots more…
match all filter and filter not filter or filter prefix filter query filter regexp filter type filter
geo bounding box filter geo distance filter geo distance range filter geo polygon filter geoshape filter geohash cell filter
has child filter has parent filter ids filter indices filter limit filter nested filter script filter
hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/query-‐dsl-‐filters.html
Kibana
• hcp://www.elasHcsearch.org/overview/kibana/installaHon/
• bin/kibana or bin/kibana.bat on windows
• hcp://localhost:5601/
AggregaHons
• Buckets and Metrics: par++oning documents based on a criteria
SELECT COUNT(color) FROM table GROUP BY color An aggrega+on is a combina+on of buckets and metrics
metric
bucket
AggregaHons
{ "aggs": { "speakers": {
"terms": { "field": "speaker" } }
} }
your aggregation name
bucket type
AggregaHons { "aggs": { "beertypes": {
"terms": { "field": "beertype" }, "aggs": { "avg_ibu": { "avg": { "field": "ibu" } } } }
} }
your aggregation name
metric type
AggregaHons
min max sum avg stats extended stats value count percenHles percenHle ranks cardinality
top hits scripted metric global filter filters missing nested reverse nested children
terms significant terms range date range ipv4 range histogram date historgram geo bounds geo distance geohash grid
hAp://www.elas+csearch.org/guide/en/elas+csearch/reference/current/search-‐aggrega+ons.html
And a whole lot more!
• Geosearch, distance and bounds • ”More Like This” • Suggesters / Autocomplete • PercolaMon • Language drivers • ScripMng
Further reading and some great resources!
• hcp://www.elasHcsearch.org/guide/
• hcp://blog.monokkel.io/
• hcps://found.no/foundaHon/