Upload
provectus
View
2.050
Download
3
Embed Size (px)
DESCRIPTION
Anatoliy Sokolenko, Software Engineer at Grid Dynamics
Citation preview
Apache Lucene/Solr
Internals
About meJava and all around
Principal Software Engineer at Grid Dynamics
Kharkiv
Apache Lucene/Solr
Internals
4 nodes ✕ 12GB disk space
June 2013 database 14.630.209 records
Indexing took 5 hours in 100 threads
1000 batch
Lucene.net
VM 16 CPU cores
16 GB memory
lightweightperformant
searchlibrary
Data Model
• document oriented
• flat
• store
• index
Data Model
• document oriented
• flat
• store
• index
score:1
tag:java
type:answer
Documentboost = 1.1
docID = 23
Showcase
Basic Flow
LuceneIndex
Index Writer Index Searcher
Analyzer Index Reader
Basic Flowscore:1
tag:java
type:answer
Documentboost = 1.1
LuceneIndex
Index Writer Index Searcher
addDocument
Analyzer Index Reader
Basic Flowscore:1
tag:java
type:answer
Documentboost = 1.1
LuceneIndex
Index Writer Index Searcher
addDocument
querytag:java
Analyzer Index Reader
Basic Flowscore:1
tag:java
type:answer
Documentboost = 1.1
LuceneIndex
Index Writer Index Searcher
addDocument
querytag:java
score:1
tag:java
type:answer
Documentboost = 1.1
search
Analyzer Index Reader
Lucene Index Structure
Index
Index
SegmentA
Index
SegmentA
SegmentB Segment
CSegment
D
score:0
score:1
score:5...
tag:java
tag:mysql
tag:css...
type:answer
type:question
3
4
2
2
3
4
3
2
Term Infos
score:0
score:1
score:5...
tag:java
tag:mysql
tag:css...
type:answer
type:question
3 +1 +2
10 +3 +1
4 +11
5 +2
6 +52 +1
1 +30 +27
3 +7 +1
5 +2
3
+7
+2
4
2
2
3
4
3
2
1 1 1
1 1 1
1 1
1 1
1 1 1
1
1 1 1 1
1 3 5
2 1
Term Infos Term Frequencies
score:0
...
tag:mysql
...
score:0
score:1
score:5...
tag:java
tag:mysql
tag:css...
type:answer
type:questiontype:question
3 +1 +2
10 +3 +1
4 +11
5 +2
6 +52 +1
1 +30 +27
3 +7 +1
5 +2
3
+7
+2
4
2
2
3
4
3
2
1 1 1
1 1 1
1 1
1 1
1 1 1
1
1 1 1 1
1 3 5
2 1
Term InfosTerm Info Index Term Frequencies3
3
2
Showcase
scalableenterprise search
server
RequestHandlers
Data Import Handler
SolrCloud
solrconfig.xmlschema.xml
SolrCloud
Join Cluster
Shard 3Shard 2Shard 1
Join Cluster
Indexing
Shard 1 Shard 2 Shard 3
Indexing
Shard 1 Shard 2 Shard 3
Indexing
Shard 1 Shard 2 Shard 3
Indexing
Shard 1 Shard 2 Shard 3
Indexing
Shard 1 Shard 2 Shard 3
Indexing
Shard 1 Shard 2 Shard 3
Query
Shard 1 Shard 2 Shard 3
Query
Shard 1 Shard 2 Shard 3 querytag:java
Query
Shard 1 Shard 2 Shard 3 querytag:java
Query
Shard 1 Shard 2 Shard 3 querytag:java
Query
Shard 1 Shard 2 Shard 3 querytag:java
Failure
Shard 1 Shard 2 Shard 3
Failure
Shard 1 Shard 2 Shard 3
Failure
Shard 1 Shard 2 Shard 3
CAP Model
C
A
PSolrCloud
Solr
Showcase
Faceted Navigation
Showcase
Algorithm
tag:java
tag:mysql
tag:css
5 +2
6 +52 +1
1 +30 +27 +2
7 31 58 59Query Result
Index
Algorithm
tag:java
tag:mysql
tag:css
5 +2
6 +52 +1
1 +30 +27 +2
7 31 58 59Query Result
Index
5 7
6 58 59
1 31 58 60
Algorithm
tag:java
tag:mysql
tag:css
5 +2
6 +52 +1
1 +30 +27 +2
7 31 58 59Query Result
Index Facet
1
2
2
5 7
6 58 59
1 31 58 60
Showcase
Text Analysis
Analyzer
Tokenizer
Filter
Char filter
Analyzer
Tokenizer
Filter
Char filter
Index time
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Filter
Char filter
Index time
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Filter
There are no pointers in Java!
Char filter
Index time
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
Index time
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
Index time
??
?pointer
?java
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
Index time Query time
??
?pointer
?java
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
pointers in Java
Index time Query time
??
?pointer
?java
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
pointers in Java
Index time Query time
pointers in Java
??
?pointer
?java
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
pointers in Java
Index time Query time
pointers in Java
pointers Javain
??
?pointer
?java
Analyzer
<strong>There are no pointers in Java!</strong>
Tokenizer
Thereare
nopointers
inJava
Filter
There are no pointers in Java!
Char filter
pointers in Java
Index time Query time
pointers in Java
pointers Javain
??
?pointer
?java
pointer java?
Showcase
Spell Suggestions
Levenshtein Distance
htmlhtmm
Levenshteindistance = 1
hlmzhtml
Levenshteindistance = 2
tag:php
tag:jquery
tag:json
tag:java
tag:c#
tag:apache
tag:osx
tag:html
Levenshtein Automaton
html
Levenshteindistance = 1
Htt
m
m
ll
tm
l
H t
t
m
m
l l
m
l
tt
m
H
l
m
Showcase
Solr is...
• enterprise level search engine
• vertically scalable
• horizontaly scalable, but...
• tunable
• poorly documentation
• with active community
• http://blog.mikemccandless.com
• http://lucene.apache.org/core/4_3_1/index.html
• Introduction to Information Retrieval http://nlp.stanford.edu/IR-book/
• http://wiki.apache.org/solr/
• https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide
References
The End