70
Apache Lucene/Solr Internals

Apache Solr/Lucene Internals by Anatoliy Sokolenko

Embed Size (px)

DESCRIPTION

Anatoliy Sokolenko, Software Engineer at Grid Dynamics

Citation preview

Page 1: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Apache Lucene/Solr

Internals

Page 2: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

About meJava and all around

Principal Software Engineer at Grid Dynamics

Kharkiv

[email protected]

Page 3: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Apache Lucene/Solr

Internals

Page 4: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

4 nodes ✕ 12GB disk space

June 2013 database 14.630.209 records

Indexing took 5 hours in 100 threads

1000 batch

Lucene.net

VM 16 CPU cores

16 GB memory

Page 5: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

lightweightperformant

searchlibrary

Page 6: Apache Solr/Lucene Internals  by Anatoliy Sokolenko
Page 7: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Data Model

• document oriented

• flat

• store

• index

Page 8: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Data Model

• document oriented

• flat

• store

• index

score:1

tag:java

type:answer

Documentboost = 1.1

docID = 23

Page 9: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 10: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Basic Flow

LuceneIndex

Index Writer Index Searcher

Analyzer Index Reader

Page 11: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Basic Flowscore:1

tag:java

type:answer

Documentboost = 1.1

LuceneIndex

Index Writer Index Searcher

addDocument

Analyzer Index Reader

Page 12: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Basic Flowscore:1

tag:java

type:answer

Documentboost = 1.1

LuceneIndex

Index Writer Index Searcher

addDocument

querytag:java

Analyzer Index Reader

Page 13: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Basic Flowscore:1

tag:java

type:answer

Documentboost = 1.1

LuceneIndex

Index Writer Index Searcher

addDocument

querytag:java

score:1

tag:java

type:answer

Documentboost = 1.1

search

Analyzer Index Reader

Page 14: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Lucene Index Structure

Page 15: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Index

Page 16: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Index

SegmentA

Page 17: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Index

SegmentA

SegmentB Segment

CSegment

D

Page 18: Apache Solr/Lucene Internals  by Anatoliy Sokolenko
Page 19: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

score:0

score:1

score:5...

tag:java

tag:mysql

tag:css...

type:answer

type:question

3

4

2

2

3

4

3

2

Term Infos

Page 20: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

score:0

score:1

score:5...

tag:java

tag:mysql

tag:css...

type:answer

type:question

3 +1 +2

10 +3 +1

4 +11

5 +2

6 +52 +1

1 +30 +27

3 +7 +1

5 +2

3

+7

+2

4

2

2

3

4

3

2

1 1 1

1 1 1

1 1

1 1

1 1 1

1

1 1 1 1

1 3 5

2 1

Term Infos Term Frequencies

Page 21: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

score:0

...

tag:mysql

...

score:0

score:1

score:5...

tag:java

tag:mysql

tag:css...

type:answer

type:questiontype:question

3 +1 +2

10 +3 +1

4 +11

5 +2

6 +52 +1

1 +30 +27

3 +7 +1

5 +2

3

+7

+2

4

2

2

3

4

3

2

1 1 1

1 1 1

1 1

1 1

1 1 1

1

1 1 1 1

1 3 5

2 1

Term InfosTerm Info Index Term Frequencies3

3

2

Page 22: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 23: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

scalableenterprise search

server

Page 24: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

RequestHandlers

Data Import Handler

SolrCloud

solrconfig.xmlschema.xml

Page 25: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

SolrCloud

Page 26: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Join Cluster

Page 27: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Shard 3Shard 2Shard 1

Join Cluster

Page 28: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Indexing

Shard 1 Shard 2 Shard 3

Page 29: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Indexing

Shard 1 Shard 2 Shard 3

Page 30: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Indexing

Shard 1 Shard 2 Shard 3

Page 31: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Indexing

Shard 1 Shard 2 Shard 3

Page 32: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Indexing

Shard 1 Shard 2 Shard 3

Page 33: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Indexing

Shard 1 Shard 2 Shard 3

Page 34: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Query

Shard 1 Shard 2 Shard 3

Page 35: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Query

Shard 1 Shard 2 Shard 3 querytag:java

Page 36: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Query

Shard 1 Shard 2 Shard 3 querytag:java

Page 37: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Query

Shard 1 Shard 2 Shard 3 querytag:java

Page 38: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Query

Shard 1 Shard 2 Shard 3 querytag:java

Page 39: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Failure

Shard 1 Shard 2 Shard 3

Page 40: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Failure

Shard 1 Shard 2 Shard 3

Page 41: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Failure

Shard 1 Shard 2 Shard 3

Page 42: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

CAP Model

C

A

PSolrCloud

Solr

Page 43: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 44: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Faceted Navigation

Page 45: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 46: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Algorithm

tag:java

tag:mysql

tag:css

5 +2

6 +52 +1

1 +30 +27 +2

7 31 58 59Query Result

Index

Page 47: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Algorithm

tag:java

tag:mysql

tag:css

5 +2

6 +52 +1

1 +30 +27 +2

7 31 58 59Query Result

Index

5 7

6 58 59

1 31 58 60

Page 48: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Algorithm

tag:java

tag:mysql

tag:css

5 +2

6 +52 +1

1 +30 +27 +2

7 31 58 59Query Result

Index Facet

1

2

2

5 7

6 58 59

1 31 58 60

Page 49: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 50: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Text Analysis

Page 51: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

Tokenizer

Filter

Char filter

Page 52: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

Tokenizer

Filter

Char filter

Index time

Page 53: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Filter

Char filter

Index time

Page 54: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Filter

There are no pointers in Java!

Char filter

Index time

Page 55: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

Index time

Page 56: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

Index time

??

?pointer

?java

Page 57: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

Index time Query time

??

?pointer

?java

Page 58: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

pointers in Java

Index time Query time

??

?pointer

?java

Page 59: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

pointers in Java

Index time Query time

pointers in Java

??

?pointer

?java

Page 60: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

pointers in Java

Index time Query time

pointers in Java

pointers Javain

??

?pointer

?java

Page 61: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Analyzer

<strong>There are no pointers in Java!</strong>

Tokenizer

Thereare

nopointers

inJava

Filter

There are no pointers in Java!

Char filter

pointers in Java

Index time Query time

pointers in Java

pointers Javain

??

?pointer

?java

pointer java?

Page 62: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 63: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Spell Suggestions

Page 64: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Levenshtein Distance

htmlhtmm

Levenshteindistance = 1

hlmzhtml

Levenshteindistance = 2

tag:php

tag:jquery

tag:json

tag:java

tag:c#

tag:apache

tag:osx

tag:html

Page 65: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Levenshtein Automaton

html

Levenshteindistance = 1

Htt

m

m

ll

tm

l

H t

t

m

m

l l

m

l

tt

m

H

l

m

Page 66: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Showcase

Page 67: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

Solr is...

• enterprise level search engine

• vertically scalable

• horizontaly scalable, but...

• tunable

• poorly documentation

• with active community

Page 70: Apache Solr/Lucene Internals  by Anatoliy Sokolenko

The End