28
Solr vs. Elasticsearch Case by Case Alexandre Rafalovitch @arafalov @SolrStart www.solr-start.com

Solr vs. Elasticsearch - Case by Case

Embed Size (px)

DESCRIPTION

A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is coming later.

Citation preview

Page 1: Solr vs. Elasticsearch - Case by Case

Solr vs. Elasticsearch

Case by Case

Alexandre Rafalovitch @arafalov

@SolrStart

www.solr-start.com

Page 2: Solr vs. Elasticsearch - Case by Case

Meet the FRENEMIES

Friends (common)• Based on Lucene• Full-text search• Structured search• Queries, filters, caches• Facets/stats/enumerations• Cloud-ready

Elasticsearch*

* Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.

Enemies (differences)• Download size• AdminUI vs. Marvel• Configuration vs. Magic• Nested documents• Chains vs. Plugins• Types and Rivers• OpenSource vs. Commercial• Etc.

Page 3: Solr vs. Elasticsearch - Case by Case

This used to be Solr (now in Lucene/ES)

• Field types• Dismax/eDismax• Many of analysis filters (WordDelimiterFilter, Soundex, Regex,

HTML, kstem, Trim…)• Multi-valued field cache• …. (source: http://heliosearch.org/lucene-solr-history/ )

• Disclaimer: Nowadays, Elasticsearch hires awesome Lucene hackers

Page 4: Solr vs. Elasticsearch - Case by Case

Basically - sisters

Source: https://www.flickr.com/photos/franzfume/11530902934/

DownloadExpanded

First run

0

50

100

150

200

250

300

Solr Elasticsearch

Page 5: Solr vs. Elasticsearch - Case by Case

Solr: Chubby or Rubenesque?

0.00 50.00 100.00 150.00 200.00 250.00 300.00

Solr

Elasticsearch+pluginsCode

Examples

Documentation

ES-Admin

ES-ICU

Extract/Tika

UIMA

Map-Reduce

Test Framework

Page 6: Solr vs. Elasticsearch - Case by Case

Elasticsearch setup

Source: https://www.flickr.com/photos/deborah-is-lola/6815624125/

• Admin UI:bin/plugin -i elasticsearch/marvel/latest

• Tika/Extraction:bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.4.1

• ICU (Unicode components): bin/plugin -install elasticsearch/elasticsearch-analysis-icu/2.4.1

• JDBC River (like DataImportHandlersubset):bin/plugin --install jdbc --urlhttp://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/1.3.4.4/elasticsearch-river-jdbc-1.3.4.4-plugin.zip

• JavaScript scripting support:bin/plugin -install elasticsearch/elasticsearch-lang-javascript/2.4.1

• On each node….

• Without dependency management (jars = rabbits)

Page 7: Solr vs. Elasticsearch - Case by Case

Index a document - Elasticsearch

1. Setup an index/collection

2. Define fields and types

3. Index content (using Marvel sense):POST /test1/hello

{

"msg": "Happy birthday",

"names": ["Alex", "Mark"],

"when": "2014-11-01T10:09:08"

}

Alternative:PUT /test1/hello/id1

{

"msg": "Happy birthday",

"names": ["Alex", "Mark"],

"when": "2014-11-01T10:09:08"

}

An index, type and definitions are created automatically

So, where is our document:GET /test1/hello/_search{

"took": 1,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

},

"hits": {

"total": 1,

"max_score": 1,

"hits": [

{

"_index": "test1",

"_type": "hello",

"_id": "AUmIk4LDF4XvfpxnVJ2g",

"_score": 1,

"_source": {

"msg": "Happy birthday",

"names": [

"Alex",

"Mark"

],

"when": "2014-11-01T10:09:08"

}}

]

}}

Page 8: Solr vs. Elasticsearch - Case by Case

Behind the scenes

GET /test1/hello/_search

…..{

"_index": "test1",

"_type": "hello",

"_id": "AUmIk4LDF4XvfpxnVJ2g",

"_score": 1,

"_source": {

"msg": "Happy birthday",

"names": [

"Alex",

"Mark"

],

"when": "2014-11-01T10:09:08"

}

….

GET /test1/hello/_mapping

{

"test1": {

"mappings": {

"hello": {

"properties": {

"msg": {

"type": "string"

},

"names": {

"type": "string"

},

"when": {

"type": "date",

"format": "dateOptionalTime"

}}}}}}

Page 9: Solr vs. Elasticsearch - Case by Case

Basic search in Elasticsearch

GET /test1/hello/_search

…..{

"_index": "test1",

"_type": "hello",

"_id": "AUmIk4LDF4XvfpxnVJ2g",

"_score": 1,

"_source": {

"msg": "Happy birthday",

"names": [

"Alex",

"Mark"

],

"when": "2014-11-01T10:09:08"

}

….

• GET /test1/hello/_search?q=foobar – no results• GET /test1/hello/_search?q=Alex – YES on names?• GET /test1/hello/_search?q=alex – YES lower case• GET /test1/hello/_search?q=happy – YES on msg?• GET /test1/hello/_search?q=2014 – YES???• GET /test1/hello/_search?q="birthday alex" – YES• GET /test1/hello/_search?q="birthday mark" – NO

Issues:1. Where are we actually searching?2. Why are lower-case searches work?3. What's so special about Alex?

Page 10: Solr vs. Elasticsearch - Case by Case

All about _all and why strings are tricky

• By default, we search in the field _all

• What's an _all field in Solr terms?

<field name="_all" type="es_string" multiValued="true" indexed="true" stored="false"/>

<copyField source="*" dest="_all"/>

• And the default mapping for Elasticsearch "string" type is like:

<fieldType name="es_string" class="solr.TextField" multiValued="true" positionIncrementGap="0" >

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.LowerCaseFilterFactory"/>

</analyzer>

</fieldType>

• Elasticsearch equivalent to Solr's solr.StrField is:{"type" : "string", "index" : "not_analyzed"}

Page 11: Solr vs. Elasticsearch - Case by Case

Can Solr do the same kind of magic?

• curl 'http://localhost:8983/solr/collection1/update/json/docs' -H 'Content-

type:application/json' -d @msg.json

curl 'http://localhost:8983/solr/collection1/select'

{

"responseHeader":{

"status":0,

"QTime":18,

"params":{}},

"response":{"numFound":1,"start":0,"docs":[

{

"msg":["Happy birthday"],

"names":["Alex", "Mark"],

"when":["2014-11-01T10:09:08Z"],

"_id":"e9af682d-e775-42f2-90a5-c932b5fbb691",

"_version_":1484096406012559360}]

}}

curl 'http://localhost:8983/solr/collection1/schema/fields'

{

"responseHeader":{

"status":0,

"QTime":1},

"fields":[

{"name":"_all", "type":"es_string",

"multiValued":true,

"indexed":true, "stored":false},

{"name":"_id", "type":"string",

"multiValued":false,

"indexed":true, "required":true,

"stored":true, "uniqueKey":true},

{"name":"_version_", "type":"long",

"indexed":true, "stored":true},

{"name":"msg", "type":"es_string"},

{"name":"names", "type":"es_string"},

{"name":"when", "type":"tdates"}]}• Output slightly re-formated

Page 12: Solr vs. Elasticsearch - Case by Case

Nearly the same magic<updateRequestProcessorChain name="add-unknown-fields-to-the-

schema">

<!-- UUIDUpdateProcessorFactory will generate an id if none is

present in the incoming document -->

<processor class="solr.UUIDUpdateProcessorFactory" />

<processor class="solr.LogUpdateProcessorFactory"/>

<processor class="solr.DistributedUpdateProcessorFactory"/>

<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>

<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>

<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>

<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>

<processor class="solr.ParseDateFieldUpdateProcessorFactory">

<arr name="format">

<str>yyyy-MM-dd'T'HH:mm:ss</str>

<str>yyyyMMdd'T'HH:mm:ss</str>

</arr>

</processor>

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">

<str name="defaultFieldType">es_string</str>

<lst name="typeMapping">

<str name="valueClass">java.lang.Boolean</str>

<str name="fieldType">booleans</str>

</lst>

<lst name="typeMapping">

<str name="valueClass">java.util.Date</str>

<str name="fieldType">tdates</str>

</lst>

<processor class="solr.RunUpdateProcessorFactory"/>

</updateRequestProcessorChain>

Not quite the same magic:• URP chain happens before copyField

• Date/Ints are converted first• copyText converts content back to string• _all field also gets copy of _id and _version

• All auto-mapped fields HAVE to be multivalued• No (ES-Style) types, just collections• Unable to reproduce cross-field search• Still rough around the edges• Requires dynamic schema, so adding new types

becomes a challenge

• Auto-mapping is NOT recommended for production

• Dynamic fields solution is still more mature

Page 13: Solr vs. Elasticsearch - Case by Case

Explicit mapping - Solr• In schema.xml (or dynamic equivalent)

• Uses Java Factories

• Related content (e.g. stopwords) are usually in separate files (recently added REST-managed)

• French example:

<fieldType name="text_fr" class="solr.TextField"positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.ElisionFilterFactory" ignoreCase="true"articles="lang/contractions_fr.txt"/>

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" />

<filter class="solr.FrenchLightStemFilterFactory"/>

</analyzer>

</fieldType>

Page 14: Solr vs. Elasticsearch - Case by Case

Explicit mapping - Elasticsearch

• Created through PUT command

• Also can be stored in config/default-mapping.json or

config/mappings/[index_name]

• Mappings for all types in one index should be compatible to avoid problems

• Usually uses predefined mapping names. Has many names, including for

languages

• Explicit mapping is through named cross-references, rather than duplicated in-

place stack (like Solr)

• Related content is usually also in the definition. Sometimes in file (e.g.

stopwords_path – needs to be on all nodes)

• French example (next slide):

Page 15: Solr vs. Elasticsearch - Case by Case

Explicit mapping – Elasticsearch - French{

"settings": {

"analysis": {

"filter": {

"french_elision": {

"type": "elision",

"articles": [ "l", "m", "t", "qu",

"n", "s", "j", "d", "c", "jusqu", "quoiqu",

"lorsqu", "puisqu"

]

},

"french_stop": {

"type": "stop",

"stopwords": "_french_"

},

"french_keywords": {

"type": "keyword_marker",

"keywords": []

},

"french_stemmer": {

"type": "stemmer",

"language": "light_french"

}

},

….

"analyzer": {

"french": {

"tokenizer": "standard",

"filter": [

"french_elision",

"lowercase",

"french_stop",

"french_keywords",

"french_stemmer"

]

}

}

}

}

}

Page 16: Solr vs. Elasticsearch - Case by Case

Default analyzer - Elasticsearch

Indexing1. the analyzer defined in the field

mapping, else 2. the analyzer defined in the _analyzer

field of the document, else 3. the default analyzer for the type,

which defaults to 4. the analyzer named default in the

index settings, which defaults to 5. the analyzer named default at node

level, which defaults to 6. the standard analyzer

Query1. the analyzer defined in the query

itself, else 2. the analyzer defined in the field

mapping, else 3. the default analyzer for the type,

which defaults to 4. the analyzer named default in the

index settings, which defaults to 5. the analyzer named default at node

level, which defaults to 6. the standard analyzer

Page 17: Solr vs. Elasticsearch - Case by Case

Index many documents – Elasticsearch

POST /test3/entries/_bulk{ "index": {"_id": "1" } }{"msg": "Hello", "names": ["Jack", "Jill"]}{ "index": {"_id": "2" } }{"msg": "Goodbye", "names": "Jason"}{ "delete" : {"_id" : "3" } }

NOTE: Rivers (similar to DIH) MAY be deprecated. Use Logstash instead (180Mb on disk, including 2 jRuby runtimes !!!)

Page 18: Solr vs. Elasticsearch - Case by Case

Index many documents - Solr

JSON - simple[

{

"_id": "1",

"msg": "Hello",

"names": ["Jack", "Jill"]

},

{

"_id": "2",

"msg": "Goodbye",

"names": "Jason"

}

]

JSON – with commands{

"add": { "doc": {

"_id": "1",

"msg": "Hello",

"names": ["Jack", "Jill"]

} },

"add": { "doc": {

"_id": "2",

"msg": "Goodbye",

"names": "Jason"

} },

"delete": { "_id":3 }

}

Also:• CSV• XML• XML+XSLT• JSON+transform (4.10)• DataImportHandler• Map-ReduceExternal tools• Logstash (owned by ES)

Page 19: Solr vs. Elasticsearch - Case by Case

Comparing search - Search

• Same but different

• Same: vast majority of the features

come from Lucene

• Different: representation of search

parameters

• Solr: URL query with many – cryptic –

parameters

• Elasticsearch:

• Search lite: URL query with a

limited set of parameters (basic

Lucene query)

• Query DSL: JSON with multi-

leveled structure

Lucene

ImplES

onlySolronly

Page 20: Solr vs. Elasticsearch - Case by Case

Search compared – Simple searches

{

"msg": "Happy birthday",

"names": ["Alex", "Mark"],

"when": "2014-11-01T10:09:08"

}

{

"msg": "Happy New Year",

"names": ["Jack", "Jill"],

"when": "2015-01-01T00:00:01"

}

{

"msg": "Goodbye",

"names": ["Jack", "Jason"],

"when": "2015-06-01T00:00:00"

}

Elasticsearch (Marvel Sense GET):• /test1/hello/_search – all• /test1/hello/_search?q=happy birthday Alex– 2• /test1/hello/_search?q=names:Alex – 1

Solr (GET http://localhost:8983/solr/…):• /collection1/select – all• /collection1/select?q=happy birthday Alex – 2• /test1/hello/_search?q=names:Alex – 1

Page 21: Solr vs. Elasticsearch - Case by Case

Search Compared – Query DSL

Elasticsearch

GET /test1/hello/_search

{

"query": {

"query_string": {

"fields": ["msg^5", "names"],

"query": "happy birthday Alex",

"minimum_should_match": "100%"

}

}

}

Solr

…/collection1/select

?q=happy birthday Alex

&defType=dismax

&qf=msg^5 names

&mm=100%

Page 22: Solr vs. Elasticsearch - Case by Case

Search Compared – Query DSL - combo

ElasticsearchGET /test1/hello/_search

{

"size" : 1,

"query": {

"filtered": {

"query": {

"query_string": {

"query": "jack"

}},

"filter": {

"range": {

"when": {

"gte": "now"

}}}}}}

Solr

…/collection1/select

?q=jack

&fq=when:[NOW TO *]

&rows=1

Search future entries about Jack. Return only the best one.

Page 23: Solr vs. Elasticsearch - Case by Case

Parent/Child structures

Inner objects• Mapping: Object• Dynamic mapping (default)• NOT separate Lucene docs• Map to flattened

multivalued fields• Search matches against

value from ANY of inner objects

{

"followers.age": [19, 26],

"followers.name":

[alex, lisa]

}

Elasticsearch

Nested objects• Mapping: nested• Explicit mapping• Lucene block storage• Inner documents are hidden• Cannot return inner docs only• Can do nested & inner

Parent and Child• Mapping: _parent• Explicit references• Separate documents• In-memory join• SLOW

SolrNested objects• Lucene block storage• All documents are visible• Child JSON is less natural

Page 24: Solr vs. Elasticsearch - Case by Case

Cloud deployment – quick take

1. General concepts are similar:

• Node discovery

• Sharding

• Replication

• Routing

1. Implementations are very, very different (layer above Lucene)

2. Solr uses Apache Zookeeper

3. Elasticsearch has its own algorithms

4. No time to discuss

5. Let's focus on the critical path: Node discovery/cloud-state management

6. Use a 3rd party analysis: Kyle Kingsbury's Jepsen tests

Page 25: Solr vs. Elasticsearch - Case by Case

Jepsen test of Zookeper

Use Zookeeper. It’s mature, well-designed, and battle-tested.

Page 26: Solr vs. Elasticsearch - Case by Case

Jepsen test of Elasticsearch

If you are an Elasticsearch user (as I am): good luck.

Page 27: Solr vs. Elasticsearch - Case by Case

Innovator’s dilemma

• Solr's usual attitude

• An amazingly useful product for many different uses

• And wants everybody to know it

• …Right in the collection1 example

• “You will need all this eventually, might as well learn it first”

• Elasticsearch is small and shiny (“trust us, the magic exists”)

• Elasticsearch + Logstash + Kibana => power-punch triple combo

• Especially when comparing to Solr (and not another commercial solution)

• Feature release process

• Elasticsearch: kimchy: “LGTM” (Looks good to me)

• Solr: full Apache process around it

• Solr – needs to buckle down and focus on onboarding experience

• Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Page 28: Solr vs. Elasticsearch - Case by Case

Solr vs. Elasticsearch

Case by Case

Alexandre Rafalovitch

www.solr-start.com

@arafalov

@SolrStart