48

Webinar: What's New in Solr 6

Embed Size (px)

Citation preview

What’s New in Solr 6

Cassandra Targett

2016

OCTOBER 11-14BOSTON, MA

Introduction

• Lucene/Solr committer since 2013

• Director of Engineering at Lucidworks

Solr 6 builds on the innovations of Solr 5

• Easy to use

• Scalable

• Secure

Solr 5 Main Themes

• Easy to Use

• bin/solr and bin/post improvements

• JSON-based facets

• More APIs

• Modern UI (Angular-based)

• Scalable

• SolrCloud hardening

• Replica placement strategy

• Streaming expressions

• Secure

• Authentication and Authorization frameworks

Highlights of Recent Solr Releases (5.4 and 5.5)

• Solr 5.4

• Basic authentication

• ConfigSets API

• FORCELEADER command

• Optimizations for faceting DocValue fields

• Solr 5.5

• Ability to edit ZooKeeper configs with bin/solr

• Rule-based authorization flexibility

• XML query parser

• More async collection APIs

Solr 6 introduces several new features

• Parallel SQL

• Cross Data Center Replication

• Graph Traversal

• Modern APIs

• Jetty 9.3 and HTTP/2

Parallel SQL

Parallelized SQL support in Solr for scalable relational algebra

Seamlessly combines SQL with Solr’s full-text capabilities

• Realtime MapReduce(ish) or Facet aggregation modes

• Parallel execution of queries across SolrCloud

• Advanced SQL syntax for powerful queries

Parallel SQL builds on Solr’s Streaming Capabilities

• Export request handler (/export)

• Streaming API

• Streams tuples in JSON

• new class: org.apache.solr.client.solrj.io

• Streaming Expressions (/stream)

• Allows non-Java programmers to access Streaming API

• Expressions are essentially functions which originate the stream or operate on the stream

Streaming Expression Request - search

curl -d 'expr=search(gettingstarted, q="*:*", fl=“id, manu_exact”, sort=“manu_exact asc")' http://localhost:8983/solr/gettingstarted/stream

{"result-set": {

"docs": [ {"manu_exact": "A-DATA Technology Inc.”, "id": "VDBDB1A16"}, {"manu_exact": "ASUS Computer Inc.”, "id": "EN7800GTX/2DHTV/256M"}, {"manu_exact": "ATI Technologies”, "id": "100-435805"}… {"EOF": true,"RESPONSE_TIME": 15}]

}}

Functions, aka Stream Sources and Stream Decorators

• Define how data is retrieved and any aggregations performed

• Designed to work with entire result sets

• Can be compounded or wrapped to perform several operations at the same time

Streaming Expression Request - reduce

curl http://localhost:8983/solr/gettingstarted/stream -d ‘expr=reduce (search(gettingstarted, q="inStock:true", qt="/export", fl="id,manu_exact", sort="manu_exact asc"), by="manu_exact", group( sort="manu_exact asc", n="2"))'

Streaming Expression Response

{“result-set": {"docs":[ {"id":"0380014300","group":[{"id":"0380014300"},{"id":"0553573403"}]}, {"manu_exact":"A-DATA Technology Inc.","id":"VDBDB1A16","group":[{"manu_exact":"A-DATA Technology Inc.","id":"VDBDB1A16"}]}, {"manu_exact":"Apache Software Foundation","id":"UTF8TEST","group":[{"manu_exact":"Apache Software Foundation","id":"UTF8TEST"},{"manu_exact":"Apache Software Foundation","id":"SOLR1000"}]}, {"manu_exact":"Apple Computer Inc.","id":"MA147LL/A","group":[{"manu_exact":"Apple Computer Inc.","id":"MA147LL/A"}]}, {"manu_exact":"Bank of America","id":"USD","group":[{"manu_exact":"Bank of America","id":"USD"}]}, {"manu_exact":"Bank of Norway","id":"NOK","group":[{"manu_exact":"Bank of Norway","id":"NOK"}]}, {"manu_exact":"Canon Inc.","id":"9885A004","group":[{"manu_exact":"Canon Inc.","id":"9885A004"},{"manu_exact":"Canon Inc.","id":"0579B002"}]}, {"manu_exact":"Corsair Microsystems Inc.","id":"VS1GB400C3","group":[{"manu_exact":"Corsair Microsystems Inc.","id":"VS1GB400C3"},{"manu_exact":"Corsair Microsystems Inc.","id":"TWINX2048-3200PRO"}]}, {"manu_exact":"Dell, Inc.","id":"3007WFP","group":[{"manu_exact":"Dell, Inc.","id":"3007WFP"}]}, {“EOF":true,"RESPONSE_TIME":24}]}}

Available Functions

• Stream Sources

• Search

• JDBC

• Facet

• Stats

• Topic

• Stream Decorators

• Complement, Unique, Intersect

• leftOuterJoin, innerJoin, hashJoin, outerHashJoin

• Top, Rollup, Facet

• Parallel

• Decorators, cont’d

• Update

• Merge

• Group, Reduce

• Daemon

• Select

Streaming Expression Request - parallel

curl http://localhost:8983/solr/gettingstarted/stream -d 'expr=parallel(workcollection, search(gettingstarted, q="inStock:true", fl="id, manu_exact", sort="manu_exact asc", partitionKeys="manu_exact"), workers=2, zkHost="localhost:9983", sort="manu_exact asc")'

Parallel SQL builds on export and streaming

• SQL statements translated into Streaming Expressions

• Automatic merge of results from worker nodes

• Advanced SQL syntax

SQL Syntax

• SELECT and SELECT DISTINCT

• select id, manu_exact from techproducts

• select distinct id, manu_exact from techproducts

• WHERE

• select id, manu_exact from techproducts where inStock=true

• select id, manu_exact from techproducts order where price=‘[10 TO 50]’

• select id, manu_exact from techproducts where cat=‘(electronics or music)’

SQL Syntax

• ORDER BY and LIMIT

• select id, manu_exact from techproducts order by manu_exact asc

• select id, manu_exact from techproducts limit 10

• GROUP BY

• select id, manu_exact from techproducts where inStock=true group by manu

SQL Syntax

• Stats

• select count(manu_exact) as count, avg(price) as avg from techproducts

• HAVING

• select id, manu_exact from techproducts where inStock=true having (avg(price)>5) order by manu_exact asc

SQL Statement and Results

{"result-set":{"docs":[ {"manu_exact":"A-DATA Technology Inc.","id":"VDBDB1A16"}, {"manu_exact":"Apache Software Foundation","id":"SOLR1000"}, {"manu_exact":"Apache Software Foundation","id":"UTF8TEST"}, {"manu_exact":"Apple Computer Inc.","id":"MA147LL/A"}, {"manu_exact":"Bank of America","id":"USD"}, {"EOF":"true","RESPONSE_TIME":8}]}

}

curl -d '&stmt=select id, manu_exact from techproducts where inStock='true' order by manu_exact limit 5' http://localhost:8983/solr/techproducts/sql

Aggregation Modes

• map_reduce

• Tuples are shuffled to worker nodes, where aggregation occurs

• Tuples are sent to worker nodes sorted by GROUP BY fields

• Great for high cardinality

• facet

• Pushes computation to JSON Facet API - only aggregates are sent over the network

• Great for low-to-moderate cardinality

Parallel SQL with map_reduce Aggregation Mode

Client/sql handlerSQL Tier

worker 2 worker 3 worker 4worker 1Worker Tier

s2_r1

s1_r3

s1_r2

s1_r1

s2_r2

s2_r3 s3_r3

s3_r2

s3_r1

s4_r3

s4_r2

s4_r1

Data Tier

Each worker queries 1 replica in each shard

JDBC Driver

• Solr now includes a JDBC driver which can be used to query Solr

• Can be used only with the SQL handler

• DB visualization tools can also be used, such as Apache Zeppelin, Squirrel, DBVisualizer, etc.

Best Practices

• Create a separate collection for the /sql handler and worker nodes

• Designed for large clusters and large data sets

• Use the correct aggregation mode

• Usually best to partition on what you are grouping on

DocValue Fields ONLY!

Export and Stream request handlers can only be used on fields that use DocValues.

Because Parallel SQL uses these capabilities, in most cases it also requires DocValue fields.

Cross Data Center Replication

Replication between two or more SolrCloud clusters in two or more data centers

CDCR Design Points

• Uses existing transaction logs

• Leader-to-Leader communication avoids duplicate updates across data centers

• Active-passive disaster recovery

• Synchronous or asynchronous indexing

• Configurable batch sizes

• No single point of failure or bottlenecks

Title

CDCR Limitations

• Must start with an empty index or one that is already fully synchronized

• May be unsatisfactory if rate of updates is high

• Active-passive

Graph Traversal

Perform graph queries for interconnected data

Solr supports graph queries

• Follow nodes to edges

• Apply optional filters during traversal

• Use cases:

• Find all tweets mentioning “Solr” by me or people I follow

• Find all draft blog posts about “parallel sql” written by a developer I know

• Find 3-star hotels in NYC my friends stayed in last year

q=Solr&fq={!graph from=following_id to=id maxDepth=1}id:”childerelda”

Modern API

Redesign Solr’s user-facing APIs

Designed for Humans

• Consistent

• Versioned

• Friendlier endpoint names

• Introspectable

• JSON output by default (`wt` still supported)

Not in 6.0, but coming very soon

{"responseHeader": {"status": 0,"QTime": 2

},"initFailures": {},"status": {

"techproducts": {"name": "techproducts","instanceDir": "/Users/cass/LuceneSolr/lucene-solr/solr/example/techproducts/solr/techproducts","dataDir": "/Users/cass/LuceneSolr/lucene-solr/solr/example/techproducts/solr/techproducts/data/","config": "solrconfig.xml","schema": "managed-schema","startTime": "2016-03-07T19:18:07.765Z","uptime": 295560,"index": {

"numDocs": 32,"maxDoc": 32,"deletedDocs": 0,"indexHeapUsageBytes": -1,"version": 6,"segmentCount": 1,"current": true,"hasDeletions": false,"directory": "org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/Users/cass/LuceneSolr/lucene-solr/solr/example/

techproducts/solr/techproducts/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1244fae; maxCacheMB=48.0 maxMergeSizeMB=4.0)","segmentsFile": "segments_2","segmentsFileSizeInBytes": 165,"userData": {

"commitTimeMSec": "1457378288231"},"lastModified": "2016-03-07T19:18:08.231Z","sizeInBytes": 27542,"size": "26.9 KB"

}}}}

http://localhost:8983/solr/v2/cores

{ "schema":{ "name":"example", "version":1.6, "uniqueKey":"id", "fieldTypes":[{ "name":"_bbox_coord", "class":"solr.TrieDoubleField", "stored":false, "docValues":true, “precisionStep":"8"}], "fields":[{ "name":"_root_", "type":"string", "indexed":true, "stored":false}, { "name":"_src_", "type":"string", "indexed":false, "stored":true}, { "name":"_version_", "type":"long", "indexed":true, “stored”:true}] }}

http://localhost:8983/solr/v2/cores/techproducts/schema

truncated response

{"spec": [{

"documentation": "https://cwiki.apache.org/confluence/display/solr/Schema+API","methods": ["POST"],"url": {

"paths": ["$handlerName"]},"commands": {

"add-field": {"properties": {},"additionalProperties": true

},"delete-field": {

"additionalProperties": true}

}}, {

"documentation": "https://cwiki.apache.org/confluence/display/solr$handlerName+API","methods": ["GET"],"url": {

"paths": ["$handlerName", "$handlerName/name", "$handlerName/uniquekey", "$handlerName/version", "$handlerName/similarity", "$handlerName/solrqueryparser", "$handlerName/zkversion", "$handlerName/zkversion", "$handlerName/solrqueryparser/defaultoperator", "$handlerName/name", "$handlerName/version", "$handlerName/uniquekey", "$handlerName/similarity", "$handlerName/similarity"]

},"body": null

}]}

http://localhost:8983/solr/v2/cores/techproducts/schema/_introspect

truncated response

…and More

• BM25 is the default Similarity

• SolrCloud Backup/Restore API

• AngularJS-based Admin UI

• Jetty 9.3 and HTTP/2 (in 6.x)

Collection Overview Screen

Getting Ready to Upgrade

Highlights of other major changes

Java 8 or higher only!

If you are still using Java 7, you will need to update Java before upgrading to Solr 6.

Changes to Defaults

• Default schemaFactory is now ManagedIndexSchemaFactory

• Similarity defaults:

• If no <similarity> defined, SchemaSimilarityFactory is used

• Defaults to BM25 when field type does not declare similarity

Deprecations introduced in Solr 5 have been removed

• SolrServer and subclasses (use SolrClient)

• DefaultSimilarityFactory has been removed

• GET methods on the Schema API have been changed

• range.date has been removed (finally)

• SolrClient.shutdown() removed in favor of SolrClient.close()

All right, WHEN?

The first release candidate could be created this week.

Expect release in the next 2-4 weeks.

More Information

• Solr Reference Guide

• https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface

• https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions+(Solr+6)

• Joel Bernstein’s presentation at Lucene Revolution

• https://www.youtube.com/watch?v=baWQfHWozXc

• Yonik’s blog, Solr ’n Stuff

• http://yonik.com/solr-cross-data-center-replication/

• http://yonik.com/solr-6/

• Shalin’s presentation to Bangalore Apache Solr/Lucene Group: http://slides.com/shalinmangar/what-s-cooking

Thanks to everyone who’s blogged or presented on upcoming features

• Joel Bernstein and Dennis Gove

• Shalin Mangar

• Yonik Seeley

• Doug Turnbull

Questions?

@childerelda

www.lucidworks.com