Upload
lucidworks
View
2.464
Download
0
Embed Size (px)
Citation preview
Solr 5 Main Themes
• Easy to Use
• bin/solr and bin/post improvements
• JSON-based facets
• More APIs
• Modern UI (Angular-based)
• Scalable
• SolrCloud hardening
• Replica placement strategy
• Streaming expressions
• Secure
• Authentication and Authorization frameworks
Highlights of Recent Solr Releases (5.4 and 5.5)
• Solr 5.4
• Basic authentication
• ConfigSets API
• FORCELEADER command
• Optimizations for faceting DocValue fields
• Solr 5.5
• Ability to edit ZooKeeper configs with bin/solr
• Rule-based authorization flexibility
• XML query parser
• More async collection APIs
Solr 6 introduces several new features
• Parallel SQL
• Cross Data Center Replication
• Graph Traversal
• Modern APIs
• Jetty 9.3 and HTTP/2
Seamlessly combines SQL with Solr’s full-text capabilities
• Realtime MapReduce(ish) or Facet aggregation modes
• Parallel execution of queries across SolrCloud
• Advanced SQL syntax for powerful queries
Parallel SQL builds on Solr’s Streaming Capabilities
• Export request handler (/export)
• Streaming API
• Streams tuples in JSON
• new class: org.apache.solr.client.solrj.io
• Streaming Expressions (/stream)
• Allows non-Java programmers to access Streaming API
• Expressions are essentially functions which originate the stream or operate on the stream
Streaming Expression Request - search
curl -d 'expr=search(gettingstarted, q="*:*", fl=“id, manu_exact”, sort=“manu_exact asc")' http://localhost:8983/solr/gettingstarted/stream
{"result-set": {
"docs": [ {"manu_exact": "A-DATA Technology Inc.”, "id": "VDBDB1A16"}, {"manu_exact": "ASUS Computer Inc.”, "id": "EN7800GTX/2DHTV/256M"}, {"manu_exact": "ATI Technologies”, "id": "100-435805"}… {"EOF": true,"RESPONSE_TIME": 15}]
}}
Functions, aka Stream Sources and Stream Decorators
• Define how data is retrieved and any aggregations performed
• Designed to work with entire result sets
• Can be compounded or wrapped to perform several operations at the same time
Streaming Expression Request - reduce
curl http://localhost:8983/solr/gettingstarted/stream -d ‘expr=reduce (search(gettingstarted, q="inStock:true", qt="/export", fl="id,manu_exact", sort="manu_exact asc"), by="manu_exact", group( sort="manu_exact asc", n="2"))'
Streaming Expression Response
{“result-set": {"docs":[ {"id":"0380014300","group":[{"id":"0380014300"},{"id":"0553573403"}]}, {"manu_exact":"A-DATA Technology Inc.","id":"VDBDB1A16","group":[{"manu_exact":"A-DATA Technology Inc.","id":"VDBDB1A16"}]}, {"manu_exact":"Apache Software Foundation","id":"UTF8TEST","group":[{"manu_exact":"Apache Software Foundation","id":"UTF8TEST"},{"manu_exact":"Apache Software Foundation","id":"SOLR1000"}]}, {"manu_exact":"Apple Computer Inc.","id":"MA147LL/A","group":[{"manu_exact":"Apple Computer Inc.","id":"MA147LL/A"}]}, {"manu_exact":"Bank of America","id":"USD","group":[{"manu_exact":"Bank of America","id":"USD"}]}, {"manu_exact":"Bank of Norway","id":"NOK","group":[{"manu_exact":"Bank of Norway","id":"NOK"}]}, {"manu_exact":"Canon Inc.","id":"9885A004","group":[{"manu_exact":"Canon Inc.","id":"9885A004"},{"manu_exact":"Canon Inc.","id":"0579B002"}]}, {"manu_exact":"Corsair Microsystems Inc.","id":"VS1GB400C3","group":[{"manu_exact":"Corsair Microsystems Inc.","id":"VS1GB400C3"},{"manu_exact":"Corsair Microsystems Inc.","id":"TWINX2048-3200PRO"}]}, {"manu_exact":"Dell, Inc.","id":"3007WFP","group":[{"manu_exact":"Dell, Inc.","id":"3007WFP"}]}, {“EOF":true,"RESPONSE_TIME":24}]}}
Available Functions
• Stream Sources
• Search
• JDBC
• Facet
• Stats
• Topic
• Stream Decorators
• Complement, Unique, Intersect
• leftOuterJoin, innerJoin, hashJoin, outerHashJoin
• Top, Rollup, Facet
• Parallel
• Decorators, cont’d
• Update
• Merge
• Group, Reduce
• Daemon
• Select
Streaming Expression Request - parallel
curl http://localhost:8983/solr/gettingstarted/stream -d 'expr=parallel(workcollection, search(gettingstarted, q="inStock:true", fl="id, manu_exact", sort="manu_exact asc", partitionKeys="manu_exact"), workers=2, zkHost="localhost:9983", sort="manu_exact asc")'
Parallel SQL builds on export and streaming
• SQL statements translated into Streaming Expressions
• Automatic merge of results from worker nodes
• Advanced SQL syntax
SQL Syntax
• SELECT and SELECT DISTINCT
• select id, manu_exact from techproducts
• select distinct id, manu_exact from techproducts
• WHERE
• select id, manu_exact from techproducts where inStock=true
• select id, manu_exact from techproducts order where price=‘[10 TO 50]’
• select id, manu_exact from techproducts where cat=‘(electronics or music)’
SQL Syntax
• ORDER BY and LIMIT
• select id, manu_exact from techproducts order by manu_exact asc
• select id, manu_exact from techproducts limit 10
• GROUP BY
• select id, manu_exact from techproducts where inStock=true group by manu
SQL Syntax
• Stats
• select count(manu_exact) as count, avg(price) as avg from techproducts
• HAVING
• select id, manu_exact from techproducts where inStock=true having (avg(price)>5) order by manu_exact asc
SQL Statement and Results
{"result-set":{"docs":[ {"manu_exact":"A-DATA Technology Inc.","id":"VDBDB1A16"}, {"manu_exact":"Apache Software Foundation","id":"SOLR1000"}, {"manu_exact":"Apache Software Foundation","id":"UTF8TEST"}, {"manu_exact":"Apple Computer Inc.","id":"MA147LL/A"}, {"manu_exact":"Bank of America","id":"USD"}, {"EOF":"true","RESPONSE_TIME":8}]}
}
curl -d '&stmt=select id, manu_exact from techproducts where inStock='true' order by manu_exact limit 5' http://localhost:8983/solr/techproducts/sql
Aggregation Modes
• map_reduce
• Tuples are shuffled to worker nodes, where aggregation occurs
• Tuples are sent to worker nodes sorted by GROUP BY fields
• Great for high cardinality
• facet
• Pushes computation to JSON Facet API - only aggregates are sent over the network
• Great for low-to-moderate cardinality
Parallel SQL with map_reduce Aggregation Mode
Client/sql handlerSQL Tier
worker 2 worker 3 worker 4worker 1Worker Tier
s2_r1
s1_r3
s1_r2
s1_r1
s2_r2
s2_r3 s3_r3
s3_r2
s3_r1
s4_r3
s4_r2
s4_r1
Data Tier
Each worker queries 1 replica in each shard
JDBC Driver
• Solr now includes a JDBC driver which can be used to query Solr
• Can be used only with the SQL handler
• DB visualization tools can also be used, such as Apache Zeppelin, Squirrel, DBVisualizer, etc.
Best Practices
• Create a separate collection for the /sql handler and worker nodes
• Designed for large clusters and large data sets
• Use the correct aggregation mode
• Usually best to partition on what you are grouping on
DocValue Fields ONLY!
Export and Stream request handlers can only be used on fields that use DocValues.
Because Parallel SQL uses these capabilities, in most cases it also requires DocValue fields.
Cross Data Center Replication
Replication between two or more SolrCloud clusters in two or more data centers
CDCR Design Points
• Uses existing transaction logs
• Leader-to-Leader communication avoids duplicate updates across data centers
• Active-passive disaster recovery
• Synchronous or asynchronous indexing
• Configurable batch sizes
• No single point of failure or bottlenecks
CDCR Limitations
• Must start with an empty index or one that is already fully synchronized
• May be unsatisfactory if rate of updates is high
• Active-passive
Solr supports graph queries
• Follow nodes to edges
• Apply optional filters during traversal
• Use cases:
• Find all tweets mentioning “Solr” by me or people I follow
• Find all draft blog posts about “parallel sql” written by a developer I know
• Find 3-star hotels in NYC my friends stayed in last year
q=Solr&fq={!graph from=following_id to=id maxDepth=1}id:”childerelda”
Designed for Humans
• Consistent
• Versioned
• Friendlier endpoint names
• Introspectable
• JSON output by default (`wt` still supported)
Not in 6.0, but coming very soon
{"responseHeader": {"status": 0,"QTime": 2
},"initFailures": {},"status": {
"techproducts": {"name": "techproducts","instanceDir": "/Users/cass/LuceneSolr/lucene-solr/solr/example/techproducts/solr/techproducts","dataDir": "/Users/cass/LuceneSolr/lucene-solr/solr/example/techproducts/solr/techproducts/data/","config": "solrconfig.xml","schema": "managed-schema","startTime": "2016-03-07T19:18:07.765Z","uptime": 295560,"index": {
"numDocs": 32,"maxDoc": 32,"deletedDocs": 0,"indexHeapUsageBytes": -1,"version": 6,"segmentCount": 1,"current": true,"hasDeletions": false,"directory": "org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/Users/cass/LuceneSolr/lucene-solr/solr/example/
techproducts/solr/techproducts/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1244fae; maxCacheMB=48.0 maxMergeSizeMB=4.0)","segmentsFile": "segments_2","segmentsFileSizeInBytes": 165,"userData": {
"commitTimeMSec": "1457378288231"},"lastModified": "2016-03-07T19:18:08.231Z","sizeInBytes": 27542,"size": "26.9 KB"
}}}}
http://localhost:8983/solr/v2/cores
{ "schema":{ "name":"example", "version":1.6, "uniqueKey":"id", "fieldTypes":[{ "name":"_bbox_coord", "class":"solr.TrieDoubleField", "stored":false, "docValues":true, “precisionStep":"8"}], "fields":[{ "name":"_root_", "type":"string", "indexed":true, "stored":false}, { "name":"_src_", "type":"string", "indexed":false, "stored":true}, { "name":"_version_", "type":"long", "indexed":true, “stored”:true}] }}
http://localhost:8983/solr/v2/cores/techproducts/schema
truncated response
{"spec": [{
"documentation": "https://cwiki.apache.org/confluence/display/solr/Schema+API","methods": ["POST"],"url": {
"paths": ["$handlerName"]},"commands": {
"add-field": {"properties": {},"additionalProperties": true
},"delete-field": {
"additionalProperties": true}
}}, {
"documentation": "https://cwiki.apache.org/confluence/display/solr$handlerName+API","methods": ["GET"],"url": {
"paths": ["$handlerName", "$handlerName/name", "$handlerName/uniquekey", "$handlerName/version", "$handlerName/similarity", "$handlerName/solrqueryparser", "$handlerName/zkversion", "$handlerName/zkversion", "$handlerName/solrqueryparser/defaultoperator", "$handlerName/name", "$handlerName/version", "$handlerName/uniquekey", "$handlerName/similarity", "$handlerName/similarity"]
},"body": null
}]}
http://localhost:8983/solr/v2/cores/techproducts/schema/_introspect
truncated response
…and More
• BM25 is the default Similarity
• SolrCloud Backup/Restore API
• AngularJS-based Admin UI
• Jetty 9.3 and HTTP/2 (in 6.x)
Java 8 or higher only!
If you are still using Java 7, you will need to update Java before upgrading to Solr 6.
Changes to Defaults
• Default schemaFactory is now ManagedIndexSchemaFactory
• Similarity defaults:
• If no <similarity> defined, SchemaSimilarityFactory is used
• Defaults to BM25 when field type does not declare similarity
Deprecations introduced in Solr 5 have been removed
• SolrServer and subclasses (use SolrClient)
• DefaultSimilarityFactory has been removed
• GET methods on the Schema API have been changed
• range.date has been removed (finally)
• SolrClient.shutdown() removed in favor of SolrClient.close()
All right, WHEN?
The first release candidate could be created this week.
Expect release in the next 2-4 weeks.
More Information
• Solr Reference Guide
• https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface
• https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions+(Solr+6)
• Joel Bernstein’s presentation at Lucene Revolution
• https://www.youtube.com/watch?v=baWQfHWozXc
• Yonik’s blog, Solr ’n Stuff
• http://yonik.com/solr-cross-data-center-replication/
• http://yonik.com/solr-6/
• Shalin’s presentation to Bangalore Apache Solr/Lucene Group: http://slides.com/shalinmangar/what-s-cooking
Thanks to everyone who’s blogged or presented on upcoming features
• Joel Bernstein and Dennis Gove
• Shalin Mangar
• Yonik Seeley
• Doug Turnbull