Solr as a Spark SQL Datasource

Solr as a Spark SQL Datasource

Kiran Chitturi,Lucidworks

Solr & Spark

• A few interesting things about Spark

• Overview of SparkSQL and DataFrames

• Solr as a SparkSQL DataSource in depth

• Use Lucene for text analysis in ML pipelines

• Example Use Case: Lucidworks Fusion and Spark

The standard for enterprise search. of Fortune 500

uses Solr.

90%

What’s interesting about Spark?• Wealth of overview / getting started resources on the Web

➢ Start here -> https://spark.apache.org/

➢ Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

• Faster, more modernized alternative to MapReduce

➢ Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less computing power)

• Unified platform for Big Data

➢ Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining ➢ Runs on YARN, Mesos, and plays well with HDFS

• Nice API for Java, Scala, Python, R, and sometimes SQL … REPL interface too

• >14,200 Issues in JIRA, 1000+ code contributors, 2.0 coming soon!

https://spark.apache.org/

https://spark.apache.org/

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Spark Components

Spark Core

Spark SQL

Spark Streaming

MLlib (machine learning)

GraphX (BSP)

Hadoop YARN Mesos Standalone

HDFSExecution

ModelThe Shuffle Caching

components

engine

cluster mgmt

Alluxio (formerly Tachyon)

languages Scala Java Python R

shared memory

Physical Architecture (Standalone)

Spark Master (daemon)

Spark Slave (daemon)

my-spark-job.jar (w/ shaded deps)

My Spark AppSparkContext

(driver)

• Keeps track of live workers

• Web UI on port 8080

• Task Scheduler

• Restart failed tasks

Spark Executor (JVM process)

Tasks

Executor runs in separate process than slave daemon

Spark Worker Node (1...N of these)

Each task works on some partition of a data set to apply a transformation or action

Cache

Losing a master prevents new applications from being executed

Can achieve HA using ZooKeeper and multiple master nodes

Tasks are assigned based on data-locality

When selecting which node to execute a task on, the master takes into account data locality

• RDD Graph • DAG Scheduler • Block tracker • Shuffle tracker

Spark SQL• DataSource API for reading from and writing to external data sources

• DataFrame is an RDD[Row] + schema

• Secret sauce is logical plan optimizer

• SQL or relational operators on DF

• JDBC / ODBC

• UDFs!

• Machine Learning Pipelines

Solr as a Spark SQL Data Source• Read/write data from/to Solr as DataFrame

• Use Solr Schema API to access field-level metadata

• Push predicates down into Solr query constructs, e.g. fq clause

• Deep-paging, shard partitioning, intra-shard splitting, streaming results

// Connect to Solr val opts = Map("zkhost" -> "localhost:9983", "collection" -> "nyc_trips") val solrDF = sqlContext.read.format("solr").options(opts).load

// Register DF as temp table solrDF.registerTempTable("trips")

// Perform SQL queries sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show()

Shard 1

Shard 2

socialdata collection

Partition 1 (spark task)

solr.DefaultSource


Spark Driver App

ZooKeeperRead collection metadata (num shards, replica URLs, etc)

TableScanQuery:q=*:*&rows=1000&

distrib=false&cursorMark=*

Results streamed back from Solr

Para

lleliz

e qu

ery

exec

utio

n

sqlContext.load(“solr”,opts)

DataFrame (schema + RDD<Row>)

Map("collection”->“nyc_trips”,“zkHost”->“…”)

SolrRDD: Reading data from Solr into Spark

Shard 1

Shard 2

socialdata collection


solr.DefaultSource


Spark Driver App

ZooKeeperRead collection metadata

(num shards, replica URLs, etc)

TableScanQuery:q=*:*&rows=1000&

distrib=false&cursorMark=*&split_field:[aaTOba}

Para

lleliz

e qu

ery

exec

utio

n

sqlContext.load(“solr”,opts)

DataFrame (schema + RDD<Row>)

Map(“collection”->“nyc_trips”,“zkHost”->“…”)

SolrRDD: Reading data from Solr into Spark

Partition 2 (spark task) &split_field:[baTObz]


&split_field:[aaTOac}

&split_field:[acTOae]

Data-locality Hint

• SolrRDD extends RDD[SolrDocument] (written in Scala)

• Give hint to Spark task scheduler about where data lives

overridedefgetPreferredLocations(split:Partition):Seq[String]={

//returnpreferredhostnameforaSolrpartition }

• Useful when Spark executor and Solr replicas live on same physical host, as we do in Fusion

• Query to a shard has a “preferred” replica; can fallback to other replicas if the preferred goes down (will be in 2.1)

Solr Streaming API for fast reads

• Contributed to spark-solr by Bloomberg team (we PRs)

• Extremely fast “table scans” over large result sets in Solr

• Relies on a column-oriented data structure in Lucene: docValues

• DocValues help speed up faceting and sorting too!

• Coming soon! Push SQL predicates down into Solr’s Parallel SQL engine available in Solr 6.x

Writing to Solr (aka indexing)• Cloud-aware client sends updates to shard leaders in parallel

• Solr Schema API used to create fields on-the-fly using the DataFrame

schema

• Better parallelism than traditional approaches like Solr DIH

val dbOpts = Map( "url" -> "jdbc:postgresql:mydb", "dbtable" -> “schema.table", "partitionColumn" -> "foo", "numPartitions" -> "10")

val jdbcDF = sqlContext.read.format("jdbc").options(dbOpts).load

val solrOpts = Map("zkhost" -> "localhost:9983", "collection" -> "mycoll")

jdbcDF.write.format("solr").options(solrOpts).mode(SaveMode.Overwrite).save

Solr / Lucene Analyzers for Spark ML Pipelines• Spark ML Pipeline provides nice API for defining stages to train / predict ML models

• Crazy idea ~ use battle-hardened Lucene for text analysis in Spark

• Pipelines support import/export (work in progress, more coming in Spark 2.0)

• Can try different text analysis techniques during cross-validation

DF

Spark ML Pipeline

Lucene Analyzer HashingTF

Standard Scaler

Trained Model (SVM)

save

https://lucidworks.com/blog/2016/04/13/spark-solr-lucenetextanalyzer/

Fusion & Spark• spark-solr 2.0.1 released, built into Fusion 2.4

• Users leave evidence of their needs & experience as they use your app

• Fusion closes the feedback loop to improve results based on user “signals” • Train and serve ML Pipeline and mllib based Machine Learning models

• Run custom Scala “script” jobs in the background in Fusion ➢Complex aggregation jobs (see next slide) ➢Unsupervised learning (LDA topic modeling) ➢Re-train supervised models as new training data flows in

Scheduled Fusion job to compute stats for user sessionsvalopts=Map("zkhost"->"localhost:9983”,"collection"->"apachelogs”)varlogEvents=sqlContext.read.format("solr").options(opts).loadlogEvents.registerTempTable("logs”)sqlContext.udf.register("ts2ms",(d:java.sql.Timestamp)=>d.getTime)sqlContext.udf.register("asInt",(b:String)=>b.toInt)valsessions=sqlContext.sql("""|SELECT*,sum(IF(diff_ms>30000,1,0))|OVER(PARTITIONBYclientipORDERBYts)session_id|FROM(SELECT*,ts2ms(ts)-lag(ts2ms(ts))|OVER(PARTITIONBYclientipORDERBYts)asdiff_msFROMlogs)tmp""".stripMargin)sessions.registerTempTable("sessions")

varsessionsAgg=sqlContext.sql("""|SELECTconcat_ws('||',clientip,session_id)asid,|first(clientip)asclientip,|min(ts)assession_start,|max(ts)assession_end,|(ts2ms(max(ts))-ts2ms(min(ts)))assession_len_ms_l,|count(*)astotal_requests_l|FROMsessions|GROUPBYclientip,session_id""".stripMargin)

sessionsAgg.write.format("solr").options(Map("zkhost"->"localhost:9983","collection"->"apachelogs_signals_aggr")).mode(org.apache.spark.sql.SaveMode.Overwrite).save

Getting started with spark-solr

• Import package via maven

./bin/spark-shell --packages "com.lucidworks.solr:spark-solr:2.0.1"

• Build from source git clone https://github.com/LucidWorks/spark-solr cd spark-solr mvn clean package -DskipTests ./bin/spark-shell --jars 2.1.0-SNAPSHOT.jar

https://github.com/LucidWorks/spark-solr/

Example : Deep paging via shards

// Connect to Solr val opts = Map( "zkhost" -> "localhost:9983", "collection" -> "nyc_trips")

val solrDF = sqlContext.read.format("solr").options(opts).load


sqlContext.sql("SELECT * FROM trips LIMIT 2").show()

Example : Deep paging with intra shard splitting

// Connect to Solr val opts = Map( "zkhost" -> "localhost:9983", "collection" -> "nyc_trips", "splits" -> "true") val solrDF = sqlContext.read.format("solr").options(opts).load


sqlContext.sql("SELECT * FROM trips").count()

Example : Streaming API (/export handler)

// Connect to Solr val opts = Map( "zkhost" -> "localhost:9983", "collection" -> "nyc_trips") val solrDF = sqlContext.read.format("solr").options(opts).load


sqlContext.sql("SELECT avg(tip_amount), avg(fare_amount) FROM trips").show()

Performance test• NYC taxi data (30 months - 91.7M rows) • Dataset loaded in to AWS RDS instance (Postgres)

• 3 EC2 nodes of r3.2x large instances • Solr and Spark instances co-located together

• Collection ‘nyc-taxi’ created with 6 shards, 1 replication • Deployed using solr-scale-tk (https://github.com/LucidWorks/solr-scale-tk)

• Dataset link: https://github.com/toddwschneider/nyc-taxi-data • More details: https://gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181

https://github.com/LucidWorks/solr-scale-tk

https://github.com/toddwschneider/nyc-taxi-data

https://gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181

• Query - simple aggregation query to calculate averages

• Streaming expressions took 2.3 mins across 6 tasks

• Deep paging took 20 minutes across 120 tasks

Query performance

Index performance• 91.4M rows imported to Solr in 49 minutes

• Docs per second: 31K

• JDBC batch size: 5000

• Indexing batch size: 50000

• Partitions: 200

Wrap-up and Q & A

Download Fusion: http://lucidworks.com/fusion/download/

Feel free to reach out to me with questions:

[email protected] / @chitturikiran

http://lucidworks.com/fusion/download/

http://lucidworks.com/fusion/download/

mailto:[email protected]?subject=

Spark Streaming: Nuts & Bolts• Transform a stream of records into small, deterministic batches

✓ Discretized stream: sequence of RDDs ✓ Once you have an RDD, you can use all the other Spark libs (MLlib, etc) ✓ Low-latency micro batches ✓ Time to process a batch must be less than the batch interval time

• Two types of operators: ✓ Transformations (group by, join, etc) ✓ Output (send to some external sink, e.g. Solr)

• Impressive performance! ✓ 4GB/s (40M records/s) on 100 node cluster with less than 1 second latency (note: not indexing rate)

✓ Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark

Spark Streaming Example: Solr as Sink

Twitter

./spark-submit--masterMASTER--classcom.lucidworks.spark.SparkAppspark-solr-1.0.jar\twitter-to-solr-zkHostlocalhost:2181–collectionsocial

Solr

JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters);

Various transformations / enrichments on each tweet (e.g. sentiment analysis, language detection)

JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }});

map()

class TwitterToSolrStreamProcessor extends SparkApp.StreamProcessor

SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);

Slide Legend

Provided by Spark

Custom Java / Scala code

Provided by Lucidworks

Document Matching using Stored Queries

• For each document, determine which of a large set of stored queries matches.

• Useful for alerts, alternative flow paths through a stream, etc

• Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match

• Matching framework; you have to decide where to load the stored queries from and what to do when matches are found

• Scale it using Spark … need to scale to many queries, checkout Luwak

Document Matching using Stored Queries

Stored Queries

DocFilterContext

Twitter map()

Slide Legend

Provided by Spark

Custom Java / Scala code

Provided by Lucidworks

JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters);

JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }});

JavaDStream<SolrInputDocument> enriched = SolrSupport.filterDocuments(docFilterContext, …);

Get queries

Index docs into an EmbeddedSolrServer Initialized from configs stored in ZooKeeper

…

ZooKeeper

Key abstraction to allow you to plug-in how to store the queries and what action to take when docs match

RDD Illustrated: Word count

map(word=>(word,1))

Map words into pairs with count of 1(quick,1)

(brown,1)

(fox,1)

(quick,1)

(quick,1)

valfile=spark.textFile("hdfs://...")

HDFS

file RDD from HDFS

quickbrownfoxjumped…

quickbrownierecipe…

quickdryingglue…

……

…

file.flatMap(line=>line.split(""))

Split lines into words

quick

brown

fox

quick

quick

……

reduceByKey(_+_)

Send all keys to same reducer and sum

(quick,1)

(quick,1)

(quick,1)

(quick,3)

Shuffle across

machine boundaries

Executors assigned based on data-locality if possible, narrow transformations occur in same executor Spark keeps track of the transformations made to generate each RDD

Partition 1Partition 2Partition 3

xvalfile=spark.textFile("hdfs://...")valcounts=file.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_)counts.saveAsTextFile("hdfs://...")

Understanding Resilient Distributed Datasets (RDD)• Read-only partitioned collection of records with fault-tolerance

• Created from external system OR using a transformation of another RDD

• RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)

• If a partition is lost, RDDs can be re-computed by re-playing the transformations

• User can choose to persist an RDD (for reusing during interactive data-mining)

• User can control partitioning scheme

Physical Architecture

Spark Master (daemon)

Spark Slave (daemon)

my-spark-job.jar (w/ shaded deps)

My Spark AppSparkContext

(driver)

• Keeps track of live workers

• Web UI on port 8080

• Task Scheduler

• Restart failed tasks

Spark Executor (JVM process)

Tasks

Executor runs in separate process than slave daemon

Spark Worker Node (1...N of these)

Each task works on some partition of a data set to apply a transformation or action

Cache

Losing a master prevents new applications from being executed

Can achieve HA using ZooKeeper and multiple master nodes

Tasks are assigned based on data-locality

When selecting which node to execute a task on, the master takes into account data locality

• RDD Graph • DAG Scheduler • Block tracker • Shuffle tracker

Spark vs. Hadoop’s Map/Reduce

Operators File System Fault-Tolerance Ecosystem

Hadoop Map-Reduce HDFS S3

Replicated blocks

Java API Pig

Hive

Spark filter, map, flatMap, join, groupByKey,

reduce, sortByKey,

count, distinct, top, cogroup,

etc …

HDFS S3

Tachyon

Immutable RDD lineage

Python / Scala / Java API

SparkSQL GraphX MLlib

SparkR

Engineering

Solr as a Spark SQL Datasource