2014 spark with elastic search

APACHE SPARK & ELASTICSEARCH

Holden KarauReducing duplicated code and saving on network overhead

Who am I?

Holden Karau● Software Engineer @ Databricks● I’ve worked with Elasticsearch before● I prefer she/her for pronouns● Author of a book on Spark and co-writing another*● github https://github.com/holdenk ● e-mail [email protected] ● @holdenkarau

*Which is why I might be sleepy today.

https://github.com/holdenk

mailto:[email protected]

What is Spark & Elasticsearch

Spark● Apache Spark™ is a fast and general engine for large-

scale data processing.● http://spark.apache.org/

Elasticsearch● Elasticsearch is a real-time distributed search and

analytics engine.● http://www.elasticsearch.org/

http://spark.apache.org/


http://www.elasticsearch.org/


Talk overview

Goal: understand how to work with ES & Hadoop● Spark & Spark streaming let us re-use indexing code● Its a bit ugly right now….● Demo* with tweets & top hash tags per region● We can customize the ES connector to write to the shard based on

partition● This is an early version of the talk (feedback welcome!)

Assumptions:● Familiar(ish) with Elasticsearch or at least Solr● Can read Scala

*Demo gods willing.

Why should you care?

Small differences between off-line and on-line

Spot the difference picture from http://en.wikipedia.org/wiki/Spot_the_difference#mediaviewer/File:Spot_the_difference.png

Leads to

fire works photo by Hailey Toft

Cat picture from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

Lets start with the on-line pipeline

val ssc = new StreamingContext(master, "IndexTweetsLive", Seconds(1))

// Set up the system properties for twitterSystem.setProperty("twitter4j.oauth.consumerKey", cK)System.setProperty("twitter4j.oauth.consumerSecret", cS)System.setProperty("twitter4j.oauth.accessToken", aT)System.setProperty("twitter4j.oauth.accessTokenSecret", ats)

val tweets = TwitterUtils.createStream(ssc, None)

Lets get ready to write the data into Elasticsearch

Photo by Cloned Milkmen

Lets get ready to write the data into Elasticsearchdef setupEsOnSparkContext(sc: SparkContext) = {val jobConf = new JobConf(sc.hadoopConfiguration)

jobConf.set("mapred.output.format.class", "org.elasticsearch.hadoop.mr.EsOutputFormat")

jobConf.setOutputCommitter(classOf[FileOutputCommitter])

jobConf.set(ConfigurationOptions.ES_RESOURCE_WRITE, “twitter/tweet”)

FileOutputFormat.setOutputPath(jobConf, new Path("-"))jobconf}

Add a schema

curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '{ "tweet" : { "properties" : { "message" : {"type" : "string"}, "hashTags" : {"type" : "string"}, "location" : {"type" : "geo_point"} } }}'

Lets format our tweets

def prepareTweets(tweet: twitter4j.Status) = { val loc = tweet.getGeoLocation() val lat = loc.getLatitude() val lon = loc.getLongitude() val hashTags = tweet.getHashtagEntities().map(_.getText()) HashMap( "docid" -> tweet.getId().toString, "message" -> tweet.getText(), "hashTags" -> hashTags.mkString(" "), "location" -> s"$lat,$lon" ) } } // Convert to HadoopWritable types mapToOutput(fields)}

And save them...tweets.foreachRDD{(tweetRDD, time) => val sc = tweetRDD.context // The jobConf isn’t serilizable so we create it here val jobConf = SharedESConfig.setupEsOnSparkContext(sc, esResource, Some(esNodes)) // Convert our tweets to something that can be indexed val tweetsAsMap = tweetRDD.map( SharedIndex.prepareTweets) tweetsAsMap.saveAsHadoopDataset(jobConf)}

Now let’s query them!

{"filtered" : { "query" : { "match_all" : {} } ,"filter" : {"geo_distance" : { "distance" : "${dist}km", "location" : { "lat" : "${lat}", "lon" : "${lon}"}}}}}}

Now let’s find the hash tags :)

jobConf.set("es.query", query)

val currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable])

val tweets = currentTweets.map{ case (key, value) => SharedIndex.mapWritableToInput(value) }

val hashTags = tweets.flatMap{t => t.getOrElse("hashTags", "").split(" ")}

println(hashTags.countByValue())

oh wait :(

Sad panda by Jose Antonio Tovar

Now let’s find some common words….// Extract the top wordsval words = tweets.flatMap{tweet => tweet.flatMap{elem => elem._2 match { case null => Nil case _ => elem._2.split(" ") }}}val wordCounts = words.countByValue()println("------")wordCounts.foreach{ case(key, value) => println(key + ":" + value) }println("------")

object WordCountOrdering extends Ordering[(String, Int)]{ def compare(a: (String, Int), b: (String, Int)) = { b._2 compare a._2 }}

val wc = words.map(x => (x, 1)).reduceByKey((x,y) => x+y)

val topWords = wc.takeOrdered(40)(WordCountOrdering)

ok, fine, the “top” words

NYC words

I,144a,83the,76to,75you,66my,56and,52me,47that,39of,38in,38on,37so,36like,35is,32

this,30was,29it,28with,27for,25I'm,24i,23but,22just,22at,21be,21are,20don't,19

have,18lol,17out,17your,16love,16up,16all,16her,15when,14not,13it's,13

SF words

I,18the,16to,15you,9a,9my,8me,8for,8at,7of,7want,6Just,6in,6is,6this,5got,5

she,5when,5was,4so,4what,4&,4he,4your,4as,4they,4it,4@,4get,4and,4are,4say,3

w/,3do,3dont,3going,3fuck,3i,3know,3or,3

Indexing Part 2 (electric boogaloo)

Writing directly to a node with the correct shard saves us network overhead

Screen shot of elasticsearch-head http://mobz.github.io/elasticsearch-head/

Slight hack timeWe clone the connector and do update EsOutputFormat.java [see https://github.com/holdenk/elasticsearch-hadoop ]

private int detectCurrentInstance(Configuration conf) { if (sparkInstance != null) { if (log.isDebugEnabled()) { log.debug(String.format("Using Spark patition info [%d]", sparkInstance, uri)); } return sparkInstance; } …. }

https://github.com/holdenk/elasticsearch-hadoop



Slight hack timeWe clone the connector and do update EsOutputFormat.java [see https://github.com/holdenk/elasticsearch-hadoop ]

public org.apache.hadoop.mapred.RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) { EsRecordWriter writer = new EsRecordWriter(job, progress); // This is a special hack for Spark which sets the name as "part-[partitionnumber]" so if our // jobconf asks for it we use this partition number as the shard number. if (HadoopCfgUtils.useSparkPartition(job) && name.startsWith("part-") ) { writer.setSparkInstance(Integer.valueOf(name.substring(5))); } return writer; }




So what does that give us?

Spark sets the file name to part-[partition number]If we have same partitioner we write directlyLikely the best place to use this is in re-indexing data

Re-index all the things*

// Read in our data setval currentTweets = sc.hadoopRDD(jobConf, classOf[EsInputFormat[Object, MapWritable]], classOf[Object], classOf[MapWritable])

// Fetch them from twitterval t4jt = tweets.flatMap{ tweet => val twitter = TwitterFactory.getSingleton() val tweetID = tweet.getOrElse("docid", "") Option(twitter.showStatus(tweetID.toLong))} t4jt.map(SharedIndex.prepareTweets) .saveAsHadoopDataset(jobConf)

*Until you hit your twitter rate limit…. oops

Cat photo from https://www.flickr.com/photos/deerwooduk/579761138/in/photolist-4GCc4z-4GCbAV-6Ls27-34evHS-5UBnJv-TeqMG-4iNNn5-4w7s61-6GMLYS-6H5QWY-6aJLUT-tqfrf-6mJ1Lr-84kGX-6mJ1GB-vVqN6-dY8aj5-y3jK-7C7P8Z-azEtd/

“Useful” links

● Feedback: [email protected]● Customized ES connector*: https://github.

com/holdenk/elasticsearch-hadoop ● Demo code: https://github.com/holdenk/elasticsearchspark ● Elasticsearch: http://www.elasticsearch.org/● Spark: http://spark.apache.org/ ● Spark streaming: http://spark.apache.org/streaming/

mailto:[email protected]




https://github.com/holdenk/elasticsearchspark



http://spark.apache.org/streaming/

Technology

2014 spark with elastic search