WITSML data processing example with Kafka and Spark Streaming Houston Hadoop Meetup, 4/26/2016

Witsml data processing with kafka and spark streaming

Embed Size (px)

Citation preview

Page 1: Witsml data processing with kafka and spark streaming

WITSML data processing example with Kafka and Spark Streaming

Houston Hadoop Meetup, 4/26/2016

Page 2: Witsml data processing with kafka and spark streaming

About me - Dmitry Kniazev

Currently Solution Architect at EPAM Systems

- About 4 years in Oil & Gas here in Houston- Started working with Hadoop about 2 years ago

Before that BI/DW Specialist at EPAM Systems for 6 years

- Reports, ETL with Oracle, Microsoft, Cognos and other tools- Enjoyed not SO HOT life in Eastern Europe

Before that Performance Analyst at EPAM Systems for 4 years

- Web Applications and Databases optimization

Page 3: Witsml data processing with kafka and spark streaming

What is the problem?

Source: http://www.croftsystems.net/blog/conventional-vs.-unconventional

Page 4: Witsml data processing with kafka and spark streaming

What is WITSML?


WITSML DataStore

Rig Aggregation


Rig Aggregation


Corp Store

WITSML DataStore

Service Company #1

Operator #1

Service Company #2

WITSML based ApplicationsWITSML

Page 5: Witsml data processing with kafka and spark streaming

Operator Company Data Center


WITSML DataStore


via SOAP


Consumer (Scala)


Service Company



Consumer (Scala)

Email / Browser

Page 6: Witsml data processing with kafka and spark streaming

What is Kafka?

Page 7: Witsml data processing with kafka and spark streaming

What is Spark Streaming?

Page 8: Witsml data processing with kafka and spark streaming

Discretized Stream

Page 9: Witsml data processing with kafka and spark streaming

Producer - prep

// some important importsimport com.mycompany.witsml.client.WitsmlClient //based on jwitsml 1.0import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}import scala.xml.{Elem, Node, XML}

// variables initializationvar producer: KafkaProducer[String, String] = nullvar startTimeIndex = DateTime.now()var topic = ""var pollInterval = 5

Page 10: Witsml data processing with kafka and spark streaming

Producer - Kafka Properties

bootstrap.servers = srv1:9092,srv2:9092

key.serializer = org.apache.kafka.common.serialization.StringSerializer

value.serializer = org.apache.kafka.common.serialization.StringSerializer

Page 11: Witsml data processing with kafka and spark streaming

Producer - main functionproducer = new KafkaProducer[String, String](props)

// each wellBore is a separate Kafka topic which is going to be partitioned by log

topic = args(0)

while (true) {

val logs = WitsmlClient.getWitsmlResponse(logsQuery)

// parse logs and send messages to Kafka

(logs \ "log").foreach { node: Node =>

// send all data from one log to the same partition

val key = (node \ "@uidLog").text

(node \\ "data").foreach { data =>

val message = new ProducerRecord(topic, null, key, data.text)




Page 12: Witsml data processing with kafka and spark streaming

Producer - results

”Well123” => Topic

“5207KFSJ18” => Key (Partition)

Content of <data> element => Message

Page 13: Witsml data processing with kafka and spark streaming

Consumer - prep

import org.apache.spark.SparkConfimport org.apache.spark.sql.{Row, SQLContext}

import org.apache.spark.streaming.dstream.InputDStreamimport org.apache.spark.streaming.kafka.KafkaUtils

var schema: StructType = null

val sc = new SparkConf().setAppName("WitsmlKafkaDemo")val ssc = new StreamingContext(sc, Seconds(1))

val dStream: InputDStream = KafkaUtils.createDirectStream(ssc, kafkaParams, topics)val sqlContext = new SQLContext(ssc.sparkContext)

Page 14: Witsml data processing with kafka and spark streaming

Consumer - Rules Definition

# fields for Spark SQL query

`Co. Man G/L`,`Gain Loss - Spare`,`ACC_DRILL_STRKS`

# where clause for SQL query

`Co. Man G/L`>100 OR `Gain Loss - Spare`<(-42.1)

Page 15: Witsml data processing with kafka and spark streaming

Consumer - main functiondStream.foreachRDD( batchRDD => {

val messages = batchRDD.map(_._2).map(_.split(","))

//create DataFrame with a custom schema

val df = sqlContext.createDataFrame(messages, schema)

//register temp table and test against rule


val collected = sqlContext.sql("SELECT " + fields + " FROM timeLog WHERE " + condition).collect

if (collected.length > 0) {

//send email alert






Page 16: Witsml data processing with kafka and spark streaming

Visualization with Highcharts

Page 17: Witsml data processing with kafka and spark streaming

Why Highcharts?

- Websockets support -> real-time data visualization- Multiple Y-axes that automatically scale -> many mnemonics on the same chart- Inverted X-axis -> great for Depth Logs- 3D charts that can be rotated -> Trajectories - Area range with custom colors -> Formations on the background- 100% client side javascript -> easy to deploy

Page 18: Witsml data processing with kafka and spark streaming

Lessons Learned

- Throw away and re-design:- Logs should be Topics, Wells(Wellbores) should be Partitions for Scalability- Producers and Consumers should be Managed Services (Flume Agents?)

- Backend:- Land data to HBase (and probably OpenTSDB)

- Frontend:- WebApp to visualize both NRT and historical data?- Mobile App for Alerts?

- Improve Producers:- Speak many WITSML dialects?

- Get ready for Real-time:- Support for ETP standard