19
WITSML data processing example with Kafka and Spark Streaming Houston Hadoop Meetup, 4/26/2016

Witsml data processing with kafka and spark streaming

Embed Size (px)

Citation preview

Page 1: Witsml data processing with kafka and spark streaming

WITSML data processing example with Kafka and Spark Streaming

Houston Hadoop Meetup, 4/26/2016

Page 2: Witsml data processing with kafka and spark streaming

About me - Dmitry Kniazev

Currently Solution Architect at EPAM Systems

- About 4 years in Oil & Gas here in Houston- Started working with Hadoop about 2 years ago

Before that BI/DW Specialist at EPAM Systems for 6 years

- Reports, ETL with Oracle, Microsoft, Cognos and other tools- Enjoyed not SO HOT life in Eastern Europe

Before that Performance Analyst at EPAM Systems for 4 years

- Web Applications and Databases optimization

Page 3: Witsml data processing with kafka and spark streaming

What is the problem?

Source: http://www.croftsystems.net/blog/conventional-vs.-unconventional

Page 4: Witsml data processing with kafka and spark streaming

What is WITSML?

DATA EXCHANGE STANDARD FOR THE UPSTREAM OIL AND GAS INDUSTRY

WITSML DataStore

Rig Aggregation

Solution

Rig Aggregation

Solution

Corp Store

WITSML DataStore

Service Company #1

Operator #1

Service Company #2

WITSML based ApplicationsWITSML

Page 5: Witsml data processing with kafka and spark streaming

Operator Company Data Center

Architecture

WITSML DataStore

HBaseWITSML

via SOAP

Internet

Consumer (Scala)

Producer(Scala)

Service Company

DC

Kafka

Consumer (Scala)

Email / Browser

Page 6: Witsml data processing with kafka and spark streaming

What is Kafka?

Page 7: Witsml data processing with kafka and spark streaming

What is Spark Streaming?

Page 8: Witsml data processing with kafka and spark streaming

Discretized Stream

Page 9: Witsml data processing with kafka and spark streaming

Producer - prep

// some important importsimport com.mycompany.witsml.client.WitsmlClient //based on jwitsml 1.0import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}import scala.xml.{Elem, Node, XML}

// variables initializationvar producer: KafkaProducer[String, String] = nullvar startTimeIndex = DateTime.now()var topic = ""var pollInterval = 5

Page 10: Witsml data processing with kafka and spark streaming

Producer - Kafka Properties

bootstrap.servers = srv1:9092,srv2:9092

key.serializer = org.apache.kafka.common.serialization.StringSerializer

value.serializer = org.apache.kafka.common.serialization.StringSerializer

Page 11: Witsml data processing with kafka and spark streaming

Producer - main functionproducer = new KafkaProducer[String, String](props)

// each wellBore is a separate Kafka topic which is going to be partitioned by log

topic = args(0)

while (true) {

val logs = WitsmlClient.getWitsmlResponse(logsQuery)

// parse logs and send messages to Kafka

(logs \ "log").foreach { node: Node =>

// send all data from one log to the same partition

val key = (node \ "@uidLog").text

(node \\ "data").foreach { data =>

val message = new ProducerRecord(topic, null, key, data.text)

producer.send(message)

}

}

Page 12: Witsml data processing with kafka and spark streaming

Producer - results

”Well123” => Topic

“5207KFSJ18” => Key (Partition)

Content of <data> element => Message

Page 13: Witsml data processing with kafka and spark streaming

Consumer - prep

import org.apache.spark.SparkConfimport org.apache.spark.sql.{Row, SQLContext}

import org.apache.spark.streaming.dstream.InputDStreamimport org.apache.spark.streaming.kafka.KafkaUtils

var schema: StructType = null

val sc = new SparkConf().setAppName("WitsmlKafkaDemo")val ssc = new StreamingContext(sc, Seconds(1))

val dStream: InputDStream = KafkaUtils.createDirectStream(ssc, kafkaParams, topics)val sqlContext = new SQLContext(ssc.sparkContext)

Page 14: Witsml data processing with kafka and spark streaming

Consumer - Rules Definition

# fields for Spark SQL query

`Co. Man G/L`,`Gain Loss - Spare`,`ACC_DRILL_STRKS`

# where clause for SQL query

`Co. Man G/L`>100 OR `Gain Loss - Spare`<(-42.1)

Page 15: Witsml data processing with kafka and spark streaming

Consumer - main functiondStream.foreachRDD( batchRDD => {

val messages = batchRDD.map(_._2).map(_.split(","))

//create DataFrame with a custom schema

val df = sqlContext.createDataFrame(messages, schema)

//register temp table and test against rule

df.registerTempTable("timeLog")

val collected = sqlContext.sql("SELECT " + fields + " FROM timeLog WHERE " + condition).collect

if (collected.length > 0) {

//send email alert

WitsmlKafkaUtil.sendEmail(collected)

}

})

ssc.start()

ssc.awaitTermination()

Page 16: Witsml data processing with kafka and spark streaming

Visualization with Highcharts

Page 17: Witsml data processing with kafka and spark streaming

Why Highcharts?

- Websockets support -> real-time data visualization- Multiple Y-axes that automatically scale -> many mnemonics on the same chart- Inverted X-axis -> great for Depth Logs- 3D charts that can be rotated -> Trajectories - Area range with custom colors -> Formations on the background- 100% client side javascript -> easy to deploy

Page 18: Witsml data processing with kafka and spark streaming

Lessons Learned

- Throw away and re-design:- Logs should be Topics, Wells(Wellbores) should be Partitions for Scalability- Producers and Consumers should be Managed Services (Flume Agents?)

- Backend:- Land data to HBase (and probably OpenTSDB)

- Frontend:- WebApp to visualize both NRT and historical data?- Mobile App for Alerts?

- Improve Producers:- Speak many WITSML dialects?

- Get ready for Real-time:- Support for ETP standard