Upload
mark-kerzner
View
560
Download
10
Embed Size (px)
Citation preview
WITSML data processing example with Kafka and Spark Streaming
Houston Hadoop Meetup, 4/26/2016
About me - Dmitry Kniazev
Currently Solution Architect at EPAM Systems
- About 4 years in Oil & Gas here in Houston- Started working with Hadoop about 2 years ago
Before that BI/DW Specialist at EPAM Systems for 6 years
- Reports, ETL with Oracle, Microsoft, Cognos and other tools- Enjoyed not SO HOT life in Eastern Europe
Before that Performance Analyst at EPAM Systems for 4 years
- Web Applications and Databases optimization
What is the problem?
Source: http://www.croftsystems.net/blog/conventional-vs.-unconventional
What is WITSML?
DATA EXCHANGE STANDARD FOR THE UPSTREAM OIL AND GAS INDUSTRY
WITSML DataStore
Rig Aggregation
Solution
Rig Aggregation
Solution
Corp Store
WITSML DataStore
Service Company #1
Operator #1
Service Company #2
WITSML based ApplicationsWITSML
Operator Company Data Center
Architecture
WITSML DataStore
HBaseWITSML
via SOAP
Internet
Consumer (Scala)
Producer(Scala)
Service Company
DC
Kafka
Consumer (Scala)
Email / Browser
What is Kafka?
What is Spark Streaming?
Discretized Stream
Producer - prep
// some important importsimport com.mycompany.witsml.client.WitsmlClient //based on jwitsml 1.0import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}import scala.xml.{Elem, Node, XML}
// variables initializationvar producer: KafkaProducer[String, String] = nullvar startTimeIndex = DateTime.now()var topic = ""var pollInterval = 5
Producer - Kafka Properties
bootstrap.servers = srv1:9092,srv2:9092
key.serializer = org.apache.kafka.common.serialization.StringSerializer
value.serializer = org.apache.kafka.common.serialization.StringSerializer
Producer - main functionproducer = new KafkaProducer[String, String](props)
// each wellBore is a separate Kafka topic which is going to be partitioned by log
topic = args(0)
while (true) {
val logs = WitsmlClient.getWitsmlResponse(logsQuery)
// parse logs and send messages to Kafka
(logs \ "log").foreach { node: Node =>
// send all data from one log to the same partition
val key = (node \ "@uidLog").text
(node \\ "data").foreach { data =>
val message = new ProducerRecord(topic, null, key, data.text)
producer.send(message)
}
}
Producer - results
”Well123” => Topic
“5207KFSJ18” => Key (Partition)
Content of <data> element => Message
Consumer - prep
import org.apache.spark.SparkConfimport org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.streaming.dstream.InputDStreamimport org.apache.spark.streaming.kafka.KafkaUtils
var schema: StructType = null
val sc = new SparkConf().setAppName("WitsmlKafkaDemo")val ssc = new StreamingContext(sc, Seconds(1))
val dStream: InputDStream = KafkaUtils.createDirectStream(ssc, kafkaParams, topics)val sqlContext = new SQLContext(ssc.sparkContext)
Consumer - Rules Definition
# fields for Spark SQL query
`Co. Man G/L`,`Gain Loss - Spare`,`ACC_DRILL_STRKS`
# where clause for SQL query
`Co. Man G/L`>100 OR `Gain Loss - Spare`<(-42.1)
Consumer - main functiondStream.foreachRDD( batchRDD => {
val messages = batchRDD.map(_._2).map(_.split(","))
//create DataFrame with a custom schema
val df = sqlContext.createDataFrame(messages, schema)
//register temp table and test against rule
df.registerTempTable("timeLog")
val collected = sqlContext.sql("SELECT " + fields + " FROM timeLog WHERE " + condition).collect
if (collected.length > 0) {
//send email alert
WitsmlKafkaUtil.sendEmail(collected)
}
})
ssc.start()
ssc.awaitTermination()
Visualization with Highcharts
Why Highcharts?
- Websockets support -> real-time data visualization- Multiple Y-axes that automatically scale -> many mnemonics on the same chart- Inverted X-axis -> great for Depth Logs- 3D charts that can be rotated -> Trajectories - Area range with custom colors -> Formations on the background- 100% client side javascript -> easy to deploy
Lessons Learned
- Throw away and re-design:- Logs should be Topics, Wells(Wellbores) should be Partitions for Scalability- Producers and Consumers should be Managed Services (Flume Agents?)
- Backend:- Land data to HBase (and probably OpenTSDB)
- Frontend:- WebApp to visualize both NRT and historical data?- Mobile App for Alerts?
- Improve Producers:- Speak many WITSML dialects?
- Get ready for Real-time:- Support for ETP standard
Thank you!
Links:
http://www.energistics.org/
http://www.highcharts.com/
https://spark.apache.org/
http://kafka.apache.org/