52
Open Source Big Data in OPC Edelweiss Kammermann Frank Munz Java One 2017

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Embed Size (px)

Citation preview

Page 1: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Open Source Big Data in OPC

Edelweiss KammermannFrank MunzJava One 2017

Page 2: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

munz & more #2

Page 3: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Page 4: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

About Meà Computer Engineer, BI and Data Integration Specialist

à Over 20 years of Consulting and Project Management experience in Oracle technology.

à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)

à Director of Community of LAOUC

à Head of BI Team CMS at ITConvergence

à Writer and frequent speaker at international conferences:

à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum

à Oracle ACE Director

Page 5: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Uruguay

Page 6: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

6

Dr. Frank Munz

•Founded munz & more in 2007

•17 years Oracle Middleware,Cloud, and Distributed Computing

•Consulting and High-End Training

•Wrote two Oracle WLS andone Cloud book

Page 7: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Page 8: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

#1

Hadoop

Page 9: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

What is Big Data?à Volume: The high amount of dataà Variety: The wide range of different data formats and schemas.

Unstructured and semi-structured data

à Velocity: The speed which data is created or consumedà Oracle added another V in this definition

à Value: Data has intrinsic value—but it must be discovered.

Page 10: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

What is Oracle Big Data Cloud Compute Edition?à Big Data Platform that integrates Oracle Big Data solution with

Open Source tools à Fully Elastic

à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service

à Access, Data and Network Security

à REST access to all the funcitonality

Page 11: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Big Data Cloud Service – Compute Edition (BDCS-CE)

Page 12: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

BDCS-CE Notebook: Interactive Analysisà Apache Zeppelin Notebook (version0.7) to interactively work with data

Page 13: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

What is Hadoop?à An open source software platform for distributed storage and

processing à Manage huge volumes of unstructured data

à Parallel processing of large data set

à Highly scalable

à Fault-tolerant

à Two main components:à HDFS: Hadoop Distributed File System for storing information

à MapReduce: programming framework that process information

Page 14: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Hadoop Components: HFDSà Stores the data on the cluster

à Namenode: block registry

à DataNode: block containers themselves (Datanode)

à HDFS cartoon by Mvarshney

Page 15: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Hadoop Components: MapReduceà Retrieves data from HDFS à A MapReduce program is composed by

à Map() method: performs filtering and sorting of the <key, value> inputs

à Reduce() method: summarize the <key,value> pairs provided by the Mappers

à Code can be written in many languages (Perl, Python, Java etc)

Page 16: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

MapReduce Example

Page 17: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Code Example

Page 18: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Code Example

Page 19: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

#2Hive

Page 20: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

What is Hive?à An open source data warehouse software on top of Apache Hadoop

à Analyze and query data stored in HDFS

à Structure the data into tables

à Tools for simple ETL

à SQL- like queries (HiveQL)

à Procedural language with HPL-SQL

à Metadata storage in a RDBMS

Page 21: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

© IT Convergence 2016. All rights reserved.

Hadoop & Hive Demo

Page 22: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

#3

Spark

Page 23: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Revisited: Map Reduce I/O

munz & more #23Source:HadoopApplicationArchitectureBook

Page 24: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Spark

• Orders of magnitude(s) faster than M/R

• Higher level Scala, Java or Python API

• Standalone, in Hadoop, or Mesos

• Principle: Run an operation on all data

-> ”Spark is the new MapReduce”• See also: Apache Storm, etc

• Uses RDDs, or Dataframes, or Datasets

munz & more #24https://stackoverflow.com/questions/31508083/difference-between-dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

Page 25: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

RDDs

Resilient Distributed Datasets

Where do they come from?

Collection of data grouped into named columns.Supports text, JSON, Apache Parquet, sequence.

ReadinHDFS,LocalFS,S3,Hbase

ParallelizeexistingCollection

TransformotherRDD->RDDsareimmutable

Page 26: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Lazy Evaluation

munz & more #26

Nothingisexecuted Execution

Transformations:map(), flatMap(),reduceByKey(), groupByKey()

Actions:collect(), count(), first(), takeOrdered(), saveAsTextFile(), …

http://spark.apache.org/docs/2.1.1/programming-guide.html

Page 27: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.

flatMap(func) Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(so funcshouldreturnaSeq ratherthanasingleitem).

reduceByKey(func,[numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunction func,whichmustbeoftype(V,V)=>V.

groupByKey([numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.

Transformations

Page 28: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Page 29: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Page 30: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Spark Demo

munz & more #30

Page 31: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Apache Zeppelin Notebook

munz & more #31

Page 32: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Word Count and Histogram

munz & more #32

res = t.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

res.takeOrdered(5, key = lambda x: -x[1])

Page 33: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Zeppelin Notebooks

munz & more #33

Page 34: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Big Data Compute Service CE

munz & more #34

Page 35: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

#4

Kafka

Page 36: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Kafka

Partitioned, replicated commit log

munz & more #36

0 1 2 3 4 … n

Immutablelog:Messageswithoffset

Producer

ConsumerA

ConsumerBhttps://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that

Page 37: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Broker1

Broker2

Broker3

TopicA(1)

TopicA(2)

TopicA(3)

Partition/Leader

Repl A(1)

Repl A(2)

Repl A(3)

Producer

Replication/Follower

Zoo-keeper

Zoo-keeper

Zoo-keeper

State/HA

Page 38: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/

- 1 topic- 1 partition- Contains every article published

since 1851- Multiple producers / consumers

ExampleforStream/TableDuality

Page 39: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Kafka Clients

SDKs Connect Streams

- OOTB:Java,Scala- Confluent:Python,C,C++

Confluent:- HDFSsink,- JDBCsource,- S3sink- Elasticsearchsink

- Plugin.jarfile- JDBC:Changedata

capture(CDC)

- Real-timedataingestion- Microservices- KSQL:SQLstreaming

engineforstreamingETL,anomalydetection,monitoring

- .jarfilerunsanywhere

High/lowlevelKafkaAPI ConfigurationonlyIntegrateexternalSystems

DatainMotionStream/Tableduality

REST

- Languageagnostic

- Easyformobileapps

- EasytotunnelthroughFWetc.

Lightweight

Page 40: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Oracle Event Hub Cloud Service

• PaaS: Managed Kafka 0.10.2

• Two deployment modes

– Basic (Broker and ZK on 1 node)

– Recommended (distributed)

• REST Proxy

– Separate sever(s) running REST Proxy

munz & more #40

Page 41: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Event Hub

munz & more #41

Page 42: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Event Hub Service

munz & more #42

Page 43: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Ports

You must open ports to allow access for external clients

• Kafka Broker (from OPC connect string)

• Zookeeper with port 2181

munz & more #43

Page 44: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Scaling

munz & more #44

horizontal (up)vertical

Page 45: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Event Hub REST Interface

munz & more #45

https://129.151.91.31:1080/restproxy/topics/a12345orderTopic

Service = Topic

Page 46: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Interesting to Know

• Event Hub topics are prefixed with ID domain

• With Kafka CLI topics with ID Domain can be created

• Topics without ID domain are not shown in OPC console

46

Page 47: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

#5

Conclusion

Page 48: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

TL;DR #bigData #openSource #OPCOpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub:Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools

@EdelweissK@FrankMunz

Page 49: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

www.linkedin.com/in/frankmunz/ www.munzandmore.com/blog

facebook.com/cloudcomputingbookfacebook.com/weblogicbook

@frankmunz

youtube.com/weblogicbook

-> more than 50 web casts

Don’t be

shy J

Page 50: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

email:[email protected]

Twitter:@EdelweissK

Page 51: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

3MembershipTiers• OracleACEDirector• OracleACE• OracleACEAssociate

bit.ly/OracleACEProgram

500+TechnicalExpertsHelpingPeersGlobally

Connect:

Nominateyourselforsomeoneyouknow:acenomination.oracle.com

@oracleace

Facebook.com/oracleaces

[email protected]

Page 52: Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Sign up for Free Trial

http://cloud.oracle.com