Open Source Big Data in OPC
Edelweiss KammermannFrank MunzJava One 2017
munz & more #2
© IT Convergence 2016. All rights reserved.
© IT Convergence 2016. All rights reserved.
About Meà Computer Engineer, BI and Data Integration Specialist
à Over 20 years of Consulting and Project Management experience in Oracle technology.
à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
à Director of Community of LAOUC
à Head of BI Team CMS at ITConvergence
à Writer and frequent speaker at international conferences:
à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum
à Oracle ACE Director
© IT Convergence 2016. All rights reserved.
Uruguay
6
Dr. Frank Munz
•Founded munz & more in 2007
•17 years Oracle Middleware,Cloud, and Distributed Computing
•Consulting and High-End Training
•Wrote two Oracle WLS andone Cloud book
#1
Hadoop
© IT Convergence 2016. All rights reserved.
What is Big Data?à Volume: The high amount of dataà Variety: The wide range of different data formats and schemas.
Unstructured and semi-structured data
à Velocity: The speed which data is created or consumedà Oracle added another V in this definition
à Value: Data has intrinsic value—but it must be discovered.
© IT Convergence 2016. All rights reserved.
What is Oracle Big Data Cloud Compute Edition?à Big Data Platform that integrates Oracle Big Data solution with
Open Source tools à Fully Elastic
à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service
à Access, Data and Network Security
à REST access to all the funcitonality
© IT Convergence 2016. All rights reserved.
Big Data Cloud Service – Compute Edition (BDCS-CE)
© IT Convergence 2016. All rights reserved.
BDCS-CE Notebook: Interactive Analysisà Apache Zeppelin Notebook (version0.7) to interactively work with data
© IT Convergence 2016. All rights reserved.
What is Hadoop?à An open source software platform for distributed storage and
processing à Manage huge volumes of unstructured data
à Parallel processing of large data set
à Highly scalable
à Fault-tolerant
à Two main components:à HDFS: Hadoop Distributed File System for storing information
à MapReduce: programming framework that process information
© IT Convergence 2016. All rights reserved.
Hadoop Components: HFDSà Stores the data on the cluster
à Namenode: block registry
à DataNode: block containers themselves (Datanode)
à HDFS cartoon by Mvarshney
© IT Convergence 2016. All rights reserved.
Hadoop Components: MapReduceà Retrieves data from HDFS à A MapReduce program is composed by
à Map() method: performs filtering and sorting of the <key, value> inputs
à Reduce() method: summarize the <key,value> pairs provided by the Mappers
à Code can be written in many languages (Perl, Python, Java etc)
© IT Convergence 2016. All rights reserved.
MapReduce Example
© IT Convergence 2016. All rights reserved.
Code Example
© IT Convergence 2016. All rights reserved.
Code Example
© IT Convergence 2016. All rights reserved.
#2Hive
© IT Convergence 2016. All rights reserved.
What is Hive?à An open source data warehouse software on top of Apache Hadoop
à Analyze and query data stored in HDFS
à Structure the data into tables
à Tools for simple ETL
à SQL- like queries (HiveQL)
à Procedural language with HPL-SQL
à Metadata storage in a RDBMS
© IT Convergence 2016. All rights reserved.
Hadoop & Hive Demo
#3
Spark
Revisited: Map Reduce I/O
munz & more #23Source:HadoopApplicationArchitectureBook
Spark
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24https://stackoverflow.com/questions/31508083/difference-between-dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
RDDs
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.Supports text, JSON, Apache Parquet, sequence.
ReadinHDFS,LocalFS,S3,Hbase
ParallelizeexistingCollection
TransformotherRDD->RDDsareimmutable
Lazy Evaluation
munz & more #26
Nothingisexecuted Execution
Transformations:map(), flatMap(),reduceByKey(), groupByKey()
Actions:collect(), count(), first(), takeOrdered(), saveAsTextFile(), …
http://spark.apache.org/docs/2.1.1/programming-guide.html
map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.
flatMap(func) Similartomap,buteachinputitemcanbemappedto0ormoreoutputitems(so funcshouldreturnaSeq ratherthanasingleitem).
reduceByKey(func,[numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,V)pairswherethevaluesforeachkeyareaggregatedusingthegivenreducefunction func,whichmustbeoftype(V,V)=>V.
groupByKey([numTasks]) Whencalledonadatasetof(K,V)pairs,returnsadatasetof(K,Iterable<V>)pairs.
Transformations
Spark Demo
munz & more #30
Apache Zeppelin Notebook
munz & more #31
Word Count and Histogram
munz & more #32
res = t.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
res.takeOrdered(5, key = lambda x: -x[1])
Zeppelin Notebooks
munz & more #33
Big Data Compute Service CE
munz & more #34
#4
Kafka
Kafka
Partitioned, replicated commit log
munz & more #36
0 1 2 3 4 … n
Immutablelog:Messageswithoffset
Producer
ConsumerA
ConsumerBhttps://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Broker1
Broker2
Broker3
TopicA(1)
TopicA(2)
TopicA(3)
Partition/Leader
Repl A(1)
Repl A(2)
Repl A(3)
Producer
Replication/Follower
Zoo-keeper
Zoo-keeper
Zoo-keeper
State/HA
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
- 1 topic- 1 partition- Contains every article published
since 1851- Multiple producers / consumers
ExampleforStream/TableDuality
Kafka Clients
SDKs Connect Streams
- OOTB:Java,Scala- Confluent:Python,C,C++
Confluent:- HDFSsink,- JDBCsource,- S3sink- Elasticsearchsink
- Plugin.jarfile- JDBC:Changedata
capture(CDC)
- Real-timedataingestion- Microservices- KSQL:SQLstreaming
engineforstreamingETL,anomalydetection,monitoring
- .jarfilerunsanywhere
High/lowlevelKafkaAPI ConfigurationonlyIntegrateexternalSystems
DatainMotionStream/Tableduality
REST
- Languageagnostic
- Easyformobileapps
- EasytotunnelthroughFWetc.
Lightweight
Oracle Event Hub Cloud Service
• PaaS: Managed Kafka 0.10.2
• Two deployment modes
– Basic (Broker and ZK on 1 node)
– Recommended (distributed)
• REST Proxy
– Separate sever(s) running REST Proxy
munz & more #40
Event Hub
munz & more #41
Event Hub Service
munz & more #42
Ports
You must open ports to allow access for external clients
• Kafka Broker (from OPC connect string)
• Zookeeper with port 2181
munz & more #43
Scaling
munz & more #44
horizontal (up)vertical
Event Hub REST Interface
munz & more #45
https://129.151.91.31:1080/restproxy/topics/a12345orderTopic
Service = Topic
Interesting to Know
• Event Hub topics are prefixed with ID domain
• With Kafka CLI topics with ID Domain can be created
• Topics without ID domain are not shown in OPC console
46
#5
Conclusion
TL;DR #bigData #openSource #OPCOpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub:Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools
@EdelweissK@FrankMunz
www.linkedin.com/in/frankmunz/ www.munzandmore.com/blog
facebook.com/cloudcomputingbookfacebook.com/weblogicbook
@frankmunz
youtube.com/weblogicbook
-> more than 50 web casts
Don’t be
shy J
email:[email protected]
Twitter:@EdelweissK
3MembershipTiers• OracleACEDirector• OracleACE• OracleACEAssociate
bit.ly/OracleACEProgram
500+TechnicalExpertsHelpingPeersGlobally
Connect:
Nominateyourselforsomeoneyouknow:acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
Sign up for Free Trial
http://cloud.oracle.com