45
Emerging Technologies/Framewo rks in Big Data Rahul Jain @rahuldausa Meetup Sep 2015

Emerging technologies /frameworks in Big Data

Embed Size (px)

Citation preview

Page 1: Emerging technologies /frameworks in Big Data

Emerging Technologies/Frameworks

in Big Data

Rahul Jain@rahuldausa

Meetup Sep 2015

Page 2: Emerging technologies /frameworks in Big Data

About Me

• Independent Big data/Search Consultant

• 8+ years of learning experience.

• Worked (got a chance) on High volume

distributed applications.

• Still a learner (and beginner)

Page 3: Emerging technologies /frameworks in Big Data

Quick Questionnaire

How many people know/heard Apache Parquet ?

How many people know/heard Apache Drill ?

How many people Know/heard Apache Flink ?

Page 4: Emerging technologies /frameworks in Big Data

What we are going to learn/see today ?

• Columnar Storage (overview)

• Apache Parquet (with Demo)

• Dremel (Basic overview)

• Apache Drill (with Demo)

• Apache Flink (with Demo)

Page 5: Emerging technologies /frameworks in Big Data

Let’s discussColumnar Storage

Page 6: Emerging technologies /frameworks in Big Data

Lets say we have a Employee table

RowId EmpId Lastname Firstname Salary001 10 Smith Joe 40000002 12 Jones Mary 50000003 11 JohnsonCathy 44000004 22 Jones Bob 55000

Page 7: Emerging technologies /frameworks in Big Data

table storage in row oriented system

In Row-oriented systems, It will be stored as

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;

RowId EmpId Lastname Firstname Salary001 10 Smith Joe 40000002 12 Jones Mary 50000003 11 JohnsonCathy 44000004 22 Jones Bob 55000

Page 8: Emerging technologies /frameworks in Big Data

table storage in column oriented system

In Row-oriented systems, It will be stored as

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;

RowId EmpId Lastname Firstname Salary001 10 Smith Joe 40000002 12 Jones Mary 50000003 11 JohnsonCathy 44000004 22 Jones Bob 55000

But In Column-oriented systems, It will be stored as

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;

Page 9: Emerging technologies /frameworks in Big Data

Row vs Column Storage

Row-oriented storage

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;

Column-oriented storage

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;

Page 10: Emerging technologies /frameworks in Big Data

Apache Parquet(Columnar Storage for Hadoop ecosystem)

Page 11: Emerging technologies /frameworks in Big Data

About Apache Parquet

• Columnar based Storage format

• Initially started by Twitter and Cloudera

• stores nested data structures in a flat columnar format using a technique

outlined in the Dremel paper from Google.

• Can store very-2 large dataset with very high compression rate.

• Due to compression, less IO and Faster Processing.

• Provides high level APIs in Java

• Integration with Hadoop and its eco-system

• http://parquet.apache.org

Page 12: Emerging technologies /frameworks in Big Data

Parquet Design• required: exactly one occurrence• optional: 0 or 1 occurrence• repeated: 0 or more occurrences

For e.g, an address book schema:

message AddressBook { required string owner; repeated string ownerPhoneNumbers; repeated group contacts { required string name; optional string phoneNumber; }}

Page 13: Emerging technologies /frameworks in Big Data

Size Comparison

$ du -sch test.*

407M test.csv (1 million records, 4 columns)70M test.csv.gz (~83% reduction)35M test.parquet (~92% reduction)

Page 14: Emerging technologies /frameworks in Big Data

Let’s discuss firstDremel:  Interactive Analysis of Web-

Scale Datasets

Page 15: Emerging technologies /frameworks in Big Data

What is Dremel• A Published a Paper in 2010 by Google• Interactive Analysis of Web-Scale Datasets

– An adhoc query on a very large scale dataset (in Petabytes)– Near Real time– MR (Map-Reduce) works but that is meant for Batch Processing

• SQL like Query Interface• Nested Data (with a Column storage representation)• Paper:

– http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

• Projects (Implementation):– Google Big Query (Cloud based)– Apache Drill (Open source)

Page 16: Emerging technologies /frameworks in Big Data

Why Dremel: Speed Matters

Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets

Page 17: Emerging technologies /frameworks in Big Data

Widely used inside Google

Credit: http://www.slideshare.net/robertlz/dremel-interactive-analysis-of-webscale-datasets

Page 18: Emerging technologies /frameworks in Big Data

Tree based structure

Credit: http://www.alberton.info/images/articles/papers/dremel1.png

Page 19: Emerging technologies /frameworks in Big Data

Column striped representation

Credit: http://www.alberton.info/images/articles/papers/dremel2.png

Page 20: Emerging technologies /frameworks in Big Data

Query Processing

Credit: http://farm9.staticflickr.com/8426/7843420938_9cb23a4cb0_b.jpg

Page 21: Emerging technologies /frameworks in Big Data

Let’s move to Apache Drill

Page 22: Emerging technologies /frameworks in Big Data

About Apache Drill

• Based on Google’s Dremel Paper• Supports data-intensive distributed applications for

interactive analysis of large-scale datasets• Have a Datastore aware optimizer

– which constructs the query plan based on datastore’s processing capabilities.

• Supports Data locality.• http://drill.apache.org/

Page 23: Emerging technologies /frameworks in Big Data

So Why Drill?• Flexible Data Model

• Fixed Schema(Avro)/Dynamic Schema(JSON)/Schema less SQL • Schema can be discovered on the Fly• Built-in optimistic query execution engine.

• Doesn’t require a particular storage or execution system (Map-Reduce, Spark, Tez)

• Better Performance and Manageability• Cluster of commodity servers

• Daemon (drillbit) on each data node• Works with Hadoop, CSV, JSON, Avro/Parquet, MongoDB, HBase,

Solr etc.

Page 24: Emerging technologies /frameworks in Big Data

Query any non-relational datastore

Page 25: Emerging technologies /frameworks in Big Data

Distributed SQL query engine

Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel

Page 26: Emerging technologies /frameworks in Big Data

Designed to support wide set of use-cases

Credit: http://www.slideshare.net/MapRTechnologies/drill-highperformancesqlenginewithjsondatamodel

Page 27: Emerging technologies /frameworks in Big Data

QueryingCSV:

0: jdbc:drill:> select count(*) from dfs.`/tmp/test.csv`;+-----------+| EXPR$0 |+-----------+| 10000001 |+-----------+1 row selected (5.771 seconds)

Parquet:

0: jdbc:drill:> select count(*) from dfs.`/tmp/test.parquet`;+-----------+| EXPR$0 |+-----------+| 10000001 |+-----------+1 row selected (0.257 seconds)

Page 28: Emerging technologies /frameworks in Big Data

Drill Shell./bin/drill-embedded

It will start Drill in Embedded Mode. You will see output like this,

org.glassfish.jersey.server.ApplicationHandler initializeINFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...apache drill 1.0.0"say hello to my little drill"0: jdbc:drill:zk=local>

For windows: This will start the shell with Drill in embedded Mode.

./bin/sqlline.bat –u "jdbc:drill:schema=dfs;zk=local"

Page 29: Emerging technologies /frameworks in Big Data

Terminology

• Drillbit– Drillbit runs on each data node in the cluster, Drill

maximizes data locality during query execution. Movement of data over the network or between nodes is minimized or eliminated when possible.

Page 30: Emerging technologies /frameworks in Big Data

Drill Configuration

drill.exec:{ cluster-id: "<cluster_name>", zk.connect: "<zkhostname1>:<port>,<zkhostname2>:<port>,<zkhostname3>:<port>“ }

Configuration: $DRILL_HOME/conf/drill-override.conf

Default configuration:

drill.exec: { cluster-id: "drillbits1", zk.connect: "localhost:2181"}

Page 31: Emerging technologies /frameworks in Big Data

Starting Drill in Distributed Mode

./bin/drillbit.sh restart

./bin/drillbit.sh [--config <conf-dir>] (start|stop|status|restart|autorestart)

It will restart the Drillbit service.Tip:Check the hostname on Drillbit is listening. For e.g.2015-09-05 03:21:20,070 [main] INFO o.apache.drill.exec.server.Drillbit - Drillbit environment: host.name=192.168.0.101

This will start the drill shell on local machine based on configuration provided in drill-overide.conf

Start the shell:./bin/drill-localhost (if drillbit listening on localhost)

otherwise

./bin/sqlline -u "jdbc:drill:drillbit=192.168.0.101"

Page 32: Emerging technologies /frameworks in Big Data

Verify it once; and try a sample0: jdbc:drill:zk=local> select * from sys.drillbits;+----------------+------------+---------------+------------+----------+|    hostname    | user_port  | control_port  | data_port  | current  |+----------------+------------+---------------+------------+----------+| 192.168.0.101  | 31010      | 31011         | 31012      | true     |+----------------+------------+---------------+------------+----------+

0: jdbc:drill:zk=local> select count(*) from `dfs`.`$DRILL_HOME/sample-data/nation.parquet`;+---------+| EXPR$0 |+---------+| 25 |+---------+1 row selected (1.752 seconds)

Page 33: Emerging technologies /frameworks in Big Data

Drill – Web Client

A Storage Plugin can be added/Enabled

Page 34: Emerging technologies /frameworks in Big Data

Let’s move to Apache Flink

Page 35: Emerging technologies /frameworks in Big Data

About Apache Flink

• Open source framework for Big Data Analytics

• Distributed Streaming dataflow engine

• Runs Computing In-Memory.

• Executes programs in data-parallel and pipelined manner.

• Most popular for running Stream Data Processing.

• Provides high level APIs in • Java

• Scala

• Python

• Integration with Hadoop and its eco-system and can read existing data of HDFS or

HBase.

• https://flink.apache.org

Page 36: Emerging technologies /frameworks in Big Data

So Why Flink?

Credit: Compiled based on several articles,Blogs, Stackoverflow posts added in references page.

• Share a lot of Similarities with relational DBMS• Data is serialized in byte buffers and processed a lot in binary representation

• So allows Fine grained memory control• Uses a Pipeline based Processing Model with Cost based Optimizer to choose

the execution strategy.• optimized for cyclic or iterative processes by using iterative transformations

on collections• achieved by an optimization of join algorithms, operator chaining and

reusing of partitioning and sorting.• Flink streaming processes data streams as true streams, i.e., data elements

are immediately "pipelined" though a streaming program as soon as they arrive

• also has its own memory management system separate from Java’s garbage collector.

Page 37: Emerging technologies /frameworks in Big Data

Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview

Page 38: Emerging technologies /frameworks in Big Data

Flink vs Spark (they looks to be pretty similar)

Apache Flink:

case class Word (word: String, frequency: Int)val counts = text .flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency")

Apache Spark:

val counts = text .flatMap(line => line.split(" ")).map(word => (word, 1)) .reduceByKey{case (x, y) => x + y}

Page 39: Emerging technologies /frameworks in Big Data

But….Apache Spark: is batch processing framework that can approximate stream processing (called as micro-batching)

Apache Flink: is primarily a stream processing framework that can look like a batch processor.

Page 40: Emerging technologies /frameworks in Big Data

Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview

Page 41: Emerging technologies /frameworks in Big Data

Credit: http://www.slideshare.net/stephanewen1/apache-flink-overview

Page 42: Emerging technologies /frameworks in Big Data

Flink – Web Client

Arguments to program separated by spaces

Page 43: Emerging technologies /frameworks in Big Data

Flink – Web Client

Page 44: Emerging technologies /frameworks in Big Data

References

• https://flink.apache.org/• https://www.quora.com/What-are-the-differences-between-Apache-Spark-and-Apache-Flink• http://stackoverflow.com/questions/28082581/what-is-the-differences-between-apache-spark-

and-apache-flink• http://statrgy.com/2015/06/01/best-data-processing-engine-flink-vs-spark/• http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-la

rge-scale-machine-learning

• http://www.infoworld.com/article/2919602/hadoop/flink-hadoops-new-contender-for-mapreduce-spark.html

• http://www.kdnuggets.com/2015/05/interview-matei-zaharia-creator-apache-spark.html

Page 45: Emerging technologies /frameworks in Big Data

Thanks!@rahuldausa on twitter and slidesharehttp://www.linkedin.com/in/rahuldausa