The Hadoop Path

S

The Hadoop Path A short presentation on where Hadoop is going

BySubash DSouza

Hadoop and Google

Hadoop came out of seminal papers released by Google in the early 2000’s viz. GFS, MapReduce and Big Table.

To see where Hadoop is moving is to see where Google has gone.

Great keynote talk by M.C. Srivas of MapR next week that addresses this question.

Jonathan Hsieh – Keynote talk at Big Data Camp LA 2014

Where I think Hadoop is moving?

Security

Real Time Analytics

Security

Hadoop vendors have become serious about security in the past year

Hortonworks’s acquisition of XA Secure

Cloudera’s acquisition of Gazzang

Kerberos has been the premise for authentication for quite some time but things like audit control and MDM have been on the horizon.

With these acquisitions, Hadoop vendors have been positioning themselves for a better security play.

Cloudera has Apache Sentry, Hortonworks has Apache Knox.

MapR supports security through authentication and authorization

Real Time Analytics

Real Time Streaming Quickly ingest data as it comes in.

Real Time Reporting Quickly process the ingested data.

Real Time Streaming

Storm

Spark Streaming

Samza

Apache Storm

One of the first streaming tools built.

Very low latency, typically looking at 10-200 ms.

Started by Nathan Marz from Backtype acquired by Twitter.

Strong support from Hortonworks.

Lower level API’s than Spark.

Trident is the micro-batching method that closely resembles Spark.

Spark Streaming

Based on the fact that not all data is required instantaneously.

Uses micro batch method.

Latency is approx. 1 sec.

Streaming has single points of failure.

Has scale issues.

Good for machine learning.

Strong support from Databricks, Cloudera, Hortonworks, MapR, Datastax & Pivotal.

Easier to integrate with Spark.

Apache Samza (Incubator)

Stream processing API built atop Kafka and Yarn.

Support from Linkedin.

Very similar to Storm.

Currently only one level of guarantee vs. multiple levels of guarantee in Storm.

Real Time Reporting ( or near real time)

Hive on Tez (Stinger)

Impala

Drill

Spark

Hawq

Apache Hive on Apache Tez

Tez is new application framework built atop YARN.

Workflows complied to DAG’s on Tez.

Optimizes MapReduce jobs up to 5 times faster than Standard MapReduce.

Supports in-memory jobs for small datasets.

Supported by Hortonworks & MapR.

Cloudera Impala

Massively parallel processing (MPP) architecture for performance, with Hadoop scalability.

Perform interactive analysis on any data stored in HDFS and Hbase.

Built with native Hadoop security: integrated with Kerberos for authentication and Apache Sentry for fine-grained, role-based authorization.

ANSI-92 SQL support.

Supports common Hadoop file formats: text, SequenceFiles, Avro, RCFile, LZO and Parquet.

Supported by Cloudera & MapR.

Apache Drill (Incubator)

Drill is a clustered, powerful MPP (Massively Parallel Processing) query engine for Hadoop that can process petabytes of data, fast.

Useful for short, interactive ad-hoc queries on large-scale data sets.

Capable of querying nested data in formats like JSON and Parquet and performing dynamic schema discovery.

Does not require a centralized metadata repository.

Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables.

Supported by MapR.

Apache Spark

Consists of multiple projects – Spark Streaming, Spark SQL, MLib and GraphX.

Runs atop YARN, Mesos & EC2.

Uses the concept of RDD’s(Resilient Distributed Datasets) where the data is immutable during transforms.

Enables in-memory processing when needed.

Supported by Databricks, Cloudera, MapR, Hortonworks, Datastax & Pivotal.

Strong support not just from Hadoop community but also from Data Science – Mahout moving to Spark, so is Cloudera Oryx.

Pivotal HAWQ

Part of the Pivotal platform.

Full SQL syntax support.

Interoperability with Hive and HBase through the Pivotal Xtension Framework (PXF).

Interoperability with Pivotal’s GemFire XD, their in-memory real-time database backed by HDFS.

Proprietary to the Pivotal platform.

What to use where?

Dependent on Use cases.

Use the right tool for the job.

Sometimes several tool for the same job, especially in the Hadoop ecosystem.

Use what is most easiest and scalable to the enterprise in such scenarios.

Q&A

@sawjd22

[email protected]

Technology

The Hadoop Path