Upload
subash-dsouza
View
120
Download
1
Embed Size (px)
DESCRIPTION
A short presentation on where Hadoop is going
Citation preview
S
The Hadoop Path A short presentation on where Hadoop is going
BySubash DSouza
Hadoop and Google
Hadoop came out of seminal papers released by Google in the early 2000’s viz. GFS, MapReduce and Big Table.
To see where Hadoop is moving is to see where Google has gone.
Great keynote talk by M.C. Srivas of MapR next week that addresses this question.
Jonathan Hsieh – Keynote talk at Big Data Camp LA 2014
Where I think Hadoop is moving?
Security
Real Time Analytics
Security
Hadoop vendors have become serious about security in the past year
Hortonworks’s acquisition of XA Secure
Cloudera’s acquisition of Gazzang
Kerberos has been the premise for authentication for quite some time but things like audit control and MDM have been on the horizon.
With these acquisitions, Hadoop vendors have been positioning themselves for a better security play.
Cloudera has Apache Sentry, Hortonworks has Apache Knox.
MapR supports security through authentication and authorization
Real Time Analytics
Real Time Streaming Quickly ingest data as it comes in.
Real Time Reporting Quickly process the ingested data.
Real Time Streaming
Storm
Spark Streaming
Samza
Apache Storm
One of the first streaming tools built.
Very low latency, typically looking at 10-200 ms.
Started by Nathan Marz from Backtype acquired by Twitter.
Strong support from Hortonworks.
Lower level API’s than Spark.
Trident is the micro-batching method that closely resembles Spark.
Spark Streaming
Based on the fact that not all data is required instantaneously.
Uses micro batch method.
Latency is approx. 1 sec.
Streaming has single points of failure.
Has scale issues.
Good for machine learning.
Strong support from Databricks, Cloudera, Hortonworks, MapR, Datastax & Pivotal.
Easier to integrate with Spark.
Apache Samza (Incubator)
Stream processing API built atop Kafka and Yarn.
Support from Linkedin.
Very similar to Storm.
Currently only one level of guarantee vs. multiple levels of guarantee in Storm.
Real Time Reporting ( or near real time)
Hive on Tez (Stinger)
Impala
Drill
Spark
Hawq
Apache Hive on Apache Tez
Tez is new application framework built atop YARN.
Workflows complied to DAG’s on Tez.
Optimizes MapReduce jobs up to 5 times faster than Standard MapReduce.
Supports in-memory jobs for small datasets.
Supported by Hortonworks & MapR.
Cloudera Impala
Massively parallel processing (MPP) architecture for performance, with Hadoop scalability.
Perform interactive analysis on any data stored in HDFS and Hbase.
Built with native Hadoop security: integrated with Kerberos for authentication and Apache Sentry for fine-grained, role-based authorization.
ANSI-92 SQL support.
Supports common Hadoop file formats: text, SequenceFiles, Avro, RCFile, LZO and Parquet.
Supported by Cloudera & MapR.
Apache Drill (Incubator)
Drill is a clustered, powerful MPP (Massively Parallel Processing) query engine for Hadoop that can process petabytes of data, fast.
Useful for short, interactive ad-hoc queries on large-scale data sets.
Capable of querying nested data in formats like JSON and Parquet and performing dynamic schema discovery.
Does not require a centralized metadata repository.
Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables.
Supported by MapR.
Apache Spark
Consists of multiple projects – Spark Streaming, Spark SQL, MLib and GraphX.
Runs atop YARN, Mesos & EC2.
Uses the concept of RDD’s(Resilient Distributed Datasets) where the data is immutable during transforms.
Enables in-memory processing when needed.
Supported by Databricks, Cloudera, MapR, Hortonworks, Datastax & Pivotal.
Strong support not just from Hadoop community but also from Data Science – Mahout moving to Spark, so is Cloudera Oryx.
Pivotal HAWQ
Part of the Pivotal platform.
Full SQL syntax support.
Interoperability with Hive and HBase through the Pivotal Xtension Framework (PXF).
Interoperability with Pivotal’s GemFire XD, their in-memory real-time database backed by HDFS.
Proprietary to the Pivotal platform.
What to use where?
Dependent on Use cases.
Use the right tool for the job.
Sometimes several tool for the same job, especially in the Hadoop ecosystem.
Use what is most easiest and scalable to the enterprise in such scenarios.