View
388
Download
3
Category
Tags:
Preview:
Citation preview
Introduction To Hadoop Ecosystem
InSemble Inc. http://www.insemble.com
Agenda
What is Big Data ?1
Use Cases & Java Developer fit4
Hadoop Ecosystem3
Relevance to your Enterprise2
Demo5
Big Data Definitions
• Wikipedia defines it as “ Data Sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time
• Gartner defines it as Data with the following characteristics– High Velocity– High Variety– High Volume
• Another Definition is “ Big Data is a large volume, unstructured data which cannot be handled by traditional database management systems
Why a game changer
• Schema on Read– Interpreting data at processing time– Key, Values are not intrinsic properties of data but chosen by person
analyzing the data• Move code to data
– With traditional, we bring data to code and I/O becomes a bottleneck
– With distributed systems, we have to deal with our own checkpointing/recovery
• More data beats better algorithms
Enterprise Relevance
• Missed Opportunities– Channels– Data that is analyzed
• Constraint was high cost– Storage– Processing
• Future-proof your business– Schema on Read– Access pattern not as relevant– Not just future-proofing your architecture
Motivation and History
• Disk access speeds have not caught up with storage capacities• Need a high speed parallel processing platform to process large
datasets on a distributed filesharing framework• Google published MapReduce architecture in 2004• Mapreduce framework
– Split the query, distribute it and process in parallel(Map Step)– Gather the results and deliver it ( Reduce Step)
• Apache Open Source Project called Hadoop implemented the MapReduce framework
– “Software library that gives users ability to process large datasets across cluster of commodity hardware in a reliable, fault-tolerant manner using a simple programming model”
Hadoop Ecosystem
Source: Apache Hadoop Documentation
HDFS Architecture
Source: Hadoop Definitive Guide by Tom White
MapReduce framework
Hadoop 2 with YARN
Source: Hadoop In Practice by Alex Holmes
Map Reduce
• Restrictive programming model– Key, values– Map, reduce functions with only coordination being just
passing keys and values• But still considered a general data-processing tool
– Google used for production search indexes– Image Analysis– Machine learning algorithms
PIG
• High level scripting language• Data Flow Language
– Good for describing data analysis problems as data flows– Can plugin UDFs written in other languages such as Java, Scala,
JRuby– Other languages can execute PIG scripts– Predominant use cases are
• Production ETL jobs• Data exploration by analysts
• Higher Level Abstraction– Map Reduce– Tez
Hive
• Framework for data warehouse on top of Hadoop– SQL Access on HDFS– Queries for Analysis
• Batch Oriented– Impala– Tez
HBase
• NoSQL database on Hadoop– Based on Google’s BigTable– Column oriented database on HDFS
• Regular Interactive/Update use cases– Real time read/write random access– Row updates are atomic
SQOOP
• Import/Export data from RDBMS into Hadoop– HDFS,Hive, Hbase– CouchBase– Uses JDBC driver to get the data types of the columns– Serialization/Deserialization
• Actual load done internally by Mapreduce jobs
Apache Flume
Source: Apache Flume Documentation
Real time streaming with Kafka & Storm• Kafka
– Pub/Sub messaging using topics– Kafka producers publish to topics
• Storm– Real time computational engine– Consumes data from spouts and passes data to bolts– Can run on top of YARN– Uses Zookeeper, implemented in Clojure– You define workflows as Directed Acyclic Graphs– True stream processing engine, so used for low latency ingestion– Can support At most once, At least once and Exactly Once semantics
Apache Spark
• High speed general purpose engine for large-scale data processing
• Does not need Hadoop, just needs a shared file system such as S3, NFS or HDFS
• Spark can run on YARN• Spark is implemented in Scala• Has Streaming API but a true batch processing engine that micro-
batches• Can only support Exactly once, but under some failure
conditions degrades to At-least once
Common Use Cases
• Queries from Detail Record Data• Queries from longer duration data• Diagnostic/Metrics/Web Logs Data Analysis• 360 degree view incorporating clickstream data• Unable to generate report within the needed timeframe• Capture and analyze sensor data• Analyze large volume of image data• Build User profiles from large volumes of data• Sentiment Analysis• Recommendation Engines• Risk Analysis
Securing Hadoop Data
Source: http://www.voltage.com
Closing
• Technology in hyper growth phase• Complex• Tools/Productivity/Monitoring products
evolving• Pilot Project• Incremental Journey
Demo - Start HDP cluster in AWS
• Total 6 EC2 machine, type t2.medium• RHEL 6.5, 3.75G Memory, 10G hard drive• 1 Ambari server + 5-node cluster• 1 Namenode + 1 Secondary node + 3 Data
Node• Public data set from
https://data.cityofchicago.org
Managing Hadoop Cluster using Ambari
• Ambari in Indian language stands for a seat sitting on top of an elephant
• Ambari is an Apache open source project that is used to• Provision Hadoop cluster• Manage Hadoop cluster• Monitor Hadoop cluster
• Agent-based deployment model
Demo — Hue
• Apache Hue provides web interface for analyzing data in Hadoop
• Use HCatalog to create table• Demo Hive Script• Demo Pig Script
Demo — Advanced Hive
• Use built-in UDF to extract latitude and longitude info• Use custom UDF (scala) to calculate distance
between two locations• Join tables between library and school data and find
libraries within 1 mile for each school • Use Tableau to connect to Hive through ODBC driver
to plot social economy data
Recommended