View
588
Download
0
Category
Tags:
Preview:
DESCRIPTION
COMPEGENCE: Nagaraj Kulkarni - Hadoop and No SQL. Presented in TDWI (The Data Warehousing Institute) India, Hyderabad (July 2011)
Citation preview
Hadoop and No SQL Slide 1 2011 Jul
Hadoop and No SQL
Co-Creating COMPetitive intelliGENCE through
Process, Data and Domain driven Information Excellence
Nagaraj Kulkarni
PRESENTED in TDWI India, Hyderabad (2011 July)
Hadoop and No SQL Slide 2 2011 Jul
Market Actions
Systemic Changes
Business Landscape
Business Intent
Business Usage
Cost
EffortSkills &
Competency
Sustainable
Scalable
Flexible
Timely
Usable
Actionable
Process
Systems
DataDomain
Information
Process, Data and Domain Integrated Approach
Decision ExcellenceCompetitive Advantage lies in the exploitation of:
–More detailed and specific information–More comprehensive external data & dependencies–Fuller integration
–More in depth analysis–More insightful plans and strategies
–More rapid response to business events–More precise and apt response to customer events
Ram Charan’s Book: What The CEO Wants You To Know: How Your Company Really WorksCOMPEGENCE 2 Information Excellence Foundation
Hadoop and No SQL Slide 3 2011 Jul
Context For Big Data Challenges
Data Base Systems – Pre Hadoop Strengths and Limitations
What is Scale, Why No SQL
Think Hadoop, Hadoop Eco system
Think Map Reduce
Nail Down Map Reduce
Think GRID (Distributed Architecture)
Deployment Options
Map Reduce Not and Map Reduce Usages
Nail Down HDFS and GRID Architecture
Touching Upon
Hadoop and No SQL Slide 4 2011 Jul
Big Data Context
Hadoop and No SQL Slide 5 2011 Jul
Systemic Changes
Connected Globe
Customer Centric
Boundary less ness
Best Sourcing
Interlinked Culture
Demand Side Focus
Bottom Up Innovation
Empowered employees
Agility and Response Time
Leading Trends
Responsiveness
Speed, Agility, Flexibility
Hadoop and No SQL Slide 6 2011 Jul
Landscape To Address
Data Explosion
Information Overload
Manageability
Scalability
Performance
Agility
Decision Making
Time to Action
Interlinked Processes
& Systems
Boundaryless
Systemic Understanding
Collaborate and Synergize
Simplify and Scale
Hadoop and No SQL Slide 7 2011 Jul
Information Overload
A wealth of information createsa poverty of attention.
Herbert Simon, Nobel Laureate Economist
COMPEGENCE Confidential 7
Hadoop and No SQL Slide 8 2011 Jul
BACKUP
More Touch points, More Channels
Source: JupiterResearch (7/08)© 2008 JupiterResearch, LLC
Hadoop and No SQL Slide 9 2011 Jul
Scale – What is it?
Hadoop and No SQL Slide 10 2011 Jul
Traditional System - How they achieve Scalability Multi Threading
Multiple CPU – Parallel Processing
Distributed Programming – SMP & MPP
ETL Load Distribution – Assigning jobs to different nodes
Improved Throughput
How do we scale
Hadoop and No SQL Slide 11 2011 Jul
Facebook500 Million Active Users per Month
500 Billion+ Page Views per month
25 Billion+ Content per month
15 TB New Data / day1200 m/cs, 21 PB Cluster
Yahoo82 PB of Data25000+ nodes
Twitter1 TB plus / day80 + nodes
eBay90 Million Active Users
10 Billion Requests per day
220 million+ items on sale
40 TB + / day40 PB of Data
1.73 Billion Internet Users
247 Billion emails per day
126 Million Blogs
5 Billion Facebook Content per week
50 Million Tweets per day
80% of this data is unstructured
Estimated 800 GB of data per user (million Petabyte!)
Scale – What is it about?
Hadoop and No SQL Slide 12 2011 Jul
Thinking of Scale - Need for Grid
dc 1
……….……….
lbbp
n1 -nn
dc 2
100 mps/pipeLog storage& processing
Web server Log
Data Highway Datamart
Datamart
Think Numbers
1000 Nodes / DC10 DC1K byte webserver log record1 second / row
In one day
1000 * 10 * 1K * 60 * 60 * 24 = 864 GBStorage for a year
864 GB * 365 = 315 TB
To store 1 PB – 40K * 1000 = Millions $To process 1 TB = 1000 minutes ~ 17 hrs
Think Agility and Flexibility
How do we scale – Think Numbers
Hadoop and No SQL Slide 13 2011 Jul
Does it scale linearly with data size andanalysis complexity
VolumeSpeedIntegration levelmore…
Scale – What is it about?
Hadoop and No SQL Slide 14 2011 Jul
We would not have no issues…
If the following assumptions Hold Good:
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.
Hadoop and No SQL Slide 15 2011 Jul
Think Hadoop
Hadoop and No SQL Slide 16 2011 Jul
New Paradigm: Go Back to Basics
Joel Spolskyhttp://www.joelonsoftware.com/items/2006/08/01.html
Divide and Conquer (Divide and Delegate and Get Done)
Move Work or Workers ?
Relax Constraints (Pre defined data models)
Expect and Plan for Failures (avoid n address failures)
Community backup
Assembly Line Processing
(Scale, Speed, Efficiency, Commodity Worker)
The “For loop”
Parallelization (trivially parallelizable)
Infrastructure and Supervision (Grid Architecture)
Manage Dependencies
Ignore the Trivia (Trivia is relative!)
Charlie Munger’s Mental Models
Hadoop and No SQL Slide 17 2011 Jul
New Paradigm: Go Back to Basics
Map Reduce Paradigm
Divide and ConquerThe “for loop” Sort and ShuffleParallelization (trivially parallelizable)Relax Data Constraints
Assembly Line Processing Scale, Speed, Efficiency, Commodity Worker)
Grid Architecture
Split and DelegateMove Work or Workers Expect and Plan for FailuresAssembly Line Processing (Scale, Speed, Efficiency, Commodity Worker)
Manage Dependencies and FailuresIgnore the Trivia (Trivia is relative!)
Map Reduce HistoryLispUnix Google FS
Replication, Redundancy, Heart Beat Check, Cluster rebalancing,Fault Tolerance, Task Restart, Chaining of jobs (Dependencies), Graceful Restart, Look Ahead or Speculative execution,
Hadoop and No SQL Slide 18 2011 Jul
Hbase/Cassandra for huge data volumes- PBs. •Hbase fits in well where Hadoop is already being used. •Cassandra less cumbersome to install/manage
MongoDB/CouchDBDocument oriented databases for easy use and GB-TB volumes. Might be problematic at PB scales
Neo4j like graph databasesfor managing relationship oriented applications- nodes and edges
Riak, redis, membase like Simple key-value databases for huge distributed in-memory hash maps
No SQL Options
Hadoop and No SQL Slide 19 2011 Jul
Let us Think Hadoop
Hadoop and No SQL Slide 20 2011 Jul
RDBMS and Hadoop
RDBMS MapReduceData size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High LowScaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High
Hadoop and No SQL Slide 21 2011 Jul
Apache Hadoop Ecosystem
Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation.HBase / Flume / Scribe: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. ZooKeeper: A high-performance coordination service for distributed applications. Flume: Message Que Processing Mahout: scalable Machine Learning algorithms using HadoopChukwa: A data collection system for managing large distributed systems.
Hadoop and No SQL Slide 22 2011 Jul
HDFS(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI ReportingETL Tools
Avro
(Ser
ializ
atio
n)
Zook
eepr
(Coo
rdin
atio
n) Sqoop
RDBMS
Apache Hadoop Ecosystem
Hadoop and No SQL Slide 23 2011 Jul
HDFS – The BackBone
Hadoop Distributed File System
Hadoop and No SQL Slide 24 2011 Jul
Map Reduce – The New ParadigmTransforming Large Data
Mappers
Reducers
MapReduce Basics
•Functional Programming
•List Processing
•Mapping Lists
Hadoop and No SQL Slide 25 2011 Jul
Query Parser Logical Plan
Semantic Checking Logical Plan
Logical Optimizer Optimized Logical Plan
Logical to Physical Translator Physical Plan
Physical To M/R Translator MapReduce Plan
Map Reduce Launcher
Pig Latin Programs
Create a job jar to be submitted to Hadoop cluster
PIG – Help the Business User QueryPig: Data-aggregation functions over semi-structured data (log files).
Hadoop and No SQL Slide 26 2011 Jul
PIG Latin Example
Hadoop and No SQL Slide 27 2011 Jul
• Scalable, Reliable, Distributed DB• Columnar Structure• Built on top of HDFS• Map Reduceable
• A SQL Database!– No joins– No sophisticated query engine– No transactions– No column typing– No SQL, no ODBC/JDBC, etc.
• Not a replacement for your RDBMS...
HBASE – Scalable Columnar
Hadoop and No SQL Slide 28 2011 Jul
• A high level interface on Hadoop for managing and querying structured data
• Interpreted as Map-Reduce jobs for execution• Uses HDFS for storage• Uses Metadata representation over hdfs files
• Key Building Principles:• Familiarity with SQL • Performance with help of built-in optimizers• Enable Extensibility – Types, Functions, Formats,
Scripts
HIVE – SQL Like
Hadoop and No SQL Slide 29 2011 Jul
• Distributed Data / Log Collection Service• Scalable, Configurable, Extensible• Centrally Manageable
• Agents fetch data from apps, Collectors save it• Abstrations: Source -> Decrator(s) -> Sink
FLUME – Distributed Data Collection
Hadoop and No SQL Slide 30 2011 Jul
An Oozie Workflow
startSSHHODAlloc
M/Rstreaming
job
decision
fork
Pigjob
M/Rjob
joinOK
OK
OK
OK
end
Java Main
FSjob
OK
kill
ERRORERROR
ERROR
ERROR
ERROR
ERROR
OK
ENOUGH
MORE
Oozie – Workflow Management
Hadoop and No SQL Slide 31 2011 Jul
Think Map n Reduce
Hadoop and No SQL Slide 32 2011 Jul
Logical Architecture
Understanding Map Reduce Paradigm
Hadoop and No SQL Slide 33 2011 Jul
Understanding Map Reduce Paradigm
Hadoop and No SQL Slide 34 2011 Jul
JobConfigure the Hadoop Job to run.
Mappermap(LongWritable key, Text value, Context context)
Reducerreduce(Text key, Iterable<IntWritable> values, Context context)
Map Reduce Paradigm
Hadoop and No SQL Slide 35 2011 Jul
MapReduce is a
functional programming model and an
associated implementation model
for processing and generating large data sets.
Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs,
and
a reduce function that merges all intermediate values associated with the same intermediate key.
Many real world tasks are expressible in this model.
Map –Reduce Definition
CONCEPTS
Programming model
Hadoop and No SQL Slide 36 2011 Jul
Programming model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) -> list(out_key, intermediate_value) •Processes input key/value pair •Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key
•Produces a set of merged output values (usually just one) •Inspired by similar primitives in LISP and other languages
Hadoop and No SQL Slide 37 2011 Jul
Word Count Example
A simple MapReduce program can be written to determine how many times different words appear in a set of files.
What does Mapper and Reducer do?
Pseudo Code:
mapper (filename, file-contents):for each word in file-contents:emit (word, 1)
reducer (word, values):sum = 0for each value in values:sum = sum + value
emit (word, sum)
Map Reduce Paradigm
Hadoop and No SQL Slide 38 2011 Jul
Programming model
Example: Count word occurrences
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values:
result += ParseInt(v); Emit(AsString(result));
Pseudocode: See appendix in paper for real code
Hadoop and No SQL Slide 39 2011 Jul
• Master-Slave architecture
• Master: JobTracker– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to TaskTrackers (slaves)
– Monitors task and TaskTracker status, re-executes tasks upon failure
• Worker: TaskTrackers– Run Map and Reduce tasks upon instruction from the Jobtracker
– Manage storage and transmission of intermediate output
Map – Reduce Execution Recap
Understanding Map Reduce Paradigm
Hadoop and No SQL Slide 40 2011 Jul
Example of map functions –Individual Count, Filter, Transformation, Sort, Pig load
Example of reduce functions –Group Count, Sum, Aggregator
A job can have many map and reducers functions.
Map – Reduce Paradigm Recap
Understanding Map Reduce Paradigm
Hadoop and No SQL Slide 41 2011 Jul
How are we doing on the Objective
Hadoop and No SQL Slide 42 2011 Jul
Process, Data and Domain driven Information Excellence
ABOUT COMPEGENCE
Hadoop and No SQL Slide 43 2011 Jul
Market Actions
Systemic Changes
Business Landscape
Business Intent
Business Usage
Cost
EffortSkills &
Competency
Sustainable
Scalable
Flexible
Timely
Usable
Actionable
Process
Systems
DataDomain
Information
Process, Data and Domain Integrated Approach
Decision ExcellenceCompetitive Advantage lies in the exploitation of:
–More detailed and specific information–More comprehensive external data & dependencies–Fuller integration
–More in depth analysis–More insightful plans and strategies
–More rapid response to business events–More precise and apt response to customer events
We complement your “COMPETING WITH ANALYTICS JOURNEY”
Hadoop and No SQL Slide 44 2011 Jul
Value Proposition
Data Qua lit y and Pro cess Aud it
Custo mer Data
Tra nsla teSegme ntDe riv e
Sum ma rize
Profiling
Sour ce DataSour ce D ata Extrac tExtra ct Sta gingSta gin g Tra nsfo rmTra nsfo rm Lo adLo ad Ap pl ica tion sAp pl icat ion s
Meta data Laye r f or C on sistent Bu sin e ss Unde rstandi ng
Assets
L i ab il i ti es
I n ve stmen t
Cards
Reference Data(Bra nch, P rodu cts)
P artn er Data
CRM / Marketin g P rograms
Inte gra te
Analysis
Reports
Dashboar ds
Excel Interface
Busine ss Rule s
Trusted Da ta
Fou ndat ion
with DW
Pla tf orm
Trusted Da ta
Fou ndat ion
with DW
Pla tf orm
ConstraintsAlternativesAssumptionsDependencies
Concerns / RisksCost of Ownership
Technology Evolution
Repeatable ReusableLeverageTrade Offs
Ease of Use: Drill Down, Up, Across
Tools
Technologies
Trends
Platforms
People
Processes
Partners
Cost
Time
TeraBytes
Reports
Dashboards
Decisions?
Actions?
Results?
Returns?
Jump Start the “Process and Information Excellence” journey
Focus on your business goals and “Competing with Analytics Journey”
Overcome multiple and diverse expertise / skill-set paucity
Preserve current investments in people and technology
Manage Data complexities and the resultant challenges
Manage Scalability to address data explosion with Terabytes of Data
Helps you focus on the business and business processes
Helps you harvest the benefits of your data investments faster
Consultative Work-thru Workshops that help and mature your team
Data
Processes
Decisions
Actions
Results
Returns
COMPEGENCE
People
Current State
Hadoop and No SQL Slide 45 2011 Jul
Our Expertise and Focus Areas
Process + Data + Domain => Decision
Analytics; Data Mining; Big Data; DWH & BI
Architecture and Methodology
Partnered Product Development
Consulting, Competency Building, Advisory, Mentoring
Executive Briefing Sessions and Deep Dive Workshops
Hadoop and No SQL Slide 46 2011 Jul
Process, Data and Domain driven Information Excellence
Process, Data and Domain driven Business Decision Life Cycle
Partners in Co-Creating Success
www.compegence.cominfo@compegence.com
Recommended