Upload
inside-analysis
View
68
Download
0
Embed Size (px)
DESCRIPTION
The Briefing Room with Dr. Robin Bloor and RedPoint Global Live Webcast on April 8, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=cfa1bffdd62dc6677fa225bdffe4a0b9 The innovation curve often arcs slowly before picking up speed. Companies that harness a major transformation early in the game can make serious headway before challengers enter the picture. The world of Hadoop features several of these upstarts, each of which uses the open-source foundation as an engine to drive vastly greater performance to a wide range of services, and even create new ones. Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how the Hadoop engine is being used to architect a new generation of enterprise applications. He’ll be briefed by George Corugedo, RedPoint Global CTO and Co-founder, who will showcase how enterprises can cost-effectively take advantage of the scalability, processing power and lower costs that Hadoop 2.0/YARN applications offer by eliminating the long-term expense of hiring MapReduce programmers. Visit InsideAnlaysis.com for more information.
Citation preview
Grab some coffee and enjoy the pre-show banter before the top of the hour!
The Briefing Room
Game Changed: How Hadoop is Reinventing Enterprise Thinking
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
This Month: BIG DATA
May: DATABASE
June: ANALYTICS & MACHINE LEARNING
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
Big Data
Twitter Tag: #briefr
The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
[email protected] @robinbloor
Twitter Tag: #briefr
The Briefing Room
RedPoint Global
! RedPoint Global is a data management and integrated marketing technology company
! Its Convergent Marketing Platform™ offers products designed for data management, collaboration and architecture integration.
! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster.
Twitter Tag: #briefr
The Briefing Room
Guest: George Corugedo
George Corugedo is Chief Technology Officer & Co-Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.
RedPoint Overview for Bloor Group
11 RedPoint Global Inc. 8 April 2014 © Confidential
Overview -‐ What is Hadoop/Hadoop 2.0
Hadoop 1.0 • All opera?ons based on Map Reduce
• Intrinsic inconsistency of code based solu?ons
• Highly skilled and expensive resources needed
• 3rd party applica?ons constrained by the need to generate code
Lower cost scaling
No need for structure
Ease of data capture
Hadoop 2.0 • Introduc?on of the YARN:
“a general-‐purpose, distributed, applica?on management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.”
• Mature applica?ons can now operate directly on Hadoop
• Reduce skill requirements and increased consistency
12 RedPoint Global Inc. 8 April 2014 © Confidential
Overview – Challenges to Adop?on
• Severe shortage of MR skilled resources • Very expensive resources and hard to retain • Inconsistent skills lead to inconsistent results • Under u?lizes exis?ng resources • Prevents broad leverage of investments across enterprise
Skills Gap
• A nascent technology ecosystem around Hadoop • Emerging technologies only address narrow slivers of func?onality • New applica?ons are not enterprise class • Legacy applica?ons have built short term capabili?es
Maturity & Governance
• Data is not useful in its raw state, it must be turned into informa?on • Benefit of Hadoop is that same data can be used from many perspec?ves • Analysts must now do the structuring of the data based on intended use of the data
Data Into Informa?on
13 RedPoint Global Inc. 8 April 2014 © Confidential
How RedPoint Achieves this
First YARN compliant ETL/data quality toolset on the market – brings together both Big Data and tradiGonal data to create Big InformaGon!
• Customer or Party Data
• Processing Speed • Match Quality
• Ease of Use
by in: RANKED
#1 The power to make your data the biggest asset your organiza?on has
14 RedPoint Global Inc. 8 April 2014 © Confidential
Key features of RedPoint Data Management
Master Key Management
ETL & ELT Data Quality
Web Services Integra?on
Integra?on & Matching
Process Automa?on & Opera?ons
• Profiling, reads/writes, transforma?ons • Single project for all jobs
• Cleanse data • Parsing, correc?on • Geo-‐spa?al analysis
• Grouping • Fuzzy match
• Create keys • Track changes • Maintain matches over ?me
• Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats
• Job scheduling, monitoring, no?fica?ons • Central point of control
All func(ons can be used on both TRADITIONAL and BIG DATA
Creates clean, integrated, ac/onable data – quickly, reliably and at low cost
15 RedPoint Global Inc. 8 April 2014 © Confidential
Spotlight on RedPoint Data Management for Hadoop
For data management in Hadoop:
• Easy-‐to-‐use interface • Leverages exis?ng skills • Executes in Hadoop 2.0 (using YARN architecture) • Fast – no MapReduce • Can combine Big Data with tradi?onal data • Data becomes ac?onable by RedPoint Interac?on
WITH REDPOINT
the only pure YARN data management pla?orm
Makes Hadoop data management easy, fast, low-‐cost. Makes Big Data clean, integrated, usable.
You get more out of your Big Data investment.
Use MapReduce x complex x requires new skills x inefficient execu?on
Move data out of Hadoop x extra ?me and effort x extra storage (expensive) x defeats the purpose of Hadoop
PREVIOUS OPTIONS
16 RedPoint Global Inc. 8 April 2014 © Confidential
Data Management on Hadoop
Par??oning AM / Tasks
Execu?on AM / Tasks Data I/O Key / Split
Analysis
Parallel Sec?on
Par??on Data server
YARN HDFS/MapReduce
17 RedPoint Global Inc. 8 April 2014 © Confidential
Resource Manager
Launches Tasks
Node Manager
DM App Master
DM Task
Node Manager
DM Task
DM Task
Node Manager
DM Task
DM Task
Launches DM App Master
Data Management Designer
DM ExecuGon
Server
Parallel Sec?on
Running DM Task
12
3
RedPoint DM for Hadoop: Processing Flow
18 RedPoint Global Inc. 8 April 2014 © Confidential
The Data Management designer
19 RedPoint Global Inc. 8 April 2014 © Confidential
DM Parallel Sec?on on Hadoop
20 RedPoint Global Inc. 8 April 2014 © Confidential
DM Hadoop Sehngs
21 RedPoint Global Inc. 8 April 2014 © Confidential
RedPoint
Benchmarks – Project Gutenberg
Map Reduce Pig
Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count';
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
3 hours runtime 15 minutes runtime 3 minutes runtime
Extensive optimization needed
User Defined Functions required prior to running script
No tuning or optimization required
22 RedPoint Global Inc. 8 April 2014 © Confidential
Who Should Care
! Companies interested in exploring the promise of Big Data Analy?cs and need an easy way to get started.
! Companies already inves?ng heavily inves?ng in Big Data Analy?cs technologies but are stuck due to the shortage of skilled resources
! Large organiza?ons that are focused on “Opera?onal Offloading” and need to achieve it cost effec?vely
! Companies who recognize that much of the data that lands in Hadoop is external to the organiza?on and need to have Data Quality and proper data governance applied to their Hadoop data.
23 RedPoint Global Inc. 8 April 2014 © Confidential
Why RedPoint
! Directly overcomes the Hadoop skills gap ! Reduced TCO because exis?ng resources can be leveraged ! Increased produc?vity and consistency of solu?ons ! Only pure YARN Data Quality applica?on on the market ! Delivers enterprise grade data quality and governance into the Hadoop cluster
Twitter Tag: #briefr
The Briefing Room
Perceptions & Questions
Analyst: Robin Bloor
Where Is That Elephant Going?
Robin Bloor, Ph.D.
The Key-Value Store is Back!
u General purpose key-value stores used to be called ISAM files
u They were available on Mainframes (VSAM) and DEC VAX (RMS) and other minicomputers
u But not on Unix or Windows or Linux
u Well now they’re back, and they’re scalable
WHAT DID WE LIKE ABOUT THEM?
The Open Source Landscape
u Hadoop + components • The data reservoir • The archive store • The analytics sandbox
u Machine Learning Algorithms • Raw power
u The R Language • Over 1 million users
These are COMPONENTS of a solution
A Process Not an Activity
u Data Analytics is a multi-disciplinary end-to-end process
u Until recently it was a walled-garden, but the walls were torn down by… • Data availability • Scalable technology • Open source tool
u Hadoop has a role here
The Evolution of Hadoop
u There were many components before YARN and Tez
u But YARN and Tez have changed the picture
u MapReduce is now an option
u Most likely Hadoop will become the default scale out file system and the OS for data flow
The Hadoop Ecosystem
u Even though it may not seem so, Hadoop is in its infancy
u Hadoop’s popularity guarantees its future
u Its future is also guaranteed by its commercial ecosystem
u That’s the Open Source Way
u Do you see Hadoop as a replacement for the data warehouse?
u Which specific components of the Hadoop ecosystem do you always (or nearly always) employ?
u Which other technologies/products do you integrate with?
u How does a RedPoint engagement normally pan out?
u What do you see as the natural business applications for Hadoop (and its ecosystem)?
u Do you think there any natural industry specific (i.e., vertical) applications?
u Which companies/technologies do you see as competitive with RedPoint
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: BIG DATA
May: DATABASE
June: ANALYTICS & MACHINE LEARNING
Twitter Tag: #briefr
The Briefing Room
THANK YOU for your
ATTENTION!