1
Bringing Hadoop to a produc0on-‐ready state Eddie Garcia -‐ Security Architect, Office of the CTO
2
Agenda
©2014 Cloudera, Inc. All rights reserved.
• The Future of Data Management • Iden0fying your first Hadoop project • Common Hadoop project challenges • Ensuring success from POC to Produc0on
3
The Future of Data Management The Enterprise Data Hub
4
Expanding Data Requires A New Approach
©2014 Cloudera, Inc. All rights reserved.
1980s Bring Data to Compute
Now Bring Compute to Data
Rela.ve size & complexity
Data Informa.on-‐centric
businesses use all data:
Mul0-‐structured, internal & external data
of all types
Compute
Compute
Compute
Process-‐centric businesses use:
• Structured data mainly • Internal data only
• “Important” data only
Compute
Compute
Compute
Data
Data
Data
Data
5
The Old Way: Bringing Data to Compute
©2014 Cloudera, Inc. All rights reserved.
Complex Architecture • Many special-‐purpose
systems • Moving data around • No complete views
Missing Data • Leaving data behind • Risk and compliance • High cost of storage
Time to Data • Up-‐front modeling • Transforms slow • Transforms lose data
Cost of Analy.cs • Exis0ng systems strained • No agility • “BI backlog”
4
1
2
3
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES
6
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS ESTERNAL DATA SOURCES
©2014 Cloudera, Inc. All rights reserved.
Diverse Analy.c PlaIorm • Bring applica0ons to data • Combine different workloads on
common data (i.e. SQL + Search) • True analy,c agility
4
1
2
3 4
The New Way: Bringing Compute to Data
Ac.ve Compliance Archive • Full fidelity original data • Indefinite 0me, any source • Lowest cost storage
1
Persistent Staging • One source of data for all analy0cs • Persist state of transformed data • Significantly faster & cheaper
2
Self-‐Service Exploratory BI • Simple search + BI tools • “Schema on read” agility • Reduce BI user backlog requests
3
7
Iden0fying your first Hadoop Use Case
8
Usage of Hadoop – current
©2014 Cloudera, Inc. All Rights Reserved.
71.4%
57.1%
42.9%
64.3%
14.3%
42.9%
ETL processin
g
Data wareh
ouse off-‐loading
Data archival
Advanced
Analy0cs
Log managem
ent
Search app
lica0
on
What are you currently using your Hadoop infrastructure for?
*Results are based on internal Cloudera sampled survey of a CDH install base users
9
64.3%
85.7%
57.1%
85.7%
50.0% 50.0%
ETL processin
g
Data wareh
ouse off-‐
loading
Data archival
Advanced
Analy0cs
Log managem
ent
Search app
lica0
on
How do you an.cipate using your Hadoop infrastructure for in the next 6-‐12 months?
Usage of Hadoop – next 6-‐12 months
©2014 Cloudera, Inc. All Rights Reserved.
*Results are based on internal Cloudera sampled survey of a CDH install base users
10
Common Hadoop project challenges
11
Percentage of 0me spent to Produc0on-‐Ready
©2014 Cloudera, Inc. All Rights Reserved.
13.73 13.42
9.82
18.27
16.09
9.00 9.60 9.67 11.00
3.33
Installa0o
n & con
figura0
on
Training / up
-‐skilling peo
ple to be
familiar with
the techno
logy
User o
nboarding
Applica0
on develop
men
t
Data m
igra0o
n
Workload migra0o
n
Integra0
ng
Seong up security and review
ing
with
infosec
Tes0ng
Other
*Results are based on internal Cloudera sampled survey of a CDH install base users
12
6.38
13.33 14.38
19.55 22.08
19.29
9.44
14.55 12.00
14.00
Unable to iden
0fy the rig
ht
busin
ess u
se case
Lack of e
ngagem
ent / sp
onsorship
by th
e bu
siness
Deploymen
t challenges (i.e.
installa0o
n & con
figura0
on)
Security/Co
mpliance challenges
Data M
anagem
ent (Discovery,
Line
age, Life
cycle mgm
t)
Lack of integra0o
n with
key ISV
tools
Lack of SQ
L features
Lack of IT skill se
ts
Lack of ‘Da
ta Scien
ce’ skill sets
Other
Blockers to Produc0on-‐Ready
©2014 Cloudera, Inc. All Rights Reserved.
*Results are based on internal Cloudera sampled survey of a CDH install base users
13
0.00 1.00 2.00 3.00 4.00 5.00 6.00
Kerberos setup & mgmt
Audit/Access mgmt
Lack of adequate governance
Lack of unified security model for Hadoop
PCI compliance
Data encryp0on, masking
Please rate on a scale of 1(NA), 2(least challenging) to 6 (most challenging) the top issues w.r.t Hadoop security/compliance?
Top challenges – Security/Compliance
©2014 Cloudera, Inc. All Rights Reserved.
*Results are based on internal Cloudera sampled survey of a CDH install base users
14 ©2014 Cloudera, Inc. All rights reserved.
Cloudera Enterprise Security
Perimeter Guarding access to the
cluster itself
Technical Concepts: Authen0ca0on
Network isola0on
Data Protec0ng data in the
cluster from unauthorized visibility
Technical Concepts: Encryp0on, Tokeniza0on,
Data masking
Access Defining what users
and applica0ons can do with data
Technical Concepts: Permissions Authoriza0on
Visibility Repor0ng on where data came from and how it’s being used
Technical Concepts: Audi0ng Lineage
15 ©2014 Cloudera, Inc. All rights reserved.
16
Top challenges – Hadoop Opera0ons
©2014 Cloudera, Inc. All Rights Reserved.
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00
Hardware procurement
Deployment on virtualized infrastructure, cloud environments
Configura0on management
Diagnos0cs and Troubleshoo0ng
Plasorm stability
SLA management
Capacity Planning
Chargeback Modeling
Integra0on with other IT Mgmt tools
Please rate on a scale of 1 (NA), 2 (least challenging) to 6 (most challenging) the top issues
*Results are based on internal Cloudera sampled survey of a CDH install base users
17
Top challenges – Organiza0on issues
©2014 Cloudera, Inc. All Rights Reserved.
0.00 1.00 2.00 3.00 4.00 5.00 6.00
Lack of understanding of Hadoop (and it’s benefits)
Unable to demonstrate quick wins
Lack of concrete use cases
Lack of ROI for business sponsors
Lack of skills – Opera0onal
Lack of skills -‐ Analy0cs
Too entrenched in exis0ng technologies
Plasorm stability
Please rate on a scale of 1(NA), 2(least challenging) to 6 (most challenging)
*Results are based on internal Cloudera sampled survey of a CDH install base users
18
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Lack of SQL performance
Lack of data types support (e.g. Decimal, varchar etc.) in Impala/Hive
Lack of SQL compliance to ANSI standards
Lack of robust ISV tools (for BI, ETL etc.)
Unable to iden0fy workloads that can be off-‐loaded from a DW
Lack of Resource/Workload management capabili0es
Lack of nested/complex structures (struct, map, array)
Lack of non-‐ANSI SQL language extensions from their EDW
Plasorm stability
Please rate on a scale of 1(NA), 2 (least challenging) to 6 (most challenging) the top issues
Top challenges – Data Warehouse off-‐load
©2014 Cloudera, Inc. All Rights Reserved.
*Results are based on internal Cloudera sampled survey of a CDH install base users
19
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Lack of capabili0es in exis0ng tools (Pig, Hive, Crunch, Morphlines etc.) to easily build ETL pipelines
Lack of skills to leverage Pig, MR etc. to build ETL pipelines
Lack of robust ISV tools; current tools from vendors like Informa0ca, Pentaho s0ll immature
Unable to leverage/migrate exis0ng ETL pipelines to Hadoop easily
Plasorm stability
Please rate on a scale of 1(NA), 2 (least challenging) to 6 (most challenging) the top issues
Top challenges – Hadoop for ETL
©2014 Cloudera, Inc. All Rights Reserved.
*Results are based on internal Cloudera sampled survey of a CDH install base users
20
Top challenges – Hadoop for Advanced Analy0cs
©2014 Cloudera, Inc. All Rights Reserved.
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Lack of robust ISV tools for advanced analy0cs (including Machine Learning) in Hadoop; exis0ng tools are immature or rela0vely new (offerings from SAS,
Revolu0on etc.)
Lack of cer0fied Partner tools from Cloudera
Lack of Data analysis skills
Lack of robust workload/resource management capabili0es
Plasorm stability
Please rate on a scale of 1(NA), 2(least challenging) to 6 (most challenging) the top issues
Lack of cer0fied Partners
*Results are based on internal Cloudera sampled survey of a CDH install base users
21 ©2014 Cloudera, Inc. All rights reserved.
• Invested heavily in learning and ramp-‐up • Re-‐wri0ng logic from the scratch to op0mize rather than re-‐using exis0ng rou0nes ß involves significant 0me/resources • Took more than 6 months to get Kerberos to work with my exis0ng iden0ty infrastructure • I need to track the metadata for data flow in/out of Hadoop • Configuring mul0-‐tenancy/work load management
Hadoop POC to Produc0on Challenges
22 ©2014 Cloudera, Inc. All rights reserved.
• Time and resources researching Analy0cs, BI, ETL tools available for Hadoop • Challenges making data in Hadoop clusters available to end business users • Exis0ng skill sets more tuned towards to Data Warehouse • Some resistance to be adop0on of newer technologies like Hadoop (from exis0ng IT folks)
Hadoop POC to Produc0on Challenges
23
Ensuring success from POC to Produc0on Big Data Center of Excellence
24
Manage Disrup0on Throughout the Enterprise Op0mize the Value of an Enterprise Data Hub as Hadoop Adop0on Increases
SAVINGS Increase resource u+liza+on over +me and exploit valuable exper+se
STANDARDS Syndicate best prac+ces and design
to assure efficient integra+on
SPEED Drive adop+on via a solu+ons
engine, not autonomous projects
SCALE Systema+cally learn, develop,
deploy, operate, publish, and repeat
STRATEGY Consolidate knowledge and resources to
bolster infrastructure
©2014 Cloudera, Inc. All rights reserved.
25
Op0mize Your Hadoop Knowledge Flow Big Data Is All About Con0nuous Improvement and Opera0onal Efficiency
Develop Deploy Operate Publish Learn
REPEAT
©2014 Cloudera, Inc. All rights reserved.
26
What Is a Center of Excellence? Centrally Incubate, Integrate, and Scale Across Lines of Business
Knowledge Hub Best Prac.ces Exper.se
Solu.ons Engine Infrastructure Clusters
Dedicated Team Architects Administrators Developers Data Scien.sts
Big Data Solu.ons as a Service
©2014 Cloudera, Inc. All rights reserved.
27
Why Build a Center of Excellence? Silos Obstruct Scale, Prevent Flexibility, and Drive Up Costs
??? Novice
Skill Level Redundant
Infrastructure Incompa.ble
Systems Disparate
Data
$$$
©2014 Cloudera, Inc. All rights reserved.
28
Project-‐Level Benefits of COE
29
Big Data as a Service Drives Success Coordinate Workloads to Distribute Advantages Widely and Seamlessly
Insights & Best Prac.ces
Hadoop Experts
Broadly Accessible Big Data
Technology & Equipment
©2014 Cloudera, Inc. All rights reserved.
30
Building Your COE
31
Big Data COE Program Roles Staff Centrally and Train to Scale
Management & Leadership
Business & Data
Technology & Ops
Lead Data Scien0st
Lead Business Analyst
Developers
Big Data Visionary
Execu0ve Sponsor
Program Manager
Admin Data Warehouse Specialist
Architects
LOB Rep
LOB Rep
LOB Rep Data Wranglers
©2014 Cloudera, Inc. All rights reserved.
32
A Five-‐Phase Development Plan Work with Hadoop Experts from Planning to Support
Staffing
• Personnel • Training • Mentoring
Incuba.on Opera.ons Con.nuity
• Scope • Build • Cer0fy
• Deploy • Integrate • Op0mize
• Use Cases • Real-‐Time • Document
• Cost Model • Support • Sustain
Expansion
©2014 Cloudera, Inc. All rights reserved.
33 ©2014 Cloudera, Inc. All rights reserved.
Thank You! Eddie Garcia [email protected]