26
Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, BUT WERE AFRAID TO ASK, ABOUT FINDING SUCCESS WITH HADOOP IN YOUR ORAGANIZATION © UTILIS TECHNOLOGY LIMITED 2017

Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Founding a Hadoop LabEVERYTHING YOU ALWAYS WANTED TO KNOW,

BUT WERE AFRAID TO ASK,

ABOUT F INDING SUCCESS WITH HADOOP IN YOUR ORAGANIZATION

© UTILIS TECHNOLOGY LIMITED 2017

Page 2: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Andre [email protected]

A Short Introduction to Your Speaker

My Adventures in Hadoop◦ Lead Hadoop adoption at three Canadian banks

◦ Established a successful Hadoop COE

◦ Advisory roles on Hadoop in finance

My Career in Finance◦ Four banks, one stock exchange, one pension fund

◦ Capital markets, retail banking, enterprise risk roles

◦ Founder of two IT departments

◦ Technology leader in Risk Systems for 15 years –◦ Architect, Enterprise Risk Systems

◦ Architect, Front Office Risk Systems

◦ Program Manager, Portfolio Management Systems

◦ Head of Risk Systems

◦ Head of Hadoop COE

Page 3: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

AgendaWhat role will your Hadoop Lab play?

◦ Defining objectives, building a team and forming partnerships

◦ Foundational work to set a path to success

What is a reasonable budget?◦ Calculating your “room” based on industry benchmarks

◦ Capacity planning, charge-out, and the central capital account

Real-life Lessons Learned◦ Setting up infrastructure to take advantage of Hadoop’s unique properties

◦ Creating a practice that fits your users’ work styles

Projects that Succeed◦ Ideas for a quick win to keep everyone motivated

◦ Medium risk projects aligned to current business problems

Page 4: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

What role will your Hadoop Lab play?“YOU CAN’T SHRINK YOUR WAY TO GREATNESS”

- TO M P E T E R S

Page 5: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

What role will your Hadoop Lab play?Will your organization’s Hadoop Lab be a control function, or a thought leader?

Control functions◦ Operational controls, compliance and auditing

◦ Budgeting

◦ Architecture gating

◦ Data governance

Thought leadership◦ Design patterns and solution architecture

◦ Demonstration projects and proofs-of-concept

◦ Filling up the talent pool using training, workshops and user groups

◦ Educating on best practices and success stories to motivate adoption

Page 6: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Foundational WorkInvest in user-friendly operational management

◦ Design a simple multi-tenancy plan based on group membership◦ Include share of execution queues, directory structures and cascading permissions

◦ Set up self-serve user on-boarding through your organization’s Help Desk

◦ Implement single sign on for Kerberos-secured clusters

Manage expectations by monitoring performance◦ Set service level objectives for both interactive and application uses

◦ Use “show back” reporting to monitor performance against objectives

Implement access control governance as a basic service◦ Generate access control matrix audits centrally for all grid users

◦ Reporting from Ranger’s database works well and is easy to build

◦ Set policy and prepare reports for periodic attestation/user account reviews

Page 7: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Maximizing Exposure to ChangeHadoop is an exceptionally fast moving technology, and so needs a different approach

◦ Maximize your ability to deploy the changes in the Hadoop platform ◦ Invest in continuous integration and automated regression testing for your development teams

◦ Establish a better-than-quarterly release cycle

◦ Publish a checklist of acceptable open source licenses (or blacklist of prohibited ones)

◦ Encourage use of Hadoop as an application container

◦ Set up lab environments

Discourage practices that prevent your organization from keeping pace◦ Avoid encapsulating Hadoop with frameworks or wrapping Hadoop inside applications

◦ Avoid proprietary add-ons – they don’t get as much collaboration in the open source community

◦ Prohibit equipment “carve outs” from your shared grid◦ Include the cost of additional equipment in the business case, co-locate, and charge out accordingly

Page 8: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Building a TeamData Engineers are the key to the successful adoption of a data lake

◦ Data engineers are hybrid of intermediate developer and junior data scientist

◦ Good data engineering accelerates data science, and the ability to deploy data science to production

Other roles to consider◦ A few versatile senior developers to give you the ability to execute POCs

◦ Data Librarian to manage the metadata catalogue and documentation

◦ Data Steward to manage the data governance process

Keep a few consultants on speed dial◦ Hadoop security experts – preferably from an audit-capable firm

◦ Compliance and fair usage experts – particularly for external data from the web and social media

Fund the Hadoop and Linux administrators, but leave them in the infrastructure team◦ They need the administrative access that these teams are allowed

Page 9: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Your New Best FriendsGive all of your stakeholders a chance to participate, by forming a working group

◦ Exposure to business stakeholders is particularly valuable for technology teams

Enlist the Capital Markets infrastructure team to build and manage the Hadoop grid◦ It is worth solving the accounting problems to get their expertise

Co-opt your existing data hub’s team to operate your new Data Lake’s processes◦ BCBS-239 projects have provided an excellent opportunity to do this

Adopting a secondary SQL on Hadoop solution helps to transfer skills as well as code◦ IBM DB2 is available for Hadoop – great way to move over a bank’s data warehouse to the Lab

◦ Other ANSI-compliant solutions include HAWQ, Vertica, Polybase*

Page 10: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

What is a reasonable budget?“PRICE IS WHAT YOU PAY. VALUE IS WHAT YOU GET.”

- WA R R E N B U F F E T

Page 11: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Understanding the CustomersBefore setting a budget, decide who you’re going to charge for your Hadoop Lab

◦ Data producers will see Hadoop as a cost-reduction opportunity◦ Most front-end systems have dozens of outbound feeds that they have to support and maintain – offer them the chance to drop off

a single comprehensive feed to Hadoop so that consumers can build and manage their own outbound feeds

◦ Consuming systems also have support teams managing inbound feeds, so they won’t see a significant change in support costs

◦ Data consumers will see Hadoop as improving their capabilities◦ Traditional data supply chain is very long: source system feeds an EDW, which feeds a data mart accessed by data scientists

◦ Asking for “one more field” requires source to send it, EDW to model and document it, data mart to provision it, and then finally a data scientist gets to consume it

◦ Giving data scientists access to the raw data makes them more efficient – even though less effort goes into providing the data!

Align the funding model to the benefits realized by the participants:◦ One-time costs to on-board new data should come from the producer of the data

◦ On-going operating costs for the Hadoop grid should be shared by the consumers of grid services

Page 12: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Setting a Budget for a Hadoop LabAnnual cost of Hadoop is widely quoted as US$1,000/TB

◦ This compares favorably to US$5K for a SAN, and US$12K for a traditional database

◦ Cost based on “balanced” reference configurations – “compute” is more, “storage” is less

Use this well-known industry benchmark to set your budget◦ Fully loaded costs for a bank-sized Hadoop grid in a bank data centre are around US$550/TB per year

◦ Capital charges for infrastructure costs, including servers and dedicated network switching, are amortized over three years

◦ Premises costs for data centre include bare racks, power and network backbone

◦ On-going support subscriptions for operating systems and Hadoop, and next-day hardware replacement included

◦ This creates around US$450/TB per year of budget room for your Hadoop Lab to claim

◦ A typical bank-sized Hadoop grid is 2-4 PB, which yields a Lab budget of US$1MM-$2MM per year◦ This budget funds a staff of 10-20 based on typical budgeting numbers of US$100K/FTE per year

Page 13: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Financing Shared Hadoop GridsEstablish a usage driven charge out model for consumers of the service

◦ Charging based on a blend of CPU and storage consumption will balance compute and data uses

◦ Consider charging consumers by service quality if your service agreements permit◦ Service quality can be designed into your multi-tenancy solution

Create a central capital account managed by the Hadoop Lab◦ Pre-authorize incremental expansion of the data lake to stay within service objectives

◦ Amortization of capital account will smooth out charges to avoid penalizing early adopters

Page 14: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Creative Project FinancingManagement loves to approve “self-funding projects”

◦ Use the cost differential of storage on Hadoop to fund intra-year work◦ Migrate historical content from operating databases to Hadoop to save on database “tier one” SAN costs

◦ Capture grid compute outputs to Hadoop instead of NAS devices

◦ Storing database back-ups on Hadoop can be cheaper than tapes

Establish an internal ”venture capital” fund in your Hadoop Lab◦ Budget “seed money” to spend with the application maintenance teams

◦ Most applications have “lights on” funding insufficient to support the POCs needed to explore Hadoop adoption

◦ Set aside funding to pay for cross-team charges for participation in a POC

◦ Use the POCs to support project proposals based on cost reduction

◦ Staffing the Hadoop Lab with a small team of versatile developers completes this capability

Page 15: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Real-Life Lessons Learned“NOTHING IS LESS PRODUCTIVE THAN TO MAKE MORE EFFICIENT WHAT SHOULD NOT BE DONE AT ALL”

- P E T E R D R U C K E R

Page 16: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Save Money by Letting it BreakIt’s OK if a node breaks – in fact, it is better to have a dead Hadoop node than a wounded one

Educate your infrastructure team to prevent them from over-engineering your Hadoop grids◦ HDFS implements a RAID strategy in software – use local disks instead of SAN for data nodes

◦ YARN is clever about parallelizing work – don’t use high-speed drives when cheap ones will do

◦ Don’t pay for “critical care” hardware support when next-day will be fine

Appliances and virtualization break the economics of Hadoop◦ Equipment failure in an appliance is all-or-nothing

◦ Centralizing the Hadoop grid into one appliance increases the need for expensive fault tolerance

◦ Unit prices increase as a result – annual costs on appliances barely stay under the $1K/TB benchmark

◦ Your virtualization farm duplicates all of the fault-tolerance in Hadoop – and slows Hadoop down◦ Vendor benchmarks show that virtualization is now almost as performant has bare-metal Hadoop grids

◦ Virtual servers are smaller and so you end up with more node-count-driven Hadoop costs

Page 17: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Networks Really MatterThe quality of the network is more important than the quality of the machines

◦ MapReduce “brings compute to the data,” but Hadoop still generates lots of internal network traffic

◦ Data hub and ETL offload patterns will generate a lot of traffic into and out of the grid

◦ Legacy tools – most notably SAS – will try to pull large data sets out of Hadoop across the network

Invest in top-of-rack switching or converged infrastructure ◦ Most data centres have 1Gb backbones connecting higher speed sub-networks

◦ Bonded 40Gb uplinks within the Hadoop grid and across racks are well worth the added cost

Spend the money and time to co-locate the consuming systems within the Hadoop sub-network◦ This will mean a “re-racking” exercise for some appliances and existing servers

Page 18: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Differing Appetites for ChangeEveryone’s first idea is to have one great, shared, co-operative data lake – and it doesn't work!

◦ The more successful you are in on-boarding data producers, the greater the difficulty of updating the Data Lake’s Hadoop distribution – the incentive to “stand pat” grows◦ Even worse if you’re using third-party tools for ingestion – it creates an external stakeholder which can block change!

◦ The more successful you are in on-boarding data consumers, the greater the demand to update the Data Lake’s Hadoop distribution – data scientists always want the most current next version of everything

Separate the interactive users from the applications with a federated deployment model◦ Put all of the applications onto a Hadoop grid which is updated very infrequently

◦ Static workloads also allow tight management of performance against service agreements

◦ Put all of the data scientists onto their own grid that updates with the Hadoop distribution◦ Self-serve data provisioning to small grids in a cloud also works really well from the consumer’s view

◦ Make sure you have a great network so that moving data between the grids is painless

Page 19: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Hadoop is Not a DatabaseProjects that attempt to replace a database server with Hadoop usually fail

◦ Avoid transactional applications

◦ Do not replace the database tier in an N-tier application with Hadoop◦ Think of Hadoop as container instead, and re-architect the application to run inside Hadoop

◦ Do not use Hadoop to host highly normalized data warehouse models◦ De-normalized data models are much more efficient on Hadoop

◦ Do not create abstraction layers using layered Hive views

The best design patterns for Hadoop are often misused◦ “ETL Off-Load” often turns into Hadoop as an FTP drop zone

◦ “Bring Compute to Data” doesn’t mean using a data node to host an application server

◦ Map/Reduce should be run with MapReduce – not using Hive to call UDFs

Page 20: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Internal Data is More Difficult to AccessThink of your 360° view of a customer as being 180° of transactions and 180° of interactions

Data governance, compliance, and security will inhibit the use of the transactional data◦ Internal data sources are also usually high-cost data sources to access

Interaction data – particularly web and social media is surprisingly easy to access◦ Social media data is actually considered “public,” and so is entirely ungoverned

◦ There are a wealth of open source social media ingestion and analysis tools available

◦ IVR systems are linked to customers and capture a significant amount of customer interaction◦ Major IVR systems discard their operating data after 3-4 months rather than warehousing it

◦ Call Centre recordings are a wealth of internal sentiment data ◦ Open source text to speech and natural language processing tools are available in python

◦ Website clicks and usage can be analyzed for price optimization and used for push marketing◦ Most website usage is analyzed through vendors – but setting up an inbound feed is easy

Page 21: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Data Science is Unstructured WorkData scientists don’t work the way IT expects them to

◦ Traditional data warehousing patterns are the data science anti-pattern◦ Data scientists don’t know what their requirements are until they’ve done their work – their job is to experiment

◦ Data scientists hate prepared views because they don’t know what logic creates them

◦ Don’t waste (too much) time on central data quality – they’re just going to re-do it anyway◦ ”Correct” data is subjective by study, so there isn’t an answer to implement centrally

◦ Preparing a time series includes data quality suitable to data science – regardless of how good the starting data is

◦ Data scientists probably know the data better than the data modelers

Page 22: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Data Science LabsData scientists want to develop analytics using production data – which breaks lots of policies

Support the creation of a Data Science Lab environment◦ Lead a “once and forever” platform security review that all Hadoop users can reference

◦ Implement data governance that facilitates “window shopping” for content – even when governance will initially prohibit using the content

Invest in advanced data masking◦ Invest in advanced data masking to prepare production data for the data science lab

◦ Advanced data masking retains the statistical properties of the underlying data

Buy a self-serve data provisioning tool◦ Data scientists love to “shop” for data and love to ”engineer” data using query-by-example tools

◦ The good tools turn the ”shopping trip” into deployable code that you can package for deployment or automation easily

Page 23: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Projects that Succeed“RISK COMES FROM NOT KNOWING WHAT YOU’RE DOING”

- WA R R E N B U F F E T

Page 24: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Quick WinsFinding a quick win or two will keep your organization motivated to adopt Hadoop

Massively parallel back-testing of StreamBase algorithms◦ StreamBase is a real-time workflow platform widely used in program trading

◦ MapReduce can encapsulate StreamBase in order to run hundreds of copies in parallel

Targeting ads on social media◦ Both Twitter and Facebook have very good APIs that you can quickly use to build a feed

◦ Python-based tools can be paired with some basic data science to find “life events”

Trend Analysis on Risk Data◦ Simulation outputs from CVA, VAR, CCR, LRM are often discarded after one day due to their size

◦ Archiving on HDFS permits trend analysis at the trade level for diagnostics and capital planning

Page 25: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Mid-Sized ProjectsMany current focus areas in finance lend themselves to achievable Hadoop projects

Volcker Rule◦ Volcker Rule metrics require an enormous amount of data, which is expensive to store

◦ Retention is required for five years of calendar days

◦ Computations can be implemented in SQL and will run well in Hive

Customer 360◦ Hadoop is a natural platform to consolidate interaction records with transactional data

Daily Liquidity Management◦ Running the calculations before pooling facilitates drill-down and analysis

◦ Tableau on Hadoop works very well for daily dashboards

Page 26: Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Thank You for Your Time