Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data

Founding a Hadoop LabEVERYTHING YOU ALWAYS WANTED TO KNOW,

BUT WERE AFRAID TO ASK,

ABOUT F INDING SUCCESS WITH HADOOP IN YOUR ORAGANIZATION

© UTILIS TECHNOLOGY LIMITED 2017

Andre [email protected]

A Short Introduction to Your Speaker

My Adventures in Hadoop◦ Lead Hadoop adoption at three Canadian banks

◦ Established a successful Hadoop COE

◦ Advisory roles on Hadoop in finance

My Career in Finance◦ Four banks, one stock exchange, one pension fund

◦ Capital markets, retail banking, enterprise risk roles

◦ Founder of two IT departments

◦ Technology leader in Risk Systems for 15 years –◦ Architect, Enterprise Risk Systems

◦ Architect, Front Office Risk Systems

◦ Program Manager, Portfolio Management Systems

◦ Head of Risk Systems

◦ Head of Hadoop COE

AgendaWhat role will your Hadoop Lab play?

◦ Defining objectives, building a team and forming partnerships

◦ Foundational work to set a path to success

What is a reasonable budget?◦ Calculating your “room” based on industry benchmarks

◦ Capacity planning, charge-out, and the central capital account

Real-life Lessons Learned◦ Setting up infrastructure to take advantage of Hadoop’s unique properties

◦ Creating a practice that fits your users’ work styles

Projects that Succeed◦ Ideas for a quick win to keep everyone motivated

◦ Medium risk projects aligned to current business problems

What role will your Hadoop Lab play?“YOU CAN’T SHRINK YOUR WAY TO GREATNESS”

- TO M P E T E R S

What role will your Hadoop Lab play?Will your organization’s Hadoop Lab be a control function, or a thought leader?

Control functions◦ Operational controls, compliance and auditing

◦ Budgeting

◦ Architecture gating

◦ Data governance

Thought leadership◦ Design patterns and solution architecture

◦ Demonstration projects and proofs-of-concept

◦ Filling up the talent pool using training, workshops and user groups

◦ Educating on best practices and success stories to motivate adoption

Foundational WorkInvest in user-friendly operational management

◦ Design a simple multi-tenancy plan based on group membership◦ Include share of execution queues, directory structures and cascading permissions

◦ Set up self-serve user on-boarding through your organization’s Help Desk

◦ Implement single sign on for Kerberos-secured clusters

Manage expectations by monitoring performance◦ Set service level objectives for both interactive and application uses

◦ Use “show back” reporting to monitor performance against objectives

Implement access control governance as a basic service◦ Generate access control matrix audits centrally for all grid users

◦ Reporting from Ranger’s database works well and is easy to build

◦ Set policy and prepare reports for periodic attestation/user account reviews

Maximizing Exposure to ChangeHadoop is an exceptionally fast moving technology, and so needs a different approach

◦ Maximize your ability to deploy the changes in the Hadoop platform ◦ Invest in continuous integration and automated regression testing for your development teams

◦ Establish a better-than-quarterly release cycle

◦ Publish a checklist of acceptable open source licenses (or blacklist of prohibited ones)

◦ Encourage use of Hadoop as an application container

◦ Set up lab environments

Discourage practices that prevent your organization from keeping pace◦ Avoid encapsulating Hadoop with frameworks or wrapping Hadoop inside applications

◦ Avoid proprietary add-ons – they don’t get as much collaboration in the open source community

◦ Prohibit equipment “carve outs” from your shared grid◦ Include the cost of additional equipment in the business case, co-locate, and charge out accordingly

Building a TeamData Engineers are the key to the successful adoption of a data lake

◦ Data engineers are hybrid of intermediate developer and junior data scientist

◦ Good data engineering accelerates data science, and the ability to deploy data science to production

Other roles to consider◦ A few versatile senior developers to give you the ability to execute POCs

◦ Data Librarian to manage the metadata catalogue and documentation

◦ Data Steward to manage the data governance process

Keep a few consultants on speed dial◦ Hadoop security experts – preferably from an audit-capable firm

◦ Compliance and fair usage experts – particularly for external data from the web and social media

Fund the Hadoop and Linux administrators, but leave them in the infrastructure team◦ They need the administrative access that these teams are allowed

Your New Best FriendsGive all of your stakeholders a chance to participate, by forming a working group

◦ Exposure to business stakeholders is particularly valuable for technology teams

Enlist the Capital Markets infrastructure team to build and manage the Hadoop grid◦ It is worth solving the accounting problems to get their expertise

Co-opt your existing data hub’s team to operate your new Data Lake’s processes◦ BCBS-239 projects have provided an excellent opportunity to do this

Adopting a secondary SQL on Hadoop solution helps to transfer skills as well as code◦ IBM DB2 is available for Hadoop – great way to move over a bank’s data warehouse to the Lab

◦ Other ANSI-compliant solutions include HAWQ, Vertica, Polybase*

What is a reasonable budget?“PRICE IS WHAT YOU PAY. VALUE IS WHAT YOU GET.”

- WA R R E N B U F F E T

Understanding the CustomersBefore setting a budget, decide who you’re going to charge for your Hadoop Lab

◦ Data producers will see Hadoop as a cost-reduction opportunity◦ Most front-end systems have dozens of outbound feeds that they have to support and maintain – offer them the chance to drop off

a single comprehensive feed to Hadoop so that consumers can build and manage their own outbound feeds

◦ Consuming systems also have support teams managing inbound feeds, so they won’t see a significant change in support costs

◦ Data consumers will see Hadoop as improving their capabilities◦ Traditional data supply chain is very long: source system feeds an EDW, which feeds a data mart accessed by data scientists

◦ Asking for “one more field” requires source to send it, EDW to model and document it, data mart to provision it, and then finally a data scientist gets to consume it

◦ Giving data scientists access to the raw data makes them more efficient – even though less effort goes into providing the data!

Align the funding model to the benefits realized by the participants:◦ One-time costs to on-board new data should come from the producer of the data

◦ On-going operating costs for the Hadoop grid should be shared by the consumers of grid services

Setting a Budget for a Hadoop LabAnnual cost of Hadoop is widely quoted as US$1,000/TB

◦ This compares favorably to US$5K for a SAN, and US$12K for a traditional database

◦ Cost based on “balanced” reference configurations – “compute” is more, “storage” is less

Use this well-known industry benchmark to set your budget◦ Fully loaded costs for a bank-sized Hadoop grid in a bank data centre are around US$550/TB per year

◦ Capital charges for infrastructure costs, including servers and dedicated network switching, are amortized over three years

◦ Premises costs for data centre include bare racks, power and network backbone

◦ On-going support subscriptions for operating systems and Hadoop, and next-day hardware replacement included

◦ This creates around US$450/TB per year of budget room for your Hadoop Lab to claim

◦ A typical bank-sized Hadoop grid is 2-4 PB, which yields a Lab budget of US$1MM-$2MM per year◦ This budget funds a staff of 10-20 based on typical budgeting numbers of US$100K/FTE per year

Financing Shared Hadoop GridsEstablish a usage driven charge out model for consumers of the service

◦ Charging based on a blend of CPU and storage consumption will balance compute and data uses

◦ Consider charging consumers by service quality if your service agreements permit◦ Service quality can be designed into your multi-tenancy solution

Create a central capital account managed by the Hadoop Lab◦ Pre-authorize incremental expansion of the data lake to stay within service objectives

◦ Amortization of capital account will smooth out charges to avoid penalizing early adopters

Creative Project FinancingManagement loves to approve “self-funding projects”

◦ Use the cost differential of storage on Hadoop to fund intra-year work◦ Migrate historical content from operating databases to Hadoop to save on database “tier one” SAN costs

◦ Capture grid compute outputs to Hadoop instead of NAS devices

◦ Storing database back-ups on Hadoop can be cheaper than tapes

Establish an internal ”venture capital” fund in your Hadoop Lab◦ Budget “seed money” to spend with the application maintenance teams

◦ Most applications have “lights on” funding insufficient to support the POCs needed to explore Hadoop adoption

◦ Set aside funding to pay for cross-team charges for participation in a POC

◦ Use the POCs to support project proposals based on cost reduction

◦ Staffing the Hadoop Lab with a small team of versatile developers completes this capability

Real-Life Lessons Learned“NOTHING IS LESS PRODUCTIVE THAN TO MAKE MORE EFFICIENT WHAT SHOULD NOT BE DONE AT ALL”

- P E T E R D R U C K E R

Save Money by Letting it BreakIt’s OK if a node breaks – in fact, it is better to have a dead Hadoop node than a wounded one

Educate your infrastructure team to prevent them from over-engineering your Hadoop grids◦ HDFS implements a RAID strategy in software – use local disks instead of SAN for data nodes

◦ YARN is clever about parallelizing work – don’t use high-speed drives when cheap ones will do

◦ Don’t pay for “critical care” hardware support when next-day will be fine

Appliances and virtualization break the economics of Hadoop◦ Equipment failure in an appliance is all-or-nothing

◦ Centralizing the Hadoop grid into one appliance increases the need for expensive fault tolerance

◦ Unit prices increase as a result – annual costs on appliances barely stay under the $1K/TB benchmark

◦ Your virtualization farm duplicates all of the fault-tolerance in Hadoop – and slows Hadoop down◦ Vendor benchmarks show that virtualization is now almost as performant has bare-metal Hadoop grids

◦ Virtual servers are smaller and so you end up with more node-count-driven Hadoop costs

Networks Really MatterThe quality of the network is more important than the quality of the machines

◦ MapReduce “brings compute to the data,” but Hadoop still generates lots of internal network traffic

◦ Data hub and ETL offload patterns will generate a lot of traffic into and out of the grid

◦ Legacy tools – most notably SAS – will try to pull large data sets out of Hadoop across the network

Invest in top-of-rack switching or converged infrastructure ◦ Most data centres have 1Gb backbones connecting higher speed sub-networks

◦ Bonded 40Gb uplinks within the Hadoop grid and across racks are well worth the added cost

Spend the money and time to co-locate the consuming systems within the Hadoop sub-network◦ This will mean a “re-racking” exercise for some appliances and existing servers

Differing Appetites for ChangeEveryone’s first idea is to have one great, shared, co-operative data lake – and it doesn't work!

◦ The more successful you are in on-boarding data producers, the greater the difficulty of updating the Data Lake’s Hadoop distribution – the incentive to “stand pat” grows◦ Even worse if you’re using third-party tools for ingestion – it creates an external stakeholder which can block change!

◦ The more successful you are in on-boarding data consumers, the greater the demand to update the Data Lake’s Hadoop distribution – data scientists always want the most current next version of everything

Separate the interactive users from the applications with a federated deployment model◦ Put all of the applications onto a Hadoop grid which is updated very infrequently

◦ Static workloads also allow tight management of performance against service agreements

◦ Put all of the data scientists onto their own grid that updates with the Hadoop distribution◦ Self-serve data provisioning to small grids in a cloud also works really well from the consumer’s view

◦ Make sure you have a great network so that moving data between the grids is painless

Hadoop is Not a DatabaseProjects that attempt to replace a database server with Hadoop usually fail

◦ Avoid transactional applications

◦ Do not replace the database tier in an N-tier application with Hadoop◦ Think of Hadoop as container instead, and re-architect the application to run inside Hadoop

◦ Do not use Hadoop to host highly normalized data warehouse models◦ De-normalized data models are much more efficient on Hadoop

◦ Do not create abstraction layers using layered Hive views

The best design patterns for Hadoop are often misused◦ “ETL Off-Load” often turns into Hadoop as an FTP drop zone

◦ “Bring Compute to Data” doesn’t mean using a data node to host an application server

◦ Map/Reduce should be run with MapReduce – not using Hive to call UDFs

Internal Data is More Difficult to AccessThink of your 360° view of a customer as being 180° of transactions and 180° of interactions

Data governance, compliance, and security will inhibit the use of the transactional data◦ Internal data sources are also usually high-cost data sources to access

Interaction data – particularly web and social media is surprisingly easy to access◦ Social media data is actually considered “public,” and so is entirely ungoverned

◦ There are a wealth of open source social media ingestion and analysis tools available

◦ IVR systems are linked to customers and capture a significant amount of customer interaction◦ Major IVR systems discard their operating data after 3-4 months rather than warehousing it

◦ Call Centre recordings are a wealth of internal sentiment data ◦ Open source text to speech and natural language processing tools are available in python

◦ Website clicks and usage can be analyzed for price optimization and used for push marketing◦ Most website usage is analyzed through vendors – but setting up an inbound feed is easy

Data Science is Unstructured WorkData scientists don’t work the way IT expects them to

◦ Traditional data warehousing patterns are the data science anti-pattern◦ Data scientists don’t know what their requirements are until they’ve done their work – their job is to experiment

◦ Data scientists hate prepared views because they don’t know what logic creates them

◦ Don’t waste (too much) time on central data quality – they’re just going to re-do it anyway◦ ”Correct” data is subjective by study, so there isn’t an answer to implement centrally

◦ Preparing a time series includes data quality suitable to data science – regardless of how good the starting data is

◦ Data scientists probably know the data better than the data modelers

Data Science LabsData scientists want to develop analytics using production data – which breaks lots of policies

Support the creation of a Data Science Lab environment◦ Lead a “once and forever” platform security review that all Hadoop users can reference

◦ Implement data governance that facilitates “window shopping” for content – even when governance will initially prohibit using the content

Invest in advanced data masking◦ Invest in advanced data masking to prepare production data for the data science lab

◦ Advanced data masking retains the statistical properties of the underlying data

Buy a self-serve data provisioning tool◦ Data scientists love to “shop” for data and love to ”engineer” data using query-by-example tools

◦ The good tools turn the ”shopping trip” into deployable code that you can package for deployment or automation easily

Projects that Succeed“RISK COMES FROM NOT KNOWING WHAT YOU’RE DOING”

- WA R R E N B U F F E T

Quick WinsFinding a quick win or two will keep your organization motivated to adopt Hadoop

Massively parallel back-testing of StreamBase algorithms◦ StreamBase is a real-time workflow platform widely used in program trading

◦ MapReduce can encapsulate StreamBase in order to run hundreds of copies in parallel

Targeting ads on social media◦ Both Twitter and Facebook have very good APIs that you can quickly use to build a feed

◦ Python-based tools can be paired with some basic data science to find “life events”

Trend Analysis on Risk Data◦ Simulation outputs from CVA, VAR, CCR, LRM are often discarded after one day due to their size

◦ Archiving on HDFS permits trend analysis at the trade level for diagnostics and capital planning

Mid-Sized ProjectsMany current focus areas in finance lend themselves to achievable Hadoop projects

Volcker Rule◦ Volcker Rule metrics require an enormous amount of data, which is expensive to store

◦ Retention is required for five years of calendar days

◦ Computations can be implemented in SQL and will run well in Hive

Customer 360◦ Hadoop is a natural platform to consolidate interaction records with transactional data

Daily Liquidity Management◦ Running the calculations before pooling facilitates drill-down and analysis

◦ Tableau on Hadoop works very well for daily dashboards

Thank You for Your Time

Documents

Founding a Hadoop Lab · 2017-05-05 · Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, ... Good data engineering accelerates data science, and the ability to deploy data