Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Founding a Hadoop LabEVERYTHING YOU ALWAYS WANTED TO KNOW,
BUT WERE AFRAID TO ASK,
ABOUT F INDING SUCCESS WITH HADOOP IN YOUR ORAGANIZATION
© UTILIS TECHNOLOGY LIMITED 2017
Andre [email protected]
A Short Introduction to Your Speaker
My Adventures in Hadoop◦ Lead Hadoop adoption at three Canadian banks
◦ Established a successful Hadoop COE
◦ Advisory roles on Hadoop in finance
My Career in Finance◦ Four banks, one stock exchange, one pension fund
◦ Capital markets, retail banking, enterprise risk roles
◦ Founder of two IT departments
◦ Technology leader in Risk Systems for 15 years –◦ Architect, Enterprise Risk Systems
◦ Architect, Front Office Risk Systems
◦ Program Manager, Portfolio Management Systems
◦ Head of Risk Systems
◦ Head of Hadoop COE
AgendaWhat role will your Hadoop Lab play?
◦ Defining objectives, building a team and forming partnerships
◦ Foundational work to set a path to success
What is a reasonable budget?◦ Calculating your “room” based on industry benchmarks
◦ Capacity planning, charge-out, and the central capital account
Real-life Lessons Learned◦ Setting up infrastructure to take advantage of Hadoop’s unique properties
◦ Creating a practice that fits your users’ work styles
Projects that Succeed◦ Ideas for a quick win to keep everyone motivated
◦ Medium risk projects aligned to current business problems
What role will your Hadoop Lab play?“YOU CAN’T SHRINK YOUR WAY TO GREATNESS”
- TO M P E T E R S
What role will your Hadoop Lab play?Will your organization’s Hadoop Lab be a control function, or a thought leader?
Control functions◦ Operational controls, compliance and auditing
◦ Budgeting
◦ Architecture gating
◦ Data governance
Thought leadership◦ Design patterns and solution architecture
◦ Demonstration projects and proofs-of-concept
◦ Filling up the talent pool using training, workshops and user groups
◦ Educating on best practices and success stories to motivate adoption
Foundational WorkInvest in user-friendly operational management
◦ Design a simple multi-tenancy plan based on group membership◦ Include share of execution queues, directory structures and cascading permissions
◦ Set up self-serve user on-boarding through your organization’s Help Desk
◦ Implement single sign on for Kerberos-secured clusters
Manage expectations by monitoring performance◦ Set service level objectives for both interactive and application uses
◦ Use “show back” reporting to monitor performance against objectives
Implement access control governance as a basic service◦ Generate access control matrix audits centrally for all grid users
◦ Reporting from Ranger’s database works well and is easy to build
◦ Set policy and prepare reports for periodic attestation/user account reviews
Maximizing Exposure to ChangeHadoop is an exceptionally fast moving technology, and so needs a different approach
◦ Maximize your ability to deploy the changes in the Hadoop platform ◦ Invest in continuous integration and automated regression testing for your development teams
◦ Establish a better-than-quarterly release cycle
◦ Publish a checklist of acceptable open source licenses (or blacklist of prohibited ones)
◦ Encourage use of Hadoop as an application container
◦ Set up lab environments
Discourage practices that prevent your organization from keeping pace◦ Avoid encapsulating Hadoop with frameworks or wrapping Hadoop inside applications
◦ Avoid proprietary add-ons – they don’t get as much collaboration in the open source community
◦ Prohibit equipment “carve outs” from your shared grid◦ Include the cost of additional equipment in the business case, co-locate, and charge out accordingly
Building a TeamData Engineers are the key to the successful adoption of a data lake
◦ Data engineers are hybrid of intermediate developer and junior data scientist
◦ Good data engineering accelerates data science, and the ability to deploy data science to production
Other roles to consider◦ A few versatile senior developers to give you the ability to execute POCs
◦ Data Librarian to manage the metadata catalogue and documentation
◦ Data Steward to manage the data governance process
Keep a few consultants on speed dial◦ Hadoop security experts – preferably from an audit-capable firm
◦ Compliance and fair usage experts – particularly for external data from the web and social media
Fund the Hadoop and Linux administrators, but leave them in the infrastructure team◦ They need the administrative access that these teams are allowed
Your New Best FriendsGive all of your stakeholders a chance to participate, by forming a working group
◦ Exposure to business stakeholders is particularly valuable for technology teams
Enlist the Capital Markets infrastructure team to build and manage the Hadoop grid◦ It is worth solving the accounting problems to get their expertise
Co-opt your existing data hub’s team to operate your new Data Lake’s processes◦ BCBS-239 projects have provided an excellent opportunity to do this
Adopting a secondary SQL on Hadoop solution helps to transfer skills as well as code◦ IBM DB2 is available for Hadoop – great way to move over a bank’s data warehouse to the Lab
◦ Other ANSI-compliant solutions include HAWQ, Vertica, Polybase*
What is a reasonable budget?“PRICE IS WHAT YOU PAY. VALUE IS WHAT YOU GET.”
- WA R R E N B U F F E T
Understanding the CustomersBefore setting a budget, decide who you’re going to charge for your Hadoop Lab
◦ Data producers will see Hadoop as a cost-reduction opportunity◦ Most front-end systems have dozens of outbound feeds that they have to support and maintain – offer them the chance to drop off
a single comprehensive feed to Hadoop so that consumers can build and manage their own outbound feeds
◦ Consuming systems also have support teams managing inbound feeds, so they won’t see a significant change in support costs
◦ Data consumers will see Hadoop as improving their capabilities◦ Traditional data supply chain is very long: source system feeds an EDW, which feeds a data mart accessed by data scientists
◦ Asking for “one more field” requires source to send it, EDW to model and document it, data mart to provision it, and then finally a data scientist gets to consume it
◦ Giving data scientists access to the raw data makes them more efficient – even though less effort goes into providing the data!
Align the funding model to the benefits realized by the participants:◦ One-time costs to on-board new data should come from the producer of the data
◦ On-going operating costs for the Hadoop grid should be shared by the consumers of grid services
Setting a Budget for a Hadoop LabAnnual cost of Hadoop is widely quoted as US$1,000/TB
◦ This compares favorably to US$5K for a SAN, and US$12K for a traditional database
◦ Cost based on “balanced” reference configurations – “compute” is more, “storage” is less
Use this well-known industry benchmark to set your budget◦ Fully loaded costs for a bank-sized Hadoop grid in a bank data centre are around US$550/TB per year
◦ Capital charges for infrastructure costs, including servers and dedicated network switching, are amortized over three years
◦ Premises costs for data centre include bare racks, power and network backbone
◦ On-going support subscriptions for operating systems and Hadoop, and next-day hardware replacement included
◦ This creates around US$450/TB per year of budget room for your Hadoop Lab to claim
◦ A typical bank-sized Hadoop grid is 2-4 PB, which yields a Lab budget of US$1MM-$2MM per year◦ This budget funds a staff of 10-20 based on typical budgeting numbers of US$100K/FTE per year
Financing Shared Hadoop GridsEstablish a usage driven charge out model for consumers of the service
◦ Charging based on a blend of CPU and storage consumption will balance compute and data uses
◦ Consider charging consumers by service quality if your service agreements permit◦ Service quality can be designed into your multi-tenancy solution
Create a central capital account managed by the Hadoop Lab◦ Pre-authorize incremental expansion of the data lake to stay within service objectives
◦ Amortization of capital account will smooth out charges to avoid penalizing early adopters
Creative Project FinancingManagement loves to approve “self-funding projects”
◦ Use the cost differential of storage on Hadoop to fund intra-year work◦ Migrate historical content from operating databases to Hadoop to save on database “tier one” SAN costs
◦ Capture grid compute outputs to Hadoop instead of NAS devices
◦ Storing database back-ups on Hadoop can be cheaper than tapes
Establish an internal ”venture capital” fund in your Hadoop Lab◦ Budget “seed money” to spend with the application maintenance teams
◦ Most applications have “lights on” funding insufficient to support the POCs needed to explore Hadoop adoption
◦ Set aside funding to pay for cross-team charges for participation in a POC
◦ Use the POCs to support project proposals based on cost reduction
◦ Staffing the Hadoop Lab with a small team of versatile developers completes this capability
Real-Life Lessons Learned“NOTHING IS LESS PRODUCTIVE THAN TO MAKE MORE EFFICIENT WHAT SHOULD NOT BE DONE AT ALL”
- P E T E R D R U C K E R
Save Money by Letting it BreakIt’s OK if a node breaks – in fact, it is better to have a dead Hadoop node than a wounded one
Educate your infrastructure team to prevent them from over-engineering your Hadoop grids◦ HDFS implements a RAID strategy in software – use local disks instead of SAN for data nodes
◦ YARN is clever about parallelizing work – don’t use high-speed drives when cheap ones will do
◦ Don’t pay for “critical care” hardware support when next-day will be fine
Appliances and virtualization break the economics of Hadoop◦ Equipment failure in an appliance is all-or-nothing
◦ Centralizing the Hadoop grid into one appliance increases the need for expensive fault tolerance
◦ Unit prices increase as a result – annual costs on appliances barely stay under the $1K/TB benchmark
◦ Your virtualization farm duplicates all of the fault-tolerance in Hadoop – and slows Hadoop down◦ Vendor benchmarks show that virtualization is now almost as performant has bare-metal Hadoop grids
◦ Virtual servers are smaller and so you end up with more node-count-driven Hadoop costs
Networks Really MatterThe quality of the network is more important than the quality of the machines
◦ MapReduce “brings compute to the data,” but Hadoop still generates lots of internal network traffic
◦ Data hub and ETL offload patterns will generate a lot of traffic into and out of the grid
◦ Legacy tools – most notably SAS – will try to pull large data sets out of Hadoop across the network
Invest in top-of-rack switching or converged infrastructure ◦ Most data centres have 1Gb backbones connecting higher speed sub-networks
◦ Bonded 40Gb uplinks within the Hadoop grid and across racks are well worth the added cost
Spend the money and time to co-locate the consuming systems within the Hadoop sub-network◦ This will mean a “re-racking” exercise for some appliances and existing servers
Differing Appetites for ChangeEveryone’s first idea is to have one great, shared, co-operative data lake – and it doesn't work!
◦ The more successful you are in on-boarding data producers, the greater the difficulty of updating the Data Lake’s Hadoop distribution – the incentive to “stand pat” grows◦ Even worse if you’re using third-party tools for ingestion – it creates an external stakeholder which can block change!
◦ The more successful you are in on-boarding data consumers, the greater the demand to update the Data Lake’s Hadoop distribution – data scientists always want the most current next version of everything
Separate the interactive users from the applications with a federated deployment model◦ Put all of the applications onto a Hadoop grid which is updated very infrequently
◦ Static workloads also allow tight management of performance against service agreements
◦ Put all of the data scientists onto their own grid that updates with the Hadoop distribution◦ Self-serve data provisioning to small grids in a cloud also works really well from the consumer’s view
◦ Make sure you have a great network so that moving data between the grids is painless
Hadoop is Not a DatabaseProjects that attempt to replace a database server with Hadoop usually fail
◦ Avoid transactional applications
◦ Do not replace the database tier in an N-tier application with Hadoop◦ Think of Hadoop as container instead, and re-architect the application to run inside Hadoop
◦ Do not use Hadoop to host highly normalized data warehouse models◦ De-normalized data models are much more efficient on Hadoop
◦ Do not create abstraction layers using layered Hive views
The best design patterns for Hadoop are often misused◦ “ETL Off-Load” often turns into Hadoop as an FTP drop zone
◦ “Bring Compute to Data” doesn’t mean using a data node to host an application server
◦ Map/Reduce should be run with MapReduce – not using Hive to call UDFs
Internal Data is More Difficult to AccessThink of your 360° view of a customer as being 180° of transactions and 180° of interactions
Data governance, compliance, and security will inhibit the use of the transactional data◦ Internal data sources are also usually high-cost data sources to access
Interaction data – particularly web and social media is surprisingly easy to access◦ Social media data is actually considered “public,” and so is entirely ungoverned
◦ There are a wealth of open source social media ingestion and analysis tools available
◦ IVR systems are linked to customers and capture a significant amount of customer interaction◦ Major IVR systems discard their operating data after 3-4 months rather than warehousing it
◦ Call Centre recordings are a wealth of internal sentiment data ◦ Open source text to speech and natural language processing tools are available in python
◦ Website clicks and usage can be analyzed for price optimization and used for push marketing◦ Most website usage is analyzed through vendors – but setting up an inbound feed is easy
Data Science is Unstructured WorkData scientists don’t work the way IT expects them to
◦ Traditional data warehousing patterns are the data science anti-pattern◦ Data scientists don’t know what their requirements are until they’ve done their work – their job is to experiment
◦ Data scientists hate prepared views because they don’t know what logic creates them
◦ Don’t waste (too much) time on central data quality – they’re just going to re-do it anyway◦ ”Correct” data is subjective by study, so there isn’t an answer to implement centrally
◦ Preparing a time series includes data quality suitable to data science – regardless of how good the starting data is
◦ Data scientists probably know the data better than the data modelers
Data Science LabsData scientists want to develop analytics using production data – which breaks lots of policies
Support the creation of a Data Science Lab environment◦ Lead a “once and forever” platform security review that all Hadoop users can reference
◦ Implement data governance that facilitates “window shopping” for content – even when governance will initially prohibit using the content
Invest in advanced data masking◦ Invest in advanced data masking to prepare production data for the data science lab
◦ Advanced data masking retains the statistical properties of the underlying data
Buy a self-serve data provisioning tool◦ Data scientists love to “shop” for data and love to ”engineer” data using query-by-example tools
◦ The good tools turn the ”shopping trip” into deployable code that you can package for deployment or automation easily
Projects that Succeed“RISK COMES FROM NOT KNOWING WHAT YOU’RE DOING”
- WA R R E N B U F F E T
Quick WinsFinding a quick win or two will keep your organization motivated to adopt Hadoop
Massively parallel back-testing of StreamBase algorithms◦ StreamBase is a real-time workflow platform widely used in program trading
◦ MapReduce can encapsulate StreamBase in order to run hundreds of copies in parallel
Targeting ads on social media◦ Both Twitter and Facebook have very good APIs that you can quickly use to build a feed
◦ Python-based tools can be paired with some basic data science to find “life events”
Trend Analysis on Risk Data◦ Simulation outputs from CVA, VAR, CCR, LRM are often discarded after one day due to their size
◦ Archiving on HDFS permits trend analysis at the trade level for diagnostics and capital planning
Mid-Sized ProjectsMany current focus areas in finance lend themselves to achievable Hadoop projects
Volcker Rule◦ Volcker Rule metrics require an enormous amount of data, which is expensive to store
◦ Retention is required for five years of calendar days
◦ Computations can be implemented in SQL and will run well in Hive
Customer 360◦ Hadoop is a natural platform to consolidate interaction records with transactional data
Daily Liquidity Management◦ Running the calculations before pooling facilitates drill-down and analysis
◦ Tableau on Hadoop works very well for daily dashboards
Thank You for Your Time