Founding a Hadoop Data Science Lab

FoundingaHadoopLabEVERYTHING YOUALWAYSWANTEDTOKNOW,BUT WEREAFRAIDTOASK,

ABOUT FINDING SUCCESSWITHHADOOP IN YOUR ORAGANIZAT ION

[email protected]

AShortIntroductiontoYourSpeaker

MyAdventuresinHadoop◦ LeadHadoopadoptionatthreeCanadianbanks◦ EstablishedasuccessfulHadoopCOE◦ Advisory rolesonHadoop infinance

MyCareerinFinance◦ Fourbanks,onestockexchange,onepension fund◦ Capitalmarkets,retailbanking,enterprise riskroles◦ Founderof twoITdepartments◦ Technology leaderinRiskSystemsfor15years–◦ Architect,EnterpriseRiskSystems◦ Architect,FrontOfficeRiskSystems◦ ProgramManager,PortfolioManagementSystems◦ HeadofRiskSystems◦ HeadofHadoopCOE

AgendaWhatrolewillyourHadoopLabplay?◦ Definingobjectives,building ateamandforming partnerships◦ Foundationalworktosetapathtosuccess

Whatisareasonablebudget?◦ Calculatingyour“room”basedonindustrybenchmarks◦ Capacityplanning, charge-out,andthecentralcapitalaccount

Real-lifeLessonsLearned◦ Settingupinfrastructure totakeadvantageofHadoop’s uniqueproperties◦ Creatingapracticethatfitsyourusers’workstyles

ProjectsthatSucceed◦ Ideasforaquickwintokeepeveryonemotivated◦ Mediumriskprojectsalignedtocurrentbusinessproblems

WhatrolewillyourHadoopLabplay?“YOUCAN’TSHRINKYOURWAYTOGREATNESS”- TOM PETER S

WhatrolewillyourHadoopLabplay?Willyourorganization’sHadoopLabbeacontrolfunction,orathoughtleader?

Controlfunctions◦ Operationalcontrols,complianceandauditing◦ Budgeting◦ Architecturegating◦ Datagovernance

Thoughtleadership◦ Designpatternsandsolutionarchitecture◦ Demonstrationprojectsandproofs-of-concept◦ Fillingupthetalentpoolusing training,workshopsandusergroups◦ Educatingonbestpracticesandsuccessstoriestomotivateadoption

FoundationalWorkInvestinuser-friendlyoperationalmanagement◦ Designasimplemulti-tenancyplanbasedongroupmembership◦ Includeshareofexecutionqueues, directorystructuresandcascadingpermissions

◦ Setupself-serveuseron-boarding through yourorganization’sHelpDesk◦ Implementsingle signonforKerberos-securedclusters

Manageexpectationsbymonitoringperformance◦ Setservicelevelobjectivesforboth interactiveandapplicationuses◦ Use“showback”reporting tomonitorperformanceagainstobjectives

Implementaccesscontrolgovernanceasabasicservice◦ Generateaccesscontrolmatrixauditscentrallyforallgridusers◦ ReportingfromRanger’sdatabaseworkswellandiseasytobuild

◦ Setpolicyandpreparereportsforperiodicattestation/useraccountreviews

MaximizingExposuretoChangeHadoopisanexceptionallyfastmovingtechnology,andsoneedsadifferentapproach◦ MaximizeyourabilitytodeploythechangesintheHadoopplatform◦ Investincontinuous integrationandautomatedregressiontestingforyourdevelopment teams◦ Establishabetter-than-quarterlyreleasecycle◦ Publish achecklist ofacceptableopen sourcelicenses (orblacklistofprohibitedones)

◦ EncourageuseofHadoopasanapplicationcontainer◦ Setuplabenvironments

Discouragepracticesthatpreventyourorganizationfromkeepingpace◦ AvoidencapsulatingHadoopwithframeworksorwrappingHadoop insideapplications◦ Avoidproprietaryadd-ons– theydon’tgetasmuchcollaboration intheopensourcecommunity◦ Prohibitequipment “carveouts”fromyoursharedgrid◦ Includethecostofadditional equipmentinthebusiness case,co-locate,andchargeoutaccordingly

BuildingaTeamDataEngineersarethekeytothesuccessfuladoptionofadatalake◦ Dataengineersarehybridof intermediatedeveloperandjuniordatascientist◦ Gooddataengineering acceleratesdatascience,andtheabilitytodeploydatasciencetoproduction

Otherrolestoconsider◦ AfewversatileseniordeveloperstogiveyoutheabilitytoexecutePOCs◦ DataLibrariantomanagethemetadatacatalogueanddocumentation◦ DataStewardtomanagethedatagovernanceprocess

Keepafewconsultantsonspeeddial◦ Hadoopsecurityexperts– preferablyfromanaudit-capablefirm◦ Complianceandfairusageexperts– particularlyforexternaldatafromthewebandsocialmedia

FundtheHadoopandLinuxadministrators,butleavethemintheinfrastructureteam◦ Theyneed theadministrativeaccessthattheseteamsareallowed

YourNewBestFriendsGiveallofyourstakeholdersachancetoparticipate,byformingaworkinggroup◦ Exposuretobusinessstakeholders isparticularlyvaluablefortechnology teams

EnlisttheCapitalMarketsinfrastructureteamtobuildandmanagetheHadoopgrid◦ Itisworthsolving theaccountingproblems togettheirexpertise

Co-optyourexistingdatahub’steamtooperateyournewDataLake’sprocesses◦ BCBS-239projectshaveprovidedanexcellentopportunity todothis

AdoptingasecondarySQLonHadoopsolutionhelpstotransferskillsaswellascode◦ IBMDB2isavailableforHadoop– greatwaytomoveoverabank’sdatawarehousetotheLab◦ OtherANSI-compliantsolutions includeHAWQ,Vertica,Polybase*

Whatisareasonablebudget?“PRICEISWHATYOUPAY. VALUEISWHATYOUGET.”- WARRENBUFFET

UnderstandingtheCustomersBeforesettingabudget,decidewhoyou’regoingtochargeforyourHadoopLab◦ DataproducerswillseeHadoopasacost-reductionopportunity◦ Mostfront-endsystemshavedozensofoutbound feedsthattheyhavetosupportandmaintain– offerthemthechancetodropoff

asinglecomprehensive feedtoHadoopsothatconsumerscanbuild andmanagetheirownoutbound feeds◦ Consuming systemsalsohavesupport teamsmanaginginbound feeds,sotheywon’tseeasignificantchangeinsupport costs

◦ DataconsumerswillseeHadoopasimproving theircapabilities◦ Traditionaldatasupply chainisverylong:sourcesystemfeedsanEDW,whichfeedsadatamartaccessedbydatascientists◦ Askingfor“onemorefield”requiressourcetosendit,EDWtomodelanddocumentit,datamarttoprovision it,andthenfinally a

datascientist getstoconsume it◦ Givingdatascientists accesstotherawdatamakesthemmore efficient– eventhoughless effortgoesintoproviding thedata!

Alignthefundingmodeltothebenefitsrealizedbytheparticipants:◦ One-timecoststoon-boardnewdatashouldcomefromtheproducerofthedata◦ On-goingoperatingcostsfortheHadoopgrid shouldbesharedbytheconsumersofgrid services

SettingaBudgetforaHadoopLabAnnualcostofHadoopiswidelyquotedasUS$1,000/TB◦ ThiscomparesfavorablytoUS$5KforaSAN,andUS$12Kforatraditionaldatabase◦ Costbasedon“balanced”referenceconfigurations– “compute”ismore, “storage”isless

Usethiswell-knownindustrybenchmarktosetyourbudget◦ Fullyloadedcostsforabank-sizedHadoopgrid inabankdatacentrearearoundUS$550/TBperyear◦ Capitalchargesforinfrastructurecosts,including serversanddedicatednetworkswitching, areamortizedoverthreeyears◦ Premisescostsfordatacentreincludebareracks,powerandnetworkbackbone◦ On-goingsupport subscriptions foroperatingsystemsandHadoop,andnext-dayhardwarereplacementincluded

◦ ThiscreatesaroundUS$450/TBperyearofbudget roomforyourHadoopLabtoclaim◦ Atypicalbank-sizedHadoopgridis2-4PB,whichyieldsaLabbudgetofUS$1MM-$2MMperyear◦ Thisbudgetfunds astaffof10-20basedontypicalbudgetingnumbersofUS$100K/FTEperyear

FinancingSharedHadoopGridsEstablishausagedrivenchargeoutmodelforconsumersoftheservice◦ ChargingbasedonablendofCPUandstorageconsumption willbalancecomputeanddatauses◦ Considerchargingconsumersbyservicequalityifyourserviceagreementspermit◦ Servicequalitycanbedesignedintoyourmulti-tenancysolution

CreateacentralcapitalaccountmanagedbytheHadoopLab◦ Pre-authorizeincrementalexpansionofthedatalaketostaywithinserviceobjectives◦ Amortizationofcapitalaccountwillsmoothoutchargestoavoidpenalizingearlyadopters

CreativeProjectFinancingManagementlovestoapprove“self-fundingprojects”◦ UsethecostdifferentialofstorageonHadoop tofund intra-yearwork◦ MigratehistoricalcontentfromoperatingdatabasestoHadooptosaveondatabase“tierone”SANcosts◦ CapturegridcomputeoutputstoHadoopinsteadofNASdevices◦ Storingdatabaseback-ups onHadoopcanbecheaperthantapes

Establishaninternal”venturecapital”fundinyourHadoopLab◦ Budget“seedmoney”tospendwiththeapplicationmaintenanceteams◦ Mostapplicationshave“lightson”funding insufficient tosupport thePOCs neededtoexploreHadoopadoption◦ Setasidefunding topayforcross-teamchargesforparticipationinaPOC◦ UsethePOCs tosupportprojectproposals basedoncostreduction

◦ Staffing theHadoopLabwithasmallteamofversatiledeveloperscompletesthiscapability

Real-LifeLessonsLearned“NOTHINGISLESSPRODUCTIVETHANTOMAKEMOREEFFICIENTWHATSHOULDNOTBEDONEATALL”- PETER DRUCKER

SaveMoneybyLettingitBreakIt’sOKifanodebreaks– infact,itisbettertohaveadeadHadoopnodethanawoundedone

Educateyourinfrastructureteamtopreventthemfromover-engineeringyourHadoopgrids◦ HDFSimplementsaRAIDstrategyinsoftware– uselocaldisksinsteadofSANfordatanodes◦ YARNiscleveraboutparallelizingwork– don’tusehigh-speeddriveswhencheaponeswilldo◦ Don’tpayfor“criticalcare”hardwaresupportwhennext-daywillbefine

AppliancesandvirtualizationbreaktheeconomicsofHadoop◦ Equipment failureinanapplianceisall-or-nothing◦ CentralizingtheHadoop gridintooneapplianceincreases theneedforexpensive faulttolerance◦ Unitpricesincreaseasaresult– annualcostsonappliances barelystayunderthe$1K/TBbenchmark

◦ Yourvirtualizationfarmduplicatesallof thefault-toleranceinHadoop– andslowsHadoopdown◦ Vendorbenchmarks showthatvirtualizationisnowalmost asperformanthasbare-metalHadoopgrids◦ Virtualserversaresmallerandsoyou endupwithmorenode-count-driven Hadoopcosts

NetworksReallyMatterThequalityofthenetworkismoreimportantthanthequalityofthemachines◦ MapReduce“bringscomputetothedata,”butHadoopstillgenerateslotsofinternalnetworktraffic◦ DatahubandETLoffloadpatternswillgeneratealotof trafficintoandoutof thegrid◦ Legacytools– mostnotablySAS– willtrytopulllargedatasetsoutofHadoopacrossthenetwork

Investintop-of-rackswitchingorconvergedinfrastructure◦ Mostdatacentreshave1Gbbackbonesconnectinghigher speedsub-networks◦ Bonded40GbuplinkswithintheHadoopgridandacrossracksarewellworththeaddedcost

Spendthemoneyandtimetoco-locatetheconsumingsystemswithintheHadoopsub-network◦ Thiswillmeana“re-racking”exerciseforsomeappliancesandexistingservers

DifferingAppetitesforChangeEveryone’sfirstideaistohaveonegreat,shared,co-operativedatalake– anditdoesn'twork!◦ Themoresuccessfulyouareinon-boarding dataproducers, thegreaterthedifficultyofupdating theDataLake’sHadoopdistribution – theincentiveto“standpat”grows◦ Evenworseifyou’reusing third-partytoolsforingestion – itcreatesanexternal stakeholderwhichcanblockchange!

◦ Themoresuccessfulyouareinon-boarding dataconsumers, thegreaterthedemandtoupdatetheDataLake’sHadoopdistribution – datascientistsalwayswantthemostcurrent nextversionofeverything

Separatetheinteractiveusersfromtheapplicationswithafederateddeploymentmodel◦ PutalloftheapplicationsontoaHadoopgridwhichisupdatedveryinfrequently◦ Staticworkloadsalsoallowtightmanagementofperformanceagainstserviceagreements

◦ Putallofthedatascientistsonto theirowngridthatupdateswiththeHadoopdistribution◦ Self-serve dataprovisioning tosmallgridsinacloud alsoworksreallywellfromtheconsumer’s view

◦ Makesureyouhaveagreatnetworksothatmovingdatabetweenthegrids ispainless

HadoopisNotaDatabaseProjectsthatattempttoreplaceadatabaseserverwithHadoopusuallyfail◦ Avoidtransactionalapplications◦ DonotreplacethedatabasetierinanN-tierapplicationwithHadoop◦ ThinkofHadoopascontainerinstead,andre-architecttheapplication toruninsideHadoop

◦ DonotuseHadoop tohosthighlynormalizeddatawarehousemodels◦ De-normalizeddatamodels aremuchmoreefficientonHadoop

◦ DonotcreateabstractionlayersusinglayeredHiveviews

ThebestdesignpatternsforHadoopareoftenmisused◦ “ETLOff-Load”often turnsintoHadoopasanFTPdropzone◦ “BringComputetoData”doesn’tmeanusingadatanode tohostanapplicationserver◦ Map/ReduceshouldberunwithMapReduce– notusingHivetocallUDFs

InternalDataisMoreDifficulttoAccessThinkofyour360° viewofacustomerasbeing180° oftransactionsand180° ofinteractions

Datagovernance,compliance,andsecuritywillinhibittheuseofthetransactionaldata◦ Internaldatasourcesarealsousuallyhigh-costdatasourcestoaccess

Interactiondata– particularlywebandsocialmediaissurprisinglyeasytoaccess◦ Socialmediadataisactuallyconsidered“public,”andsoisentirelyungoverned◦ Thereareawealthofopen sourcesocialmediaingestion andanalysis toolsavailable

◦ IVRsystemsarelinked tocustomersandcaptureasignificantamountofcustomerinteraction◦ MajorIVRsystemsdiscardtheiroperatingdataafter3-4monthsratherthanwarehousingit

◦ CallCentrerecordingsareawealthofinternalsentimentdata◦ Opensourcetexttospeechandnaturallanguageprocessing toolsareavailableinpython

◦ Websiteclicksandusagecanbeanalyzedforpriceoptimizationandusedforpushmarketing◦ Mostwebsiteusageisanalyzedthroughvendors – butsettingupaninbound feediseasy

DataScienceisUnstructuredWorkDatascientistsdon’tworkthewayITexpectsthemto◦ Traditionaldatawarehousingpatternsarethedatascienceanti-pattern◦ Datascientistsdon’t knowwhattheirrequirementsareuntilthey’ve donetheirwork– theirjob istoexperiment◦ Datascientistshatepreparedviewsbecausetheydon’tknowwhatlogiccreatesthem

◦ Don’twaste(toomuch) timeoncentraldataquality– they’rejustgoing tore-doitanyway◦ ”Correct”dataissubjective bystudy, sothereisn’tananswertoimplementcentrally◦ Preparingatimeseries includes dataquality suitabletodatascience– regardlessofhowgoodthestartingdatais

◦ Datascientistsprobablyknowthedatabetterthanthedatamodelers

DataScienceLabsDatascientistswanttodevelopanalyticsusingproductiondata–whichbreakslotsofpolicies

SupportthecreationofaDataScienceLabenvironment◦ Leada“onceandforever”platformsecurityreviewthatallHadoopuserscanreference◦ Implementdatagovernancethatfacilitates“windowshopping” forcontent– evenwhengovernancewillinitiallyprohibit usingthecontent

Investinadvanceddatamasking◦ Investinadvanceddatamasking toprepareproduction dataforthedatasciencelab◦ Advanceddatamaskingretainsthestatisticalpropertiesoftheunderlying data

Buyaself-servedataprovisioningtool◦ Datascientistsloveto“shop”fordataandloveto”engineer”datausingquery-by-exampletools◦ Thegoodtoolsturnthe”shopping trip”intodeployable codethatyoucanpackagefordeployment orautomationeasily

ProjectsthatSucceed“RISKCOMESFROMNOTKNOWINGWHATYOU’REDOING”- WARRENBUFFET

QuickWinsFindingaquickwinortwowillkeepyourorganizationmotivatedtoadoptHadoop

Massivelyparallelback-testingofStreamBasealgorithms◦ StreamBaseisareal-timeworkflowplatformwidelyusedinprogramtrading◦ MapReducecanencapsulateStreamBaseinordertorunhundreds ofcopiesinparallel

Targetingadsonsocialmedia◦ BothTwitterandFacebookhaveverygoodAPIsthatyoucanquicklyusetobuildafeed◦ Python-based toolscanbepairedwithsomebasicdatasciencetofind“lifeevents”

TrendAnalysisonRiskData◦ Simulationoutputs fromCVA,VAR,CCR,LRMareoftendiscardedafteronedayduetotheirsize◦ ArchivingonHDFSpermitstrendanalysisatthetradelevelfordiagnosticsandcapitalplanning

Mid-SizedProjectsManycurrentfocusareasinfinancelendthemselvestoachievableHadoopprojects

VolckerRule◦ VolckerRulemetricsrequireanenormous amountofdata,whichisexpensivetostore◦ Retentionisrequired forfiveyearsofcalendardays◦ Computationscanbeimplemented inSQLandwillrunwellinHive

Customer360◦ Hadoop isanaturalplatformtoconsolidateinteractionrecordswithtransactionaldata

DailyLiquidityManagement◦ Running thecalculationsbeforepooling facilitatesdrill-downandanalysis◦ TableauonHadoopworksverywellfordailydashboards

ThankYouforYourTime

Business

Founding a Hadoop Data Science Lab