26
Founding a Hadoop Lab EVERYTHING YOU ALWAYS WANTED TO KNOW, BUT WERE AFRAID TO ASK, ABOUT FINDING SUCCESS WITH HADOOP IN YOUR ORAGANIZATION © UTILIS TECHNOLOGY LIMITED 2017

Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

FoundingaHadoopLabEVERYTHINGYOUALWAYSWANTEDTOKNOW,

BUTWEREAFRAIDTOASK,

ABOUTF INDINGSUCCESSWITHHADOOP INYOURORAGANIZATION

©UTILISTECHNOLOGYLIMITED2017

Page 2: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

[email protected]

AShortIntroductiontoYourSpeaker

MyAdventuresinHadoop◦ LeadHadoopadoptionatthreeCanadianbanks◦ EstablishedasuccessfulHadoopCOE◦ AdvisoryrolesonHadoopinfinance

MyCareerinFinance◦ Fourbanks,onestockexchange,onepensionfund◦ Capitalmarkets,retailbanking,enterpriseriskroles◦ FounderoftwoITdepartments◦ TechnologyleaderinRiskSystemsfor15years–

◦ Architect,EnterpriseRiskSystems◦ Architect,FrontOfficeRiskSystems◦ ProgramManager,PortfolioManagementSystems◦ HeadofRiskSystems◦ HeadofHadoopCOE

Page 3: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

AgendaWhatrolewillyourHadoopLabplay?◦ Definingobjectives,buildingateamandformingpartnerships◦ Foundationalworktosetapathtosuccess

Whatisareasonablebudget?◦ Calculatingyour“room”basedonindustrybenchmarks◦ Capacityplanning,charge-out,andthecentralcapitalaccount

Real-lifeLessonsLearned◦ SettingupinfrastructuretotakeadvantageofHadoop’suniqueproperties◦ Creatingapracticethatfitsyourusers’workstyles

ProjectsthatSucceed◦ Ideasforaquickwintokeepeveryonemotivated◦ Mediumriskprojectsalignedtocurrentbusinessproblems

Page 4: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

WhatrolewillyourHadoopLabplay?“YOUCAN’TSHRINKYOURWAYTOGREATNESS”- TOMPETERS

Page 5: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

WhatrolewillyourHadoopLabplay?Willyourorganization’sHadoopLabbeacontrolfunction,orathoughtleader?

Controlfunctions◦ Operationalcontrols,complianceandauditing◦ Budgeting◦ Architecturegating◦ Datagovernance

Thoughtleadership◦ Designpatternsandsolutionarchitecture◦ Demonstrationprojectsandproofs-of-concept◦ Fillingupthetalentpoolusingtraining,workshopsandusergroups◦ Educatingonbestpracticesandsuccessstoriestomotivateadoption

Page 6: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

FoundationalWorkInvestinuser-friendlyoperationalmanagement◦ Designasimplemulti-tenancyplanbasedongroupmembership

◦ Includeshareofexecutionqueues,directorystructuresandcascadingpermissions

◦ Setupself-serveuseron-boardingthroughyourorganization’sHelpDesk◦ ImplementsinglesignonforKerberos-securedclusters

Manageexpectationsbymonitoringperformance◦ Setservicelevelobjectivesforbothinteractiveandapplicationuses◦ Use“showback”reportingtomonitorperformanceagainstobjectives

Implementaccesscontrolgovernanceasabasicservice◦ Generateaccesscontrolmatrixauditscentrallyforallgridusers

◦ ReportingfromRanger’sdatabaseworkswellandiseasytobuild

◦ Setpolicyandpreparereportsforperiodicattestation/useraccountreviews

Page 7: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

MaximizingExposuretoChangeHadoopisanexceptionallyfastmovingtechnology,andsoneedsadifferentapproach◦ MaximizeyourabilitytodeploythechangesintheHadoopplatform

◦ Investincontinuousintegrationandautomatedregressiontestingforyourdevelopmentteams◦ Establishabetter-than-quarterlyreleasecycle◦ Publishachecklistofacceptableopensourcelicenses(orblacklistofprohibitedones)

◦ EncourageuseofHadoopasanapplicationcontainer◦ Setuplabenvironments

Discouragepracticesthatpreventyourorganizationfromkeepingpace◦ AvoidencapsulatingHadoopwithframeworksorwrappingHadoopinsideapplications◦ Avoidproprietaryadd-ons– theydon’tgetasmuchcollaborationintheopensourcecommunity◦ Prohibitequipment“carveouts”fromyoursharedgrid

◦ Includethecostofadditionalequipmentinthebusinesscase,co-locate,andchargeoutaccordingly

Page 8: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

BuildingaTeamDataEngineersarethekeytothesuccessfuladoptionofadatalake◦ Dataengineersarehybridofintermediatedeveloperandjuniordatascientist◦ Gooddataengineeringacceleratesdatascience,andtheabilitytodeploydatasciencetoproduction

Otherrolestoconsider◦ AfewversatileseniordeveloperstogiveyoutheabilitytoexecutePOCs◦ DataLibrariantomanagethemetadatacatalogueanddocumentation◦ DataStewardtomanagethedatagovernanceprocess

Keepafewconsultantsonspeeddial◦ Hadoopsecurityexperts– preferablyfromanaudit-capablefirm◦ Complianceandfairusageexperts– particularlyforexternaldatafromthewebandsocialmedia

FundtheHadoopandLinuxadministrators,butleavethemintheinfrastructureteam◦ Theyneedtheadministrativeaccessthattheseteamsareallowed

Page 9: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

YourNewBestFriendsGiveallofyourstakeholdersachancetoparticipate,byformingaworkinggroup◦ Exposuretobusinessstakeholdersisparticularlyvaluablefortechnologyteams

EnlisttheCapitalMarketsinfrastructureteamtobuildandmanagetheHadoopgrid◦ Itisworthsolvingtheaccountingproblemstogettheirexpertise

Co-optyourexistingdatahub’steamtooperateyournewDataLake’sprocesses◦ BCBS-239projectshaveprovidedanexcellentopportunitytodothis

AdoptingasecondarySQLonHadoopsolutionhelpstotransferskillsaswellascode◦ IBMDB2isavailableforHadoop– greatwaytomoveoverabank’sdatawarehousetotheLab◦ OtherANSI-compliantsolutionsincludeHAWQ,Vertica,Polybase*

Page 10: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Whatisareasonablebudget?“PRICEISWHATYOUPAY. VALUEISWHATYOUGET.”- WARREN BUFFET

Page 11: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

UnderstandingtheCustomersBeforesettingabudget,decidewhoyou’regoingtochargeforyourHadoopLab◦ DataproducerswillseeHadoopasacost-reductionopportunity

◦ Mostfront-endsystemshavedozensofoutboundfeedsthattheyhavetosupportandmaintain– offerthemthechancetodropoffasinglecomprehensivefeedtoHadoopsothatconsumerscanbuildandmanagetheirownoutboundfeeds

◦ Consumingsystemsalsohavesupportteamsmanaginginboundfeeds,sotheywon’tseeasignificantchangeinsupportcosts

◦ DataconsumerswillseeHadoopasimprovingtheircapabilities◦ Traditionaldatasupplychainisverylong:sourcesystemfeedsanEDW,whichfeedsadatamartaccessedbydatascientists◦ Askingfor“onemorefield”requiressourcetosendit,EDWtomodelanddocumentit,datamarttoprovisionit,andthenfinallya

datascientistgetstoconsumeit◦ Givingdatascientistsaccesstotherawdatamakesthemmore efficient– eventhoughless effortgoesintoprovidingthedata!

Alignthefundingmodeltothebenefitsrealizedbytheparticipants:◦ One-timecoststoon-boardnewdatashouldcomefromtheproducerofthedata◦ On-goingoperatingcostsfortheHadoopgridshouldbesharedbytheconsumersofgridservices

Page 12: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

SettingaBudgetforaHadoopLabAnnualcostofHadoopiswidelyquotedasUS$1,000/TB◦ ThiscomparesfavorablytoUS$5KforaSAN,andUS$12Kforatraditionaldatabase◦ Costbasedon“balanced”referenceconfigurations– “compute”ismore,“storage”isless

Usethiswell-knownindustrybenchmarktosetyourbudget◦ Fullyloadedcostsforabank-sizedHadoopgridinabankdatacentrearearoundUS$550/TBperyear

◦ Capitalchargesforinfrastructurecosts,includingserversanddedicatednetworkswitching,areamortizedoverthreeyears◦ Premisescostsfordatacentreincludebareracks,powerandnetworkbackbone◦ On-goingsupportsubscriptionsforoperatingsystemsandHadoop,andnext-dayhardwarereplacementincluded

◦ ThiscreatesaroundUS$450/TBperyearofbudgetroomforyourHadoopLabtoclaim◦ Atypicalbank-sizedHadoopgridis2-4PB,whichyieldsaLabbudgetofUS$1MM-$2MMperyear

◦ Thisbudgetfundsastaffof10-20basedontypicalbudgetingnumbersofUS$100K/FTEperyear

Page 13: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

FinancingSharedHadoopGridsEstablishausagedrivenchargeoutmodelforconsumersoftheservice◦ ChargingbasedonablendofCPUandstorageconsumptionwillbalancecomputeanddatauses◦ Considerchargingconsumersbyservicequalityifyourserviceagreementspermit

◦ Servicequalitycanbedesignedintoyourmulti-tenancysolution

CreateacentralcapitalaccountmanagedbytheHadoopLab◦ Pre-authorizeincrementalexpansionofthedatalaketostaywithinserviceobjectives◦ Amortizationofcapitalaccountwillsmoothoutchargestoavoidpenalizingearlyadopters

Page 14: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

CreativeProjectFinancingManagementlovestoapprove“self-fundingprojects”◦ UsethecostdifferentialofstorageonHadooptofundintra-yearwork

◦ MigratehistoricalcontentfromoperatingdatabasestoHadooptosaveondatabase“tierone”SANcosts◦ CapturegridcomputeoutputstoHadoopinsteadofNASdevices◦ Storingdatabaseback-upsonHadoopcanbecheaperthantapes

Establishaninternal”venturecapital”fundinyourHadoopLab◦ Budget“seedmoney”tospendwiththeapplicationmaintenanceteams

◦ Mostapplicationshave“lightson”fundinginsufficienttosupportthePOCsneededtoexploreHadoopadoption◦ Setasidefundingtopayforcross-teamchargesforparticipationinaPOC◦ UsethePOCstosupportprojectproposalsbasedoncostreduction

◦ StaffingtheHadoopLabwithasmallteamofversatiledeveloperscompletesthiscapability

Page 15: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Real-LifeLessonsLearned“NOTHINGISLESSPRODUCTIVETHANTOMAKEMOREEFFICIENTWHATSHOULDNOTBEDONEATALL”- PETER DRUCKER

Page 16: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

SaveMoneybyLettingitBreakIt’sOKifanodebreaks– infact,itisbettertohaveadeadHadoopnodethanawoundedone

Educateyourinfrastructureteamtopreventthemfromover-engineeringyourHadoopgrids◦ HDFSimplementsaRAIDstrategyinsoftware– uselocaldisksinsteadofSANfordatanodes◦ YARNiscleveraboutparallelizingwork– don’tusehigh-speeddriveswhencheaponeswilldo◦ Don’tpayfor“criticalcare”hardwaresupportwhennext-daywillbefine

AppliancesandvirtualizationbreaktheeconomicsofHadoop◦ Equipmentfailureinanapplianceisall-or-nothing

◦ CentralizingtheHadoopgridintooneapplianceincreasestheneedforexpensivefaulttolerance◦ Unitpricesincreaseasaresult– annualcostsonappliancesbarelystayunderthe$1K/TBbenchmark

◦ Yourvirtualizationfarmduplicatesallofthefault-toleranceinHadoop– andslowsHadoopdown◦ Vendorbenchmarksshowthatvirtualizationisnowalmost asperformanthasbare-metalHadoopgrids◦ Virtualserversaresmallerandsoyouendupwithmorenode-count-drivenHadoopcosts

Page 17: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

NetworksReallyMatterThequalityofthenetworkismoreimportantthanthequalityofthemachines◦ MapReduce“bringscomputetothedata,”butHadoopstillgenerateslotsofinternalnetworktraffic◦ DatahubandETLoffloadpatternswillgeneratealotoftrafficintoandoutofthegrid◦ Legacytools– mostnotablySAS– willtrytopulllargedatasetsoutofHadoopacrossthenetwork

Investintop-of-rackswitchingorconvergedinfrastructure◦ Mostdatacentreshave1Gbbackbonesconnectinghigherspeedsub-networks◦ Bonded40GbuplinkswithintheHadoopgridandacrossracksarewellworththeaddedcost

Spendthemoneyandtimetoco-locatetheconsumingsystemswithintheHadoopsub-network◦ Thiswillmeana“re-racking”exerciseforsomeappliancesandexistingservers

Page 18: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

DifferingAppetitesforChangeEveryone’sfirstideaistohaveonegreat,shared,co-operativedatalake– anditdoesn'twork!◦ Themoresuccessfulyouareinon-boardingdataproducers,thegreaterthedifficultyofupdatingtheDataLake’sHadoopdistribution– theincentiveto“standpat”grows◦ Evenworseifyou’reusingthird-partytoolsforingestion– itcreatesanexternal stakeholderwhichcanblockchange!

◦ Themoresuccessfulyouareinon-boardingdataconsumers,thegreaterthedemandtoupdatetheDataLake’sHadoopdistribution– datascientistsalwayswantthemostcurrent nextversionofeverything

Separatetheinteractiveusersfromtheapplicationswithafederateddeploymentmodel◦ PutalloftheapplicationsontoaHadoopgridwhichisupdatedveryinfrequently

◦ Staticworkloadsalsoallowtightmanagementofperformanceagainstserviceagreements

◦ PutallofthedatascientistsontotheirowngridthatupdateswiththeHadoopdistribution◦ Self-servedataprovisioningtosmallgridsinacloudalsoworksreallywellfromtheconsumer’sview

◦ Makesureyouhaveagreatnetworksothatmovingdatabetweenthegridsispainless

Page 19: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

HadoopisNotaDatabaseProjectsthatattempttoreplaceadatabaseserverwithHadoopusuallyfail◦ Avoidtransactionalapplications◦ DonotreplacethedatabasetierinanN-tierapplicationwithHadoop

◦ ThinkofHadoopascontainerinstead,andre-architecttheapplicationtoruninsideHadoop

◦ DonotuseHadooptohosthighlynormalizeddatawarehousemodels◦ De-normalizeddatamodelsaremuchmoreefficientonHadoop

◦ DonotcreateabstractionlayersusinglayeredHiveviews

ThebestdesignpatternsforHadoopareoftenmisused◦ “ETLOff-Load”oftenturnsintoHadoopasanFTPdropzone◦ “BringComputetoData”doesn’tmeanusingadatanodetohostanapplicationserver◦ Map/ReduceshouldberunwithMapReduce– notusingHivetocallUDFs

Page 20: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

InternalDataisMoreDifficulttoAccessThinkofyour360° viewofacustomerasbeing180° oftransactionsand180° ofinteractions

Datagovernance,compliance,andsecuritywillinhibittheuseofthetransactionaldata◦ Internaldatasourcesarealsousuallyhigh-costdatasourcestoaccess

Interactiondata– particularlywebandsocialmediaissurprisinglyeasytoaccess◦ Socialmediadataisactuallyconsidered“public,”andsoisentirelyungoverned

◦ Thereareawealthofopensourcesocialmediaingestionandanalysistoolsavailable

◦ IVRsystemsarelinkedtocustomersandcaptureasignificantamountofcustomerinteraction◦ MajorIVRsystemsdiscardtheiroperatingdataafter3-4monthsratherthanwarehousingit

◦ CallCentrerecordingsareawealthofinternalsentimentdata◦ Opensourcetexttospeechandnaturallanguageprocessingtoolsareavailableinpython

◦ Websiteclicksandusagecanbeanalyzedforpriceoptimizationandusedforpushmarketing◦ Mostwebsiteusageisanalyzedthroughvendors– butsettingupaninboundfeediseasy

Page 21: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

DataScienceisUnstructuredWorkDatascientistsdon’tworkthewayITexpectsthemto◦ Traditionaldatawarehousingpatternsarethedatascienceanti-pattern

◦ Datascientistsdon’tknowwhattheirrequirementsareuntilthey’vedonetheirwork– theirjobistoexperiment◦ Datascientistshatepreparedviewsbecausetheydon’tknowwhatlogiccreatesthem

◦ Don’twaste(toomuch)timeoncentraldataquality– they’rejustgoingtore-doitanyway◦ ”Correct”dataissubjectivebystudy,sothereisn’tananswertoimplementcentrally◦ Preparingatimeseriesincludesdataqualitysuitabletodatascience– regardlessofhowgoodthestartingdatais

◦ Datascientistsprobablyknowthedatabetterthanthedatamodelers

Page 22: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

DataScienceLabsDatascientistswanttodevelopanalyticsusingproductiondata– whichbreakslotsofpolicies

SupportthecreationofaDataScienceLabenvironment◦ Leada“onceandforever”platformsecurityreviewthatallHadoopuserscanreference◦ Implementdatagovernancethatfacilitates“windowshopping”forcontent– evenwhengovernancewillinitiallyprohibitusingthecontent

Investinadvanceddatamasking◦ Investinadvanceddatamaskingtoprepareproductiondataforthedatasciencelab◦ Advanceddatamaskingretainsthestatisticalpropertiesoftheunderlyingdata

Buyaself-servedataprovisioningtool◦ Datascientistsloveto“shop”fordataandloveto”engineer”datausingquery-by-exampletools

◦ Thegoodtoolsturnthe”shoppingtrip”intodeployablecodethatyoucanpackagefordeploymentorautomationeasily

Page 23: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

ProjectsthatSucceed“RISKCOMESFROMNOTKNOWINGWHATYOU’REDOING”- WARREN BUFFET

Page 24: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

QuickWinsFindingaquickwinortwowillkeepyourorganizationmotivatedtoadoptHadoop

Massivelyparallelback-testingofStreamBasealgorithms◦ StreamBaseisareal-timeworkflowplatformwidelyusedinprogramtrading◦ MapReducecanencapsulateStreamBaseinordertorunhundredsofcopiesinparallel

Targetingadsonsocialmedia◦ BothTwitterandFacebookhaveverygoodAPIsthatyoucanquicklyusetobuildafeed◦ Python-basedtoolscanbepairedwithsomebasicdatasciencetofind“lifeevents”

TrendAnalysisonRiskData◦ SimulationoutputsfromCVA,VAR,CCR,LRMareoftendiscardedafteronedayduetotheirsize◦ ArchivingonHDFSpermitstrendanalysisatthetradelevelfordiagnosticsandcapitalplanning

Page 25: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

Mid-SizedProjectsManycurrentfocusareasinfinancelendthemselvestoachievableHadoopprojects

VolckerRule◦ VolckerRulemetricsrequireanenormousamountofdata,whichisexpensivetostore◦ Retentionisrequiredforfiveyearsofcalendardays◦ ComputationscanbeimplementedinSQLandwillrunwellinHive

Customer360◦ Hadoopisanaturalplatformtoconsolidateinteractionrecordswithtransactionaldata

DailyLiquidityManagement◦ Runningthecalculationsbeforepoolingfacilitatesdrill-downandanalysis◦ TableauonHadoopworksverywellfordailydashboards

Page 26: Founding a Hadoop Lab€¦ · My Adventures in Hadoop Lead Hadoop adoption at three Canadian banks Established a successful Hadoop COE Advisory roles on Hadoop in finance My Career

ThankYouforYourTime