Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
FoundingaHadoopLabEVERYTHINGYOUALWAYSWANTEDTOKNOW,
BUTWEREAFRAIDTOASK,
ABOUTF INDINGSUCCESSWITHHADOOP INYOURORAGANIZATION
©UTILISTECHNOLOGYLIMITED2017
AShortIntroductiontoYourSpeaker
MyAdventuresinHadoop◦ LeadHadoopadoptionatthreeCanadianbanks◦ EstablishedasuccessfulHadoopCOE◦ AdvisoryrolesonHadoopinfinance
MyCareerinFinance◦ Fourbanks,onestockexchange,onepensionfund◦ Capitalmarkets,retailbanking,enterpriseriskroles◦ FounderoftwoITdepartments◦ TechnologyleaderinRiskSystemsfor15years–
◦ Architect,EnterpriseRiskSystems◦ Architect,FrontOfficeRiskSystems◦ ProgramManager,PortfolioManagementSystems◦ HeadofRiskSystems◦ HeadofHadoopCOE
AgendaWhatrolewillyourHadoopLabplay?◦ Definingobjectives,buildingateamandformingpartnerships◦ Foundationalworktosetapathtosuccess
Whatisareasonablebudget?◦ Calculatingyour“room”basedonindustrybenchmarks◦ Capacityplanning,charge-out,andthecentralcapitalaccount
Real-lifeLessonsLearned◦ SettingupinfrastructuretotakeadvantageofHadoop’suniqueproperties◦ Creatingapracticethatfitsyourusers’workstyles
ProjectsthatSucceed◦ Ideasforaquickwintokeepeveryonemotivated◦ Mediumriskprojectsalignedtocurrentbusinessproblems
WhatrolewillyourHadoopLabplay?“YOUCAN’TSHRINKYOURWAYTOGREATNESS”- TOMPETERS
WhatrolewillyourHadoopLabplay?Willyourorganization’sHadoopLabbeacontrolfunction,orathoughtleader?
Controlfunctions◦ Operationalcontrols,complianceandauditing◦ Budgeting◦ Architecturegating◦ Datagovernance
Thoughtleadership◦ Designpatternsandsolutionarchitecture◦ Demonstrationprojectsandproofs-of-concept◦ Fillingupthetalentpoolusingtraining,workshopsandusergroups◦ Educatingonbestpracticesandsuccessstoriestomotivateadoption
FoundationalWorkInvestinuser-friendlyoperationalmanagement◦ Designasimplemulti-tenancyplanbasedongroupmembership
◦ Includeshareofexecutionqueues,directorystructuresandcascadingpermissions
◦ Setupself-serveuseron-boardingthroughyourorganization’sHelpDesk◦ ImplementsinglesignonforKerberos-securedclusters
Manageexpectationsbymonitoringperformance◦ Setservicelevelobjectivesforbothinteractiveandapplicationuses◦ Use“showback”reportingtomonitorperformanceagainstobjectives
Implementaccesscontrolgovernanceasabasicservice◦ Generateaccesscontrolmatrixauditscentrallyforallgridusers
◦ ReportingfromRanger’sdatabaseworkswellandiseasytobuild
◦ Setpolicyandpreparereportsforperiodicattestation/useraccountreviews
MaximizingExposuretoChangeHadoopisanexceptionallyfastmovingtechnology,andsoneedsadifferentapproach◦ MaximizeyourabilitytodeploythechangesintheHadoopplatform
◦ Investincontinuousintegrationandautomatedregressiontestingforyourdevelopmentteams◦ Establishabetter-than-quarterlyreleasecycle◦ Publishachecklistofacceptableopensourcelicenses(orblacklistofprohibitedones)
◦ EncourageuseofHadoopasanapplicationcontainer◦ Setuplabenvironments
Discouragepracticesthatpreventyourorganizationfromkeepingpace◦ AvoidencapsulatingHadoopwithframeworksorwrappingHadoopinsideapplications◦ Avoidproprietaryadd-ons– theydon’tgetasmuchcollaborationintheopensourcecommunity◦ Prohibitequipment“carveouts”fromyoursharedgrid
◦ Includethecostofadditionalequipmentinthebusinesscase,co-locate,andchargeoutaccordingly
BuildingaTeamDataEngineersarethekeytothesuccessfuladoptionofadatalake◦ Dataengineersarehybridofintermediatedeveloperandjuniordatascientist◦ Gooddataengineeringacceleratesdatascience,andtheabilitytodeploydatasciencetoproduction
Otherrolestoconsider◦ AfewversatileseniordeveloperstogiveyoutheabilitytoexecutePOCs◦ DataLibrariantomanagethemetadatacatalogueanddocumentation◦ DataStewardtomanagethedatagovernanceprocess
Keepafewconsultantsonspeeddial◦ Hadoopsecurityexperts– preferablyfromanaudit-capablefirm◦ Complianceandfairusageexperts– particularlyforexternaldatafromthewebandsocialmedia
FundtheHadoopandLinuxadministrators,butleavethemintheinfrastructureteam◦ Theyneedtheadministrativeaccessthattheseteamsareallowed
YourNewBestFriendsGiveallofyourstakeholdersachancetoparticipate,byformingaworkinggroup◦ Exposuretobusinessstakeholdersisparticularlyvaluablefortechnologyteams
EnlisttheCapitalMarketsinfrastructureteamtobuildandmanagetheHadoopgrid◦ Itisworthsolvingtheaccountingproblemstogettheirexpertise
Co-optyourexistingdatahub’steamtooperateyournewDataLake’sprocesses◦ BCBS-239projectshaveprovidedanexcellentopportunitytodothis
AdoptingasecondarySQLonHadoopsolutionhelpstotransferskillsaswellascode◦ IBMDB2isavailableforHadoop– greatwaytomoveoverabank’sdatawarehousetotheLab◦ OtherANSI-compliantsolutionsincludeHAWQ,Vertica,Polybase*
Whatisareasonablebudget?“PRICEISWHATYOUPAY. VALUEISWHATYOUGET.”- WARREN BUFFET
UnderstandingtheCustomersBeforesettingabudget,decidewhoyou’regoingtochargeforyourHadoopLab◦ DataproducerswillseeHadoopasacost-reductionopportunity
◦ Mostfront-endsystemshavedozensofoutboundfeedsthattheyhavetosupportandmaintain– offerthemthechancetodropoffasinglecomprehensivefeedtoHadoopsothatconsumerscanbuildandmanagetheirownoutboundfeeds
◦ Consumingsystemsalsohavesupportteamsmanaginginboundfeeds,sotheywon’tseeasignificantchangeinsupportcosts
◦ DataconsumerswillseeHadoopasimprovingtheircapabilities◦ Traditionaldatasupplychainisverylong:sourcesystemfeedsanEDW,whichfeedsadatamartaccessedbydatascientists◦ Askingfor“onemorefield”requiressourcetosendit,EDWtomodelanddocumentit,datamarttoprovisionit,andthenfinallya
datascientistgetstoconsumeit◦ Givingdatascientistsaccesstotherawdatamakesthemmore efficient– eventhoughless effortgoesintoprovidingthedata!
Alignthefundingmodeltothebenefitsrealizedbytheparticipants:◦ One-timecoststoon-boardnewdatashouldcomefromtheproducerofthedata◦ On-goingoperatingcostsfortheHadoopgridshouldbesharedbytheconsumersofgridservices
SettingaBudgetforaHadoopLabAnnualcostofHadoopiswidelyquotedasUS$1,000/TB◦ ThiscomparesfavorablytoUS$5KforaSAN,andUS$12Kforatraditionaldatabase◦ Costbasedon“balanced”referenceconfigurations– “compute”ismore,“storage”isless
Usethiswell-knownindustrybenchmarktosetyourbudget◦ Fullyloadedcostsforabank-sizedHadoopgridinabankdatacentrearearoundUS$550/TBperyear
◦ Capitalchargesforinfrastructurecosts,includingserversanddedicatednetworkswitching,areamortizedoverthreeyears◦ Premisescostsfordatacentreincludebareracks,powerandnetworkbackbone◦ On-goingsupportsubscriptionsforoperatingsystemsandHadoop,andnext-dayhardwarereplacementincluded
◦ ThiscreatesaroundUS$450/TBperyearofbudgetroomforyourHadoopLabtoclaim◦ Atypicalbank-sizedHadoopgridis2-4PB,whichyieldsaLabbudgetofUS$1MM-$2MMperyear
◦ Thisbudgetfundsastaffof10-20basedontypicalbudgetingnumbersofUS$100K/FTEperyear
FinancingSharedHadoopGridsEstablishausagedrivenchargeoutmodelforconsumersoftheservice◦ ChargingbasedonablendofCPUandstorageconsumptionwillbalancecomputeanddatauses◦ Considerchargingconsumersbyservicequalityifyourserviceagreementspermit
◦ Servicequalitycanbedesignedintoyourmulti-tenancysolution
CreateacentralcapitalaccountmanagedbytheHadoopLab◦ Pre-authorizeincrementalexpansionofthedatalaketostaywithinserviceobjectives◦ Amortizationofcapitalaccountwillsmoothoutchargestoavoidpenalizingearlyadopters
CreativeProjectFinancingManagementlovestoapprove“self-fundingprojects”◦ UsethecostdifferentialofstorageonHadooptofundintra-yearwork
◦ MigratehistoricalcontentfromoperatingdatabasestoHadooptosaveondatabase“tierone”SANcosts◦ CapturegridcomputeoutputstoHadoopinsteadofNASdevices◦ Storingdatabaseback-upsonHadoopcanbecheaperthantapes
Establishaninternal”venturecapital”fundinyourHadoopLab◦ Budget“seedmoney”tospendwiththeapplicationmaintenanceteams
◦ Mostapplicationshave“lightson”fundinginsufficienttosupportthePOCsneededtoexploreHadoopadoption◦ Setasidefundingtopayforcross-teamchargesforparticipationinaPOC◦ UsethePOCstosupportprojectproposalsbasedoncostreduction
◦ StaffingtheHadoopLabwithasmallteamofversatiledeveloperscompletesthiscapability
Real-LifeLessonsLearned“NOTHINGISLESSPRODUCTIVETHANTOMAKEMOREEFFICIENTWHATSHOULDNOTBEDONEATALL”- PETER DRUCKER
SaveMoneybyLettingitBreakIt’sOKifanodebreaks– infact,itisbettertohaveadeadHadoopnodethanawoundedone
Educateyourinfrastructureteamtopreventthemfromover-engineeringyourHadoopgrids◦ HDFSimplementsaRAIDstrategyinsoftware– uselocaldisksinsteadofSANfordatanodes◦ YARNiscleveraboutparallelizingwork– don’tusehigh-speeddriveswhencheaponeswilldo◦ Don’tpayfor“criticalcare”hardwaresupportwhennext-daywillbefine
AppliancesandvirtualizationbreaktheeconomicsofHadoop◦ Equipmentfailureinanapplianceisall-or-nothing
◦ CentralizingtheHadoopgridintooneapplianceincreasestheneedforexpensivefaulttolerance◦ Unitpricesincreaseasaresult– annualcostsonappliancesbarelystayunderthe$1K/TBbenchmark
◦ Yourvirtualizationfarmduplicatesallofthefault-toleranceinHadoop– andslowsHadoopdown◦ Vendorbenchmarksshowthatvirtualizationisnowalmost asperformanthasbare-metalHadoopgrids◦ Virtualserversaresmallerandsoyouendupwithmorenode-count-drivenHadoopcosts
NetworksReallyMatterThequalityofthenetworkismoreimportantthanthequalityofthemachines◦ MapReduce“bringscomputetothedata,”butHadoopstillgenerateslotsofinternalnetworktraffic◦ DatahubandETLoffloadpatternswillgeneratealotoftrafficintoandoutofthegrid◦ Legacytools– mostnotablySAS– willtrytopulllargedatasetsoutofHadoopacrossthenetwork
Investintop-of-rackswitchingorconvergedinfrastructure◦ Mostdatacentreshave1Gbbackbonesconnectinghigherspeedsub-networks◦ Bonded40GbuplinkswithintheHadoopgridandacrossracksarewellworththeaddedcost
Spendthemoneyandtimetoco-locatetheconsumingsystemswithintheHadoopsub-network◦ Thiswillmeana“re-racking”exerciseforsomeappliancesandexistingservers
DifferingAppetitesforChangeEveryone’sfirstideaistohaveonegreat,shared,co-operativedatalake– anditdoesn'twork!◦ Themoresuccessfulyouareinon-boardingdataproducers,thegreaterthedifficultyofupdatingtheDataLake’sHadoopdistribution– theincentiveto“standpat”grows◦ Evenworseifyou’reusingthird-partytoolsforingestion– itcreatesanexternal stakeholderwhichcanblockchange!
◦ Themoresuccessfulyouareinon-boardingdataconsumers,thegreaterthedemandtoupdatetheDataLake’sHadoopdistribution– datascientistsalwayswantthemostcurrent nextversionofeverything
Separatetheinteractiveusersfromtheapplicationswithafederateddeploymentmodel◦ PutalloftheapplicationsontoaHadoopgridwhichisupdatedveryinfrequently
◦ Staticworkloadsalsoallowtightmanagementofperformanceagainstserviceagreements
◦ PutallofthedatascientistsontotheirowngridthatupdateswiththeHadoopdistribution◦ Self-servedataprovisioningtosmallgridsinacloudalsoworksreallywellfromtheconsumer’sview
◦ Makesureyouhaveagreatnetworksothatmovingdatabetweenthegridsispainless
HadoopisNotaDatabaseProjectsthatattempttoreplaceadatabaseserverwithHadoopusuallyfail◦ Avoidtransactionalapplications◦ DonotreplacethedatabasetierinanN-tierapplicationwithHadoop
◦ ThinkofHadoopascontainerinstead,andre-architecttheapplicationtoruninsideHadoop
◦ DonotuseHadooptohosthighlynormalizeddatawarehousemodels◦ De-normalizeddatamodelsaremuchmoreefficientonHadoop
◦ DonotcreateabstractionlayersusinglayeredHiveviews
ThebestdesignpatternsforHadoopareoftenmisused◦ “ETLOff-Load”oftenturnsintoHadoopasanFTPdropzone◦ “BringComputetoData”doesn’tmeanusingadatanodetohostanapplicationserver◦ Map/ReduceshouldberunwithMapReduce– notusingHivetocallUDFs
InternalDataisMoreDifficulttoAccessThinkofyour360° viewofacustomerasbeing180° oftransactionsand180° ofinteractions
Datagovernance,compliance,andsecuritywillinhibittheuseofthetransactionaldata◦ Internaldatasourcesarealsousuallyhigh-costdatasourcestoaccess
Interactiondata– particularlywebandsocialmediaissurprisinglyeasytoaccess◦ Socialmediadataisactuallyconsidered“public,”andsoisentirelyungoverned
◦ Thereareawealthofopensourcesocialmediaingestionandanalysistoolsavailable
◦ IVRsystemsarelinkedtocustomersandcaptureasignificantamountofcustomerinteraction◦ MajorIVRsystemsdiscardtheiroperatingdataafter3-4monthsratherthanwarehousingit
◦ CallCentrerecordingsareawealthofinternalsentimentdata◦ Opensourcetexttospeechandnaturallanguageprocessingtoolsareavailableinpython
◦ Websiteclicksandusagecanbeanalyzedforpriceoptimizationandusedforpushmarketing◦ Mostwebsiteusageisanalyzedthroughvendors– butsettingupaninboundfeediseasy
DataScienceisUnstructuredWorkDatascientistsdon’tworkthewayITexpectsthemto◦ Traditionaldatawarehousingpatternsarethedatascienceanti-pattern
◦ Datascientistsdon’tknowwhattheirrequirementsareuntilthey’vedonetheirwork– theirjobistoexperiment◦ Datascientistshatepreparedviewsbecausetheydon’tknowwhatlogiccreatesthem
◦ Don’twaste(toomuch)timeoncentraldataquality– they’rejustgoingtore-doitanyway◦ ”Correct”dataissubjectivebystudy,sothereisn’tananswertoimplementcentrally◦ Preparingatimeseriesincludesdataqualitysuitabletodatascience– regardlessofhowgoodthestartingdatais
◦ Datascientistsprobablyknowthedatabetterthanthedatamodelers
DataScienceLabsDatascientistswanttodevelopanalyticsusingproductiondata– whichbreakslotsofpolicies
SupportthecreationofaDataScienceLabenvironment◦ Leada“onceandforever”platformsecurityreviewthatallHadoopuserscanreference◦ Implementdatagovernancethatfacilitates“windowshopping”forcontent– evenwhengovernancewillinitiallyprohibitusingthecontent
Investinadvanceddatamasking◦ Investinadvanceddatamaskingtoprepareproductiondataforthedatasciencelab◦ Advanceddatamaskingretainsthestatisticalpropertiesoftheunderlyingdata
Buyaself-servedataprovisioningtool◦ Datascientistsloveto“shop”fordataandloveto”engineer”datausingquery-by-exampletools
◦ Thegoodtoolsturnthe”shoppingtrip”intodeployablecodethatyoucanpackagefordeploymentorautomationeasily
ProjectsthatSucceed“RISKCOMESFROMNOTKNOWINGWHATYOU’REDOING”- WARREN BUFFET
QuickWinsFindingaquickwinortwowillkeepyourorganizationmotivatedtoadoptHadoop
Massivelyparallelback-testingofStreamBasealgorithms◦ StreamBaseisareal-timeworkflowplatformwidelyusedinprogramtrading◦ MapReducecanencapsulateStreamBaseinordertorunhundredsofcopiesinparallel
Targetingadsonsocialmedia◦ BothTwitterandFacebookhaveverygoodAPIsthatyoucanquicklyusetobuildafeed◦ Python-basedtoolscanbepairedwithsomebasicdatasciencetofind“lifeevents”
TrendAnalysisonRiskData◦ SimulationoutputsfromCVA,VAR,CCR,LRMareoftendiscardedafteronedayduetotheirsize◦ ArchivingonHDFSpermitstrendanalysisatthetradelevelfordiagnosticsandcapitalplanning
Mid-SizedProjectsManycurrentfocusareasinfinancelendthemselvestoachievableHadoopprojects
VolckerRule◦ VolckerRulemetricsrequireanenormousamountofdata,whichisexpensivetostore◦ Retentionisrequiredforfiveyearsofcalendardays◦ ComputationscanbeimplementedinSQLandwillrunwellinHive
Customer360◦ Hadoopisanaturalplatformtoconsolidateinteractionrecordswithtransactionaldata
DailyLiquidityManagement◦ Runningthecalculationsbeforepoolingfacilitatesdrill-downandanalysis◦ TableauonHadoopworksverywellfordailydashboards
ThankYouforYourTime