257

Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 2: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BigDataEssentialsCopyright©2016byAnilK.Maheshwari,Ph.D.

Bypurchasingthisbook,youagreenottocopyordistributethebookbyanymeans,mechanicalorelectronic.

Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.

Otherbooksbythesameauthor:

DataAnalyticsMadeAccessiblethe#1BestsellerinDataMining

Moksha:LiberationThroughTranscendence

Page 3: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 4: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

PrefaceBigDataisanew,andinclusive,naturalphenomenon.Itisasmessyasnatureitself.ItrequiresanewkindofConsciousnesstofathomitsscaleandscope,anditsmanyopportunitiesandchallenges.UnderstandingtheessentialsofBigDatarequiressuspendingmanyconventionalexpectationsandassumptionsaboutdata…suchascompleteness,clarity,consistency,andconciseness.Fathomingandtamingthemulti-layeredBigDataisadreamthatisslowlybecomingareality.Itisarapidlyevolvingfieldthatisgrowingexponentiallyinvalueandcapabilities.

ThereisagrowingnumberofbooksbeingwrittenonBigData.Theyfallmostlyintwocategories.Thefirstkindfocusonbusinessaspects,anddiscussthestrategicinternalshiftsrequiredforreapingthebusinessbenefitsfromthemanyopportunitiesofferedbyBigData.Thesecondkindfocusonparticulartechnologyplatforms,suchasHadooporSpark.Thisbookaimstobringtogetherthebusinesscontextandthetechnologiesinaseamlessway.

ThisbookwaswrittentomeettheneedsforanintroductoryBigDatacourse.Itismeantforstudents,aswellasexecutives,whowishtotakeadvantageofemergingopportunitiesinBigData.Itprovidesanintuitionofthewholenessofthefieldinasimplelanguage,freefromjargonandcode.AlltheessentialBigDatatechnologytoolsandplatformssuchasHadoop,MapReduce,Spark,andNoSqlarediscussed.MostoftherelevantprogrammingdetailshavebeenmovedtoAppendicestoensurereadability.Theshortchaptersmakeiteasytoquicklyunderstandthekeyconcepts.AcompletecasestudyofdevelopingaBigDataapplicationisincluded.

ThankstoMaharishiMaheshYogiforcreatingawonderfuluniversitywhoseconsciousness-basedenvironmentmadewritingthisevolutionarybookpossible.Thankstomanycurrentandformerstudentsforcontributingtothisbook.DheerajPandeyassistedwiththeWebloganalyzerapplicationanditsdetails.SurajThapaliaassistedwiththeHadoopinstallationguide.EnkhbilegTseeleesurenhelpedwritetheSparktutorial.Thankstomyfamilyforsupportingmeinthisprocess.MydaughtersAnkitaandNupurreviewedthebookandmadehelpfulcomments.MyfatherMr.RLMaheshwariandbrotherDr.SunilMaheshwarialsoreadthebookandenthusiasticallyapprovedit.MycolleagueDr.EdiShivajitooreviewedthebook.

MaytheBigDataForcebewithyou!

Dr.AnilMaheshwari

Page 5: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

August2016,Fairfield,IA

Page 6: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ContentsPreface

Chapter1–WholenessofBigData

Introduction

UnderstandingBigData

CASELET:IBMWatson:ABigDatasystem

CapturingBigData

VolumeofData

VelocityofData

VarietyofData

VeracityofData

BenefittingfromBigData

ManagementofBigData

OrganizingBigData

AnalyzingBigData

TechnologyChallengesforBigData

StoringHugeVolumes

Ingestingstreamsatanextremelyfastpace

Handlingavarietyofformsandfunctionsofdata

Processingdataathugespeeds

ConclusionandSummary

Organizationoftherestofthebook

ReviewQuestions

LibertyStoresCaseExercise:StepB1

Section1

Chapter2-BigDataApplications

Introduction

CASELET:BigDataGetstheFlu

BigDataSources

PeopletoPeopleCommunications

Page 7: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SocialMedia

PeopletoMachineCommunications

Webaccess

MachinetoMachine(M2M)Communications

RFIDtags

Sensors

BigDataApplications

MonitoringandTrackingApplications

AnalysisandInsightApplications

NewProductDevelopment

Conclusion

ReviewQuestions

LibertyStoresCaseExercise:StepB2

Chapter3-BigDataArchitecture

Introduction

CASELET:GoogleQueryArchitecture

StandardBigdataarchitecture

BigDataArchitectureexamples

IBMWatson

Netflix

Ebay

VMWare

TheWeatherCompany

TicketMaster

LinkedIn

Paypal

CERN

Conclusion

ReviewQuestions

LibertyStoresCaseExercise:StepB3

Section2

Page 8: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter4:DistributedComputingusingHadoop

Introduction

HadoopFramework

HDFSDesignGoals

Master-SlaveArchitecture

Blocksystem

EnsuringDataIntegrity

InstallingHDFS

ReadingandWritingLocalFilesintoHDFS

ReadingandWritingDataStreamsintoHDFS

SequenceFiles

YARN

Conclusion

ReviewQuestions

Chapter5–ParallelProcessingwithMapReduce

Introduction

MapReduceOverview

MapReduceprogramming

MapReduceDataTypesandFormats

WritingMapReduceProgramming

TestingMapReducePrograms

MapReduceJobsExecution

HowMapReduceWorks

ManagingFailures

ShuffleandSort

ProgressandStatusUpdates

HadoopStreaming

Conclusion

ReviewQuestions

Chapter6–NoSQLdatabases

Introduction

Page 9: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

RDBMSVsNoSQL

TypesofNoSQLDatabases

ArchitectureofNoSQL

CAPtheorem

PopularNoSQLDatabases

HBase

ArchitectureOverview

ReadingandWritingData

Cassandra

ArchitectureOverview

ReadingandWritingData

HiveLanguage

HIVELanguageCapabilities

PigLanguage

Conclusion

ReviewQuestions

Chapter7–StreamProcessingwithSpark

Introduction

SparkArchitecture

ResilientDistributedDatasets(RDD)

DirectedAcyclicGraph(DAG)

SparkEcosystem

Sparkforbigdataprocessing

MLlib

SparkGraphX

SparkR

SparkSQL

SparkStreaming

Sparkapplications

SparkvsHadoop

Conclusion

Page 10: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestions

Chapter8–IngestingData

Wholeness

MessagingSystems

PointtoPointMessagingSystem

Publish-SubscribeMessagingSystem

ApacheKafka

UseCases

KafkaArchitecture

Producers

Consumers

Broker

Topic

SummaryofKeyAttributes

Distribution

Guarantees

ClientLibraries

ApacheZooKeeper

KafkaProducerexampleinJava

Conclusion

ReviewQuestions

References

Chapter9–CloudComputingPrimer

Introduction

CloudComputingCharacteristics

In-housestorage

Cloudstorage

CloudComputing:EvolutionofVirtualizedArchitecture

CloudServiceModels

CloudComputingMyths

CloudComputing:GettingStarted

Page 11: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Conclusion

ReviewQuestions

Section3

Chapter10–WebLogAnalyzerapplicationcasestudy

Introduction

Client-ServerArchitecture

WebLoganalyzer

Requirements

SolutionArchitecture

Benefitsofthissolution

Technologystack

ApacheSpark

SparkDeployment

ComponentsofSpark

HDFS

MongoDB

ApacheFlume

OverallApplicationlogic

TechnicalPlanfortheApplication

ScalaSparkcodeforloganalysis

SampleLogdata

SampleInputData:

SampleOutputofWebLogAnalysis

ConclusionandFindings

ReviewQuestions

Chapter10:DataMiningPrimer

Gatheringandselectingdata

Datacleansingandpreparation

OutputsofDataMining

EvaluatingDataMiningResults

DataMiningTechniques

Page 12: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MiningBigData

FromCausationtoCorrelation

FromSamplingtotheWhole

FromDatasettoDatastream

DataMiningBestPractices

Conclusion

ReviewQuestions

Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)

CreatingClusterserveronAWS,InstallHadoopfromCloudEra

Step1:CreatingAmazonEC2Servers.

Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop

Step3:WordCountusingMapReduce

Appendix2:SparkInstallationandTutorial

Step1:VerifyingJavaInstallation

Step2:VerifyingScalainstallation

Step3:DownloadingScala

Step4:InstallingScala

Step5:DownloadingSpark

Step6:InstallingSpark

Step7:VerifyingtheSparkInstallation

Step8:Application:WordCountinScala

AdditionalResources

AbouttheAuthor

Page 13: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 14: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 15: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter1–WholenessofBigData

Introduction

BigDataisanall-inclusivetermthatreferstoextremelylarge,veryfast,diverse,andcomplexdatathatcannotbemanagedwithtraditionaldatamanagementtools.Ideally,BigDatawouldharnessallkindsofdata,anddelivertherightinformation,totherightperson,intherightquantity,attherighttime,tohelpmaketherightdecision.BigDatacanbemanagedbydevelopinginfinitelyscalable,totallyflexible,andevolutionarydataarchitectures,coupledwiththeuseofextremelycost-effectivecomputingcomponents.TheinfinitepotentialknowledgeembeddedwithinthiscosmiccomputerwouldhelpconnecteverythingtotheUnifiedFieldofallthelawsofnature.

ThisbookwillprovideacompleteoverviewofBigDatafortheexecutiveandthedataspecialist.ThischapterwillcoverthekeychallengesandbenefitsofBigData,andtheessentialtoolsandtechnologiesnowavailablefororganizingandmanipulatingBigData.

Page 16: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

UnderstandingBigData

BigDatacanbeexaminedontwolevels.Onafundamentallevel,itisdatathatcanbeanalyzedandutilizedforthebenefitofthebusiness.Onanotherlevel,itisaspecialkindofdatathatposesuniquechallenges.Thisisthelevelthatthisbookwillfocuson.

Figure1‑1:BigDataContext

Atthelevelofbusiness,datageneratedbybusinessoperations,canbeanalyzedtogenerateinsightsthatcanhelpthebusinessmakebetterdecisions.Thismakesthebusinessgrowbigger,andgenerateevenmoredata,andthecyclecontinues.Thisisrepresentedbythebluecycleonthetop-rightofFigure1.1.ThisaspectisdiscussedinChapter10,aprimeronDataAnalytics.

Onanotherlevel,BigDataisdifferentfromtraditionaldataineveryway:space,time,andfunction.ThequantityofBigDatais1,000timesmorethanthatoftraditionaldata.Thespeedofdatagenerationandtransmissionis1,000timesfaster.TheformsandfunctionsofBigDataaremuchmorediverse:fromnumberstotext,pictures,audio,videos,activitylogs,machinedata,andmore.Therearealsomanymoresourcesofdata,fromindividualstoorganizationstogovernments,usingarangeofdevicesfrommobilephonestocomputerstoindustrialmachines.Notalldatawillbeofequalqualityandvalue.ThisisrepresentedbytheredcycleonthebottomleftofFigure1.1.ThisaspectofBigData,anditsnewtechnologies,isthemainfocusofthisbook.

BigDataismostlyunstructureddata.Everytypeofdataisstructureddifferently,andwillhavetobedealtwithdifferently.TherearehugeopportunitiesfortechnologyproviderstoinnovateandmanagetheentirelifecycleofBigData…togenerate,gather,store,organize,analyze,andvisualizethisdata.

Page 17: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CASELET:IBMWatson:ABigDatasystemIBMcreatedtheWatsonsystemasawayofpushingtheboundariesofArtificialIntelligenceandnaturallanguageunderstandingtechnologies.WatsonbeattheworldchampionhumanplayersofJeopardy(quizstyleTVshow)inFeb2011.WatsonreadsupondataabouteverythingonthewebincludingtheentireWikipedia.Itdigestsandabsorbsthedatabasedonsimplegenericrulessuchas:bookshaveauthors;storieshaveheroes;anddrugstreatailments.Ajeopardyclue,receivedintheformofacrypticphrase,isbrokendownintomanypossiblepotentialsub-cluesofthecorrectanswer.Eachsub-clueisexaminedtoseethelikelinessofitsanswerbeingthecorrectanswerforthemainproblem.Watsoncalculatestheconfidencelevelofeachpossibleanswer.Iftheconfidencelevelreachesmorethanathresholdlevel,itdecidestooffertheanswertotheclue.Itmanagestodoallthisinamere3seconds.

Watsonisnowbeingappliedtodiagnosingdiseases,especiallycancer.Watsoncanreadallthenewresearchpublishedinthemedicaljournalstoupdateitsknowledgebase.Itisbeingusedtodiagnosetheprobabilityofvariousdiseases,byapplyingfactorssuchaspatient’scurrentsymptoms,healthhistory,genetichistory,medicationrecords,andotherfactorstorecommendaparticulardiagnosis.(Source:SmartestmachinesonEarth:youtube.com/watch?v=TCOhyaw5bwg)

Figure1.2:IBMWatsonplayingJeopardy

Q1:WhatkindsofBigDataknowledge,technologiesandskillsarerequiredtobuildasystemlikeWatson?Whatkindofresourcesareneeded?

Q2:WilldoctorsbeabletocompetewithWatsonindiagnosingdiseases

Page 18: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

andprescribingmedications?WhoelsecouldbenefitfromasystemlikeWatson?

Page 19: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CapturingBigDataIfdataweresimplygrowingtoolarge,ORonlymovingtoofast,ORonlybecomingtoodiverse,itwouldberelativelyeasy.However,whenthefourVs(Volume,Velocity,Variety,andVeracity)arrivetogetherinaninteractivemanner,itcreatesaperfectstorm.WhiletheVolumeandVelocityofdatadrivethemajortechnologicalconcernsandthe

costsofmanagingBigData,thesetwoVsarethemselvesbeingdrivenbythe3rdV,theVarietyofformsandfunctionsandsourcesofdata.

VolumeofData

Thequantityofdatahasbeenrelentlesslydoublingevery12-18months.TraditionaldataismeasuredinGigabytes(GB)andTerabytes(TB),butBigDataismeasuredinPetabytes(PB)andExabytes(1Exabyte=1MillionTB).

Thisdataissohugethatitisalmostamiraclethatonecanfindanyspecificthinginit,inareasonableperiodoftime.Searchingtheworld-widewebwasthefirsttrueBigDataapplication.Googleperfectedtheartofthisapplication,anddevelopedmanyofthepath-breakingtechnologiesweseetodaytomanageBigData.

Theprimaryreasonforthegrowthofdataisthedramaticreductioninthecostofstoringdata.Thecostsofstoringdatahavedecreasedby30-40%everyyear.Therefore,thereisanincentivetorecordeverythingthatcanbeobserved.Itiscalled‘datafication’oftheworld.Thecostsofcomputationandcommunicationhavealsobeencomingdown,similarly.Anotherreasonforthegrowthofdataistheincreaseinthenumberofformsandfunctionsofdata.MoreaboutthisintheVarietysection.

VelocityofData

Iftraditionaldataislikealake,BigDataislikeafast-flowingriver.BigDataisbeinggeneratedbybillionsofdevices,andcommunicatedatthespeedoftheinternet.Ingestingallthisdataislikedrinkingfromafirehose.Onedoesnothavecontroloverhowfastthedatawillcome.Ahugeunpredictabledata-streamisthenewmetaphorforthinkingaboutBigData.

Theprimaryreasonfortheincreasedvelocityofdataistheincreaseininternetspeed.Internetspeedsavailabletohomesandofficesarenowincreasingfrom10MB/secto1GB/sec(100timesfaster).Morepeoplearegettingaccesstohigh-speedinternetaroundtheworld.Anotherimportantreasonistheincreasedvarietyofsourcesthatcangenerateandcommunicatedatafromanywhere,atanytime.MoreonthatintheVarietysection.

VarietyofData

Page 20: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Bigdataisinclusiveofallformsofdata,forallkindsoffunctions,fromallsourcesanddevices.Iftraditionaldata,suchasinvoicesandledgerswerelikeasmallstore,BigDataisthebiggestimaginableshoppingmallthatoffersunlimitedvariety.Therearethreemajorkindsofvariety.

1. Thefirstaspectofvarietyistheformofdata.Datatypesrangeinorderofsimplicityandsizefromnumberstotext,graph,map,audio,video,andothers.Therecouldbeacompositeofdatathatincludesmanyelementsinasinglefile.Forexample,textdocumentshavetextandgraphsandpicturesembeddedinthem.Videocanhavechartsandsongsembeddedinthem.Audioandvideohavedifferentandmorecomplexstorageformatsthannumbersandtext.Numbersandtextcanbemoreeasilyanalyzedthananaudioorvideofile.Howshouldcompositeentitiesbestoredandanalyzed?

2. Thesecondaspectisthevarietyoffunctionofdata.Therearehumanchatsandconversationdata,songsandmoviesforentertainment,businesstransactionrecords,machineoperationsperformancedata,newproductdesigndata,olddataforbackup,etc.Humancommunicationdatawouldbeprocessedverydifferentlyfromoperationalperformancedata,withtotallydifferentobjectives.Avarietyofapplicationsareneededtocomparepicturesinordertorecognizepeople’sfaces;comparevoicestoidentifythespeaker;andcomparehandwritingstoidentifythewriter.

3. Thethirdaspectofvarietyisthesourceofdata.Mobilephonesandtabletdevicesenableawideseriesofapplicationsorappstoaccessdataandgeneratedatafromanytimeanywhere.Webaccesslogsareanothernewandhugesourceofdiagnosticdata.ERPsystemsgeneratemassiveamountsofstructuredbusinesstransactionalinformation.Sensorsonmachines,andRFIDtagsonassets,generateincessantandrepetitivedata.Broadlyspeaking,therearethreebroadtypesofsourcesofdata:Human-humancommunications;human-machinecommunications;andmachine-to-machinecommunications.Thesourcesofdata,andtheirrespectiveapplicationsarisingfromthatdata,willbediscussedinthenextchapter.

Page 21: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Figure1.3SourcesofBigData(Source:Hortonworks.com)VeracityofData

Veracityrelatestothebelievabilityandqualityofdata.BigDataismessy.Thereisalotofmisinformationanddisinformation.Thereasonsforpoorqualityofdatacanrangefromhumanandtechnicalerror,tomaliciousintent.

1. Thesourceofinformationmaynotbeauthoritative.Forexample,allwebsitesarenotequallytrustworthy.Anyinformationfromwhitehouse.govorfromnytimes.comismorelikelytobeauthenticandcomplete.Wikipediaisuseful,butnotallpagesareequallyreliable.Thecommunicatormayhaveanagendaorapointofview.

2. Thedatamaynotbereceivedcorrectlybecauseofhumanortechnicalfailure.Sensorsandmachinesforgatheringandcommunicatingdatamaymalfunctionandmayrecordandtransmitincorrectdata.Urgencymayrequirethetransmissionofthebestdataavailableatapointintime.Suchdatamakesreconciliationwithlater,accurate,recordsmoreproblematic.

3. Thedataprovidedandreceived,mayhowever,alsobeintentionallywrong,forcompetitiveorsecurityreasons.

Dataneedstobesiftedandorganizedbyqualityfactors,forittobeputtoanygreatuse.

Page 22: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BenefittingfromBigDataDatausuallybelongstotheorganizationthatgeneratesit.Thereisotherdata,suchassocialmediadata,thatisfreelyaccessibleunderanopengenerallicense.Organizationscanusethisdatatolearnabouttheirconsumers,improvetheirservicedelivery,anddesignnewproductstodelighttheircustomersandtogainacompetitiveadvantage.Dataisalsolikeanewnaturalresource.Itisbeingusedtodesignnewdigitalproducts,suchason-demandentertainmentandlearning.

Organizationsmaychoosetogatherandstorethisdataforlateranalysis,ortosellittootherorganizations,whomightbenefitfromit.Theymayalsolegitimatelychoosetodiscardpartsoftheirdataforprivacyorlegalreasons.However,organizationscannotaffordtoignoreBigData.OrganizationsthatdonotlearntoengagewithBigData,couldfindthemselvesleftfarbehindtheircompetition,landinginthedustbinofhistory.InnovativesmallandneworganizationscanuseBigDatatoquicklyscaleupandbeatlargerandmorematureorganizations.

BigDataapplicationsexistinallindustriesandaspectsoflife.TherearethreemajortypesofBigDataapplications:MonitoringandTracking,AnalysisandInsight,andnewdigitalproductdevelopment.

MonitoringandTrackingApplications:Consumergoodsproducersusemonitoringandtrackingapplicationstounderstandthesentimentsandneedsoftheircustomers.IndustrialorganizationsuseBigDatatotrackinventoryinmassiveinterlinkedglobalsupplychains.Factoryownersuseittomonitormachineperformanceanddopreventivemaintenance.Utilitycompaniesuseittopredictenergyconsumption,andmanagedemandandsupply.InformationTechnologycompaniesuseittotrackwebsiteperformanceandimproveitsusefulness.Financialorganizationsuseittoprojecttrendsbetterandmakemoreeffectiveandprofitablebets,etc.

AnalysisandInsight:PoliticalorganizationsuseBigDatatomicro-targetvotersandwinelections.PoliceuseBigDatatopredictandpreventcrime.Hospitalsuseittobetterdiagnosediseasesandmakemedicineprescriptions.Adagenciesuseittodesignmoretargetedmarketingcampaignsquickly.Fashiondesignersuseittotracktrendsandcreatemoreinnovativeproducts.

Page 23: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Figure1.4:ThefirstBigDataPresident

NewProductDevelopment:IncomingdatacouldbeusedtodesignnewproductssuchasrealityTVentertainment.Stockmarketfeedscouldbeadigitalproduct.Thisareaneedsmuchmoredevelopment.

Page 24: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ManagementofBigDataManyorganizationshavestartedinitiativesaroundtheuseofBigData.However,mostorganizationsdonotnecessarilyhaveagriponit.HerearesomeemerginginsightsintomakingbetteruseofBigData.

1. Acrossallindustries,thebusinesscaseforBigDataisstronglyfocusedonaddressingcustomer-centricobjectives.ThefirstfocusondeployingBigDatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.

2. Solvearealpain-point.BigDatashouldbedeployedforspecificbusinessobjectivesinordertohavemanagementavoidbeingoverwhelmedbythesheersizeofitall.

3. Organizationsarebeginningtheirpilotimplementationsbyusingexistingandnewlyaccessibleinternalsourcesofdata.Itisbettertobeginwithdataunderone’scontrolandwhereonehasasuperiorunderstandingofthedata.

4. Puthumansanddatatogethertogetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.

5. Advancedanalyticalcapabilitiesarerequired,butlacking,fororganizationstogetthemostvaluefromBigData.Thereisagrowingawarenessofbuildingorhiringthoseskillsandcapabilities.

6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.

7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.

8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyouinitiallyanticipate.Datacanaddperspectivetootherdatalateroninamultiplicativemanner.

9. Maintainonecopyofyourdata,notmultiple.Thiswouldhelpavoidconfusionandincreaseefficiency.

10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponentialrates.Storagecostscontinuetofall,datagenerationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.

11. Ascalableandextensibleinformationmanagementfoundationisaprerequisiteforbigdataadvancement.BigDatabuildsuponaresilient,secure,efficient,flexible,andreal-timeinformationprocessingenvironment.

Page 25: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

12. BigDataistransformingbusiness,justlikeITdid.BigDataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.

Page 26: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

OrganizingBigData

Goodorganizationdependsuponthepurposeoftheorganization.

Givenhugequantities,itwouldbedesirabletoorganizethedatatospeedupthesearchprocessforfindingaspecific,adesiredthingintheentiredata.Thecostofstoringandprocessingthedata,too,wouldbeamajordriverforthechoiceofanorganizingpattern.

Giventhefastspeedofdata,itwouldbedesirabletocreateascalablenumberofingestpoints.Itwillalsobedesirabletocreateatleastathinveneerofcontroloverthedatabymaintainingcountandaveragesovertime,uniquevaluesreceived,etc.

Giventhevarietyinformfactors,dataneedstobestoredandanalyzeddifferently.Videosneedtobestoredseparatelyandusedforservinginastreamingmode.Textdatamaybecombined,cleaned,andvisualizedforthemesandsentiments.

Givendifferentqualitylevelsofdata,variousdatasourcesmayneedtoberankedandprioritizedbeforeservingthemtotheaudience.Forexample,thequalityofawebpagemaybecomputedthroughaPageRankmechanism.

Page 27: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

AnalyzingBigData

BigDatacanbeanalyzedintwoways.ThesearecalledanalyzingBigDatainmotionorBigDataatrest.Firstwayistoprocesstheincomingstreamofdatainrealtimeforquickandeffectivestatisticsaboutthedata.Theotherwayistostoreandstructurethedataandapplystandardanalyticaltechniquesonbatchesofdataforgeneratinginsights.Thiscouldthenbevisualizedusingreal-timedashboards.BigDatacanbeutilizedtovisualizeaflowingorastaticsituation.Thenatureofprocessingthishuge,diverse,andlargelyunstructureddata,canbelimitedonlybyone’simagination.

Figure1.5:BigDataArchitecture

Amillionpointsofdatacanbeplottedinagraphandofferaviewofthedensityofdata.However,plottingamillionpointsonthegraphmayproduceablurredimagewhichmayhide,ratherthanhighlightthedistinctions.Insuchacase,binningthedatawouldhelp,orselectingthetopfewfrequentcategoriesmaydelivergreaterinsights.Streamingdatacanalsobevisualizedbysimplecountsandaveragesovertime.Forexample,belowisadynamicallyupdatedchartthatshowsup-to-datestatisticsofvisitortraffictomyblogsite,anilmah.com.Thebarshowsthenumberofpageviews,andtheinnerdarkerbarshowsthenumberofuniquevisitors.Thedashboardcouldshowtheviewbydays,weeksoryearsalso.

Page 28: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Figure1.6:Real-timeDashboardforwebsiteperformancefortheauthor’sblog

TextDatacouldbecombined,filtered,cleaned,thematicallyanalyzed,andvisualizedinawordcloud.Hereiswordcloudfromarecentstreamoftweets(ieTwittermessages)fromUSPresidentialcandidatesHillaryClintonandDonaldTrump.Thelargerwordsimpliesgreaterfrequencyofoccurrenceinthetweets.Thiscanhelpunderstandthemajortopicsofdiscussionbetweenthetwo.

Figure1.7:AwordcloudofHillaryClinton’sandDonaldTrump’stweets

Page 29: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TechnologyChallengesforBigData

Therearefourmajortechnologicalchallenges,andmatchinglayersoftechnologiestomanageBigData.

StoringHugeVolumes

Thefirstchallengerelatestostoringhugequantitiesofdata.Nomachinecanbebigenoughtostoretherelentlesslygrowingquantityofdata.Therefore,dataneedstobestoredinalargenumberofsmallerinexpensivemachines.However,withalargenumberofmachines,thereistheinevitablechallengeofmachinefailure.Eachofthesecommoditymachineswillfailatsomepointoranother.Failureofamachinecouldentailalossofdatastoredonit.

ThefirstlayerofBigDatatechnologyhelpsstorehugevolumesofdata,whileavoidingtheriskofdataloss.Itdistributesdataacrossthelargeclusterofinexpensivecommoditymachines,andensuresthateverypieceofdataisstoredonmultiplemachinestoguaranteethatatleastonecopyisalwaysavailable.Hadoopisthemostwell-knownclusteringtechnologyforBigData.ItsdatastoragepatterniscalledHadoopDistributedFileSystem(HDFS).ThissystemisbuiltonthepatternsofGoogle’sFilesystems,designedtostorebillionsofpagesandsortthemtoanswerusersearchqueries.

Ingestingstreamsatanextremelyfastpace

ThesecondchallengerelatestotheVelocityofdata,i.e.handlingtorrentialstreamsofdata.Someofthemmaybetoolargetostore,butmuststillbeingestedandmonitored.Thesolutionliesincreatingspecialingestingsystemsthatcanopenanunlimitednumberofchannelsforreceivingdata.Thesequeuingsystemscanholddata,fromwhichconsumerapplicationscanrequestandprocessdataattheirownpace.

BigDatatechnologymanagesthisvelocityproblem,usingaspecialstream-processingengine,whereallincomingdataisfedintoacentralqueueingsystem.Fromthere,afork-shapedsystemsendsdatatobatchprocessingaswellastostreamprocessingdirections.Thestreamprocessingenginecandoitsworkwhilethebatchprocessingdoesitswork.ApacheSparkisthemostpopularsystemforstreamingapplications.

Handlingavarietyofformsandfunctionsofdata

ThethirdchallengerelatestothestructuringandaccessofallvarietiesofdatathatcompriseBigData.Storingthemintraditionalflatorrelationalfilestructureswouldbetoowastefulandslow.ThethirdlayerofBigDatatechnologysolvesthisproblemby

Page 30: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

storingthedatainnon-relationalsystemsthatrelaxmanyofthestringentconditionsoftherelationalmodel.ThesearecalledNoSQL(NotOnlySQL)databases.

HBaseandCassandraaretwoofthebetterknownNoSQLdatabasessystems.HBase,forexample,storeseachdataelementseparatelyalongwithitskeyidentifyinginformation.Thisiscalledakey-valuepairformat.Cassandrastoresdatainadocumentformat.TherearemanyothervariantsofNoSQLdatabases.NoSQLlanguages,suchasPigandHive,areusedtoaccessthisdata.

Processingdataathugespeeds

Thefourthchallengerelatestomovinglargeamountsofdatafromstoragetotheprocessor,asthiswouldconsumeenormousnetworkcapacityandchokethenetwork.Thealternativeandinnovativemodewouldbetomovetheprocessortothedata.

ThesecondlayerofBigDatatechnologyavoidsthechokingofthenetwork.Itdistributesthetasklogicthroughouttheclusterofmachineswherethedataisstored.Thosemachineswork,inparallel,onthedataassignedtothem,respectively.Afollow-upprocessconsolidatestheoutputsofallthesmalltasksanddeliversthefinalresults.MapReduce,alsoinventedbyGoogle,isthebest-knowntechnologyforparallelprocessingofdistributedBigData.

Table1.1:TechnologicalchallengesandsolutionsforBigData

Challenge Description Solution Technology

Volume Avoidriskofdatalossfrommachinefailureinclustersofcommoditymachines

Replicatesegmentsofdatainmultiplemachines;masternodekeepstrackofsegmentlocation

HDFS

Volume&Velocity

Avoidchokingofnetworkbandwidthbymovinglargevolumesofdata

Moveprocessinglogictowherethedataisstored;manageusingparallelprocessingalgorithms

Map-Reduce

Variety Efficientstorageoflargeandsmalldataobjects

Columnardatabasesusingkey-pairvaluesformat

HBase,Cassandra

Velocity Monitoringstreamstoolargetostore

Fork-shapedarchitecturetoprocessdataasstreamandasbatch

Spark

Page 31: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Oncethesemajortechnologicalchallengesaremet,alltraditionalanalyticalandpresentationtoolscanbeappliedtoBigData.TherearemanyadditionalsupportivetechnologiestomakethetaskofmanagingBigDataeasier.Forexample,aresourcemanager(suchasYARN)canhelpmonitortheresourceusageandloadbalancingofthemachinesinthecluster.

Page 32: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionandSummary

BigDataisamajorphenomenonthatimpactseveryone,andisanopportunitytocreatenewwaysofworking.BigDataisextremelylarge,complex,fast,andnotalwaysclean,itisdatathatcomesfrommanysourcessuchaspeople,web,andmachinecommunications.Itneedstobegathered,organizedandprocessedinacost-effectivewaythatmanagesthevolume,velocity,varietyandveracityofBigData.HadoopandSparksystemsarepopulartechnologicalplatformsforthispurpose.HereisalistofthemanydifferencesbetweentraditionalandBigData.

Table1.2:ComparingBigDatawithTraditionalData

Feature TraditionalData BigData

RepresentativeStructure Lake/Pool FlowingStream/river

PrimaryPurpose Managebusinessactivities Communicate,Monitor

Sourceofdata Businesstransactions,documents

Socialmedia,Webaccesslogs,machinegenerated

Volumeofdata Gigabytes,Terabytes Petabytes,Exabytes

Velocityofdata Ingestleveliscontrolled Real-timeunpredictableingest

Varietyofdata Alphanumeric Audio,Video,Graphs,Text

Veracityofdata Clean,moretrustworthy Variesdependingonsource

Structureofdata Well-Structured Semi-orUn-structured

PhysicalStorageofData

InaStorageAreaNetwork

Distributedclustersofcommoditycomputers

Databaseorganization Relationaldatabases NoSQLdatabases

DataAccess SQL NoSQLsuchasPig

DataManipulationConventionaldataprocessing Parallelprocessing

Dynamicdashboardswithsimple

Page 33: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

DataVisualization Varietyoftools measures

DatabaseTools Commercialsystems Open-source-Hadoop,Spark

TotalCostofSystem MediumtoHigh high

Page 34: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

OrganizationoftherestofthebookThisbookwillcoverapplications,architectures,andtheessentialBigDatatechnologies.Therestofthebookisorganizedasfollows.

Section1willdiscusssources,applications,andarchitecturaltopics.Chapter2willdiscussafewcompellingbusinessapplicationsofBigData,basedontheunderstandingofthedifferentsourcesandformatsofdata.Chapter3willcoversomeexamplesofarchitecturesusedbymanyBigDataapplications.

Section2willdiscussthesixmajortechnologyelementsidentifiedintheBigDataEcosystem(Figure1.5).Chapter4willdiscussHadoopandhowitsDistributedFilesystem(HDFS)works.Chapter5willdiscussMapReduceandhowthisparallelprocessingalgorithmworks.Chapter6willdiscussNoSQLdatabasestolearnhowtostructurethedataintodatabasesforfastaccess.PigandHivelanguages,fordataaccess,willbeincluded.Chapter7willcoverstreamingdata,andthesystemsforingestingandprocessingthisdata.ThischapterwillcoverSpark,anintegrated,in-memoryprocessingtoolsettomanageBigData.Chapter8willcoverDataingestsystem,withApacheKafka.Chapter9willbeaprimeronCloudComputingtechnologiesusedforrentingstorageandcomputersatthirdpartylocations.

Section3willincludePrimersandtutorials.Chapter10willpresentacasestudyonthewebloganalyzer,anapplicationthatingestsalogofalargenumberofwebrequestentrieseverydayandcancreatesummaryandexceptionreports.Chapter11willbeaprimerondataanalyticstechnologiesforanalyzingdata.Afulltreatmentcanbefoundinmybook,DataAnalyticsMadeAccessible.Appendix1willbeatutorialoninstallingHadoopclusteronAmazonEC2cloud.Appendix2willbeatutorialoninstallingandusingSpark.

Page 35: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1.WhatisBigData?Whyshouldanyonecare?

Q2.Describethe4VmodelofBigData.

Q3.WhatarethemajortechnologicalchallengesinmanagingBigData?

Q4:WhatarethetechnologiesavailabletomanageBigData?

Q5.WhatkindofanalysescanbedoneonBigData?

Q6:WatchClouderaCEOpresenttheevolutionofHadoopathttps://www.youtube.com/watch?v=S9xnYBVqLws.WhydidpeoplenotpayattentiontoHadoopandMapReducewhenitwasintroduced?Whatimplicationsdoesithavetoemergingtechnologies?

Page 36: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

LibertyStoresCaseExercise:StepB1LibertyStoresInc.isaspecializedglobalretailchainthatsellsorganicfood,organicclothing,wellnessproducts,andeducationproductstoenlightenedLOHAS(LifestylesoftheHealthyandSustainable)citizensworldwide.Thecompanyis20yearsold,andisgrowingrapidly.Itnowoperatesin5continents,50countries,150cities,andhas500stores.Itsells20000productsandhas10000employees.Thecompanyhasrevenuesofover$5billionandhasaprofitofabout5%ofitsrevenue.Thecompanypaysspecialattentiontotheconditionsunderwhichtheproductsaregrownandproduced.Itdonatesaboutone-fifth(20%)fromitspre-taxprofitsfromgloballocalcharitablecauses.

Q1:CreateacomprehensiveBigDatastrategyfortheCEOofthecompany.

Q2:HowcanBigDatasystemssuchasIBMWatsonhelpthiscompany?

Page 37: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 38: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Section1

Thissectioncoversthreeimportanthigh-leveltopics.

Chapter2willcoverbigdatasources,andmanyapplicationsinmanyindustries.

Chapter3willarchitecturesformanagingbigdata

Page 39: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 40: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 41: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter2-BigDataApplications

IntroductionIfatraditionalsoftwareapplicationisalovelycat,thenaBigDataapplicationisapowerfultiger.AnidealBigDataapplicationwilltakeadvantageofalltherichnessofdataandproducerelevantinformationtomaketheorganizationresponsiveandsuccessful.BigDataapplicationscanaligntheorganizationwiththetotalityofnaturallaws,thesourceofallsuccess.

Companiesliketheconsumergoodsgiant,Proctor&Gamble,haveinsertedBigDataintoallaspectsofitsplanningandoperations.Theindustrialgiant,Volkswagen,asksallitsbusinessunitstoidentifysomerealisticinitiativeusingBigDatatogrowtheirunit’ssales.Theentertainmentgiant,Netflix,processes400billionuseractionseveryday,andthesearesomeofthebiggestusersofBigData.

Figure2‑0‑1:BigDataapplicationisapowerfultiger(Source:Flickr.com)

Page 42: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CASELET:BigDataGetstheFluGoogleFluTrendswasanenormouslysuccessfulinfluenzaforecastingservice,pioneeredbyGoogle.ItemployedBigData,suchasthestreamofsearchtermsusedinitsubiquitousInternetsearchservice.TheprogramaimedtobetterpredictfluoutbreaksusingdataandinformationfromtheU.S.CentersforDiseaseControlandPrevention(CDC).Whatwasmostamazingwasthatthisapplicationwasabletopredicttheonsetofflu,almosttwoweeksbeforeCDCsawitcoming.From2004tillabout2012itwasabletosuccessfullypredictthetimingandgeographicallocationofthearrivalofthefluseasonaroundtheworld.

Figure2‑0‑2:GoogleFlutrends

However,itfailedspectacularlytopredictthe2013fluoutbreak.DatausedtopredictEbola’sspreadin2014-15yieldedwildlyinaccurateresults,andcreatedamajorpanic.Newspapersacrosstheglobespreadthisapplication’sworst-casescenariosfortheEbolaoutbreakof2014.

GoogleFluTrendsfailedfortworeasons:BigDatahubris,andalgorithmicdynamics,(a)Thequantityofdatadoesnotmeanthatonecanignorefoundationalissuesofmeasurementandconstructvalidityandreliabilityanddependenciesamongdataand(b)GoogleFluTrendspredictionswerebasedonacommercialsearchalgorithmthatfrequentlychanges,basedonGoogle’sbusinessgoals.ThisuncertaintyskewedthedatainwaysevenGoogleengineersdidnotunderstand,evenskewingtheaccuracyofpredictions.Perhapsthebiggestlessonisthatthereisfarlessinformationinthedata,typicallyavailableintheearlystagesofanoutbreak,thanisneededtoparameterizethetestmodels.

Q1:WhatlessonswouldyoulearnfromthedeathofaprominentandhighlysuccessfulBigDataapplication?

Q2:WhatotherBigDataapplicationscanbeinspiredfromthesuccessofthisapplication?

Page 43: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 44: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BigDataSourcesBigDataisinclusiveofalldataaboutallactivitieseverywhere.Itcan,thus,potentiallytransformourperspectiveonlifeandtheuniverse.Itbringsnewinsightsinreal-timeandcanmakelifehappierandmaketheworldmoreproductive.BigDatacan,however,alsobringperils—intermsofviolationofprivacy,andsocialandeconomicdisruption.

Therearethreemajorcategoriesofdatasources:humancommunications,human-machinecommunications,andmachine-machinecommunications.

Page 45: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

PeopletoPeopleCommunicationsPeopleandcorporationsincreasinglycommunicateoverelectronicnetworks.Distanceandtimehavebeenannihilated.Everyonecommunicatesthroughphoneandemail.Newstravelsinstantly.Influentialnetworkshaveexpanded.Thecontentofcommunicationhasbecomericherandmultimedia.High-resolutioncamerasinmobilephonesenablepeopletotakepicturesandvideos,andinstantlysharethemwithfriendsandfamily.Allthesecommunicationsarestoredinthefacilitiesofmanyintermediaries,suchastelecomandinternetserviceproviders.Socialmediaisanew,butparticularlytransformativetypeofhuman-humancommunications.

SocialMedia

SocialmediaplatformssuchasFacebook,Twitter,LinkedIn,YouTube,Flickr,Tumblr,Skye,Snapchat,andothershavebecomeanincreasinglyintimatepartofmodernlife.Theseareamongthehundredsofsocialmediathatpeopleuseandtheygeneratehugestreamsoftext,pictures,videos,logs,andothermultimediadata.

PeoplesharemessagesandpicturesthroughsocialmediasuchasFacebookandYouTube.TheysharephotoalbumsthroughFlickr.TheycommunicateinshortasynchronousmessageswitheachotheronTwitter.TheymakefriendsonFacebook,andfollowothersonTwitter.Theydovideoconferencing,usingSkypeandleadersdelivermessagesthatsometimesgoviralthroughsocialmedia.AllthesedatastreamsarepartofBigData,andcanbemonitoredandanalyzedtounderstandmanyphenomena,suchaspatternsofcommunication,aswellasthegistoftheconversations.Thesemediahavebeenusedforawidevarietyofpurposeswithstunningeffects.

Figure2‑0‑3:Samplingofmajorsocialmedia

Page 46: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

PeopletoMachineCommunicationsSensorsandwebaretwoofthekindsofmachinesthatpeoplecommunicatewith.PersonalassistantssuchasSiriandCortanaarethelatestinman-machinecommunicationsastheytrytounderstandhumanrequestsinnaturallanguage,andfulfilthem.WearabledevicessuchasFitBitandsmartwatcharesmartdevicesthatread,storeandanalyzepeople’spersonaldatasuchasbloodpressureandweight,foodandexercisedata,andsleeppatterns.Theworld-widewebislikeaknowledgemachinethatpeopleinteractwithtogetanswersfortheirqueries.

Webaccess

Theworld-wide-webhasintegrateditselfintoallpartsofhumanandmachineactivity.Theusageofthetensofbillionsofpagesbybillionsofwebusersgenerateshugeamountofenormouslyvaluableclickstreamdata.Everytimeawebpageisrequested,alogentryisgeneratedattheproviderend.Thewebpageprovidertrackstheidentityoftherequestingdeviceanduser,andtimeandspatiallocationofeachrequest.Ontherequesterside,therearecertainsmallpiecesofcomputercodeanddatacalledcookieswhichtrackthewebpagesreceived,date/timeofaccess,andsomeidentifyinginformationabouttheuser.Allthewebaccesslogs,andcookierecords,canprovidewebusagerecordsthatcanbeanalyzedfordiscoveringopportunitiesformarketingpurposes.

Awebloganalyzerisanapplicationrequiredtomonitorstreamingwebaccesslogsinreal-timetocheckonwebsitehealthandtoflagerrors.Adetailedcasestudyofapracticaldevelopmentofthisapplicationisshowninchapter8.

Page 47: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MachinetoMachine(M2M)CommunicationsM2McommunicationsisalsosometimescalledtheInternetofThings(IoT).Atrilliondevicesareconnectedtotheinternetandtheycommunicatewitheachotherorsomemastermachines.Allthisdatacanbeaccessedandharnessedbymakersandownersofthosemachines.

Machinesandequipmenthavemanykindsofsensorstomeasurecertainenvironmentalparameters,whichcanbebroadcasttocommunicatetheirstatus.RFIDtagsandsensorsembeddedinmachineshelpgeneratethedata.ContainersonshipsaretaggedwithRFIDtagsthatconveytheirlocationtoallthosewhocanlisten.Similarly,whenpalletsofgoodsaremovedinwarehousesorlargeretainstores,thosepalletscontainelectromagnetic(RFID)tagsthatconveytheirlocation.CarscarryanRFIDtranspondertoidentifythemselvestoautomatedtollboothsandpaythetolls.Robotsinafactory,andinternet-connectedrefrigeratorsinahouse,continuallybroadcasta‘heartbeat’thattheyarefunctionallynormally.Surveillancevideosusingcommoditycamerasareanothermajorsourceofmachine-generateddata.

Automobilescontainsensorsthatrecordandcommunicateoperationaldata.Amoderncarcangeneratemanymegabytesofdataeveryday,andtherearemorethan1billionmotorvehiclesontheroad.Thustheautomotiveindustryitselfgeneratehugeamountsofdata.Self-drivingcarswouldonlyaddtothequantityofdatagenerated.

RFIDtags

AnRFIDtagisaradiotransmitterwithalittleantennathatcanrespondtoandcommunicateessentialinformationtospecialreadersthroughRadioFrequency(RF)channel.Afewyearsago,majorretailerssuchasWalmartdecidedtoinvestinRFIDtechnologytotaketheretailindustrytoanewlevel.ItforcedtheirsupplierstoinvestinRFIDtagsonthesuppliedproducts.Today,almostallretailersandmanufacturershaveimplementedRFID-tagsbasedsolutions.

Figure2‑0‑4:AsmallpassiveRFIDtag

HereishowanRFIDtagworks.WhenapassiveRFIDtagcomesinthevicinityofanRFreaderandis‘tickled’,thetagrespondsbybroadcastingafixedidentifyingcode.An

Page 48: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

activeRFIDtaghasitsownbatteryandstorage,andcanstoreandcommunicatealotmoreinformation.EveryreadingofmessagefromanRFIDtagbyanRFreadercreatesalogentry.ThusthereisasteadystreamofdatafromeveryreaderasitrecordsinformationaboutalltheRFIDtagsinitsareaofinfluence.Therecordsmaybeloggedregularly,andthustherewillbemanymorerecordsthanarenecessarytotrackthelocationandmovementofanitem.Alltheduplicateandredundantrecordsisremoved,toproduceclean,consolidateddataaboutthelocationandstatusofitems.

Sensors

Asensorisasmalldevicethatcanobserveandrecordphysicalorchemicalparameters.Sensorsareeverywhere.Aphotosensorintheelevatorortraindoorcansenseifsomeoneismovingandtothuskeepthedoorfromclosing.ACCTVcameracanrecordavideoforsurveillancepurposes.AGPSdevicecanrecorditsgeographicallocationeverymoment.

Figure2‑0‑5:Anembeddedsensor

Temperaturesensorsinacarcanmeasurethetemperatureoftheengineandthetiresandmore.Thethermostatinabuildingorarefrigeratortoohavetemperaturesensors.Apressuresensorcanmeasurethepressureinsideanindustrialboiler.

Page 49: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BigDataApplicationsMonitoringandTrackingApplicationsPublicHealthMonitoring

TheUSgovernmentisencouragingallhealthcarestakeholderstoestablishanationalplatformforinteroperabilityanddatasharingstandards.Thiswouldenablesecondaryuseofhealthdata,whichwouldadvanceBigDataanalyticsandpersonalizedholisticprecisionmedicine.Thiswouldbeabroad-basedplatformliketheGoogleFluTrendscase.

ConsumerSentimentMonitoring

SocialMediahasbecomemorepowerfulthanadvertising.Manyconsumergoodscompanieshavemovedabulkoftheirmarketingbudgetsfromtraditionaladvertisingmediaintosocialmedia.TheyhavesetupBigDatalisteningplatforms,whereSocialMediadatastreams(includingtweetsandFacebookpostsandblogposts)arefilteredandanalyzedforcertainkeywordsorsentiments,bycertaindemographicsandregions.Actionableinformationfromthisanalysisisdeliveredtomarketingprofessionalsforappropriateaction,especiallywhentheproductisnewtothemarket.

Figure2‑0‑6:ArchitectureforaListeningPlatform(source:Intelligenthq.com)

Assettracking

TheUSDepartmentofDefenseisencouragingtheindustrytodeviseatinyRFIDchipthatcouldpreventthecounterfeitingofelectronicpartsthatendupinavionicsorcircuitboardsforotherdevices.Airplanesareoneoftheheaviestusersofsensorswhichtrackeveryaspectoftheperformanceofeverypartoftheplane.Thedatacanbedisplayedon

Page 50: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

thedashboard,aswellasstoredforlaterdetailedanalysis.Workingwithcommunicatingdevices,thesesensorscanproduceatorrentofdata.

Theftbyvisitors,shoppersandevenemployees,isamajorsourceoflossofrevenueforretailers.AllvaluableitemsinthestorecanbeassignedRFIDtags,andthegatesofthestoreareequippedwithRFreaders.Thishelpssecuretheproducts,andreduceleakage(theft),fromthestore.

Supplychainmonitoring

AllcontainersonshipscommunicatetheirstatusandlocationusingRFIDtags.Thus,retailersandtheirsupplierscangainreal-timevisibilitytotheinventorythroughouttheglobalsupplychain.Retailerscanknowexactlywheretheitemsareinthewarehouse,andsocanbringthemintothestoreattherighttime.Thisisparticularlyrelevantforseasonalitemsthatneedtobesoldontime,orelsetheywillbesoldatadiscount.Withitem-levelRFIDtacks,retailersalsogainfullvisibilityofeachitemandcanservetheircustomersbetter.

ElectricityConsumptionTracking

Electricutilitiescantrackthestatusofgeneratingandtransmissionsystems,andalsomeasureandpredicttheconsumptionofelectricity.Sophisticatedsensorscanhelpmonitorvoltage,current,frequency,temperature,andothervitaloperatingcharacteristicsofhugeandexpensiveelectricdistributioninfrastructure.Smartmeterscanmeasuretheconsumptionofelectricityatregularintervalsofonehourorless.Thisdataisanalyzedtomakereal-timedecisionstomaximizepowercapacityutilizationandthetotalrevenuegeneration.

PreventiveMachineMaintenance

Allmachines,includingcarsandcomputers,willfailsometime,becauseoneormoreortheircomponentswillfail.Anypreciousequipmentcouldbeequippedwithsensors.Thecontinuousstreamofdatafromthesensorsdatacouldbemonitoredandanalyzedtoforecastthestatusofkeycomponents,andthus,monitortheoverallmachine’shealth.Preventivemaintenancecanbescheduledtoreducethecostofdowntime.

AnalysisandInsightApplications

BigDatacanbestructuredandanalyzedusingdataminingtechniquestoproduceinsightsandpatternsthatcanbeusedtomakebusinessbetter.

PredictivePolicing

Page 51: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TheLosAngelesPoliceDepartment(LAPD)inventedtheconceptofPredictivePolicing.TheLAPDworkedwithUCBerkeleyresearcherstoanalyzeitslargedatabaseof13millioncrimesrecordedover80years,andpredictedthelikelinessofcrimesofcertaintypes,atcertaintimes,andincertainlocations.Theyidentifiedhotspotsofcrimewherecrimeshadoccurred,andwherecrimewaslikelytohappeninthefuture.Crimepatternsweremathematicallymodeledafterasimpleinsightborrowedfromametaphorofearthquakesanditsaftershocks.Inessence,itsaidthatonceacrimeoccurredinalocation,itrepresentedacertaindisturbanceinharmony,andwouldthus,leadtoagreaterlikelihoodofasimilarcrimeoccurringinthelocalvicinityinthenearfuture.Themodelshowedforeachpolicebeat,thespecificneighborhoodblocksandspecifictimeslots,wherecrimewaslikelytooccur.

Figure2‑0‑7:LAPDofficeronpredictingpolicing(Source:nbclosangeles.com)

Byincludingthepolicecars’patrolschedulesinaccordancewiththemodel’spredictions,theLAPDwasabletoreducecrimeby12%to26%fordifferentcategoriesofcrime.Recently,theSanFranciscoPoliceDepartmentreleaseditsowncrimedataforover2years,sodataanalystscouldmodelthatdataandpreventfuturecrimes.

WinningPoliticalElections

TheUSPresident,BarackObama,wasthefirstmajorpoliticalcandidatetouseBigDatainasignificantway,inthe2008elections.HeisthefirstBigDatapresident.Hiscampaigngathereddataaboutmillionsofpeople,includingsupporters.Theyinventedthe“DonateNow”buttonforuseinemailstoobtaincampaigncontributionsfrommillionsofsupporters.Theycreatedpersonalprofilesofmillionsofsupportersandwhattheyhaddoneandcoulddoforthecampaign.Datawasusedtodetermineundecidedvoterswhocouldbeconvertedtotheirside.Theyprovidedphonenumbersoftheseundecidedvoterstothesupporterstocall,andthenrecordedtheoutcomeofthosecallsallovertheweb,

Page 52: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

usinginteractiveapplications.Obamahimselfusedhistwitteraccounttocommunicatehismessagesdirectlywithhismillionsoffollowers.

Aftertheelections,ObamaconvertedthelistofsupporterstoanadvocacymachinethatwouldprovidethegrassrootssupportforthePresident’sinitiatives.Sincethen,almostallcampaignsuseBigData.SenatorBernieSandersusedthesameBigDataplaybooktobuildaneffectivenationalpoliticalmachinepoweredentirelybysmalldonors.Analyst,NateSilver,createdsophisticalpredictivemodelsusinginputsfrommanypoliticalpollsandsurveystowinpunditstosuccessfullypredictwinnersoftheUSelections.Natewashowever,unsuccessfulinpredictingDonaldTrump’srise,andthatshowsthelimitsofBigData.

PersonalHealth

Correctdiagnosisisthesinequanonofeffectivetreatment.Medicalknowledgeandtechnologyisgrowingbyleapsandbounds.IBMWatsonisaBigDataAnalyticsenginethatingestsandmetabolizesallthemedicalinformationintheworld,andthenappliesitintelligentlytoanindividualsituation.Watsoncanprovideadetailedandaccuratemedicaldiagnosisusingcurrentsymptoms,patienthistory,medicationhistory,andenvironmentaltrends,andotherparameters.SimilarproductsmightbeofferedasanApptolicenseddoctors,andevenindividuals,toimproveproductivityandaccuracyinhealthcare.

NewProductDevelopment

Theseapplicationsaretotallynewconceptsthatdidnotexistearlier.

Flexibleautoinsurance

AnautoinsurancecompanycanusetheGPSdatafromcarstocalculatetheriskofaccidentsbasedontravelpatterns.Theautomobilecompaniescanusethecarsensordatatotracktheperformanceofacar.Saferdriverscanberewardedandtheerrantdriverscanbepenalized.

Page 53: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Figure2‑0‑8:GPSbasedtrackingofvehicles

Location-basedretailpromotion

Aretailer,orathird-partyadvertiser,cantargetcustomerswithspecificpromotionsandcouponsbasedonlocationdataobtainedthroughGPS,thetimeofday,thepresenceofstoresnearby,andmappingittotheconsumerpreferencedataavailablefromsocialmediadatabases.Adsandofferscanbedeliveredthroughmobileapps,SMS,andemail.Theseareexamplesofmobileapps.

Recommendationservice

Ecommerceisafastgrowingindustryinthelastcoupleofdecades.Avarietyofproductsaresoldandsharedovertheinternet.Webusers’browsingandpurchasehistoryonecommercesitesisutilizedtolearnabouttheirpreferencesandneeds,andtoadvertiserelevantproductandpricingoffersinreal-time.Amazonusesapersonalizedrecommendationenginesystemtosuggestnewadditionalproductstoconsumersbasedonaffinitiesofvariousproducts.Netflixalsousesarecommendationenginetosuggestentertainmentoptionstoitsusers.

Page 54: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionBigDatahasapplicabilityacrossallindustries.TherearethreemajortypesofdatasourcesofBigData.Theyarepeople-peoplecommunications,people-machinecommunications,andmachine-machinecommunications.Eachtypehasmanysourcesofdata.Therearethreetypesofapplications.Theyarethemonitoringtype,theanalysistype,andnewproductdevelopment.Thischapterpresentsafewbusinessapplicationsofeachofthosethreetypes.

Page 55: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:WhatarethemajorsourcesofBigData?Describeasourceofeachtype.

Q2:WhatarethethreemajortypesofBigDataapplications?Describetwoapplicationsofeachtype.

Q3:WoulditbeethicaltoarrestsomeonebasedonaBigDataModel’spredictionofthatpersonlikelytocommitacrime?

Q4:AnautoinsurancecompanylearnedaboutthemovementsofapersonbasedontheGPSinstalledinthevehicle.Woulditbeethicaltousethatasasurveillancetool?

Q5:ResearchcandescribeaBigDataapplicationthathasaprovenreturnoninvestmentforanorganization.

Page 56: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

LibertyStoresCaseExercise:StepB2TheBoardofDirectorsaskedthecompanytotakeconcreteandeffectivestepstobecomeadata-drivencompany.Thecompanywantstounderstanditscustomersbetter.Itwantstoimprovethehappinesslevelsofitscustomersandemployees.Itwantstoinnovateonnewproductsthatitscustomerswouldlike.Itwantstorelateitscharitableactivitiestotheinterestsofitscustomers.

Q1:Whatkindofdatasourcesshouldthecompanycaptureforthis?

Q2:WhatkindofBigDataapplicationswouldyousuggestforthiscompany?

Page 57: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 58: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 59: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter3-BigDataArchitecture

IntroductionBigDataApplicationArchitectureistheconfigurationoftoolsandmodulestoaccomplishthewholetask.Anidealarchitecturewouldberesilient,secure,cost-effective,andadaptivetonewneedsandenvironments.Thisisachievedthroughbeginningwithprovenarchitectures,andcreativelyandprogressivelyrestructuringitwithnewelementsasadditionalneedsandproblemsarise.BigDataarchitecturesultimatelyalignwiththearchitectureoftheUniverse,thesourceofallinvincibility.

Page 60: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CASELET:GoogleQueryArchitectureGoogleinventedthefirstBigDataarchitecture.Theirgoalwastogatheralltheinformationontheweb,organizeit,andsearchitforspecificqueriesfrommillionsofusers.Anadditionalgoalwastofindawaytomonetizethisservicebyservingrelevantandprioritizedonlineadvertisementsonbehalfofclients.

Googledevelopedwebcrawlingagentswhichwouldfollowallthelinksinthewebandmakeacopyofallthecontentonallthewebpagesitvisited.

Googleinventedcost-effective,resilient,andfastwaystostoreandprocessallthatexponentiallygrowingdata.Itdevelopedascale-outarchitectureinwhichitcouldlinearlyincreaseitsstoragecapacitybyinsertingadditionalcomputersintoitscomputingnetwork.Thedatafilesweredistributedoverthelargenumberofmachinesinthecluster.ThisdistributedfilessystemwascalledtheGoogleFilesystem,andwastheprecursortoHDFS.

Googlewouldsortorindexthedatathusgatheredsoitcanbesearchedefficiently.Theyinventedthekey-pairNoSQLdatabasearchitecturetostorevarietyofdataobjects.Theydevelopedthestoragesystemtoavoidupdatesinthesameplace.Thusthedatawaswrittenonce,andreadmultipletimes.

Figure3‑0‑1:GoogleQueryArchitecture

GoogledevelopedtheMapReduceparallelprocessingarchitecturewherebylargedatasetscouldbeprocessedbythousandsofcomputersinparallel,witheachcomputerprocessingachunkofdata,toproducequickresultsfortheoveralljob.

Page 61: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TheHadoopecosystemofdatamanagementtoolslikeHadoopdistributedfilesystem(HDFS),columnardatabasesystemlikeHBase,aqueryingtoolsuchasHive,andmore,emergedfromGoogle’sinventions.Stormisastreamingdatatechnologiestoproduceinstantresults.LambdaArchitectureisaY-shapedarchitecturethatbranchesouttheincomingdatastreamforbatchaswellasstreamprocessing.

Q1:WhyshouldGooglepublishitsFileSystemandtheMapReduceparallelprogrammingsystemandsenditintoopen-sourcesystem?

Q2:WhatelsecanbedonewithGoogle’srepositoryofalltheweb’sdata?

Page 62: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

StandardBigdataarchitectureHereisthegenericBigDataArchitectureintroducedinChapter1.Therearemanysourcesofdata.Alldataisfunneledinthroughaningestsystem.Thedataisforkedintotwosides:astreamprocessingsystemandabatchprocessingsystem.TheoutcomeoftheseprocessingcanbesentintoNoSQLdatabasesforlaterretrieval,orsentdirectlyforconsumptionbymanyapplicationsanddevices.

Figure3‑0‑2:BigDataApplicationArchitecture

Abigdatasolutiontypicallycomprisestheseaslogicallayers.Eachlayercanberepresentedbyoneormoreavailabletechnologies.

Bigdatasources:Thesourcesofdataforanapplicationdependsuponwhatdataisrequiredtoperformthekindofanalysesyouneed.ThevarioussourcesofBigdataweredescribedinchapter2.Thedatawillvaryinorigin,size,speed,form,andfunction,asdescribedbythe4Vsinchapter1.Datasourcescanbeinternalorexternaltotheorganization.Thescopeofaccesstodataavailablecouldbelimited.Thelevelofstructurecouldbehighorlow.Thespeedofdataanditsquantitywillalsobyhighorlowdependinguponthedatasource.

Dataingestlayer:Thislayerisresponsibleforacquiringdatafromthedatasources.Thedataisthroughascalablesetofinputpointsthatcanacquireatvariousspeedsandinvariousquantities.Thedataissenttoabatchprocessingsystem,astreamprocessingsystem,ordirectlytoastoragefilesystem(suchasHDFS).Complianceregulationsandgovernancepoliciesimpactwhatdatacanbestoredandforhowlong.

BatchProcessinglayer:TheanalysislayerreceivesdatafromtheingestpointorfromthefilesystemorfromtheNoSQLdatabases.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitandproducethedesiredresults.Thisbatchprocessinglayerthusneedstounderstandthedatasourcesanddatatypes,thealgorithms

Page 63: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

thatwouldworkonthatdata,andtheformatofthedesiredoutcomes.Theoutputofthislayercouldbesentforinstantreporting,orstoredinaNoSQLdatabasesforanon-demandreport,fortheclient.

StreamingProcessinglayer:Thislayerreceivesdatadirectlyfromtheingestpoint.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitinrealtime,andproducethedesiredresults.Thislayerthusneedstounderstandthedatasourcesanddatatypesextremelywell,andthesuper-lightalgorithmsthatwouldworkonthatdatatoproducethedesiredresults.TheoutcomeofthislayertoocouldbestoredintheNoSQLDatabases.

DataOrganizingLayer:Thislayerreceivesdatafromboththebatchandstreamprocessinglayers.Itsobjectiveistoorganizethedataforeasyaccess.ItisrepresentedbyNoSQLdatabases.SQL-likelanguageslikeHiveandPigcanbeusedtoeasilyaccessdataandgeneratereports.

DataConsumptionlayer:Thislayerconsumestheoutputprovidedbytheanalysislayers,directlyorthroughtheorganizinglayer.Theoutcomecouldbestandardreports,dataanalytics,dashboardsandothervisualizationapplications,recommendationengine,onmobileandotherdevices.

InfrastructureLayer:Atbottomthereisalayerthatmanagestherawresourcesofstorage,compute,andcommunication.Thisisincreasinglyprovidedthroughacloudcomputingparadigm.

DistributedFileSystemLayer:ItwouldalsoincludetheHadoopDistributedFileSystem(HDFS).Itwouldalsoincludesupportingapplications,suchasYARN(YetAnotherResourceManager),thatenabletheefficientaccesstodatastorageanditstransfer.

Page 64: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BigDataArchitectureexamplesEverymajororganizationandapplicationshasauniqueoptimizedinfrastructuretosuititsspecificneeds.HerebelowaresomearchitectureexamplesfromsomeveryprominentusersanddesignersofBigDataapplications.

IBMWatson

IBMWatsonusesSparktomanageincomingdatastreams.ItalsousesSpark’sMachineLearninglibrary(MLLib)toanalyzedataandpredictdiseases.

Netflix

Thisisoneofthelargestprovidersofonlinevideoentertainment.Theyhandle400Billiononlineeventsperday.Asacutting-edgeuserofbigdatatechnologies,theyareconstantlyinnovatingtheirmixoftechnologiestodeliverthebestperformance.Kafkaisthecommonmessagingsystemforallincomingrequests.TheyhosttheentireinfrastructureonAmazonWebServices(AWS).ThedatabaseisAWS’S3aswellasCassandraandHbasetostoredata.Sparkisusedforstreamprocessing.

Page 65: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

(Source:Netflix)

Ebay

Ebayisthesecond-largestEcommercecompanyintheworld.Itdelivers800millionlistingsfrom25millionsellersto160millionbuyers.Tomanagethishugestreamofactivity,EBayusesastackofHadoop,Spark,Kafka,andotherelements.TheythinkthatKafkaisthebestnewthingforprocessingdatastreams.

VMWare

HereisVMware’sviewofaBigDataarchitecture.Itissimilarto,butmoredetailedthan,ourmainbigarchitecturediagram.

Page 66: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TheWeatherCompany

TheWeathercompanyservesweatherdatagloballythroughwebsitesandmobileapps.ItusesstreamingarchitectureusingApacheSpark.

TicketMaster

Thisistheworld’slargestcompanythatsellseventtickets.Theirgoalistomaketicketsavailabletopurchaseforrealfans,andpreventbadactorsfrommanipulatingthesystemtoincreasethepriceoftheticketsinthesecondarymarkets.

Page 67: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

LinkedIn

Thegoalofthisprofessionalnetworkingcompanyistomaintainanefficientsystemforprocessingthestreamingdataandmakethelinkoptionsavailableinreal-time.

Paypal

Thispayments-facilitationcompanyneedstounderstandandacquirecustomers,andprocessalargenumberofpaymenttransactions.

Page 68: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CERN

Thispremierhigh-energyphysicsresearchlabcomputepetabytesofdatausingin-memorystreamprocessingtoprocessdatafrommillionsofsensorsanddevices.

Page 69: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionBigDataapplicationsarearchitectedtodostreamaswellasbatchprocessing.Dataisingestedandfedintostreamingandbatchprocessing.MosttoolsusedforbigdataprocessingareopensourcetoolsservedthroughtheApachecommunity,andsomekeydistributorsofthosetechnologies.

Page 70: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:DescribetheBigDataprocessingarchitecture.

Q2:WhatareGoogle’scontributionstoBigdataprocessing?

Q3:WhataresomeofthehottesttechnologiesvisibleinBigDataprocessing?

Page 71: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

LibertyStoresCaseExercise:StepB3ThewantstobuildascalableandfuturisticplatformforitsBigData.

Q1:WhatkindofBigDataProcessingarchitecturewouldyousuggestforthiscompany

Page 72: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 73: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 74: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Section2

ThissectioncoverstheimportantBigDatatechnologiesdefinedintheBigDataarchitecturespecifiedinchapter3.

Chapter4willcoverHadoopanditsDistributedFileSystem(HDFS)

Chapter5willcovertheparallelprocessingalgorithm,MapReduce.

Chapter6willNoSQLdatabasessuchasHBaseandCassandra.ItwillalsocoverPigandHivelanguagesusedforaccessingthosedatabases.

Chapter7willcoverSpark,afastandintegratedstreamingdatamanagementplatform.

Chapter8willcoverDataIngestsystems,usingApacheKafka

Chapter9willcoverCloudComputingmodel.

Page 75: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 76: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 77: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter4:DistributedComputingusingHadoopIntroductionAdistributedsystemisacleverwayofstoringhugequantitiesofdata,securelyandcost-effectively,forspeedandease,forretrievalandprocessing,usinganetworkedcollectionofcommoditymachines.Theidealdistributedfilesystemwouldstoreinfiniteamountsofdatawhilemakingthecomplexitycompletelytransparenttotheuser,andenableeasyaccesstotherightdatainstantly.Thiswouldbeachievedbystoringfragmentsofdataatdifferentlocations,andinternallymanagingthelower-leveltasksofstoringandreplicatingdataacrossthenetwork.ThedistributedsystemultimatelyleadstothecreationoftheunboundedcosmiccomputerthatisalignedwiththeUnifiedFieldofallthelawsofnature.

Page 78: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

HadoopFrameworkTheApacheHadoopdistributedcomputingframeworkiscomposedofthefollowingmodules:

1. HadoopCommon–containslibrariesandutilitiesneededbyotherHadoopmodules

2. HadoopDistributedFileSystem(HDFS)–adistributedfile-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster

3. YARN–aresource-managementplatformresponsibleformanagingcomputingresourcesinclustersandusingthemforschedulingofusers’applications,and

4. MapReduce–animplementationoftheMapReduceprogrammingmodelforlargescaledataprocessing.

ThischapterwillcoverHadoopCommon,HDFS,andYARN.ThenextchapterwillcoverMapReduce.

Page 79: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

HDFSDesignGoalsTheHadoopdistributedfilesystem(HDFS)isadistributedandscalablefile-system.Itisdesignedforapplicationsthatdealwithlargedatasizes.Itisalsodesignedtodealwithmostlyimmutablefiles,i.e.writedataonce,butreaditmanytimes.

HDFShasthefollowingmajordesigngoals:

1. Hardwarefailuremanagement–itwillhappen,andonemustplanforit.2. Hugevolume–createcapacityforlargenumberofhugefilesizes,withfast

read/writethroughput3. Highspeed–createamechanismtoprovidelowlatencyaccesstostreaming

applications4. Highvariety–Maintainsimpledatacoherence,bywritingdataoncebutreading

manytimes.5. Open-source–Maintaineasyaccessibilityofdatausinganyhardware,software,

anddatabaseplatform6. Networkefficiency–Minimizenetworkbandwidthrequirement,byminimizing

datamovement

Page 80: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Master-SlaveArchitectureHadoopisanarchitecturefororganizingcomputersinamaster-slaverelationshipthathelpsachievegreatscalabilityinprocessing.AnHDFSclusterhastwotypesofnodesoperatinginamaster−workerpattern:asinglemasternode(calledNameNode),andalargenumberofslaveworkernodes(calledDataNodes).AsmallHadoopclusterincludesasinglemasterandmultipleworkernodes.AlargeHadoopclusterwouldconsistofamasterandthousandsofsmallordinarymachinesasworkernodes.

Figure4‑0‑1:Master-SlaveArchitecture

Themasternodemanagestheoverallfilesystem,itsnamespace,andcontrolstheaccesstofilesbyclients.Themasternodeisawareofthedata-nodes:i.e.whatblocksofwhichfilearestoredonwhichdatanode.Italsocontrolstheprocessingplanforallapplicationsrunningonthedataonthecluster.Thereisonlyonemasternode.Unfortunately,thatmakesitasinglepointoffailure.Therefore,wheneverpossible,themasternodehasahotbackupjustincasethemasternodediesunexpectedly.Themasternodeusesatransactionlogtopersistentlyrecordeverychangethatoccurstofilesystemmetadata.

Theworkernodesstorethedatablocksintheirstoragespace,asdirectedbythemasternode.Eachworkernodetypicalcontainsmanydiskstomaximizestoragecapacityandaccessspeed.Eachworkernodehasitsownlocalfilesystem.Aworkernodehasnoawarenessofthedistributedfilestructure.Itsimplystoreseachblockofdataasdirected,asifeachblockwereaseparatefile.TheDataNodesstoreandserveupblocksofdataoverthenetworkusingablockprotocol,underthedirectionoftheNameNode.

Page 81: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Figure4‑0‑2:HadoopArchitecture(Source:Hadoop.apache.org)

TheNamenodestoresallrelevantinformationaboutalltheDataNodes,andthefilesstoredinthoseDataNodes.TheNameNodewillcontain:

-ForeveryDataNode,itsname,Rack,Capacity,andHealth

-ForeveryFile,itsName,replicas,Type,Size,TimeStamp,Location,Health,etc.

ItaDataNodefails,thereisnoseriousproblem.ThedataonthefaileddataNodewillbeaccessedfromitsreplicasonotherDataNodes.ThefailedDataNodecanbeautomaticallyrecreatedonanothermachine,bywritingallthosefileblocksoffromtheotherhealthyreplicas.Eachdata-nodesendsaheartbeatmessagetothename-nodeperiodically.Withoutthismessage,theDataNodeisassumedtobedead.TheDataNodereplicationeffortwouldautomaticallykick-intoreplacethedeaddata-node.

Thefilesystemhasasetoffeaturesandcapabilitiestocompletelyhidethesplinteringandscatteringofdata,andenabletheusertodealwiththedataatahigh,logicallevel.

TheNameNodetriestoensurethatfilesareevenlyspreadacrossthedata-nodesinthecluster.Thatbalancesthestorageandcomputingload,andalsolimitstheextentoflossfromthefailureofanode.TheNameNodealsotriestooptimizethenetworkingload.Whenretrievingdataororderingtheprocessing,theNameNodetriestopickFragmentsfrommultiplenodestobalancetheprocessingloadandspeedupthetotallyprocessingeffort.TheNameNodealsotriestostorefragmentsoffilesonthesamenodeforspeedofreadandwriting.Processingisdoneonthenodewherethefilefragmentisstored.

Anypieceofdataisstoredtypicallyonthreenodes:twoonthesamerack,andoneona

Page 82: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

differentrack.Datanodescantalktoeachothertorebalancedata,tomovecopiesaround,andtokeepthereplicationofdatahigh.

Page 83: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BlocksystemHDFSstoreslargefiles(typicallygigabytestoterabytes)bystoringsegments(calledblocks)ofthefileacrossmultiplemachines.AblockofdataisthefundamentalstorageunitinHDFS.Datafilesaredescribed,readandwritteninblock-sizedgranularity.Allstoragecapacityandfilesizesaremeasuredinblocks.Ablockrangesfrom16-128MBinsize,withadefaultblocksizeof64MB.Thus,anHDFSfileischoppedupinto64MBchunks,andifpossible,eachchunkwillresideonadifferentDataNode.

Everydatafiletakesupanumberofblocksdependinguponitssize.Thusa100MBfilewilloccupytwoblocks(100MBdividedby64MB),withsomeroomtospare.Everystoragediskcanaccommodateanumberofblocksdependinguponthesizeofthedisk.Thusa1Terabytestoragewillhave16000blocks(1TBdividedby64MB).

Everyfileisorganizedasaconsecutivelynumberedsequenceofblocks.Afile’sblocksarestoredphysicallyclosetoeachotherforeaseofaccess,asfaraspossible.Thefile’sblocksizeandreplicationfactorareconfigurablebytheapplicationthatwritesthefileonHDFS.

Page 84: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

EnsuringDataIntegrityHadoopensuresthatnodatawillbelostorcorrupted,duringstorageorprocessing.Thefilesarewrittenonlyonce,andneverupdatedinplace.Theycanbereadmanytimes.Onlyoneclientcanwriteorappendtoafile,atatime.Noconcurrentupdatesareallowed.

Ifadataisindeedlostorcorrupted,orifapartofthediskgetscorrupted,anewhealthyreplicaforthatlostblockwillbeautomaticallyrecreatedbycopyingfromthereplicasonotherdata-nodes.Atleastoneofthereplicasisstoredonadata-nodeonadifferentrack.Thisguardsagainstthefailureoftherackofnodes,orthenetworkingrouter,onit.

AchecksumalgorithmisappliedonalldatawrittentoHDFS.Aprocessofserializationisusedtoturnfilesintoabytestreamfortransmissionoveranetworkorforwritingtopersistentstorage.Hadoophasadditionalsecuritybuiltin,usingKerberosverifier.

Page 85: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

InstallingHDFSItispossibletorunHadooponanin-houseclusterofmachines,oronthecloudinexpensively.Asanexample,TheNewYorkTimesused100AmazonElasticComputeCloud(EC2)instances(DataNodes)andaHadoopapplicationtoprocess4TBofrawimageTIFFdatastoredinAmazonSimpleStorageService(S3)into11millionfinishedPDFsinthespaceof24hoursatacomputationcostofabout$240(notincludingbandwidth).SeeChapter9foraprimeronCloudComputing.SeeAppendix1forastep-by-steptutorialoninstallingHadooponAmazonEC2.

HadoopiswritteninJava.HadoopalsorequiresaworkingJavainstallation.InstallingHadooptakesalotofresources.Forexample,allinformationaboutfragmentsoffilesneedstobeinName-nodememory.AthumbruleisthatHadoopneedsapproximately1GBmemorytomanage1Mfilefragments.ManyeasymechanismsexisttoinstalltheentireHadoopstack.UsingaGUIsuchasClouderaResourcesManagertoinstallaClouderaHadoopstackiseasy.Thisstackincludes,HDFS,andmanyotherrelatedcomponents,suchasHBase,Pig,YARN,andmore.InstallingitonaclusteronacloudservicesproviderlikeAWSiseasierthaninstallingJavaVirtualMachines(JVMs)onHDFScanbeinstalledbyusingClouderaGUIResourcesManager.Ifdoingfromcommandline,downloadHadoopfromoneoftheApachemirrorsites

HadoopiswritteninJava.AndmostaccesstofilesisprovidedthroughJavaabstractclassorg.apache.hadoop.fs.FileSystem.HDFScanbemounteddirectlywithaFilesysteminUserspace(FUSE)virtualfilesystemonLinuxandsomeotherUnixsystems.FileaccesscanbeachievedthroughthenativeJavaapplicationprogramminginterface(API).AnotherAPI,calledThrift,helpstogenerateaclientinthelanguageoftheusers’choosing(suchasC++,Java,Python).WhentheHadoopcommandisinvokedwithaclassnameasthefirstargument,itlaunchesaJavavirtualmachine(JVM)toruntheclass,alongwiththerelevantHadooplibraries(andtheirdependencies)ontheclasspath.

HDFShasaUNIX-likecommandlikeinterface(CLI).UseshshelltocommunicatewithHadoop.HDFShasUNIX-likepermissionsmodelforfilesanddirectories.Therearethreeprogressivelyincreasinglevelsofpermissions:read(r),write(w),andexecute(x).Createahduser,andcommunicateusingsshshellonthelocalmachine.

%hadoopfs-help##getdetailedhelponeverycommand.

ReadingandWritingLocalFilesintoHDFS

Therearetwodifferentwaystotransferdata:fromthelocalfilesystem,orforman

Page 86: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

input/outputstream.CopyingafilefromthelocalfilesystemtoHDFScanbedoneby:

%hadoopfs-copyFromLocalpath/filename

ReadingandWritingDataStreamsintoHDFS

ReadafilefromHDFSbyusingajava.net.URLobjecttoopenastreamtoreadthedatarequiresashortscript,asbelow.

InputStreamin=null;

Start{

instream=newURL(“hdfs://host/path”).openStream();//detailsofprocessin}

Finish{IOUtils.closeStream(instream);}

Asimplemethodtocreateanewfileisasfollows:

publicFSDataOutputStreamcreate(Pathp)throwsIOException

Datacanbeappendedtoanexistingfileusingtheappend()method:

publicFSDataOutputStreamappend(Pathp)throwsIOException

Adirectorycanbecreatedbyasimplemethod:

publicbooleanmkdirs(Pathp)throwsIOException

Listthecontentsofadirectoryusing:

publicFileStatus[]listStatus(Pathp)throwsIOException

publicFileStatus[]listStatus(Pathp,PathFilterfilter)throwsIOException

Page 87: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SequenceFilesTheincomingdatafilescanrangefromverysmalltoextremelylarge,andwithdifferentstructures.BigDatafilesarethereforeorganizedquitedifferentlytohandlethediversityoffilesizesandtype.LargefilesarestoredasHDFSfiles,withFileFragmentsdistributedacrossthecluster.However,smallerfilesshouldbebunchedtogetherintosinglesegmentforefficientstorage.

SequenceFilesareaspecializeddatastructurewithinHadooptohandlesmallerfileswithsmallerrecordsizes.SequenceFileusesapersistentdatastructurefordataavailableinkey-valuepairformat.Thesehelpefficientlystoresmallerobjects.HDFSandMapReducearedesignedtoworkwithlargefiles,sopackingsmallfilesintoaSequenceFilecontainer,makesstoringandprocessingthesmallerfilesmoreefficientforHDFSandMapReduce.

Sequencefilesarerow-orientedfileformats,whichmeansthatthevaluesforeachrowarestoredcontiguouslyinthefile.Thisformatsareappropriatewhenalargenumberofcolumnsofasinglerowareneededforprocessingatthesametime.Thereareeasycommandstocreate,readandwriteSequenceFilestructures.SortingandmergingSequenceFilesisnativetoMapReducesystem.AMapFileisessentiallyasortedSequenceFilewithanindextopermitlookupsbykey.

Page 88: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

YARNYARN(YetAnotherResourceNegotiator)isthearchitecturalcenterofHadoop,Itisoftencharacterizedasalarge-scale,distributedoperatingsystemforbigdataapplications.YARNmanagesresourcesandmonitorsworkloads,inasecuremulti-tenantenvironment,whileensuringhighavailabilityacrossmultipleHadoopclusters.YARNalsobringsgreatflexibilityasacommonplatformtorunmultipletoolsandapplicationssuchasinteractiveSQL(e.g.Hive),real-timestreaming(e.g.Spark),andbatchprocessing(MapReduce),toworkondatastoredinasingleHDFSstorageplatform.Itbringsclustersmorescalabilitytoexpandbeyond1000nodes,italsoimprovesclusterutilizationthroughdynamicallocationofclusterresourcestovariousapplications.

Figure4‑0‑3:HadoopDistributedArchitectureincludingYARN

TheResourceManagerinYARNhastwomaincomponents:SchedulerandApplicationsManager.

YARNSchedulerallocatesresourcestothevariousrequestingapplications.ItdoessobasedonanabstractnotionofaresourceContainerwhichincorporateselementssuchasMemory,CPU,Diskstorage,Network,etc.EachmachinealsohasaNodeManagerthatmanagesalltheContainersonthatmachine,andreportsstatusonresourcesandContainerstotheYARNScheduler.

YARNApplicationsManageracceptsnewjobsubmissionsfromtheclient.ItthenrequestsafirstresourceContainerfortheapplication-specificApplicationMasterprogram,andmonitorsthehealthandexecutionoftheapplication.Oncerunning,theApplicationMasterdirectlynegotiatesadditionalresourcecontainersfromtheSchedulerasneeded.

Page 89: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionHadoopisthemajortechnologyformanagingbigdata.HDFSsecurelystoresdataonlargeclustersofcommoditymachines.Amastermachinecontrolsthestorageandprocessingactivitiesoftheworkermachines.ANameNodecontrolsthenamespaceandstorageinformationforthefilesystemontheDataNodes.AmasterJobTrackercontrolstheprocessingoftasksattheDataNodes.YARNistheresourcesmanagerthatmanagesallresourcesdynamicallyandefficientlyacrossallapplicationsonthecluster.HadoopFilesystemandotherpartsoftheHadoopstackaredistributedbymanyvendors,andcanbeeasilyinstalledoncloudcomputinginfrastructure.HadoopinstallationtutorialisinAppendixA.

Page 90: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:HowdoesHadoopdifferfromatraditionalfilesystem?

Q2:WhatarethedesigngoalsforHDFS?

Q3:HowdoesHDFSensuresecurityandintegrityofdata?

Q4:Howdoesamasternodedifferfromtheworkernode?

Page 91: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 92: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 93: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter5–ParallelProcessingwithMapReduce

Introduction

Aparallelprocessingsystemisacleverwaytoprocesshugeamountsofdatainashortperiodoftimebyenlistingtheservicesofmanycomputingdevicestoworkonpartsofthejob,simultaneously.Theidealparallelprocessingsystemwillworkacrossanycomputationalproblem,usinganynumberofcomputingdevices,acrossanysizeofdatasets,witheaseandhighprogrammerproductivity.Thisisachievedbyframingtheprobleminawaythatitcanbebrokendownintomanyparts,suchthatthateachpartcanbepartiallyprocessedindependentlyoftheotherparts;andthentheintermediateresultsfromprocessingthepartscanbecombinedtoproduceafinalsolution.Infiniteparallelprocessingistheessenceofinfinitedynamismofthelawsofnature.

Page 94: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MapReduceOverview

MapReduceisaparallelprogrammingframeworkforspeedinguplargescaledataprocessingforcertaintypesoftasks.ItachievessowithminimalmovementofdataondistributedfilesystemssuchasHDFSclusters,toachievenear-realtimeresults.Therearetwomajorpre-requisitesforMapReduceprogramming.(a)Theapplicationmustlenditselftoparallelprogramming.(b)Thedatacanbeexpressedinkey-valuepairs.

MapReduceprocessingissimilartoUNIXsequence(alsocalledpipe)structure

e.g.theUNIXcommand:

grep|sort|countmyfile.txt

willproduceawordcountinthetextdocumentcalledmyfile.txt.

Therearethreecommandsinthissequence,andtheyworkasfollows:(a)grepiscommandtoreadthetextfileandcreateanintermediatefilewithonewordonaline;(b)sortcommandwillsortthatintermediatefile,andproduceanalphabeticallysortedlistofwordsinthatset;(c)thecountcommandwillworkonthatsortedlist,toproducethenumberofoccurrencesofeachword,anddisplaytheresultstotheuserina“word,frequency”pairformat.

Forexample:Supposemyfile.txtcontainsthefollowingtext:

Myfile:Wearegoingtoapicnicnearourhouse.Manyofourfriendsarecoming.Youarewelcometojoinus.Wewillhavefun.

TheoutputsofGrep,SortandWordcountwillasshownbelow.

Grep Sort WordCount

We a a 1

are are are 3

going are coming 1

to are friends 1

a coming fun 1

picnic friends going 1

Page 95: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

near fun have 1

our going house 1

house have join 1

Many house many 1

of join near 1

our many of 1

friends near our 2

are of picnic 1

coming our to 2

You our us 1

are picnic we 2

welcome to welcome 1

to to will 1

join us you 1

us We

we we

will welcome

have will

fun you

Ifthefileisverylarge,thenitwillbetakethecomputeralongtimetoprocessit.Parallelprocessingcanhelphere.

MapReducespeedsupthecomputationbyreadingandprocessingsmallchunksoffile,bydifferentcomputersinparallel.Thusifafilecanbebrokendowninto100smallchunks,

Page 96: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

eachchunkcanbeprocessedataseparatecomputerinparallel.Thetotaltimetakentoprocessthefilecouldbe1/100ofthetimetakenotherwise.However,nowtheresultsofthecomputationonsmallchunksareresidingina100differentplaces.Theselargenumberofpartialresultsneedtobecombinedtoproduceacompositeresult.TheresultsoftheoutputsfromvariouschunkswillbecombinedbyanotherprogramcalledtheReduceprogram.

TheMapstepwilldistributethefulljobintosmallertasksthatcanbedoneonseparatecomputerseachusingonlyapartofthedataset.TheresultoftheMapstepwillbeconsideredasintermediateresults.TheReducestepwillreadtheintermediateresults,andwillcombineallofthemandproducethefinalresult.Theprogrammerneedstospecifiesthefunctionallogicforboththemapandreducesteps.Thesorting,betweentheMapandReducesteps,doesnotneedtobespecifiedandisautomaticallytakencareoftheMapReducesystemasastandardserviceprovidedtoeveryjob.Thesortingofthedatarequiresafieldtosorton.Thustheintermediateresultsneedtohavesomekindofakeyfield,andasetofassociatednon-keyattribute(s)forthatkey.

Figure5‑0‑1:MapReduceArchitecture

Inpractice,tomanagethevarietyofdatastructuresstoredinthefilesystem,dataisstoredasonekeyandonenon-keyattribute.Thusthedataisrepresentedasakey-valuepair.Theintermediateresults,andthefinalresultsallwillalsobeinkey-pairformat.ThusakeyrequirementfortheuseofMapReduceparallelprocessingsystemisthattheinputdataandoutputdatamustbothberepresentedinkey-valuesformats.

Mapstepreadsdatainkey-valuepairformat.Theprogrammerdecidewhatshouldbethecharacteristicsofthekeyandvaluefields.TheMapstepproducesresultsinkey-valuepairformat.However,thecharacteristicsofthekeysproducedbytheMapstep,i.e.theintermediateresults,neednotbesamekeysattheinputdata.So,thosecanbecalledkey2-value2pairs.

TheReducestepreadsthekey2-value2pairs,theintermediateresultsproducedbytheMapstep.Reducestepwillproduceanoutputusingthesamekeysthatitread.Onlythevaluesassociatedwiththosekeyswillchangethoughasaresultofprocessing.Thusitcan

Page 97: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

belabeledaskey2-value3format.

Supposethetextinthemyfile.txtcanbesplitinto4approximatelyequalsegments.Itcouldbedonewitheachsentenceasaseparatepieceoftext.Thefoursegmentswilllookasfollowing:

Segment1:Wearegoingtoapicnicnearourhouse.

Segment2:Manyofourfriendsarecoming.

Segment3:Youarewelcometojoinus.

Segment4:Wewillhavefun.

Thustheinputtothe4processorsintheMapStepwillbeinkey-valuepairformat.Thefirstcolumnisthekey,whichistheentiresentenceinthiscase.Thesecondcolumnisthevalue,whichinthisapplicationisthefrequencyofthesentence.

Wearegoingtoapicnicnearourhouse. 1

Manyofourfriendsarecoming. 1

Youarewelcometojoinus. 1

Wewillhavefun. 1

Thistaskcanbedoneinparallelbyfourprocessors.Eachofthissegmentwillbetaskforadifferentprocessor.Thuseachtaskwillproduceafileofwords,withacountof1.Therewillbefourintermediatefiles,in<key,value>pairformat,shownbelow.

Key2 Value2 Key2 Value2 Key2 Value2 Key2 Value2

we 1 many 1 you 1 we 1

are 1 of 1 are 1 will 1

going 1 our 1 welcome 1 have 1

Page 98: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

to 1 friends 1 to 1 fun 1

a 1 are 1 join 1

picnic 1 coming 1 us 1

near 1

our 1

house 1

ThesortprocessinherentwithinMapReducewillsorteachoftheintermediatefiles,andproducethefollowingsortedkey-pairvalues:

Key2 Value2 Key Value2 Key Value2 Key Value2

a 1 are 1 are 1 fun 1

are 1 coming 1 join 1 have 1

going 1 friends 1 to 1 we 1

house 1 many 1 us 1 will 1

near 1 of 1 welcome 1

our 1 our 1 you 1

picnic 1

to 1

we 1

TheReducefunctionwillreadthesortedintermediatefiles,andcombinethecountsforalltheuniquewords,toproducethefollowingoutput.Thekeysremainthesameasintheintermediateresults.However,thevalueschangeascountsfromeachoftheintermediatefilesareaddedupforeachkey.Forexample,thecountfortheword‘are’goesupto3.

Key2 Value3

Page 99: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

a 1

are 3

coming 1

friends 1

fun 1

going 1

have 1

house 1

join 1

many 1

near 1

of 1

our 2

picnic 1

to 2

us 1

we 2

welcome 1

will 1

you 1

ThisoutputwillbeidenticaltothatproducedbytheUNIXsequenceearlier.

Page 100: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MapReduceprogrammingAdataprocessingproblemneedstobetransformedintotheMapReducemodel.Thefirststepistovisualizetheprocessingplanintoamapandareducestep.Whentheprocessinggetsmorecomplex,thiscomplexitycanbegenerallymanifestedinhavingmoreMapReducejobs,ormorecomplexmapandreducejobs.HavingmorebutsimplerMapReducejobsleadstomoreeasilymaintainablemapperandreducerprograms.

MapReduceDataTypesandFormats

MapReducehasasimplemodelofdataprocessing:inputsandoutputsforthemapandreducefunctionsarekey-valuepairs.ThemapandreducefunctionsinHadoopMapReducehavethefollowinggeneralform:

map:(K1,V1)→list(K2,V2)

reduce:(K2,list(V2))→list(K3,V3)

Ingeneral,themapinputkeyandvaluetypes(K1andV1)aredifferentfromthemapoutputtypes(K2andV2).However,thereduceinputmusthavethesametypesasthemapoutput,althoughthereduceoutputtypesmaybedifferentagain(K3andV3).SinceMapperandReducerareseparateclasses,thetypeparametershavedifferentscopes,

Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases.Aninputsplitisachunkoftheinputthatisprocessedbyasinglemap.Eachmapprocessesasinglesplit.Eachsplitisdividedintorecords,andthemapprocesseseachrecord—akey-valuepair—inturn.Splitsandrecordsarelogical:andmaymaptoafullfile,apartofafile,oracollectionoffiles.Inadatabasecontext,asplitmightcorrespondtoarangeofrowsfromatableandarecordtoarowinthatrange

WritingMapReduceProgramming

Startbywritingpseudocodeforthemapandreducefunctions.TheprogramcodeforboththemapandthereducefunctioncanthenbewritteninJavaorotherlanguages.InJava,themapfunctionisrepresentedbythegenericMapperclass.Itusesfourparameters:inputkey,inputvalue,outputkey,outputvalue.Thisclassusesanabstractmap()method.Thismethodreceivedtheinputkeyandinputvalue.Itwouldnormallyproduceandoutputkeyandoutputvalue.Formorecomplexproblems,itisbettertouseahigher-levellanguagethanMapReduce,suchasPig,Hive,Cascading,Crunch,orSpark.

Amappercommonlyperformsinputformatparsing,projection(selectingtherelevantfields),andfiltering(selectingtherecordsofinterest).Thereducertypicallycombines

Page 101: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

(addsoraverages)thosevalues.

Figure5‑0‑2:MapReduceprogramFlow

Herebelowisthestep-by-steplogicImaginethatwewanttodoawordcountofalluniquewordsinatext.

1. Thebigdocumentissplitintomanysegments.Themapstepisrunoneachsegmentofdata.Theoutputwillbeasetofkey,valuepairs.Inthiscase,thekeywillbeawordinthedocument.

2. Thesystemwillgatherthekey,valuepairoutputsfromallthemappers,andwillsortthembykey.Thesortedlistitselfmaythenbesplitintoafewsegments.

3. AReducertaskwillreadthesortedlistandproduceacombinedlistofwordcounts.

HereistheJavacodeforwordcount:.

map(Stringkey,Stringvalue):

foreachwordwinvalue:

EmitIntermediate(w,“1”);

reduce(Stringkey,Iteratorvalues):

intresult=0;

foreachvinvalues:

result+=ParseInt(v);

Emit(AsString(result));

TestingMapReducePrograms

Mapperprogramsrunningonaclustercanbecomplicatedtodebug.Thetime-honored

Page 102: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

wayofdebuggingprogramsisviaprintstatements.However,withtheprogramseventuallyrunningontensorthousandsofnodes,itisbesttodebugtheprogramsinstages.Therefore,runtheprogramusingsmallsampledatasetstoensurethattheprogramisworkingcorrectly.Expandtheunitteststocoverlargerdatasetandrunitonacluster.Ensurethatthemapperorreducercanhandletheinputscorrectly.Runningagainstthefulldatasetislikelytoexposesomemoreissues,whichshouldbefixed,byalteringyourmapperorreducertohandlethenewcases.Aftertheprogramisworking,theprogrammaybetunedtomaketheentireMapReducejobrunfaster.

Itmaybedesirabletosplitthelogicintomanysimplemappersandchainingthemintoasinglemapperusingafacility(theChainMapperlibraryclass)builtintoHadoop.Itcanrunachainofmappers,followedbyareducerandanotherchainofmappers,inasingleMapReducejob.

Page 103: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MapReduceJobsExecution

AMapReducejobisspecifiedbytheMapprogramandtheReduceprogram,alongwiththedatasetsassociatedwiththatjob.ThereisanothermasterprogramthatresidesandrunsendlesslyontheNameNode.ItiscalledtheJobtracker,andittrackstheprogressoftheMapReducejobsfrombeginningtothecompletion.Hadoopdividesthejobintotwotasks:maptasksandreducetasks.HadoopmovestheMapandReducecomputationlogictoeachDataNodethatishostingapartofthedata.ThecommunicationbetweenthenodesisaccomplishedusingYARN,Hadoop’snativeresourcemanager.

Themastermachine(NameNode)iscompletelyawareofthedatastoredoneachoftheworkermachines(DataNodes).Itschedulesthemaporreducejobstotasktrackerswithfullawarenessofthedatalocation.Forexample:ifnodeAcontainsdata(x,y,z)andnodeBcontainsdata(a,b,c),thejobtrackerschedulesnodeBtoperformmaporreducetaskson(a,b,c)andnodeAwouldbescheduledtoperformmaporreducetaskson(x,y,z).Thisreducesthedatatrafficandpreventschokingofthenetwork.

EachDataNodehasamasterprogramcalledtheJobtracker.ThisprogrammonitorstheexecutionofeverytaskassignedtoitbytheNameNode.Whenthetaskiscompleted,theTasktrackersendsacompletionmessagetotheJobTrackerprogramonthe

Thejobsandtasksworkinamaster-slavemode.

Figure5‑0‑3:HierarchicalMonitoringArchitecture

WhenthereismorethanonejobinaMapReduceworkflow,itisnecessarytheybeexecutedintherightorder.Foralinearchainofjobsitmightbeeasy.Foramorecomplexdirectedacyclicgraph(DAG)ofjobs,therearelibrariesthatcanhelporchestrateyourworkflow.OronecanuseApacheOozie,asystemforrunningworkflowsofdependentjobs.

Oozieconsistsoftwomainparts:aworkflowenginethatstoresandrunsworkflowscomposedofdifferenttypesofHadoopjobs(MapReduce,Pig,Hive,andsoon),andacoordinatorenginethatrunsworkflowjobsbasedonpredefinedschedulesanddata

Page 104: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

availability.Ooziehasbeendesignedtoscale,anditcanmanagethetimelyexecutionofthousandsofworkflowsinaHadoopcluster.

ThedatasetfortheMapReducejobisdividedintofixed-sizepiecescalledinputsplits,orjustsplits.Hadoopcreatesonemaptaskforeachsplit,whichrunstheuser-definedmapfunctionforeachrecordinthesplit.ThetasksarescheduledusingYARNandrunonnodesinthecluster.YARNensuresthatifataskfailsorinordinatelydelayed,itwillbeautomaticallyscheduledtorunonadifferentnode.Theoutputsofthemapjobsarefedasinputtothereducejob.Thatlogicisalsopropagatedtothenode(s)thatwilldothereducejobs.Tosaveonbandwidth,Hadoopallowstheuseofacombinerfunctiononthemapoutput.Thenthecombinerfunction’soutputformstheinputtothereducefunction.

HowMapReduceWorks

AMapReducejobcanbeexecutedwithasinglemethodcall:submit()onaJobobject.WhentheresourcemanagerreceivesacalltoitssubmitApplication()method,ithandsofftherequesttotheYARNscheduler.Theschedulerallocatesacontainer,andtheresourcemanagerthenlaunchestheapplicationmaster’sprocess.TheapplicationmasterforMapReducejobsisaJavaapplicationwhosemainclassisMRAppMaster.Itinitializesthejobbycreatinganumberofbookkeepingobjectstokeeptrackofthejob’sprogress.Itretrievestheinputsplitscomputedintheclientfromthesharedfilesystem.Itthencreatesamaptaskobjectforeachsplit,aswellasanumberofreducetaskobjectsdeterminedbythemapreduce.job.reducesproperty(setbythesetNumReduceTasks()methodonJob).TasksaregivenIDsatthispoint.TheapplicationmastermustdecidehowtorunthetasksthatmakeuptheMapReducejob.Theapplicationmasterrequestscontainersforallthemapandreducetasksinthejobfromtheresourcemanager.Onceataskhasbeenassignedresourcesforacontaineronaparticularnodebytheresourcemanager’sscheduler,theapplicationmasterstartsthecontainerbycontactingthenodemanager.ThetaskisexecutedbyaJavaapplicationwhosemainclassisYarnChild.

ManagingFailures

Therecanbefailuresattheleveloftheentirejoborparticulartasks.Theentireapplicationmasteritselfcouldfail.

Taskfailureusuallyhappenswhentheusercodeinthemaporreducetaskthrowsaruntimeexception.Ifthishappens,thetaskJVMreportstheerrortoitsparentapplicationmaster,whereitisloggedintoerrorlogs.Theapplicationmasterwillthenrescheduleexecutionofthetaskonanotherdatanode.

Page 105: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Theentirejob,i.e.MapReduceapplicationmasterapplicationrunningonYARN,toocanfail.Inthatcase,itisstartedagain,subjecttoamaximumnumberwhichisauser-setconfigurationparameter.

Ifadatanodemanagerfailsbycrashingorrunningveryslowly,itwillstopsendingheartbeatstotheresourcemanager(orsendthemveryinfrequently).Theresourcemanagerwillthenremoveitfromitspoolofnodestoschedulecontainerson.Anytaskorapplicationmasterrunningonthefailednodemanagerwillberecoveredusingerrorlogs,andstartedonothernodes.

ResourceManagerYARNcanalsofail,andithasmoresevereconsequencesfortheentirecluster.Therefore,typically,therewillbeahot-standbyforYARN.Iftheactiveresourcemanagerfails,thenthestandbycantakeoverwithoutasignificantinterruptiontotheclient.Thenewresourcemanagercanreadtheapplicationinformationfromthestatestore,andthenrestarttheapplicationthatwererunningonthecluster.

ShuffleandSort

MapReduceguaranteesthattheinputtoeveryreducerissortedbykey.Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducersasinputs—isknownastheshuffle.

Whenthemapfunctionstartsproducingoutput,itisnotdirectlywrittentodisk.Thetakesadvantageofbufferingwritesinmemoryanddoingsomepresortingforefficiencyreasons.Eachmaptaskhasacircularmemorybufferthatitwritestheoutputto.Beforeitwritestodisk,thethreadfirstdividesthedataintopartitionscorrespondingtothereducersthattheywillultimatelybesentto.Withineachpartition,thebackgroundthreadperformsanin-memorysortbykey.Ifthereisacombinerfunction,itisrunontheoutputofthesortsothatthereislessdatatotransfertothereducer.

Thereducetaskneedsthemapoutputforitsparticularpartitionfromseveralmaptasksacrossthecluster.Themaptasksmayfinishatdifferenttimes,sothereducetaskstartsreadingtheiroutputsassoonaseachcompletes.Whenallthemapoutputshavebeenread,thereducetaskmergesthemapoutputs,maintainingtheirsortordering.Thereducefunctionisinvokedforeachkeyinthesortedoutput.TheoutputofthisphaseiswrittendirectlytotheoutputfilesystemsuchasHDFS.

ProgressandStatusUpdates

MapReducejobsarelong-runningbatchjobs,takingalongtimetorun.Itisimportantfor

Page 106: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

theusertogetfeedbackonhowthejob’sprogress.Ajobandeachofitstaskshaveastatusvalue(e.g.,running,successfullycompleted,failed),theprogressofmapsandreduces,thevaluesofthejob’scounters.Thesevaluesareconstantlycommunicatedbacktotheclient.Whentheapplicationmasterreceivesanotificationthatthelasttaskforajobiscomplete,itchangesthestatusforthejobto“successful.”Jobstatisticsandcountersarecommunicatedtotheuser.

Hadoopcomeswithanativeweb-basedGUIfortrackingtheMapReducejobs.Itdisplaysusefulinformationaboutajob’sprogresssuchashowmanytaskshavebeencompleted,andwhichonesarestillbeingexecuted.Oncethejobiscompleted,onecanviewthejobstatisticsandlogs.

Page 107: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

HadoopStreamingHadoopStreamingusesstandardUnixstreamsastheinterfacebetweenHadoopanduserprogram.Streamingisanidealapplicationfortextprocessing.Mapinputdataispassedoverstandardinputtoyourmapfunction,whichprocessesitlinebylineandwriteslinestostandardoutput.Amapoutputkey-valuepairiswrittenasasingletab-delimitedline.Inputtothereducefunctionisinthesameformat—atab-separatedkey-valuepair—passedoverstandardinput.Thereducefunctionreadslinesfromstandardinput,whichtheframeworkguaranteesaresortedbykey,andwritesitsresultstostandardoutput.

Page 108: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Conclusion

MapReduceisthefirstpopularparallelprogrammingframeworkforBigData.Itworkswellforapplicationswherethedatacanbelarge,anddivisibleintoseparatesets,andrepresentedin<key,value>pairformat.Theapplicationlogicisdividedintotwoparts:aMapprogramandaReduceProgram.Eachoftheseprogramscanberuninparallelbyseveralmachines.

Page 109: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestions

Q1:WhatisMapReduce?Whatareitsbenefits?

Q2:Whatisthekey-valuepairformat?Howisitdifferentfromotherdatastructures?Whatareitsbenefits?Andlimitations.

Page 110: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 111: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 112: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter6–NoSQLdatabasesANoSQLdatabaseisacleverwaytocost-effectivelyorganizelargeamountsofheterogeneousdataforefficientaccessandupdates.TheidealNoSQLdatabaseiscompletelyalignedwiththenatureoftheproblemsbeingsolved,andissuperfastinthattask.Thisisachievedbyreleasingandrelaxingmanyoftheintegrityandredundancyconstraintsofstoringdatainrelationaldatabases,andstoringdatainmanyinnovativeformatsasalignedwithbusinessneed.ThediverseNoSQLdatabaseswillultimatelycollectiveevolveintoaholisticsetofefficientandelegantdatastructuresattheheartofacosmiccomputerofinfiniteorganizationcapacity.

Page 113: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

IntroductionRelationaldatamanagementsystems(RDBMS)areapowerfulanduniversallyuseddatabasetechnologybyalmostallenterprises.Relationaldatabasesarestructuredandoptimizedtoensureaccuracyandconsistencyofdata,whilealsoeliminatinganyredundancyofdata.Thesedatabasesarestoredonthelargestandmostreliableofcomputerstoensurethatthedataisalwaysavailableatagranularlevelandatahighspeed.

Bigdataishoweveramuchlargerandunpredictablestreamofdata.Relationaldatabasesareinadequateforthistask,andwillalsobeveryexpensiveforsuchlargedatavolumes.Managingthecostsandspeedofmanagingsuchlargeandheterogeneousdatastreamsrequiresrelaxingmanyofthestrictrulesandrequirementsofrelationaldata.Dependinguponwhichconstraint(s)arerelaxed,adifferentkindofdatabasestructurewillemerge.ThesearecalledNoSQLdatabases,todifferentiatethemfromrelationaldatabasesthatuseStructuredQueryLanguage(SQL)astheprimarymeanstomanipulatedata.

NoSQLdatabasesarenext-generationdatabasesthatarenon-relationalintheirdesign.ThenameNoSQLismeanttodifferentiateitfromantiquated,‘PRE-relational’databases.Today,almosteveryorganizationthatneedstogathercustomerfeedbackandsentimentstoimprovetheirbusiness,willuseaNoSQLdatabase.NoSQLisusefulwhenanenterpriseneedstoaccess,analyzeandutilizemassiveamountsofeitherstructuredorunstructureddataordatathat’sstoredremotelyinanyvirtualserveracrosstheglobe.

Theconstraintsofarelationaldatabasearerelaxedinmanyways.Forexample,relationaldatabasesrequirethatanydataelementcouldberandomlyaccessedanditsvaluecouldbeupdatedinthatsamephysicallocation.However,thesimplephysicsofstoragesaysthatitissimplerandfastertoreadorwritesequentialblocksofdataonadisk.Therefore,NoSQLdatabasefilesarewrittenonceandalmostneverupdatedinplace.Ifanewversionofapartofthedatabecomeavailable,itwouldbestoredelsewherebythesystem.Thesystemwouldhavetheintelligencetolinktheupdateddatatotheolddata.

PigandHivearetwokeyandpopularlanguagesintheHadoopecosystemthatworkswellonNoSQLdatabases.PigoriginatedatYahoowhileHiveoriginatedatFacebook.BothPigandHivecanusethesamedataasaninput,andcanachievesimilarresultswithqueries.BothPigLatinandHivecommandseventuallycompiletoMapandReducejobs.Theyhaveasimilargoal-toeasethecomplexityofwritingcomplexjavaMapReduceprograms.MostMapReducejobscanbeimplementedeasilyinHiveorPig.

Page 114: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Foranalyticalneeds,HiveispreferableoverPig.Forcontrolledprocessing,Pig’sscriptingdesignispreferableHiveleadstoeaseandproductivityusingitsSQLlikedesignanduserinterface.Pigoffersgreatercontroloverdataflows.JavaMRcanbeusedformoreadvancedAPIstoaccomplishthingswhenthereissomethingspecialneeded,suchasinteractingwithathird-partytool,orsomespecialdatacharacteristics.

Page 115: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

RDBMSVsNoSQLTheyaredifferentinmanyways.First,NoSQLdatabases,donotsupportrelationalschemaortheSQLlanguage.ThetermNoSQLstandsmostlyfor“NotonlySQL”.Second,theirtransactionprocessingcapabilitiesarefastbutweak,andtheydonotsupporttheACID(Atomicity,Consistency,Isolation,Durability)propertiesassociatedwithtransactionprocessingusingrelationaldatabases.Instead,theyareapproximatelyaccurateatanypointintime,andwillbeeventuallyconsistent.Third,thesedatabasesarealsodistributedandhorizontallyscalabletomanageweb-scaledatabasesusingHadoopclustersofstorage.Thustheyworkwellwiththewrite-once,read-manystoragemechanismofHadoopclusters.

Feature RDBMS NoSQL

Applications MostlycentralizedApplications(e.g.ERP)

Mostlydesignedforthedecentralizedapplications(e.g.Web,mobile,sensors)

Availability Moderatetohigh Continuousavailabilitytoreceiveandservedata

Velocity Moderatevelocityofdata Highvelocityofdata(devices,sensors,socialmedia,etc.).Lowlatencyofaccess.

DataVolume Moderatesize;archivedafterforacertainperiod

Hugevolumeofdata,storedmostlyforalongtimeorforever;LinearlyscalableDB.

DataSources Dataarrivesfromoneorfew,mostlypredictablesources

Dataarrivesfrommultiplelocationsandareofunpredictablenature

Datatype Dataaremostlystructured Structuredorunstructureddata

DataAccess Primaryconcernisreadingthedata

Concernisbothreadandwrite

Technology Standardizedrelationalschemas;SQLlanguage

Manydesignswithmanyimplementationsofdatastructuresandaccesslanguages

Cost Expensive;commercial Low;open-sourcesoftware

Page 116: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 117: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TypesofNoSQLDatabasesThevarietyofbigdatameansthatfilesizeandtypeswillvaryenormously.Therearespecializeddatabasestosuitdifferentpurposes.

1. DocumentDatabases:Storinga10GBvideomoviefileasasingleobjectcouldbespeededupbysequentiallystoringthedataincontiguousblocksofphysicalstorage.Anindexcouldstoretheidentifyinginformationaboutthemovie,andtheaddressofthestartingblock.Therestofstoragedetailscouldbehandledbythesystem.Thisstorageformatwouldbeacalleddocumentstoreformat.Theindexwouldcontainthenameofthemovie,andthevalueistheentirevideofile,characterizedbythefirstblockofstorage.Documentdatabasesaregenerallyusefulforcontentmanagementsystems,bloggingplatforms,webanalytics,real-timeanalytics,ecommerce-applications.Wewouldavoidusingdocumentdatabasesforsystemsthatneedcomplextransactionsspanningmultipleoperationsorqueriesagainstvaryingaggregatestructures.

2. Key-ValuePairDatabases:Therecouldbeacollectionofmanydataelementssuchasacollectionoftextmessageswhichcouldalsofitintoasinglephysicalblockofstorage.Eachtextmessageisauniqueobject.Thisdatawouldneedtobequeriedoften.Thatcollectionofmessagescouldalsobestoredinakey-valuepairformat,bycombiningtheidentifierofthemessageandthecontentofthemessage.Key-valuedatabasesareusefulforstoringsessioninformation,userprofiles,preferences,andshoppingcartdata.Key-valuedatabasesdon’tworksowellwhenweneedtoquerybynon-keyfieldsoronmultiplekeyfieldsatthesametime.

3. GraphDatabases:Geographicmapdatathatisstoredinsetofrelationshipsorlinksbetweenpoints.Graphdatabasesareverywellsuitedtoproblemspaceswherewehaveconnecteddata,suchassocialnetworks,spatialdata,routinginformation,andrecommendationengines.

4. ColumnarDatabases:Somekindofdatabasesareneededtospeedupsomeoft-soughtqueriesfromverylargedatasets.Supposethereisanextremelylargedatawarehouseofweblogaccessdata,whichisrolledupbythenumberofwebaccessbythehour.Thisneedstobequeried,orsummarizedoften,involvingonlysomeofthedatafieldsfromthedatabase.Thusthequerycouldbespeededupbycreatingadatabasestructurethatincludedonlytherelevantcolumnsofthedataset,alongwiththekeyidentifyinginformation.Thisiscalledacolumnardatabaseformat,andisusefulforcontentmanagementsystems,bloggingplatforms,maintainingcounters,expiringusage,heavywritevolumesuchaslog

Page 118: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

aggregation.Columnfamilydatabasesforsystemswellwhenthequerypatternshavestabilized.

ThechoiceofNoSQLdatabasedependsonthesystemrequirements.Thereareatleast200implementationsofNoSQLdatabasesofthesefourtypes.Visitnosql-database.orgformore.

Despitethename,aNoSQLdatabasedoesnotnecessarilyprohibitstructuredquerylanguage(likeMySQL).WhilesomeoftheNoSQLsystemsareentirelynon-relational,othersjustavoidsomeselectedfunctionalityofRDMSsuchasfixedtableschemasandjoinoperations.ForNoSQLsystems,insteadofusingtables,thedatacanbeorganizedthedatainkey/valuepairformat,andthenSQLcanbeused.

ThefirstpopularNoSQLdatabasewasHBase,whichisapartoftheHadoopfamily.ThemostpopularNoSQLdatabaseusedtodayisApacheCassandra,whichwasdevelopedandownedbyFacebooktillitwasreleasedasopensourcein2008.OtherNoSQLdatabasesystemsareSimpleDB,Google’sBigTable,MemcacheDB,OracleNoSQL,Voldemort,etc.

Page 119: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ArchitectureofNoSQL

Figure6‑0‑1:NoSQLDatabasesArchitecture

OneofthekeyconceptsunderlyingtheNoSQLdatabasesisthatdatabasemanagementhasmovedtoatwo-layerarchitecture;separatingtheconcernsofdatamodelinganddatastorage.Thedatastoragelayerfocusesonthetaskofhigh-performancescalabledatastorageforthetaskathand.Thedatamanagementlayeravarietyofdatabaseformats,andallowsforlow-levelaccesstothatdatathroughspecializedlanguagesthataremoreappropriateforthejob,ratherthanbeingconstrainedbyusingthestandardSQLformat.

NoSQLdatabasesmapsthedatainthekey/valuepairsandsavesthedatainthestorageunit.Thereisnostorageofdatainacentralizedtabularform,sothedatabaseishighlyscalable.Thedatacouldbeofdifferentforms,andcomingfromdifferentsources,andtheycanallbestoredinsimilarkey/valuepairformats.

ThereareavarietyofNoSQLarchitectures.SomepopularNoSQLdatabaseslikeMongoDBaredesignedinamaster/slavemodellikemanyRDBMS.ButotherpopularNoSQLdatabaseslikeCassandraaredesignedinamaster-lessfashionwhereallthenodesintheclustersarethesame.So,itisthearchitectureoftheNoSQLdatabasesystemthatdeterminesthebenefitsofdistributedandscalablesystememergeslikecontinuousavailability,distributedaccess,highspeed,andsoon.

NoSQLdatabasesprovidedeveloperslotofoptionstochoosefromandfinetunethesystemtotheirspecificrequirements.Understandingtherequirementsofhowthedataisgoingtobeconsumedbythesystem,questionssuchasisitreadheavyvswriteheavy,isthereaneedtoquerydatawithrandomqueryparameters,willthesystembeablehandleinconsistentdata.

Page 120: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CAPtheoremDataisexpectedtobeaccurateandavailable.Inadistributedenvironment,accuracydependsupontheconsistencyofdata.AsystemisconsideredConsistentifallreplicasofcopycontainthesamevalue.ThesystemisconsideredAvailable,ifthedataIisavailableatallpointsintime.Itisalsodesirableforthedatatobeconsistentandavailableevenwhenanetworkfailurerendersthedatabasepartitionedintotwoormoreislands.Asystemisconsideredpartitiontolerantifprocessingcancontinueinbothpartitionsinthecaseofanetworkfailure.Inpracticeitishardtoachieveallthree.

ThechoicebetweenConsistencyandAvailabilityremainstheunavoidablerealityfordistributeddatastores.CAPtheoremstatesthatinanydistributedsystemonecanchooseonlytwooutofthethree(Consistency,AvailabilityandPartitionTolerance).Thethirdwillbedeterminedbythosechoices.

NoSQLdatabasescanbetunedtosuitone’schoiceofhighconsistencyoravailability.Forexample,foraNoSQLdatabase,thereareessentiallythreeparameters:

-N=replicationfactor,i.e.thenumberofreplicascreatedforeachpieceofdata

-R=Minimumnumberofnodesthatshouldrespondtoareadrequestforittobeconsideredsuccessful

-W=Minimumnumberofnodesthatshouldrespondtoawriterequestbeforeitsconsideredsuccessful.

SettingthevaluesofRandWveryhigh(R=N,andW=N)willmakethesystemmoreconsistent.However,itwillbeslowtoreportConsistency,andthusAvailabilitywillbelow.Ontheotherend,settingRandWtobeverylow(suchasR=1andW=1),wouldmaketheclusterhighlyavailable,asevenasinglesuccessfulread(orwrite)wouldlettheclustertoreportsuccess.However,consistencyofdataontheclusterwillbelowsincemanyofthemaynothaveyetreceivedthelatestcopyofthedata.

Ifanetworkgetspartitionedbecauseofanetworkfailure,thenonehastotradeoffavailabilityversusconsistency.NoSQLdatabaseusersoftenchooseavailabilityandpartitiontoleranceoverstrongconsistency.Theyarguethatshortperiodsofapplicationmisbehaviorarelessproblematicthanshortperiodsofunavailability.

Consistencyismoreexpensiveintermsofthroughputorlatency,thanisAvailability.However,HDFSchoosesconsistency–asthreefaileddatanodescanpotentiallyrendera

Page 121: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

file’sblockscompletelyunavailable.

Page 122: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

PopularNoSQLDatabasesWecovertwoofthemorepopularofferings.

Page 123: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

HBaseApacheHBaseisacolumn-oriented,non-relational,distributeddatabasesystemthatrunsontopofHDFS.AnHBasesystemcomprisesasetoftables.Eachtablecontainsrowsandcolumns,muchlikeatraditionaldatabase.EachtablemusthaveanelementdefinedasaPrimaryKey;allaccesstoHBasetablesisdoneusingthePrimaryKey.AnHBasecolumnrepresentsanattributeofanobject.Forexample,ifthetableisstoringdiagnosticlogsfromwebservers,eachrowwillbealogrecord.Eachcolumninthattablewillrepresentanattributesuchasthedate/timeoftherecord,ortheservername.HBasepermitsmanyattributestobegroupedtogetherintoacolumnfamily,sothatallelementsofacolumnfamilyareallstoredasessentiallyacompositeattribute.

Columnardatabasesaredifferentfromarelationaldatabaseintermsofhowthedataisstored.Intherelationaldatabase,allthecolumns/attributesofagivenrowarestoredtogether.WithHBaseyoumustpredefinethetableschemaandspecifythecolumnfamilies.Allrowsofacolumnfamilywillstoredsequentially.However,it’sveryflexibleinthatnewcolumnscanbeaddedtofamiliesatanytime,makingtheschemaflexibleandthereforeabletoadapttochangingapplicationrequirements.

ArchitectureOverview

HBaseisbuiltonmaster-slaveconcept.InHBaseamasternodemanagesthecluster,whiletheworkernodes(calledregionservers)storeportionsofthetablesandperformtheworkonthedata.HBaseisdesignedafterGoogleBigtable,andofferssimilarcapabilitiesontopofHadoopandHDFS.Itdoesconsistentreadsandwrites.Itdoesautomaticandconfigurableshardingoftables.Ashardisasegmentofthedatabase.

Figure6‑0‑2:HBASEArchitecture

Physically,HBaseiscomposedofthreetypesofserversinamasterslavetypeofarchitecture.

Page 124: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

(a)TheNameNodemaintainsmetadatainformationforallthephysicaldatablocksthatcomprisethefiles.

(b)Regionserversservedataforreadsandwrites.

(c)TheHadoopDataNodestoresthedatathattheRegionServerismanaging.

HBaseTablesaredividedhorizontallybyrowkeyrangeinto“Regions.”Aregioncontainsallrowsinthetablebetweentheregion’sstartkeyandendkey.Regionassignment,DDL(create,deletetables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.ThereisanautomaticfailoversupportbetweenRegionServers.AllHBasedataisstoredinHDFSfiles.RegionServersarecollocatedwiththeHDFSDataNodes,whichenabledatalocality(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocalwhenitiswritten,butwhenaregionismoved,itisnotlocaluntilcompaction.

EachRegionServercreatesanephemeralnode.TheHMastermonitorsthesenodestodiscoveravailableregionservers,anditalsomonitorsthesenodesforserverfailures.

Amasterisresponsibleforcoordinatingtheregionservers,includingassigningregionsonstartup,loadbalancingofrecoveryamongregions,andmonitoringtheirhealth.Itisalsotheinterfaceforcreating,deleting,updatingtables

ReadingandWritingData

ThereisaspecialHBaseCatalogtablecalledtheMETAtable,whichholdsthelocationoftheregionsinthecluster.ZooKeeperstoresthelocationoftheMETAtable.

ThisiswhathappensthefirsttimeaclientreadsorwritestoHBase:

TheclientgetstheRegionserverthathoststheMETAtablefromZooKeeper.

Theclientwillquerythe.META.servertogettheregionservercorrespondingtotherowkeyitwantstoaccess.TheclientcachesthisinformationalongwiththeMETAtablelocation.

ItwillgettheRowfromthecorrespondingRegionServer.

Forfuturereads,theclientusesthecachetoretrievetheMETAlocationandpreviouslyreadrowkeys.Overtime,itdoesnotneedtoquerytheMETAtable,unlessthereisamissbecausearegionhasmoved;thenitwillre-queryandupdatethecache.

Page 125: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CassandraApacheCassandraisalargelyscalableopensourcenon-relationaldatabasethatofferscontinuousuptime,simplicityandeasydatadistributionacrossmultipledatacentersandcloud.CassandrawasoriginallydevelopedatFacebookandwasopensourcedin2008.Itprovidesmanybenefitsoverthetraditionalrelationaldatabasesformodernonlineapplicationslikescalablearchitecture,continuousavailability,highdataprotection,multidatareplicationsoverdatacenters,datacompression,SQLlikelanguageandsoon.

ArchitectureOverview

Cassandraarchitectureprovidesitsabilitytoscaleandprovidecontinuousavailability.Ratherthanusingmaster-slavearchitecture,ithasamaster-less“ring”designthatiseasytosetupandmaintain.InCassandra,allnodesplayanequalrole,allnodescommunicatewithoneanotherbyadistributedandhighlyscalableprotocolcalledgossip.

So,theCassandrascalablearchitectureprovidesthecapacityofhandlinglargevolumeofdata,andlargenumberofconcurrentusersoroperationsoccurringatthesametime,acrossmultipledatacenters,justaseasilyasanormaloperationfortherelationaldatabases.Toenhanceitscapacity,onesimplyneedstoaddnewnodestoanexistingclusterwithouttakingdownthesystemanddesigningfromthescratch.

AlsotheCassandraarchitecturemeansthatunlikeothermasterslavesystems,ithasnosinglepointoffailureandthusiscapableofofferingcontinuousavailabilityanduptime.

ReadingandWritingData

Page 126: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

DatatobewrittentoaCassandranodeisfirstrecordedinanondiskcommitlogandthenitiswrittentoamemorybasedunitcalleda“memTable”.Whena“memTable”sizeexceedsacertainsetthreshold,thedataisthenwrittentofileondiskcalledan“SSTable”.Thus,inthiswaythewriteoperationisfullysequentialinnature.withmanyinputoutputoperationoccurringatthesametime,ratherthanoccurringoneatatimeoveralongperiod.

Forareadoperation,Cassandralooksinaninmemorydatastructurecalleda“Bloomfilter”thatfetchtheprobabilityofa“SSTable”havingtherequireddata.TheBloomfiltercanperformthetaskveryquicklytotellifafilehastheneededdataornot.IfitreturntruethenCassandralooksforanotherlayerofinmemorycaches,andthenfetchesthecompresseddataondisk.Iftheanswerisfalse,Cassandradoesn’tbotherwithreadingthe“SSTable”andlooksforanotherfiletofetchtherequireddata.

WriteSyntax:TTransporttr=newTSocket(HOST,PORT);

TFramedTransporttf=newTFramedTransport(tr);TProtocolprotocal=newTBinaryProtocol(tf);Cassandra.Clientclient=newCassandra.Client(protocal);

tf.open();

client.insert(userIDKey,cp,newColumn(“Colume-name”.getBytes(UTF8),“Colume-data”.getBytes(),clock),CL);

ReadSyntax:

Columncol=client.get(userIDKey,colPathName,CL).getColumn();

LOG.debug(“Columnname:”+newString(col.Colume-name,UTF8));

LOG.debug(“Columnvalue:”+newS tring(col.Colume-data,UTF8));

Page 127: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

HiveLanguageHiveisadeclarativeSQL-likelanguageforqueries.HivewasdesignedtoappealtoacommunitycomfortablewithSQL.Itisusedmainlybydataanalystsontheserverside,fordesigningreports.Ithasitsownmetadatasectionwhichcanbedefinedaheadoftime,beforedataisloaded.Hivesupportsmapandreducetransformscriptsinthelanguageoftheuser’schoice,whichcanbeembeddedwithinSQLclauses.ItiswidelyusedinFacebookbyanalystscomfortablewithSQL,aswellasbydataminersprogramminginPython.Hiveisbestusedfortraditionaldatawarehousingtasks;itisnotdesignedforonlinetransactionprocessing.

Hiveisbestsuitedforstructureddata.HivecanbeusedtoquerydatastoredinHbase,whichisakey-valuestore.Hive’sSQL-likestructuremakestransformationofdatatoandfromRDBMSiseasier.SupportingSQLsyntaxalsomakesiteasytointegratewithexistingBItools.Hiveneedsthedatatobefirstimported(orloaded)andafterthatitcanbeworkedupon.Incaseofstreamingdata,onewouldhavetokeepfillingbuckets(orfiles),andthenHivecanbeusedtoprocesseachfilledbucket,whileusingotherbucketstokeepstoringthenewlyarrivingdata.

HivedataColumnsaremappedtotablesinHDFS.ThismappingisstoredinMetadata.AllHQLqueriesareconvertedtoMapReducejobs.Atablecanhaveonemorepartitionkeys.ThereareusualSQLdatatypes,andArraysandMapsandStructstorepresentmorecomplextypesofdata.Thereareuserdefinedfunctionsformapping,aggregating

Figure6‑3:HiveArchitecture

HIVELanguageCapabilities

Hive’sSQLprovidesalmostallbasicSQLoperations.Theseoperationsworkontablesandorpartitions.Theseoperationsare:SELECT,FROM,WHERE,JOIN,GROUPBY,

Page 128: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ORDERBY.Italsoallowstheresultstobestoredinanothertable,orinaHDFSfile.

Thestatementtocreateapage_viewtablewouldbelike:

CREATETABLEpage_view(viewTimeINT,useridBIGINT,

page_urlSTRING,referrer_urlSTRING,

ipSTRINGCOMMENT‘IPAddressoftheUser’)

COMMENT‘Thisisthepageviewtable’

PARTITIONEDBY(dtSTRING,countrySTRING)

STOREDASSEQUENCEFILE;

Hereisascriptforloadingdataintothisfile.

CREATEEXTERNALTABLEpage_view_stg(viewTimeINT,useridBIGINT,

page_urlSTRING,referrer_urlSTRING,

ipSTRINGCOMMENT‘IPAddressoftheUser’,

countrySTRINGCOMMENT‘countryoforigination’)

COMMENT‘Thisisthestagingpageviewtable’

ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘44’LINESTERMINATEDBY‘12’

STOREDASTEXTFILE

LOCATION‘/user/data/staging/page_view’;

ThetablecreatedabovecanbestoredinHDFSasaTextFileorasaSequenceFile.

AnINSERTqueryonthistablewilllooklike:

hadoopdfs-put/tmp/pv_2008-06-08.txt/user/data/staging/page_view

FROMpage_view_stgpvs

INSERTOVERWRITETABLEpage_viewPARTITION(dt=‘2008-06-08’,country=‘US’)

SELECTpvs.viewTime,pvs.userid,pvs.page_url,pvs.referrer_url,null,null,pvs.ip

WHEREpvs.country=‘US’;

Page 129: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 130: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

PigLanguagePigisahigh-levelprocedurallanguage.Itisusedmainlyforprogramming.Ithelpstocreateastep-by-stepflowofdatatodoprocessing.Itoperatesmostlyontheclientsideofthecluster.PigLatinfollowsaprocedureprogrammingmodelandmorenaturaltousetobuildadatapipeline,suchasETLjob.Itgivesfullcontroloverhowthedataflowsthroughthepipeline,whentocheckpointthedatainpipeline,anditsupportDAGsinpipelinesuchassplit,andgivesmorecontroloveroptimization.Pigworkswellwithunstructureddata.Forcomplexoperationssuchasanalyzingmatrices,orsearchforpatternsinunstructureddata,Pigwillgivegreatercontrolandoptions.

Pigallowsonetoloaddataandusercodeatanypointinthepipeline.Thiscanbeimportantforingestingstreamingdatafromsatellitesorinstruments.Pigalsouseslazyevaluation.PigisfasterinthedataimportbutslowerinactualexecutionthananRDBMSfriendlylanguagelikeHive.Pigiswellsuitedtoparallelizationandsoitisbettersuitedforverylargedatasetsthroughput(amountofdataprocessed)ismoreimportantthanlatency(speedofresponse).

PigisSQL-like,butdifferstoagreatextent.Itdoesnothaveadedicatedmetadatasection;theschemawillhavetobedefinedintheprogramitself.Itis.PigcanbeeasierforsomeonewhohadnotearlierexperiencewithSQL.

Page 131: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionNoSQLdatabasesemergedinresponsetothelimitationsofrelationaldatabasesinhandlingthesheervolume,natureandgrowthofdata.NoSQLdatabaseshavethefunctionalitylikeMapReduce.NoSQLdatabaseisprovingtobeaviablesolutiontotheenterprisedataneedsandcontinuetodoso.TherearefourtypesofNoSQLdatabases:columnar,Key-pair,document,andgraphicaldatabases.CassandraandHBaseareamongthemostpopularNOSQLdatabases.HiveisanSQL-typelanguagetoaccessdatafromNoSQLdatabases.Pigisaproceduralhigh-languagethatgivesgreatercontroloverdataflows.

Page 132: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:WhatisaNoSQLdatabase?Whatarethedifferenttypesofit?

Q2:HowdoesaNoSQLdatabaseleveragethepowerofMapReduce?

Q3:whatarethekindsofNoSQLdatabases?Whataretheadvantagesofeach?

Q3:WhatarethesimilaritiesanddifferencesbetweenHiveandPig?

Page 133: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 134: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter7–StreamProcessingwithSparkAstreamprocessingsystemisacleverwaytoprocesslargequantitiesofdatafromavastsetofextremelyfastincomingdatastreams.Theidealstreamprocessingenginewillcaptureandreportinrealtimetheessenceofalldatastreams,nomatterthespeedorsizeofnumber.Thisisachievedbyusinginnovativealgorithmsandfiltersthatrelaxmanycomputationalaccuracyrequirements,tocomputesimpleapproximatemetricsinrealtime.Streamprocessingenginealignswiththeinfinitedynamismoftheflowofnature.

Page 135: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

IntroductionApacheSparkisanintegrated,fast,in-memory,general-purposeengineforlarge-scaledataprocessing.Sparkisidealforiterativeandinteractiveprocessingtasksonlargedatasetsandstreams.Sparkachieves10-100xperformanceoverHadoopbyoperatingwithanin-memoryconstructcalled‘ResilientDistributedDatasets’,whichhelpavoidthelatenciesinvolvedindiskreadsandwrites.WhileSparkiscompatiblewithHadoopfilesystemsandtools,alargescaleadoptionofSparkanditsbuilt-inlibraries(forMachineLearning,GraphProcessing,Streamprocessing,SQL)willdeliverseamlessfastdataprocessingalongwithhighprogrammerproductivity.SparkhasbecomeamoreefficientandproductivealternativeforHadoopecosystem,andisincreasingbeingusedinindustry.

ApacheSparkwasoriginallydevelopedin2009inUCBerkeley’sAMPLab,andopensourcedin2010asanApacheproject.Itcanprocessdatafromavarietyofdatarepositories,includingtheHadoopDistributedFileSystem(HDFS),andNoSQLdatabasessuchasHBaseandCassandra.Sparksupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.Sparkgivesusacomprehensive,unifiedframeworktomanagebigdataprocessingrequirementswithavarietyofdatasetsthatarediverseinnature(textdata,graphdataetc)aswellasthesourceofdata(batchv.real-timestreamingdata).SparkenablesapplicationsinHadoopclusterstorunupto100timesfasterinmemoryand10timesfasterevenwhenrunningondisk.SparkisanalternativetoHadoopMapReduceratherthanareplacementforHadoop.Itprovidesacomprehensiveandunifiedsolutiontomanagedifferentbigdatausecasesandrequirements.

Page 136: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SparkArchitecture

ThecoreSparkenginefunctionspartlyasanapplicationprogramminginterface(API)layerandunderpinsasetofrelatedtoolsformanagingandanalyzingdata,includingaSQLqueryengine,alibraryofmachinelearningalgorithms,agraphprocessingsystemandstreamingdataprocessingsoftware.Sparkallowsprogrammerstodevelopcomplex,multi-stepdatapipelinesusingdirectedacyclicgraph(DAG)pattern.Italsosupportsin-memorydatasharingacrossDAGs,sothatdifferentjobscanworkwiththesamedata.SparkrunsontopofexistingHadoopDistributedFileSystem(HDFS)infrastructuretoprovideenhancedandadditionalfunctionality.ItprovidessupportfordeployingSparkapplicationsinanexistingHadoopv1cluster(withSIMR–Spark-Inside-MapReduce)orHadoopv2YARNclusterorevenApacheMesos.

Nextwewillintroducethetwoimportancefeaturesinspark:RDDsandDAG.

ResilientDistributedDatasets(RDD)

RDD,ResilientDistributedDatasets,isadistributedmemorydistribution.Theyaremotivatedbytwotypesofapplicationsthatcurrentcomputingframeworkshandleinefficiently:iterativealgorithmsandinteractivedataminingtools.Inbothcases,keepingdatainmemorycanimproveperformancebyanorderofmagnitude.

RDDsareImmutableandpartitionedcollectionofrecords,whichcanonlybecreatedbycoarsegrainedoperationssuchasmap,filter,groupbyetc.Bycoarsegrainedoperations,itmeansthattheoperationsareappliedonallelementsinadataset.RDDscanonlybecreatedbyreadingdatafromastablestoragesuchasHDFSorbytransformationsonexistingRDDs.

OncedataisreadintoanRDDobjectinSpark,avarietyofoperationscanbeperformedbycallingabstractSparkAPIs.Thetwomajortypesofoperationavailablearetransformationsandactions.Transformationsreturnanew,modifiedRDDbasedontheoriginal.SeveraltransformationsareavailablethroughtheSparkAPI,includingmap(),

Page 137: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

filter(),sample(),andunion().ActionsreturnavaluebasedonsomecomputationbeingperformedonanRDD.SomeexamplesofactionssupportedbytheSparkAPIincludereduce(),count(),first(),andforeach().

DirectedAcyclicGraph(DAG)

DAGrefersadirectedacyclicgraph.Thisapproachisanimportantfeatureforreal-timenigDataplatforms.Thosetools,includingStorm,Spark,andTez,offeramazingnewcapabilitiesforbuildinghighlyinteractive,real-timecomputingsystemstopoweryourreal-timeBI,predictiveanalytics,real-timemarketingandothercriticalsystems.

DAGScheduleristheschedulinglayerofApacheSparkthatimplementsstage-orientedscheduling,i.e.afteranRDDactionhasbeencalleditbecomesajobthatisthentransformedintoasetofstagesthataresubmittedasTaskSetsforexecution.Ingeneral,DAGSchedulerdoesthreethingsinSpark:ComputesanexecutionDAG,i.e.DAGofstages,forajob;Determinesthepreferredlocationstoruneachtaskon;Handlesfailuresduetoshuffleoutputfilesbeinglost.

Page 138: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SparkEcosystemSparkisanintegratedstackoftoolsresponsibleforscheduling,distributing,andmonitoringapplicationsconsistingofmanycomputationaltasksacrossmanyworkermachines,oracomputingcluster.SparkiswrittenprimarilyinScala,butincludescodefromPython,Java,R,andotherlanguages.Sparkcomeswithasetofintergratedtoolsthatreducelearningtimeanddeliverhigheruserproductivity.SparkecosystemincludesMesosresourcemanager,andothertools.

SparkhasalreadyovertakenHadoopingeneralbecauseofbenefitsitprovidesintermsoffasterexecutioniniterativeprocessingalgorithms.

Page 139: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SparkforbigdataprocessingSparksupportbigdataminingthroughrelevantlibrariesincludingMLlib,GraphXandSparkR.AndthroughSparkSQLlanguageandStreaminglibrary.

MLlib

MLlibisSpark’smachinelearninglibrary.Itconsistsofbasicmachinelearningalgorithmssuchasclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellaslower-leveloptimizationprimitivesandhigher-levelpipelineAPIs.Atthesametime,wecareaboutalgorithmicperformance.Sparkexcelsatiterativecomputation,enablingMLlibtorunfast.SoMLlibalsocontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.Inaddition,SparkMLlibiseasytouseanditcansupportscala,Java,Python,andSparkR.

Forexample,Decisiontreesisapopulardataclassificationtechnique,SparkMLlibcansupportdecisiontreesforbinaryandmulticlassclassification,usingbothcontinuousandcategoricalfeatures.Theimplementationpartitionsdatabyrows,allowingdistributedtrainingwithmillionsofinstances.

FunctionsinDecisionTrees

class:publicstaticDecisionTreeModeltrainClassifier(…)

Methodtotrainadecisiontreemodelforbinaryormulticlassclassification.

Parameters:

•input-Trainingdataset:RDDofLabeledPoint.Labelsshouldtakevalues{0,1,…,numClasses-1}.

•numClassesForClassification-numberofclassesforclassification.

•categoricalFeaturesInfo-Mapstoringarityofcategoricalfeatures.

•impurity-Criterionusedforinformationgaincalculation.Supportedvalues:“gini”or“entropy”

•maxDepth-Maximumdepthofthetree.(suggestedvalue:4).

•maxBins-maximumnumberofbinsusedforsplittingfeatures(suggestedvalue:100).

Returns:DecisionTreeModelthatcanbeusedforprediction

SparkGraphX

Efficientprocessingoflargegraphsisanotherimportantandchallengingissue.Many

Page 140: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

practicalcomputingproblemsconcernlargegraphs.Forexample,googlehavetorunitsPageRankonbillionsofwebpagesandmaybetrillionsofweblinks.GraphXisanewcomponentinSparkforgraphsandgraph-parallelcomputation.Atahighlevel,GraphXextendstheSparkRDDbyintroducinganewGraphabstraction:adirectedmulti-graphwithpropertiesattachedtoeachvertexandedge.

Tosupportgraphcomputation,GraphXexposesasetoffundamentaloperatorssuchassubgraph,joinVertices,andaggregateMessagesonthebaissofanoptimizedvariantofthePregelAPI(PregelisthesystematGooglethatpowersPageRank).Inaddition,GraphXincludesagrowingcollectionofgraphalgorithmsandbuilderstosimplifygraphanalyticstasks.

WecomputethePageRankofeachuserasfollows:

//loadtheedgesasagraphobject

valgraph=GraphLoader.edgeListFile(sc,“outlink.txt”)

//Runpagerank

valranks=graph.pagerank(0.00000001).vertices

//jointherankwiththewebpages

valpages=sc.textFile(“pages.txt”).map{line=>valfields=line.split(“,”)(fields(0).toLong,fields(1))}

valranksByPagename=pages.join(ranks).map{case(id,(pagename,rank))=>(pagename,rank)}

//printtheoutput

println(rankByPagename.collect().mkString(“\n”))

SparkR

Risapopularstatisticalprogramminglanguagewithanumberofextensionsthatsupportdataprocessingandmachinelearningtasks.However,interactivedataanalysisinRisusuallylimitedastheruntimeissingle-threadedandcanonlyprocessdatasetsthatfitinasinglemachine’smemory.SparkR,anRpackageinitiallydevelopedattheAMPLab,canprovideanRfrontendtoApacheSparkandusingSpark’sdistributedcomputationengineallowsustorunlargescaledataanalysisfromtheRshell.SparkRexposestheRDDAPIofSparkasdistributedlistsinR.Forexample,onecanreadaninputfilefromHDFSandprocesseverylineusinglapplyonaRDD.Thereisacaseletasfollows:

sc<-sparkR.init(“local”)

Page 141: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

lines<-textFile(sc,“hdfs://data.txt”)

wordsPerLine<-lapply(lines,function(line)){length(unlist(strsplit(line,””)))})

Inadditiontolapply,SparkRalsoallowsclosurestobeappliedoneverypartitionusinglapplyWithPartition.OthersupportedRDDfunctionsincludeoperationslikereduce,reduceByKey,groupByKeyandcollect.

SparkSQL

SparkSQLisalanguageprovidedtodealwiththestructureddata.Usingthisonecanrunqueriesonthedataandgetsomemeaningfulresult.ItsupportsthequeriesthroughSQLaswellasHQL(HiveQueryLanguage)whichisApache’sHiveversionofSQL.

SparkStreaming

SparkStreaminggainsdatastreamsfrominputsources,processtheminacluster,pushouttodatabases/dashboards.Sparkfurtherchopsupdatastreamsintobatchesoffewseconds.SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDoperations.Theprocessedresultsarepushedoutasbatches.

Page 142: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SparkapplicationsSomehotdataproblemsthataresolvedwellbyatoollikeApacheSparkinclude:1.Real-timeLogDatamonitoring.2.MassiveNaturalLanguageProcessing3.LargeScaleOnlineRecommendationSystems.

AsimpleWordcountapplicationcanberuninSparkshellasbelow.

valtextFile=sc.textFile(“C:\Users\MyName\Documents\obamaSpeech.txt”)

***Comment:savesthetextfileastextFile***

valcounts=textFile.flatMap(line=>line.split(”“)).map(word=>(word,1)).reduceByKey(_+_)

***Comment:Calculatethetotalwordsbysplittingwithspace***

counts.count();

***Resultstheoutputasbelow******

Long=52

counts.saveAsTextFile(“C:\Users\MyName\Desktop\counts1”)

***Comment:savesthefileonmyDesktop***

Page 143: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SparkvsHadoopSparkandHadooparebothpopularApacheprojectsdedicatedtobigdataprocessing.Hadoop,formanyyears,wastheleadingopensourcebigdataplatformandmanycompaniesalreadyuseadistributedcomputingframeworklikeHadoopbasedonMapReduce.Table9.1providesasummaryofthedifferencesbetweenHadoopandSpark.

Feature Hadoop Spark

Purpose Resilientcost-effectivestorageandprocessingoflargedatasets

Fastgeneral-purposeengineforlarge-scaledataprocessing

Corecomponent HadoopDistributedFilesystem(HDFS)

SparkCore,thein-memoryprocessingengine.

Storage HDFSmanagesmassivedatacollectionsacrossmultiplenodeswithinaclusterofcommodityservers.

Spark doesn’t do distributedstorage. It operates ondistributeddatacollections.

FaultTolerance Hadoop uses replication toachievefaulttolerance.

SparkusesRDDforfaulttolerancethatminimizesnetworkI/O.

Natureofprocessing

AccompaniedbyMapReduce,itincludesbatchprocessingofthisdatainparallelmode

Batch as well as streamprocessing.

SweetspotBatchprocessing

Iterativeandinteractiveprocessingjobs,thatcanfitinthememory

ProcessingSpeedMapReduceisslow.

Sparkcanbeupto10xfasterthanMapReduceforbatchprocessingandupto100xfasterforstreamprocessing.

Security Moresecure Lesssecure

Failurerecovery Hadoopcanrecoverfromsystemfaultsorfailuressincedataarewrittentodiskaftereveryoperation

WithSpark,dataobjectsarestoredinRDD.Thesecanbereconstructedafterfaultsorfailures

Analyticstools Built-inMLLib(Machine

Page 144: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Separateengine Learning)andGraphX(GraphProcessing)libraries

Compatibility PrimarystoragemodelisHDFS CompatibilitywithHDFSandotherstorageformats

Languagesupport Java Scalaisnativelanguage.APIsforpython,java,R,others.

DrivingOrganization Yahoo AMPLabsfromUCBerkeley

Technologyowners Apache,Open-source,free Open-source,free

KeyDistributors Cloudera,Horton,MapR Databricks,AMPLabs

CostofSystem MediumtoHigh MediumtoHigh

Page 145: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionSparkisanewintegratedsystemforbigdataprocessing.ItsmostimportantcoreabstractionisRDDs,alongwithrelevantlibrarieslikeMLlibandGraphX.Sparkisareallypowerfulopensourceprocessingenginebuildaroundspeed,easeofuse,andsophisticatedanalytics.

Page 146: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:Describethesparkecosystem.

Q2:CompareSparkandHadoopintermsoftheirabilitytodostreamcomputing?

Q3:WhatisanRDD?HowdoesitmakeSparkfaster?

Q4:DescribethreemajorcapabilitiesinSparkfordataanalytics.

Page 147: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 148: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 149: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter8–IngestingDataWholenessADataingestingsystemisareliableandefficientpointofreceptionforalldatacomingintoasystem.Thissystemisdesignedtobeflexibleandscalabletoreceivedatafromvarioussources,atvarioustimesandspeedsandquantities.Theingestsystemmakesthedataavailableforusebythetargetapplicationsinrealtime.Ideally,alldatawouldbesmoothlyreceived,andmadeavailablefordownstreamapplicationstosecurelyandreliablyaccessattheirownconvenience.Adedicatingdataingestmechanismisachievedbycreatingafastandflexiblebufferforreceivingandstoringallincomingstreamsofdata.Thedatainthebufferisstoredinasequentialmanner,andismadeavailabletoallconsumingapplicationsinafastandorderlymanner.

BigDataarrivesintoasystematunpredictablespeedsandquantities.Businessapplicationsthereafterreceiveandprocessthisdataatsomeplannedthroughputcapacity.Aningestbufferisneededtocommunicatethedatawithoutlossofdataorspeed.Thisbufferideahashistoricallybeencalledamessagingsystem,nottoodissimilarfromamailboxsystematthepostoffice.Incomingmessagesareputintoasetoforganizedlocations,fromwherethetargetapplicationswouldreceivethemwhentheyareready.

Withhugeamountsofdatacominginfromdifferentsources,andmanymoreconsumingapplications,apoint-to-pointsystemofdeliveringmessagesbecomesinadequateandslow.Alternatively,incomingdatacanbecategorizedintocertaintopics,andstoredintherespectivelocationorlocationsforthosetopics.Insteadofdatabeingreceivedandheldinstorageforaspecifictargetapplication,nowthedatamaybeconsumedbyanyapplicationthatisinterestedindatarelatedtoatopic.Eachconsumingapplicationcanchoosetoreaddataaboutoneormoretopicsofitsinterest.Thisiscalledthepublish-and-subscribesystem.

Page 150: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MessagingSystemsAMessagingSystemisanasynchronousmodeofcommunicatingdatabetweenapplication.Therearetwogenerickindsofmessagingsystems−apoint-to-pointsystem,andapublish-subscribe(pub-sub)system.Mostofthemessagingpatternsnowfollowpub-submodel.

PointtoPointMessagingSystem

Inapoint-to-pointsystem,everymessageisdirectedataparticularreceiver.Acommonqueuecanreceivemessagesfrommanyproducersormessages.Anyparticularmessagecanbereceivedandconsumedbyonlyonereceiver.Oncethattargetconsumerreadsamessageinthequeue,thatmessagedisappearsfromthatqueue.ThetypicalexampleofthissystemisanOrderProcessingSystem,whereeachorderwillbeprocessedbyoneOrderProcessor.

Publish-SubscribeMessagingSystem

Inapub-submessagingsystem,theapplicationspublishtheiroutputtoastandardmessagingqueue.Thetargetrecipientwillonlyneedtoknowwheretogetthemessage,wheneveritisreadytopickupthemessage.Applicationsthuscanignorethemechanicsofinteractionwithotherapplications,andsimplycareaboutthemessageitself.Thisisespeciallyvaluablewhentheremaybemanytargetrecipientsforamessage.Inapub-subsystem,messagesareenteredintothemessagingqueueasynchronouslyfromclientapplications.

Amessagequeuingsystemneedstobefastandsecuretoservemanyapplications,bothproducersandsubscribers.Messagesarealsoreplicatedacrossmultiplelocationsforreliabilityofdata.

TherearetwopopularDataingestingsystemsusedinBigData.Anoldersystem,calledFlume,iscloselytiedtotheHadoopdistributedfilesystem.ThenewandmorepopularsystemisageneralpurposesystemcalledApacheKafka.Inthischapterwewilldiscussthenewsystem,Kafka.

Page 151: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ApacheKafkaApacheKafkaisanopensourcepublish-and-subscribemessagebrokersystem.Kafkaaimstoprovideanintegratedhigh-throughput,low-latencymessagingplatformforhandlingreal-timedatafeeds.Intheabstract,itisasinglepointofcontactbetweenallproducersandconsumersofdata.AllproducersofdatasenddatatoKafka.AllconsumersofdatareaddatafromKafka.(Figure8.1)

Figure8‑1:Kafkacoreidea

Kafkaisadistributed,partitioned,scalable,replicatedmessagingsystem,withasimplebutuniquedesign.ItwasinitiallydevelopedbyLinkedInandwasopensourcedinearly2011.ApacheSoftwareFoundationisnowresponsibleforitsdevelopmentandimprovement.Kafkaisavaluableforanenterpriseslevelinfrastructurebecauseofitssimplicityandscalability.Kafkasystemiswritteninthehigh-levelScalaprogramminglanguage.

Page 152: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

UseCasesFollowingaresomepopularusecasesofApacheKafka.

Messaging

KafkaisaverygoodalternativeforatraditionalmessagebrokerbecauseKafkamessagingsystemhasbetterthroughput,builtinpartitioning,replicationandbetterfaulttolerance.Kafkaisverygoodsolutionforalargescalemessageprocessingapplications.

WebsiteActivityTracking

WebsiteActivityTrackingwasoneofinitialusecasesforKafkaforLinkedIn.Users’onlineactivitytrackingpipelinewasrebuiltasasetofrealtimedatafeeds.Generalwebactivitytrackingincludesverylargevolumeofdata,andKafkaisverygoodathandlingthishugevolumeofdata.Useractivitytypessuchaspageview,searches,clicks,etccanbedesignatedascentraltopics,andtheactivitydatacanbepublishedtothosetopics.Thoseeventsareavailableforrealtimeorofflineprocessingandreporting.

StreamProcessing

PopularframeworkssuchasStormandSparkStreamingreaddatafromatopic,processit,andsendtootherusersandconsumerapplications.TheymayevenwriteitbacktoKafkatoanewtopic.Kafka’sstrongdurabilityisalsoveryusefulforstreamprocessing.

LogAggregation

ActivityLogaggregationtypicallygathersphysicallogfilesfromserversandputsthemallinacentralplaceforprocessing.Kafkacanabstractawaythedetailsofthefilesandprovideacleanerabstractionoflogdataasastreamofmessages.UseofKafkathenallowsforlower-latencyprocessingandeasiersupportformultipledatasourcesanddistributeddataconsumption.Unlikededicatedlog-centricsystems,Kafkaoffershigherperformanceandstrongerdurabilityguaranteesduetoreplication.

CommitLogKafkacanbeusedasexternalcommitlogforadistributeddatabasesystem.Thisauditlogcanhelptore-syncdatabetweenthefailednodestorestoretheirdata.ThelogcompactioninKafkahelpstoachievethisfeaturemoreefficiently.

Page 153: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

KafkaArchitectureIntheabstract,Kafkabrokersdealwithproducersandconsumersofdata.Aproducerpushesdataintotheingestsystematitsownspeed,scaleandconvenience.Aconsumerpullsdataoutofthesystematitsownspeed,scaleandconvenience.Allthereceiveddataisorganizedbycategories,calledtopics.Incomingdataissortedandstoredintotopicservers.Theconsumersofdatacansubscribetooneormoretopics(Figure8.2).

Figure8‑2:KafkaEcosystem

Therearemorethanonebrokers(alsocalledservers,orpartitions)foreachtopic,forreliabilityofthemessagingsystem.Thustwoormorebrokerswillstoredataoneachtopic.Onlyonebrokercanbeleaderatanygiventime.Intheleadbrokerfails,thenasecondonecanautomaticallytakeoverandpreventthelossofaccesstodata.

Kafkaisdesignedfordistributedhighthroughputsystems.Incomparisontoothermessagingsystems,Kafkahasbetterthroughput,built-inpartitioning,replicationandinherentfault-tolerance,whichmakesitagoodfitforlarge-scalemessageprocessingapplications.Ithastheabilitytohandlealargenumberofdiverseconsumers.ItintegratesverywellwithApacheStorm,Sparkandotherreal-timestreamingdataapplications.Kafkaisveryfastandcanperform2millionwrites/sec.Italsoguaranteeszerodowntimeandzerodataloss.

TherearealotofcontributingorganizationshelpingtoimprovetheKafkaopen-sourcesystem.Ithasverywelldocumentedonlineresources.IthasbeenusedbymanybigorganizationssuchasLinkedIn,CiscoSystem,Spotify,Paypal,HubSpot,Shopify,Uberandmore.HubSpotusesKafkatodeliverrealtimenotificationofwhenarecipientopenstheiremail.PaypalusesKafkatoprocessmillionsofupdatesinaminute.

Producers

Aproducerisresponsibleforselectingthepartition,andthetopicforthemessagethatitwantstoconvey.Itcanuseround-robinalgorithmtobalancetheloadamongpartitions.Therecanbebothsynchronousandasynchronousproducersforproducingmessageandpublishingtothepartition.

Consumers

Aconsumerisresponsibleforreadingthedataaboutthetopicthatithassubscribed.Theconsumerisresponsibleforreadingthedatawithinareasonableperiodoftime,beforethe

Page 154: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

queuesareemptiedforefficientmanagementofstorage.Differentconsumingapplicationscanreadthedataatdifferenttimes.Kafkahasstrongerorderingguaranteesthanatraditionalmessagingsystem.Aconsumerneedstoknowhowfarithasreadinthatqueue,soastoavoidduplicatesorlosesomedata.

Broker

AbrokerisaserverinaKafkacluster.Theclustermayhavemanysuchserversorbrokers.

Topic

Atopicisacategoryintowhichmessagesarepublished.Foreachtopicthereisaseparatepartitionlogforstorageofmessages.Eachpartitionhasanorderedsequenceofmessagesforthattopic.Eachmessageinthepartitionisassignedauniquesequentialnumber,alsocalledtheoffset.Thisoffsethelpstoidentifyeachmessagewithinthepartition.

Theconsumerreadsthedatasequentiallyaccordingtooffsetnumbers.Theconsumermaintainstheoffsettorememberhowfarithasread.Generally,theoffsetincreaseslinearlyasmessagesareconsumed.However,aconsumercanresetoffsettoaccessthedatagainandreprocessitasneeded.

TheKafkaclusterkeepsallthepublishedmessageswhetherornottheyhavebeenconsumedforaconfigurableperiodoftimeornot.Forexample,ifthelogretentionissettosevendays,thenforthesevendaysafterpublishing,themessageisavailableforconsumption.Aftersevendays,Kafkadiscardsthemessagestofreeupspace.

Kafka’sperformanceisnotaffectedbythesizeofdata.Eachpartitionmustfitontheserversthathostit,butatopicmayhavemultiplepartitions.ThisenablesKafkatomanageanarbitraryamountofdata.Also,itactsastheunitofparallelism.

SummaryofKeyAttributes1. Diskbased:Kafkaworksonaclusterofdisks.Itdoesnotkeepeverythingin

memory,andkeepswritingtothedisktomakethestoragepermanent.2. Faulttolerant:DatainKafkaisreplicatedacrossmultiplebrokers.Whenany

leaderbrokerfails,afollowerbrokertakesoverasleaderandeverything

Page 155: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

continuestoworknormally.3. Scalable:Kafkacanscaleupeasilybyaddingmorepartitionsormorebrokers.

Morebrokershelptospreadtheloadandthisprovidesgreaterthroughput.4. Lowlatency:Kafkadoesverylittleprocessingonthedata.Thusithasverylow

latencyrateMessagesproducedbytheconsumerarepublishedandavailabletotheconsumerwithinafewmilliseconds.

5. FiniteRetention:Kafkabydefaultkeepsthemessageintheclusterforaweek.Afterthatthestorageisrefreshed.Thusthedataconsumershaveuptoaweektocatchupondata,incasetheyfallbehindforanyreason.

Distribution

TheKafkaclustermaintainsmultipleserversoverthedistributednetwork.Thepartitionsofthelogaremaintainedoverthisnetwork.Eachserverhandlesdataandrequestsforashareofthepartitions.Eachpartitionisreplicatedacrossaconfigurablenumberofserversforfaulttolerance.Butoneoftheserverforeachpartitionactsasthemainserveralsocalled“leader”whileitmayormaynothaveoneormoresecondaryserveralsoknownas“followers”.Theleaderserverisresponsibleforhandlingallthereadandwriteoperationforthepartitionwhilethefollowerssilentlyreplicatestheleader.Thefollowerserverbecomesveryhelpfulwhentheleaderserverfails.Thefollowerserverautomaticallybecomestheleaderandthenhandlesthefailure.Oneservercanbealeaderforsomeofthepartitionsonit,whileitmaybefollowerforotherpartitions.Thusoneservercanactasbothleaderandfollower.Thishelpstobalancetheworkloadontheserverswithinthecluster.

Guarantees

Messagessentalwaysmaintaintheordertheyweresent.Forexample,ifamessageM1andM2weresentbythesameproducerandM1wassentfirstthenthemessageM1willhaveloweroffsetthanmessageM2.Therefore,M1willalwaysappearbeforetheM2fortheconsumer.

EachtopichasareplicationfactorNandthesystemcantolerateuptoN-1serverfailureswithoutlosinganymessagescommittedtothelog.

ClientLibraries

Kafkasupportsfollowingclientlibraries:

1. Python:PurepythonimplementationwithfullprotocolsupportandConsumerProducerarealsoincluded.

2. C:HighperformanceClibrarywithfullprotocolsupport.3. C++,Ruby,Javascriptandmore.

Page 156: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ApacheZooKeeperKafkaisbuiltontopofZooKeeper.ApacheZookeeperisadistributedconfigurationandsynchronizationserviceinHadoopclusters.HereitservesasthecoordinationinterfacebetweentheKafkabrokersandconsumers.TheKafkaserversstoresbasicmetadatainZookeeperandsharesinformationabouttopics,brokers,andconsumeroffsets(queuereaders)andsoon.

SinceZookeeperdoesitownlayersofreplication,thefailureofaKafkabrokerdoesnotaffectthestateoftheKafkacluster.EvenifZookeeperfails,Kafkawillrestorethestate,oncetheZookeeperrestarts.ThisgiveszerodowntimeforKafka.Zookeeperalsomanagesthealternativeleaderbrokerselection,incaseofaKafkaleaderfailure.KafkaProducerexampleinJava

//Configure

Propertiesconfig=newProperties();

config.setProperty(ProducerConfig.BOOTSTRAP_SERVER_CONFIG,“localhost:8082”);

KafkaProducerproducer=newKafkaProducer(config);

ProducerRecordrecord=newProducerRecord(“topic”,“key”.getBytes(),”value”.getBytes());

Future<RecordMetaData>response=producer.send(record);

Page 157: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionBigdataisingestedusingadedicatedsystem.Theseoftentaketheformofmessagingsystems.Publish-and-subscribesystemsareefficientwaysofdeliveringdatafrommanysourcestomanytargets,inareliable,secureandefficientway.Kafkaisanopen-source,reliable,secure,andscalabledatapublish-subscribemessagingsystem.Itdealswithproducersaswellasconsumersofdata.Messagesarepublishedtoasetofcentraltopics.Eachconsumercansubscribetoanynumberoftopics.Kafkausesaleader-followersystemofmanagingreplicatedpartitionsforthesamesetofdata,toensurefullreliabilityandzerodowntime.

Page 158: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:Whatisadataingestsystem?Whyisitanimportanttopic?

Q2:Whatarethetwowaysofdeliveringdatafrommanysourcestomanytargets?

Q3:WhatisKafka?Whatareitsadvantages?Describe3usecasesofKafka.

Q4:Whatisatopic?Howdoesithelpwithdataingestmanagement?

Page 159: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

References1.http://kafka.apache.org/documentation.html#introduction

Page 160: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 161: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 162: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter9–CloudComputingPrimerCloudcomputingisacost-effectiveandflexiblemodeofdeliveringITinfrastructureasaservicetoclients,overinternet,onameteredbasis.ThecloudcomputingmodeloffersclientsenormousflexibilitytouseasmuchITcapacity–compute,storage,network–asneededwithouthavingtoinvestinadedicatedITcapacityonone’sown.TheITusagecanbescaledupordowninminutes.ThecomplexITinfrastructuremanagementskillsareallownedbythecloudcomputingprovider,andproblemscanberesolvedmuchfaster.TheclientcansimplyaccessasmoothlyrunningITinfrastructureoverafastinternetconnection.ITcapacityinthecloudcanbepurchasedasacustompackagedependinguponone’sneedsintermsofaverageandpeakITrequirements.Thecomputingcloudistheultimatecosmiccomputeralignedwithalllawsofnature.

Page 163: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

IntroductionManagingverylargeandfastdatastreamsisahugechallenge.Itrequiresmakingcriticaldecisionsaboutitsstorage,structure,andaccess.Thisdatawouldbestoredinlargeclustersofhundredsorthousandsofinexpensivecomputers.Suchclustersareoftencalledserverfarms.Thelocationandsizeofsuchclustersimpactscosts.Theserverfarmsmaybelocatedintheirowndatacenters,ortheymayberentedfromspecializedthird-partyorganizationscalledcloudcomputingserviceproviders.

CloudcomputingprovidestheITleadershipacost-effectiveandpredictablesolutionforreliablymeetingtheirlargedatamanagementneeds.Therearemanyvendorsofferingthisservice.Priceskeepdroppingregularly,becauseITcomponentskeepgettingcheaper,thereisgrowingvolumeofbusiness,andthereiseffectivecompetition.Withcloudcomputing,theITexpensebecomesanoperatingexpenseratherthanacapitalexpense.ThecostsofITbecomesalignedwithrevenuestreamsandmakescashflowmanagementeasier.

Oneofthemainreasonsforenterprisesmovingtocloudcomputingistoexperimentwithnewandriskyprojects.Thisflexiblemodelmakesitmucheasiertolaunchnewproductsandservices,withoutbeingexposedtotheriskofaheavylossinITinfrastructure.Forexample,anewHollywoodmovie’ssitewillhavemillionsofvisitorstoitswebsiteforamonthbeforeandforamonthafterthemovie’sreleasedate.Afterthatthevisitstothewebsitewilldropdramatically.Thewebsiteownerwouldbenefitenormouslyfromusingacloudcomputingmodelwheretheypayforthepeakwebusagecapacityforthosefewmonths,andmuchlessastheusagedropsdown.Moreimportantly,theflexibilityensuresthattheirwebsitewillnotcrashjustincasethemoviebecomesasuper-hitandattractsunusuallylargenumberofvisitorstothewebsite.

Page 164: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CloudComputingCharacteristicsHerearethemajorcharacteristicsofacloudcomputingmodel.

1. FlexibleCapacity:Thecapacitycanscaleuprapidly.Onecanexpandandreduceresourcesaccordingtoone’sspecificservicerequirements,asandwhenneeded.Thecloudinternallydoesregularworkloadbalancingamongtheneedsofmillionsofclients,andthishelpsbringdowncostsforeveryone.

2. Attractivepaymentmodel:Cloudcomputingworksonapay-per-usemodel.i.e.onepaysonlyforwhatoneuses,andforhowlongoneusesit.ITcostsbecomeanexpenseratherthanacapitalexpensefortheclient.Theresourcepricesmaybenegotiatedatlong-termcontractrates,andcanalsobepurchasedatspotmarketrates.

3. ResiliencyandSecurity:Thefailureofanyindividualserverandstorageresourcesdoesnotimpacttheuser.TheServersandstorageforallclientsareisolatedtomaximizesecurityofdata.

Page 165: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

In-housestorageMostorganizationshavedatacentersforrunningtheirregularIToperations.Anorganizationmaydecidetoexpanditsowndatacentertostorelargestreamsofdata.Theorganizationcanensurecompletesecurityandprivacyofitsdataifitkeepsallthedatain-house.However,thecostsandcomplexityofmanagingthisdataareincreasing,anditisnotcost-effectiveforeveryorganizationtomanagehugedatacenters.Hiringandretainingscarceadvancedskillstomanagesuchdatacenterswouldalsobeachallenge.

Page 166: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CloudstorageItisnowbecomingatrendfororganizationstochoosetostoretheirdatainmassivedatacentersownedbyotherspecializedcompanies.Theirdataandprocessingcapacityresidesinsomesortofahugecloudoutthere,whichisaccessiblefromanywhereanytimethroughasimpleinternetconnection.

CompanieslikeAmazon,Google,Microsoft,Apple,andIBMareamongthemajorprovidersofcloudstorageandcomputingservicesaroundtheworld.Theyownandoperatedatacenterswithmillionsofcomputersinthem.

Figure0‑1:Acloudcomputingdatacenter

Commercially,cloudserviceprovidersareabletoconsolidatetherequirementsofthousandsormillionsofcustomers,andsupplyflexibleamountsofdatastorageandcomputingfacilityavailabletoclientsonaper-usagebasis.Thispaymodelissimilartohowelectricutilitycompanieschargeconsumersfortheirusageofelectricityinhomesandoffices.Cloudcomputingoffersmuchlowercostsperuse,justlikeusingtheelectricutilitycostsmuchlessthanowningandoperatingone’sownelectricitygenerators.

Page 167: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Amajordisadvantageofcloudstorageisthatthedataisstoredawayfromone’sphysicalcontrol.Thussecurityofpreciousdataislefttothehandsofthecloudcomputingprovider.Whilethesecurityprotocolsarerapidlyimproving,however,therearenofailsafemethodsforsecuringdatainthecloud.Thereisalsoariskofbeinglockedintooneprovider’sinfrastructure.Thecost-benefittradeoffshavedefinitelytiltedtowardsusingcloudcomputingproviders.Atsomefuturepointintime,thecloudservicesprovidersmightbeheavilyregulatedliketheelectricutilities.

Page 168: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CloudComputing:EvolutionofVirtualizedArchitectureCloudcomputingisessentiallyacommercialmodelforvirtualizedserverinfrastructure.IBMbegantooffertime-sharingservicesonitsmainframecomputersbeginninginthe1960s.Nowthatsametechnologyhasbeenofferedonnetworksofsmallmachinesthroughthevirtualizationprocess.

Virtualizationassumesthatlogicalmachinescanbedifferentiatedfromphysicalmachines.AphysicalservercouldrunmultipleVirtualMachines(VMs);andonevirtualmachinemayspanmultiplephysicalservers.Thevirtualizationsoftwareiscalledahypervisor.ItabstractsallmachinesintoVirtualMachines,usingeasyGUIinterface.Avirtualizationsoftwarecantypicallyrunonaheterogeneousphysicalinfrastructure,andconvertallITcapacityintoasingleunifiedcapacity.Thiscapacitycanthenbeprovisionedinslicesandpackages.Theuserapplicationsarenotawarethattheyarerunninginavirtualizedenvironment;sotheyrunasifrunningonadedicatedmachine.Theapplicationscanalsorunontopoftheirownnativeoperatingsystems.

Page 169: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CloudServiceModelsTherearetwomajordimensionstoconceptualizetheCloudcomputingmodels:thescopeofservicesreceived;andthecontroloverandcostofthoseservices.

1. Therangeofcloudcomputingservicesfromacloudcomputingprovider,fallinthreebroadbuckets:

1. Infrastructureasaservice:Thisisthelowestlevelofservices,andincludedonlyrawcapacityofcompute,storage,andnetworking.Thepriceforthisservicesisthelowest.

2. Platformasaservice:ThisincludesIaaS,alongwithothertechnologiesandservices.ThesearestillverygeneraltoolssuchopensourceHadooporSparkorCassandraimplementation,alongwithcertainmonitoringtools.Thecostsarealittlehigherbecauseoftheadditionalmanagementandmonitoringservicesprovidedbytheprovider.

3. Softwareasaservice:Thisincludesthecomputingplatformaswellasbusinessapplicationsthatgetworkdone.Forexample,salesforce.comwasoneofthefirstCRMapplicationsoldonlyonaSaaSmodel.Googlesellsanemailservicetoorganizationsonaper-user-per-monthbasis.Thisisalsothemostexpensivetypeofcloudservice.

2. Theotherwaythecloudservicesdifferisintermsoftheownershipandcontrol.1. Publiccloud:Thiswillbealargesharedinfrastructuremadeavailableto

oneandall,inalow-costandmulti-tenancymodel.Theclientcanaccessitusinganydevice.Thedownsideisthatthedataalsoresidesonthecloud,andthuscouldbevulnerabletotheftorhacking.Thecoststo

Page 170: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

clientarelow,andvariabledependinguponuse.2. Privatecloud:Thisisacloudversionofanin-houseITinfrastructure.

Theorganizationwillhaveexclusivecontrolovertheentireinfrastructure.Thecostswouldbefixedandhigher.

3. Hybridcloud:Thisisamixofflexibilityofcapacity,andmuchcontroloversomekeyaspectsofit.Onecouldretaincompletecontrolovercriticalapplications,whileusingsharedinfrastructurefornon-criticalapplications.

Alllevelsofinfrastructureandpaymodelsareuseful,astheyserverdifferentlevelsofneedsforclientorganizations.However,mostofthegrowthincloudcomputingishappeningbecauseoftheattractivenessofthelowcostofthepubliccloudmodel.

Page 171: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CloudComputingMythsThereareacoupleofmisconceptionsaboutthecostsandbenefitsofcloudcomputing.

1. Myth:PublicCloudcomputingwouldsatisfyalltherequirement:scalability,flexibility,payperuse,resilience,multitenancy,andsecurity.Dependinguponthetypeofserviceselected(SaaS,IaaS,orPaaS),theservicecansatisfyspecificsubsetsoftheserequirements.

2. Myth:CloudcomputingwouldbeusefulonlyifyouareoutsourcingyourITfunctionstoanexternalserviceprovider.OnecoulduseaprivatecloudcomputingmodelforasectionofITapplicationstoofferon-demand,scalable,andpay-per-usedeploymentswithinyourenterprise’sowndatacenter.

Page 172: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CloudComputing:GettingStartedHerebelowisaframeworkforcloudadoption.Learnmoreaboutthecontextforgettingbenefitsfromcloudcomputing.Selecttherightmodelandlevelofcloudcapacity.Setuptheapplicationsandamonitoringsystemforthoseapplicationandthetotalcloudfootprint.Chooseaserviceprovider,sayAmazonWebServices,theleadingproviderofcloudcomputing.UseAppendixAtoinstallHadooponAWSEC2publiccloud

infrastructure.

Page 173: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionCloudcomputingisabusinessmodeltoprovideshared,flexible,cost-effectiveITinfrastructuretogetstartedquicklyonbuildinganapplication.ForBigDataapplications,itcanbeevenmoreattractivetotestthesystemusingrentedfacilities,beforemakingthedeterminationofinvestingindedicatedITinfrastructure.

Page 174: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1:DescribeCloudComputingmodel.

Q2:Whataretheadvantagesofcloudcomputingoverin-housecomputing

Q3:DescribethetechnicalarchitectureforCloudcomputing.

Q4:Nameafewmajorprovidersofcloudcomputingservices.

Page 175: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 176: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 177: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Section3

ThissectioncoverstheotherrelevantconceptsandtutorialsforeffectivelymanagingandutilizingBigData.

Chapter10willbringallthetoolstogetherinacasestudyofdevelopingwebloganalyzer,asanexampleofausefulBigDataapplication.

Chapter11willcovertheoverallviewofDataMiningtoolsandtechniquestoextractbenefitfromBigData.

Appendix1showsstepbystep,thewaytoinstallaHadoopclusteronacloudcomputingplatform.

Appendix2isatutorialoninstallingandrunningSpark.

Page 178: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 179: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 180: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter10–WebLogAnalyzerapplicationcasestudy

IntroductionAwebloganalyzerisanautomatedsoftwaretoolthathelpstoanalyzeandmakedecisionsonanumberofissuesregardingwebapplicationserverlogs.Anidealwebloganalyzerwouldanalyzeunlimitedstreamsofdataandhelpkeeptheentireuniverserunningsmoothlyandwithoutfault.Thiswouldbedonebyeliminatingtheneedformanuallyaccessingthelogs,automatingtheflowofinformation,andalertingthesystemadministratorasneeded.

Page 181: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Client-ServerArchitectureEveryweb-basedapplicationrunsonaclient-serverarchitecture.Clientsareentitiesthataccessservers,andserversareentitiesthatrespondtotheclientwithasolution.Alotofclientssimultaneouslytrytoaccessservers.Theserversmaybedatabaseserver,networkserver,theapplicationserver,oranyserverinthen-tierarchitecture.Foreachrequest,alogentryisgenerated.Thespeedofaccessrequestsdeterminedthestreamoflogentries.Thisleadstoapotentiallyhugelogovertime.Thelogcanbeprocessedasstreamofdata.Thislogcanalsobestoredontheserversforlateranalysis.

Logscanbeusedformonitoring,auditandanalysispurposes.Itcanhelpwitherrordiagnosticsincaseawebsitebecomessloworitgoesdown.Logscanbeanalyzedtodetecthackingactivity.Theycanalsobeanalyzedtosummarizethepopularityofwebpages,andthedistributionofthepagerequesters.Itcanhelpwithaccessvolumes,andforscalingupordowntheinfrastructure.

Page 182: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

WebLoganalyzerTheloganalyzerreceivedstreaminglogsfromaserverlocation,andanalyzesmultiplethingsusingmanyalgorithmstogeneratethedesiredresults.Thesystemiscompletelyautomated.Thelogisproduced,anditisconsumedittomakereal-timereports.Itiseasytoimaginethemassivedataflowproducedbythelogintheserverenvironmentwhileitisalsobeinganalyzedsimultaneouslyontheadministratorside.`

Requirements

Thisisaloganalyzertoanalyzeawebapplicationhostedonaserver.Itisabusyapplicationownedbyabigcompany.Itreceivesmorethan15000webaccessrequestsperhour.Alltheaccessrequestsneedtobelogged,anddumpedtoHadoopFilesystemperiodically.Theanalyzerisrequiredtoingestreal-timelogdata,andfilteroutapartofdataforanalyzinganddumpingtoHDFS.Ithastodostreamingdataflowmanagementaswellasbatchprocessing.TheanalyzerneedstoprocessthedatabeforeitisdumpedintoHDFS,andalsoafteritisputintoHDFS.Thesystemadministratorsshouldbealertedinrealtimeaboutpossiblethreats,overloads,delays,potentialserrors,andanyotherdamages.Theresultsofalltheanalysesneedtobestoredinadatabaseforlaterpresentationinagraphicalformat.Theresultshavetobemadeavailableforanyperiodoftime,withoutanymissingtimevalues.Thelogdatahastobepreservedforfuturewithoutlosinganylogdata.

SolutionArchitecture

GetstreamingdatausingApacheFlume,andsendittoHDFS.UseApacheSparkfordataflowmanagementplatformandprocessingengine.StoretheresultsofanalysisinMongoDB.Thisisasafesolution,becausethedatagetsstoredintoHadoopclusterandisavailableforfuturerequirements,evenwhileitisbeinganalyzedinrealtime.Theresultsofreal-timeprocessingalsogointoMongoDB.

Page 183: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Fig10.1:WebLogAnalyzerArchitecture

Page 184: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

BenefitsofthissolutionTheadvantagesofthissolutionare:

1. RealtimeloggingandanalysisdatageneratedonserverisstreameddirectlytoHDFSbyFlumeagentwithoutdelay.Everylogentrygeneratedovereverysinglepointoftimeisanalyzedandusedformonitoringanddecisionmaking,

2. Automaticloghandlingandstorage.LoadingdataintoHDFSnormallyrequiresmanuallyrunningcertainHadoopcommands.ThisloganalyzerusesaFlumeagentorsparkstreamingtohandlealldataonitsown,withoutanyexternallymanagedefforts.

3. Easyandconvenientimplementusingbuilt-inandeasy-to-customizemachinelearningalgorithmsinSpark.

4. Easyerrorhandling,serverrequesthandling,andoverallserverperformanceoptimization.Itmakesserversmarterbykeepingtrackofalmosteveryaspectsofserver.

Page 185: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TechnologystackThetechnologystackusedforthisapplicationisshownbelow.Abriefofeachcomponentfollows.

1. ApacheSparkv22. Hadoop2.6.0cdh53. ApacheFlume4. Scala,Java5. MongoDB6. RestFulWebservices7. FrontUItools8. LinuxShellScripts

Page 186: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ApacheSparkSparkisfastin-memory-basedclustercomputingtechnology,designedforfastandstreamingcomputation.ItisbuiltontopofHadoopandMapReducesystem,anditextendsMapReducemodeltousemoretypesofcomputation,whichincludesinteractivequeriesandstreamprocessing.Ithaslotoflibrariesandpackageslikemachinelearning(MLLib),graphcomputation(GraphX)etc.Itclaimstoexecute10to100timesfasterthanHadoopbecauseofitsin-memorycomputationmodel.ItalsosupportsmultiplelanguagessuchasScala,Python,Java,andR.

SparkDeployment

1. Standalone2. HadoopYARN3. SIMR:SparkinmapReduce//Mesos

ComponentsofSpark

SparkSql:DataabstractioncalledschemaRDD,whichprovidessupportforstructuredandsemi-structureddata.

SparkStreaming:IngestsdatainminibatchandperformRDDtransformationonthosemini-batches.StreamingdataanalyticsusingRDD

MLib(machinelearning):Itisadistributedmachinelearningframework,whichoperatesin-memoryathighspeed,andoffersmanyMLalgorithms.

GraphX:ThisdistributedgraphprocessingframeworkprovidesAPIformanygraphcomputationalgorithms.

SparkCore:Thisisageneralexecutionengineforsparkplatformuponwhichallotherfunctionalityisbuilt.Ittakescareoftaskdispatchingandscheduling,andbasicI/Ofunctionalities.

Spark-shell:Itisapowerfultooltoanalyzedatainteractively.Itisavailableonscalaandpython.Spark’sprimarydataabstractionisanin-memorycollectionsofitemscalledRDD.ItcanbecreatedfromHadoopinputformatslikeHDFS,andbytransformingexistingRDDsusingfiltersandmapsintonewRDDs.

ScriptingandProgrammingmodelusingSparkContext:OnecanuseanIDEtodevelopandtesttheanalyticscode.OnecanthencreateajartoruntheanalyticsusingHadooparchitecture.Thejarcanalsobesubmittedusingspark-submitutilitytotheSparkengine.Forexample:

Page 187: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

spark-submit—classapache.accesslogs.ServerLogAnalyzer—master

*localScalaSpark/Scala1/target/scala-2.10/Scala1-assembly-1.0.jar>output.txt

Page 188: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

HDFSHDFSisadistributedfilesystem,thatisatthecoreofHadoopsystem.

-Deployedonlowcostcommodityhardware

-Faulttolerant

-SupportsBatchProcessing

-Designedforlargedatasetorlargefiles

-Maintainscoherencethroughwriteoncereadmanytimes

-Movingcomputationtothelocationofthedata.

Page 189: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MongoDBItisdocument-orienteddatabase.ItcameintoexistenceasaNoSQLdatabase.

Page 190: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ApacheFlumeFlumeisanopensourcetoolforhandlingstreaminglogsordata.Itisadistributedandreliablesystemforefficientlycollecting,aggregatingandmovinglargeamountofdatafrommanydifferentsourcestoacentralizeddatastore.ItisapopulartooltoassistwithdataflowandstoragetoHDFS.Flumeisnotrestrictedtologdata.Thedatasourcesarecustomizablesoitmightbeanysourcelikeeventdata,trafficdata,socialmediadata,oranyotherdatasource.ThemajorComponentsofFlumeare:

-Event

-Agent

-DataGenerators

-CentralizedStores

Page 191: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

OverallApplicationlogicThesystemreadsaccesslogsandpresentstheresultsintabularandgraphicalformtoendusers.Thissystemprovidesthefollowingmajorfunctions:

1. Calculatecontentsize2. CountResponsecode3. AnalyzerequestingIP-address4. ManageEndpoints

Page 192: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TechnicalPlanfortheApplicationTechnically,theprojectfollowsthefollowingstructure:

1. FlumetakesstreaminglogfromrunningapplicationserverandstoresinHDFS.Flumeusescompressiontostorehugelogfilestospeedupthedatatransferandforstorageefficiency.

2. ApacheSparkusesHDFSasinputsourceandanalyzesdatausingMLLib.ApacheSparkstoresanalyzeddatainMongoDB

3. RESTfuljavaservicepresentsJSONobjectsfetchingfromMongoDBandsendingtoFrontend.Graphicaltoolsareusedtopresentdata.

Page 193: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ScalaSparkcodeforloganalysisNote:ThisapplicationiswritteninScalalanguage.Belowistheoperativepartofthecode.VisitgithublinkbelowforthecompleteScalacodeforthisapplication.

//calculatessizeoflog,andprovidesmin,maxandaveragesize

//cachingisdoneforrepeatedlyusedfactors

defcalcContentSize(log:RDD[AccessLogs])={

valsize=log.map(log=>log.contentSize).cache()

valaverage=size.reduce(_+_)/size.count()

println(“ContentSize::Average::”+average+””+

”||Maximum::”+size.max()+“||Minimum::”+size.min())

}

//SendalltheresponsecodewithitsfrequencyofoccurrenceasOutput

defresponseCodeCount(log:RDD[AccessLogs])={

valresponseCount=log.map(log=>(log.responseCode,1))

.reduceByKey(_+_)

.take(1000)

println(s”““ResponseCodesCount:${responseCount.mkString(“[“,“,”,“]”)}”””)

}

//filtersipaddressesthathavemorethen10requestsinserverlog

defipAddressFilter(log:RDD[AccessLogs])={

valresult=log.map(log=>(log.ipAddr,1))

.reduceByKey(_+_)

.filter(count=>count._2>1)

//.map(_._1).take(10)

.collect()

println(“IPAddressesCount::${result.mkString(“[“,“,”,“]”)}”)

}}

Page 194: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SampleLogdataSampleInputData:

InputFields(selectedfields):

Certainfieldshavebeenomittedtomakethecodeclear.Theresponsecodehasbeencoloredinredasitisthebasisofthemajorreports.

1. ipAddress:String,2. dateTime:String,3. method:String,4. endPoint:String,5. protocol:String,6. responseCode:Long,7. contentSize:Long

SampleInputRowsofData:

64.242.88.10[07/Mar/2014:16:05:49-0800]“GET/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariablesHTTP/1.1”40112846

64.242.88.10[07/Mar/2014:16:06:51-0800]“GET/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2HTTP/1.1”2004523

64.242.88.10[07/Mar/2014:16:10:02-0800]“GET/mailman/listinfo/hsdivisionHTTP/1.1”2006291

64.242.88.10[07/Mar/2014:16:11:58-0800]“GET/twiki/bin/view/TWiki/WikiSyntaxHTTP/1.1”2007352

64.242.88.10[07/Mar/2014:16:20:55-0800]“GET/twiki/bin/view/Main/DCCAndPostFixHTTP/1.1”2005253

64.242.88.10[07/Mar/2014:16:23:12-0800]“GET/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore&param1=1.12&param2=1.12HTTP/1.1”20011382

Page 195: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

SampleOutputofWebLogAnalysisContentSize::Average::10101||Maximum::138789||Minimum::0

ResponseCodesCount:[(401,113),(200,591),(302,1)]

IPAddressesCount::[(127.0.0.1,31),(207.195.59.160,15),(67.131.107.5,3),(203.147.138.233,13),(64.242.88.10,452),(10.0.0.153,188)]

EndPoints::[(/wap/Project/login.php,15),(/cgi-bin/mailgraph.cgi/mailgraph_2.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0.png,12),(/wap/Project/loginsubmit.php,12),(/cgi-bin/mailgraph.cgi/mailgraph_2_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3.png,12)]

IntermediatedataisstoredinHadoopFileSysteminCSVformat

Toseedetailedcode,visit:https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzer.scala

Thiswebloganalyzercanbeenhancedinmanyways.Forexample,itcananalyzehistoryoflogsfrompreviousyearsanddiscoverwebaccesstrends.Thisapplicationcanalsobemadetodiscarddataolderthan5yearsintopermanentandbackupstorage.

Page 196: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionandFindingsTherearemorethan100technologiesaroundApacheecosystem.MostbasicistheMapReducetechniqueusedbyHadoopengine.ManystacksareavailableontopofMapReduce.Itisimportanttoincorporatetherightsetsofelementstodeveloptherightstackfortheparticularlargescaledataanalytics.AfewawesometechnologieslikeHDFS,Spark,Hive,MongoDB,andFlume/Kafkaislikelytomakethebigdataapplicationpowerfulandworthy.

Itisalsousefultoexperimentwithmanyothertechnologiesduringthedevelopmentofthisloganalyzer.FlumeandKafkaaremostpowerfultoolstohandlestreamingdata.SparkhasitsownstreamingAPI,butit’snoteasytoincorporatewithHDFSstorage.DevelopingthisapplicationalsohelpstolearnLinuxbasedtasksandshellscriptsalongwithsomedatahandlingtoolslikeAWKandStreamEditor.

Thisapplicationreducesburdenofmanualhandlingoflogsondatabase,applicationorhistoryservers.Moreover,ithelpstopresentanalyzeddatainanimpressivewaythatleadstoeasydecisionmaking.ThisapplicationcameintodevelopmentafterdoingmuchresearchonbigdatatoolssuchasApacheSpark.Thatsavedalottimeandcostlater.Itwasdevelopedusingagiledevelopmentpractices.

Page 197: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestionsQ1.Describetheadvantagesofawebloganalyzer.

Q2.Describethemajorchallengesindevelopingthisapplication.

Q3:Checkoutthereferencesbelow.Identify3-4majorlessonslearnedfromthecodeandvideo.

Page 198: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 199: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 200: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Chapter10:DataMiningPrimer

Dataminingistheartandscienceofdiscoveringknowledge,insights,andpatternsindata.Itistheactofextractingusefulpatternsfromanorganizedcollectionofdata.Patternsmustbevalid,novel,potentiallyuseful,andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.

Dataminingisamultidisciplinaryfieldthatborrowstechniquesfromavarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfromthedatabasesarea.Itdrawsmodelingandanalyticaltechniquesfromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.

Thefieldofdataminingemergedinthecontextofpatternrecognitionindefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspiredtechnologies,ithasevolvedtohelpgainacompetitiveadvantageinbusiness.

Forexample,“customerswhobuycheeseandmilkalsobuybread90percentofthetime”wouldbeausefulpatternforagrocerystore,whichcanthenstocktheproductsappropriately.Similarly,“peoplewithbloodpressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.

Pastdatacanbeofpredictivevalueinmanycomplexsituations,especiallywherethepatternmaynotbesoeasilyvisiblewithoutthemodelingtechnique.Hereisadramaticcaseofadata-drivendecision-makingsystemthatbeatsthebestofhumanexperts.Usingpastdata,adecisiontreemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvoteina5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentofthetime.Incontrast,thelegalanalystscouldatbestpredictcorrectly59percentofthetime.(Source:Martinetal.2004)

Page 201: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

GatheringandselectingdataTolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized,andthenefficientlymined.Onerequirestheskillsandtechnologiesforconsolidationandintegrationofdataelementsfrommanysources.

Gatheringandcuratingdatatakestimeandeffort,particularlywhenitisunstructuredorsemistructured.Unstructureddatacancomeinmanyformslikedatabases,blogs,images,videos,audio,andchats.Therearestreamsofunstructuredsocialmediadatafromblogs,chats,andtweets.Therearestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetofthings,andsoon.Eventuallythedatashouldberectangularized,thatis,putinrectangulardatashapeswithclearcolumnsandrows,beforesubmittingittodatamining.

Knowledgeofthebusinessdomainhelpsselecttherightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegatheredfromthedatawarehouse.Everyindustryandfunctionwillhaveitsownrequirementsandconstraints.Thehealthcareindustrywillprovideadifferenttypeofdatawithdifferentdatanames.TheHRfunctionwouldprovidedifferentkindsofdata.Therewouldbedifferentissuesofqualityandprivacyforthesedata.

Page 202: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

DatacleansingandpreparationThequalityofdataiscriticaltothesuccessandvalueofthedataminingproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.

Dataalmostcertainlyneedstobecleansedandtransformedbeforeitcanbeusedfordatamining.Therearemanywaysinwhatdatamayneedtobecleansed–fillingmissingvalues,reigningintheeffectsofoutliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-80%ofthetimeneededforadataminingproject.

Page 203: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflecttheobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.

Onepopularformofdataminingoutputisadecisiontree.Itisahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakeamodel-baseddecision.Thetreemayhavecertainattributes,suchasprobabilitiesassignedtoeachbranch.Arelatedformatisasetofbusinessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemappedtobusinessrules.Iftheobjectivefunctionisprediction,thenadecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.

Theoutputcanbeintheformofaregressionequationormathematicalfunctionthatrepresentsthebestfittingcurvetorepresentthedata.Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.

Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwithtwochildren,livinginthecoastalareas”.Orapopulationof“20-something,ivy-league-educated,techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmorethan20yearsold,givinglowmileagepergallon,whichfailedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.

Businessrulesareanappropriaterepresentationoftheoutputofamarketbasketanalysisexercise.Theserulesareif-thenstatementswithsomeprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).

Page 204: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

EvaluatingDataMiningResultsTherearetwoprimarykindsofdataminingprocesses:supervisedlearningandunsupervisedlearning.Insupervisedlearning,adecisionmodelcanbecreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswerforfuturedatainstances.Classificationisthemaincategoryofsupervisedlearningactivity.Therearemanytechniquesforclassification,decisiontreesbeingthemostpopularone.Eachofthesetechniquescanbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.

PredictiveAccuracy=(CorrectPredictions)/TotalPredictionsSupposeadataminingprojecthasbeeninitiatedtodevelopapredictivemodelforcancerpatientsusingadecisiontree.Usingarelevantsetofvariablesanddatainstances,adecisiontreemodelhasbeencreated.Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapointispositive,thatisacorrectprediction,calledatruepositive(TP).Similarly,whenatruenegativedatapointisclassifiedasnegative,thatisatruenegative(TN).Ontheotherhand,whenatrue-positivedatapointisclassifiedbythemodelasnegative,thatisanincorrectprediction,calledafalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive,thatisclassifiedasafalsepositive(FP).Thisisrepresentedusingtheconfusionmatrix(Figure4.1).

ConfusionMatrix TrueClass

Positive Negative

PredictedClass

Predictedclass

Positive

TruePositive(TP)

FalsePositive(FP)

Negative

FalseNegative(FN)

TrueNegative(TN)

Figure10.1:ConfusionMatrix

Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.

PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).

Page 205: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Allclassificationtechniqueshaveapredictiveaccuracyassociatedwithapredictivemodel.Thehighestvaluecanbe100%.Inpractice,predictivemodelswithmorethan70%accuracycanbeconsideredusableinbusinessdomains,dependinguponthenatureofthebusiness.

TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.

Page 206: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused to explore thedata to find interesting associativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure10.2).

DataMiningTechniques

SupervisedLearning

(Predictiveabilitybasedonpastdata)

Classification–MachineLearning

DecisionTrees

NeuralNetworks

Classification-Statistics

Regression

UnsupervisedLearning

(Exploratoryanalysistodiscoverpatterns)

ClusteringAnalysis

AssociationRules

Figure10.2:ImportantDataMiningTechniques

Themostimportantclassofproblemssolvedusingdataminingareclassificationproblems.Classificationtechniquesarecalledsupervisedlearningasthereisawaytosupervisewhetherthemodelisprovidingtherightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyofthedecisionmakingprocessinthefuture.Thedataofpastdecisionsisorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.

Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.

1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.

2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.

3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuchdatapreparationfromtheusers.

4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.

Therearemanyalgorithmstoimplementdecisiontrees.Someofthepopular

Page 207: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

onesareC5,CARTandCHAID.

Regressionisamostpopularstatisticaldataminingtechnique.Thegoalofregressionistoderiveasmoothwell-definedcurvetobestthedata.Regressionanalysistechniques,forexample,canbeusedtomodelandpredicttheenergyconsumptionasafunctionofdailytemperature.Simplyplottingthedatamayshowanon-linearcurve.Applyinganon-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfuturedaycanbepredictedusingthisequation.Theaccuracyoftheregressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.

ArtificialNeuralNetworks(ANN)isasophisticateddataminingtechniquefromtheArtificialIntelligencestreaminComputerScience.Itmimicsthebehaviorofhumanneuralstructure:Neuronsreceivestimuli,processthem,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuronoutputsadecision.Adecisiontaskmaybeprocessedbyjustoneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemanylayersofneuronsinvolvedinadecisiontask,dependinguponthecomplexityofthedomain.Theneuralnetworkcanbetrainedbymakingadecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedbackreceivedonitspreviousdecisions.Theintermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.

ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyingasetofsimilargroupsinthedata.Itisatechniqueusedforautomaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata.TheK-meanstechniqueisapopulartechniqueandallowstheuserguidanceinselectingtherightnumber(K)ofclustersfromthedata.Clusteringisalsoknownasthesegmentationtechnique.Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.Theoutputisthecentroidsforeachclusterandtheallocationofdatapointstotheircluster.Thecentroiddefinitionisusedtoassignnewdatainstancescanbeassignedtotheirclusterhomes.Clusteringisalsoapartoftheartificialintelligencefamilyoftechniques.

Associationrulesareapopulardataminingmethodinbusiness,especially

Page 208: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

wheresellingisinvolved.Alsoknownasmarketbasketanalysis,ithelpsinansweringquestionsaboutcross-sellingopportunities.ThisistheheartofthepersonalizationengineusedbyecommercesiteslikeAmazon.comandstreamingmoviesiteslikeNetflix.com.Thetechniquehelpsfindinterestingrelationships(affinities)betweenvariables(itemsorevents).ThesearerepresentedasrulesoftheformX Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andtherearenorightorwronganswers.Therearejuststrongerandweakeraffinities.Thus,eachrulehasaconfidencelevelassignedtoit.Apartofthemachinelearningfamily,thistechniqueachievedlegendarystatuswhenafascinatingrelationshipwasfoundinthesalesofdiapersandbeers.

Page 209: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

MiningBigDataAsdatagrowslargerandlarger,thereareafewwaysinwhichanalyzingBigdataisdifferent.FromCausationtoCorrelation

Thereismoredataavailablethantherearetheoriesandtoolsavailabletoexplainit.Historically,theoriesofhumanbehavior,andtheoriesofuniverseingeneral,havebeenintuitedandtestedusinglimitedandsampleddata,withsomestatisticalconfidencelevel.Nowthatdataisavailableinextremelylargequantitiesaboutmanypeopleandmanyfactors,theremaybetoomuchnoiseinthedatatoarticulateandtestcleantheories.Inthatcase,itmaysufficetovalueco-occurrencesorcorrelationofeventsassignificantwithoutnecessarilyestablishingstrongcausation.FromSamplingtotheWhole

Poolingallthedatatogetherintoasinglebigdatasystemcanhelpdiscoverevents,thathelpbringaboutafullerpictureofthesituation,andhighlightthreatsoropportunitiesthatanorganizationfaces.Workingfromthefulldatasetcanenablediscoveringremotebutextremelyvaluableinsights.Forexample,ananalysisofthepurchasinghabitsofmillionscustomersandtheirbillionstransactionsattheirthousandsofstorescangiveanorganizationavast,detailedanddynamicviewofsalespatternsintheircompany,whichmaynotbeavailablefromtheanalysisofsmallsamplesofdatabyeachstoreorregion.FromDatasettoDatastream

Aflowingstreamhasaperishableandunlimitedconnotationtoit,whileadatasethasafinitudeandpermanenceaboutit.Withanygiveninfrastructure,onecanonlyconsumesomuchdataatatime.Datastreamsaremany,largeandfast.Thusonehastochoosewhichofthemanystreamsofdatadoesonewanttoengagewith.Itisequivalenttodecidingwhichstreamtofishin.Themetricsusedforanalysisofstreamstendtoberelativelysimpleandrelatetotimedimension.Mostofthemetricsarestatisticalmeasuressuchascountsandmeans.Forexample,acompanymightwanttomonitorcustomersentimentaboutitsproducts.Sotheycouldcreateasocialmedialisteningplatformthatwouldreadalltweetsandblogpostsabouttheminreal-time.Thisplatformwould(a)keepacountofpositiveandnegativesentimentmessageseveryminute,and(b)flaganymessagesthatmeritattentionsuchassendinganonlineadvertisementorpurchaseoffertothatcustomer.

Page 210: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

DataMiningBestPracticesEffectiveandsuccessfuluseofdataminingactivityrequiresbothbusinessandtechnologyskills.Thebusinessaspectshelpunderstandthedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources,cleanupthedata,assembleittomeettheneedsofthebusinessproblem,andthenrunthedataminingtechniquesontheplatform.

Animportantelementistogoaftertheproblemiteratively.Itisbettertodivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbestpracticeslearnedfromtheuseofdataminingtechniquesoveralongperiodoftime.TheDataMiningindustryhasproposedaCross-IndustryStandardProcessforDataMining(CRISP-DM).Ithassixessentialsteps(Figure4.3):

1. BusinessUnderstanding:Thefirstandmostimportantstepindataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeanyotherproject,inthatitshouldshowstrongpayoffsiftheprojectissuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject,whichmeansthattheprojectalignswellwiththebusinessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginativehypothesesforthesolution.Thinkingoutsidetheboxisimportant,bothintermsofaproposedmodelaswellinthedatasetsavailableandrequired.

Page 211: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Figure4.3:CRISP-DMDataMiningcycle

2. DataUnderstanding:Arelatedimportantstepistounderstandthedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesestosolveaproblem.Withoutrelevantdata,thehypothesescannotbetested.

3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70%ofthetimeinadataminingproject.Itmaybedesirabletocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.

4. Modeling:Thisistheactualtaskofrunningmanyalgorithmsusingtheavailabledatatodiscoverifthehypothesesaresupported.Patienceisrequiredincontinuouslyengagingwiththedatauntilthedatayieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused.Atoolcouldbetriedwithdifferentoptions,suchasrunningdifferentdecisiontreealgorithms.

5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbettertotriangulatetheanalysisbyapplyingmultipledataminingtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracywithmoretestdata.Whentheaccuracyhasreachedsomesatisfactorylevel,thenthemodelshouldbedeployed.

Page 212: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresentedtothekeystakeholders,andisdeployedintheorganization.Otherwisetheprojectwillbeawasteoftimeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization.Themodelshouldbeeventuallyembeddedintheorganization’sbusinessprocesses.

Page 213: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ConclusionDataMiningislikedivingintotheroughmaterialtodiscoveravaluablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportanttoprovideimaginativesolutionsthatcanthenbetestedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.

Page 214: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

ReviewQuestions1. Whatisdatamining?Whataresupervisedandunsupervisedlearning

techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportant

tofollowtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. HowisminingBigdatadifferentfromtraditionaldatamining?

Page 215: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 216: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)

Page 217: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

CreatingClusterserveronAWS,InstallHadoopfromCloudEraTheobjectiveofthistutorialistosetupabigdataprocessinginfrastructureusingcloudcomputing,andHadoopandSparksoftware.

Page 218: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step1:CreatingAmazonEC2Servers.

1. Openhttps://aws.amazon.com/2. ClickonServices3. ClickonEC2

YoucanseethebelowresultonceyouclickonEC2.Ifyoualreadyhaveaserveryoucanseethenumberofrunningservers,theirvolumeandotherinformation.

4. ClickonLaunchInstanceButton.

Page 219: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

5. ClickonAWSMarketePlace6. TypeUbuntuinsearchtextbox.7. ClickonSelectbutton

8. Ubuntuisfreesoyoudon’thavetoworryabouttheservicepriceClickonContinuebutton.

Page 220: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

9. ChooseGeneral.purposem1.largeandclickonNext:ConfigurareInstanceDetails(DonotchoosetheMicroInstancest1.microitisfreebutitwillnotabletohandletheinstallation.)

10. ClickonNext:AddStorage

11. Specifythevolumesize20GB(Defaultwillbe8butitwillnotsufficient)andClickonNext:TagInstance

12. Typethenamecs488-master(Thisisforlabeltoknowwhichoneismasterandslave)andclickonNext:SecurityGroup

Page 221: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

13. Weneedtoopenourservertotheworldincludingmostoftheportcauseclouderaneedtoaddmoreport.SpecifythegroupnameType:ChooseCustomTCPRulePortRange0-65500Source:AnyWhereAndClickonReviewInstance

14. Themessageshowsthewarningthisisonlythatweopenourservertoworld,Soignoreitfornow.ClickonLaunchbutton.

15. TypethekeypairnameandClickonDownloadKeyPairbutton(rememberthelocationofdownloadedfileweneedthisfiletologintotheserver.)andClickonLaunchInstances.

Page 222: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

16. Nowthemasterserveriscreated.

Now,weneedfourmoreserverstomaketheclusteringforthatwedon’tneedtodotheseprocessfourtimes.Wejustincreasethevalueofnoofinstanceweneedandwegotthe4servers.

Nowwearegoingtolaunch4moreserverwhichisslaves.

Pleaserepeatstep4-9

Gotoamazonmarketplace,chooseUbuntu,selecttheinstancetype(General.purpose)

17. Type4inNumberofInstances.Whichwillcreatethe4moreserverforus.

18. Nametheservercs488-slave

19. Selectthepreviouscreatedsecuritygroup.

Page 223: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

20. Itisimportantthatyouneedtochoosetheexistingkeypairfortheseservertoo.

Ifeverythinggoeswell,youcanseehave5instances,5volumes,1keypair,1or2securitygroups.

Wearenowsuccessfullycreated5servers.

Page 224: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop

Firstofalltakeanoteforallyourserverdetails,IPAddress,DNSaddress.Masterandslaves.

MasterPublicDNSAddress:ec2-54-200-210-141.us-west-2.compute.amazonaws.comMasterPrivateIPAddress:172.31.20.82

Slave1PrivateIP:172.31.26.245Slave2PrivateIP:172.31.26.242Slave3PrivateIP:172.31.26.243Slave4PrivateIP:172.31.26.244

Onceyouhavetheseinrecorded,youcanconnecttotheserver.Ifyouareusinglinuxasoperatingsystemyoucanusesshcommandfromterminaltoconnectit.

Connectingtheserver(Windows)

1. Downloadthesshsoftware(Putty)(http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)Alsodownloadputtygentoconvertourauthenticationfile.pemto.ppk

Page 225: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

2. Openputtygenloadtheauthenticationfile

ClickonSavePrivateKey

3. OpenPuttytypethemasterpublicdnsaddressinhostnameandthanclickonSSHfromleftpanel>ClickonAuth>>Selecttherecentconvertedauthenticationfile(.ppk)andfinallyclickonOpenbutton.

Page 226: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

4. Nowyouwillabletoconnecttheserverpleasetype“ubuntu”thedefaultusernametologintothesystem.

5. Onceyouconnecttypethefollowingcommandintotheterminal6. sudoaptitudeupdate7. cd/usr/local/src/8. sudowgethttp://archive.cloudera.com/cm4/installer/latest/cloudera-manager-

installer.bin9. sudochmodu+xcloudera-manager-installer.bin10. sudo./cloudera-manager-installer.bin

Page 227: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

11. Thereis4morestepwhereyouclickonNextandYesforlicenseagreement.Onceyoufinishtheinstallationyouneedtorestarttheservice.

12. sudoservicecloudera-scm-serverrestart

Youarenowabletoconnecttheclouderafromyoubrowser.Theaddresswillbehttp://<YOURPUBLICDNSSERVER>:7180eg.http://ec2-54-200-210-141.us-west-2.compute.amazonaws.com:7180anddefaultusernameandpasswordisadmin/admintologintothesystem.

Page 228: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Oncerestarttheserveritwillopentheloginscreenagain.Thesameusernameandpassword(admin/admin)isusetologintothesystem.

13. ClickonLaunchtheClassicwizard

14. ClickonContinue

Page 229: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

15. ProvideallthePrivateIPaddressofmasterandslavescomputersandclickonSearchbutton.

16. ClickonContinuebutton.

17. ChooseNoneforSOLR1….AndNoneforIMPAL….AndClickonContinuebutton.

Page 230: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

18. ClickonAnotherUser>>Type“ubuntu”andselectAllhostsacceptsameprivatekey>>uploadtheauthenticationfile.pemandclickonContinuebutton.

19. Nowclouderawillinstallthesoftwareforeachofourserver.

20. Oncetheinstallationiscompleteclickoncontinuebutton.

21. Onceitreachto100%clickoncontinuebutton.Donotdisconnectinternetnorshutthemachine,Iftheprocesswillnotcompletethatweneedtore-createthewholeprocess.Clickoncontinuebutton.

Page 231: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

22. ClickonContinue.

23. ChooseCoreHadoopandClickonInspectRoleAssignmentsbutton

Page 232: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

24. NowforyoumasterIPitshouldhaveonlyNameNodeselectionanduncheckedinDataNode.Thisisimportanttomakethemasterandslaveserver.

25. Nowtheclouderawillinstallthealltheservicesforyoufutureuseyoucanrecordtheusernameandpasswordofeachservices.ClickonTestConnection

Page 233: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

26. ClickonContinue

Page 234: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

27. Nowalltheinstallationiscompleteyoucannowhave1masternode4datanode.

Page 235: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

28. Youshouldseethedashboard.

Page 236: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step3:WordCountusingMapReduce29. Nowlogintomasterserverfromputty.30. Runthefollowingcommand31. cd~/32. mkdircode-and-data33. cdcode-and-data34. sudowgethttps://s3.amazonaws.com/learn-hadoop/hadoop-infiniteskills-

richmorrow-class.tgz35. sudotar-xvzfhadoop-infiniteskills-richmorrow-class.tgz36. cddata37. sudo-uhdfshadoopfs-mkdir/user/ubuntu38. sudo-uhdfshadoopfs-chownubuntu/user/ubuntu39. hadoopfs-putshakespeareshakespeare-hdfs40. hadoopversion41. hadoopfs-lsshakespeare-hdfs

42. sudohadoopjar/opt/cloudera/parcels/CDH-4.7.1-

1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarwordcountshakespeare-hdfswordcount-output

43. hadoopjar/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarsleep-m10-r10-mt20000-rt20000

Page 237: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 238: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 239: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Appendix2:SparkInstallationandTutorial

ThistutorialwillhelpinstallSparkandgetitrunningonastandalonemachine.ItwillthenhelpdevelopasimpleanalyticalapplicationusingRlanguage.

Page 240: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step1:VerifyingJavaInstallation

JavainstallationisoneofthemandatorythingsininstallingSpark.TrythefollowingcommandtoverifytheJAVAversion.

$java-version

IfJavaisalready,installedonyoursystem,yougettoseethefollowingresponse−

javaversion“1.7.0_71”

Java(TM)SERuntimeEnvironment(build1.7.0_71-b13)

JavaHotSpot(TM)ClientVM(build25.0-b02,mixedmode)

IncaseyoudonothaveJavainstalledonyoursystem,thenInstallJavabeforeproceedingtonextstep.

Page 241: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step2:VerifyingScalainstallation

VerifyScalainstallationusingfollowingcommand.

$scala-version

IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−

Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL

Incaseyoudon’thaveScalainstalledonyoursystem,thenproceedtonextstepforScalainstallation.

Page 242: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step3:DownloadingScala

DownloadthelatestversionofScalabyvisitthefollowinglinkDownloadScala.Forthistutorial,weareusingscala-2.11.6version.Afterdownloading,youwillfindtheScalatarfileinthedownloadfolder.

Page 243: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step4:InstallingScala

FollowthebelowgivenstepsforinstallingScala.ExtracttheScalatarfile

TypethefollowingcommandforextractingtheScalatarfile.

$tarxvfscala-2.11.6.tgzMoveScalasoftwarefiles

UsethefollowingcommandsformovingtheScalasoftwarefiles,torespectivedirectory(/usr/local/scala).

$su–

Password:

#cd/home/Hadoop/Downloads/

#mvscala-2.11.6/usr/local/scala

#exit

SetPATHforScala

UsethefollowingcommandforsettingPATHforScala.

$exportPATH=$PATH:/usr/local/scala/binVerifyingScalaInstallation

Afterinstallation,itisbettertoverifyit.UsethefollowingcommandforverifyingScalainstallation.

$scala-version

IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−

Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL

Page 244: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step5:DownloadingSpark

DownloadthelatestversionofSpark.Forthistutorial,weareusingspark-1.3.1-bin-hadoop2.6version.Afterdownloadingit,youwillfindtheSparktarfileinthedownloadfolder.

Page 245: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step6:InstallingSpark

FollowthestepsgivenbelowforinstallingSpark.ExtractingSparktar

Thefollowingcommandforextractingthesparktarfile.

$tarxvfspark-1.3.1-bin-hadoop2.6.tgzMovingSparksoftwarefiles

ThefollowingcommandsformovingtheSparksoftwarefilestorespectivedirectory(/usr/local/spark).

$su–

Password:

#cd/home/Hadoop/Downloads/

#mvspark-1.3.1-bin-hadoop2.6/usr/local/spark

#exit

SettinguptheenvironmentforSpark

Addthefollowinglineto~/.bashrcfile.Itmeansaddingthelocation,wherethesparksoftwarefilearelocatedtothePATHvariable.

exportPATH=$PATH:/usr/local/spark/bin

Usethefollowingcommandforsourcingthe~/.bashrcfile.

$source~/.bashrc

Page 246: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step7:VerifyingtheSparkInstallation

WritethefollowingcommandforopeningSparkshell.

$spark-shell

Ifsparkisinstalledsuccessfullythenyouwillfindthefollowingoutput.

SparkassemblyhasbeenbuiltwithHive,includingDatanucleusjarsonclasspath

UsingSpark’sdefaultlog4jprofile:org/apache/spark/log4j-defaults.properties

15/06/0415:25:22INFOSecurityManager:Changingviewaclsto:hadoop

15/06/0415:25:22INFOSecurityManager:Changingmodifyaclsto:hadoop

15/06/0415:25:22INFOSecurityManager:SecurityManager:authenticationdisabled;

uiaclsdisabled;userswithviewpermissions:Set(hadoop);userswithmodifypermissions:Set(hadoop)

15/06/0415:25:22INFOHttpServer:StartingHTTPServer

15/06/0415:25:23INFOUtils:Successfullystartedservice‘HTTPclassserver’onport43292.

WelcometoSparkversion1.4.0

UsingScalaversion2.10.4(JavaHotSpot(TM)64-BitServerVM,Java1.7.0_71)

Typeinexpressionstohavethemevaluated.

Sparkcontextavailableassc

scala>

Hereyoucanseethevideo:

HowtoinstallSpark

Youmightencounter“filespecifiednotfounderror”whenyouarefirstinstallingSPARKstandalone:

Page 247: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

TofixthisyouhavetosetupyourJAVA_HOME

Step1:Start->run->commandprompt(cmd)

Step2:DeterminewhereisyourJDKislocated,bydefaultitisinyourC:\programfiles

Step3SelectyourJDKtouseinmycase,IwillusemyJDK_8

CopythedirectorytoyourclipboardandgotoyourCMD.Andpressenter.

Page 248: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step4:AddittogeneralPATH

Andpressenter.

NowgotoyoursparkfolderandgotoBIN\spark_shell

Youhaveinstalledsparklet’strytouseit.

Page 249: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

Step8:Application:WordCountinScala

NowwewilldoanexampleofwordcountinScala:

text_file=sc.textFile(“hdfs://…”)

counts=text_file.flatMap(lambdaline:line.split(”“))\

.map(lambdaword:(word,1))\

.reduceByKey(lambdaa,b:a+b)

counts.saveAsTextFile(“hdfs://…”)

NOTE:Ifyouareworkingonastand-aloneSpark:

Thiscounts.saveAsTextFile(“hdfs://…”)commandwillgiveyouanerrorofNullPointerException.

Solution:counts.coalesce(1).saveAsTextFile()

ForimplementingwordcloudwecoulduseRinoursparkconsole:

However,ifyouclickonSparkRstraightawayyouwillgetanerror.

Tofixthis:

Step1:Setuptheenvironmentvariables.

InthePATHVariableaddyourpath:Iadded->;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\sbin;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin

Step2:InstallRsoftwareandRstudio.ThenaddthepathofRsoftwarepathtothePATHvariable.

Iaddedthistomyexistingpath->;C:\ProgramFiles\R\R-3.2.2\bin\x64\(Remembereachpaththatyouaddmustbeseparatedbysemicolonandnospacesplease)

Step3:Runcommandpromptasanadministrator.

Step4:Nowexecutethecommand>“SparkR”fromthecommandprompt.Ifsuccessfulyoushouldseemessage“Sparkcontextisavailable…”asseen

Page 250: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

below.IfyoupathisnotsetcorrectlyyoucanalternativelynavigatetothelocationwhereyouhavedownloadedSparkR.Inmycase(C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin)andexecute“SparkR”Command.

Step5:ConfigurationinsidetheRStudiotoconnecttoSpark!

ExecutethebelowthreecommandsinRstudioeverytime:

#HerewearesettingupSPARK_HOMEenvironmentvariable

Sys.setenv(SPARK_HOME=“C:/spark-1.5.1-bin-hadoop2.6/spark-1.5.1-bin-hadoop2.6”)

#Setthelibrarypath

.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”),“R”,“lib”),.libPaths()))

#LoadingtheSparkRLibary

library(SparkR)

IfyouseethebelowmessagethenyouareallsettostartworkingwithSparkR

Nowlet’sStartCodinginR:

Page 251: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

lords<-Corpus(DirSource(“temp/”))

Toseewhat’sinthatcorpus,typethecommand

inspect(lords)

Thisshouldprintoutcontentsonthemainscreen.Next,weneedtocleanitup.Executethefollowinginthecommandline,onelineatatime:

lords<-tm_map(lords,stripWhitespace)

lords<-tm_map(lords,tolower)

lords<-tm_map(lords,removeWords,stopwords(“english”))

lords<-tm_map(lords,stemDocument)

Thetm_mapfunctioncomeswiththetmpackage.Thevariouscommandsareself-explanatory:stripunnecessarywhitespace,converteverythingtolowercase(otherwisethewordcloudmighthighlightcapitalisedwordsseparately),removeEnglishcommonwordslike‘the’(so-called‘stopwords’),andcarryouttextstemmingforthefinaltidy-up.DependingonwhatyouwanttoachieveyoucouldalsoexplicitlyremovenumbersandpunctuationwiththeremoveNumbersandremovePunctuationarguments.

Itispossiblethatyoumaygeterrormessageswhilstexecutingsomeofthecommands,e.g.missingpackages.IfsoinstalltheseasoutlinedaboveinStep4,andrepeat

Ifalliswellthenyoushouldnowbereadytocreateyourfirstwordcloud!Trythis:

wordcloud(lords,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,colors=brewer.pal(8,“Dark2”))

Page 252: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 253: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 254: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

AdditionalResourcesHerearesomeotherbooks,papers,videoandotherresources,foradeeperdiveintothetopicscoveredinthisbook.

1. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink.HoughtonMifflinHarcourt.

2. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com

3. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.

4. MateiZahariaandet.Al.(2010).“ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,”UniversityofCalifornia,Berkeley.OReilley.

5. SandyRyza,UriLasersonet.al(2014).“Advanced-Analytics-with-Spark”.OReilley.

Websites:

6. ApacheHadoopresources:https://hadoop.apache.org/docs/r2.7.2/7. ApacheHDFS:https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html8. HadoopAPIsite:http://hadoop.apache.org/docs/current/api/9. ApacheSpark:http://spark.apache.org/docs/latest/

10.https://www.biostat.wisc.edu/~kbroman/Rintro/Rwinpack.html11.http://robjhyndman.com/hyndsight/building-r-packages-for-windows/12.https://stevemosher.wordpress.com/ten-steps-to-building-an-r-package-under-windows/13.http://www.inside-r.org/packages/cran/wordcloud/docs/wordcloud14.https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html15.https://intellipaat.com/tutorial/spark-tutorial/16.https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces5017.https://en.wikipedia.org/wiki/NoSQL

Page 255: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

18.http://www.planetcassandra.org/what-is-apache-cassandra/19.http://www.datastax.com/nosql20.https://www.sitepen.com/blog/2010/05/11/nosql-architecture/21.http://nosql-database.org/22.http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf

VideoResources

23.DougCuttingon‘Hadoopat10’:https://www.youtube.com/watch?v=yDZRDDu3CJo24.StatusofApachecommunity:https://www.youtube.com/watch?v=sOZnf8Nn3Fo.25.Spark2.0updatesshowinganicedemoacrossR,ScalaandSQL)usingtweetsandclustering.https://www.youtube.com/watch?v=9xSz0ppBtFg26.https://www.youtube.com/watch?v=VwiGHUKAHWM27.https://www.youtube.com/watch?v=L5QWO8QBG5c28.https://www.youtube.com/watch?v=KvQto_b3sqw29.https://www.youtube.com/watch?v=YW28qItH_tA

Page 256: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the
Page 257: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the

AbouttheAuthorDr.AnilMaheshwariisaProfessorofComputerScienceandInformationSystems,andtheDirectorofCenterforDataAnalytics,atMaharishiUniversityofManagement.Heteachescoursesindataanalytics,andhelpswithextractingdeepinsightsfromtheirdata.HeworkedinavarietyofleadershiprolesatIBMinAustinTX,andhasalsoworkedatmanyothercompaniesincludingstartups.

HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineeringdegreefromIndianInstituteofTechnologyinDelhi,anMBAfromIndianInstituteofManagementinAhmedabad,andaPh.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.

Heistheauthorofthe#1bestsellerDataAnalyticsMadeAccessible.

HeblogsinterestingstuffonITandEnlightenmentatanilmah.com

Instructorscanreachhimforcoursematerialsatakm2030@gmail.com.Speakingengagementsarewelcome.