26
Big Data with Java Marton Elek 2017 march

Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

Embed Size (px)

Citation preview

Page 1: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

BigDatawithJavaMarton Elek2017march

Page 2: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

2 ©HortonworksInc.2011– 2017. AllRightsReserved

Hortworks DataPlatform

à Collectionoffullopensourceapacheprojects

Page 3: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

3 ©HortonworksInc.2011– 2017. AllRightsReserved

Hadoop atScale

• Yahoo– 34000nodes,478PB• eBay– 10000nodes,150PB• Linkedin – 5000nodes,• Twitter– 3500nodes,30to50PB• Spotify – 700nodes,15PBofdata• Facebook– Thousands

Page 4: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

4 ©HortonworksInc.2011– 2017. AllRightsReserved

ApacheHadoop

Collectionofmultiplesubprojects:Ã HDFS

– Distributedfilesystem

à YARN– Distributedprocessingframeworkandclustermanagement

à MAPREDUCE– Mapreduce frameworktowritecalculationindistributedenvironment

Page 5: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

5 ©HortonworksInc.2011– 2017. AllRightsReserved

ApacheHDFS– Hadoop DistributedFileSystem

• Verylargescaledistributedfilesystem• 10Knodes, tensofmillions filesandPeta Bytesofdata

• Supports largefiles

• Designedtorunoncommodityhardware,assumeshardwarefailures• Filesarereplicatedtohandlehardwarefailure• Detectfailuresandrecoversfromthemautomatically

• OptimizedforBatchprocessing• Datalocationsareexposedsothatthecomputationscanmovetowheredataresides

• DataCoherency• Writeonceandreadmanytimesaccesspattern• Appending issupported forexistingfiles

• Filesarebrokenupinchunkscalled‘blocks’• Blocksaredistributedovernodes

Page 6: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

6 ©HortonworksInc.2011– 2017. AllRightsReserved

HDFSArchitecture(Master-Slave)

Page 7: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

7 ©HortonworksInc.2011– 2017. AllRightsReserved

HDFS:KeyServices

• NameNode• Masterservice• Managesthefilesystemnamespace• Single serviceacrossthecluster(HAcanbeenabled)• Regulatesaccesstofilesbyclients• Mapsfilenametoasetofblocks• MapsablocktotheDataNode whereitresides• Replicationengine forblocks

• DataNode• Slaveservice.Runsonslavenodes• BlockServer• Managesblockread/writeforHDFS,Storesdatainthelocalfilesystem• PeriodicallysendsareportofallexistingblockstotheNameNode• PingsNameNode forinstructions• Ifheatbeatfails,DataNode isremovedfromtheclusterandreplicatedblockstakeover

• StandbyNameNode• MergesNamenode’s filesystemimageandeditlogs

Page 8: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

8 ©HortonworksInc.2011– 2017. AllRightsReserved

ClusterTopology

HDFSClient

MasterServicesNameNodeResourceManagerHBase Masteretc..

SlaveServicesDataNode

NodeManagerRegionServer

Rack

NameNode

SecondaryNameNode

OtherMasterSvcs

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

Rack Rack

Page 9: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

9 ©HortonworksInc.2011– 2017. AllRightsReserved

Page 10: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

10 ©HortonworksInc.2011– 2017. AllRightsReserved

Page 11: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

11 ©HortonworksInc.2011– 2017. AllRightsReserved

YARN

à Howtoexecuteanyjobonmultiplemachines?– Clustermanagement– Distributedprocessingframework– Goal:executeapplicationonmultiplemachine

• Manageavailableresources(CPU,memory)• Usedifferent schedulingalgorithms(CapacityScheduler,FairScheduler)

à Components– Resourcemanager(1,2…instances):

• Managetheapplicationrequests, scheduleapplications,…– Nodemanager (∞instances):

• Executethescheduledapplication

http://ercoppa.github.io/HadoopInternals/HadoopArchitectureOverview.html

Page 12: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

12 ©HortonworksInc.2011– 2017. AllRightsReserved

TransitionfromHadoop1toHadoop2

HADOOP 1.0

HDFS(redundant, reliable storage)

MapReduce(cluster resourcemanagement

&dataprocessing)

HDFS2(redundant, reliable storage)

YARN(cluster resourcemanagement)

MapReduce(dataprocessing)

Others(dataprocessing)

HADOOP 2.0

Single Use SystemBatch Apps

Multi Purpose PlatformBatch, Interactive, Online, Streaming,

Page 13: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

13 ©HortonworksInc.2011– 2017. AllRightsReserved

YARN Architecture•Cluster Operating System

•Enable’s Generic Data Processing Tasks with ‘Containers’ •Big Compute (Metal Detectors) for Big Data (Hay Stack)

•Resource Manager•Global resource scheduler

•Node Manager•Per-machine agent•Manages the life-cycle of container & resource monitoring

•Application Master•Per-application master that manages application scheduling and task execution•E.g. MapReduce Application Master

•Container •Basic unit of allocation•Fine-grained resource allocation across multiple resource types •(memory, cpu, disk, network, gpu etc.)

Page 14: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

14 ©HortonworksInc.2011– 2017. AllRightsReserved

YARN what is it good for?

•Compute for Data Processing

•Compute for Embarrassingly Parallel Problems•Problems with tiny datasets and/or that don’t depend on one another•ie: Exhaustive Search, Trade Simulations, Climate Models, Genetic Algorithms

•Beyond MapReduce•Enables Multi Workload Compute Applications on a Single Shared Infrastructure•Stream Processing, NoSQL, Search, InMemory, Graphs, etc•ANYTHING YOU CAN START FROM CLI!

•Slider & Code Reuse•Run existing applications on YARN: HBase on YARN, Storm on YARN•Reuse existing Java code in Containers making serial applications parallel

Page 15: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

15 ©HortonworksInc.2011– 2017. AllRightsReserved

Page 16: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

16 ©HortonworksInc.2011– 2017. AllRightsReserved

Hadoopmapreduce

à ”HadoopMapReduceisasoftwareframeworkforeasilywritingapplicationswhich– processvastamountsofdata(multi-terabytedata-sets)– in-parallel– onlargeclusters(thousandsofnodes)ofcommodityhardware– inareliable,fault-tolerantmanner.”

à Connection:MapReducejobsare– ScheduledonYARN– UsingdatafromHDFS

à AMapReducejob usually– splitstheinputdata-setintoindependentchunkswhichareprocessedbythemaptasks

inacompletelyparallelmanner.– Theframeworksortstheoutputsofthemaps,– whicharetheninputtothereducetasks.– Typicallyboththeinputandtheoutputofthejobarestoredinafile-system.(HDFS

input/outputformat)– Theframeworktakescareofschedulingtasks,monitoringthemandre-executesthe

failedtasks.

Page 17: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

17 ©HortonworksInc.2011– 2017. AllRightsReserved

Mapreduce example– wordcount

à Rawdata– LoremIpsum issimplydummytextoftheprintingandtypesettingindustry.LoremIpsum

hasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalley….

à Map– Lorem:1– Ipsum:1– is:1– simply:

à Shuffle– Lorem:[1]– is:[1,1,1]

à Reduce– Lorem:1– is:3

Page 18: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

18 ©HortonworksInc.2011– 2017. AllRightsReserved

Mapreduce example– wordcount

à Rawdata– LoremIpsum issimplydummytextoftheprintingandtypesettingindustry.LoremIpsum

hasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalley….

à Map– Lorem:1– Ipsum:1– is:1– simply:

à Shuffle– Lorem:[1]– is:[1,1,1]

à Reduce– Lorem:1– is:3

publicstaticclassTokenizerMapperextendsMapper<Object,Text,Text,IntWritable>{

privatefinalstaticIntWritable one=newIntWritable(1);privateTextword=newText();

publicvoidmap(Objectkey,Textvalue,Contextcontext)throwsIOException,InterruptedException {StringTokenizer itr =newStringTokenizer(value.toString());while(itr.hasMoreTokens()){word.set(itr.nextToken());context.write(word,one);}}}

Page 19: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

19 ©HortonworksInc.2011– 2017. AllRightsReserved

Mapreduce example– wordcount

à Rawdata– LoremIpsum issimplydummytextoftheprintingandtypesettingindustry.LoremIpsum

hasbeentheindustry'sstandarddummytexteversincethe1500s,whenanunknownprintertookagalley….

à Map– Lorem:1– Ipsum:1– is:1– simply:

à Shuffle– Lorem:[1]– is:[1,1,1]

à Reduce– Lorem:1– is:3

publicstaticclassIntSumReducerextendsReducer<Text,IntWritable,Text,IntWritable>{privateIntWritable result=newIntWritable();

publicvoidreduce(Textkey,Iterable<IntWritable>values,Contextcontext

)throwsIOException,InterruptedException {int sum=0;for(IntWritable val :values){sum+=val.get();}result.set(sum);context.write(key,result);}}

Page 20: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

20 ©HortonworksInc.2011– 2017. AllRightsReserved

ApacheSpark

à ”ApacheSpark isafastandgeneralengineforlarge-scaledataprocessing.”à Sameabstractionforstreaming/batchprocessing(+MachineLearning,graphprocessing)

à Multilanguagesupport:– Scala– Python– R

à FunctionalandSQLinterfacesà Supportsmultipleexecutionengine

– YARN– StandaloneSparkcluster

à In-memorycachebetweenthestages

Page 21: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

21 ©HortonworksInc.2011– 2017. AllRightsReserved

Sparkexamples

Page 22: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

22 ©HortonworksInc.2011– 2017. AllRightsReserved

SparkUI

Page 23: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

23 ©HortonworksInc.2011– 2017. AllRightsReserved

SparkUI

Page 24: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

24 ©HortonworksInc.2011– 2017. AllRightsReserved

ApacheKafka

à ”Distributedstreamingplatform”à Publish/subscribetostreamsofrecordsà Storestreamsinafault-tolerantwayà KafkaConnect:APItoeasilycreateapplicationtostreamto/fromKafkaà KafkaStream:APItodostreamprocessing

Page 25: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

25 ©HortonworksInc.2011– 2017. AllRightsReserved

Hortworks DataPlatform:What’smore

à What’smore?à Keyvaluestoretofastkeybasedaccess

– HBase,Phonix (SQLinterface)

à Securityandgovernance– Knox,Atlas,Ranger

à Management– Ambari,Cloudbreak

à Streaming– Storm,Flume,

(Spark,Kafka)

Page 26: Big Data with Java - konferenciak.advalorem.hukonferenciak.advalorem.hu/uploads/files/Oracle Java konferencia... · Hadoop at Scale • Yahoo –34000 nodes, 478 PB • eBay –10000

26 ©HortonworksInc.2011– 2017. AllRightsReserved

ThankYou