Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
AGlimpseoftheHadoopEchosystem
1
HadoopEchosystem
• Aclusterissharedamongseveralusersinanorganization• Differentservices
• HDFSandMapReduceprovidethelowerlayersoftheinfrastructures• Othersystems“plug”ontopofthese• Easierwaytoprogramapplications• MapReduceandHDFSare“lowlevel”
2
HBase
• Hadoopdatabaseforrandomread/writeaccess• HBase isanopensource,non-relational,distributed“database”
• modeledafterGoogle'sBigTable.• ItrunsontopofHadoopandHDFS,providingBigTable-likecapabilitiesforHadoop.• EricBrewer’sCAPtheorem,HBase isaCPtypesystem.
• Consistency,availability,partitiontolerance.
3
WhentouseHBase
• Realbigdata:billionsofrowsXmillionsofcolumns• Datacannotstoreinasinglenode.
• Randomread/writeaccess• Thousandsofoperationsonbigdata• NoneedofextrafeaturesofRDMSliketypedcolumns,secondaryindexes,transactions,advancedquerylanguages,etc.
4
HDFS Hbase
Good forstoringlargefile Built ontopofHDFS.GoodforhostingverylargetableslikebillionsofrowsXmillionsofcolumn
Writeonce.Append tofilesinsomeofrecentversionsbutnotcommonlyused
Read/writemany
No randomread/write Randomread/write
Noindividualrecordlookupratherreadalldata Fastrecordslookup(update)
HBase
• TypeofNoSql database• HBase isreallymorea"DataStore"than"DataBase”.ItlacksmanyofthefeaturesyoufindinanRDBMS,suchastypedcolumns,secondaryindexes,triggers,andadvancedquerylanguages,…
• Stronglyconsistentreadandwrite• Automaticsharding (i.e.,“horizontalpartitioning”)• HBase tablesaredistributedontheclusterviaregions,andregionsareautomaticallysplitandre-distributedasdatagrows
• AutomaticRegionServer failover• Hadoop/HDFSIntegration• MassivelyparallelizedprocessingviaMapReduceforusingHBase asbothsourceandsink.• JavaAPIforprogrammaticaccess,RESTfornon-Javafront-ends.
5
6
#Getsallthedatafortherowhbase>get'/user/user01/customer','jsmith’
#Limitthistoonlyonecolumnfamilyhbase>get'/user/user01/customer','jsmith',{COLUMNS=>['addr']}
#Limitthistoaspecificcolumnhbase>get'/user/user01/customer','jsmith',{COLUMNS=>['order:numb']}
#Scanallrowsoftable't1'hbase>scan't1'
#Specifyatimerangehbase>scan't1',{TIMERANGE=>[1303668804,1303668904]}
#Specifyastartrow,limittheresultto10rows,andonlyreturnselectedcolumnshbase>scan't1',{COLUMNS=>['c1','c2'],LIMIT=>10,STARTROW=>'xyz'}
Hive
“TheApacheHive™datawarehousesoftwarefacilitatesreading,writing,andmanaginglargedatasetsresidingindistributedstorageusingSQL.Structurecanbeprojectedontodataalreadyinstorage.AcommandlinetoolandJDBCdriverareprovidedtoconnectuserstoHive.”
7
Hive
• AnSQLlikeinterfacetoHadoop.• DatawarehouseinfrastructurebuiltontopofHadoop• Providedatasummarization,queryandanalysis• QueryexecutionviaMapReduce• HiveinterpretertransparentlyconvertsqueriestoMapReduce.• Butotherbackends arealsosupported,e.g.,Spark
• Opensource,developedbyFacebook• AlsousedbyNetflix,Cnet,Digg,eHarmonyetc.
8
SELECTcustomerId,max(total_cost)FROMhive_purchasesGROUPBYcustomerIdHAVINGcount(*)>3;
• Wordcount inHive• Justacuriosity– probablynotthetypicalkindofquery
https://en.wikipedia.org/wiki/Apache_Hive
9
1DROP TABLE IFEXISTS docs;2CREATE TABLE docs(lineSTRING);3LOAD DATA INPATH'input_file'OVERWRITEINTO TABLE docs;4CREATE TABLE word_counts AS 5SELECT word,count(1)AS count FROM6(SELECT explode(split(line,'\s'))AS wordFROM docs)temp7GROUP BY word8ORDER BY word;
YARN
• YetAnotherResourceNegotiator• YARNApplicationResourceNegotiator(RecursiveAcronym)• Remediesthescalabilityshortcomingsof“classic”MapReduce• A generalpurposeframework.MapReduceisoneapplication.
10
MapReduceLimitations
• Scalability• MaximumClusterSize– 4000Nodes• MaximumConcurrentTasks– 40000• CoarsesynchronizationinJobTracker
• Singlepointoffailure• Failurekillsallqueuedandrunningjobs• Jobsneedtoberesubmittedbyusers• Restartistrickyduetocomplexstate
11
12
Fora(short)introduction:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
• SplitsupthemajorfunctionsofJobTracker:• TheResourceManager hastwocomponents:SchedulerandApplicationsManager.• Scheduler:performsnomonitoringortrackingofstatusfortheapplication.
• Noguaranteesaboutrestartingfailedtaskseitherduetoapplicationfailureorhardwarefailures.• Performsitsschedulingfunctionbasedontheresourcerequirementsoftheapplications;• Abstractnotionofaresource Container (memory,cpu,disk,networketc.)
• TheApplicationsManager isresponsibleforacceptingjob-submissions,negotiatingthefirstcontainerforexecutingtheapplicationspecificApplicationMaster• ProvidestheserviceforrestartingtheApplicationMaster containeronfailure.
• ApplicationMaster (oneperapplication)• NegotiateappropriateresourcecontainersfromtheScheduler• Trackstheirstatusandmonitoringforprogress.• Runsasanormalcontainer.• Frameworkspecificlibrary• WorkswiththeNodeManager(s)toexecuteandmonitorthetasks.
• NodeManager (NM)• Anewper-nodeslaveisresponsibleforlaunchingtheapplications’containers,monitoringtheirresourceusage(cpu,memory,disk,network)andreportingtotheResourceManager.
13
YARN
• FaultToleranceandAvailability• ResourceManager
• Nosinglepointoffailure– statesavedinZooKeeper• ApplicationMastersarerestartedautomatically
• Optionalfailoverviaapplication-specificcheckpoint• MapReduceapplicationspickupwheretheyleftoffviastatesavedinHDFS
• Scalability• 6000- 10000Nodes• 100000+ConcurrentTasks• 10000+Jobs
14
YARN
• SupportforparadigmsotherthanMapReduce(Multitenancy)• HBase onYARN(HOYA),MachineLearning:Spark,Graphprocessing:Giraph,Real-timeprocessing:Storm
• Enabledbyallowingtheuseofparadigm-specificapplicationmaster• RunallonthesameHadoopcluster!
15
Sources
• Hadoop2.0andYARN- Subash D’Souza• https://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/• https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html• http://hbase.apache.org/book.html#arch.overview
16