23
The Hadoop Stack, Part 2 Introduction to HBase CSE 40822 – Cloud Computing – Fall 2018 Prof. Douglas Thain University of Notre Dame

The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

The Hadoop Stack, Part 2 Introduction to HBase

CSE40822–CloudComputing–Fall2018Prof.DouglasThain

UniversityofNotreDame

Page 2: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Three Case Studies

• Workflow:PigLatin• AdataflowlanguageandexecutionsystemthatprovidesanSQL-likewayofcomposingworkflowsofmultipleMap-Reducejobs.

• Storage:HBase• ANoSQlstoragesystemthatbringsahigherdegreeofstructuretotheflat-filenatureofHDFS.

• Execution:Spark• Anin-memorydataanalysissystemthatcanuseHadoopasapersistencelayer,enablingalgorithmsthatarenoteasilyexpressedinMap-Reduce.

Page 3: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

References

•  FayChangetal,Bigtable:ADistributedStorageSystemforStructuredData,OSDI2006•  http://research.google.com/archive/bigtable.html

•  IntroductiontoHbaseSchemaDesign,AmandeepKhurana,Login;Magazine,2012.•  https://www.usenix.org/publications/login/october-2012-volume-37-number-5/introduction-hbase-schema-design

• ApacheHBaseDocumentation•  http://hbase.apache.org

Page 4: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

From HDFS to HBase

• HDFSprovidesuswithafilesystemconsistingofarbitrarilylargefilesthatcanonlybewrittenonceandareshouldbereadsequentially,endtoend.ThisworksfineforsequentialOLAPquerieslikePig.• OLTPworkloadswanttoreadandwriteindividualcellsinalargetable.(e.g.updateinventoryandpriceasorderscomein.)• HBaseimplementsOLTPinteractionsontopofHDFSbyusingadditionalstorageandmemorytoorganizethetables,andwritingthembacktoHDFSasneeded.• HBaseisanopensourceimplementationoftheBigTabledesignpublishedbyGoogle.

Page 5: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

HBase Data Model

• Adatabaseconsistsofmultipletables.•  Eachtableconsistsofmultiplerows,sortedbyrowkey.•  Eachrowcontainsarowkeyandoneormorecolumnfamilies.•  Eachcolumnfamilyisdefinedwhenthetableiscreated.• Columnfamiliescancontainmultiplecolumns.(family:column)• Acellisuniquelyidentifiedby(table,row,family:column).• Acellcontainsanuninterpretedarrayofbytesandatimestamp.

Page 6: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Data in Tabular Form

Name Home Office

Key First Last Phone Email Phone Email

101 Florian Krepsbach 555-1212 [email protected]

666-1212

[email protected]

102 Marilyn Tollerud 555-1213 666-1213

103 Pastor Inqvist 555-1214 [email protected]

Page 7: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Data in Tabular Form

Name Home Office Social

Key First Middle Last Phone Email Phone Email FacebookID

101 Florian Garfield Krepsbach 555-1212

[email protected]

666-1212

[email protected]

102 Marilyn Tollerud 555-1213

666-1213

103 Pastor Inqvist 555-1214

[email protected]

Newcolumnscanbeaddedatruntime.

Columnfamiliescannotbeaddedatruntime.

Page 8: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Don’t be fooled by the picture: Hbase is really a sparse table.

Page 9: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Nested Data Representation TablePeople(Name,Home,Office){

101:{ Timestamp:T403; Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}, Home:{Phone=“555-1212”,Email=“[email protected]”}, Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{ Timestamp:T593; Name:{First=“Marilyn”,Last=“Tollerud”}, Home:{Phone=“555-1213”}, Office:{Phone=“666-1213”}},…

}

Page 10: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Nested Data Representation GETPeople:101

101:{ Timestamp:T403; Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}, Home:{Phone=“555-1212”,Email=“[email protected]”}, Office:{Phone=“666-1212”,Email=“[email protected]”}}

GETPeople:101:Name

People:101:Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”}GETPeople:101:Name:First

People:101:Name:First=“Florian”

Page 11: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Fundamental Operations

• CREATEtable,families• PUTtable,rowid,family:column,value• PUTtable,rowid,whole-row• GETtable,rowid•  SCANtable(WITHfilters)• DROPtable

Page 12: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Consistency Model

• Atomicity:Entirerowsareupdatedatomicallyornotatall.• Consistency:

•  AGETisguaranteedtoreturnacompleterowthatexistedatsomepointinthetable’shistory.(Checkthetimestamptobesure!)•  ASCANmustincludealldatawrittenpriortothescan,andmayincludeupdatessinceitstarted.

•  Isolation:Notguaranteedoutsideasinglerow.• Durability:Allsuccessfulwriteshavebeenmadedurableondisk.

Page 13: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

The Implementation

•  I’llgiveyoutheideainseveralsteps:•  Idea1:Putanentiretableinonefile.(Itdoesn’twork.)•  Idea2:Log+onefile.(Better,butdoesn’tscaletolargedata.)•  Idea3:Log+onefilepercolumnfamily.(Gettingbetter!)•  Idea4:Partitiontableintoregionsbykey.

• Keepinmindthat“onefile”meansonegiganticfilestoredinHDFS.Wedon’thavetoworryaboutthedetailsofhowthefileissplitintoblocks.

Page 14: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Idea 1: Put the Table in a Single File

• Howdowedothefollowingoperations?•  CREATE•  DELETE•  SCAN•  GET•  PUT

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

File“People”

Page 15: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Variable-Length Data is Fundamentally Hard!

Funreadingonthetopic:http://www.joelonsoftware.com/articles/fog0000000319.html

ID FirstName LastName Phone101 Florian Krepsbach 555-3434102 Marilyn Tollerud 555-1213103 Pastor Ingvist 555-1214

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

SQLTable:People(ID:Integer,FirstName:CHAR[20],LastName:Char[20],Phone:CHAR[8])UPDATEPeopleSETPhone=“555-3434”WHEREID=403;

HBaseTablePeople(ID,Name,Home,Office)PUTPeople,403,Home:Phone,555-3434

Eachrowisexactly52byteslong.Tomovetothenextrow,justfseek(file,+52);TogettoRow401,fseek(file,401*52);Overwritethedatainplace.

?

Page 16: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Idea 2: One Tablet + Transaction Log

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

TableforPeople

PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….

TransactionLogforTablePeople:

Changesareappliedonlytothelogfile,Thentheresultingrecordiscachedinmemory.Readsmustconsultbothmemoryanddisk.

MemoryCacheforTablePeople:

101 102

GETPeople:101 GETPeople:103PUTPeople:101:Office:Phone=“555-3434”

Page 17: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Idea 2 Requires Periodic Log Compression

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“666-1212”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”},Office:{Phone=“666-1213”}},...

TableforPeopleonDisk:(Old)

PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….

TransactionLogforTablePeople:

101:{Timestamp:T403;Name:{First=“Florian”,Middle=“Garfield”,Last=“Krepsbach”},Home:{Phone=“555-1212”,Email=“[email protected]”},Office:{Phone=“555-3434”,Email=“[email protected]”}},102:{Timestamp:T593;Name:{First=“Marilyn”,Last=“Tollerud”},Home:{Phone=“555-1213”,Email=“[email protected]”},...

TableforPeopleonDisk:(New)

Writeoutanewcopyofthetable,withallofthechangesapplied.Deletethelogandmemorycache,andstartover.

Page 18: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Idea 3: Partition by Column Family

DataforColumnFamilyName

TabletsforPeopleonDisk:(Old)

PUT101:Office:Phone=“555-3434”PUT102:Home:[email protected]….

TransactionLogforTablePeople:

TabletsforPeopleonDisk:(New)

Writeoutanewcopyofthetablet,withallofthechangesapplied.Deletethelogandmemorycache,andstartover.

DataforColumnFamilyHome

DataforColumnFamilyOffice

DataforColumnFamilyHome(Changed)

DataforColumnFamilyOffice(Changed)

DataforColumnFamilyName

Page 19: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Idea 4: Split Into Regions

Region1:Keys100-200

Region2:Keys200-300

Region3:Keys300-400

Region4:Keys400-500

RegionServer

RegionMaster

RegionServer

RegionServer

RegionServer

TransactionLog

MemoryCache

Tablet

Tablet

Tablet

HbaseClient

(DetailofOneRegionServer)

ColumnFamilyName

ColumnFamilyHome

ColumnFamilyOffice

WherearetheserversfortablePeople?

Accessdataintables

Page 20: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

ColumnFamilyName ColumnFamilyHome ColumnFamilyOffice

Region1Keys101-200

Region2Keys201-300

Region3Keys301-400

TablePeople

Page 21: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

The consistency model is tightly coupled to the scalability of the implementation!

Page 22: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Consistency Model

• Atomicity:Entirerowsareupdatedatomicallyornotatall.• Consistency:

•  AGETisguaranteedtoreturnacompleterowthatexistedatsomepointinthetable’shistory.(Checkthetimestamptobesure!)•  ASCANmustincludealldatawrittenpriortothescan,andmayincludeupdatessinceitstarted.

•  Isolation:Notguaranteedoutsideasinglerow.• Durability:Allsuccessfulwriteshavebeenmadedurableondisk.

Page 23: The Hadoop Stack, Part 2 Introduction to HBasedthain/courses/cse40822/fall2018/notes/HBase.pdf• HBase implements OLTP interactions on top of HDFS by using additional storage and

Questions for Discussion

• Whatarethetradeoffsbetweenhavingaverytallverticaltableversushavingaverywidehorizontaltable?•  Supposeatablestartsoffsmall,andthenalargenumberofrowsareaddedtothetable.Whathappensandwhatshouldthesystemdo?• UsingthePeopleexample,comeupwithsomeexamplesofsimplequeriesthatareveryfast,andsomethatareveryslow.Isthereanythingthatcanbedoneabouttheslowqueries?• WoulditbepossibletohaveSCANreturnaconsistentsnapshotofthesystem?Howwouldthatwork?