31
HBaseCon East 2016 HBase and Spark, state of the art

HBaseConEast2016: HBase and Spark, State of the Art

Embed Size (px)

Citation preview

Page 1: HBaseConEast2016: HBase and Spark, State of the Art

HBaseCon East 2016HBase and Spark, state of the art

Page 2: HBaseConEast2016: HBase and Spark, State of the Art

• Java Message Service => JMS• Solutions Architect at Cloudera• A bit of everything…

• Development• Team/Project manager• Architect

• O'Reilly author of Architecting HBase Applications• International

• Worked from Paris to Los Angeles• More than 100 flights per year

• HBase (and others) contributor

About Jean-Marc Spaggiari

HBaseCon East 2016

Page 3: HBaseConEast2016: HBase and Spark, State of the Art

Overview• Where we came from• Examples of code• Improvements that are coming up• Spark Streaming Use case

HBaseCon East 2016

Page 4: HBaseConEast2016: HBase and Spark, State of the Art

Source of Demand• Demand started in the field

• Use Cases• APIs access Gets, Puts, Scans• MapReduce Mass Scans• MapReduce Bulk Load• MapReduce Smart gets and puts

• Spark has all but killed MapReduce• Spark Streaming has grown in popularity

• Populating Aggregates• Entity Centric-Time Series data store i.e. OpenTSDB• Look ups for joins or mutations

HBaseCon East 2016

Page 5: HBaseConEast2016: HBase and Spark, State of the Art

How it Started

• Started on GitHub• Andrew Purtell started the effort to put into HBase• Big call out to Sean B, Jon H, Ted Y, Ted M and Matteo B

• Components• Normal Spark• Spark Streaming• Bulk Load• SparkSQL

HBaseCon East 2016

Page 6: HBaseConEast2016: HBase and Spark, State of the Art

Under the covers

HBaseCon East 2016

Driver

Walker Node

Configs

Executor

Static Space

Configs

HConnection

Tasks TasksWalker Node

Executor

Static Space

Configs

HConnection

Tasks Tasks

Page 7: HBaseConEast2016: HBase and Spark, State of the Art

Key Addition: HBaseContext

Create an HBaseContext :// An Hadoop/HBase Configuration objectval conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))// sc is the Spark Context; hbase context corresponds to an HBase Connectionval hbaseContext = new HBaseContext(sc, conf)

HBaseCon East 2016

Page 8: HBaseConEast2016: HBase and Spark, State of the Art

• Foreach

• Map

• BulkLoad

• BulkLoadThinRows

• BulkGet (aka Multiget)

• BulkDelete

• Most of them in both Java and Scala

Operations on the HBaseContext

HBaseCon East 2016

Page 9: HBaseConEast2016: HBase and Spark, State of the Art

Foreach

Read data in parallel for each partition and compute :

rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {// do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))it.foreach(r => {

... // HBase API put/incr/append/cas calls}bufferedMutator.flush()bufferedMutator.close()

})

HBaseCon East 2016

Page 10: HBaseConEast2016: HBase and Spark, State of the Art

Foreach

Read data in parallel for each partition and compute :

hbaseContext.foreachPartition(keyValuesPuts,new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {@Overridepublic void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);while (t._1().hasNext()) {

... // HBase API put/incr/append/cas calls}mutator.flush();mutator.close();

} }); });

HBaseCon East 2016

Page 11: HBaseConEast2016: HBase and Spark, State of the Art

Map

Take a dataset and map it in parallel for each partition to produce a new RDD or process it

val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {val table = conn.getTable(TableName.valueOf("t1"))var res = mutable.MutableList[String]()it.map( r => {

... // HBase API Scan Results}

})

HBaseCon East 2016

Page 12: HBaseConEast2016: HBase and Spark, State of the Art

BulkLoad

Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)

rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator

}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))

HBaseCon East 2016

Page 13: HBaseConEast2016: HBase and Spark, State of the Art

BulkLoad

Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)

rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator

}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))

HBaseCon East 2016

Page 14: HBaseConEast2016: HBase and Spark, State of the Art

Bulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)

rdd.hbaseBulkLoad(hbaseContext, tableName,t => {val rowKey = t._1val fam:Array[Byte] = t._2._1val qual = t._2._2val value = t._2._3val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)Seq((keyFamilyQualifier, value)).iterator

}, stagingFolder)val load = new LoadIncrementalHFiles(config)load.run(Array(stagingFolder, tableNameString))

BulkLoad

HBaseCon East 2016

Page 15: HBaseConEast2016: HBase and Spark, State of the Art

BulkLoadThinRowsBulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {

val rowKey = Bytes.toBytes(t._1)val familyQualifiersValues = new FamiliesQualifiersValuest._2.foreach(f => {val family:Array[Byte] = f._1val qualifier = f._2val value:Array[Byte] = f._3familyQualifiersValues +=(family, qualifier, value)

})(new ByteArrayWrapper(rowKey), familyQualifiersValues)

}, stagingFolder.getPath)

HBaseCon East 2016

Page 16: HBaseConEast2016: HBase and Spark, State of the Art

BulkLoadThinRowsBulk load a data set into HBase (for skinny tables, <10k cols)

hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {

val rowKey = Bytes.toBytes(t._1)val familyQualifiersValues = new FamiliesQualifiersValuest._2.foreach(f => {val family:Array[Byte] = f._1val qualifier = f._2val value:Array[Byte] = f._3familyQualifiersValues +=(family, qualifier, value)

})(new ByteArrayWrapper(rowKey), familyQualifiersValues)

}, stagingFolder.getPath)

HBaseCon East 2016

Page 17: HBaseConEast2016: HBase and Spark, State of the Art

BulkPut

Parallelized HBase Multiput :

hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => {

val put = new Put(putRecord._1)putRecord._2.foreach((putValue) =>

put.add(putValue._1, putValue._2, putValue._3))put

}

HBaseCon East 2016

Page 18: HBaseConEast2016: HBase and Spark, State of the Art

BulkPutParallelized HBase Multiput :

hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {@Overridepublic Put call(String v1) throws Exception {String[] tokens = v1.split("\\|");Put put = new Put(Bytes.toBytes(tokens[0]));put.addColumn(Bytes.toBytes("segment"),

Bytes.toBytes(tokens[1]),Bytes.toBytes(tokens[2]));

return put;}

});

HBaseCon East 2016

Page 19: HBaseConEast2016: HBase and Spark, State of the Art

BulkDelete

Parallelized HBase Multi-deletes

hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,putRecord => new Delete(putRecord),4) // batch size

rdd.hbaseBulkDelete(hbaseContext, tableName,putRecord => new Delete(putRecord),4) // batch size

HBaseCon East 2016

Page 20: HBaseConEast2016: HBase and Spark, State of the Art

Table RDD

How to materialize a table as a Spark RDD.

HBaseCon East 2016

Page 21: HBaseConEast2016: HBase and Spark, State of the Art

Table RDD

How to materialize a table as a Spark RDD.

HBaseCon East 2016

Page 22: HBaseConEast2016: HBase and Spark, State of the Art

Table RDD

How to materialize a table as a Spark RDD.

HBaseCon East 2016

Page 23: HBaseConEast2016: HBase and Spark, State of the Art

Table RDD

How to materialize a table as a Spark RDD.

HBaseCon East 2016

Page 24: HBaseConEast2016: HBase and Spark, State of the Art

What Improvements Have We Made?

▪ Combine Spark and HBase• Spark Catalyst Engine for Query Plan and Optimization

• HBase for Fast Access KV Store

• Implement Standard External Data Source with Built-in Filter

▪ High Performance• Data Locality: Move Computation to Data

• Partition Pruning: Task only Performed in RS Holding Requested Data

• Column Pruning / Predicate Pushdown: Reduce Network Overhead

▪ Full Fledged DataFrame Support• Spark-SQL

• Integrated Language Query

▪ Run on Top of Existing HBase Table• Native Support Java Primitive Types

▪ Still some work and improvements to be done• HBASE-16638 Reduce the number of Connections

• HBASE-14217 Add Java access to Spark bulk load functionality

HBaseCon East 2016

Page 25: HBaseConEast2016: HBase and Spark, State of the Art

Data frame + HBase

WIP... 2.0?

HBaseCon East 2016

Page 26: HBaseConEast2016: HBase and Spark, State of the Art

Usage - Define the Catalog

def catalog = s"""{|"table":{"namespace":"default", "name":"table1"},|"rowkey":"key",|"columns":{

|"col0":{"cf":"rowkey", "col":"key", "type":"string"},|"col1":{"cf":"cf1", "col":"col1", "type":"boolean"},|"col2":{"cf":"cf1", "col":"col2", "type":"string"}

|}|}""".stripMargin

HBase table to dataframe table catalog mapping:

HBaseCon East 2016

Page 27: HBaseConEast2016: HBase and Spark, State of the Art

Usage – Write to HBase

sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog,

HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()

Export RDD into a new HBase table with DataFrame

HBaseCon East 2016

Page 28: HBaseConEast2016: HBase and Spark, State of the Art

Usage– Construct DataFrame

def withCatalog(cat: String): DataFrame = {sqlContext.read.options(Map(HBaseTableCatalog.tableCatalog->catalog)).format("org.apache.spark.sql.execution.datasources.hbase").load()

}

Import RDD into a new HBase table with DataFrame

HBaseCon East 2016

Page 29: HBaseConEast2016: HBase and Spark, State of the Art

Usage - Language Integrate Query/SQL

val df = withCatalog(catalog)val s = df.filter((($"col0" <= "row050" && $"col0" > "row040") ||$"col0" === "row005" ||$"col0" === "row020" ||$"col0" === "r20" ||$"col0" <= "row005") &&($"col2" === 1 ||$"col2" === 42)).select("col0", "col1", "col2")

s.show

Import RDD into a new HBase table with DataFrame

HBaseCon East 2016

Page 30: HBaseConEast2016: HBase and Spark, State of the Art

Usage - Language Integrate Query/SQL

// Load the dataframeval df = withCatalog(catalog)//SQL exampledf.registerTempTable("table")sqlContext.sql("select count(col1) from table").show

Import RDD into a new HBase table with DataFrame

HBaseCon East 2016

Page 31: HBaseConEast2016: HBase and Spark, State of the Art

Spark Streaming Example

KafkaProducerSpark

StreamingHBase SOLR

HBaseCon East 2016