Berkley Data Analysis Stack

Berkley Data Analysis StackShark, Bagel

2

Previous Presentation Summary• Mesos, Spark, Spark Streaming

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across frameworks(multi-programming for datacenters)

Efficient data sharing across frameworks

Data Processing• in-memory processing • trade between time, quality, and cost

ApplicationNew apps: AMP-Genomics, Carat, …

3

Previous Presentation Summary• Mesos, Spark, Spark Streaming

Spark Example: Log Mining• Load error messages from a log into memory,

then interactively search for various patternslines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .

tasksresults

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Cached RDDParallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Logistic Regression Performance127 s / iteration

first iteration 174 s

further iterations 6 s

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {…

}

println("Final w: " + w)

HIVE: Components

HDFS

Hive CLIDDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

ParserPlanner

Mgm

t. W

eb

UI

Data Model

Hive Entity

Sample Metastore Entity

Sample HDFS Location

Table T /wh/T

Partition date=d1 /wh/T/date=d1

Bucketing column userid

/wh/T/date=d1/part-0000…/wh/T/date=d1/part-1000(hashed on userid)

External Table extT /wh2/existing/dir

(arbitrary location)

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.2. Load “Buckets” directly. The user is responsible for creating buckets.

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;

Creates the table directory.


1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

Step 1

Step 2


1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

Step 3

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

Hive

File on HDFS

HierarchicalObject

Writable

Stream Stream

HierarchicalObject

Map Output File

Writable Writable Writable

HierarchicalObject

File on HDFS

User Script

HierarchicalObject

HierarchicalObject

Hive Operator Hive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper Reducer

ObjectInspector

1.0 3 540.2 1 332.2 8 2120.7 2 22

thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>

BytesWritable(\x3F\x64\x72\x00)

Java ObjectObject of a Java Class

Standard ObjectUse ArrayList for struct and arrayUse HashMap for map

LazyObjectLazily-deserialized

WritableWritable

WritableWritable

Text(‘1.0 3 54’) // UTF8 encoded

User defined SerDes per ROW

getTypeObjectInspector1 getFieldOI

getStructField

getTypeObjectInspector2getMapValueOI

getMapValue

deserialize SerDeserialize getOI

SerDe, ObjectInspector and TypeInfo

HierarchicalObject

WritableWritable

Struct

int stringlist

struct

map

string string

HierarchicalObject

String ObjectgetTypeObjectInspector3

TypeInfo

BytesWritable(\x3F\x64\x72\x00)

Text(‘a=av:b=bv 23 1:2=4:5 abcd’)

class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}

List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)

int int

HashMap(“a” “av”, “b” “bv”),HashMap<String, String> a,

“av”

Documents

Berkley Data Analysis Stack