12
Berkley Data Analysis Stack Shark, Bagel

Berkley Data Analysis Stack

  • Upload
    charla

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Berkley Data Analysis Stack. Shark, Bagel . Previous Presentation Summary. Mesos , Spark , Spark Streaming. New apps: AMP-Genomics, Carat, … . Application. Application. in-memory processing trade between time , quality , and cost. Data Processing. Data Processing. Storage. - PowerPoint PPT Presentation

Citation preview

Page 1: Berkley Data Analysis  Stack

Berkley Data Analysis StackShark, Bagel

Page 2: Berkley Data Analysis  Stack

2

Previous Presentation Summary• Mesos, Spark, Spark Streaming

Infrastructure

Storage

Data Processing

Application

Resource Management

Data Management

Share infrastructure across frameworks(multi-programming for datacenters)

Efficient data sharing across frameworks

Data Processing• in-memory processing • trade between time, quality, and cost

ApplicationNew apps: AMP-Genomics, Carat, …

Page 3: Berkley Data Analysis  Stack

3

Previous Presentation Summary• Mesos, Spark, Spark Streaming

Page 4: Berkley Data Analysis  Stack

Spark Example: Log Mining• Load error messages from a log into memory,

then interactively search for various patternslines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .

tasksresults

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Cached RDDParallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Page 5: Berkley Data Analysis  Stack

Logistic Regression Performance127 s / iteration

first iteration 174 s

further iterations 6 s

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {…

}

println("Final w: " + w)

Page 6: Berkley Data Analysis  Stack

HIVE: Components

HDFS

Hive CLIDDLQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

ParserPlanner

Mgm

t. W

eb

UI

Page 7: Berkley Data Analysis  Stack

Data Model

Hive Entity

Sample Metastore Entity

Sample HDFS Location

Table T /wh/T

Partition date=d1 /wh/T/date=d1

Bucketing column userid

/wh/T/date=d1/part-0000…/wh/T/date=d1/part-1000(hashed on userid)

External Table extT /wh2/existing/dir

(arbitrary location)

Page 8: Berkley Data Analysis  Stack

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.2. Load “Buckets” directly. The user is responsible for creating buckets.

CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE;

Creates the table directory.

Page 9: Berkley Data Analysis  Stack

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view';

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

Step 1

Step 2

Page 10: Berkley Data Analysis  Stack

Hive/Shark flowchart (Insert into table)Two ways to do this.

1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.

Step 3

FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US';

Page 11: Berkley Data Analysis  Stack

Hive

File on HDFS

HierarchicalObject

Writable

Stream Stream

HierarchicalObject

Map Output File

Writable Writable Writable

HierarchicalObject

File on HDFS

User Script

HierarchicalObject

HierarchicalObject

Hive Operator Hive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper Reducer

ObjectInspector

1.0 3 540.2 1 332.2 8 2120.7 2 22

thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>

BytesWritable(\x3F\x64\x72\x00)

Java ObjectObject of a Java Class

Standard ObjectUse ArrayList for struct and arrayUse HashMap for map

LazyObjectLazily-deserialized

WritableWritable

WritableWritable

Text(‘1.0 3 54’) // UTF8 encoded

User defined SerDes per ROW

Page 12: Berkley Data Analysis  Stack

getTypeObjectInspector1 getFieldOI

getStructField

getTypeObjectInspector2getMapValueOI

getMapValue

deserialize SerDeserialize getOI

SerDe, ObjectInspector and TypeInfo

HierarchicalObject

WritableWritable

Struct

int stringlist

struct

map

string string

HierarchicalObject

String ObjectgetTypeObjectInspector3

TypeInfo

BytesWritable(\x3F\x64\x72\x00)

Text(‘a=av:b=bv 23 1:2=4:5 abcd’)

class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}

List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)

int int

HashMap(“a” “av”, “b” “bv”),HashMap<String, String> a,

“av”