22
@ Coursera Daniel Chia @DanielJHChia Software Engineer, Infrastructure

Scalding @ Coursera

Embed Size (px)

DESCRIPTION

A lightning talk I gave about how Coursera decided to use Scalding.

Citation preview

Page 1: Scalding @ Coursera

@ Coursera

Daniel Chia @DanielJHChia

Software Engineer, Infrastructure

Page 2: Scalding @ Coursera

Overview

• Context

• Growing Needs

• Hive / Pig / Scalding

Page 3: Scalding @ Coursera

Technical (Online Stack)

• 100% hosted on AWS

• Service-oriented architecture

• Mix of MySQL and Cassandra for persistence

• Scala

Page 4: Scalding @ Coursera

Existing Warehouse

Streaming

Page 5: Scalding @ Coursera

Future Warehouse Flow

S3

Event Data

Page 6: Scalding @ Coursera

Need 1: Expressive

• Joins

• Aggregations

• Secondary sort

• Multiple map-reduce

Page 7: Scalding @ Coursera

Need 2: Semi-structured Data

• Increased usage of Cassandra

• Events data

Page 8: Scalding @ Coursera

{

“timestamp”:1411359695744,

“membershipState":"LearnerEnrolled"

}

Page 9: Scalding @ Coursera

{ "typeName": "multipart", "definition": { "assignmentParts": { "id1": { "typeName": "plainText", "order": 0, "definition": { "prompt": "Write a sentence describing what you think about cereal." } }, "id2": { "typeName": "richText", "order": 1, "definition": { "prompt": "Write a long essay with lots of fancy formatting describing what you think about cereal." } }, "id3": { "typeName": "url", "order": 2, "definition": { "prompt": "Post a link to your favorite cereal." } }, "id4": { "typeName": "plainText", "order": 3, "definition": {

Page 10: Scalding @ Coursera

Choices

• Hive

• Pig

• Scalding

Page 11: Scalding @ Coursera

Hive

• SQL-like language

• Great for simple rollups and aggregations

• Procedural transforms difficult to express

Page 12: Scalding @ Coursera

Pig

• Mature

• Procedural

• Pig Latin + Lots of UDFs

Page 13: Scalding @ Coursera

Scalding – Pros

• Succinct

• Expressive

• All code in one language

• Re-use online data models

Page 14: Scalding @ Coursera

Scaling – Pros

• Easy to test

Page 15: Scalding @ Coursera

Scalding – Cons

• Have to learn Scala

• More heavy weight for simple experimental things.

• Many layers abstracted from MapReduce

Page 16: Scalding @ Coursera

Scalding – Example

• User event data

• Want to join with course and topic data

Page 17: Scalding @ Coursera

Scalding – Exampleval events = TypedTsv … /* load data */ .toTypedPipe

val courses = TypedTsv … .toTypedPipe

val topics = TypedTsv … .toTypedPipe

Page 18: Scalding @ Coursera

Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */

Page 19: Scalding @ Coursera

Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .leftJoin(topics.groupBy(_.topicId)) /* more analysis */

Page 20: Scalding @ Coursera

Scalding – Exampleevents.groupBy(_.courseId) .leftJoin(courses.groupBy(_.courseId)) .groupBy(_._2.topicId) .sketch(reducer = 100) .leftJoin(topics.groupBy(_.topicId))

Page 21: Scalding @ Coursera

Scalding – Wish-list

• More documentation

• Scala 2.11 soon, please?

Page 22: Scalding @ Coursera

Questions?

We’re hiring! coursera.org/jobs