Big Data: Analytics Platforms

1

Big Data: Analytics Platforms

Donald KossmannSystems Group, ETH Zurich

http://systems.ethz.ch

2

Why Big Data?

• because bigger is smarter– answer tough questions

• because we can– push the limits and good things will happen

3

bigger = smarter?

• Yes!– tolerate errors– discover the long tail and corner cases– machine learning works much better

4

bigger = smarter?

• Yes!– tolerate errors– discover the long tail and corner cases– machine learning works much better

• But!– more data, more error (e.g., semantic heterogeneity)– with enough data you can prove anything– still need humans to ask right questions

5

Fundamental Problem of Big Data

• There is no ground truth– gets more complicated with self-fulfilling prophecies• e.g., stock market predictions change behavior of people• e.g., Web search engines determine behavior of people

6

Fundamental Problem of Big Data

• There is no ground truth– gets more complicated with self-fulfilling prophecies

• Hard to debug: takes human out of the loop– Example: How to play lottery in Napoli• Step 1: You visit “oracles” who predict numbers to play• Step 2: You visit “interpreters” who explain predictions• Step 3: After you lost, “analysts” tell you that “oracles” and

“interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista]

7

Why Big Data?

• because bigger is smarter– answer tough questions

• because we can– push the limits and good things will happen

8

Because we can… Really?

• Yes!– all data is digitally born– storage capacity is increasing– counting is embarrassingly parallel

9

Because we can… Really?

• Yes!– all data is digitally born– storage capacity is increasing– counting is embarrassingly parallel

• But,– data grows faster than energy on chip– value / cost tradeoff unknown– ownership of data unclear (aggregate vs. individual)

• I believe that all these “but’s” can be addressed

10

Utiliy & Cost Functions of Data

Noise / Error Noise / Error

Utility Cost

11

Utiliy & Cost Functions of Data


Utility Cost

curated

random

malicious

curated

random malicious

12

Best Utility/Cost Tradeoff


Utility Cost

malicious

malicious

13

What is good enough?


Utility Cost

curated curated

14

What about platforms?

• Relational Databases– great for 20% of the data – not great for 80% of the data

• Hadoop– great for nothing – good enough for (almost) everything (if tweaked)

15

Why is Hadoop so popular?• availability: open source and free• proven technology: nothing new & simple• works for all data and queries• branding: the big guys use it• it has the right abstractions– MR abstracts “counting” (= machine learning)

• it is an eco-system - it is NOT a platform– HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, …– relational database systems– turned into a platform depending on app / problem

16

Example: Amadeus Log Service• HDFS for compressed logs

• HBase to index by timestamp and session id

• SOLR for full text search

• Hadoop (MR) for usage stats & disasters

• Oracle to store meta-data (e.g., user information)

• Disclaimer: under construction & evaluation!!!– current production system is proprietary

17

Some things Hadoop got wrong?

• performance: huge start-up time & overheads

• productivity: e.g., joins, configuration knobs

• SLAs: no response time guarantees, no real time

• Essentially ignored 40 years of DB research

18

Some things Hadoop got right

• scales without (much) thinking

• moves the computation to the data

• fault tolerance, load balance, …

19

How to improve on Hadoop

• Option 1: Push our knowledge into Hadoop?– implement joins, recursion, …

• Option 2: Push Hadoop into RDBMS?– build a Hadoop-enabled database system

• Option 3: Build new Hadoop components – real-time, etc.

• Option 4: Patterns to compose components– log service, machine learning, …– but, do not build a “super-Hadoop”

20

Conclusion• Focus on “because we can…” part– help data scientists to make everything work

• Stick to our guns– develop clever algorithms & data structures– develop modeling tools and languages– develop abstractions for data, errors, failures, …– develop “glue”; get the plumbing right

• Package our results correctly– find the right abstractions (=> APIs of building blocks)

Documents

Big Data: Analytics Platforms