20
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch 1

Big Data: Analytics Platforms

  • Upload
    zoe

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Big Data: Analytics Platforms. Donald Kossmann Systems Group, ETH Zurich http:// systems.ethz.ch. Why Big Data?. because bigger is smarter answer tough questions because we can push the limits and good things will happen. bigger = smarter?. Yes! tolerate errors - PowerPoint PPT Presentation

Citation preview

Page 1: Big Data: Analytics Platforms

1

Big Data: Analytics Platforms

Donald KossmannSystems Group, ETH Zurich

http://systems.ethz.ch

Page 2: Big Data: Analytics Platforms

2

Why Big Data?

• because bigger is smarter– answer tough questions

• because we can– push the limits and good things will happen

Page 3: Big Data: Analytics Platforms

3

bigger = smarter?

• Yes!– tolerate errors– discover the long tail and corner cases– machine learning works much better

Page 4: Big Data: Analytics Platforms

4

bigger = smarter?

• Yes!– tolerate errors– discover the long tail and corner cases– machine learning works much better

• But!– more data, more error (e.g., semantic heterogeneity)– with enough data you can prove anything– still need humans to ask right questions

Page 5: Big Data: Analytics Platforms

5

Fundamental Problem of Big Data

• There is no ground truth– gets more complicated with self-fulfilling prophecies• e.g., stock market predictions change behavior of people• e.g., Web search engines determine behavior of people

Page 6: Big Data: Analytics Platforms

6

Fundamental Problem of Big Data

• There is no ground truth– gets more complicated with self-fulfilling prophecies

• Hard to debug: takes human out of the loop– Example: How to play lottery in Napoli• Step 1: You visit “oracles” who predict numbers to play• Step 2: You visit “interpreters” who explain predictions• Step 3: After you lost, “analysts” tell you that “oracles” and

“interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista]

Page 7: Big Data: Analytics Platforms

7

Why Big Data?

• because bigger is smarter– answer tough questions

• because we can– push the limits and good things will happen

Page 8: Big Data: Analytics Platforms

8

Because we can… Really?

• Yes!– all data is digitally born– storage capacity is increasing– counting is embarrassingly parallel

Page 9: Big Data: Analytics Platforms

9

Because we can… Really?

• Yes!– all data is digitally born– storage capacity is increasing– counting is embarrassingly parallel

• But,– data grows faster than energy on chip– value / cost tradeoff unknown– ownership of data unclear (aggregate vs. individual)

• I believe that all these “but’s” can be addressed

Page 10: Big Data: Analytics Platforms

10

Utiliy & Cost Functions of Data

Noise / Error Noise / Error

Utility Cost

Page 11: Big Data: Analytics Platforms

11

Utiliy & Cost Functions of Data

Noise / Error Noise / Error

Utility Cost

curated

random

malicious

curated

random malicious

Page 12: Big Data: Analytics Platforms

12

Best Utility/Cost Tradeoff

Noise / Error Noise / Error

Utility Cost

malicious

malicious

Page 13: Big Data: Analytics Platforms

13

What is good enough?

Noise / Error Noise / Error

Utility Cost

curated curated

Page 14: Big Data: Analytics Platforms

14

What about platforms?

• Relational Databases– great for 20% of the data – not great for 80% of the data

• Hadoop– great for nothing – good enough for (almost) everything (if tweaked)

Page 15: Big Data: Analytics Platforms

15

Why is Hadoop so popular?• availability: open source and free• proven technology: nothing new & simple• works for all data and queries• branding: the big guys use it• it has the right abstractions– MR abstracts “counting” (= machine learning)

• it is an eco-system - it is NOT a platform– HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, …– relational database systems– turned into a platform depending on app / problem

Page 16: Big Data: Analytics Platforms

16

Example: Amadeus Log Service• HDFS for compressed logs

• HBase to index by timestamp and session id

• SOLR for full text search

• Hadoop (MR) for usage stats & disasters

• Oracle to store meta-data (e.g., user information)

• Disclaimer: under construction & evaluation!!!– current production system is proprietary

Page 17: Big Data: Analytics Platforms

17

Some things Hadoop got wrong?

• performance: huge start-up time & overheads

• productivity: e.g., joins, configuration knobs

• SLAs: no response time guarantees, no real time

• Essentially ignored 40 years of DB research

Page 18: Big Data: Analytics Platforms

18

Some things Hadoop got right

• scales without (much) thinking

• moves the computation to the data

• fault tolerance, load balance, …

Page 19: Big Data: Analytics Platforms

19

How to improve on Hadoop

• Option 1: Push our knowledge into Hadoop?– implement joins, recursion, …

• Option 2: Push Hadoop into RDBMS?– build a Hadoop-enabled database system

• Option 3: Build new Hadoop components – real-time, etc.

• Option 4: Patterns to compose components– log service, machine learning, …– but, do not build a “super-Hadoop”

Page 20: Big Data: Analytics Platforms

20

Conclusion• Focus on “because we can…” part– help data scientists to make everything work

• Stick to our guns– develop clever algorithms & data structures– develop modeling tools and languages– develop abstractions for data, errors, failures, …– develop “glue”; get the plumbing right

• Package our results correctly– find the right abstractions (=> APIs of building blocks)