Upload
zoe
View
34
Download
0
Embed Size (px)
DESCRIPTION
Big Data: Analytics Platforms. Donald Kossmann Systems Group, ETH Zurich http:// systems.ethz.ch. Why Big Data?. because bigger is smarter answer tough questions because we can push the limits and good things will happen. bigger = smarter?. Yes! tolerate errors - PowerPoint PPT Presentation
Citation preview
1
Big Data: Analytics Platforms
Donald KossmannSystems Group, ETH Zurich
http://systems.ethz.ch
2
Why Big Data?
• because bigger is smarter– answer tough questions
• because we can– push the limits and good things will happen
3
bigger = smarter?
• Yes!– tolerate errors– discover the long tail and corner cases– machine learning works much better
4
bigger = smarter?
• Yes!– tolerate errors– discover the long tail and corner cases– machine learning works much better
• But!– more data, more error (e.g., semantic heterogeneity)– with enough data you can prove anything– still need humans to ask right questions
5
Fundamental Problem of Big Data
• There is no ground truth– gets more complicated with self-fulfilling prophecies• e.g., stock market predictions change behavior of people• e.g., Web search engines determine behavior of people
6
Fundamental Problem of Big Data
• There is no ground truth– gets more complicated with self-fulfilling prophecies
• Hard to debug: takes human out of the loop– Example: How to play lottery in Napoli• Step 1: You visit “oracles” who predict numbers to play• Step 2: You visit “interpreters” who explain predictions• Step 3: After you lost, “analysts” tell you that “oracles” and
“interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista]
7
Why Big Data?
• because bigger is smarter– answer tough questions
• because we can– push the limits and good things will happen
8
Because we can… Really?
• Yes!– all data is digitally born– storage capacity is increasing– counting is embarrassingly parallel
9
Because we can… Really?
• Yes!– all data is digitally born– storage capacity is increasing– counting is embarrassingly parallel
• But,– data grows faster than energy on chip– value / cost tradeoff unknown– ownership of data unclear (aggregate vs. individual)
• I believe that all these “but’s” can be addressed
10
Utiliy & Cost Functions of Data
Noise / Error Noise / Error
Utility Cost
11
Utiliy & Cost Functions of Data
Noise / Error Noise / Error
Utility Cost
curated
random
malicious
curated
random malicious
12
Best Utility/Cost Tradeoff
Noise / Error Noise / Error
Utility Cost
malicious
malicious
13
What is good enough?
Noise / Error Noise / Error
Utility Cost
curated curated
14
What about platforms?
• Relational Databases– great for 20% of the data – not great for 80% of the data
• Hadoop– great for nothing – good enough for (almost) everything (if tweaked)
15
Why is Hadoop so popular?• availability: open source and free• proven technology: nothing new & simple• works for all data and queries• branding: the big guys use it• it has the right abstractions– MR abstracts “counting” (= machine learning)
• it is an eco-system - it is NOT a platform– HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, …– relational database systems– turned into a platform depending on app / problem
16
Example: Amadeus Log Service• HDFS for compressed logs
• HBase to index by timestamp and session id
• SOLR for full text search
• Hadoop (MR) for usage stats & disasters
• Oracle to store meta-data (e.g., user information)
• Disclaimer: under construction & evaluation!!!– current production system is proprietary
17
Some things Hadoop got wrong?
• performance: huge start-up time & overheads
• productivity: e.g., joins, configuration knobs
• SLAs: no response time guarantees, no real time
• Essentially ignored 40 years of DB research
18
Some things Hadoop got right
• scales without (much) thinking
• moves the computation to the data
• fault tolerance, load balance, …
19
How to improve on Hadoop
• Option 1: Push our knowledge into Hadoop?– implement joins, recursion, …
• Option 2: Push Hadoop into RDBMS?– build a Hadoop-enabled database system
• Option 3: Build new Hadoop components – real-time, etc.
• Option 4: Patterns to compose components– log service, machine learning, …– but, do not build a “super-Hadoop”
20
Conclusion• Focus on “because we can…” part– help data scientists to make everything work
• Stick to our guns– develop clever algorithms & data structures– develop modeling tools and languages– develop abstractions for data, errors, failures, …– develop “glue”; get the plumbing right
• Package our results correctly– find the right abstractions (=> APIs of building blocks)