17
Tuesday, June 8, 2010

Big Data @ Bodensee Barcamp 2010

  • Upload
    c1sc0

  • View
    1.036

  • Download
    3

Embed Size (px)

DESCRIPTION

Big Data @ Bodensee Barcamp 2010

Citation preview

Page 1: Big Data @ Bodensee Barcamp 2010

Tuesday, June 8, 2010

Page 2: Big Data @ Bodensee Barcamp 2010

Tuesday, June 8, 2010

Page 3: Big Data @ Bodensee Barcamp 2010

BIG DATAThe rise of the data scientist

http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/

Tuesday, June 8, 2010

Page 4: Big Data @ Bodensee Barcamp 2010

Holidaycheck

Travel platform: review + book

12+ countries (.de ... .cn)

30% growth / year, profitable

Almost 1.5 mio hotel reviews

1.6 mio + pics

Tuesday, June 8, 2010

Page 5: Big Data @ Bodensee Barcamp 2010

internet-driven company

traditional: MVC/3-Tier/RDBMS/caching

50+ Apache instances

15 Gb Operational Data

12 Gb logs / day

5 searches / second

My scientist friend: “That’s neat, but it’s not data science.”

Data @ HC

Tuesday, June 8, 2010

Page 6: Big Data @ Bodensee Barcamp 2010

The I/O Bottleneck“The problem is simple: Memory, Disk size and CPU and even

network performance continue to grow much faster than disk I/O performance.”

2004 to 2009

CPU: still following Moore's Law (transistor x2 every 18 months)

Memory Bandwidth (Intel): 9.3x

Disk Density (SATA): 8x

Disk I/O: 0.8x

Network speed: routers can easily saturate the fastest hard drives

http://blogs.cisco.com/datacenter/comments/networking_delivering_more_by_exceeding_the_law_of_moore/

Tuesday, June 8, 2010

Page 7: Big Data @ Bodensee Barcamp 2010

I/O Repercussions

Turn to memcache

Try out SSD

Try out asynchronous writes (e.g. message queues)

Try to solve/hack the I/O problem: Sharding, in-memory DB

Our problems seem big, but are they really?

Tuesday, June 8, 2010

Page 8: Big Data @ Bodensee Barcamp 2010

So what is Big Data anyway?“The term Big data from software engineering and computer science

describes datasets that grow so large that they become awkward to work with using on-hand database management tools”

kilo to mega to giga to tera to peta to exa to zetta to yotta

Tuesday, June 8, 2010

Page 9: Big Data @ Bodensee Barcamp 2010

NoSQL = Not Only SQLTrade-Offs, e.g. transactions, data loss

e.g. Document Stores (MongoDB) e.g. Key-Value Stores (MemcacheDB)

e.g. Graph Databases (Neo4j) Map/Reduce algorithm

Tuesday, June 8, 2010

Page 10: Big Data @ Bodensee Barcamp 2010

Medium Data“With yesterday's scientific technology most businesses should be able to

handle their data analysis needs.”

HC: 12 Gb logfiles / day = medium data problem

(2006) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

(2004) MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat

Solved (?) with: RDBMS + NoSQL

Tuesday, June 8, 2010

Page 11: Big Data @ Bodensee Barcamp 2010

3 sexy skills of data geeks

“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it. Hal Valerian (Google)”

http://dataspora.com/blog/sexy-data-geeks/

Tuesday, June 8, 2010

Page 12: Big Data @ Bodensee Barcamp 2010

3 skills: statistics

sentiment analysis natural language processingmachine learning

good old-fashioned regressionrecommendation engines

Tuesday, June 8, 2010

Page 13: Big Data @ Bodensee Barcamp 2010

3 skills: visualization

Vs.

Q: Are you hiring statisticians, visualization experts & data plumbers?

TheOathMeal Edward Tufte, Ben Fry

Tuesday, June 8, 2010

Page 14: Big Data @ Bodensee Barcamp 2010

3 skills: data plumbing

Glue languages: Python, Perl, regex, XSLT

Admin: setting up, maintaining clusters

Affinity with OSS & *nix

NoSQL = NoSchema = Transform Data

/^([\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+\.)*[\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+@((((([a-z0-9]{1}[a-z0-9\-]{0,62}[a-z0-9]{1})|[a-z])\.)+[a-z]{2,6})|(\d{1,3}\.){3}\d{1,3}(\:\d{1,5})?)$/i

Tuesday, June 8, 2010

Page 15: Big Data @ Bodensee Barcamp 2010

More Data beats smart algorithms

spelling correction machine translation

face recognition

http://videos.syntience.com/ai-meetups/peternorvig.htmlhttp://dataspora.com/blog/tipping-points-and-big-data/

Tuesday, June 8, 2010

Page 16: Big Data @ Bodensee Barcamp 2010

Ethics of data

Black Hat vs. White Hat <=> Black Data vs. White data

White: Amazon free public datasets (e.g. human genome)

Black: Scientific climate data (or the lack of PUBLIC data)

Just like money, information flows to the least taxed location in a global world.

Tuesday, June 8, 2010

Page 17: Big Data @ Bodensee Barcamp 2010

Take-Away & Discuss“Don't throw away data if you don’t have to, because

unlike material goods, data becomes more valuable the more of it is created. As a society, I don't think we

understand this completely yet.”

q: Who is using a NoSQL db? Share Stories?

q: Do you hire statisticians?

q: Do you hire visualization experts?

q: Do you know how much data you are throwing away?

q: Share: how big is your data?

q: Do you own your customer data or does Facebook?

q: Do you own your content or does Google?

q: How are you exploiting asynchronicity?

q: Any tips on introducing NoSQL in companies?

q: Do you own your analytics data?

q: Should information be regulated (privacy)? Can it?

Tuesday, June 8, 2010