Da
Michael Hausenblas, MapR TechnologiesBerlin Buzzwords 2013, Open Stage Talk
Friday, 7 June 13
Nope. Not this one.
Friday, 7 June 13
Friday, 7 June 13
things youcan influence
things thataffect you
try and focus on this stuffFriday, 7 June 13
The awkward moment when I open the data I got from a customer
Friday, 7 June 13
http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/
aka crap in, crap out
Friday, 7 June 13
Some examples …
Friday, 7 June 13
• Encöding hell
• Schema? Sure, I fax you a screenshot
• Dupes and other fakes
• Sampling
Friday, 7 June 13
Encöding hell
application-specific encodings
• URL encoding• HTML encoding• Database escaping
non-ASCII?
a%20percent-encoded%20string%20as%
20of%20RFC%203986
a <strong>HTML</strong> encoded string
Friday, 7 June 13
• Use Unicode
• Use Unicode
• Use Unicode
Encöding hell
http://www.swedishfika.com/2010/01/19/escaping-from-encoding-hell/
Friday, 7 June 13
• Encöding hell
• Schema? Sure, I fax you a screenshot
• Dupes and other fakes
• Sampling
Friday, 7 June 13
Schema? Sure, I fax you a screenshot
Friday, 7 June 13
Schema? Sure, I fax you a screenshot
• There is a need for proper, formal documentation
• For humans and machines
• Basis for validation—automate!
Friday, 7 June 13
• Encöding hell
• Schema? Sure, I fax you a screenshot
• Dupes and other fakes
• Sampling
Friday, 7 June 13
Dupes and other fakes
Friday, 7 June 13
Dupes and other fakes
Friday, 7 June 13
Dupes and other fakes
• Use plots to get an overview
• Watch out for outliers
• Try to establish source for errors and fix
• Document (in any case)
Friday, 7 June 13
• Encöding hell
• Schema? Sure, I fax you a screenshot
• Dupes and other fakes
• Sampling
Friday, 7 June 13
• My data is too big. I can’t check it all.
• Why don’t you sample, then?
Sampling
Friday, 7 June 13
Friday, 7 June 13
Go
and
buy
this
boo
k. N
ow.
Friday, 7 June 13