Big Data is Here! Embrace It!

  • Upload
    j-boye

  • View
    243

  • Download
    0

Embed Size (px)

Citation preview

  • 1. Big Data is Here!Embrace It!Bebo WhiteSLAC National Accelerator Laboratory/Stanford [email protected], Aarhus

2. SLAC lives forJBoye2014, Aarhusdata! 3. Data from the Linear Coherent Light Source (LCLS)JBoye2014, Aarhus 4. Data from the Linear Coherent Light Source (LCLS)JBoye2014, Aarhus 5. Data from the Linear Coherent Light Source (LCLS)JBoye2014, Aarhus~1.3 million DVDs/month 6. Data from the Linear Coherent Light Source (LCLS)JBoye2014, Aarhus~1.3 million DVDs/monthIs this Big Data? 7. Like SLAC Data is probably one of the most important assets of yourcompany (if not the most) You collect it You analyze it You plan with it You protect it You (maybe) sell it It sets you apart from your competitors And there is (probably) lots of itJBoye2014, Aarhus 8. I Hate Buzzwords They are like shiny objects to get ourattention to new concepts and capabilitiesJBoye2014, Aarhus 9. I Hate Buzzwords They are like shiny objects to get ourattention to new concepts and capabilitiesJBoye2014, Aarhus 10. I Hate Buzzwords They are like shiny objects to get ourattention to new concepts and capabilitiesJBoye2014, Aarhus 11. I Hate Buzzwords They are like shiny objects to get ourattention to new concepts and capabilitiesJBoye2014, Aarhus 12. JBoye2014, Aarhus 13. JBoye2014, Aarhus 14. Big Data IS abuzzword!JBoye2014, Aarhus 15. Big Data IS abuzzword!JBoye2014, Aarhus 16. Big Data IS abuzzword!JBoye2014, Aarhus 17. Big Data IS abuzzword!JBoye2014, Aarhus 18. Instead of Hype- Think Evolution instead of RevolutionJBoye2014, Aarhus 19. JBoye2014, Aarhus 20. JBoye2014, Aarhusdekabytes? 21. dekabytes? hectobytes?JBoye2014, Aarhus 22. JBoye2014, Aarhus 23. The Data Deluge From the beginning of recorded time until2003, mankind generated 5 exabytes of data In 2011, the same amount of data wasgenerated every two days In 2013, the same amount of data wasgenerated every 10 minutes In 2014? Such numbers become almost meaninglessJBoye2014, Aarhus 24. Where is This DataComing From?JBoye2014, Aarhus 25. Where is This DataComing From? EVERYWHERE!JBoye2014, Aarhus 26. Where is This DataComing From? EVERYWHERE! Any communication over a networkinvolves transfer of data that is meaningfulto someoneJBoye2014, Aarhus 27. Where is This DataComing From? EVERYWHERE! Any communication over a networkinvolves transfer of data that is meaningfulto someone Every e-mail, every tweet, every transaction,every social media interaction, etc.,etc.JBoye2014, Aarhus 28. Where is This DataComing From? EVERYWHERE! Any communication over a networkinvolves transfer of data that is meaningfulto someone Every e-mail, every tweet, every transaction,every social media interaction, etc.,etc. Sensors - The Internet of ThingsJBoye2014, Aarhus 29. Somethings HappeninHereJBoye2014, Aarhus First 24 hours 50-100 Mb data 30. And before he wasbornJBoye2014, Aarhus 31. Big Data - a PossibleDefinition Refers to datasets whose size is beyond the ability of Single storage devices Typical database software tools to capture, store,manage, and analyze (McKinsey Global Institute) This definition is not defined in terms of data size(which will increase) It can vary by sector/usage This is not a new issueJBoye2014, Aarhus 32. Beyond CapabilityJBoye2014, Aarhus 1956 technology 5Mb storage LCLS would requireover 1 trillion units permonth 33. JBoye2014, Aarhus 34. JBoye2014, Aarhus 35. The Cloud/IOT/WOT is/will be a very noisy place An unbelievable of objects (theoreticallymore than 10e38) will be able to talk to usand to each other (orders of magnitudemore than now) We will be interested in hearing what someof them have to say How can we manage these conversations? Traditional interfaces break downJBoye2014, Aarhus 36. JBoye2014, Aarhus 37. JBoye2014, Aarhus 38. JBoye2014, Aarhus 39. JBoye2014, Aarhus 40. JBoye2014, Aarhus 41. The SquareKilometer ArrayJBoye2014, Aarhus 42. The SquareKilometer ArrayJBoye2014, Aarhus 43. JBoye2014, AarhusThe Large HadronColliderThe SquareKilometer Array 44. JBoye2014, AarhusThe Large HadronColliderThe SquareKilometer Array 45. JBoye2014, Aarhus 46. JBoye2014, Aarhus 47. JBoye2014, Aarhus 48. JBoye2014, AarhusEBay 49. JBoye2014, AarhusEBay 50. JBoye2014, AarhusEBayFacebook 51. JBoye2014, AarhusEBayFacebook 52. JBoye2014, AarhusEBayFacebookTwitter 53. JBoye2014, Aarhus 54. High-volume, -velocity, and -variety information assetsthat demand cost-effective innovative forms of informationprocessing for enhanced insight and decision makingJBoye2014, Aarhus 55. Volume? ~ data volume worldwide in 2013 = 3.5 ZB(including 400 billion feature length HD movies) Velocity? Every 60 sec. on Facebook - 510K postedcomments; 293K status updates; 136K uploadedphotos 30 billion shares 20 million apps installedJBoye2014, Aarhus 56. Variety? Any type of data both meaningful andmeaningless Veracity? How is trust established? What does like really mean?JBoye2014, Aarhus 57. UsageJBoye2014, AarhusPlus:ISPsUtilitiesAcademic institutionsEveryone 58. JBoye2014, Aarhus 59. Challenges of HarnessingBig Data Datamining huge datasets Shortages of Big Data experts Privacy, legal, and social issues Strategies for acquiring Big Data - a newform of currencyJBoye2014, Aarhus 60. Big Data Analytics andData ScienceJBoye2014, Aarhus 61. An Interesting Example -Sentiment Analysis Goal - gauging mood on social networkdata Not a traditional survey or focus group Social sites operate 24/7 Timeliness - not subject to time lags Useful to marketers, IT, customers, etc.JBoye2014, Aarhus 62. Difficult CommentAnalysis (1/2) False negatives - crying & crap (negative) vs.crying with joy & holy crap! (positive) Relative sentiment - I bought a Honda Accord -great for Honda, bad for Toyota Compound sentiment - I love the phone but hatethe network Conditional sentiment - If someone doesnt callme back, Im never doing business with themagain!JBoye2014, Aarhus 63. Difficult CommentAnalysis (2/2) Scoring sentiment - I like it vs. I really like it vs.I love it Sentiment modifiers - I bought an iPhonetoday :-) Gotta love the telephone company ;-< International/cultural sentiments Japanese - unique emoticons for crying - (;_;) Italians - effusive, grandiose British - drier, less effusiveJBoye2014, Aarhus 64. Interesting stuff, but Who is it benefiting? Is it making us smarter or safer? At what price? Is Big Data a new currency?JBoye2014, Aarhus 65. What Im really interested in isthe intersection between BigData and Linked Open DataJBoye2014, Aarhus 66. JBoye2014, Aarhus 67. Big Data (and even not so Big Data) tendsto be unstructured data (e.g., lists, e-mails,tweets, etc.) - like the sentiment analysis stuff Therefore it tends to be thin rather thanthick Thin means very little (if any) context -just data, little knowledge What can be added to change data fromthin to thick?JBoye2014, Aarhus 68. Linked Data Provides access to the semantics of dataitems Based upon Semantic Web technologies andontologies Designed for machines first and humans later Degree of structure in descriptions of thingsis highJBoye2014, Aarhus 69. Linked Data Pros Far more parseable and machineprocessable than raw unstructured data Enhances data descriptions for complexanalyses Can contribute to the VERACITY of our data Wide variety of discipline/data ontologiesavailableJBoye2014, Aarhus 70. Linked Data Cons Much harder to do than adding keywordmetadata Building efficient processing applicationsand parsers Implementing effective linked data storesJBoye2014, Aarhus 71. Linked Open Data LOD refers to data stores of Linked Data that arepublished (made available online and accessed via URLs)and free to use Open data means it must be available to all withoutcopyright or ownership There is an increasing trend towards openinggovernment data (US and UK, San Francisco and more)and scientific results Provides unprecedented ability to build mashupapplicationsJBoye2014, Aarhus 72. JBoye2014, Aarhus 73. JBoye2014, Aarhus 74. JBoye2014, AarhusA deliberate, conscious, structuredattempt to turn data intoknowledge 75. Todays Takeaways (1/2) New forms of data allow us to answer old questions in new ways and toanswer entirely new questions There are multiple shifts happening: Data volume Creative & realtime analytics Computational infrastructure Dataflows vs. datasets Correlation vs. causation Increasing automation Machine2Machine data in the Internet of Things Change in use - not just commercial dataminingJBoye2014, Aarhus 76. Todays Takeaways (2/2) We own much of this data - it is about, from, orpaid for by us, so we should have a say in how it isused Data is a human resource as real as naturalresources I want my grandkids to reap the full benefits of theearths/mankinds data Call it what you wish, but embrace it or be leftbehindJBoye2014, Aarhus 77. Thank You!Questions? [email protected], AarhusWant copies ofthese slides?