Analytics Drives Big Data Drives Infrastructure

  • Upload
    xale4

  • View
    117

  • Download
    3

Embed Size (px)

DESCRIPTION

A personal perspective of how analytics have evolved from the 80s to current and how it has driven demands on the computing and storage infrastructure. Examples are given from using machine learning ("AI") techniques using neural networks and genetic algorithms in 80s and 90s to Aumnidata's social media analytics in 2008-10 and real-time intent detection by Cruxly from 2011 onwards.

Citation preview

  • 1. Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8th, 2013 [email protected]

2. 2 Whats Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies Youd Like on Netflix? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3. The Sommelier Robot Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3 4. Predicting What Movies Youd Watch Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4 5. 5 (Analytics, BigData, DataStore)+ Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 6. 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Statistics Regression Linear Time-Series Decision Trees R AI (McCarthy) 1956 Expert Systems Machine Learning Neural Networks SVM LDA Nave Bayes K-nearest neighbor Random Forests . . . Genetic Algorithms Random Forests SNARC (Minsky) 1951 Dendral (Feigenbaum) 1965 Fraser and Burnell (1970) . . . Vapnik (1992) Ihaka and Gentleman (1993) 7. 7 Common Analytics Processing pre-2000 Sources: Local Data: Numeric, Homogeneous Processing: Local Consumer: Local Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8. Flavor Predictor Neural Networks USPTO #5,373,452 (1994) 1988 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8 9. Pattern Recognition Genetic Algorithms US PTO #5,140,530, 1992 Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9 10. 10 Small to Big http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 11. 11 Typical Analytics: 2000-2006 Sources: Global , Social Networks Data: Heterogeneous, Numeric, Text Processing: Hosted/Scale Consumer: Global Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc. Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12. 2007- : Internet Data Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12 13. Financial Risk Scoring: Detect Risk Scoring: detect incremental change in # occurrences where corporate officers mention risk (or equivalent terms) during earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13 14. Financial Risk Scoring: Listen *Risk Scoring: detect incremental change in occurrences where corporate officers mention risk (or semantically equivalent terms) during the corporate earnings call Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14 15. Banking: Credit Worthiness remember 2008? Analyze bank reports to assess loans, payments, recoveries, etc. for key bank indexes, groups of banks, or individual banks Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15 16. Share of Voice: Online Buzz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16 17. Sentiment Analysis Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17 18. 18 Analytics Processing: 2007- Sources: Global, Mobile, New Social (Instagram, . . ) Data: Multi-Dimensional, Heterogeneous, Audio/Video Processing: Hosted/Scale Consumer: Global Analytics: Batch, Streaming, . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19. 2008 - : Real-Time/Streaming Analytics Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19 20. Brand Marketing Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20 21. Brand Management 21 22. Customer Support Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22 23. Customer Support 23 24. 24 Lead Generation Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25. . . . More Data, Faster http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25 26. Internet of Things http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for- m2m-technology-to-drive-connected-smarter-cities/ Message Queuing Telemetry Transport Machine-to-Machine Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26 27. 27 AumniData: Batch Processing Data Collector (Batch Scheduled) Twitter Blog/Web Site Data Collector (Batch Scheduled) RSS/ATOM Feed Requestor/ URL Scanner NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP Stack+ AumniData Classifier + Analytics* (RackSpace VM) Dashboard Application (.3rd party App) Blog/Web Site Blog/Web SiteYouTube Dashboard Configuration (TomCat) Custom Analytics Display Ad-Hoc Query Summary Data Collector (Batch Scheduled) Content Store Content / Metadata Index (MySQL) Dashboard Store (SQL Server) Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 28. 28 Cruxly: Stream Processing Streaming API Client (Heroku Worker) (24x7) Streaming API Client (Heroku Worker) (24x7) NLP+ Cruxly Intent Detection (AWS) Streaming API Client (Heroku Worker) (24x7) Tweets (Keywords) Request (Keywords) Tweets (Keywords) Tweet ID + Intent Signal (Heroku PostgresSQL) Tweets Content Store (DynamoDB) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP+ Cruxly Intent Detection (AWS) NLP (NER, etc + Cruxly Intent Detection (AWS) Reports / Dashboard Tracker Editor (web app - Heroku) Twitter Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 29. 29 Data Analytics Demands . . . Store Process Analyze View Store Process Analyze View Storm Data Collector Text / Sensor Data/ Stream . . . NLP Classify Index Query/ RT Query Ad Hoc/ Search/ SQL Custom Analytics Dashboards Chart Report Machine Learning Library Stats Library R Yarn 30. Storage Implications: Back to the Future MB/s Batch IOPs Stream Both? Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30 31. Storage Implications: Back to the Future II, III Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 Task tracker Task tracker Task tracker Job Tracker Zookeeper Hive Pig Oozie HUE HDFS clientData Node Data Node Data Node Name Node MapReduceHDFS Master Slave #1 Slave #N Mgmt Node Storage Capacity Scaling? 31 Storage Tiering? Import/Export Data? 32. A More General Data Analytics Framework? Data Ingesters (Basic) Data Ingesters (Smart) Content StoreMetadata / In-Mem Store Processing Stream and Batch Data Ingesters Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 AnalyticsProcessing SensorProcessing:DataIntegration VisualizationLibrary/InteractiveQuery LocalStorage/Flash/DAS MapReduce/DistributedDataStore 32 33. 33 Conclusion Data Analytics Big Data Scale-Out Variety Infrastructure Volume Bandwidth Support Velocity Streaming Support We Solved the Processing Problem We Need to Solve the Larger Storage Problem Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 34. 34 Grateful Acknowledgements Kapil Tundwal Dr. Kirill Kireyev Dr. Andrew Lampert Venky Madireddy Dr. Shumin Wu Joan Wrabetz Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013