Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Tackling The Challenges of Big Data Visualizing Twitter
Samuel Madden
Professor and Director of Big Data at CSAIL
Massachusetts Institute of Technology
Tackling The Challenges of Big Data Visualizing Twitter
Introduction to Twitter Data
Samuel Madden Professor and Director of Big Data at CSAIL
Massachusetts Institute of Technology
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
This Module
1. Understanding Twitter data!!2. Demonstration of MapD, a Twitter visualization system!
3. Technology behind MapD!!4. Other Approaches to Interactive Analytics!
2
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Part 1: Twitter
• 500 million tweets a day!– > 8 million/day “geocoded”!
• More than just 140 characters:!– Geo Coordinates!– Timestamp!– User and Follower information!– Reply Information!– Hashtags!– Device/Platform Used to Post!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Geocoded Data
Geocoded Tweets containing “snow” on Oct 4, 2013!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
“Big Data” and Twitter
• Volumes and rates are massive; need new tools to interactively visualize data!
3
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
“Big Data” and Twitter
Not the best way to visualize 500 M Tweets!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
“Big Data” and Twitter
• Volumes and rates are massive; need new tools to interactively visualize data!
• Want to correlate with external and internal data sets!– E.g., brand preference vs census district income!
• Want to do deep analysis of content!– What product, show, or person is being discussed!– What opinion is being expressed (“sentiment analysis”)!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Spatial Correlations
Source: http://www.richblockspoorblocks.com!
0!0.005!
0.01!0.015!
0.02!0.025!
0.03!0.035!
0.04!0.045!
0.05!
0! 100,000! 200,000! 300,000! 400,000! 500,000!
% T
wee
ts
Men
tioni
ng
Star
buck
s!
Zip Code Income!
Requirement: Interactive learning and statistics!
4
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
“Big Data” and Twitter
• Volumes and rates are massive; need new tools to interactively visualize data!
• Want to correlate with external and internal data sets!– E.g., brand preference vs census district income!
• Want to do deep analysis of content!– What product, show, or person is being discussed!– What opinion is being expressed (“sentiment analysis”)!
Tackling The Challenges of Big Data Use Case: Visualizing Twitter
Introduction to Twitter
THANK YOU
Tackling The Challenges of Big Data Visualizing Twitter
MapD Demo
Samuel Madden Professor and Director of Big Data at CSAIL
Massachusetts Institute of Technology
5
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Part 2: MapD Demo
• What is MapD?!– GPU Accelerated Database!– With a Twitter Visualization in Front of it!
• What Will You See!– ~ 50 M tweets displayed!– Nothing is pre-computed!
• Designed to tackle the “volume” and “velocity” challenges!
Tackling The Challenges of Big Data Visualizing Twitter
MapD Demo
THANK YOU
Tackling The Challenges of Big Data Visualizing Twitter
How MapD Works
Samuel Madden Professor and Director of Big Data at CSAIL
Massachusetts Institute of Technology
6
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
How Does MapD Work?
• Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data!
• Not an accelerator, but a full blown SQL Database!!
WHAT IS MAPD? MapD is: � A GPU (Graphics Processing Unit)-
accelerated SQL column store database � Scales to any number of Nvidia
GPUs � A real-time map generator � Uses GPUs to render point and
heatmaps of query results in milliseconds
� A WMS web-server � Can serve out of the box as the
backend for a web mapping client, allowing for querying and visualization of billions of features
� Fast and cost-effective � 4 Nvidia commodity GPUs provide
provide over 12 Teraflops of compute power and nearly 1 TB/sec of memory bandwidth
147,201,658 tweets from Oct 1, 2012 to Nov 6, 2012
Relative intensity of “tornado” on Twitter (with point overlay) from Febuary 29, 2012 to March 1, 2012
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
GPUs
• Massive parallelism enables interactive browsing interfaces!– High End GPUs can provide!
250 GB/sec of bandwidth!5X conventional microprocessors!
4 Teraflops compute !10X conventional multi-core microprocessor!
– 7 – 70x speedup in database ops!• Challenges!
– Limited Memory on GPUs – But growing!!– Limited bandwidth between CPU and GPU – Will Change!!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
“Shared Nothing” Processing Multiple GPUs, with data partitioned between them!
Node 1! Node 2! Node 3!
Query: Heatmap tweets containing “rain”!
Filter!text ILIKE ‘rain’!
Filter!text ILIKE ‘rain’!
Filter!text ILIKE ‘rain’!
Tweets 0 – 100M! Tweets 100M – 200M! Tweets 200M – 300M!Images:http://graffletopia.com!
7
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Tweet Indexing on GPU • Encode tweets using a “dictionary”!
Word! Encoding!…! …!
Rain! 57663!
Rainbow! 57664!
Rainman! 57665!
Rainy! 57666!
…! …!
Filter!text ILIKE ‘rain’!
!
Filter!SELECT tweetid FROM words
WHERE id = 57663!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• Column-oriented execution!
– Avoids wasting memory bandwidth!• Plan:!
– Produce bitmap of tweets to read!– Read tweets, increment output bins
in bitmap!
Filter!SELECT tweet id FROM words WHEREid = 57663!
TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!
TweetId! Lat! Lon!…! …!1! -41.5! 23.1!2! -41.7! 77.4!3! -37.4! 48.2!4! 28.4! -44.0!…! …!Data Tables Reside in GPU Memory!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Balanced data per thread improves performance!
Bitmap!0!0!0!0!0!0!…!
Tweet 1!
Tweet n!
Warp 2!
Warp 3!
Warp 1!
TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!
8
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Need same amount of data to be efficient!
Bitmap!1!0!0!0!1!0!…!
Tweet 1!
Tweet n!
Warp 2!
Warp 3!
Warp 1!
TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Need same amount of data to be efficient!
Bitmap!1!0!0!0!1!0!…!
Tweet 1!
Tweet n!
Warp 2!
Warp 3!
Warp 1!
TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Need same amount of data to be efficient!
Bitmap!1!0!1!1!1!0!…!
Tweet 1!
Tweet n!
TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!
9
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Need same amount of data to be efficient!
Bitmap!1!0!1!1!1!0!…!
Tweet 1!
Tweet n!
Lat! Lon!…!-41.5! 23.1!-41.7! 77.4!-37.4! 48.2!28.4! -44.0!…!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Need same amount of data to be efficient!
Bitmap!1!0!1!1!1!0!…!
Lat! Lon!…!-41.5! 23.1!-41.7! 77.4!-37.4! 48.2!28.4! -44.0!…!
Warp 2!
Warp 3!
Warp 1!
Output buffer!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Example: Filtering in Parallel !• 1000+ GPU Threads!• Running in “Warps”!• Threads in same warp run exactly the same instructions!
– Need same amount of data to be efficient!
Bitmap!1!0!1!1!1!0!…!
Lat! Lon!…!-41.5! 23.1!-41.7! 77.4!-37.4! 48.2!28.4! -44.0!…!
Warp 2!
Warp 3!
Warp 1!
Output buffer!
10
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Parallel Plumbing
Once parallel versions of all operators are built, programmers can just think in terms of SQL!
E.g.,!!SELECT heatmap(lat,lon) !!WHERE tweettext ILIKE “rain”!
!Scaling to 100’s of millions of points in milliseconds!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Summary
• Interactivity can create a qualitative difference!
• GPUs can dramatically accelerate some tasks!– Compute intensive tasks via parallelism!– Data intensive tasks via increased memory bandwidth!
• Clever use of hardware enables dramatic speedups!• Arrays of multiple GPUs help solve large problems!
Try it yourself:!http://mapd.csail.mit.edu!
Tackling The Challenges of Big Data Visualizing Twitter
How MapD Works
THANK YOU
11
Tackling The Challenges of Big Data Visualizing Twitter
Other Approaches to Interactive Analytics
Samuel Madden Professor and Director of Big Data at CSAIL
Massachusetts Institute of Technology
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Other Approaches to Interactive Analytics • Interactive Analytics Important in Many Situations!
• Examples!– Fault Diagnostics!– Intrusion Detection!– Financial Analysis!– Online Advertising!
• GPUs & Massive Parallelism are Just One Way Approach!
!
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Other Techniques We’ll Cover
• Brute Force / Massive Parallelism!– MapD (Hadoop, and many others)!
• Partitioning!– Split the data, operate on the part you need!
• Sampling!– Operate on a Subset of the Data!
• Summarization!– Operate on a Summary of the Data!
12
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Partitioning
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Sampling
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Summarization
13
© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!
Other Techniques We’ll Cover
• Brute Force / Massive Parallelism!– MapD (Hadoop, and many others)!
• Partitioning!– Split the data, operate on the part you need!
• Sampling!– Operate on a Subset of the Data!
• Summarization!– Operate on a Summary of the Data!
Tackling The Challenges of Big Data Visualizing Twitter
THANK YOU