34
Topic cluster of Streaming Tweets based on GPU- Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Embed Size (px)

Citation preview

Page 1: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map

Group 15Chen ZhutianHuang Hengguang

Page 2: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Outline

Background

Pipeline and Technique

Conclusion

Page 3: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Background

Page 4: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

What happen in the tweets stream?

• Unsupervised, Clustering algorithm.

• Organize large document collections according to textual similarities.

• Create visible result for searching and exploring large document collections.

Page 5: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

WEBSOM system

• Based on Self Organizing Map.• Generate topic map for

documents.• Explore large documents just

like explore Google map.

Page 6: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

What WEBSOM looks like?

Page 7: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Gap

• WEBSOM – Long document, static, long training time.

• Twitter – Short text, dynamic, streaming data

• How to adapt SOM to streaming Twitter data?

Page 8: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

What our system looks like

Page 9: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Outline

Background

Pipeline and Technique

Conclusion

Page 10: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Pipeline and Technique

Page 11: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Pipeline

Detect Event

Build Dictionary

Vectorize Tweets

Reduce Dimension

SOM Cluster

Show the SOM map

Detect Event

Page 12: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Detect Event

• Only focus on unusual events.• How to identify abnormal events on

Twitter?

Tweets Stream

Events

Events

Page 13: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

1. Similar to TCP’s congestion control mechanism.

2. Count the number of tweets in a moving window.

3. Weighted moving average and variance.

4. Threshold to determine whether it’s an event.

Detect Event

Page 14: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

FIFA2014, Brazil 1:7 Germany

• Track 823 keywords.

such as “FIFA”, ”Ger”, ”Brazil”,

“#WordCup”…

• In 110 minutes.

• 100 million tweets.

• Sample 1%

Test Data

Page 15: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Goal!

Goal! X 3

Goal!

Time of Peak What’s happen?

4:11 First Goal!

4:25 Goal! X 3 in 3 minute

4:30 Goal!

5:07 Second Half Begin

5:25 Goal!

5:35 Goal!

5:46 Goal!

5:50 End!

Detect Event

Page 16: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Build Dictionary

Vectorize Tweets

Reduce Dimension

SOM Cluster

Show the SOM map

Detect Event

Detect Event

Build Dictionary

Page 17: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

1. Remove stop words2. Stemming – Snow Balls3. Remove words whose occurrence less that

10%4. Remove words whose occurrence greater

that 50%

Build Dictionary

Page 18: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

1. Vector Space model2. TF-IDF3. Normalization

Vectorize Tweets

𝑉 𝑖= (0.4123 ,0.12312 ,0.344 ,… )

Page 19: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

10,000 tweets x 10,000 dimension

1+ hour for convergence

Page 20: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Reduce Dimension

Show the SOM map

SOM Cluster

Reduce Dimension

Vectorize Tweets

Build Dictionary

Detect Event

Page 22: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Show the SOM map

SOM Cluster

Reduce Dimension

Vectorize Tweets

Build Dictionary

Detect Event

SOM Cluster

Page 23: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

What is SOM? Self-organization Map.

• Artificial Neural Network

• Unsupervised Learning• Iteration Based• Visible Result

SOM Cluster

Page 24: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

• Sequential SOM

• Batch Type SOM Faster, Effective

SOM Cluster

Page 25: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Random Projection+ Batch SOM +

1 SecondHour

SOM Cluster

CUDA

Page 26: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

20 newsgroups

• 20,000 documents.

• 20 different newsgroups.

• only in 1 group.

Test Data

http://web.ist.utl.pt/acardoso/datasets/.

60% vs

40%Train

Test

Page 27: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Method Random Projection

Macro Accuracy(

%)

Micro Accuracy(

%)

Renato’s SOM NO 68 67

Our Method YES 60 61

Conclusion: Random projection will result in losing precision. Hence the performance will decrease after dimension reduction.

20 Newsgroup Test

Page 28: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Method Random Projection

Macro Accuracy(%)

Micro Accuracy(%)

Renato’s SOM NO 68 67

Our Method YES 60 61

Matlab repeat Renato’s SOM

NO 63 62

Matlab repeat Renato’s SOM

YES 61 60

We use SOM tool box to repeat Renato’s experiment totally.

20 Newsgroup Test

Page 29: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

FIFA Data

Page 30: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

FIFA Data

Page 31: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

FIFA Data

Page 32: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Conclusion

Page 33: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

• 2 algorithms• 3 sets of

experiment• 1 prototype

system• 1 case study

Conclusion

Page 34: Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang

Thanks for Watching

Q & A