19
1/19 Finding Missing Tweets using Topic Structure and Browsing Time Finding Missing Tweets using Topic Structure and Browsing Time Yu Suzuki , Hiromitsu Ohara , Akiyo Nadamoto Nara Institute of Science and Technology, Japan Konan University, Japan 5. December, 2017

Finding Missing Tweets using Topic Structure and Browsing Time

Embed Size (px)

Citation preview

Page 1: Finding Missing Tweets using Topic Structure and Browsing Time

1/19

Finding Missing Tweets using Topic Structure and Browsing Time

Finding Missing Tweetsusing Topic Structure and Browsing Time

Yu Suzuki†, Hiromitsu Ohara‡, Akiyo Nadamoto‡

† Nara Institute of Science and Technology, Japan‡ Konan University, Japan

5. December, 2017

Page 2: Finding Missing Tweets using Topic Structure and Browsing Time

2/19

Finding Missing Tweets using Topic Structure and Browsing Time

Introduction

Introduction

From Social Network Services (SNSs), there are massive volumes ofmessages.Users are not always on-line.

Users miss important information on SNSs.c.f.) A function on twitter “While you were away.” The structure ofsummarization is flat.

Users need to understand in a short time about the topics while theusers are off-line.

A mechanism of summarizing the tweets is useful.

We believe that when we summarize the tweets as a tree structure, theusers can easily understand the topics.

Summarize Tweets Using Topic Structure and Browsing Time

Page 3: Finding Missing Tweets using Topic Structure and Browsing Time

3/19

Finding Missing Tweets using Topic Structure and Browsing Time

Introduction

Why we consider topic structure?missing tweets topic sub topic

Today’s baseball game is exciting! baseball gameYesterday I went to baseball stadium baseball place

I’m at Salzburg! travel austriaI’m at baseball stadium! baseball place

· · · · · · · · ·

Tweets with minority topics are ignored if we summarize missing tweets.Missing tweets are mainly related to “baseball.”Only one tweet is related to “travel.”If these tweets are summarized without using topics, the tweet about travelmay not be appeared at the summary.

We visualize this topics of tweets as a tree structure.First, the users see top-level topics, such as “baseball” and “travel.”if the users are interested in “baseball,” the users browse “game” and “place.”Users do not miss a tweet about travel.

How to construct the topic structure?

Page 4: Finding Missing Tweets using Topic Structure and Browsing Time

4/19

Finding Missing Tweets using Topic Structure and Browsing Time

Introduction

Our contribution

1 Generate topic structures of tweets using the Wikipedia category treeand browsing time

We use Wikipedia category as a knowledge to construct tree structure.We use browsing time as a tweets which users miss.

2 Visualize the topic structure of tweets using a network graphWe implement our method using Web application.

3 Confirm using real dataset that our proposed method is effective forcommonly known topics

Our method is effective if there are many information about the theme.Wikipedia only have articles about commonly known topics.

Page 5: Finding Missing Tweets using Topic Structure and Browsing Time

5/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

Overview

2. Generate a Topic Graph

Wikipedia Category Tree

Tweets

1. Clustering of Tweets

C0 = Ichiro C1 = Masahiro

C3 = Human ➡ deletetoo wide to cover topics

Ichiro Masahiro

MLB playerSportsJapanese

Topic node: a parent node ofTweet clusters

3. Visualization

Ichiro Masahiro

Japanese MLB Player

Tweet listNow three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing.

Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory.

Baseball

Sports

Basketball

Mariners

MLB

Players

Japan

Players

Abstract node: a parent node of

topic nodes

Topic Graph

tweets correspond to

category

about Ichiro

about Masahiro

Page 6: Finding Missing Tweets using Topic Structure and Browsing Time

6/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

Steps

Overview

1 Extracting missing tweet: Extracting which tweets are submittedduring user’s browsing time and it is before and after.

2 Clustering Tweets into Categories and extracting topics: UsingRepeated Bisection as clustering tools, we divide a set of tweets intoclusters and extract topics in each cluster.

3 Generate a topic graph: Using the topics of tweets and the Wikipediacategory tree, we generate a topic graph of the tweets.

4 Classify topics Classify the topics which are nodes of the topic graphas known topics and unknown topics.

5 Visualization of topic graphs: We visualize the topic graph and thecorresponding tweets using our implemented Web user interface.

Page 7: Finding Missing Tweets using Topic Structure and Browsing Time

7/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

0. Extraction of missing tweets

0. Extraction of missing tweets

We extract tweets which users have not browse.We assume that the browsing time is given.Browsing time may be available if we construct twitter client applications.

Page 8: Finding Missing Tweets using Topic Structure and Browsing Time

8/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

1. Clustering tweets

1. Clustering tweets

Tweets

1. Clustering of Tweets

C0 = Ichiro C1 = Masahiro

C3 = Human ➡ deletetoo wide to cover topics

We use repeated-bisection for clustering tweets.In our experiment, repeated-bisection is the most effective method forclustering short texts.Similar to k -means.

We remove noise clusters.We calculate the cosine similarity between each two texts in a cluster.We remove the nodes if the similarity is beyond the threshold.

Page 9: Finding Missing Tweets using Topic Structure and Browsing Time

9/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

1. Clustering tweets

Repeated bisectionGiven a set of tweets T , we extract a feature vector for each tweet. First, wedivide a tweet into the terms using morphological analysis or POS tagger.Then, we select noun and unknown terms as feature terms. The reason ofusing unknown terms is that these terms consist of slang and newly inventedwords which are not recognized by the morphological analysis. To clean thefeature terms, we select the terms which are included in more than twotweets. Feature vector f (ti) of tweet ti (ti ∈ T ) is defined as follows.

f (ti) = [tf (ti ,w1) · idf (w1), tf (ti ,w2) · idf (w2), · · · ,tf (ti ,wm) · idf (wm)] (1)

tf (ti ,wj) =

1 if wk appears at ti

more than once0 else

(2)

idf (wj) = − logdf (wj)

|T | (3)

where wj is a term in T , |T | is the number of tweets in T , tf (ti ,wj) indicateswhether wj appears at ti or not, df (wj) is the number of tweets which have wj ,and idf (wj) is an IDF (Inverted Document Frequency) value of wj where adocument is a tweet.

Page 10: Finding Missing Tweets using Topic Structure and Browsing Time

10/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

2. Topic graph

2. Generate a topic graph

2. Generate a Topic Graph

Ichiro Masahiro

MLB playerSportsJapanese

Topic node: a parent node ofTweet clusters

Abstract node: a parent node of

topic nodes

tweets correspond to

category

1 Generate a topic node corresponds to a tweet.

2 Generate a semantic node, which corresponds to a topic node.

3 Merge multiple nodes into simple structure of nodes.

Page 11: Finding Missing Tweets using Topic Structure and Browsing Time

11/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

2.1 Generate a topic node

Generate a topic node

Topic node: An Wikipedia article corresponds to a cluster.Method

1 Repeated bisection method outputs keywords for each category, with relateddegrees between keywords and categories.

2 We retrieve articles in Wikipedia, and find the most relevant category.

Many categories have their articles, then the categories are alsocandidates of topic nodes.

ExampleA category has keywords such that {(“baseball ′′, 1), (“player ′′, 0.5)}.There are two Wikipedia articles wp and wq :

The title of wp is “baseball team” and wq is “baseball player.”Calculate scores for each article:

wq = 1 + 0 = 1 , and wq = 1 + 0.5 = 1.5.

We select an article wq , “baseball player,” as a topic node.

Page 12: Finding Missing Tweets using Topic Structure and Browsing Time

12/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

2.2 Generate a semantic node

Generate a semantic node

Semantic node: The categories which correspond to the topic node onthe Wikipedia.Method

1 Get category names using Wikipedia.2 Prune unsuitable categories from semantic nodes using black list.

Person born in 19xx, Stub, A list of xx, . . .

ExampleCategory c0 is tagged by “Ichiro Suzuki, ”An article “Ichiro Suzuki” has two categories “Yankees Players” and“Baseball Players.”“Yankee players” and “Baseball players” are considered as semantic node.

Ichiro Suzuki Kenta Maeda

Yankees Player Baseball Player Baseball PlayerDodgers Player

Page 13: Finding Missing Tweets using Topic Structure and Browsing Time

13/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

2.3 Merge multiple nodes

Merge Multiple Nodes

Ichiro Suzuki Kenta Maeda

Yankees Player Baseball Player Baseball PlayerDodgers Player

Figure: Example of two network graphs

Ichiro Suzuki Kenta Maeda

Yankees Player Baseball Player Dodgers Player

Figure: Two nodes are merged iftwo graphs share the commonnodes.

Ichiro Suzuki Kenta Maeda

Yankees Player Baseball Player Dodgers Player

Sportspeople

Figure: If a leaf node and not leaf node correspond to the same article, these nodesare merged.

Page 14: Finding Missing Tweets using Topic Structure and Browsing Time

14/19

Finding Missing Tweets using Topic Structure and Browsing Time

Our Proposed Method

Visualization

Visualize topic nodes and semantic nodes3. Visualization

Ichiro Masahiro

Japanese MLB Player

Tweet listNow three of the greatest hitters in Major League history in one dugout with the Marlins. Barry Bonds, Ichiro and Don Kelly. amazing.

Joe Girardi discusses Masahiro Tanaka pitching on extended rest after Tuesday night's 9-0 victory.

Topic Graph

about Ichiro

about Masahiro

Page 15: Finding Missing Tweets using Topic Structure and Browsing Time

15/19

Finding Missing Tweets using Topic Structure and Browsing Time

Experiments

Experimental Setup

Experimental Setup

Aim of our experiment:To confirm that our method is effective or not.Which themes of tweets are appropriate for applying our proposed method.

Evaluation Measure: Precision ratioWe (the second author of our paper) manually select an appropriatecategories for each tweet.We calculate precision ratio for each category.

precision =The number of accurately categorized tweets

The number of tweets in the category

DatasetCategory: Politics, Music, Computer, Sports, and Animation/Games (fivecategories)Tweets: We prepared 2,000 tweets for each category. We use Twitter searchAPI.

Page 16: Finding Missing Tweets using Topic Structure and Browsing Time

16/19

Finding Missing Tweets using Topic Structure and Browsing Time

Experiments

Experimental Setup

Procedure of the experiment

1 Clustering 2,000 tweets for each theme, and extracting topics of eachcluster

2 Generate the topic graph using our proposed method3 Give clusters and their corresponding Wikipedia article titles to the

observers.Observers are hired using crowdsourcing (Crowdworks).

4 Observers evaluate whether the article titles are appropriate or not forrepresenting the clusters using the following five degrees (5:appropriate, 4: almost appropriate, 3: cannot say, 2: almostinappropriate, 1: inappropriate).

5 Summarize the observer’s evaluations, and analyze whether ourproposed method has good accuracy or not

Page 17: Finding Missing Tweets using Topic Structure and Browsing Time

17/19

Finding Missing Tweets using Topic Structure and Browsing Time

Experiments

Experimental Results

Experimetal Results

1.0-2.0 2.0-3.0 3.0-4.0 4.0-5.0

25

0

5

10

15

20

Average of evaluation scores

Num

ber o

f eva

luat

ion

scor

es

PoliticsMusicComputerSportsVideo Games

Table: Numbers ofevaluation scores forrespective bins.

Theme # obsv. Prec.Politics 8 0.72Music 11 0.56Computer 5 0.44Sports 5 0.42Animation 4 0.52& Games

Our method is useful for tweets about politics.There are many technical terms about politicsMany articles are on the WIkipedia.

Our method is not effective for computer, sports.There are wide variety of topics.Less number of articles are on the Wikipedia.

Page 18: Finding Missing Tweets using Topic Structure and Browsing Time

18/19

Finding Missing Tweets using Topic Structure and Browsing Time

Experiments

Experimental Results

Merging Multiple topics

��������

���� � �� � ��� �

���������

������� ����� ������ �������� �

������� ����� ����������

One example of a topic/semantic graphBlack node means topic node, and gray node means semantic node.

There is a topic node about “Yakult” and “Sofrbank.”Yakult: A Manufacturer of drinksSoftbank: A Carrer of Cell phoneBoth two companies are based in Tokyo.

We can connect two nodes using our proposed merging nodes ofmultiple topics.

Page 19: Finding Missing Tweets using Topic Structure and Browsing Time

19/19

Finding Missing Tweets using Topic Structure and Browsing Time

Conclusion

Conclusion

We proposed a method for automatically extracting user’s missing tweetsbased on topic granularity and missing time of browsing user.

We extract missing tweet based on the missing time.We propose a method for mapping a set of extracted missing tweets to theWikipedia category tree by considering topic structure granularity.

We confirmed that our proposed method is effective for “politics,” but noteffective for “computer” and “sports.”

Future WorkWe should consider resources other than Wikipedia as a knowledge base.

Wikipedia is not always suitable for personal topics.

We should consider synonyms.We should compare the other methods with our method.We should do a usability test of Web user interface.