Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012

Motivation Research question State-of-the-art Methodology Results Conclusions and Future Work

Exploring the similarity between Social Knowledge Sources andTwitter for Cross-Domain Topic Classification of Tweets

Andrea Varga, Amparo E. Cano and Fabio Ciravegna

1 Organisations Information and Knowledge (OAK) Research GroupUniversity of Sheffield

2 Knowledge Management Institute (KMI)Open University

KECSM 2012/ISWC 2012

Nov 12, 2012

1/22


Outline

1 Motivation

2 State-of-the-art

3 Methodology

4 Results

5 Conclusions and Future Work

2/22


Why classifying Tweets into topics?

Topic classification (TC) of tweets can be important for multipleapplication:

Information Retrieval

Recommendation

Emergency responses, etc.

Topic name Example tweetsDisaster&Accident(DisAcc) happening accident people dying could

phone ambulance wakakkaka xdEntertainment&Culture(EntCult) google adwords commercial greeeat en-

joyed watching greeeeeat dayPolitics(Pol) quoting military source sk media reports

deployed rocket launchers decoys realSports(Sports) ravens good position games left browns

bengals playoffs

3/22


What are the challenges in Topic Classification (TC) of Tweets?

Special characteristics of tweets

the restricted size of a post (limited to 140 characters)

the frequent use of misspellings and jargons

the frequent use of abbreviations

the use of non-standard English: reflected in vocabulary and writing style


phone ambulance wakakkaka xdEntertainment&Culture(EntCult) google commercial greeeat enjoyed

watching dayPolitics(Pol) quoting military source media reports de-

ployed rocket launchers decoys realSports(Sports) ravens good position games left browns

bengals playoffs

4/22










joyed watching greeeeeat dayPolitics(Pol) quoting military source media reports de-

ployed rocket launchers decoys realSports(Sports) ravens good position games left browns

bengals playoffs

4/22












bengals playoffs

4/22












bengals playoffs

4/22












bengals playoffs

=> These characteristics poses additional challenges for traditionalsupervised machine learning approaches for buildingaccurate TC of tweets

4/22


Why are Social Knowledge Sources (KS) relevant to Twitter?

Data bottleneck problem: investigate an alternative approach inspiredby domain adaptation/transfer learning for exploiting the information fromSocial Knowledge Sources (DBpedia and Freebase) for TC of Tweets

Commonalities between KSs and Twitter

they are constantly edited by web users

they are social and built on a collaborative manner

they cover a large number of topics

5/22








5/22








5/22








5/22








More importantly: KSs contain a large number of annotated data on alarge number of topics

5/22


Research questions

1 Are KSs relevant for topic classification of Tweets?

2 Which features make the KSs look more similar to Twitter?

3 How similar or dissimilar are KSs to Twitter? Which similarity measuredoes better quantify the lexical changes between KSs and Twitter?

6/22


State-of-the-art approaches for TC of Tweets

Using DBpedia for Topic Classification of Tweets:Wikify (Mihalcea, R. and Csomai, A., 2007)Enriching unstructured text with Wikipedia links (D. Milne and I. H. Witten,2008)Tagme (P. Ferragina and U. Scaiella., 2010)Topical Social Sensor (P. K. P. N. Mendes et al., 2010)Vector space model (Oscar Munoz-Garcia et al. 2011)

Using Freebase for Topic Classification of Tweets:Clustering based approach (S.P.Kasiviswanathan et al., 2011)

Our main contribution:Understanding the similarity between KSs and Twitter

Exploring multiple KSs (DBpedia + Freebase)

Investigating various statistical metrics for quantifying thesimilarity between KSs and Twitter

7/22


Methodology followed

8/22



1 Collecting Data from KSs

Retrieve articles

Concept enrichment

Build Cross-domain Classifier Annotate Tweets

Sc. FB Sc. DB-FBSc. DB

Retrieve tweets

Concept enrichment

8/22




2 Building Cross-Domain (CD) Topic Classifier of Tweets

Retrieve articles

Concept enrichment



Retrieve tweets

Concept enrichment

8/22




2 Building Cross-Domain (CD) Topic Classifier of Tweets

3 Measuring Distributional Changes Between KSs and Twitter

Retrieve articles

Concept enrichment



Retrieve tweets

Concept enrichment

8/22


Step 1: Collecting Data from KSs

Twitter corpus collected in Abel et al. (2011), tweets posted between October 2010 andJanuary 2011, annotated with 17 topics

Random selection of 1,000 articles/tweets from DBpedia/Freebase/Twitter for each topic =>9,465 articles from DBpedia; 16,915 articles from Freebase; and 12,412 tweets

Preprocessing: removal of hastags, mentions and URLs from tweets; taking top-1000

features for each topic

88.6%

8.6%

1.8%0.9%

Dbpedia multilabel frequency

1 8 2 3+4+5+6+7+9

99.9% 0.1%

Freebase multilabel frequency

1 2

71%

22.3%

5.6%

1%0.1%

Twitter multilabel frequency

1 2 3 4 6+5

88.6%

8.6%

1.8%0.9%

Dbpedia multilabel frequency

1 8 2 3+4+5+6+7+9

99.9% 0.1%

Freebase multilabel frequency

1 2

71%

22.3%

5.6%

1%0.1%

Twitter multilabel frequency

1 2 3 4 6+5

9/22


Step 1: Collecting Data from KSs

Disaster_AccidentBusiness_Finance Education Entertainment

Health

Environment

Human Interest Labor Law_Crime

Politics

Religion Social Issues Sports

Technology_IT

Weather War_Conflict

Retrieval of articles for a given topic (e.g. Politics):from DBpedia: executing SPARQL queries for retrieving category namescontaining the topic name:

Category:Politics_of_the_United_StatesCategory:National_Democratic_Party_Egypt_politiciansetc.

from Freebase: accessing Text Service API for articles belonging to thetopic:

for underspecified topics/domains: consider articles containing the topic in theirtitles

10/22


Step 2: Building Cross-Domain (CD) Topic Classifier of Tweets

Considering two different feature sets:

11/22



Considering two different feature sets:BOW: tf.idf value of the words present the examples (articles or tweets)

11/22



Considering two different feature sets:BOW: tf.idf value of the words present the examples (articles or tweets)

BOE: tf.idf value of the words and entity+concept pairs present the examples(articles or tweets)

Retrieve articles

Concept enrichment



Retrieve tweets

Concept enrichment

11/22


Step 3: Measuring Distributional Changes Between KSs and Twitter

Building a vector−→ds for each the source dataset (Sc.DB, Sc.Fb,

Sc.Db-FB) and a vector−→dt for the target dataset (Twitter) consisting of

the TF-IDF weight for either the BoW or BoE feature sets

statistical measures applied:

χ2 test: χ2 =∑ (O−E)2

E , where O is the observed value for a feature, whileE is the expected value calculated on the basis of the joint corpus

Kullback-Leibler symmetric distance:

KL(−→ds ||−→dt ) =

∑f∈FS∪FT

(−→ds(f )−

−→dt (f )) log

−→ds (f )−→dt (f )

cosine similarity: cosine(−→ds ,−→dt ) =

∑FS∪FTk=1 (

−→ds (fSk

)×−→ds (fTk

))∑FS∪FTk=1 (

−→ds (fSk

))2×∑FS∪FT

k=1 (−→dt (fTk

))2

12/22


Experimental setting

1-vs-all approach, building individual CD classifier for each topic, SVMclassifiers, performed 5 cross-fold validation

Sc-Db, Sc-Fb, Sc-Db-Fb classifiers trained on full KS data, evaluated on20% Twitter data 2,482 tweets)

TGT classifier: trained on 80% Twitter data, evaluated on 20% Twitterdata (2,482 tweets)

13/22


Findings -Classification performance in F1 measure

Q1 : Which KS reflects better the lexical variation in Twitter?

14/22




Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

14/22




Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

14/22




Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

14/22




Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

14/22




Sc.Db-FB showed best performance, followed by Sc.Fb and Sc.Db

Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

15/22





Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

15/22





Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

15/22



Q2 : What feature makes the KSs look more similar to Twitter?

BoW features were found better than BoE for CD classifiersBoE features were found better than BoW for TGT

Sc.D

B(Bo

E)

Sc.D

B(Bo

W)

Sc.F

B(Bo

E)

SC.F

B(Bo

W)

Sc.D

B−FB

(BoE

)

SC.D

B−FB

(BoW

)

TGT(

BoW

)

TGT(

BoE)

Edu

Sports

War

Labor

Weather

HumInt

Env

TechIT

DisAcc

SocIssue

HospRecr

Law

Pol

Health

Religion

EntCult

BusFi

37.40 37.80 42.50 47.20 42.50 45.70 71.90 71.30

10.30 11.70 26.50 26.20 26.20 26.00 60.10 59.20

9.30 10.90 23.90 23.60 24.60 25.70 67.60 72.70

1.40 1.30 31.90 31.90 30.10 29.90 79.90 79.40

3.10 7.00 39.80 39.90 36.70 36.00 81.20 81.50

1.10 1.40 2.00 1.60 1.50 2.20 33.60 34.20

15.20 14.20 2.20 2.20 8.90 8.40 46.60 48.30

1.60 2.40 18.40 18.60 12.40 9.90 57.40 58.00

8.30 9.70 14.50 14.50 21.00 21.10 57.50 59.20

1.30 2.00 8.80 9.00 9.70 9.10 44.20 44.00

1.40 2.30 17.20 19.50 11.30 13.90 41.60 42.60

0.90 2.70 16.80 17.80 14.80 13.30 46.80 46.40

22.20 21.00 27.80 26.80 24.20 26.80 45.10 44.90

28.60 30.30 25.60 24.70 26.90 25.40 51.70 51.80

23.40 25.10 14.40 14.70 21.00 20.40 58.40 58.50

15.30 15.10 15.50 16.20 19.50 20.20 42.20 43.10

21.40 21.20 11.50 15.30 18.60 19.10 47.50 46.50

16/22


Findings -Examining the number of annotation needed for Twitterclassifier to outperform Sc. Db-FB

Investigated the impact of employing Sc. Db-FB classifier over theTwitter classifier in terms of number of annotations

The performance of the Twitter classifier against the three CD classifiersover the full learning curve

17/22


Findings -Examining the number of annotation needed for Twitterclassifier to outperform Sc. Db-FB

Investigated the impact of employing Sc. Db-FB classifier over theTwitter classifier in terms of number of annotations

The performance of the Twitter classifier against the three CD classifiersover the full learning curve

=> In the absence of any annotated tweets, applying these CDclassifiers are beneficial

17/22


Findings -Examining the number of annotation needed for Twitterclassifier to outperform the CD classifiers

Q3 : How similar or dissimilar are KSs to Twitter posts; and whichsimilarity measure does better reflect the lexical changes between KSsand Twitter posts?

Compared χ2, KL-divergence, cosine for each topic

χ2 obtained the best correlation with the performance of CD classifiers,achived scores >70% for 32 cases

cosine obtained correlation scores >70% for 25 cases

KL obtained correlation scores >70% for 24 cases

18/22


Findings -Examining the number of annotation needed for Twitterclassifier to outperform the CD classifiers

Q3 : How similar or dissimilar are KSs to Twitter posts; and whichsimilarity measure does better reflect the lexical changes between KSsand Twitter posts?

Compared χ2, KL-divergence, cosine for each topic

χ2 obtained the best correlation with the performance of CD classifiers,achived scores >70% for 32 cases

cosine obtained correlation scores >70% for 25 cases

KL obtained correlation scores >70% for 24 cases

=> χ2 test is the best measure for quantifying the distributionaldifferences between KSs and Twitter.

18/22


Conclusions and Future Work

We presented a first study towards understanding the usefulness of KSs in TC oftweets at various granularities: lexical features (BoW) and entity features (BoE)

Our main findings are:

19/22




Our main findings are:In the absence of any annotated tweets, applying these CD classifiers are beneficial

19/22





Out of the two KSs, Freebase topics seem to be much closer to the Twitter topics thanthe DBpedia topics.

19/22






The two KSs contain complementary information

19/22







For the CD classifiers, on average BOW features were more useful than BoE features

19/22








For the Twitter classifiers, on average BOE features were more useful than BoWfeatures

19/22









We found χ2 test as being the best measure for quantifying the distributionaldifferences between KSs and Twitter.

19/22









We found χ2 test as being the best measure for quantifying the distributionaldifferences between KSs and Twitter.

Our future work will focus on building more accurate TC classifiers andinvestigating better measures

19/22


Corpus Analysis - Size of vocabulary

!"

#!!!"

$!!!"

%!!!"

&!!!"

'!!!!"

'#!!!"

'$!!!"

'%!!!"

()*+,"

-,*.//"

01)"

0234)53"

026"

78953:"

7;*<=8/"

7)>?23"

@9A;B"

@9C"

D;5"

=85,E,;2"

F;/?**)8"

F<;B3*"

G8/:?G"

H893:8B"

H9B"

IJ%'" I%&!"IIJ$" IIJI"

I&%!"I$K!" I$L$" I#%J" I'!!"

IJ#I" I$!I" I#&#" I#JK"IJL'" II%I" IIK!" I'$#"

&K%&"LI$#"

K$'#"

'!#'#"

LJJ$"LI!K" LI$#" L$##" LI$#" LI$#"

LK'&"

''L$I"

LI$#"

&'&I"

L&KJ"

&&J&"LI$#"

I&!I"

#'&L"

IK!K"

##J$"'&#J"

$%&$"

JK%#"

%JK'"

'J%&"

$JJ'"

%$'#"J&&&"

'&LK"

%&$#" %K#&"

#$#K"#KJ!"

'!K#'"

'!!L#"

L##'"

'''!!"

'!!I'"

''L%L"'#$#J"

'I!$L"

L&#I"

''J#I"

'IIIJ"

'J!I'"

'!!#L"

'#J!K"

'I&'L"

LLII"'!$IJ"

GMG"

F/N-("

F/N+("

F/N-(O+("

20/22


Understanding the results - Number of unique entities

Examining the number of entities in the source (Sc. DB, Sc. FB, Sc. DB-FB) andtarget (TGT) datasets after pre-processing.

the TGT dataset consists of 1.73± 0.35 entities/tweet

the Sc.DB dataset consists of 22.24± 1.44 entities entities/article

the Sc.FB dataset consists of 8.14± 5.78 entities entities/article

!"

#!!!"

$!!!!"

$#!!!"

%!!!!"

%#!!!"

&!!!!"

&#!!!"

'!!!!"

'#!!!"

()*+,"

-,*.//"

01)"

0234)53"

026"

78953:"

7;*<=8/"

7)>?23"

@9A;B"

@9C"

D;5"

=85,E,;2"

F;/?**)8"

F<;B3*"

G8/:?G"

H893:8B"

H9B"

$II%" $JKK" $J!#" $J!#" $JK&" $'K$" $%J&" $#I$" $'#&" $ILI" %&$%"$$%!" $#$%" %$$L" $KJ&" %%JL"

$%!L"

%$$#L"%%&K&" %%!IL"

%&'#!"%%%J'"

%!LI%"

%%&K&" %%#K%" %%&K&" %%&K&"%!JII"

$KIK'"

%%&K&"

%'KK$"%&L'$"

%%%&I" %%&K&"

L'!$"

%&$L"

KL%%"

$I$&" %&J#"

#K#J"

IJ#$"

$'L&L"

%'%'"

KIK%"

$I%&!"

$%!#&"

%''$"

$J'K&"

$'$#&"

%J%I"

KIKK"

%K!K#"

%'#%'"

%I$J&"

%#%&L"%'&'#"

%#I!J"

&$LJ%"

&LL'$"

%'#KK"

%IIJK"

&I%JK"

%I&$!"

%'#IL"

'%&KI"

&K&!!"

%'J&&"

%III%"

GMG"

F/N-("

F/N+("

F/N-(O+("

21/22


Corpus Analysis - Distribution of the top 15 entities

Labo

r (Sc

.FB)

War

(TG

T)W

ar (S

c.FB

)So

cIss

ue (S

c.FB

)W

eath

er (S

c.FB

)Te

chIT

(Sc.

DB)

Hum

Int (

Sc.D

B)W

eath

er (S

c.DB

)So

cIss

ue (S

c.DB

)La

w (S

c.DB

)La

bor (

Sc.D

B)Di

sAcc

(Sc.

DB)

Hosp

Recr

(Sc.

DB)

EntC

ult (

Sc.F

B)Te

chIT

(Sc.

FB)

Hum

Int (

TGT)

EntC

ult (

TGT)

EntC

ult (

Sc.D

B)Te

chIT

(TG

T)Bu

sFi (

Sc.F

B)He

alth

(Sc.

FB)

Heal

th (S

c.DB

)Bu

sFi (

Sc.D

B)En

v (S

c.DB

)La

w (S

c.FB

)En

v (S

c.FB

)Di

sAcc

(Sc.

FB)

Hosp

Recr

(Sc.

FB)

Wea

ther

(TG

T)Di

sAcc

(TG

T)En

v (T

GT)

Edu

(Sc.

DB)

Edu

(Sc.

FB)

Spor

ts (S

c.DB

)W

ar (S

c.DB

)Hu

mIn

t (Sc

.FB)

Pol (

Sc.F

B)Sp

orts

(Sc.

FB)

Relig

ion

(Sc.

FB)

Spor

ts (T

GT)

Relig

ion

(TG

T)Po

l (TG

T)Po

l (Sc

.DB)

Relig

ion

(Sc.

DB)

Heal

th (T

GT)

Hosp

Recr

(TG

T)Bu

sFi (

TGT)

Labo

r (TG

T)So

cIss

ue (T

GT)

Edu

(TG

T)La

w (T

GT)

NaturalFeature

Country

Person

Position

Organization

Technology

MedicalCondition

SportsEvent

Region

Continent

Company

IndustryTerm

City

Facility

ProvinceOrState

22/22

Documents

Exploring the similarity between Social Knowledge Sources and Twitter for Cross-Domain Topic Classification of Tweets #KECSM 2012 #ISWC2012