A Comparison of Different Strategies for Automated Semantic Document Annotation

Preview:

Citation preview

1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Gregor Große-BöltingChifumi NishiokaAnsgar Scherp

A Comparison of Different Strategies for Automated Semantic Document Annotation

2Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Motivation [1/2]• Document annotation

– Facilitates users and search engines to find documents– Requires a huge amount of human effort– e.g., subject indexers in ZBW labeled 1.6 million scientific

documents in economics

• Semantic document annotation– Documents annotated with semantic entities– e.g., PubMed and MeSH, ACM DL and ACM CCS

Focus on semantic document annotation

Necessity of automated document annotation

3Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Motivation [2/2]• Small scale experiments so far

– Comparing a small number of strategies– Datasets containing a few hundred documents

• Comparing of 43 strategies for document annotation within the developed experiment framework– The largest number of strategies

• Experiments with three datasets from different domains– Contain full-texts of 100,000 documents annotated by subject

indexers– The largest dataset of scientific publications

We conducted the largest scale experiment

4Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Experiment Framework

Strategies are composed of methods from concept extraction, concept activation, and annotation selection

1. Concept Extractiondetect concepts (candidate annotations) from each document

2. Concept Activationcompute a score for each concept of a document

3. Annotation Selectionselect annotations from concepts for each document

4. Evaluationmeasure performance of strategies with ground truth

5Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?

6Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Extraction [1/2]• Entity

– Extract entities from documents using a domain-specific knowledge base

– Domain-specific knowledge base• Entities (subjects) in a specific domain (e.g., medicine)• One or more labels for each entity• Relationships between entities

– Detect entities by string matching with entity labels• Tri-gram

– Extract contigurous sequences of one, two, and three words in a document

7Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Extraction [2/2]• RAKE (Rapid Automatic Keyword

Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords– Incorporate cooccurrence and frequency of words

• LDA (Latent Dirichlet Allocation) [Blei et al. 03]– Unsupervised topic modeling method for inferring latent

topics in a document corpus– Topic model

• Topic: A probability distribution over words• Document: A probability distribution over topics

– Treat a topic as a concept

8Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Activation [1/6]• Three types of concept activation

methods– Statistical Methods

• Baseline• Use only directly mentioned concepts

– Hierarchy-based Methods• Reveal concepts that are not mentioned explicitly using a

hierarchical knowledge base– Graph-based Methods

• Use only directly mentioned concepts• Represent concept

cooccurrences as a graph

Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate

Tax

Bank

Interest Rate

Financial Crisis

Central Bank

9Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Activation [2/6]• Statistical Methods

– Frequency

• depends on Concept Extraction methods– The number of appearances (Entity and Tri-gram)– The score output by RAKE (RAKE) – The probability of a topic for a document (LDA)

– CF-IDF [Goossen et al. 11]• An extension of TF-IDF replacing words with concepts• Lower scores for concepts that appear in many documents

𝑠𝑐𝑜𝑟𝑒𝑐𝑓𝑖𝑑𝑓 (𝑐 ,𝑑)=𝑐𝑓 (𝑐 ,𝑑) ∙𝑙𝑜𝑔¿𝐷∨ ¿¿ {𝑑∈𝐷 } :{𝑐∈𝑑}∨¿¿

¿

𝑠𝑐𝑜𝑟𝑒 𝑓𝑟𝑒𝑞(𝑐 ,𝑑)= 𝑓𝑟𝑒𝑞(𝑐 ,𝑑 )

10Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Activation [3/6]• Hierarchy-based Methods [1/2]

– Base Activation

• : a set of child concepts of a concept • : decay parameter, set • e.g.,

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 )+𝜆 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒(𝑐 𝑖 ,𝑑)

SocialRecommendation

SocialTagging

Web Searching Web Mining

SiteWrapping

Web LogAnalysis

World Wide Web

𝑐1

𝑐2

𝑐3

11Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Activation [4/6]• Hierarchy-based Methods [2/2]

– Branch Activation

• : reciprocal of the number of concepts that are located one level above a concept

– OneHop Activation

• : set of concepts in a document • Activates concepts in a maximum distance of one hop

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆∙𝐵𝑁 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐𝑖 ,𝑑)

𝑠𝑐𝑜𝑟𝑒 h𝑜𝑛𝑒 𝑜𝑝 (𝑐 ,𝑑 )={ 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 ) if∨𝐶𝑙(𝑐 )∩𝐶𝑑∨≥2𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆 ∙ ∑

𝑐 𝑖∈𝐶𝑙 (𝑐)

𝑓𝑟𝑒𝑞 (𝑐 𝑖 ,𝑑 )otherwise

12Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Activation [5/6]• Graph-based Methods [1/2]

– Degree [Zouaq et al. 12]

• : the number of edges linked with a concept • e.g.,

– HITS [Kleinberg 99; Zouaq et al. 12]• Link analysis algorithm for search engines [Kleinberg 99]

𝑠𝑐𝑜𝑟𝑒𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑)=𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑 )

𝑠𝑐𝑜𝑟𝑒h𝑖𝑡𝑠 (𝑐 ,𝑑 )=h𝑢𝑏 (𝑐 ,𝑑)+ h𝑎𝑢𝑡 (𝑐 ,𝑑 )

Tax

Bank

Interest Rate

Financial Crisis

Central Bank

13Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Concept Activation [6/6]• Graph-based Methods [2/2]

– PageRank [Page et al. 99; Mihalcea & Paul 04]• Link analysis algorithm for search engines• Based on the intuition that a node that is linked from many

important nodes is more important

• : set of concepts connected with incoming edges from • : set of concepts connected with outgoing edges from • : dumping factor,

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒 (𝑐 ,𝑑 )=(1−𝜇 )+𝜇 ∙ ∑𝑐 𝑖∈𝐶𝑖𝑛 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒(𝑐 𝑖 ,𝑑 )¿𝐶𝑜𝑢𝑡(𝑐 𝑖)∨¿

¿

14Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Annotation Selection• Top-5 and Top-10

– Select concepts whose scores are ranked in top-k• k Nearest Neighbor (kNN) [Huang et al. 11]

– Based on the assumption that documents with similar concepts share similar annotations

1. Compute similarity scores between a target document and all documents with annotations

2. Select union of annotations of k nearest documents

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

Example

- Selected annotationsFinance; China; Marketing; Competition Law

15Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Configurations [1/5]

Entity Tri-gram LDARAKE

StatisticalMethods(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

16Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Configurations [2/5]

24 strategies

Entity Tri-gram LDARAKE

StatisticalMethods(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

17Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

15 strategies

Entity Tri-gram LDARAKE

StatisticalMethod

(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Configurations [3/5]

18Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

3 strategies

Entity Tri-gram LDARAKE

StatisticalMethod

(Frequency)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Configurations [4/5]

19Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Entity Tri-gram LDARAKE

StatisticalMethods(Frequency)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

Configurations [5/5]

43 strategies in total

20Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Datasets and Metrics of ExperimentsEconomics Political Science Computer Science

publication ZBW FIV SemEval 2010# of publications 62,924 28,324 244# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)knowledge base STW European Thesaurus ACM CCS# of entities 6,335 7,912 2,299# of labels 11,679 8,421 9,086

• Computer Science: SemEval 2010 dataset [Kim et al. 10]– Publications are annotated with keywords originally– We converted keywords to entities by string matching

• All publications and labels of entities are in English• We use full-texts of publications• All annotations are used as ground truth• Evaluation metrics: Precision, Recall, F-measure

21Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

(I) Best Performing Strategies• Economics and Political Science datasets

– The best strategy: Entity × HITS × kNN– F-measure: 0.39 (economics), 0.28 (political science)

• Computer Science dataset– The best strategy: Entity × Degree × kNN– F-measure: 0.33 (computer science)

• Graph-based methods do not differ a lot

In general, a document annotation strategyEntity × Graph-based method × kNN performs best

22Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

(II) Influence of Concept Extraction

• Concept Extraction method: Entity– Use domain-specific knowledge bases– Knowledge bases: freely available and of high quality– 32 thesauri listed in W3C SKOS Datasets

For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA

23Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

(III) Influence of Concept Activation

• Poor performance of hierarchy-based methods– We use full-texts in the experiments

• Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated

– However, OneHop can work as well as graph-based methods• It activates concept in one hop distance

For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods

24Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

(IV) Influence of Annotation Selection

• kNN– No learning process– Confirms the assumption that documents with similar

concepts share similar annotations

For Annotation Selection methods, kNN can enhance the performance

25Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Conclusion• Large scale experiment for automated semantic

document annotation for scientific publications• Best strategy: Entity × Graph-based method × kNN

– Novel combination of methods• Best concept extraction method: Entity• Best concept activation method: Graph-based

methods– OneHop can achieve similar performance and requires

less computation cost

26Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Thank you!Questions?

27Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Appendix

28Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?

29Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

LDA (Latent Dirichlet Allocation)

source: D. M. Blei. Probabilistic topic models, CACM, 2012.

30Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Entity Extraction and Conversion• Entity extraction

– String matching with entity labels– Starting with longer entity labels

• e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”).

• Converting to entities– Words and keywords are extracted in Tri-gram and RAKE– They are converted to entities by string matching with

entity labels before annotation selection– If no matched entity label is found, word or keyword is

discarded

31Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

kNN [1/2]• Similarity measure

– Each document is represented as a vector where each element is a score of a concept

– Cosine similarity is used as a similarity measureGDPImmigrationPopulationBankInterest rateCanada

0.30.50.80.10.00.5

GDPImmigrationPopulationBankInterest rateCanada

0.60.00.40.80.40.2

cosine similarity between and

𝑑1 𝑑2

32Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

kNN [2/2]• k = 1

• k = 2

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

MarketingCompetitive law

Selected annotations

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

??

0.49

0.45

0.42

0.60

MarketingCompetitive lawFinanceChina

Selected annotations

33Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Evaluation Metrics• Precision

• Recall

• F-measure

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

¿

𝑟𝑒𝑐𝑎𝑙𝑙=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

¿

𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒=2∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

34Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Datasets• Economics dataset

– 11 GB• Political science dataset

– 3.8 GB

35Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Experiments• Preprocessing documents

– lemmatization– stop words removal

• 10-fold cross validation– split a dataset into 10 equal sized subsets– 8 subset for training data– 1 subset for testing data– 1 subset for optimizing parameter

36Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Result Table: Entity [1/2]Economics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .12 (.11) .18 (.16) .14 (.12) .15 (.13) .12 (.10) .13 (.10) .14 (.17) .05 (.07) .07 (.09)CF-IDF .05 (.07) .12 (.16) .07 (.10) .07 (.09) .08 (.10) .07 (.09) .24 (.22) .14 (.14) .17 (.16)Base Act. .05 (.08) .10 (.13) .07 (.09) .10 (.10) .10 (.09) .09 (.09) .14 (.19) .07 (.10) .09 (.12)Branch Act. .04 (.07) .08 (.12) .05 (.08) .09 (.09) .09 (.09) .08 (.09) .12 (.17) .06 (.10) .08 (.11)OneHop .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .27 (.21) .26 (.21) .25 (.19)Degree .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .29 (.21) .28 (.21) .27 (.19)HITS .10 (.09) .21 (.17) .13 (.11) .13 (.11) .14 (.11) .13 (.10) .30 (.22) .29 (.21) .28 (.20)PageRank .10 (.09) .20 (.17) .13 (.11) .13 (.10) .14 (.11) .13 (.10) .29 (.22) .29 (.21) .27 (.20)

37Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Result Table: Entity [2/2]Computer Science

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .18 (.21) .14 (.15) .15 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .24 (.16) .30 (.17)CF-IDF .02 (.08) .02 (.06) .02 (.06) .03 (.11) .01 (.04) .02 (.05) .47 (.29) .23 (.17) .29 (.18)Base Act. .17 (.20) .13 (.14) .14 (.15) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .22 (.15) .29 (.17)Branch Act. .17 (.20) .12 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .50 (.28) .22 (.15) .29 (.17)OneHop .17 (.20) .13 (.14) .14 (.15) .21 (.22) .08 (.08) .11 (.11) .42 (.30) .25 (.21) .29 (.20)Degree .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .49 (.28) .27 (.17) .33 (.18)HITS .18 (.21) .14 (.15) .15 (.16) .21 (.22) .08 (.08) .11 (.11) .48 (.31) .27 (.18) .32 (.20)PageRank .17 (.21) .13 (.15) .14 (.16) .22 (.22) .08 (.08) .12 (.11) .50 (.29) .25 (.15) .31 (.18)

38Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Result Table: Tri-gramEconomics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .06 (.08) .14 (.16) .08 (.10) .10 (.10) .11 (.11) .10 (.09) .08 (.14) .05 (.08) .06 (.09)CF-IDF .05 (.05) .06 (.07) .05 (.06) .09 (.10) .09 (.10) .08 (.09) .09 (.15) .04 (.08) .06 (.10)Degree .01 (.03) .03 (.07) .01 (.04) .01 (.03) .03 (.07) .01 (.04) .11 (.14) .03 (.05) .05 (.07)HITS .01 (.03) .02 (.06) .01 (.03) .01 (.03) .00 (.06) .01 (.03) .12 (.14) .04 (.06) .06 (.08)PageRank .01 (.04) .03 (.08) .02 (.05) .01 (.04) .03 (.08) .02 (.05) .08 (.12) .03 (.05) .04 (.06)

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .26 (.24) .20 (.18) .22 (.19) .54 (.30) .20 (.13) .29 (.17) .44 (.28) .25 (.18) .30 (.19)CF-IDF .23 (.24) .18 (.18) .19 (.19) .54 (.29) .22 (.14) .30 (.17) .48 (.28) .20 (.14) .26 (.15)Degree .09 (.15) .07 (.11) .07 (.12) .13 (.19) .05 (.07) .07 (.09) .48 (.29) .23 (.16) .29 (.18)HITS .05 (.14) .04 (.09) .04 (.10) .11 (.18) .04 (.06) .06 (.09) .39 (.29) .26 (.21) .28 (.19)PageRank .02 (.06) .02 (.05) .02 (.06) .03 (.08) .01 (.03) .02 (.05) .46 (.29) .25 (.18) .30 (.18)

39Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Result Table: RAKEEconomics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

nF Recall Precisio

nF

Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

F Recall Precision

F

Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)

40Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Result Table: LDAEconomics

kNNRecall Precisio

nF

Frequency .19 (.30) .19 (.30) .19 (.30)

Political SciencekNN

Recall Precision

F

Frequency .15 (.19) .15 (.18) .14 (.17)

Computer SciencekNN

Recall Precision

F

Frequency .28 (.27) .24 (.23) .24 (.22)

41Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Materials• Codes

– https://github.com/ggb/ShortStories• Datasets

– economics and political science• not publicly available yet• contact us directly, if you are interested in

– computer science• publicly available

42Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Presentation• K-CAP 2015

– International Conference on Knowledge Capture– Scope

• Knowledge Acquisition / Capture• Knowledge Extraction from Text• Semantic Web• Knowledge Engineering and Modelling• …

• Time slot– Presentation: 25 minutes– Q & A: 5 minutes

43Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Reference• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,

JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.

Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011.

• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015.

• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011.

• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.

• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.

44Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Reference• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked

environment, Journal of the ACM, 1999.• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,

EMNLP, 2004.• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank

citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword

extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept

detection, ESWC, 2012.

Recommended