A Comparison of Different Strategies for Automated Semantic Document Annotation

1Chifumi Nishioka chni@informatik.uni-kiel.de, K-CAP 2015

Gregor Große-BöltingChifumi NishiokaAnsgar Scherp

Motivation [1/2]• Document annotation

– Facilitates users and search engines to find documents– Requires a huge amount of human effort– e.g., subject indexers in ZBW labeled 1.6 million scientific

documents in economics

• Semantic document annotation– Documents annotated with semantic entities– e.g., PubMed and MeSH, ACM DL and ACM CCS

Focus on semantic document annotation

Necessity of automated document annotation

Motivation [2/2]• Small scale experiments so far

– Comparing a small number of strategies– Datasets containing a few hundred documents

• Comparing of 43 strategies for document annotation within the developed experiment framework– The largest number of strategies

• Experiments with three datasets from different domains– Contain full-texts of 100,000 documents annotated by subject

indexers– The largest dataset of scientific publications

We conducted the largest scale experiment

Experiment Framework

Strategies are composed of methods from concept extraction, concept activation, and annotation selection

1. Concept Extractiondetect concepts (candidate annotations) from each document

2. Concept Activationcompute a score for each concept of a document

3. Annotation Selectionselect annotations from concepts for each document

4. Evaluationmeasure performance of strategies with ground truth

Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?

Concept Extraction [1/2]• Entity

– Extract entities from documents using a domain-specific knowledge base

– Domain-specific knowledge base• Entities (subjects) in a specific domain (e.g., medicine)• One or more labels for each entity• Relationships between entities

– Detect entities by string matching with entity labels• Tri-gram

– Extract contigurous sequences of one, two, and three words in a document

Concept Extraction [2/2]• RAKE (Rapid Automatic Keyword

Extraction) [Rose et al. 10] – Unsupervised method for extracting keywords– Incorporate cooccurrence and frequency of words

• LDA (Latent Dirichlet Allocation) [Blei et al. 03]– Unsupervised topic modeling method for inferring latent

topics in a document corpus– Topic model

• Topic: A probability distribution over words• Document: A probability distribution over topics

– Treat a topic as a concept

Concept Activation [1/6]• Three types of concept activation

methods– Statistical Methods

• Baseline• Use only directly mentioned concepts

– Hierarchy-based Methods• Reveal concepts that are not mentioned explicitly using a

hierarchical knowledge base– Graph-based Methods

• Use only directly mentioned concepts• Represent concept

cooccurrences as a graph

Bank, Interest Rate, Financial Crisis, Bank, Central Bank, Tax, Interest Rate

Interest Rate

Financial Crisis

Central Bank

Concept Activation [2/6]• Statistical Methods

– Frequency

• depends on Concept Extraction methods– The number of appearances (Entity and Tri-gram)– The score output by RAKE (RAKE) – The probability of a topic for a document (LDA)

– CF-IDF [Goossen et al. 11]• An extension of TF-IDF replacing words with concepts• Lower scores for concepts that appear in many documents

𝑠𝑐𝑜𝑟𝑒𝑐𝑓𝑖𝑑𝑓 (𝑐 ,𝑑)=𝑐𝑓 (𝑐 ,𝑑) ∙𝑙𝑜𝑔¿𝐷∨ ¿¿ {𝑑∈𝐷 } :{𝑐∈𝑑}∨¿¿

𝑠𝑐𝑜𝑟𝑒 𝑓𝑟𝑒𝑞(𝑐 ,𝑑)= 𝑓𝑟𝑒𝑞(𝑐 ,𝑑 )

Concept Activation [3/6]• Hierarchy-based Methods [1/2]

– Base Activation

• : a set of child concepts of a concept • : decay parameter, set • e.g.,

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 )+𝜆 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑏𝑎𝑠𝑒(𝑐 𝑖 ,𝑑)

SocialRecommendation

SocialTagging

Web Searching Web Mining

SiteWrapping

Web LogAnalysis

World Wide Web

Concept Activation [4/6]• Hierarchy-based Methods [2/2]

– Branch Activation

• : reciprocal of the number of concepts that are located one level above a concept

– OneHop Activation

• : set of concepts in a document • Activates concepts in a maximum distance of one hop

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐 ,𝑑 )= 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆∙𝐵𝑁 ∙ ∑𝑐𝑖∈𝐶 𝑙 (𝑐)

𝑠𝑐𝑜𝑟𝑒 h𝑏𝑟𝑎𝑛𝑐 (𝑐𝑖 ,𝑑)

𝑠𝑐𝑜𝑟𝑒 h𝑜𝑛𝑒 𝑜𝑝 (𝑐 ,𝑑 )={ 𝑓𝑟𝑒𝑞 (𝑐 ,𝑑 ) if∨𝐶𝑙(𝑐 )∩𝐶𝑑∨≥2𝑓𝑟𝑒𝑞 (𝑐 ,𝑑)+𝜆 ∙ ∑

𝑐 𝑖∈𝐶𝑙 (𝑐)

𝑓𝑟𝑒𝑞 (𝑐 𝑖 ,𝑑 )otherwise

Concept Activation [5/6]• Graph-based Methods [1/2]

– Degree [Zouaq et al. 12]

• : the number of edges linked with a concept • e.g.,

– HITS [Kleinberg 99; Zouaq et al. 12]• Link analysis algorithm for search engines [Kleinberg 99]

𝑠𝑐𝑜𝑟𝑒𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑)=𝑑𝑒𝑔𝑟𝑒𝑒(𝑐 ,𝑑 )

𝑠𝑐𝑜𝑟𝑒h𝑖𝑡𝑠 (𝑐 ,𝑑 )=h𝑢𝑏 (𝑐 ,𝑑)+ h𝑎𝑢𝑡 (𝑐 ,𝑑 )

Interest Rate

Financial Crisis

Central Bank

Concept Activation [6/6]• Graph-based Methods [2/2]

– PageRank [Page et al. 99; Mihalcea & Paul 04]• Link analysis algorithm for search engines• Based on the intuition that a node that is linked from many

important nodes is more important

• : set of concepts connected with incoming edges from • : set of concepts connected with outgoing edges from • : dumping factor,

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒 (𝑐 ,𝑑 )=(1−𝜇 )+𝜇 ∙ ∑𝑐 𝑖∈𝐶𝑖𝑛 (𝑐)

𝑠𝑐𝑜𝑟𝑒𝑝𝑎𝑔𝑒(𝑐 𝑖 ,𝑑 )¿𝐶𝑜𝑢𝑡(𝑐 𝑖)∨¿

Annotation Selection• Top-5 and Top-10

– Select concepts whose scores are ranked in top-k• k Nearest Neighbor (kNN) [Huang et al. 11]

– Based on the assumption that documents with similar concepts share similar annotations

1. Compute similarity scores between a target document and all documents with annotations

2. Select union of annotations of k nearest documents

Central bankLawFinancial crisis

FinanceChina

Human resourceLeadership

MarketingCompetition law

Example

- Selected annotationsFinance; China; Marketing; Competition Law

Configurations [1/5]

Entity Tri-gram LDARAKE

StatisticalMethods(2 methods)

Hierarchy-basedMethods(3 methods)

Graph-basedMethods(3 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

24 strategies

StatisticalMethods(2 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

15 strategies

StatisticalMethod

(2 methods)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

3 strategies

StatisticalMethod

(Frequency)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

StatisticalMethods(Frequency)

Top-k(2 methods)

kNN(1 method)

ConceptExtraction

AnnotationSelection

ConceptActivation

43 strategies in total

Datasets and Metrics of ExperimentsEconomics Political Science Computer Science

publication ZBW FIV SemEval 2010# of publications 62,924 28,324 244# of annotations 5.26 (± 1.84) 12.00 (± 4.02) 5.05 (± 2.41)knowledge base STW European Thesaurus ACM CCS# of entities 6,335 7,912 2,299# of labels 11,679 8,421 9,086

• Computer Science: SemEval 2010 dataset [Kim et al. 10]– Publications are annotated with keywords originally– We converted keywords to entities by string matching

• All publications and labels of entities are in English• We use full-texts of publications• All annotations are used as ground truth• Evaluation metrics: Precision, Recall, F-measure

(I) Best Performing Strategies• Economics and Political Science datasets

– The best strategy: Entity × HITS × kNN– F-measure: 0.39 (economics), 0.28 (political science)

• Computer Science dataset– The best strategy: Entity × Degree × kNN– F-measure: 0.33 (computer science)

• Graph-based methods do not differ a lot

In general, a document annotation strategyEntity × Graph-based method × kNN performs best

(II) Influence of Concept Extraction

• Concept Extraction method: Entity– Use domain-specific knowledge bases– Knowledge bases: freely available and of high quality– 32 thesauri listed in W3C SKOS Datasets

For Concept Extraction methods, Entity consistently outperforms Tri-gram, RAKE, and LDA

(III) Influence of Concept Activation

• Poor performance of hierarchy-based methods– We use full-texts in the experiments

• Full-texts contain so many different concepts (203.80 unique entities (SD: 24.50)) that others do not have to be activated

– However, OneHop can work as well as graph-based methods• It activates concept in one hop distance

For Concept Activation methods, graph-based methods are better than statistical methods or hierarchy-based methods

(IV) Influence of Annotation Selection

• kNN– No learning process– Confirms the assumption that documents with similar

concepts share similar annotations

For Annotation Selection methods, kNN can enhance the performance

Conclusion• Large scale experiment for automated semantic

document annotation for scientific publications• Best strategy: Entity × Graph-based method × kNN

– Novel combination of methods• Best concept extraction method: Entity• Best concept activation method: Graph-based

methods– OneHop can achieve similar performance and requires

less computation cost

Thank you!Questions?

Appendix

Research Question• Research questions solved with the experiment

framework

(I) Which strategy performs best?

(II) Which concept extraction method performs best?

(III) Which concept activation method performs best?

(IV) Which annotation selection method performs best?

LDA (Latent Dirichlet Allocation)

source: D. M. Blei. Probabilistic topic models, CACM, 2012.

Entity Extraction and Conversion• Entity extraction

– String matching with entity labels– Starting with longer entity labels

• e.g., From a text “financial crisis is …”, only an entity “financial crisis” is detected (not “crisis”).

• Converting to entities– Words and keywords are extracted in Tri-gram and RAKE– They are converted to entities by string matching with

entity labels before annotation selection– If no matched entity label is found, word or keyword is

discarded

kNN [1/2]• Similarity measure

– Each document is represented as a vector where each element is a score of a concept

– Cosine similarity is used as a similarity measureGDPImmigrationPopulationBankInterest rateCanada

0.30.50.80.10.00.5

GDPImmigrationPopulationBankInterest rateCanada

0.60.00.40.80.40.2

cosine similarity between and

𝑑1 𝑑2

kNN [2/2]• k = 1

• k = 2

FinanceChina

MarketingCompetitive law

Selected annotations

FinanceChina

MarketingCompetitive lawFinanceChina

Selected annotations

Evaluation Metrics• Precision

• Recall

• F-measure

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

𝑟𝑒𝑐𝑎𝑙𝑙=|{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡𝑜𝑓 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }|∩∨{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨ ¿¿ {𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑎𝑛𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛𝑠 }∨¿¿

𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒=2∙ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Datasets• Economics dataset

– 11 GB• Political science dataset

– 3.8 GB

Experiments• Preprocessing documents

– lemmatization– stop words removal

• 10-fold cross validation– split a dataset into 10 equal sized subsets– 8 subset for training data– 1 subset for testing data– 1 subset for optimizing parameter

Result Table: Entity [1/2]Economics

top-5 top-10 kNNRecall Precisio

nF Recall Precisio

Frequency .14 (.17) .14 (.15) .13 (.15) .22 (.20) .11 (.10) .14 (.12) .08 (.21) .08 (.21) .08 (.21)CF-IDF .19 (.19) .18 (.17) .18 (.16) .24 (.21) .12 (.10) .15 (.12) .29 (.32) .30 (.32) .29 (.31)Base Act. .10 (.14) .09 (.13) .09 (.13) .18 (.19) .09 (.09) .12 (.11) .20 (.30) .20 (.30) .20 (.29)Branch Act. .08 (.14) .08 (.12) .08 (.12) .17 (.19) .08 (.09) .11 (.11) .17 (.28) .17 (.28) .17 (.27)OneHop .12 (.16) .12 (.14) .12 (.14) .19 (.19) .09 (.09) .12 (.11) .35 (.34) .36 (.34) .35 (.33)Degree .15 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)HITS .14 (.17) .14 (.15) .14 (.15) .23 (.20) .11 (.10) .14 (.12) .40 (.32) .40 (.32) .39 (.31)PageRank .14 (.17) .14 (.15) .14 (.15) .22 (.20) .11 (.09) .14 (.12) .39 (.33) .40 (.33) .38 (.32)

Political Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

Result Table: Entity [2/2]Computer Science

nF Recall Precisio

Result Table: Tri-gramEconomics

nF Recall Precisio

Frequency .12 (.15) .12 (.14) .11 (.14) .19 (.19) .10 (.10) .13 (.12) .08 (.22) .08 (.22) .08 (.21)CF-IDF .10 (.12) .10 (.12) .09 (.11) .17 (.17) .08 (.10) .12 (.12) .07 (.20) .06 (.22) .06 (.20)Degree .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .07 (.21) .07 (.21) .07 (.20)HITS .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .02 (.06) .08 (.22) .08 (.22) .07 (.21)PageRank .03 (.09) .03 (.08) .03 (.08) .03 (.09) .03 (.08) .03 (.08) .10 (.20) .04 (.08) .05 (.11)

Recall Precision

F Recall Precision

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

Result Table: RAKEEconomics

nF Recall Precisio

Frequency .08 (.14) .08 (.12) .08 (.12) .15 (.18) .07 (.08) .10 (.11) .34 (.33) .34 (.33) .33 (.32)

Recall Precision

F Recall Precision

Frequency .04 (.07) .08 (.13) .05 (.08) .07 (.09) .08 (.09) .07 (.08) .31 (.23) .18 (.15) .22 (.17)

Computer Sciencetop-5 top-10 kNN

Recall Precision

F Recall Precision

Frequency .24 (.24) .17 (.16) .19 (.17) .42 (.28) .15 (.10) .22 (.14) .42 (.27) .20 (.13) .25 (.15)

Result Table: LDAEconomics

kNNRecall Precisio

Frequency .19 (.30) .19 (.30) .19 (.30)

Political SciencekNN

Recall Precision

Frequency .15 (.19) .15 (.18) .14 (.17)

Computer SciencekNN

Recall Precision

Frequency .28 (.27) .24 (.23) .24 (.22)

Materials• Codes

– https://github.com/ggb/ShortStories• Datasets

– economics and political science• not publicly available yet• contact us directly, if you are interested in

– computer science• publicly available

Presentation• K-CAP 2015

– International Conference on Knowledge Capture– Scope

• Knowledge Acquisition / Capture• Knowledge Extraction from Text• Semantic Web• Knowledge Engineering and Modelling• …

• Time slot– Presentation: 25 minutes– Q & A: 5 minutes

Reference• [Blei et al. 03] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation,

JMLR, 2003. • [Blei 12] D. M. Blei. Probabilistic topic models, CACM, 2012.• [Goossen et al. 11] F. Goossen, W. IJntema, F. Frasincar, F. Hogenboom, and U.

Kaymak. News personalization using the CF-IDF semantic recommender, WIMS, 2011.

• [Grosse-Bolting et al. 15] G. Grosse-Bolting, C. Nishioka, and A. Scherp. Generic process for extracting user profiles from social media using hierarchical knowledge bases, ICSC, 2015.

• [Huang et al. 11] M. Huang, A. Névéol, and Z. Lu. Recommending MeSH terms for annotating biomedical articles, JAMIA, 2011.

• [Kapanipathi et al. 14] P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User interests identification on Twitter using a hierarchical knowledge base, ESWC, 2014.

• [Kim et al. 10] S. N. Kim, O. Medelyan, M. Y. Kan, and T. Baldwin. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, International Workshop on Semantic Evaluation, 2010.

Reference• [Kleinberg 99] J. M. Kleinberg. Authoritative sources in a hyperlinked

environment, Journal of the ACM, 1999.• [Mihalcea & Paul 04] R. Mihalcea and T. Paul. TextRank: Bringing order into texts,

EMNLP, 2004.• [Page et al. 99] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank

citation ranking: bringing order to the web, TR of Stanford InfoLab, 1999.• [Rose et al. 10] S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword

extraction from individual documents, Text Mining, 2010. • [Zouaq et al. 12] A. Zouaq, G. Dragan, and H. Marek. Voting theory for concept

detection, ESWC, 2012.

A Comparison of Different Strategies for Automated Semantic Document Annotation

Internet

Semantic Web - Multimedia Annotation –

Semantic Annotation, Indexing, and Retrieval

Long-Read Annotation: Automated Eukaryotic Genome · Breakthrough Technologies Long-Read Annotation: Automated Eukaryotic Genome Annotation Based on Long-Read cDNA Sequencing1[OPEN]

Semantic annotation of biomedical data

Automated eukaryotic gene structure annotation using

The BioMoby Semantic Annotation Experiment

Semi-automated Assessment of Annotation Trustworthiness

Semi-Automated Annotation of Environmental Acoustic Recordings · Recording Semi-Automated Sensors Spectrograms Tagging Taxonomy . Semi-Automated Annotation of Environmental Acoustic

Lexical Semantics and Semantic Annotation

Semantic Annotation Dc 2009

The use of semantic prototypes in semantic role annotation (pdf)

Shotton& Peroni semantic annotation of publication entitiessperoni.web.cs.unibo.it/...annotation-publication.pdf · PDF reading and annotation environment that provides semantic enrichment

Semantic Annotation Workshop - cst.dkcst.dk/it_inf/1-clara-Intro-Semantic-Annotation.pdf · Semantic annotation The big picture . Outline 1. Semantics: ... Semantic roles •Syntax

Semantic Concept Annotation for User Generated Videos ...jin-qin.com/papers/Semantic_Concept_Annotation_for_User_Generate… · Semantic Concept Annotation for User Generated Videos

A Flexible Architecture for Semantic Annotation and ...ceur-ws.org/Vol-228/6.pdf · A Flexible Architecture for Semantic Annotation and Automated Multimedia Presentation Generation

Automated Prokaryotic Annotation at JCVI

Semantic Annotation of Image Collections

Semantic Annotation of Documents

ONTEA: PLATFORM FOR PATTERN BASED AUTOMATED …ontea.sourceforge.net/publications/ontea_laclavik_cai_2009.pdf · Ontea: Platform for Pattern Based Automated Semantic Annotation 557

Module 9: Ontologies and Semantic Annotation · Semantic Annotation vs Ontology Population Semantic Annotation Mentions of instances in the text are annotated wrt concepts (classes)