Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

Charting the Digital Library Evaluation Domain with a Semantically Enhanced

Mining Methodology

Eleni A!ontzi,1 Giannis Kazadeis,1 Leonidas Papachristopoulos,2Michalis Sfakakis,2 Giannis Tsakonas,2 Christos Papatheodorou2

13th ACM/IEEE Joint Conference on Digital Libraries, July 22-26, Indianapolis, IN, USA

1. Department of Informatics,Athens University of Economics & Business

2. Database & Information Systems Group, Department of Archives & Library

Science, Ionian University

aim & scope of research

• To propose a methodology for discovering patterns in the scienti!c literature.

• Our case study is performed in the digital library evaluation domain and its conference literature.

• We question:

• We question:- how we select relevant studies,

• We question:- how we select relevant studies,- how we annotate them,

• We question:- how we select relevant studies,- how we annotate them,- how we discover these patterns,

• We question:- how we select relevant studies,- how we annotate them,- how we discover these patterns,in an effective, machine-operated way, in order to have reusable and interpretable data?

• Abundance of scienti!c information

• Abundance of scienti!c information• Limitations of existing tools, such as reusability

• Abundance of scienti!c information• Limitations of existing tools, such as reusability• Lack of contextualized analytic tools

• Abundance of scienti!c information• Limitations of existing tools, such as reusability• Lack of contextualized analytic tools• Supervised automated processes

panorama

1. Document classi!cation to identify relevant papers

panorama

1. Document classi!cation to identify relevant papers- We use a corpus of 1,824 papers from the JCDL and ECDL

(now TPDL) conferences, era 2001-2011.

panorama

(now TPDL) conferences, era 2001-2011.2. Semantic annotation processes to mark up important concepts

panorama

- We use a schema for semantic annotation, the Digital Library Evaluation Ontology, and a semantic annotation tool, GoNTogle.

panorama

3. Clustering to form coherent groups (K=11)

panorama

3. Clustering to form coherent groups (K=11)4. Interpretation with the assistance of the ontology schema

panorama

• During this process we perform benchmarking tests to qualify speci!c components to effectively automate the exploration of the literature and the discovery of research patterns.

part 1

how we identify relevant studies

training phase

• e aim was to train a classi!er to identify relevant papers.

training phase

• e aim was to train a classi!er to identify relevant papers.• Categorization

training phase

- two researchers categorized, a third one supervised

training phase

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords

training phase

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL

training phase

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa

training phase

- two researchers categorized, a third one supervised- descriptors: title, abstract & author keywords- rater’s agreement: 82.96% for JCDL, 78% for ECDL - inter-rater agreement: moderate levels of Cohen’s Kappa- 12% positive # 88% negative

training phase

• Skewness of data addressed via resampling:

training phase

• Skewness of data addressed via resampling: - under-sampling (Tomek Links)

training phase

• Skewness of data addressed via resampling: - under-sampling (Tomek Links)- over-sampling (random over-sampling)

corpus de!nition

• Classi!cation algorithm: Naïve Bayes

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)

corpus de!nition

• Classi!cation algorithm: Naïve Bayes• Two sub-sets: a development (75%) and a test (25%)• Ten-fold validation: the development set was randomly divided

to 10 equal; 9/10 as training set and 1/10 as test set.

corpus de!nition

0 0.2 0.4 0.6 0.8 1.0

corpus de!nition

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

corpus de!nition

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

fp rate

corpus de!nition

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

fp rate

tp rate

corpus de!nition

0 0.2 0.4 0.6 0.8 1.0

TestDevelopment

fp rate

tp rate

part 2

how we annotate

the schema - DiLEO

• DiLEO aims to conceptualize the DL evaluation domain by exploring its key entities, their attributes and their relationships.

the schema - DiLEO

• A two layered ontology:

the schema - DiLEO

• A two layered ontology:- Strategic level: consists of a set of classes related with the

scope and aim of an evaluation.

the schema - DiLEO

• A two layered ontology:- Strategic level: consists of a set of classes related with the

scope and aim of an evaluation.- Procedural level: consists of classes dealing with practical

issues.

the instrument - GoNTogle

• We used GoNTogle to generate a RDFS knowledge base.

• GoNTogle uses the weighted k-NN algorithm to support either manual, or automated ontology-based annotation.

• http://bit.ly/12nlryh

the process - 1/3

• GoNTogle estimates a score for each class/subclass, calculating its presence in the k nearest neighbors.

the process - 1/3

• We set a score threshold above which a class is assigned to a new instance (optimal score: 0.18).

the process - 1/3

• e user is presented with a ranked list of the suggested classes/subclasses and their score ranging from 0 to 1.

the process - 1/3

• e user is presented with a ranked list of the suggested classes/subclasses and their score ranging from 0 to 1.

• 2,672 annotations were manually generated.

the process - 2/3

• RDFS statements were processed to construct a new data set (removal of stopwords, symbols, lowercasing, etc.)

the process - 2/3

• Experiments both with un-stemmed (4,880 features) and stemmed (3,257 features) words.

the process - 2/3

• Multi-label classi!cation via the ML framework Meka.

the process - 2/3

• Multi-label classi!cation via the ML framework Meka.• Four methods

- binary representation

- Label powersets- RAkEL- ML-kNN

• Four algorithms- Naïve Bayes - Multinomial

Naïve Bayes - k-Nearest-

Neighbors- Support Vector

Machines

• Four metrics- Hamming Loss- Accuracy- One-error- F1 macro

the process - 3/3

• Performance tests were repeated using GoNTogle.

the process - 3/3

• Performance tests were repeated using GoNTogle. • GoNTogle’s algorithm achieves good results in relation to the

tested multi-label classi!cation algorithms.

the process - 3/3

Hamming Loss Accuracy One - Error F1 macro

0.390.29

the process - 3/3

Hamming Loss Accuracy One - Error F1 macro

0.390.29

0.02GoNTogleMeka

part 3

how we discover

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features

clustering - 1/3

• e !nal data set consists of 224 vectors of 53 features- represents the assigned annotations from the DiLEO

vocabulary to the document corpus.

clustering - 1/3

vocabulary to the document corpus.• We represent the annotated documents by 2 vector models:

clustering - 1/3

- binary: fi has the value of 1, if the respective to fi subclass is assigned to the document m, otherwise 0.

clustering - 1/3

- binary: fi has the value of 1, if the respective to fi subclass is assigned to the document m, otherwise 0.

- tf-idf: feature frequency ffi of fi in all vectors is equal to 1 when the respective subclass is annotated to the respective document m; idfi is the inverse document frequency of the feature i in documents M.

clustering - 2/3

• We cluster the vector representations of the annotations by applying 2 clustering algorithms:

clustering - 2/3

• We cluster the vector representations of the annotations by applying 2 clustering algorithms:- K-Means: partitions M data points to K clusters. e rate of

decrease peaked for K near 11 when plotted the Objective function (cost or error) for various values of K.

clustering - 2/3

• We cluster the vector representations of the annotations by applying 2 clustering algorithms:- K-Means: partitions M data points to K clusters. e rate of

decrease peaked for K near 11 when plotted the Objective function (cost or error) for various values of K.

- Agglomerative Hierarchical Clustering: a ‘bottom up’ built hierarchy of clusters.

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.

clustering - 3/3

• We assess each feature of each cluster using the frequency increase metric.- it calculates the increase of the frequency of a feature fi in the

cluster k (cfi,k) compared to its document frequency dfi in the entire data set

clustering - 3/3

• We select the threshold a that maximizes the F1-measure, the harmonic mean of Coverage and Dissimilarity mean.

clustering - 3/3

• We select the threshold a that maximizes the F1-measure, the harmonic mean of Coverage and Dissimilarity mean.- Coverage: the proportion of features participating in the

clusters to the total number of features

clustering - 3/3

• We select the threshold a that maximizes the F1-measure, the harmonic mean of Coverage and Dissimilarity mean.- Coverage: the proportion of features participating in the

clusters to the total number of features- Dissimilarity mean: the average of the distinctiveness of the

clusters, de!ned in terms of the dissimilarity di,j between all the possible pairs of the clusters.

metrics - F1-measure

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

metrics - F1-measure

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

K-Means tf-idf K-Means binary Hierarchical tf-idf

part 4

how (and what) we interpret

Levels

patterns

hasDimensionsType

isAimingAt

Research Questions

isSupporting/isSupportedBy

hasPerformed/isPerformedIn

isUsedIn/isUsingFindings

CriteriaMetrics Factors

Means Types

Criteria Categories

hasConstituent/ isConstituting

Dimensionstechnical excellence

Instrumentssoftware

Activityreport

Goalsdesign

Subjectshuman agents

Dimension Type

summative

Meanssurvey studies

isParticipatingIn

Means laboratory studies

Characteristicscount

Characteristicsdiscipline

Dimensionseffectiveness

Objects

PROCEDURAL LAYER

STRATEGIC LAYER

K-Means tf-idf

patternsResearch

Questions

hasPerformed/isPerformedIn

Findings

CriteriaMetrics Factors

Criteria Categories

hasConstituent/ isConstituting

isParticipatingIn

Instruments

Dimensionseffectiveness

Dimensions Types

meanssurvey studies

means laboratory studies

Characteristics

Goaldescribe

means typequantitative

hasMeansType

activityrecord

activitycompare

Levelinterface

isAimingAt

isAffecting/isAffectedBy

Objects

Subjectshuman agents

PROCEDURAL LAYER

STRATEGIC LAYER

Hierarchical

part 5

conclusions

• e patterns re$ect and - up to a point - con!rm the anecdotally evident research practices of DL researchers.

conclusions

• Patterns have similar properties to a map.

conclusions

• Patterns have similar properties to a map. - ey can provide the main and the alternative routes one can

follow to reach to a destination, taking into account several practical parameters that might not know.

conclusions

• By exploring previous pro!les, one can weight all the available options.

conclusions

• By exploring previous pro!les, one can weight all the available options.

• is approach can extend other coding methodologies in terms of transparency, standardization and reusability.

ank you for your attention.

questions?

Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology

Education

Semantically Mapping Science (SMS)

Easy as ABC? Facilitating Pictorial Communication via Semantically Enhanced Layoutpages.cs.wisc.edu/~jerryzhu/pub/conll_ttp.pdf · 2011. 8. 17. · Binary predicate features evaluated

Semantic Interoperability Berlin, 25 March 2008 Semantically Enhanced Resource Allocator Marc de Palol Jorge Ejarque, Iñigo Goiri, Ferran Julià, Jordi

THE STATISTIC VERIFICATION OF A SEMANTICALLY · PDF fileof a semantically based hypotheses: russian verbs ... the statistic verification of a semantically ... construction grammar,

Semantically Relevant Visual Dictionary

Semantically-Enabled Digital Investigations

MOSAICA: Semantically Enhanced Multifaceted Collaborative Access to Cultural Heritage

Please Sync all Off-Line Charting Prior to the Release...Release Date: 09/03/2011 Please Sync all Off-Line Charting Prior to the Release _____ Human Resources Tracking - Enhanced Human

MOSAICA (FP6 IST – 034984) Semantically Enhanced Multifaceted Collaborative Access to Cultural Heritage htto:// dovw@savion.huji.ac.il

Developing Semantically Rich Applications

Semantically-Enhanced Model-Experiment-Evaluation ... · Semantically-Enhanced Model-Experiment-Evaluation Processes (SeMEEPs) within the Atmospheric Chemistry Community Chris Martin1,

Semantically enhanced Information Retrieval: an ontology ...nets.ii.uam.es/miriam/thesis.pdf · Aiming to solve the limitations of keyword-based models, the idea of conceptual search,

SERSCIS has received EC Research Funding Semantically Enhanced Resilient and Secure Critical Infrastructure Services EMS 2012 UKSIM – AMSS : 6 th European

SLAKE: A SEMANTICALLY-LABELED KNOWLEDGE-ENHANCED …

Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”

Community Data Evaluation using a Semantically Enhanced Modelling Process e-mail: mhh@comp.leeds.ac.uk, Mohammed Haji 1, Peter Dew 1, Chris Martin 1,2

Unleashing the Power of Learning: An Enhanced Learning ... › system › files › atc19-song_0.pdf · semantically-equivalent learning-based approach. In Sec-tion3, we identify

On Building Semantically Enhanced Location-Based Social Networks

Semantically & Structurally Negatives

Semantically Enhanced Model Experiment Evaluation Process (SeMEEP) within the Atmospheric Chemistry Community Chris Martin 1,2, Mo Haji 2, Peter Dew 2,