25
Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering University of Limerick, Ireland A New Unsupervised Approach to Automatic Topical Indexing of Scientific Documents According to Library Controlled Vocabularies ALISE 2013 Work Supported Work Supported by: by:

Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering

Embed Size (px)

DESCRIPTION

A New Unsupervised Approach to Automatic Topical Indexing of Scientific Documents According to Library Controlled Vocabularies. Arash Joorabchi & Abdulhussain E. Mahdi Department of Electronic and Computer Engineering University of Limerick, Ireland. ALISE 2013. Work Supported by:. - PowerPoint PPT Presentation

Citation preview

Page 1: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Arash Joorabchi & Abdulhussain E. Mahdi

Department of Electronic and Computer Engineering

University of Limerick, Ireland

A New Unsupervised Approach to Automatic Topical

Indexing of Scientific Documents According to

Library Controlled Vocabularies

ALISE 2013

Work Supported by: Work Supported by:

Page 2: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Subject (Topical) Metadata in Libraries

• Un-controlled

Unrestricted author and/or reader-assigned keywords and keyphrases,

such as:

– Index Term-Uncontrolled (MARC-653)

• Controlled

Restricted cataloguer-assigned classes and subject headings, such as:

– DDC (MARC-082)

– LCC (MARC-050)

– LCSH/FAST (MARC-650)

Page 3: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

The Case of Scientific Digital Libraries & Repositories

Archived Material Include: Journal articles, conference papers, technical

reports, theses & dissertations, books chapters, etc.

• Un-controlled Subject Metadata:

– Commonly available when enforced by editors, e.g., in case of published

journal articles & conf. proceedings, but rare in unedited publications.

– Inconsistent

• Controlled Subject Metadata:

– Rare due to the sheer volume of new materials published and high cost of

cataloguing.

– High level of incompleteness and inaccuracy due to oversimplified classification

rules, e.g., IF published by the Dept. of Computer Science THEN DDC: 004,

LCSH: Computer science

Page 4: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Automatic Subject Metadata Generation in Scientific Digital Libraries

& Repositories

Aims to provide a fully/semi automated alternative to manual

classification.

1. Supervised (ML-based) Approach:

– utilizing generic machine learning algorithms for text classification (e.g., NB, SVM, DT).

– challenged by the large-scale & complexities of library classification schemes, e.g., deep

hierarchy, skewed data distribution, data sparseness, and concept drift [Jun Wang ’09].

2. Unsupervised (String Matching-based) Approach:

– String-to-string matching between words in a term list extracted from library thesauri &

classification schemes, and words in the text to be classified.

– Inferior performance compared to supervised methods [Golub et al. ‘06].

Page 5: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

A New Unsupervised Concept-to-Concept Matching Approach - An

Overview

WorldCatDatabase

MARC records sharing a key concept(s) with the

paper/article

Paper/Article (Full Text)

Inference

RankingWikipedia Concepts

Key ConceptsPaper/Article (MARC Rec.)

653: {…}

082: {…}

650: {…}DDC

FAST

Page 6: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Paper/Article (MARC Rec.)

653: {Wikipedia: HP 9000}

650: {FAST: HP 9000 (Computer)}

Wikipedia as a Crowd-Sourced Controlled Vocabulary

Extensive topic/concept coverage (4m < English articles)

Up-to-date (lags Twitter by ~3h on major events [Osborne et al.’12])

Rich knowledge source for NLP (semantic relatedness, word sense

disambiguation)

Detailed description of concepts

Alternative Label

Related Term

Page 7: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Wikification using WikipediaMiner – an open source toolkit for mining

Wikipedia [Milne, Witten ‘09]

Block Edit Models for Approximate String Matching

Abstract

In this paper we examine the concept of string block edit distance, where two strings A and B are compared by

extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena

encountered in important real-world applications, including pen computing and molecular biology. The basic problem

admits a family of variations depending on whether the strings must be matched in their entireties, and whether overlap

is permitted. We show that several variants are NP-complete, and give polynomial-time algorithms for solving….

.

.

Wikipedia Concepts – Detection In Text

Descriptor: String (computer science)

Non-descriptors:– character string – text string– binary string

String (theory) String (rope) String (music) …

Page 8: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Wikipedia Concepts – Ranking Features

1. Occurrence Frequency

2. First Occurrence3. Last Occurrence

4. Occurrence Spread

5. Length

6. Lexical Diversity7. Lexical Unity

8. Avg Link Probability 9. Max Link Probability

10. Generality 11. Speciality

12. Distinct Links Count 13. Links Out Ratio 14. Links In Ratio

15. Avg Disambiguation Confidence16. Max Disambiguation Confidence

17. Link-Based Relatedness to Other Topics 18. Link-Based Relatedness to Context

19. Cat-Based Relatedness to Other Topics

20. Translations Count

Page 9: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Un-supervised

Pros: easy to implement & fast

plug & play, i.e., no training needed

Cons (naïve assumptions): Assumes all features carry the same weight

Assumes all features contribute to the importance probability of candidates linearly

Key Wikipedia Concepts – Rank & Filtering

||

1

)Score(F

iijj ftopic

Genetic algorithm (ECJ) settings

Species Population Size

Genome Size

Chunk Size

Min Gene

Max Gene

Elites Crossover Type

Selection Method

Mutation Type

Mutation Probability

Threads

Float 40 40 2 0.0 2.0 1 two points Tournament Reset 0.05 2

Supervised1. Initial population - a set of ranking functions with random weight and degree parameter values within a preset range

2. Evaluate fitness of each ranking function.

3. (selection, crossover, mutation) -> new generation

4. Repeat steps 2 & 3 until threshold is passed

||

1

)Score(F

i

dijij

ifwtopic

Page 10: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Key Wikipedia Concepts – Evaluation Dataset & Measure

Wiki-20 dataset [Medelyan, Witten ‘08]:

20 Computer Science related papers/articles.

Each annotated by 15 Human Annotator (HA) teams independently.

HAs assigned an average of 5.7 topics per Doc.

an Avg. of 35.5 unique topics assigned per Doc.

Rolling’s inter-indexer consistency (=F1) :

ba

c(A,B)

2yconsistencindexer -Inter

HA1

MA

HA3HA2

VK

Page 11: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Key Wikipedia Concepts – Evaluation Results

Performance comparison with human annotators and rival machine annotators

Min. Avg. Max.

TFIDF (baseline) n/a - unsupervised 5 5.7 8.3 14.7KEA++ (KEA-5.0) Naïve Bayes 5 15.5 22.6 27.3

Grineva et al. n/a - unsupervised 5 18.2 27.3 33.0Maui Naïve Bayes (all 14 features) 5 22.6 29.1 33.8Maui Bagging decision trees (all 14 features) 5 25.4 30.1 38.0

Human annotators (gold standard)

n/a - senior CS studentsVaried, with an average of

5.7 per document21.4 30.5 37.1

CKE n/a - unsupervised 5 22.7 30.6 38.3Current work n/a - unsupervised 5 19.1 30.7 37.9

Maui Bagging decision trees (13 best features) 5 23.6 31.6 37.9Current work (LOOCV) GA, threshold=800, unique bests method 5 12.3 32.8 58.1Current work (LOOCV) GA, threshold=200, unique bests method 5 13.9 32.9 56.7

Current work (LOOCV) GA, threshold=400, unique bests method 5 14.0 33.5 58.1

MethodAvg. inter consistency with

human annotators (% )Number of Keyprases

Assgined per document, nk

Learning Approach

– Joorabchi, A. and Mahdi, A. Automatic Subject Metadata Generation for Scientific Documents Using Wikipedia and Genetic Algorithms. In Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012)

– Joorabchi, A. and Mahdi, A. Automatic Keyphrase Annotation of Scientific Documents Using Wikipedia and Genetic Algorithms. To appear in the Journal of Information Science

Page 12: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Querying WorldCat Database

Top

30

Key

Concepts

in the

document

WorldCatDatabase

http://worldcat.org/webservices/catalog/search/sru?querysru?query=

srw.kw = Doc_Key_Concept_Descriptor

AND srw.ln exact eng //Language

AND srw.la all eng //Language Code (Primary)

AND srw.mt all bks //Material Type

AND srw.dt exact bks //Document Type (Primary)

&servicelevel = full

&maximumRecords = 100

&sortKeys = relevance,,0 //Descending order

&wskey = [wskey]

≤100 potentially

related MARC records

Page 13: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Refining Key Concepts Based on WorldCat Search Results

marc_recsi , j ≤100

nceptsDoc_Key_CoRefinednceptsDoc_Key_Co

conceptkeydocnceptsDoc_Key_CoRefinednceptsDoc_Key_CoRefined

conceptkeydoc

nceptsDoc_Key_CoRefined

nceptsDoc_Key_CoRefined

conceptskeydocconceptskeydoc

conceptskeydocmatchestotal

matchestotal

ConceptsKeyDocconceptskeydoc

nceptsDoc_Key_CoRefined

i

i

i

iie

i

i

_:

___:_ ELSE

__ Discard THEN

20_ OR

10_

AND

8.0__eInDoc_Scor__eInDoc_Scor

OR

__eInDoc_Scor1_log OR

0_

IF

:____

_

1

Marc_Recsi=Doc_Key_Concepts =

doc_key_conceptsi ≤30

e.g., “Logical conjunction”

e.g., “Logic”(72,353): 13.7>10.3

vs. “Linear logic”(17): 2.83 < 8.6

total_matchesi

Page 14: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

MARC Records Parsing, Classification, Concept Detection

001 Control Number

245($a) Title Statement (Title)

505($a, $t) Formatted Contents Note

520($a, $b) Summary, Etc.

650($a) Subject Added Entry-Topical Term

653($a) Index Term-Uncontrolled

OCLC Classify

Wikipedia-Miner

marc_recsi , j ≤100Marc_Recsi=

Doc_Key_Concepts=

doc_key_conceptsi ≤20

DDCi,j Marc_Conceptsi,jFASTi,j

*OCLC Classify finds the most popular DDC & FASTs for the work using the OCLC FRBR Work-Set algorithm.

total_matchesi

Page 15: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Measuring Relatedness Between MARC Records and the Article/Paper

ConceptsMarc

conceptsshared

conceptsharedconceptsshared

DKCMC

ConceptsMarcconceptshared_Mark_RecsAll_Uniquerecsmarc

_Mark_RecsAll_Uniqueconceptshared

Marc_Recs_Mark_RecsAll_Unique

ConceptsMarc

conceptssharedconceptsshared

conceptskeydocxConceptsKeyDocConceptsMarcxceptsShared_Con

k

k

ConceptsShared

kk

ji

jikji

k

ConceptsKeyDoc

ii

kk

iji

_

_eInDoc_Scor

_rc_FreqInverse_Malog1__FreqNormalizedlog

,sRelatednes

__:__rc_FreqInverse_Ma

_

_qInMarc_Fre__FreqNormalized

__:___

2

2

_

12

,

,,

__

1

,

Relatedness?Relatedness?

marc_recsi , j ≤100Marc_Recsi=

Doc_Key_Concepts=

doc_key_concepts i ≤20

DDCi,jMarc_Conceptsi,j FASTi,j

total_matchesi

Relatednessi,j

Page 16: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Weighting DDC Candidates

kk

kkkk

ConceptsKeyDoc

iijiki

kk

k

ConceptsKeyDoc

i

csMarc

jjikji

k

ijiki

k

ConceptsKeyDoc

icsMarc

jji

csMarc

jjik

k

csMarc

jjii

ConceptsKeyDoc

i

csMarc

jjikk

k

ijiji

ConceptsKeyDoc

ii

ddcsuniqueddcsuniqueMatchesTotalAverageInverse

ddcsuniqueddcsuniqueddcsuniqueddcsunique

Marc_RecsjDDCddcsuniquematchestotal

ddcsuniqueddcsunique

ddcsunique

DDCddcsuniquesRelatednes

ddcsunique

Marc_RecsjDDCddcsuniqueConceptsKeyDocconceptskeydoc

ConceptsKeyDocddcsunique

PerConceptCountValidDDCsHighest

DDC

DDCddcsunique

ddcsunique

DDCxConceptsKeyDocconceptskeydocxPerConceptCountValidDDCsHighest

DDCddcsuniqueddcsunique

DDCsUniqueddcsunique

Marc_RecsjConceptsKeyDoci_Marc_RecsAll_UniquerecsmarcDDCxDDCsUnique

Marc_Recsc_RecsUnique_MarAll

i

i

i

i

i

_l_Matcheserage_TotaInverse_Av1____log

_ncept_FreqInverse_Colog__FreqNormalizedlog_Freqlog_Weight

1__

_Freq_l_Matcheserage_TotaInverse_Av

_Freq

_

_latednessAverage_Re

1_:____

|__|_ncept_FreqInverse_Co

___

0

_

__FreqNormalized

0____: Nmax___

__Freq

:__

1,__1_:_

_

2

222

__

1,

__

1

Re_

1,,

,

__

1Re_

1,

Re_

1,

Re_

1,

__

1

Re_

1,

,,

__

1

Page 17: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Weighting FAST Candidates

kk

kkkk

ConceptsKeyDoc

iijiki

kk

k

ConceptsKeyDoc

i

csMarc

jjikji

k

ijiki

k

ConceptsKeyDoc

icsMarc

jji

csMarc

jjik

k

csMarc

jjii

ConceptsKeyDoc

i

csMarc

jjikk

k

ijiji

ConceptsKeyDoc

ii

fastsuniquefastsuniqueMatchesTotalAverageInverse

fastsuniquefastsuniquefastsuniquefastsunique

Marc_RecsjFASTfastsuniquematchestotal

fastsuniquefastsunique

fastsunique

FASTfastsuniquesRelatednes

fastsunique

Marc_RecsjFASTfastsuniqueConceptsKeyDocconceptskeydoc

ConceptsKeyDocfastsunique

PerConceptCountValidFASTsHighest

FAST

FASTfastsunique

fastsunique

FASTxConceptsKeyDocconceptskeydocxPerConceptCountValidFASTsHighest

FASTfastsuniquefastsunique

FASTsUniquefastsunique

Marc_RecsjConceptsKeyDoci_Marc_RecsAll_UniquerecsmarcFASTxFASTsUnique

Marc_Recsc_RecsUnique_MarAll

i

i

i

i

i

_l_Matcheserage_TotaInverse_Av1____log

_ncept_FreqInverse_Colog__FreqNormalizedlog_Freqlog_Weight

1__

_Freq_l_Matcheserage_TotaInverse_Av

_Freq

_

_latednessAverage_Re

1_:____

|__|_ncept_FreqInverse_Co

___

0

_

__FreqNormalized

0____:Nmax___

__Freq

:__

1,__1_:_

_

2

222

__

1,

__

1

Re_

1,,

,

__

1Re_

1,

Re_

1,

Re_

1,

__

1

Re_

1,

,,

__

1

Page 18: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

006.312 : 10.991574176537037 + 006.31 : 19.614959248944054 = 30.60653342548109+ 006.3 : 12.77908859025236 = 43.385622015733446

DDCs Weight Aggregation & Outlier Detection

Sort Unique_DDCs set based on DDCs depth in descending order

For each DDCi ∈ Unique_DDCs Do :

For each DDCj ∈ Unique_DDCs Do :

IF subclass(DDCi, DDCj) THEN

IF weight(DDCi) > highest_DDC_weight/10 THEN

weight(DDCi) = weight(DDCi) + weight(DDCj)

Discard DDCj

ELSE Discard DDCi

DDCi DDCi+1

Upper + 1

Outlier

s.t. weight(DDCi) > (upper inner fence = Q3 + 1.5*IQ)

Example:

*BoxPlot Outliers - DDCs whose weights lie an abnormal distance from the others’, i.e., mild and extreme outliers

Page 19: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

FASTs Weight Aggregation & Outlier Detection

Unique_FASTs := {x ∈ Unique_FASTs : weight(x) > highest_FAST_weight/10}

For each FASTi ∈ Unique_FASTs Do :

For each FASTj ∈ Unique_FASTs Do :

IF related(FASTi , FASTj) AND WC_SubjectUsage(FASTi) < WC_SubjectUsage(FASTj)

THEN weight(FASTi) = weight(FASTi) + weight(FASTj)

FASTi FASTi+1 FASTi+2

Outlier1 + Outlier2 + 1

Expert systems (Computer science) 4.224295291384108 -> seeAlsoHeading: Artificial intelligence-> seeAlsoHeading: Computer systems-> seeAlsoHeading: Soft computing-> subjectUsage: 14685.0

+ Artificial intelligence(subjectUsage:36145.0) weight : 5.214271611745798 = 9.438566903129907

Example:

Page 20: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

DDCs Binary Evaluation

Wiki-20 dataset [Medelyan, Witten ‘08] containing 20 Computer Science related papers/articles.

FPTP

TP

assigned Total

assignedcorrectly ofNumber Precision

FNTP

TP

correct possible Total

assigned correctly ofNumber Recall

RePre

Re2Pr

1F

*Automatic Classification Toolbox for Digital Libraries (ACT-DL) by Bielefeld University Library and deployed at Bielefeld Academic Search Engine (BASE)

Doc ID Predicted DDC (by current method) True DDCPredicted

DDC (by ACT-DL*)

519.542 Decision theory ✓006.35 Natural language processing ✓

7183 006.333 Deduction, problem solving, reasoning ✓ 0047502 005.131 Symbolic logic 006.333 Deduction, problem solving, reasoning 0049307 005.757--0218 Object-oriented databases--Standards 005.757 Object-oriented databases 00410894 621.3815--0287 Components and circuits--Testing and measurement 005.14 Verification, testing, measurement, debugging 00412049 005.43 Systems programs 005.453 Compilers 00413259 001.6443 (invalid in DDC22 & DDC23) 001.4226 Presentation of statistical data 00016393 004.53 Internal storage (Main memory) 005.435 Memory management programs 00418209 005.115 Logic programming ✓ 004

511.322 Set theory ✓005.275 Programming for multiprocessor computers ✓004.35 Multiprocessing ✓004.33 Real-time processing ✓

23267 005.117 Object-oriented programming ✓ 00423507 495.6--5 Japanese--Grammar 006.35 Natural language processing 40023596 658.4036--028546 Group decision making--Computer communications ✓ 150

515.2433 Fourier and harmonic analysis ✓below threshold 006.37 Computer vision

37632 005.14 Verification, testing, measurement, debugging ✓ 00439172 006.4--015116 Computer pattern recognition--Combinatorics ✓ 51039955 005.117 Object-oriented programming ✓ 15040879 004 Computer science 006.31 Machine learning 00443032 005.262 Programming in specific programming languages 005.26 Programming for personal computers 004

TP= 14, FP=9, FN= 10, Pr= 0.61, Re= 0.58, F1= 0.60

287

19970

20287

25473

Overall F1=[0.05, 0.75]

004

004

004

004

004: 78k005: 100006: 403

ImbalancedTraining Set

Page 21: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

DDCs Hierarchical Evaluation

L1 L2 L3 L4 L5 L6 L7 Facet Avg.TP 21 21 18 17 15 10 2 2

FP 2 2 5 5 5 4 2 3

FN 3 3 6 7 8 4 1 0

Pr 0.91 0.91 0.78 0.77 0.75 0.71 0.50 0.40 0.72

Re 0.88 0.88 0.75 0.71 0.65 0.71 0.67 1.00 0.78

F1 0.89 0.89 0.77 0.74 0.70 0.71 0.57 0.57 0.73

L1 L2 L3 L4 L5 L6 L7 Facet Avg.TP 16 16 1

FP 4 4 19

FN 4 4 19

Pr 0.80 0.80 0.05 0.55

Re 0.80 0.80 0.05 0.55

F1 0.80 0.80 0.05 0.55

L1 L2 L3 L4 L5 L6 L7 Facet Avg.Pr 0.90 0.78 0.77 0.82

Re 0.75 0.56 0.55 0.62

F1 0.81 0.63 0.62 0.69

Cu

rre

nt

Wo

rk

(Wik

i-2

0 d

ata

se

t)A

CT

-DL

(Wik

i-2

0 d

ata

se

t)

AC

T-D

L(B

AS

E

da

tas

et)

Page 22: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

FASTs Binary Evaluation

Bayesian statistical decision theory ✓Bayesian statistical decision theory--Industrial applications Natural language processing (Computer science)Maximum entropy method Information retrievalEconometric models Machine learningModel-based reasoning ✓Knowledge acquisition (Expert systems) ✓Expert systems (Computer science) ✓Semantics Conceptual structures (Information theory)Case-based reasoning ✓Object-oriented databases ✓UML (Computer science) Computer software—DevelopmentBooch method Computer-aided software engineeringSoftware patterns ✓Object-oriented methods (Computer science) ✓Object-oriented databases--Standards Object-oriented programming (Computer science)Regression analysis ✓Struts framework Computer software--Quality controlApplication software--Testing ✓Yacc (Computer file) ✓Assembling (Electronic computers) Compiling (Electronic computers)Three-dimensional display systems ✓Interactive computer systems ✓Interactive multimedia Information visualizationDistributed shared memory ✓Intel i860 (Microprocessor) Memory management (Computer science)Cache memory ✓Virtual storage (Computer science) ✓Predicate (Logic) ✓Modality (Logic) ✓Set theory ✓Sorting (Electronic computers) ✓Parallel algorithms ✓Data transmission systems Real-time data processingVirtual computer systems ✓Parallel computers ✓Modula-3 (Computer program language) Object-oriented methods (Computer science)ML (Computer program language) Object-oriented programming (Computer science)Object-oriented databases Computer software--ReusabilityAbstract data types (Computer science) ✓English language--Noun phrase ✓Grammar, Comparative and general--Noun phrase ✓Automatic speech recognition Computational linguistics

23596 Teams in the workplace--Data processing ✓Data compression (Telecommunication) ✓Image compression ✓Signal processing--Mathematics ✓Wavelets (Mathematics) ✓Video compression ✓Digital video ✓Data compression (Computer science) ✓Software visualization ✓Debugging in computer science ✓Matching theory ✓Text processing (Computer science) ✓Graphical user interfaces (Computer systems) Combinatorial analysisSmalltalk (Computer program language) Object-oriented programming languagesObjective-C (Computer program language) Object-oriented programming (Computer science) Automatic speech recognition Machine learningSpeech processing systems ClassificationSupervised learning (Machine learning) ✓HP-UX Software localizationHewlett-Packard computers--Programming User interfaces (Computer systems)HP 9000 (Computer) Computer interfacesC (Computer program language) ✓

TP= 40, FP= 24, FN= 24, Pre= Re= F1= 0.625

Doc ID Predicted FAST True FAST

287

7183

7502

9307

23507

13259

20287

39955

25473

37632

39172

Overall

43032

10894

19970

18209

16393

12049

23267

40879

Bayesian statistical decision theory ✓Bayesian statistical decision theory--Industrial applications Natural language processing (Computer science)Maximum entropy method Information retrievalEconometric models Machine learningModel-based reasoning ✓Knowledge acquisition (Expert systems) ✓Expert systems (Computer science) ✓

Doc ID Predicted FAST True FAST

287

7183

TP= 40, FP= 24, FN= 24 F1= 0.625

Page 23: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Semi-Supervised Classification

1. Bayesian statistical decision theory >252.41740965808467 2. Bayesian statistical decision theory--Industrial applications >223.09281028013865 3. Maximum entropy method >223.09281028013865 4. Econometric models >189.47706031373122 5. Economics, Mathematical >188.4336672427764 6. Natural language processing (Computer science) >176.13905753628868 7. Econometrics >156.6469274464959 8. Distribution (Probability theory) >120.64195152106359 9. Parsing (Computer grammar) >102.72834662505807 10. Lexicology--Data processing >101.39771816337012 11. Machine translating >99.39171867148306 12. Text processing (Computer science) >96.65689215290195 13. Information retrieval >79.01359045012737 14. Semantic Web >73.12618493349078 15. Probabilities >70.99695859769267 16. Computational linguistics >65.00474591701948 17. Machine learning >60.14168210721469 18. Decision making >50.302190572189424 19. Inference >49.142891911243986 20. Interactive computer systems >49.04810095707191 ...41. Mathematical physics >25.256694185393123

287: Clustering Full Text Documents

12049: Occam's Razor: The Cutting Edge for Parser Technology1. 005.43 >449.17978755450434 (Systems programs) 2. 005.453 >429.04491205387495 (Compilers)3. 005.12 >144.3981891584036 4. 510.7808 >138.0169127750601 5. 005.26 >105.58801291194308 6. 415 >79.72358747591275 7. 001.6425 >39.024619737391866 8. 004 >36.433436906359425

Page 24: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

Future Work

Detecting Wikipedia topics in documents is computationally expensive.

Eliminate the need for sending queries to WorldCat and repeating the process

of topic detection on matching MARC records by performing topic detection on

a locally held FRBRized version of WorldCat DB.

Complementing topics extracted from MARC records of a work

catalogued in WorldCat with Common terms and phrases from its

content (as extracted by Google Books)

Probabilistic Mapping of Wikipedia concepts/articles to their

corresponding DDCs and FASTS (already initiated by OCLC research

via developing VIAFbot for mapping Wikipedia biography articles to VIAF.org)

Page 25: Arash Joorabchi  &  Abdulhussain E. Mahdi  Department of Electronic and Computer Engineering

This work is supported by:

OCLC/ALISE Library & Information Science Research Grant Program

Irish Research Council 'New Foundations' Scheme

Thank You!Thank You!

Questions…Questions…

For more information, please contact:

[email protected] [email protected]