12
Abstracting concepts from text documents by using an ontology E. Chernyak 1 , O. Chugunova 1 , J. Askarova 1 , S. Nascimento 2 , B. Mirkin 1,3 ivision of Applied Mathematics and Informatics, NRU-HSE, Moscow, Russia epartment of Informatics, New University of Lisbon, Caparica, Portugal epartment of Computer Science, Birkbeck University of London, London, UK

Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Embed Size (px)

Citation preview

Page 1: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Abstracting concepts from text documents by using an ontology

E. Chernyak1, O. Chugunova1, J. Askarova1, S. Nascimento2, B. Mirkin1,3

1 Division of Applied Mathematics and Informatics, NRU-HSE, Moscow, Russia2 Department of Informatics, New University of Lisbon, Caparica, Portugal3 Department of Computer Science, Birkbeck University of London, London, UK

Page 2: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Contents

1. Statement of the problem

2. Method

3. Examples of application

4. Future work

Page 3: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Statement of the problem

•Interpretation of a text corpus over a taxonomy (the main part of an ontology)

Page 4: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Input

...

Collection of the ACM Journal abstracts

The ACM Computing Classification System

(1998)

...

...

Page 5: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

OutputHead subjects and related events (gap, offshoot)

Code Membership value

ACM-CCS Topic

F.1.3 0.597 Complexity Measures and ClassesH.2.3 0.475 LanguagesF.2.3 0.4009 Tradeoffs between Complexity Measures H.2.1 0.3705 Logical DesignF.1.1 0.322 Models of ComputationH.2.4 0.2973 SystemsD.2.8 0.24 MetricsH.2.8 0.2193 Database ApplicationsJ.4 0.211 SOCIAL AND BEHAVIORAL SCIENCESK.8.0 0.203 GeneralH.2.6 0.1840 Database MachinesF.2.2 0.1739 Nonnumerical Algorithms and ProblemsI.1.2 0.0178 Algorithms...

Head Subjects (Interpretation):H.2 DATABASE MANAGEMENTF. Theory of Computation

Page 6: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Method

1.Building a profile of the corpus

A. Annotated suffix tree for abstracts and keywords (Pampapathi, Mirkin, Levene, 2006)

B. Scoring ACM-CCS leaves including references between them

C. Clustering the profiles (if needed)

2.Lifting the profile in the taxonomy tree

A. Specifying head subject, gap and offshoot penalty weights

B. Parsimonious lifting (Mirkin, Nascimento, Fenner, Pereira, 2010)

Page 7: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Annotated Suffix Tree (AST)

• is used to compute and store the frequencies of all substrings of the string

Page 8: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Lifting

•Represent the thematic clusters in ACM-CCS by higher, more general, nodes depending on

the inconsistencies (Lift)

Page 9: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Applications I

•The Journal of ACM abstracts and the ACM-CCS

•Course syllabuses of Mathematics and Informatics disciplines and an in-house taxonomy of Mathematics and Informatics built using Supreme Attestation Committee of Russia documentation (in Russian)

Page 10: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Applications II

Two AST-profiles: A – goodProfile A. Two variable logic on data trees and XML reasoning

AST-profile ACM-CCS index terms

ID TE ACM–CCS topic ID # ACM–CCS topic

H.2.3 0.4541 Languages H.2.3 0 Languages

I.1.3 0.4489 Languages and Systems

F.4.3 2 Formal Languages

F.4.3 0.3918 Formal Languages H.2.1 12 Logical Design

D.4.5 0.3049 Reliability F.4.1 27 Mathematical Logic

I.6.2 0.2578 Simulation Languages I.7.2 52 Document Preparation

Page 11: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Applications II

Two AST-profiles: B – poorProfile B. Lower bounds for processing data with few random accesses to external memory

AST-profile ACM-CCS index terms

ID TE ACM–CCS topic ID # ACM–CCS topic

H.2.8 0.4330 Database Applications F.1.3 160 Complexity Measures and Classes

H.2.5 0.2904 Heterogeneous Databases

H.2.4 165 Systems

C.5.1 0.2630 Large and Medium (``Mainframe'') Computers

F.1.1 219 Models of Computation

J.1 0.2115 ADMINISTRATIVE DATA PROCESSING

I.2.7 0.1870 Natural Language Processing

Page 12: Abstracting concepts from text documents by using an ontology E. Chernyak 1, O. Chugunova 1, J. Askarova 1, S. Nascimento 2, B. Mirkin 1,3 1 Division of

Conclusion

•Interpretation by producing profiles and lifting them in the taxonomy

•Issues

A. AST scoring – slow and noised

B. The taxonomies are not quite relevant

C. Penalty weights? (Future work: change the parsinomy criterion for that of the maximum likelihood)