Centre of Computing in the Humanities (CCH), King's College London Information can be accessed using multiple access points that are meaningful for scholars in a specif i c f i eld. Access points to information in Classics Print resources ● Table of Content (TOC) ● Indexes (index of citations, index of greek word, index of geographic place, index of names, etc.) Electronic resources ● TOCs ● Access through search engines ● ? * usually provided just for monographs because expensive to be produced Goal Devising an automatic system to improve information retrieval over a discipline-specif ic corpus of unstructured texts. Why automatic? Because automatic means also scalable when you are dealing with a huge quantity of data. Information retrieval: the task of retrieving information (most of the times accomplished by using search engines) Corpus of unstructured texts: collection of plain texts, without any kind of mark-up (such as XML). Method 1. Reuse existing data resources containing structured information (such as gazetteers, authority lists, etc.) stored using different data formats (Relational DataBases, XML f i les, etc.) 2. Apply Computational Linguistic and Natural Language Processing algorithms for the information extraction 3. Use structured data as training data for the algorithms which “mines” the unstructured text corpus CORPUS: Open Access collection of Classics journal papers Expected results ● Providing automatically multiple meaningful entry points to information ● Enrich the corpus with links to navigate through resources ● Exploiting extracted information to improve user access to the corpus ● Demonstrate the scalability of the approach EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS Matteo Romanello, [email protected] Gone digital. What changed? We are moving from books to e- books, and from journals to e- journals as we are using them almost daily. Is our way of accessing information actually changed with the use of digital tools? Did just the format change or are we provided with innovative ways of accessing information based on digital technologies? The project at a glance ● PhD research project in Digital Humanities (DH) ● discipline: DH, Classics (Greek and Latin literature) ● topic: extracting structured information from a corpus of unstructured texts HIDDEN WORD PUZZLE To solve the puzzle find the words in the schema by using a word list as clue. At the end you'll have added information to the initially chaotic picture. Steps 1. Building the corpus (OCR, preprocessing) 2. Making the data sources interoperable (when the same entity E appears in DB1 and DB2, the information about E in DB1 have to be added to information about E in DB2) 3. Finding in the corpus the mentions of REALIA (place, names, work passages, etc.) 4. Disambiguating the mentions of REALIA 5. Automatic creation of new indices to the texts

[poster] Extracting Information From Classics Scholarly Texts

Download PDF Report

Upload
matteo-romanello
View
512
Download
0

Tags:

Embed Size (px)

DESCRIPTION

Poster presented at the Research Fair 2009 at King's College (London).

Citation preview

Page 1: [poster] Extracting Information From Classics Scholarly Texts

Centre of Computing in the Humanities (CCH), King's College London

Information can be accessed using multiple access points that are meaningful for scholars in a specifi c fi eld.

Access points to information in ClassicsPrint resources● Table of Content (TOC)● Indexes (index of citations, index of greek word, index of geographic place, index of names, etc.)

Electronic resources● TOCs● Access through search engines● ?

* usually provided just for monographs because expensive to be produced

GoalDevising an automatic system to improve information retrieval over a discipline-specifi c corpus of unstructured texts.

Why automatic? Because automatic means also scalable when you are dealing with a huge quantity of data.

Information retrieval: the task of retrieving information (most of the times accomplished by using search engines)

Corpus of unstructured texts: collection of plain texts, without any kind of mark-up (such as XML).

Method1. Reuse existing data resources containing structured information (such as gazetteers, authority lists, etc.) stored using different data formats (Relational DataBases, XML fi les, etc.)

2. Apply Computational Linguistic and Natural Language Processing algorithms for the information extraction

3. Use structured data as training data for the algorithms which “mines” the unstructured text corpus

CORPUS: Open Access collection of Classics journal papers

Expected results

● Providing automatically multiple meaningful entry points to information

● Enrich the corpus with links to navigate through resources

● Exploiting extracted information to improve user access to the corpus

● Demonstrate the scalability of the approach

EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS

Matteo Romanello, [email protected]

Gone digital. What changed?

We are moving from books to e-books, and from journals to e-journals as we are using them almost daily.

Is our way of accessing information actually changed with the use of digital tools?

Did just the format change or are we provided with innovative ways of accessing information based on digital technologies?

The project at a glance● PhD research project in Digital Humanities (DH)● discipline: DH, Classics (Greek and Latin literature)● topic: extracting structured information from a corpus of unstructured texts

HIDDEN WORD PUZZLE

To solve the puzzle find the words in the schema by using a word list as clue.

At the end you'll have added information to the initially chaotic picture.

Steps

1. Building the corpus (OCR, preprocessing)

2. Making the data sources interoperable (when the same entity E appears in DB1 and DB2, the information about E in DB1 have to be added to information about E in DB2)

3. Finding in the corpus the mentions of REALIA (place, names, work passages, etc.)

4. Disambiguating the mentions of REALIA

5. Automatic creation of new indices to the texts

mailto:[email protected]