Upload
matteo-romanello
View
512
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Poster presented at the Research Fair 2009 at King's College (London).
Citation preview
Centre of Computing in the Humanities (CCH), King's College London
Information can be accessed using multiple access points that are meaningful for scholars in a specifi c fi eld.
Access points to information in ClassicsPrint resources● Table of Content (TOC)● Indexes (index of citations, index of greek word, index of geographic place, index of names, etc.)
Electronic resources● TOCs● Access through search engines● ?
* usually provided just for monographs because expensive to be produced
GoalDevising an automatic system to improve information retrieval over a discipline-specifi c corpus of unstructured texts.
Why automatic? Because automatic means also scalable when you are dealing with a huge quantity of data.
Information retrieval: the task of retrieving information (most of the times accomplished by using search engines)
Corpus of unstructured texts: collection of plain texts, without any kind of mark-up (such as XML).
Method1. Reuse existing data resources containing structured information (such as gazetteers, authority lists, etc.) stored using different data formats (Relational DataBases, XML fi les, etc.)
2. Apply Computational Linguistic and Natural Language Processing algorithms for the information extraction
3. Use structured data as training data for the algorithms which “mines” the unstructured text corpus
CORPUS: Open Access collection of Classics journal papers
Expected results
● Providing automatically multiple meaningful entry points to information
● Enrich the corpus with links to navigate through resources
● Exploiting extracted information to improve user access to the corpus
● Demonstrate the scalability of the approach
EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS
Matteo Romanello, [email protected]
Gone digital. What changed?
We are moving from books to e-books, and from journals to e-journals as we are using them almost daily.
Is our way of accessing information actually changed with the use of digital tools?
Did just the format change or are we provided with innovative ways of accessing information based on digital technologies?
The project at a glance● PhD research project in Digital Humanities (DH)● discipline: DH, Classics (Greek and Latin literature)● topic: extracting structured information from a corpus of unstructured texts
HIDDEN WORD PUZZLE
To solve the puzzle find the words in the schema by using a word list as clue.
At the end you'll have added information to the initially chaotic picture.
Steps
1. Building the corpus (OCR, preprocessing)
2. Making the data sources interoperable (when the same entity E appears in DB1 and DB2, the information about E in DB1 have to be added to information about E in DB2)
3. Finding in the corpus the mentions of REALIA (place, names, work passages, etc.)
4. Disambiguating the mentions of REALIA
5. Automatic creation of new indices to the texts