View
406
Download
3
Category
Preview:
Citation preview
Giovanni Colavizza, Frédéric Kaplan
On Mining Citations to Primary and Secondary Sources in Historiography
1
Motivation - the Scholar
Sciences: Google Scholar
English Low-cost information gathering
Humanities: no Google Scholar like system
multiple languages High-cost information gathering
Issues: lack of data [Sula and Miller, 2014] leads to absence of services: estimated coverage of Web of Science for Humanities circa 13% [Mingers and Leydesdorff, 2015].
2
Motivation - the Footnote
How humanists cite? Footnotes [see e.g. Hellqvist, 2009]
3
Motivation - the Archive
Approximately half citations to primary sources [Wiberley Jr., 2009]
4
Project: Linked Books
In the context of the Venice Time Machine
Partners: • Ca’ Foscari Library System • Biblioteca Marciana • Istituto Veneto di Scienze, Lettere ed Arti • Archivio di Stato di Venezia • EPFL
5
Data acquisition
Corpus and annotation:
• 4 journals (for 150 years) • 2000 monographs • digitisation almost over • ongoing annotation (samples from 1
journal and 1’000 monographs done, approx 10’000 annotated citations)
6
1- Data: new citation corpus (History of Venice) and a pipeline for citation extraction from footnotes.
2- Analytical framework: development or adaptation of methods from bibliometrics to the humanities.
3- Services: a Google Scholar for the History of Venice accounting for citations to both primary and secondary sources.
Goals
7
Pipeline
Text block detection
Citation extraction
Citation parsing
8
Text block detection
Method: SVM classifier. On what: text lines. Why? Citations are in footnotes, filter input space.
9
Text block detection
Main causes of errors: - partial citations (e.g. “Ivi., p. 37. some text”)
FALSE NEGATIVE (critical) - in-text shortened citations FALSE POSITIVE
(not critical) - footnotes without citations FALSE NEGATIVE
(not critical)
10
Next steps: finalise with extra features, layout detection on images.
Citation extraction
Method: CRF classifier. On what: words. Why? Citations need to be individuated and separated from text. Citations are classified in primary and secondary.
11
Citation extraction
Main causes of error: - Wrong boundaries (instance accuracy
0.78, critical) - Wrong class (Primary-Secondary, not
critical)
12
Next steps: finalise with extra features, add rules.
Citation parsing
Method: CRF classifier. On what: words. Why? Elements of citations need to be individuated to identify the cited source.
13
Citation parsing
Main causes of error: - Wrong classes (e.g. Editor mistaken for
Author, not critical: reduce class space) - Sensible to under-represented classes
(e.g. archival terminology, not critical)
14
Next steps: finalise with extra features, simplify class space, add rules, lookup features.
What’s next
Towards a pipeline for citation extraction:
1. Normalisation 2. Linkage 3. Database as linked data 4. Publication
15
Giovanni Colavizza, Frédéric Kaplan
Thank you
16
Recommended