16
Giovanni Colavizza, Frédéric Kaplan On Mining Citations to Primary and Secondary Sources in Historiography 1

On Mining Citations to Primary and Secondary Sources in Historiography

Embed Size (px)

Citation preview

Page 1: On Mining Citations to Primary and Secondary Sources in Historiography

Giovanni Colavizza, Frédéric Kaplan

On Mining Citations to Primary and Secondary Sources in Historiography

1

Page 2: On Mining Citations to Primary and Secondary Sources in Historiography

Motivation - the Scholar

Sciences: Google Scholar

English Low-cost information gathering

Humanities: no Google Scholar like system

multiple languages High-cost information gathering

Issues: lack of data [Sula and Miller, 2014] leads to absence of services: estimated coverage of Web of Science for Humanities circa 13% [Mingers and Leydesdorff, 2015].

2

Page 3: On Mining Citations to Primary and Secondary Sources in Historiography

Motivation - the Footnote

How humanists cite? Footnotes [see e.g. Hellqvist, 2009]

3

Page 4: On Mining Citations to Primary and Secondary Sources in Historiography

Motivation - the Archive

Approximately half citations to primary sources [Wiberley Jr., 2009]

4

Page 5: On Mining Citations to Primary and Secondary Sources in Historiography

Project: Linked Books

In the context of the Venice Time Machine

Partners: • Ca’ Foscari Library System • Biblioteca Marciana • Istituto Veneto di Scienze, Lettere ed Arti • Archivio di Stato di Venezia • EPFL

5

Page 6: On Mining Citations to Primary and Secondary Sources in Historiography

Data acquisition

Corpus and annotation:

• 4 journals (for 150 years) • 2000 monographs • digitisation almost over • ongoing annotation (samples from 1

journal and 1’000 monographs done, approx 10’000 annotated citations)

6

Page 7: On Mining Citations to Primary and Secondary Sources in Historiography

1- Data: new citation corpus (History of Venice) and a pipeline for citation extraction from footnotes.

2- Analytical framework: development or adaptation of methods from bibliometrics to the humanities.

3- Services: a Google Scholar for the History of Venice accounting for citations to both primary and secondary sources.

Goals

7

Page 8: On Mining Citations to Primary and Secondary Sources in Historiography

Pipeline

Text block detection

Citation extraction

Citation parsing

8

Page 9: On Mining Citations to Primary and Secondary Sources in Historiography

Text block detection

Method: SVM classifier. On what: text lines. Why? Citations are in footnotes, filter input space.

9

Page 10: On Mining Citations to Primary and Secondary Sources in Historiography

Text block detection

Main causes of errors: - partial citations (e.g. “Ivi., p. 37. some text”)

FALSE NEGATIVE (critical) - in-text shortened citations FALSE POSITIVE

(not critical) - footnotes without citations FALSE NEGATIVE

(not critical)

10

Next steps: finalise with extra features, layout detection on images.

Page 11: On Mining Citations to Primary and Secondary Sources in Historiography

Citation extraction

Method: CRF classifier. On what: words. Why? Citations need to be individuated and separated from text. Citations are classified in primary and secondary.

11

Page 12: On Mining Citations to Primary and Secondary Sources in Historiography

Citation extraction

Main causes of error: - Wrong boundaries (instance accuracy

0.78, critical) - Wrong class (Primary-Secondary, not

critical)

12

Next steps: finalise with extra features, add rules.

Page 13: On Mining Citations to Primary and Secondary Sources in Historiography

Citation parsing

Method: CRF classifier. On what: words. Why? Elements of citations need to be individuated to identify the cited source.

13

Page 14: On Mining Citations to Primary and Secondary Sources in Historiography

Citation parsing

Main causes of error: - Wrong classes (e.g. Editor mistaken for

Author, not critical: reduce class space) - Sensible to under-represented classes

(e.g. archival terminology, not critical)

14

Next steps: finalise with extra features, simplify class space, add rules, lookup features.

Page 15: On Mining Citations to Primary and Secondary Sources in Historiography

What’s next

Towards a pipeline for citation extraction:

1. Normalisation 2. Linkage 3. Database as linked data 4. Publication

15

Page 16: On Mining Citations to Primary and Secondary Sources in Historiography

Giovanni Colavizza, Frédéric Kaplan

Thank you

16