On Mining Citations to Primary and Secondary Sources in Historiography

Giovanni Colavizza, Frédéric Kaplan

Motivation - the Scholar

Sciences: Google Scholar

English Low-cost information gathering

Humanities: no Google Scholar like system

multiple languages High-cost information gathering

Issues: lack of data [Sula and Miller, 2014] leads to absence of services: estimated coverage of Web of Science for Humanities circa 13% [Mingers and Leydesdorff, 2015].

Motivation - the Footnote

How humanists cite? Footnotes [see e.g. Hellqvist, 2009]

Motivation - the Archive

Approximately half citations to primary sources [Wiberley Jr., 2009]

Project: Linked Books

In the context of the Venice Time Machine

Partners: • Ca’ Foscari Library System • Biblioteca Marciana • Istituto Veneto di Scienze, Lettere ed Arti • Archivio di Stato di Venezia • EPFL

Data acquisition

Corpus and annotation:

• 4 journals (for 150 years) • 2000 monographs • digitisation almost over • ongoing annotation (samples from 1

journal and 1’000 monographs done, approx 10’000 annotated citations)

1- Data: new citation corpus (History of Venice) and a pipeline for citation extraction from footnotes.

2- Analytical framework: development or adaptation of methods from bibliometrics to the humanities.

3- Services: a Google Scholar for the History of Venice accounting for citations to both primary and secondary sources.

Pipeline

Text block detection

Citation extraction

Citation parsing

Method: SVM classifier. On what: text lines. Why? Citations are in footnotes, filter input space.

Main causes of errors: - partial citations (e.g. “Ivi., p. 37. some text”)

FALSE NEGATIVE (critical) - in-text shortened citations FALSE POSITIVE

(not critical) - footnotes without citations FALSE NEGATIVE

(not critical)

Next steps: finalise with extra features, layout detection on images.

Citation extraction

Method: CRF classifier. On what: words. Why? Citations need to be individuated and separated from text. Citations are classified in primary and secondary.

Citation extraction

Main causes of error: - Wrong boundaries (instance accuracy

0.78, critical) - Wrong class (Primary-Secondary, not

critical)

Next steps: finalise with extra features, add rules.

Citation parsing

Method: CRF classifier. On what: words. Why? Elements of citations need to be individuated to identify the cited source.

Citation parsing

Main causes of error: - Wrong classes (e.g. Editor mistaken for

Author, not critical: reduce class space) - Sensible to under-represented classes

(e.g. archival terminology, not critical)

Next steps: finalise with extra features, simplify class space, add rules, lookup features.

What’s next

Towards a pipeline for citation extraction:

1. Normalisation 2. Linkage 3. Database as linked data 4. Publication

Giovanni Colavizza, Frédéric Kaplan

Thank you

On Mining Citations to Primary and Secondary Sources in Historiography

Science

Film and Historiography

AFRICAN HISTORIOGRAPHY: From colonial historiography to

Roman Historiography

Church Historiography

This is Jeopardy IPT286/Copyright Unit (see citations page) see citations pagesee citations page

Historiography Term Paper

TOWARDS HISTORIOGRAPHY

The Labour Welfare Reforms Essay Content Structure Debate Historiography Content Structure Debate Historiography

Early Muslim Historiography

Using Historiography to Evaluate Primary and Secondary Sources

M Diouf Historiography

Electronic Filing of Minor Offenses Via TraCs€¦ · Total Citations SOA Citations (paper) SOA Electronic Citations Municipal Citations (paper) Municipal Electronic Citations Day

Technology and Historiography

DJH3B - HISTORIOGRAPHY

Epilogue: The Poverty of Historiography—a Poet's …...Epilogue: The Poverty of Historiography—a Poet's Reproach Rabindranath Tagore's critique of historiography and his appeal

Speciﬁc character of citations in historiography (using the … › download › pdf › 11890701.pdf · 2013-07-16 · Speciﬁc character of citations in historiography (using

Historiography presentation

Forms - Muslim Historiography

Theodore Roosevelt Historiography

Iscc web historiography