1
The Nature Of Digitally-Produced Jacco van Ossenbruggen, Laura Hollink, Myriam C. Traub Really? Do you know the limitations of your tool and their impact on your research? Of course, I only use unbiased tools on unbiased data for my research! Today: Biases in the source data are unknown because it was not created for scientific purposes. Biases in the tool chain are unknown because we use black box computational workflows and tools. We need a systematic fit-for-purpose assessment for digital tools and data, starting with the ability to measure and quantify technology-induced bias. Do you know THAT… … even simple word count tools return different counts for the same text? … most search engines are biased on document length? … the version of the Google Ngram Viewer corpus may impact your trend analysis? … OCR performs better on expensive newspapers targeting the social elite than on titles printed on cheaper paper? … the Twitter Streaming API is not serving a random sample of the complete “firehose”? … the performance of predictive models on new data cannot be predicted, not even by the developers of the models? Traub, Myriam C., and Jacco van Ossenbruggen. Workshop on Tool Criticism in the Digital Humanities, Amsterdam, 22 May 2015 If you know OF OTHER examples of technology-induced bias: please tweet! #toolcrit The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism 1980: Bick and Müller observe that years of experience in scientific data collection methods have informed us about each method’s limitations with respect to validity and representativeness. They claim, however, that this is still largely missing in “new” research methods based on data that has been produced for purposes other than scientific inquiry. Tool criticism Scholars need to assume a critical attitude towards the use of tools and perform a systematic fit-for-purpose assessment. Tool makers need to publish the source code of the tools and document their requirements and functionalities. Data providers need to make provenance information available. Data scientists need to develop quality metrics that measure technology-induced bias and take the context and requirements of research tasks into account. Bick, Wolfgang, and Paul J Müller. "The Nature of Process-Produced Data. Towards a Social-Scientific Source Criticism." Historical social research: The use of historical and processproduced data. (1980). Amsterdam Data Science

The Nature of Digitally-Produced Data: Towards Social-Scientific Tool Criticism

Embed Size (px)

Citation preview

Page 1: The Nature of Digitally-Produced Data: Towards Social-Scientific Tool Criticism

The Nature Of Digitally-Produced

Jacco van Ossenbruggen, Laura Hollink, Myriam C. Traub

Really? Do you know

the limitations of your tool and their impact on your research?

Of course, I only use unbiased tools on unbiased data for my

research!

Today: Biases in the source data are unknown because it was not created for scientific purposes.Biases in the tool chain are unknown because we use  black box computational workflows and tools.

We need a systematic fit-for-purpose assessment for digital tools and data, starting with the ability to measure and quantify technology-induced bias.

Do you know THAT…

… even simple word count tools return different counts for the same text?

… most search engines are biased on document length?… the version of the Google Ngram Viewer corpus may

impact your trend analysis?… OCR performs better on expensive newspapers targeting

the social elite than on titles printed on cheaper paper?… the Twitter Streaming API is not serving a random

sample of the complete “firehose”?… the performance of predictive models on new

data cannot be predicted, not even by the developers of the models?

Traub, Myriam C., and Jacco van Ossenbruggen. Workshop on Tool Criticism in the Digital Humanities, Amsterdam, 22 May 2015

If you know OF OTHER examples of technology-induced bias: please tweet! #toolcrit

The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism

1980: Bick and Müller observe that years of experience in scientific data collection methods have informed us about each method’s limitations with respect to validity and representativeness. They claim, however, that this is still largely missing in “new” research methods based on data that has been produced for purposes other than scientific inquiry.

Tool criticism

Scholars need to assume a critical attitude towards the use of tools and perform a systematic fit-for-purpose assessment.

Tool makers need to publish the source code of the tools and document their requirements and functionalities.

Data providers need to make provenance information available.

Data scientists need to develop quality metrics that measure

technology-induced bias and take the context and requirements of research tasks into

account.

Bick, Wolfgang, and Paul J Müller. "The Nature of Process-Produced Data. Towards a Social-Scientific Source Criticism." Historical social research: The use of historical and processproduced data. (1980).

AmsterdamData Science