Big Data, Data Integrity, And the Fracturing

8/18/2019 Big Data, Data Integrity, And the Fracturing

1/11

Original Research Article

Big Data, data integrity, and the fracturingof the control zone

Carl Lagoze

Abstract

Despite all the attention to Big Data and the claims that it represents a ‘‘paradigm shift’’ in science, we lack understandingabout what are the qualities of Big Data that may contribute to this revolutionary impact. In this paper, we look beyondthe quantitative aspects of Big Data (i.e. lots of data) and examine it from a sociotechnical perspective. We argue that akey factor that distinguishes ‘‘Big Data’’ from ‘‘lots of data’’ lies in changes to the traditional, well-established ‘‘control

zones’’ that facilitated clear provenance of scientific data, thereby ensuring data integrity and providing the foundation forcredible science. The breakdown of these control zones is a consequence of the manner in which our network tech-nology and culture enable and encourage open, anonymous sharing of information, participation regardless of expertise,and collaboration across geographic, disciplinary, and institutional barriers. We are left with the conundrum—how toreap the benefits of Big Data while re-creating a trust fabric and an accountable chain of responsibility that make crediblescience possible.

Keywords

Big Data, control zone, paradigm shift, sociotechnical

Big Data is not only about being big

The popular and scholarly literature is filled with excite-

ment about Big Data. A good deal of the enthusiasm

comes from the business sector, where Big Data offers

new possibilities for direct and micro marketing,

supply-chain optimization, and other means of increas-

ing efficiency and profits. This enthusiasm has also

spread to the public sector, particularly in the areas

of security and terrorism prevention. In this paper, we

examine the impact of Big Data in the context of sci-

ence,1 encompassing the research that takes place in the

academic, corporate, and government milieu.Admittedly, the line between commercial research (dis-

tinguished from corporate research such as that which

takes place at IBM Watson) and scientific research can

be fuzzy, but we distinguish the former as motivated by

financial concerns (e.g. product improvement for profit

improvement), whereas the latter is motivated by the

search for some ‘‘truth’’. Some argue that Big Data

represents a new paradigm of science, a ‘‘fourth para-

digm’’ (Hey et al., 2009), adopting the terminology used

by Kuhn (1970) to characterize the revolutionary

transformation of a scientific field.2

While many viewthis new paradigm as complementary rather than sub-

stitutive to pre-existing paradigms (observation, experi-

mentation, and simulation), others like Chris Anderson

have taken a more extreme view, claiming that Big

Data represents the ‘‘end of theory’’ (Anderson, 2008).

Our goal in this paper is to pull back from the hype

and take a more measured, analytical approach to

Big Data, focusing on the question ‘‘what are the

characteristics of (some) Big Data that manifest a para-

digm shift in the fundamental assumptions of science’’?

We distinguish between Big Data characteristics that

have methodological consequences and those thatimpact epistemological foundations. We characterize

the former as important but not paradigm-shifting.

In contrast, we argue that a paradigm shift is

University of Michigan, School of Information, Ann Arbor, MI, USA

Corresponding author:

Carl Lagoze, University of Michigan, 105 S. State Street, Ann Arbor,

MI 48103, USA.

Email: [email protected]

Big Data & Society

July–December 2014: 1–11

! The Author(s) 2014

DOI: 10.1177/2053951714558281

bds.sagepub.com

Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License(http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further

permission provided the original work is attributed as specified on the SAGE and Open Access pages (http://www.uk.sagepub.com/aboutus/

openaccess.htm).

by guest on December 21, 2014bds.sagepub.comDownloaded from

http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/


2/11

indeed evident when Big Data impacts epistemological

foundations.

Embedded in this argument is the assumption that

the characteristics we are looking for are not native to

all uses of big (in size) data. And, in fact, it may be true

that data that is not necessarily quantitatively large

may have characteristics that are paradigm-shiftingwhen used in certain contexts and by certain commu-

nities of use.

Before proceeding any further with an analysis of

‘‘big’’ as a qualifying characteristic of (some) ‘‘data’’,

it is important to establish a definition of ‘‘data’’,

whether big or small. A National Academies report

(A Question of Balance: Private Rights and the Public

Interest in Scientific and Technical Databases, 1999)

provides a simple and inclusive foundation definition:

‘‘data are artifacts, numbers, letters, and symbols that

describe an object, idea, condition, situation, or other

factors.’’ Although this definition is useful, it fails to

capture the ‘‘relative’’ nature of data (in contrast to it

having an ‘‘essential’’ nature). As Borgman (2011)

states: ‘‘[d]ata may exist only in the eyes of the

holder: the recognition that an observation, artifact,

or record constitute data is itself a scholarly act.’’

This perspective of data is reflexive; something (e.g.

images, text, and Excel worksheet, etc.) is data because

someone uses it as data in a specific context, and tran-

scendent, it carries across the many disciplines, prac-

tices, and epistemologies of science.

This relational/contextual perspective gives us the

basis for examining the Big Data phenomenon in a

manner that both crosses epistemological boundariesand is contextualized by them. With due recognition

of the dangers of making generalizations about ‘‘sci-

ence’’, we hope to establish some fundamental aspects

of Big Data that are indeed boundary crossing, while

remaining shaped by (and shaping) specific disciplinary

practices.

Having established this relativistic definition of data,

we return to the notion that ‘‘Big Data is not only

about being big’’; that there is some combination of

features or dimensions (perhaps among them size)

that may have revolutionary effects on science and

knowledge production. This multidimensional perspec-tive is evident in many of the popularized, mass-market

descriptions of Big Data.

One popular multidimensional definition of Big

Data is based on the so-called 3Vs: Volume, Velocity,

and Variety (Laney, 2001). Volume is the size factor.

Velocity refers to the speed of accumulation, the result-

ing dynamic nature of the data, and the high-scale pro-

cessing capacity needed to make it useful and keep it

current. Finally, Variety refers to the mixing together,

or mashing-up, of heterogeneous data types, models,

and schema and the need to resolve these differences

in order to make the data useful. Others have enhanced

this list with additional ‘‘Vs’’: Validity, the amount of

bias or noise in the data; Veracity, the correctness and

accuracy of the data; and Volatility, the persistence and

longevity of data (Normandeau, 2013), the first two of

which, Validity and Veracity, are of particular interest

to the argument of this paper.Mayer-Schonberger and Cukier in their best-selling

book Big Data offer an alternative but complementary

set of characteristics of Big Data, which they claim

‘‘challenges the way we live and interact with the

world’’ (Mayer-Scho ¨ nberger, 2013, p. 6). They charac-

terize Big Data as revolutionary because it enables/

embodies ‘‘three shifts [characteristics] in the way we

analyze information and transform how we understand

and organize society.’’ The first is the ‘‘more’’ charac-

teristic, which they posit as the foundation for the two

other characteristics. A notable aspect of ‘‘bigness’’

according to the authors is its equivalence to ‘‘allness’’

(n¼all). Throughout the book they assert that Big

Data obviates the need for traditional (in their view

flawed) sampling techniques and increasingly can be

considered a complete view of the object of investiga-

tion. We later question this argument and initially note

that even if the n¼all principle were true, the notion of

data providing a ‘‘complete view’’ of reality, in the

objective sense, is met with skepticism by a number of

modern scholars (Bowker, 2014; Edwards, 2010;

Gitelman, 2013). The second characteristic is

‘‘messy,’’ the effect of which is diminished by the n¼all

characteristic. In their words, ‘‘looking at vastly more

data also permits us to lessen our desire for exactitude.’’The third and final characteristic is the shift in analyt-

ical technique from causality to correlation. ‘‘Most

strikingly, society will need to shed some of its obses-

sion for causality in exchange for simple correlations;

not knowing why but only what.’’ We will return to

Mayer-Schonberger and Cukier later in this paper to

further critique of their n¼all claim and its implica-

tions for new paradigm science.

These two attempts to define Big Data, and many

others like them, fail to adequately capture the nuances

and contexts of use of Big Data that may make it revo-

lutionary and the driver of a new scientific paradigm.Employing Kuhn’s words, when are Big Data ‘‘tra-

dition-shattering complements to the tradition-bound

activity of normal science’’ (Kuhn, 1970)? To answer

this question, we need to examine Big Data from a

sociotechnical perspective (Bijker, 1995; Lamb and

Sawyer, 2005). We need to investigate their social, cul-

tural, historical, and technical facets and the interplay

and tensions among these facets that collectively estab-

lish the impact of Big Data on science and the possible

transformation thereof. An analysis of this sort will

allow us to distinguish the aspects of Big Data that,

2 Big Data & Society




3/11

no matter how contributory to innovation, may be

more evolutionary than revolutionary, from those

that are indeed paradigm-shifting. Furthermore, it

will help us distinguish between locality—discipline

and/or field-specific characteristics of Big Data—and

globality—aspects of Big Data that may be paradigm-

shifting across the scholarly enterprise.

Lots of data or Big Data?

Because technology is such a basic enabler and compo-

nent of Big Data practices (i.e. computation including

hardware, software, and algorithmic components; high-

speed networks; massive storage arrays), it is useful to

build our argument on the notions of new technological

paradigms (Dosi, 1982) and of disruption (Christensen

and Rosenbloom, 1995; Rosenbloom and Christensen,

1994). Originating in the business and organizational

behavior sector, these two concepts nicely complement

Kuhn’s theories, which focus on the scholarly domain.

Dosi distinguishes between evolutionary paths of

technological change and new technological paradigms

that represent discontinuities from pre-existing techno-

logical paths and address new classes of problems.

Christensen and Rosenbloom expand on this with the

notion of disruptive innovation, which is a discontinu-

ity in not only the technological aspect of a product or

service, but also a sociotechnical disruption; a context-

ual change in the set of valuations and values that

frame and are impacted by the technical innovation.

Christensen initially applied this theory of disruption

to product lines (Christensen, 1997), with a primeexample being the successive introduction of smaller

hard disk platters that initially seemed noncompetitive

with the disc products of mainstream manufacturers,

but eventually and repetitively obliterated the main-

stream markets due to their framing within the revolu-

tionary personalization of computing. Christensen has

also applied this theoretical framework to health care

(Christensen et al., 2008a) and education (Christensen

et al., 2008b).

By leveraging the theoretical frameworks of Kuhn,

Rosenbloom, and Christiansen, we argue that a disrup-

tion in science (a.k.a. the creation of a new paradigm) isnot just methodological, a way of doing (a.k.a. tech-

nical), but also must be sociotechnical. It must chal-

lenge existing epistemological norms, ways of

knowing and framing the fundamental scientific ques-

tions of the field; institutional ecologies (Star and

Griesemer, 1989), agreements on scope, assumed know-

ledge, and boundaries of research work; reward struc-

tures, paths to tenure and promotion; and

communication regimes, mechanisms, and norms for

disseminating knowledge. We will use this scaffolding

for the remainder of this essay to distinguish between

what we will call lots of data, the effects of which are by

and large methodological and technical, and true Big

Data, that which entails epistemological and, as a

result, paradigmatic change.

Our distinguishing between these two terms—lots of

data (which entails methodological change and tech-

nical innovation) and Big Data (which implies the re-evaluation of epistemological foundations)—should

not be interpreted as an attempt to segregate data

into two disjoint silos, i.e. data set 1 is ‘‘lots of data’’,

in contrast to data set 2 that is genuine ‘‘Big Data’’.

Our intention, rather, is to establish these concepts as

continuous dimensions with which instances of data use

can be evaluated in order to understand the degree and

origins of their methodological and/or paradigm-shift-

ing effects, i.e. a use of data set 1 has high ‘‘lots of data’’

impact but low ‘‘Big Data’’ impact while a use of data

set 2 has low ‘‘lots of data’’ impact but high ‘‘Big Data’’

impact. The term ‘‘instances of data use’’, in contrast to

simply ‘‘data’’, is intentional and refers to the fact that,

similar to the definition of data, the methodological

and epistemological impact of data must be evaluated

within the context of use. An important facet of this

context is the distinct epistemic culture (Knorr-Cetina,

1999) of the community of use and its particular per-

spectives on data and its meaning. In other words, the

same data set may ‘‘measure’’ differently according to

the ‘‘Big Data’’ and ‘‘lots of data’’ dimensions when

employed by different disciplinary communities and/

or for different purposes.

Although the primary focus of the remainder of this

paper is the Big Data dimension—when, how, and whydoes data use challenge the epistemological foundations

of science—it is useful, for the purpose of contrast, to

briefly examine the companion lots of data dimension.

This brevity should not be construed as dismissive

towards the significance of these technical challenges

and the methodological impacts they have. Indeed,

there are great challenges here and the scholarly and

practical effects of meeting these challenges can be pro-

found, albeit not paradigm-shifting.

Two often-cited instances of data use demonstrate

the lots of data dimension. The petabytes of data

streaming in from high-energy physics experiments(studied thoroughly by Knorr-Cetina, 1999) or those

that are components of the Sloan Digital Sky Survey

(Szalay and Gray, 2001) are certainly Big Data in terms

of size. But, considered alone, their bigness and the

issues associated with them are by and large technical.

These communities have historic cultures of data shar-

ing (Ginsparg, 1994; Knorr-Cetina, 1999) and, in fact,

their data has always been ‘‘big’’ relative to the quan-

titative definitions of the day. This is similar to the situ-

ation with many domains of science that have a legacy

of exploring and manipulating large data sets, where

Lagoze 3




4/11

‘‘large’’ is historically contextualized relative to the

technical affordances of the time (Gitelman, 2013).

The massive quantity of data in these two examples

clearly introduces issues about new high-capacity stor-

age systems, high-speed networks to easily move them

back and forth, and map-reduce algorithms that permit

parallel computation over these massive data sets. Arecent white paper co-authored by leading data science

researchers (Agrawal et al., n.d.) provides a useful list

of the cross-cutting challenges that need to be met to

respond to these issues; heterogeneity and incomplete-

ness, scale, timeliness or speed, privacy, and human

collaboration. All of these are formidable challenges.

However, the need for these new methodologies and

tools to manipulate, store, and curate these massive

data sets does not correspond to a paradigm-shifting

disruption of the historically data-focused epistemic

culture of the communities of practice that engage

with these data.

A recent paper by Leonelli (2014) in the inaugural

issue of this journal explores the same issue in the dis-

cipline of biology.3 Similar to this paper (albeit limited

to a single discipline), Leonelli aims ‘‘to inform a cri-

tique of the supposedly revolutionary power of Big

Data science,’’ likewise defining revolutionary as syn-

onymous with creating a new epistemology and a new

set of norms. Similar to our earlier examples in physics

and astronomy, she notes that ‘‘data-gathering prac-

tices in subfields [of the life sciences] have been at the

heart of inquiry since the early modern era, and have

generated problems ever since.’’ She then aims the bulk

of her critique at Mayer-Scho ¨ nberger and Cukier’sclaims that data completeness mitigates data messiness

and their championing of correlation over causality,

which we will return to later in this paper. She finishes

by rejecting the notion that Big Data is exerting a revo-

lutionary effect on the epistemology of biology itself,

claiming that ‘‘there is a strong continuity with prac-

tices of large data collection and assemblage since the

early modern period; and the core methods and epi-

stemic problems of biological research, including

exploratory experimentation, sampling and the search

for causal mechanisms, remain crucial parts of the

inquiry in the area of science.’’ In contrast to epistemiceffects on the discipline itself, she acknowledges signifi-

cant methodological challenges ‘‘encountered in

developing and applying curatorial standards for

data . . . ’’ and in the dissemination of that data.

On the other hand, the sensitivity of the evolutionary

versus revolutionary impact of big (or of even any) data

to epistemic culture becomes evident in the context of

digital humanities (or as some call it computational

humanities, and its specializations such as computa-

tional history). The level of controversy over the ‘‘data-

fication’’ (Mayer-Scho ¨ nberger, 2013) of historical and/

or literary artifacts (whether in massive scale such as

the Google Books Project or the scale of a single liter-

ary corpus) can be viewed as evidence of resistance to

the introduction of a new epistemology, based on data,

that is viewed by some as threatening, and perhaps

inferior, to existing and historically based epistemolo-

gies (Bruns, 2013; Rosenberg, 2013).These examples in physics, astronomy, biology, and

the humanities (and many similar ones) lead us to con-

clude that mere bigness, lots of data (which appears

to have different meanings in different scholarly

fields), is not the basis for declaring a new paradigm

in science. Furthermore, we can be fairly confident that

such a blanket declaration without attention to the con-

founding factor of epistemic cultures warrants

skepticism.

Data integrity and credible science

With these caveats in mind, however, we do claim that

there might be some cross-cutting framing of data and

their application across the entire scholarly endeavor,

recognizing that this framing needs to be parameterized

to a particular use of data within a particular epistemic

culture. Then, we need to understand how Big Data

might challenge this common framing, thereby becom-

ing ‘‘tradition shattering’’ (Kuhn, 1970).

At the forefront is the notion of data integrity, which

we assert is a consistent and discipline-crossing founda-

tion of credible science (Committee on Ensuring the

Utility and Integrity of Research Data in a Digital

Age, 2009; Nowotny, 2001). We intentionally use theterm integrity rather than correctness or quality; the

latter terms ascribe a level of positivism to data that

many modern scholars refute (Edwards, 2010;

Gitelman, 2013). Integrity, on the other hand, has a

more constructivist tone, implying notions of ‘‘trust’’,

‘‘fitness for use’’, and ‘‘consensual understanding’’, all

of which are contextual and relative to epistemic cul-

ture, in contrast to the implicitly binary notion of cor-

rectness. Looking at this from the perspective of

infrastructure to support data sharing (using ‘‘infra-

structure’’ in its broadest most sociotechnical sense;

Edwards et al., 2007), we can then draw the linksfrom integrity to trust, and ultimately to provenance

(evidence upon which trust is established), and propose

that determining the degree of data integrity is based on

the ability to answer a number of questions. What is the

origin of these data? Who has been responsible for

them since their origination? Can we apply our stand-

ard notions for trust and integrity to them? Do our

standard methodologies for interpreting them and

drawing conclusions from them make sense? Big Data

is then those data that disrupt fundamental notions of

integrity and force new ways of thinking and doing to





5/11

reestablish it. Said differently, Big Data is data that

makes us rethink our notions of credible science.

Our attention here to the issues of data and scientific

integrity is coincident with a growing concern with the

reliability of scientific knowledge. The notion of a crisis

in reliability has been discussed in the media (Naik,

2011), and in scientific journal articles (Brembs andMunafò, 2013) and editorials (‘‘Announcement,’’

2013; Jasny et al., 2011). Some of the concern about

reliability has been fueled by well-publicized cases of

scientific fraud and data falsification in a number of

scientific fields (Harrison et al., 2010; ‘‘Researcher

Faked Evidence of Human Cloning, Koreans

Report,’’ 2006; Verfaellie and McGwin, 2011). In add-

ition, a number of academics are warning about the

prevalence of false results in the scientific literature

(Ioannidis, 2005; Po ¨ schl, 2004).

But, as pointed out by Stodden (2014), some of this

concern arises from the increasing prevalence of data-

intensive (Big Data) science across the disciplines, and

the application of computational, analytical methods to

those data without complete understanding of their

characteristics (e.g. the nature of the sample represented

by the data). Absent full understanding of the data (and

in some cases a failure to account for this lack of intim-

acy with the data), researchers have at times unwittingly

or sloppily applied methodological tools or epistemo-

logical understanding to those data that failed to

account for the fundamental differences between them

and traditional highly-curated and reliable data. As

pointed out by Lazer et al. (2014), ‘‘ . . . most Big Data

that have received popular attention are not the outputof instruments designed to produce valid and reliable

data amenable for scientific analysis.’’

Of particular concern in this area has been scientific

results based on data sources of questionable proven-

ance and integrity such as distributed sensors (Wallis

et al., 2007) and ‘‘black box social media,’’ where the

origin and basis of the data are difficult to determine

(Driscoll and Walker, 2014) and the algorithmic bias on

the conclusions is difficult to unravel (Gillespie, 2014).

A well-known example of the foibles of the reliance on

informally collected data and algorithmic projection is

the Google Flu Trends (GFT), which raised huge sci-entific optimism about the predictive utility of infor-

mally collected data when first published in Nature in

2009 (Ginsberg et al., 2009). This optimism suffered a

serious setback in 2013 when the GFT predictions for

that year were shown to be seriously exaggerated

(Butler, 2013; Lazer et al., 2014). A complete account-

ing for this setback is beyond the scope of this paper.

However, one acknowledged factor is an overconfi-

dence in the veracity of the data as a true sample of

reality, rather than a random snapshot in time and the

result of algorithmic dynamics.

We acknowledge that this emphasis on data integrity

(a.k.a. quality) stands somewhat in opposition to the

popularized claims by Mayer-Scho ¨ nberger and Cukier

that ‘‘looking at vastly more data . . . permits us to

loosen up a desire for exactitude’’ and effectively

allows us to ignore ‘‘messiness’’ in data (Mayer-

Scho ¨ nberger, 2013). As mentioned earlier, this claimand subsequent claims by the authors seem to rely

heavily on n¼all, that is, Big Data is not a sample

but a complete set. We find this claim highly suspicious

and agree with fellow scholars (Boyd and Crawford,

2011; Bowker, 2014) who take the position that any

data, no matter what its size, is de facto a sample,

with bias implicit due to choice of instrumentation,

span of observation, units of measurement, and

numerous other factors. In essence, n never equals all;

all is a limit in mathematical terms that can be

approached but never attained. This point is also

emphasized by Leonelli, who states that ‘‘having a lot

of data is not the same as having all of them; and

cultivating such a vision of completeness is a very

risky and potentially misleading strategy’’ (Leonelli,

2014). Thus, if one denies sampling and its effects on

messiness or on our ability to derive meaning from cor-

relations, as Mayer-Scho ¨ nberger and Cukier seem to

do, they tread on questionable territory in terms of

high integrity science, and may indeed have an argu-

ment that is more appropriate to business and com-

merce. Again quoting Leonelli, ‘‘it is no coincidence

that most of the examples given by Mayer-

Scho ¨ nberger and Cukier come from the industrial

world, and particularly globalized retail strategies asis the case of Amazon.com’’ (Leonelli, 2014).

As a point of reference, it is useful to look at the

notions of integrity, trust, and provenance in the con-

text of archives and archival science, for which they are

essential concepts. Hirtle (2000) describes the meanings

of these terms and the manner in which they are core to

the definition of the archive in the context of the ship

Constellation, a tourist destination in Baltimore harbor

that was mistakenly identified as a revolutionary war

ship when its vintage was really the US Civil War.

According to Hirtle (2000), ‘‘at the heart of an arch-

ive. . .

are records that are created by an agency ororganization in the course of its business and that

serve as evidence of the actions of that agency or organ-

ization [italics added].’’ Furthermore, ‘‘one way in

which archivists working with . . . records have sought

to ensure the enduring value of archives as evidence is

through the maintenance of an unbroken provenance for

the records [italics added].’’ Implicit in the notion of

‘‘unbroken provenance’’ is control over storage and

transfer; in order to serve as evidence an archival

record must demonstrate a complete, unbroken, histor-

ical knowledge of the item of interest, who has been in

Lagoze 5




6/11

control of it, and by what means it has been transferred

or moved to other authorities. Fans of crime shows on

TV or of detective novels should find this notion quite

familiar; the evidence presented in a court of law is

useless if law enforcement has lost control of it and it

may have been tampered with.

Defining the control zone

Taking a cue from archival science then, we should

look at the role of control (and unbroken provenance)

as a necessary (but not necessarily sufficient) factor in

data integrity. Traditional data origination, sharing,

and reuse were based on the reality of containable

and concrete physical data (e.g. written by hand or

stored on magnetic devices that are kept in drawers

or file cabinets) and data sharing practices based on

physical handoff to known colleagues. The physicality

of both the data and the transfer of data amounted to a

well-defined control zone resulting in a provenance

chain that was documented and witnessed. Before

examining the breakdown of this control zone in the

context of Big Data, in the next section we examine the

same notion and its role in the disruption of another

knowledge infrastructure (Edwards et al., 2013) that

has over the past two decades undergone considerable

change, the library. In a seminal 1996 article, ‘‘Library

Functions, Scholarly Communication, and the

Foundation of the Digital Library: Laying Claim to

the Control Zone’’ (Atkinson, 1996), the late Ross

Atkinson, then Associate University Librarian at

Cornell University, describes how the notion of acontrol zone lay at the foundation of the library.

According to Atkinson, the functioning of the library

depends on the definition of a clear boundary, a

demarcation of what lies in the library and what is out-

side. Internal to this boundary, within the control zone,

the library can lay claim to those resources that have

been selected as part of the collection, and assert cur-

ation, or stewardship, of those selected resources to

ensure their integrity, availability, and stability over

the long-term.

The boundary of the traditional library was easy to

define. It was the ‘‘bricks and mortar’’ structure with aclear and controlled entry point that contained and

protected the selected physical resources over which

the library asserted control and curatorial responsibil-

ity. Correspondingly, from the patron’s point of view,

the boundary marked what could be called a ‘‘trust

zone’’, an area to which entry and exit were clearly

marked and in which they could presume the existence

of the integrity guarantees of the library. Integrity, in

this case, does not imply veracity of the resources of the

library, but adherence to principles of proper informa-

tion stewardship, including accurate description,

longevity of the resources, and adherence to some selec-

tion criteria.

In Lagoze (2010), we describe how the move from

physical to digital information resources and the

attendant access to them by the web architecture

profoundly disrupted the foundation of the control

zone. This disruption was not anticipated by early par-ticipants, practitioners, and researchers in the early

digital library initiatives, who foresaw technical

but not institutional change. In fact, some predicted

that in the end ‘‘[digital] library services would fol-

low a familiar model’’ (Gladney et al., 1994).

Others saw the Internet as another familiar evolution-

ary technical change, similar to past challenges to

libraries, stating that ‘‘The anarchy of the Internet

may be daunting for the neophyte, but it differs little

from the bibliographic chaos that is the result of

five and a half centuries of the printing press’’

(Lerner, 1999).

Two decades later, it is clear that the implications of

moving from physical to digital information and net-

work access to the information is more than a technical

phenomenon; the implications are more than that

‘‘digital information crosses boundaries easily’’ (Van

House et al., 2003) and in fact are deeply disruptive

to the library. By viewing the library as a meme,4

rather than just as an institution or a physical artifact,

we can see the roots of the disruption. At the founda-

tion of it is the foundation of the library itself, the dis-

integration of the control zone. The notions of a clear

boundary, and the attendant concepts of being inside or

outside, disappear in the web architecture, where users(i.e. patrons) no longer enter through a well-defined

door, but ride hyperlinks and land wherever they may

choose in the digital library. Attempts to reassert a

boundary by defining a new digital door or portal

and establishing branding signposts defining inside vs.

outside have proven incompatible with the dominant

web context and have largely failed. With the collapse

of the control zone, other fundamental components of

the library meme become difficult to implement or ana-

chronistic relative to the increasingly normative

broader web context. These include selection, deciding

what information sources are available to patrons;intermediation, acting as a buffer between information

creators and information users; bibliographic descrip-

tion, providing ‘‘order making’’ via the catalog; and

fixity, guaranteeing the immutability of information

resources.

In conclusion, the wholesale transition of our intel-

lectual, popular, and cultural heritage to the digital

realm has been accompanied by a disruptive change

in our expectations about our knowledge infrastruc-

tures. The notions of selection, intermediation, biblio-

graphic description, and fixity that are core principles





7/11

of the library meme stand at odds to the web informa-

tion meme. These contradictions become sharper as the

web has moved over the past decade into the web 2.0

era and beyond. Expectations of open access to infor-

mation, active participation in knowledge production

and annotation, and the integration of social activity

and knowledge activities are now the expected norm.Libraries are certainly part of this modern knowledge

infrastructure. But they exist as participants in a world

of competing ‘‘knowledge institutions’’ (e.g. Wikipedia,

Facebook, Twitter). Meanwhile, notions of informa-

tion integrity, which were formally grounded in institu-

tional frameworks such as the library, remain

problematic and in search of new ways to certify the

provenance of information resources.

Rethinking credible science in the age

of Big Data

With knowledge of this precedent, we can now return

to Big Data and recognize parallels in the historical

transitions of the library and the transformations in

the ways that scholarly data are created, shared, and

used. The relatively well-controlled mechanisms (both

cultural and technical) for data creation, data sharing,

and data reuse are under pressure for a number of rea-

sons. Funders, the public, and fellow scientists are

demanding, for good reason, better access to data

and in general ‘‘open data’’ (Huijboom and Broek,

2011; Molloy, 2011; Murray-Rust, 2008), motivating

the creation of numerous data repositories (Greenberg

et al., 2009; Hahnel, 2012; Michener et al., 2011) thatallow easy and generally anonymous access to scientific

data on a global scale. Science in general is becoming

more collaborative and interdisciplinary (Barry and

Born, 2013; Haythornthwaite et al., 2006; Wagner

et al., 2011) (at least partly due to the multidisciplinary

scope of grand challenge problems like climate change),

breaking down traditional closely-knit teams of col-

leagues and bringing together scholars with different

epistemic and methodological cultures. An increasing

number of data sources originate from nontraditional

means, such as social networks for which concerns

about integrity and provenance are not priorities.Mashups of data are becoming increasingly common,

blurring the lines between formal and informal data.

Scientists seem to have a love/hate relationship with

this new reality. While they support the abstract idea

of open data (Cragin et al., 2010; Tenopir et al., 2011),

their sharing practices, and sharing preferences, remain

relatively closed and motivated by control (Borgman,

2011; Edwards et al., 2011).

Quantitative social science research provides an

interesting example of this data transition and impact

on the control zone. For the past 50 years, quantitative

social science has been built on a shared foundation of

data sources originating from survey research, aggre-

gate government statistics, and in-depth studies of indi-

vidual places, people, or events. Underlying these data

is a well-established and well-controlled infrastructure

composed of an international network of highly curated

and metadata-rich archives of social science such as theInter-University Consortium for Political and Social

Research5 (ICPSR) and the UK Data Archive.6

These archives continue to play an important role in

quantitative social science research. However, the emer-

gence and maturation of ubiquitous networked com-

puting and the ever-growing data cloud have

introduced a spectacular quantity and variety of new

data sources into this mix. These include social media

data sources such as Facebook, Twitter, and other

online communities in which individuals reveal massive

amounts of information about themselves that are

invaluable for social science research. When combined

with more traditional data sources, these provide the

opportunity for studies at scales and complexities here-

tofore unimaginable. This transformation has been

described by Gary King, a Harvard political scientist,

as the social science data revolution, which is character-

ized by a ‘‘changing evidence base of social science

research’’ (King, 2011a, 2011b). These new opportu-

nities present formidable new challenges to the fabric

of social science research. Among those mentioned by

King (2011b) include privacy challenges, problems of

sampling bias in uncontrolled data sets, a change in the

basic ‘‘job descriptions’’ of social scientists with

demand for new skills in statistical methods, computa-tional methods, and the like, and the need for new

cross-disciplinary collaborations (i.e. breaking down

the silos that social science scholars formally existed

in). Clearly this is an example of Big Data rather

than just lots of data.

Another example of this fracturing of the control

zone exists in observational science, for example, iden-

tification and reporting of phenomena (e.g. species) in

ecological niches, astronomy, and meteorology. In each

of these areas there is a growing interest in what has

been termed crowd sourced citizen science, which

engages numerous volunteers as participants in large-scale scientific endeavors (Wiggins and Crowston,

2010). The opportunities for large-scale citizen science

arise from the ubiquitous networking and computing

context and especially the recent spectacular growth

in the use of mobile devices. The motivations for lever-

aging this large-scale volunteer workforce as observa-

tional ‘‘sensors’’ are substantial. The geographic scope

of the observational spaces and the varieties of habitats

make reliance on trained observers (e.g. scientists)

infeasible. Our particular experience in this area is

with the eBird project,7 originated at the Cornell

Lagoze 7




8/11

Laboratory of Ornithology, a highly successful citizen

science project that for over a decade has collected

observations from volunteer participants worldwide

(Sullivan et al., 2014). Those data have subsequently

been used for a large body of highly-regarded and influ-

ential scientific research.

It comes as no surprise that crowd sourced citizenscience makes a substantial portion of the formal sci-

entific community uneasy (Sauer et al., 1994), especially

in fields where people’s lives are at stake, such as medi-

cine (Raven, 2012). These data, by nature, breakdown a

well-established control zone whereby data is collected

by experts or individuals managed by experts who care-

fully abide by scientific methods. In contrast, citizen

science of this type must contend with the problems

of highly variable observer expertise and experience.

How can we trust data or the science that results

from those data when their provenance is rooted in

sources whose own provenance does not conform to

‘‘standard’’ criteria such as degree, publication record,

or institutional affiliation?

The examples described above are only two of the

many instances in which new varieties of Big Data are

undermining traditional control zones of science. If we

look longitudinally, we can see that examples such as

these are only the beginning of the problem. The frac-

tured control zones, and the resulting uncertain prov-

enance and trust, only intensify through the lifecycle of

sharing, reuse, and circulation of data in an open net-

work in which not all participants are deemed trust-

worthy according to established norms. Looking

across this lifecycle, this dilemma very quickly becomescombinatorially more complex. If the control zone

around data set A and that around data set B are

poorly defined, that which results from the reuse and

combination of the two is only fuzzier. Of course, this is

only the first step in the progressive mashup and

‘‘cooking’’ of these data with other data, a progression

that is inevitable when data reuse is easy and strongly

encouraged.

Despite the challenges and uncertainties, the inclu-

sion of these ‘‘uncontrolled’’ Big Data into the scientific

process is a reality that will continue and perhaps

become more common. Our ‘‘always there, every-where’’ network culture will continue to make more

and larger amounts of automatically, accidentally,

and informally created data available for science. The

value of these data across the scholarly spectrum has

been demonstrated numerous times. Social scientists

can conduct studies on large-scale social networks

that may not replace, but do significantly complement,

traditional research based on small-scale social groups

(Milgram, 1967; Zachary, 1977). Observational scien-

tists can now accumulate heretofore unavailable evi-

dence of global phenomena, such as bird migrations

and climatological events, by leveraging the active par-

ticipation and contribution of enthusiastic human

volunteers.8

Our goal in this paper has not been to propose a

normative framework for this reality, but to simulate

and add to discussions and investigations of its

entangled social, cultural, historical, and technicalimplications. Rather than fall back on hyperbolic

‘‘Big Data will change the world,’’ the scholarly com-

munity needs to understand it and investigate its impli-

cations for science policy and public trust of science.

We propose two threads for moving forward: one epis-

temological, evaluate our understanding of quality in

both data and science and our means for determining it,

the other methodological, developing means of recover-

ing traditional quality metrics.

The first approach begins by raising the awareness of

researchers who use Big Data about its opportunities,

complexities, and dangers. This area is reasonably well

covered in Boyd and Crawford’s (2011) paper ‘‘Six

Provocations for Big Data’’, which covers many of

the caveats in dealing with this kind of data including

‘‘Claims to Objectivity and Accuracy are Misleading’’

and ‘‘Bigger Data Are Not Always Better Data.’’ As

the authors point out, a critical component of using Big

Data for research is understanding the integrity of

those data, where they originated, what biases are

built into them, how data cleaning may lead to over

fitting, and what sampling biases may be embedded in

them. In this context, we need to evaluate what quality

and integrity mean in a networked culture and its

numerous possible contexts, in the manner that otherscholars are investigating parallel issues such as privacy

(Nissenbaum, 2009).

As for methodology, we suggest two technical paths

that may offer amelioration of the integrity problem,

both based on retrospectively recovering provenance,

rather than prospectively, as in the traditional

manner. In our research with eBird, we have been

investigating ways to reconstruct observer/contributor

expertise from the aggregated data. Our realization has

been that expertise is too nuanced a factor to recon-

struct, but that experience, interpreted as deliberate

practice, is an effective path to expert performance(Ericsson and Charness, 1994). Evidence of experience

can be extracted from the aggregated data; for example,

frequency of contributions, the diversity of contribu-

tions measured by species distribution, etc. By devising

ways to recognize these traces we hope to develop

mechanisms that aid scientists in determining the

expertise (and perhaps integrity) of anonymous data

contributors (reference removed for author anonym-

ity). Another approach might be to employ digital

forensics (Reith et al., 2002), a technique increasingly

popular in the intelligence and legal communities,





9/11

which, like our work with expertise, recovers traces of

origin and provenance metadata from a digital artifact

itself.

In conclusion, we have argued for an understanding

of the difference between lots of data and Big Data. The

former, a quantitative feature with mainly technical

and methodological implications, has, without adoubt, had important effects on the way science is

done and what it makes possible. However, the latter,

a qualitative feature with profound epistemological and

sociotechnical implications, shakes some of the core

assumptions of credible science: trust and integrity.

Similar to so many aspects of our modern digital cul-

ture such as journalism (e.g. the New York Times versus

the flood of grassroots news blogs) and reference infor-

mation (e.g. Encyclopedia Britannica versus

Wikipedia), it is futile and even undesirable to seek a

return to traditional, rigid control zones. Nevertheless,

we are left with the challenge with Big Data to reap its

benefits while simultaneously holding science to the

same standards that it has been held to for centuries.

Declaration of conflicting interests

The author declares that there is no conflict of interest.

Funding

This research received no specific grant from any funding

agency in the public, commercial, or not-for-profit sectors.

Notes

1. Throughout this essay we use the term ‘‘science’’ as a gen-eralization for all academic fields, not just the physical,

life, etc. sciences.

2. Some well-known examples of Kuhn’s notion of a para-

digm shift are the introduction of plate tectonics in geology

and Einstein’s special relativity theory in physics, both of

which challenged primary assumptions of their respective

fields.

3. Although Leonelli does undertake a disciplinary-level ana-

lysis, she acknowledges the flaws of using ‘‘discipline’’ as

the unit of study, recognizing the fact that within such

coarse granularity lies a wide variety of epistemological

and methodological practices.

4. We employ the term ‘‘meme’’ here to expand what we

mean by ‘‘library’’ beyond its operational, technical, and

institutional characteristics, and consider it in a manner

similar to a semiotic sign (Morris, 1938).

5. http://www.icpsr.umich.edu

6. http://www.data-archive.ac.uk

7. http://ebird.org

8. One might conjecture about the possibility of machine sen-

sing to replace the human volunteers. However, bird spe-

cies observation and identification rely on a highly

nuanced combination of visual, auditory, habitat, and

other knowledge that will make automated sensing extre-

mely difficult to implement.

References

Agrawal D, Bernstein P, Bertino E, et al. (n.d.) Challenges

and Opportunities with Big Data. Available at: https://

www.purdue.edu/discoverypark/cyber/assets/pdfs/

BigDataWhitePaper.pdf (accessed 28 October 2014).

Anderson C (2008) The end of theory: will the data deluge

make the scientific method obsolete? Wired 1–5.Announcement: Reducing our irreproducibility (2013) Nature

496(7446): 398–398.

A Question of Balance: Private Rights and the Public Interest

in Scientific and Technical Databases (1999) Washington,

DC: The National Academies Press. Available at: http://

www.nap.edu/openbook.php?record_id¼9692 (accessed

28 October 2014).

Atkinson R (1996) Library functions, scholarly communica-

tion, and the foundation of the digital library: laying claim

to the control zone. The Library Quarterly 66(3).

Barry A and Born G (2013) Interdisciplinarity:

Reconfigurations of the Social and Natural Sciences, 1st

ed. New York, NY: Routledge.

Bijker WE (1995) Of Bicycles, Bakelites, and Bulbs: Toward aTheory of Sociotechnical Change. Cambridge, MA:MIT Press.

Borgman CL (2011) The conundrum of sharing research data.

Journal of the American Society for Information Science

63(6): 1–40.

Bowker G (2014) The theory/data thing. International Journal

of Communication 8(5).

Boyd D and Crawford K (2011) Six provocations for Big

Data. SSRN Electronic Journal . DOI: 10.2139/

ssrn.1926431.

Brembs B and Munafo ` M (2013) Deep impact: unintended

consequences of journal rank. ArXiv. Available at: http://

arxiv.org/abs/1301.3748 (accessed 28 October 2014).

Bruns A (2013) Faster than the speed of print: reconciling‘‘big data’’ social media analysis and academic scholar-

ship. First Monday 18(10). Available at: http://first

monday.org/ojs/index.php/fm/article/view/4879/3756

(accessed 7 October 2013).

Butler D (2013) When Google got flu wrong. Nature

494(7436): 155–156.

Christensen CM (1997) The Innovator’s Dilemma: When New

Technologies Cause Great Firms to Fail . Boston, MA:

Harvard Business School Press.

Christensen CM, Grossman JH and Hwang J (2008a) The

Innovator’s Prescription: A Disruptive Solution for Health

Care. New York, NY: McGraw-Hill.

Christensen CM, Horn MB and Johnson CW (2008b)

Disrupting Class: How Disruptive Innovation Will Change

the Way the World Learns. New York, NY: McGraw-Hill.

Christensen CM and Rosenbloom RS (1995) Explaining the

attacker’s advantage: technological paradigms, organiza-

tional dynamics, and the value network. Research Policy

24(2): 233–257.

Committee on Ensuring the Utility and Integrity of Research

Data in a Digital Age (2009) Ensuring the Integrity,

Accessibility, and Stewardship of Research Data in the

Digital Age. Washington, DC: National Academies Press.

Cragin MH, Palmer CL, Carlson JR, et al. (2010) Data shar-

ing, small science and institutional repositories.

Lagoze 9




10/11

Philosophical Transactions. Series A, Mathematical,

Physical, and Engineering Sciences 368(1926): 4023–4038.

Dosi G (1982) Technological paradigms and technological

trajectories: a suggested interpretation of the determinants

and directions of technical change. Research Policy 11(3):

147–162.

Driscoll K and Walker S (2014) Big data, big questions work-

ing within a black box: transparency in the collection and

production of big twitter data. International Journal of

Communication 8(0): 20.

Edwards P, Mayernik MS, Batcheller A, et al. (2011) Science

friction: data, metadata, and collaboration. Social Studies

of Science 41(5): 667–690.

Edwards PN (2010) A Vast Machine: Computer Models,

Climate Data, and the Politics of Global Warming.

Cambridge, MA: MIT Press.

Edwards PN, Jackson SJ, Bowker GC, et al. (2007)

Understanding Infrastructure: Dynamics, Tensions, and

Design. Washington, DC: National Science Foundation.

Edwards PN, Jackson SJ, Chalmers MK, et al. (2013)

Knowledge Infrastructures: Intellectual Frameworks and Research Challenges. Ann Arbor, MI.

Ericsson KA and Charness N (1994) Expert performance: its

structure and acquisition. American Psychologist 49(8):

725–747.

Gillespie T (2014) The relevance of algorithms. In: Gillespie

T, Boczkowski P and Foot (eds) Media Technologies:

Essays on Communication, Materiality, and Society.

Cambridge, MA: MIT Press, p.167.

Ginsberg J, Mohebbi MH, Patel RS, et al. (2009) Detecting

influenza epidemics using search engine query data.

Nature 457(7232): 1012–1014.

Ginsparg P (1994) First steps towards electronic research

communication. Los Alamos Science 8(4): 390–396.Gitelman L (2013) ‘ ‘Raw Data’’ Is an Oxymoron

(Infrastructures). Cambridge, MA: The MIT Press, p.192.

Gladney HM, Fox EA, Ahmed Z, et al. (1994) Digital

Library: Gross Structure and Requirements: Report from

a March 1994 Workshop. College Station: IEEE.

Greenberg J, White HC, Carrier S, et al. (2009) A metadata

best practice for a scientific data repository. Journal of

Library Metadata 9(3–4): 194–212.

Hahnel M (2012) Exclusive: figshare a new open data project

that wants to change the future of scholarly publishing. In:

Impact of Social Sciences Blog.

Harrison WTA, Simpson J and Weil M (2010) Editorial. Acta

Crystallographica Section E Structure Reports Online

66(1): e1–e2.

Haythornthwaite C, Lunsford KJ, Bowker GC, et al. (2006)

Challenges for research and practice in distributed, inter-

disciplinary collaboration. In: Hine C (ed) New

Infrastructures for Knowledge Production: Understanding

E-science. Information Science Publishing, pp.143–166.

Hey T, Tansley S and Tolle K (eds) (2009) The Fourth

Paradigm. Redmond, WA: Microsoft Research.

Hirtle PB (2000) Archival authenticity in a digital age.

In: Cullen C, Levy DM, Lynch CA, et al. (eds)

Authenticity in a Digital Environment. Washington, DC:

Council on Library and Information Resources.

Huijboom N and Broek TD (2011) Open data: an inter-

national comparison of strategies. European Journal of

ePractice 12: 1–13.

Ioannidis JPA (2005) Why most published research findings

are false. PLoS Med 2(8): e124.

Jasny BR, Chin G, Chong L, et al. (2011) Data replication &

reproducibility. Again, and again, and again. . .

Introduction. Science (New York, N.Y.) 334(6060): 1225.King G (2011a) Ensuring the data-rich future of the social

sciences. Science (New York, N.Y.) 331(6018): 719–721.

King G (2011b) The social science data revolution. Available

at: http://gking.harvard.edu/files/gking/files/evbase-hori-

zonsp.pdf (accessed 28 October 2014).

Knorr-Cetina K (1999) Epistemic Cultures: How the Sciences

Make Knowledge. Cambridge, MA: Harvard University

Press.

Kuhn TS (1970) The Structure of Scientific Revolutions, 2nd

ed. Chicago: University of Chicago Press.

Lagoze C (2010) Lost Identity: The Assimilation of Digital

Libraries into the Web (PhD dissertation). Cornell

University, Ithaca. Available at: http://carllagoze.files.wordpress.com/2012/06/carllagoze.pdf.

Lamb R and Sawyer S (2005) On extending social informatics

from a rich legacy of networks and conceptual resources.

Information Technology & People 18(1): 9–20.

Laney D (2001) {3D} Data Management: Controlling Data

Volume, Velocity, and Variety.

Lazer D, Kennedy R, King G, et al. (2014) The parable of

Google flu: traps in big data analysis. Science 343(6176):

1203–1205.

Leonelli S (2014) What difference does quantity make? On the

epistemology of Big Data in biology. Big Data & Society

1(1). DOI: 10.1177/2053951714534395.

Lerner FA (1999) Libraries Through the Ages. New York,

NY: Continuum.Mayer-Scho ¨ nberger V (2013) Big Data: A Revolution that

Will Transform How We Live, Work, and Think. Boston:

Houghton Mifflin Harcourt.

Michener W, Vieglais D, Vision T, et al. (2011) DataONE:

data observation network for earth — preserving data and

enabling innovation in the biological and environmental

sciences. D-Lib Magazine 17(1/2).

Milgram S (1967) The small world problem. Psychology

Today 2: 60–67.

Molloy JC (2011) The open knowledge foundation: open data

means better science. PLoS Biology 9. DOI: 10.1371/

journal.pbio.1001195.

Morris CW (1938) Foundations of the Theory of Signs

.Chicago: University of Chicago Press.

Murray-Rust P (2008) Open data in science. Serials Review

34: 52–64.

Naik G (2011). Mistakes in scientific studies surge. Wall Street

Journal . Available at: http://online.wsj.com/news/articles/

SB10001424052702303627104576411850666582080.

Nissenbaum H (2009) Privacy in Context: Technology, Policy,

and the Integrity of Social Life. Stanford, CA: Stanford

Law Books.

Normandeau N (2013) Beyond volume, variety and vel-

ocity is the issue of big data veracity. Available at:

http://inside-bigdata.com/2013/09/12/beyond-volume-





11/11

variety-velocity-issue-big-data-veracity/ (accessed 15 April

2014).

Nowotny H (2001) Re-Thinking Science: Knowledge and the

Public in an Age of Uncertainty, 1st ed. Cambridge, UK:

Polity.

Po ¨ schl U (2004) Interactive journal concept for improved sci-

entific publishing and quality assurance. Learned

Publishing 17(2): 105–113.Raven K (2012) 23andMe’s face in the crowdsourced health

research industry gets bigger. Available at: http://blogs.

nature.com/spoonful/2012/07/23andmes-face-in-the-

crowdsourced-health-research-industry-gets-bigger.html

(accessed 28 October 2014).

Reith M, Carr C and Gunsch G (2002) An examination of

digital forensic models. International Journal of Digital

Evidence 1: 1–12.

Researcher faked evidence of human cloning, Koreans report

(2006) The New York Times, 10 January.

Rosenberg D (2013) Data before the fact. In: ‘‘Raw Data’’ is

an Oxymoron. Cambridge, MA: MIT Press, pp.15–30.

Rosenbloom RS and Christensen CM (1994) Technologicaldiscontinuties, organizational capabilities, and strategic

commitments. Industrial and Corporate Change 3(3):

655–685.

Sauer JR, Peterjohn BG and Link WA (1994) Observer

differences in the North American Breeding Bird Survey.

The Auk 111(1): 50–62.

Star SL and Griesemer JR (1989) Institutional ecology, trans-

lations and boundary objects: amateurs and professionals

in Berkeley’s Museum of Vertebrate Zoology, 1907-39.

Social Studies of Science 19(3): 387.

Stodden V (2014) Enabling reproducibility in big data

research: balancing confidentiality and scientific transpar-

ency. In: Privacy, Big Data and the Public Good .

Cambridge, UK: Cambridge University Press. Availableat: http://www.cambridge.org/us/academic/subjects/

statistics-probability/statistical-theory-and-methods/

privacy-big-data-and-public-good-frameworks-

engagement (accessed 28 October 2014).

Sullivan BL, Aycrigg JL, Barry JH, et al. (2014) The eBird

enterprise: an integrated approach to development and

application of citizen science. Biological Conservation 169

(January).

Szalay A and Gray J (2001) The world-wide telescope. Science(New York, N.Y.) 293(5537): 2037–2040.

Tenopir C, Allard S, Douglass K, et al. (2011) Data sharing

by scientists: practices and perceptions. PLoS ONE 6(6):

21.

Van House NA, Bishop AP and Buttenfield BP (2003)

Introduction: Digital Libraries as Sociotechnical Systems.

Cambridge, MA: MIT Press.

Verfaellie M and McGwin J (2011) The case of Diederik

Stapel: Allegations of scientific fraud by prominent

Dutch social psychologist are investigated by multiple uni-

versities. Psychological Science Agenda 25(12).

Wagner CS, Roessner JD, Bobb K, et al. (2011) Approaches

to understanding and measuring interdisciplinary scientificresearch (IDR): a review of the literature. Journal of

Informetrics 5(1): 14–26.

Wallis J, Borgman C, Mayernik M, et al. (2007) Know thy

sensor: trust, data quality, and data integrity in scientific

digital libraries. In: Kova ´ cs L, Fuhr N and Meghini C

(eds) Research and Advanced Technology for Digital

Libraries SE- 32. Vol. 4675, Berlin, Heidelberg: Springer,

pp. 380–391.

Wiggins A and Crowston K (2010) Distributed scientific

collaboration: research opportunities in citizen science.

In: Proceedings of ACM CSCW 2010 workshop on the

changing dynamics of scientific collaborations.

Zachary WW (1977) An information flow model for conflict

and fission in small groups. Journal of Anthropological Research 33: 452–473.

Lagoze 11

Documents

Big Data, Data Integrity, And the Fracturing