Upload
slaheddine-dardouri
View
214
Download
0
Embed Size (px)
Citation preview
8/18/2019 Big Data, Data Integrity, And the Fracturing
1/11
Original Research Article
Big Data, data integrity, and the fracturingof the control zone
Carl Lagoze
Abstract
Despite all the attention to Big Data and the claims that it represents a ‘‘paradigm shift’’ in science, we lack understandingabout what are the qualities of Big Data that may contribute to this revolutionary impact. In this paper, we look beyondthe quantitative aspects of Big Data (i.e. lots of data) and examine it from a sociotechnical perspective. We argue that akey factor that distinguishes ‘‘Big Data’’ from ‘‘lots of data’’ lies in changes to the traditional, well-established ‘‘control
zones’’ that facilitated clear provenance of scientific data, thereby ensuring data integrity and providing the foundation forcredible science. The breakdown of these control zones is a consequence of the manner in which our network tech-nology and culture enable and encourage open, anonymous sharing of information, participation regardless of expertise,and collaboration across geographic, disciplinary, and institutional barriers. We are left with the conundrum—how toreap the benefits of Big Data while re-creating a trust fabric and an accountable chain of responsibility that make crediblescience possible.
Keywords
Big Data, control zone, paradigm shift, sociotechnical
Big Data is not only about being big
The popular and scholarly literature is filled with excite-
ment about Big Data. A good deal of the enthusiasm
comes from the business sector, where Big Data offers
new possibilities for direct and micro marketing,
supply-chain optimization, and other means of increas-
ing efficiency and profits. This enthusiasm has also
spread to the public sector, particularly in the areas
of security and terrorism prevention. In this paper, we
examine the impact of Big Data in the context of sci-
ence,1 encompassing the research that takes place in the
academic, corporate, and government milieu.Admittedly, the line between commercial research (dis-
tinguished from corporate research such as that which
takes place at IBM Watson) and scientific research can
be fuzzy, but we distinguish the former as motivated by
financial concerns (e.g. product improvement for profit
improvement), whereas the latter is motivated by the
search for some ‘‘truth’’. Some argue that Big Data
represents a new paradigm of science, a ‘‘fourth para-
digm’’ (Hey et al., 2009), adopting the terminology used
by Kuhn (1970) to characterize the revolutionary
transformation of a scientific field.2
While many viewthis new paradigm as complementary rather than sub-
stitutive to pre-existing paradigms (observation, experi-
mentation, and simulation), others like Chris Anderson
have taken a more extreme view, claiming that Big
Data represents the ‘‘end of theory’’ (Anderson, 2008).
Our goal in this paper is to pull back from the hype
and take a more measured, analytical approach to
Big Data, focusing on the question ‘‘what are the
characteristics of (some) Big Data that manifest a para-
digm shift in the fundamental assumptions of science’’?
We distinguish between Big Data characteristics that
have methodological consequences and those thatimpact epistemological foundations. We characterize
the former as important but not paradigm-shifting.
In contrast, we argue that a paradigm shift is
University of Michigan, School of Information, Ann Arbor, MI, USA
Corresponding author:
Carl Lagoze, University of Michigan, 105 S. State Street, Ann Arbor,
MI 48103, USA.
Email: [email protected]
Big Data & Society
July–December 2014: 1–11
! The Author(s) 2014
DOI: 10.1177/2053951714558281
bds.sagepub.com
Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License(http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further
permission provided the original work is attributed as specified on the SAGE and Open Access pages (http://www.uk.sagepub.com/aboutus/
openaccess.htm).
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
2/11
indeed evident when Big Data impacts epistemological
foundations.
Embedded in this argument is the assumption that
the characteristics we are looking for are not native to
all uses of big (in size) data. And, in fact, it may be true
that data that is not necessarily quantitatively large
may have characteristics that are paradigm-shiftingwhen used in certain contexts and by certain commu-
nities of use.
Before proceeding any further with an analysis of
‘‘big’’ as a qualifying characteristic of (some) ‘‘data’’,
it is important to establish a definition of ‘‘data’’,
whether big or small. A National Academies report
(A Question of Balance: Private Rights and the Public
Interest in Scientific and Technical Databases, 1999)
provides a simple and inclusive foundation definition:
‘‘data are artifacts, numbers, letters, and symbols that
describe an object, idea, condition, situation, or other
factors.’’ Although this definition is useful, it fails to
capture the ‘‘relative’’ nature of data (in contrast to it
having an ‘‘essential’’ nature). As Borgman (2011)
states: ‘‘[d]ata may exist only in the eyes of the
holder: the recognition that an observation, artifact,
or record constitute data is itself a scholarly act.’’
This perspective of data is reflexive; something (e.g.
images, text, and Excel worksheet, etc.) is data because
someone uses it as data in a specific context, and tran-
scendent, it carries across the many disciplines, prac-
tices, and epistemologies of science.
This relational/contextual perspective gives us the
basis for examining the Big Data phenomenon in a
manner that both crosses epistemological boundariesand is contextualized by them. With due recognition
of the dangers of making generalizations about ‘‘sci-
ence’’, we hope to establish some fundamental aspects
of Big Data that are indeed boundary crossing, while
remaining shaped by (and shaping) specific disciplinary
practices.
Having established this relativistic definition of data,
we return to the notion that ‘‘Big Data is not only
about being big’’; that there is some combination of
features or dimensions (perhaps among them size)
that may have revolutionary effects on science and
knowledge production. This multidimensional perspec-tive is evident in many of the popularized, mass-market
descriptions of Big Data.
One popular multidimensional definition of Big
Data is based on the so-called 3Vs: Volume, Velocity,
and Variety (Laney, 2001). Volume is the size factor.
Velocity refers to the speed of accumulation, the result-
ing dynamic nature of the data, and the high-scale pro-
cessing capacity needed to make it useful and keep it
current. Finally, Variety refers to the mixing together,
or mashing-up, of heterogeneous data types, models,
and schema and the need to resolve these differences
in order to make the data useful. Others have enhanced
this list with additional ‘‘Vs’’: Validity, the amount of
bias or noise in the data; Veracity, the correctness and
accuracy of the data; and Volatility, the persistence and
longevity of data (Normandeau, 2013), the first two of
which, Validity and Veracity, are of particular interest
to the argument of this paper.Mayer-Schonberger and Cukier in their best-selling
book Big Data offer an alternative but complementary
set of characteristics of Big Data, which they claim
‘‘challenges the way we live and interact with the
world’’ (Mayer-Scho ¨ nberger, 2013, p. 6). They charac-
terize Big Data as revolutionary because it enables/
embodies ‘‘three shifts [characteristics] in the way we
analyze information and transform how we understand
and organize society.’’ The first is the ‘‘more’’ charac-
teristic, which they posit as the foundation for the two
other characteristics. A notable aspect of ‘‘bigness’’
according to the authors is its equivalence to ‘‘allness’’
(n¼all). Throughout the book they assert that Big
Data obviates the need for traditional (in their view
flawed) sampling techniques and increasingly can be
considered a complete view of the object of investiga-
tion. We later question this argument and initially note
that even if the n¼all principle were true, the notion of
data providing a ‘‘complete view’’ of reality, in the
objective sense, is met with skepticism by a number of
modern scholars (Bowker, 2014; Edwards, 2010;
Gitelman, 2013). The second characteristic is
‘‘messy,’’ the effect of which is diminished by the n¼all
characteristic. In their words, ‘‘looking at vastly more
data also permits us to lessen our desire for exactitude.’’The third and final characteristic is the shift in analyt-
ical technique from causality to correlation. ‘‘Most
strikingly, society will need to shed some of its obses-
sion for causality in exchange for simple correlations;
not knowing why but only what.’’ We will return to
Mayer-Schonberger and Cukier later in this paper to
further critique of their n¼all claim and its implica-
tions for new paradigm science.
These two attempts to define Big Data, and many
others like them, fail to adequately capture the nuances
and contexts of use of Big Data that may make it revo-
lutionary and the driver of a new scientific paradigm.Employing Kuhn’s words, when are Big Data ‘‘tra-
dition-shattering complements to the tradition-bound
activity of normal science’’ (Kuhn, 1970)? To answer
this question, we need to examine Big Data from a
sociotechnical perspective (Bijker, 1995; Lamb and
Sawyer, 2005). We need to investigate their social, cul-
tural, historical, and technical facets and the interplay
and tensions among these facets that collectively estab-
lish the impact of Big Data on science and the possible
transformation thereof. An analysis of this sort will
allow us to distinguish the aspects of Big Data that,
2 Big Data & Society
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
3/11
no matter how contributory to innovation, may be
more evolutionary than revolutionary, from those
that are indeed paradigm-shifting. Furthermore, it
will help us distinguish between locality—discipline
and/or field-specific characteristics of Big Data—and
globality—aspects of Big Data that may be paradigm-
shifting across the scholarly enterprise.
Lots of data or Big Data?
Because technology is such a basic enabler and compo-
nent of Big Data practices (i.e. computation including
hardware, software, and algorithmic components; high-
speed networks; massive storage arrays), it is useful to
build our argument on the notions of new technological
paradigms (Dosi, 1982) and of disruption (Christensen
and Rosenbloom, 1995; Rosenbloom and Christensen,
1994). Originating in the business and organizational
behavior sector, these two concepts nicely complement
Kuhn’s theories, which focus on the scholarly domain.
Dosi distinguishes between evolutionary paths of
technological change and new technological paradigms
that represent discontinuities from pre-existing techno-
logical paths and address new classes of problems.
Christensen and Rosenbloom expand on this with the
notion of disruptive innovation, which is a discontinu-
ity in not only the technological aspect of a product or
service, but also a sociotechnical disruption; a context-
ual change in the set of valuations and values that
frame and are impacted by the technical innovation.
Christensen initially applied this theory of disruption
to product lines (Christensen, 1997), with a primeexample being the successive introduction of smaller
hard disk platters that initially seemed noncompetitive
with the disc products of mainstream manufacturers,
but eventually and repetitively obliterated the main-
stream markets due to their framing within the revolu-
tionary personalization of computing. Christensen has
also applied this theoretical framework to health care
(Christensen et al., 2008a) and education (Christensen
et al., 2008b).
By leveraging the theoretical frameworks of Kuhn,
Rosenbloom, and Christiansen, we argue that a disrup-
tion in science (a.k.a. the creation of a new paradigm) isnot just methodological, a way of doing (a.k.a. tech-
nical), but also must be sociotechnical. It must chal-
lenge existing epistemological norms, ways of
knowing and framing the fundamental scientific ques-
tions of the field; institutional ecologies (Star and
Griesemer, 1989), agreements on scope, assumed know-
ledge, and boundaries of research work; reward struc-
tures, paths to tenure and promotion; and
communication regimes, mechanisms, and norms for
disseminating knowledge. We will use this scaffolding
for the remainder of this essay to distinguish between
what we will call lots of data, the effects of which are by
and large methodological and technical, and true Big
Data, that which entails epistemological and, as a
result, paradigmatic change.
Our distinguishing between these two terms—lots of
data (which entails methodological change and tech-
nical innovation) and Big Data (which implies the re-evaluation of epistemological foundations)—should
not be interpreted as an attempt to segregate data
into two disjoint silos, i.e. data set 1 is ‘‘lots of data’’,
in contrast to data set 2 that is genuine ‘‘Big Data’’.
Our intention, rather, is to establish these concepts as
continuous dimensions with which instances of data use
can be evaluated in order to understand the degree and
origins of their methodological and/or paradigm-shift-
ing effects, i.e. a use of data set 1 has high ‘‘lots of data’’
impact but low ‘‘Big Data’’ impact while a use of data
set 2 has low ‘‘lots of data’’ impact but high ‘‘Big Data’’
impact. The term ‘‘instances of data use’’, in contrast to
simply ‘‘data’’, is intentional and refers to the fact that,
similar to the definition of data, the methodological
and epistemological impact of data must be evaluated
within the context of use. An important facet of this
context is the distinct epistemic culture (Knorr-Cetina,
1999) of the community of use and its particular per-
spectives on data and its meaning. In other words, the
same data set may ‘‘measure’’ differently according to
the ‘‘Big Data’’ and ‘‘lots of data’’ dimensions when
employed by different disciplinary communities and/
or for different purposes.
Although the primary focus of the remainder of this
paper is the Big Data dimension—when, how, and whydoes data use challenge the epistemological foundations
of science—it is useful, for the purpose of contrast, to
briefly examine the companion lots of data dimension.
This brevity should not be construed as dismissive
towards the significance of these technical challenges
and the methodological impacts they have. Indeed,
there are great challenges here and the scholarly and
practical effects of meeting these challenges can be pro-
found, albeit not paradigm-shifting.
Two often-cited instances of data use demonstrate
the lots of data dimension. The petabytes of data
streaming in from high-energy physics experiments(studied thoroughly by Knorr-Cetina, 1999) or those
that are components of the Sloan Digital Sky Survey
(Szalay and Gray, 2001) are certainly Big Data in terms
of size. But, considered alone, their bigness and the
issues associated with them are by and large technical.
These communities have historic cultures of data shar-
ing (Ginsparg, 1994; Knorr-Cetina, 1999) and, in fact,
their data has always been ‘‘big’’ relative to the quan-
titative definitions of the day. This is similar to the situ-
ation with many domains of science that have a legacy
of exploring and manipulating large data sets, where
Lagoze 3
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
4/11
‘‘large’’ is historically contextualized relative to the
technical affordances of the time (Gitelman, 2013).
The massive quantity of data in these two examples
clearly introduces issues about new high-capacity stor-
age systems, high-speed networks to easily move them
back and forth, and map-reduce algorithms that permit
parallel computation over these massive data sets. Arecent white paper co-authored by leading data science
researchers (Agrawal et al., n.d.) provides a useful list
of the cross-cutting challenges that need to be met to
respond to these issues; heterogeneity and incomplete-
ness, scale, timeliness or speed, privacy, and human
collaboration. All of these are formidable challenges.
However, the need for these new methodologies and
tools to manipulate, store, and curate these massive
data sets does not correspond to a paradigm-shifting
disruption of the historically data-focused epistemic
culture of the communities of practice that engage
with these data.
A recent paper by Leonelli (2014) in the inaugural
issue of this journal explores the same issue in the dis-
cipline of biology.3 Similar to this paper (albeit limited
to a single discipline), Leonelli aims ‘‘to inform a cri-
tique of the supposedly revolutionary power of Big
Data science,’’ likewise defining revolutionary as syn-
onymous with creating a new epistemology and a new
set of norms. Similar to our earlier examples in physics
and astronomy, she notes that ‘‘data-gathering prac-
tices in subfields [of the life sciences] have been at the
heart of inquiry since the early modern era, and have
generated problems ever since.’’ She then aims the bulk
of her critique at Mayer-Scho ¨ nberger and Cukier’sclaims that data completeness mitigates data messiness
and their championing of correlation over causality,
which we will return to later in this paper. She finishes
by rejecting the notion that Big Data is exerting a revo-
lutionary effect on the epistemology of biology itself,
claiming that ‘‘there is a strong continuity with prac-
tices of large data collection and assemblage since the
early modern period; and the core methods and epi-
stemic problems of biological research, including
exploratory experimentation, sampling and the search
for causal mechanisms, remain crucial parts of the
inquiry in the area of science.’’ In contrast to epistemiceffects on the discipline itself, she acknowledges signifi-
cant methodological challenges ‘‘encountered in
developing and applying curatorial standards for
data . . . ’’ and in the dissemination of that data.
On the other hand, the sensitivity of the evolutionary
versus revolutionary impact of big (or of even any) data
to epistemic culture becomes evident in the context of
digital humanities (or as some call it computational
humanities, and its specializations such as computa-
tional history). The level of controversy over the ‘‘data-
fication’’ (Mayer-Scho ¨ nberger, 2013) of historical and/
or literary artifacts (whether in massive scale such as
the Google Books Project or the scale of a single liter-
ary corpus) can be viewed as evidence of resistance to
the introduction of a new epistemology, based on data,
that is viewed by some as threatening, and perhaps
inferior, to existing and historically based epistemolo-
gies (Bruns, 2013; Rosenberg, 2013).These examples in physics, astronomy, biology, and
the humanities (and many similar ones) lead us to con-
clude that mere bigness, lots of data (which appears
to have different meanings in different scholarly
fields), is not the basis for declaring a new paradigm
in science. Furthermore, we can be fairly confident that
such a blanket declaration without attention to the con-
founding factor of epistemic cultures warrants
skepticism.
Data integrity and credible science
With these caveats in mind, however, we do claim that
there might be some cross-cutting framing of data and
their application across the entire scholarly endeavor,
recognizing that this framing needs to be parameterized
to a particular use of data within a particular epistemic
culture. Then, we need to understand how Big Data
might challenge this common framing, thereby becom-
ing ‘‘tradition shattering’’ (Kuhn, 1970).
At the forefront is the notion of data integrity, which
we assert is a consistent and discipline-crossing founda-
tion of credible science (Committee on Ensuring the
Utility and Integrity of Research Data in a Digital
Age, 2009; Nowotny, 2001). We intentionally use theterm integrity rather than correctness or quality; the
latter terms ascribe a level of positivism to data that
many modern scholars refute (Edwards, 2010;
Gitelman, 2013). Integrity, on the other hand, has a
more constructivist tone, implying notions of ‘‘trust’’,
‘‘fitness for use’’, and ‘‘consensual understanding’’, all
of which are contextual and relative to epistemic cul-
ture, in contrast to the implicitly binary notion of cor-
rectness. Looking at this from the perspective of
infrastructure to support data sharing (using ‘‘infra-
structure’’ in its broadest most sociotechnical sense;
Edwards et al., 2007), we can then draw the linksfrom integrity to trust, and ultimately to provenance
(evidence upon which trust is established), and propose
that determining the degree of data integrity is based on
the ability to answer a number of questions. What is the
origin of these data? Who has been responsible for
them since their origination? Can we apply our stand-
ard notions for trust and integrity to them? Do our
standard methodologies for interpreting them and
drawing conclusions from them make sense? Big Data
is then those data that disrupt fundamental notions of
integrity and force new ways of thinking and doing to
4 Big Data & Society
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
5/11
reestablish it. Said differently, Big Data is data that
makes us rethink our notions of credible science.
Our attention here to the issues of data and scientific
integrity is coincident with a growing concern with the
reliability of scientific knowledge. The notion of a crisis
in reliability has been discussed in the media (Naik,
2011), and in scientific journal articles (Brembs andMunafò, 2013) and editorials (‘‘Announcement,’’
2013; Jasny et al., 2011). Some of the concern about
reliability has been fueled by well-publicized cases of
scientific fraud and data falsification in a number of
scientific fields (Harrison et al., 2010; ‘‘Researcher
Faked Evidence of Human Cloning, Koreans
Report,’’ 2006; Verfaellie and McGwin, 2011). In add-
ition, a number of academics are warning about the
prevalence of false results in the scientific literature
(Ioannidis, 2005; Po ¨ schl, 2004).
But, as pointed out by Stodden (2014), some of this
concern arises from the increasing prevalence of data-
intensive (Big Data) science across the disciplines, and
the application of computational, analytical methods to
those data without complete understanding of their
characteristics (e.g. the nature of the sample represented
by the data). Absent full understanding of the data (and
in some cases a failure to account for this lack of intim-
acy with the data), researchers have at times unwittingly
or sloppily applied methodological tools or epistemo-
logical understanding to those data that failed to
account for the fundamental differences between them
and traditional highly-curated and reliable data. As
pointed out by Lazer et al. (2014), ‘‘ . . . most Big Data
that have received popular attention are not the outputof instruments designed to produce valid and reliable
data amenable for scientific analysis.’’
Of particular concern in this area has been scientific
results based on data sources of questionable proven-
ance and integrity such as distributed sensors (Wallis
et al., 2007) and ‘‘black box social media,’’ where the
origin and basis of the data are difficult to determine
(Driscoll and Walker, 2014) and the algorithmic bias on
the conclusions is difficult to unravel (Gillespie, 2014).
A well-known example of the foibles of the reliance on
informally collected data and algorithmic projection is
the Google Flu Trends (GFT), which raised huge sci-entific optimism about the predictive utility of infor-
mally collected data when first published in Nature in
2009 (Ginsberg et al., 2009). This optimism suffered a
serious setback in 2013 when the GFT predictions for
that year were shown to be seriously exaggerated
(Butler, 2013; Lazer et al., 2014). A complete account-
ing for this setback is beyond the scope of this paper.
However, one acknowledged factor is an overconfi-
dence in the veracity of the data as a true sample of
reality, rather than a random snapshot in time and the
result of algorithmic dynamics.
We acknowledge that this emphasis on data integrity
(a.k.a. quality) stands somewhat in opposition to the
popularized claims by Mayer-Scho ¨ nberger and Cukier
that ‘‘looking at vastly more data . . . permits us to
loosen up a desire for exactitude’’ and effectively
allows us to ignore ‘‘messiness’’ in data (Mayer-
Scho ¨ nberger, 2013). As mentioned earlier, this claimand subsequent claims by the authors seem to rely
heavily on n¼all, that is, Big Data is not a sample
but a complete set. We find this claim highly suspicious
and agree with fellow scholars (Boyd and Crawford,
2011; Bowker, 2014) who take the position that any
data, no matter what its size, is de facto a sample,
with bias implicit due to choice of instrumentation,
span of observation, units of measurement, and
numerous other factors. In essence, n never equals all;
all is a limit in mathematical terms that can be
approached but never attained. This point is also
emphasized by Leonelli, who states that ‘‘having a lot
of data is not the same as having all of them; and
cultivating such a vision of completeness is a very
risky and potentially misleading strategy’’ (Leonelli,
2014). Thus, if one denies sampling and its effects on
messiness or on our ability to derive meaning from cor-
relations, as Mayer-Scho ¨ nberger and Cukier seem to
do, they tread on questionable territory in terms of
high integrity science, and may indeed have an argu-
ment that is more appropriate to business and com-
merce. Again quoting Leonelli, ‘‘it is no coincidence
that most of the examples given by Mayer-
Scho ¨ nberger and Cukier come from the industrial
world, and particularly globalized retail strategies asis the case of Amazon.com’’ (Leonelli, 2014).
As a point of reference, it is useful to look at the
notions of integrity, trust, and provenance in the con-
text of archives and archival science, for which they are
essential concepts. Hirtle (2000) describes the meanings
of these terms and the manner in which they are core to
the definition of the archive in the context of the ship
Constellation, a tourist destination in Baltimore harbor
that was mistakenly identified as a revolutionary war
ship when its vintage was really the US Civil War.
According to Hirtle (2000), ‘‘at the heart of an arch-
ive. . .
are records that are created by an agency ororganization in the course of its business and that
serve as evidence of the actions of that agency or organ-
ization [italics added].’’ Furthermore, ‘‘one way in
which archivists working with . . . records have sought
to ensure the enduring value of archives as evidence is
through the maintenance of an unbroken provenance for
the records [italics added].’’ Implicit in the notion of
‘‘unbroken provenance’’ is control over storage and
transfer; in order to serve as evidence an archival
record must demonstrate a complete, unbroken, histor-
ical knowledge of the item of interest, who has been in
Lagoze 5
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
6/11
control of it, and by what means it has been transferred
or moved to other authorities. Fans of crime shows on
TV or of detective novels should find this notion quite
familiar; the evidence presented in a court of law is
useless if law enforcement has lost control of it and it
may have been tampered with.
Defining the control zone
Taking a cue from archival science then, we should
look at the role of control (and unbroken provenance)
as a necessary (but not necessarily sufficient) factor in
data integrity. Traditional data origination, sharing,
and reuse were based on the reality of containable
and concrete physical data (e.g. written by hand or
stored on magnetic devices that are kept in drawers
or file cabinets) and data sharing practices based on
physical handoff to known colleagues. The physicality
of both the data and the transfer of data amounted to a
well-defined control zone resulting in a provenance
chain that was documented and witnessed. Before
examining the breakdown of this control zone in the
context of Big Data, in the next section we examine the
same notion and its role in the disruption of another
knowledge infrastructure (Edwards et al., 2013) that
has over the past two decades undergone considerable
change, the library. In a seminal 1996 article, ‘‘Library
Functions, Scholarly Communication, and the
Foundation of the Digital Library: Laying Claim to
the Control Zone’’ (Atkinson, 1996), the late Ross
Atkinson, then Associate University Librarian at
Cornell University, describes how the notion of acontrol zone lay at the foundation of the library.
According to Atkinson, the functioning of the library
depends on the definition of a clear boundary, a
demarcation of what lies in the library and what is out-
side. Internal to this boundary, within the control zone,
the library can lay claim to those resources that have
been selected as part of the collection, and assert cur-
ation, or stewardship, of those selected resources to
ensure their integrity, availability, and stability over
the long-term.
The boundary of the traditional library was easy to
define. It was the ‘‘bricks and mortar’’ structure with aclear and controlled entry point that contained and
protected the selected physical resources over which
the library asserted control and curatorial responsibil-
ity. Correspondingly, from the patron’s point of view,
the boundary marked what could be called a ‘‘trust
zone’’, an area to which entry and exit were clearly
marked and in which they could presume the existence
of the integrity guarantees of the library. Integrity, in
this case, does not imply veracity of the resources of the
library, but adherence to principles of proper informa-
tion stewardship, including accurate description,
longevity of the resources, and adherence to some selec-
tion criteria.
In Lagoze (2010), we describe how the move from
physical to digital information resources and the
attendant access to them by the web architecture
profoundly disrupted the foundation of the control
zone. This disruption was not anticipated by early par-ticipants, practitioners, and researchers in the early
digital library initiatives, who foresaw technical
but not institutional change. In fact, some predicted
that in the end ‘‘[digital] library services would fol-
low a familiar model’’ (Gladney et al., 1994).
Others saw the Internet as another familiar evolution-
ary technical change, similar to past challenges to
libraries, stating that ‘‘The anarchy of the Internet
may be daunting for the neophyte, but it differs little
from the bibliographic chaos that is the result of
five and a half centuries of the printing press’’
(Lerner, 1999).
Two decades later, it is clear that the implications of
moving from physical to digital information and net-
work access to the information is more than a technical
phenomenon; the implications are more than that
‘‘digital information crosses boundaries easily’’ (Van
House et al., 2003) and in fact are deeply disruptive
to the library. By viewing the library as a meme,4
rather than just as an institution or a physical artifact,
we can see the roots of the disruption. At the founda-
tion of it is the foundation of the library itself, the dis-
integration of the control zone. The notions of a clear
boundary, and the attendant concepts of being inside or
outside, disappear in the web architecture, where users(i.e. patrons) no longer enter through a well-defined
door, but ride hyperlinks and land wherever they may
choose in the digital library. Attempts to reassert a
boundary by defining a new digital door or portal
and establishing branding signposts defining inside vs.
outside have proven incompatible with the dominant
web context and have largely failed. With the collapse
of the control zone, other fundamental components of
the library meme become difficult to implement or ana-
chronistic relative to the increasingly normative
broader web context. These include selection, deciding
what information sources are available to patrons;intermediation, acting as a buffer between information
creators and information users; bibliographic descrip-
tion, providing ‘‘order making’’ via the catalog; and
fixity, guaranteeing the immutability of information
resources.
In conclusion, the wholesale transition of our intel-
lectual, popular, and cultural heritage to the digital
realm has been accompanied by a disruptive change
in our expectations about our knowledge infrastruc-
tures. The notions of selection, intermediation, biblio-
graphic description, and fixity that are core principles
6 Big Data & Society
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
7/11
of the library meme stand at odds to the web informa-
tion meme. These contradictions become sharper as the
web has moved over the past decade into the web 2.0
era and beyond. Expectations of open access to infor-
mation, active participation in knowledge production
and annotation, and the integration of social activity
and knowledge activities are now the expected norm.Libraries are certainly part of this modern knowledge
infrastructure. But they exist as participants in a world
of competing ‘‘knowledge institutions’’ (e.g. Wikipedia,
Facebook, Twitter). Meanwhile, notions of informa-
tion integrity, which were formally grounded in institu-
tional frameworks such as the library, remain
problematic and in search of new ways to certify the
provenance of information resources.
Rethinking credible science in the age
of Big Data
With knowledge of this precedent, we can now return
to Big Data and recognize parallels in the historical
transitions of the library and the transformations in
the ways that scholarly data are created, shared, and
used. The relatively well-controlled mechanisms (both
cultural and technical) for data creation, data sharing,
and data reuse are under pressure for a number of rea-
sons. Funders, the public, and fellow scientists are
demanding, for good reason, better access to data
and in general ‘‘open data’’ (Huijboom and Broek,
2011; Molloy, 2011; Murray-Rust, 2008), motivating
the creation of numerous data repositories (Greenberg
et al., 2009; Hahnel, 2012; Michener et al., 2011) thatallow easy and generally anonymous access to scientific
data on a global scale. Science in general is becoming
more collaborative and interdisciplinary (Barry and
Born, 2013; Haythornthwaite et al., 2006; Wagner
et al., 2011) (at least partly due to the multidisciplinary
scope of grand challenge problems like climate change),
breaking down traditional closely-knit teams of col-
leagues and bringing together scholars with different
epistemic and methodological cultures. An increasing
number of data sources originate from nontraditional
means, such as social networks for which concerns
about integrity and provenance are not priorities.Mashups of data are becoming increasingly common,
blurring the lines between formal and informal data.
Scientists seem to have a love/hate relationship with
this new reality. While they support the abstract idea
of open data (Cragin et al., 2010; Tenopir et al., 2011),
their sharing practices, and sharing preferences, remain
relatively closed and motivated by control (Borgman,
2011; Edwards et al., 2011).
Quantitative social science research provides an
interesting example of this data transition and impact
on the control zone. For the past 50 years, quantitative
social science has been built on a shared foundation of
data sources originating from survey research, aggre-
gate government statistics, and in-depth studies of indi-
vidual places, people, or events. Underlying these data
is a well-established and well-controlled infrastructure
composed of an international network of highly curated
and metadata-rich archives of social science such as theInter-University Consortium for Political and Social
Research5 (ICPSR) and the UK Data Archive.6
These archives continue to play an important role in
quantitative social science research. However, the emer-
gence and maturation of ubiquitous networked com-
puting and the ever-growing data cloud have
introduced a spectacular quantity and variety of new
data sources into this mix. These include social media
data sources such as Facebook, Twitter, and other
online communities in which individuals reveal massive
amounts of information about themselves that are
invaluable for social science research. When combined
with more traditional data sources, these provide the
opportunity for studies at scales and complexities here-
tofore unimaginable. This transformation has been
described by Gary King, a Harvard political scientist,
as the social science data revolution, which is character-
ized by a ‘‘changing evidence base of social science
research’’ (King, 2011a, 2011b). These new opportu-
nities present formidable new challenges to the fabric
of social science research. Among those mentioned by
King (2011b) include privacy challenges, problems of
sampling bias in uncontrolled data sets, a change in the
basic ‘‘job descriptions’’ of social scientists with
demand for new skills in statistical methods, computa-tional methods, and the like, and the need for new
cross-disciplinary collaborations (i.e. breaking down
the silos that social science scholars formally existed
in). Clearly this is an example of Big Data rather
than just lots of data.
Another example of this fracturing of the control
zone exists in observational science, for example, iden-
tification and reporting of phenomena (e.g. species) in
ecological niches, astronomy, and meteorology. In each
of these areas there is a growing interest in what has
been termed crowd sourced citizen science, which
engages numerous volunteers as participants in large-scale scientific endeavors (Wiggins and Crowston,
2010). The opportunities for large-scale citizen science
arise from the ubiquitous networking and computing
context and especially the recent spectacular growth
in the use of mobile devices. The motivations for lever-
aging this large-scale volunteer workforce as observa-
tional ‘‘sensors’’ are substantial. The geographic scope
of the observational spaces and the varieties of habitats
make reliance on trained observers (e.g. scientists)
infeasible. Our particular experience in this area is
with the eBird project,7 originated at the Cornell
Lagoze 7
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
8/11
Laboratory of Ornithology, a highly successful citizen
science project that for over a decade has collected
observations from volunteer participants worldwide
(Sullivan et al., 2014). Those data have subsequently
been used for a large body of highly-regarded and influ-
ential scientific research.
It comes as no surprise that crowd sourced citizenscience makes a substantial portion of the formal sci-
entific community uneasy (Sauer et al., 1994), especially
in fields where people’s lives are at stake, such as medi-
cine (Raven, 2012). These data, by nature, breakdown a
well-established control zone whereby data is collected
by experts or individuals managed by experts who care-
fully abide by scientific methods. In contrast, citizen
science of this type must contend with the problems
of highly variable observer expertise and experience.
How can we trust data or the science that results
from those data when their provenance is rooted in
sources whose own provenance does not conform to
‘‘standard’’ criteria such as degree, publication record,
or institutional affiliation?
The examples described above are only two of the
many instances in which new varieties of Big Data are
undermining traditional control zones of science. If we
look longitudinally, we can see that examples such as
these are only the beginning of the problem. The frac-
tured control zones, and the resulting uncertain prov-
enance and trust, only intensify through the lifecycle of
sharing, reuse, and circulation of data in an open net-
work in which not all participants are deemed trust-
worthy according to established norms. Looking
across this lifecycle, this dilemma very quickly becomescombinatorially more complex. If the control zone
around data set A and that around data set B are
poorly defined, that which results from the reuse and
combination of the two is only fuzzier. Of course, this is
only the first step in the progressive mashup and
‘‘cooking’’ of these data with other data, a progression
that is inevitable when data reuse is easy and strongly
encouraged.
Despite the challenges and uncertainties, the inclu-
sion of these ‘‘uncontrolled’’ Big Data into the scientific
process is a reality that will continue and perhaps
become more common. Our ‘‘always there, every-where’’ network culture will continue to make more
and larger amounts of automatically, accidentally,
and informally created data available for science. The
value of these data across the scholarly spectrum has
been demonstrated numerous times. Social scientists
can conduct studies on large-scale social networks
that may not replace, but do significantly complement,
traditional research based on small-scale social groups
(Milgram, 1967; Zachary, 1977). Observational scien-
tists can now accumulate heretofore unavailable evi-
dence of global phenomena, such as bird migrations
and climatological events, by leveraging the active par-
ticipation and contribution of enthusiastic human
volunteers.8
Our goal in this paper has not been to propose a
normative framework for this reality, but to simulate
and add to discussions and investigations of its
entangled social, cultural, historical, and technicalimplications. Rather than fall back on hyperbolic
‘‘Big Data will change the world,’’ the scholarly com-
munity needs to understand it and investigate its impli-
cations for science policy and public trust of science.
We propose two threads for moving forward: one epis-
temological, evaluate our understanding of quality in
both data and science and our means for determining it,
the other methodological, developing means of recover-
ing traditional quality metrics.
The first approach begins by raising the awareness of
researchers who use Big Data about its opportunities,
complexities, and dangers. This area is reasonably well
covered in Boyd and Crawford’s (2011) paper ‘‘Six
Provocations for Big Data’’, which covers many of
the caveats in dealing with this kind of data including
‘‘Claims to Objectivity and Accuracy are Misleading’’
and ‘‘Bigger Data Are Not Always Better Data.’’ As
the authors point out, a critical component of using Big
Data for research is understanding the integrity of
those data, where they originated, what biases are
built into them, how data cleaning may lead to over
fitting, and what sampling biases may be embedded in
them. In this context, we need to evaluate what quality
and integrity mean in a networked culture and its
numerous possible contexts, in the manner that otherscholars are investigating parallel issues such as privacy
(Nissenbaum, 2009).
As for methodology, we suggest two technical paths
that may offer amelioration of the integrity problem,
both based on retrospectively recovering provenance,
rather than prospectively, as in the traditional
manner. In our research with eBird, we have been
investigating ways to reconstruct observer/contributor
expertise from the aggregated data. Our realization has
been that expertise is too nuanced a factor to recon-
struct, but that experience, interpreted as deliberate
practice, is an effective path to expert performance(Ericsson and Charness, 1994). Evidence of experience
can be extracted from the aggregated data; for example,
frequency of contributions, the diversity of contribu-
tions measured by species distribution, etc. By devising
ways to recognize these traces we hope to develop
mechanisms that aid scientists in determining the
expertise (and perhaps integrity) of anonymous data
contributors (reference removed for author anonym-
ity). Another approach might be to employ digital
forensics (Reith et al., 2002), a technique increasingly
popular in the intelligence and legal communities,
8 Big Data & Society
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
9/11
which, like our work with expertise, recovers traces of
origin and provenance metadata from a digital artifact
itself.
In conclusion, we have argued for an understanding
of the difference between lots of data and Big Data. The
former, a quantitative feature with mainly technical
and methodological implications, has, without adoubt, had important effects on the way science is
done and what it makes possible. However, the latter,
a qualitative feature with profound epistemological and
sociotechnical implications, shakes some of the core
assumptions of credible science: trust and integrity.
Similar to so many aspects of our modern digital cul-
ture such as journalism (e.g. the New York Times versus
the flood of grassroots news blogs) and reference infor-
mation (e.g. Encyclopedia Britannica versus
Wikipedia), it is futile and even undesirable to seek a
return to traditional, rigid control zones. Nevertheless,
we are left with the challenge with Big Data to reap its
benefits while simultaneously holding science to the
same standards that it has been held to for centuries.
Declaration of conflicting interests
The author declares that there is no conflict of interest.
Funding
This research received no specific grant from any funding
agency in the public, commercial, or not-for-profit sectors.
Notes
1. Throughout this essay we use the term ‘‘science’’ as a gen-eralization for all academic fields, not just the physical,
life, etc. sciences.
2. Some well-known examples of Kuhn’s notion of a para-
digm shift are the introduction of plate tectonics in geology
and Einstein’s special relativity theory in physics, both of
which challenged primary assumptions of their respective
fields.
3. Although Leonelli does undertake a disciplinary-level ana-
lysis, she acknowledges the flaws of using ‘‘discipline’’ as
the unit of study, recognizing the fact that within such
coarse granularity lies a wide variety of epistemological
and methodological practices.
4. We employ the term ‘‘meme’’ here to expand what we
mean by ‘‘library’’ beyond its operational, technical, and
institutional characteristics, and consider it in a manner
similar to a semiotic sign (Morris, 1938).
5. http://www.icpsr.umich.edu
6. http://www.data-archive.ac.uk
7. http://ebird.org
8. One might conjecture about the possibility of machine sen-
sing to replace the human volunteers. However, bird spe-
cies observation and identification rely on a highly
nuanced combination of visual, auditory, habitat, and
other knowledge that will make automated sensing extre-
mely difficult to implement.
References
Agrawal D, Bernstein P, Bertino E, et al. (n.d.) Challenges
and Opportunities with Big Data. Available at: https://
www.purdue.edu/discoverypark/cyber/assets/pdfs/
BigDataWhitePaper.pdf (accessed 28 October 2014).
Anderson C (2008) The end of theory: will the data deluge
make the scientific method obsolete? Wired 1–5.Announcement: Reducing our irreproducibility (2013) Nature
496(7446): 398–398.
A Question of Balance: Private Rights and the Public Interest
in Scientific and Technical Databases (1999) Washington,
DC: The National Academies Press. Available at: http://
www.nap.edu/openbook.php?record_id¼9692 (accessed
28 October 2014).
Atkinson R (1996) Library functions, scholarly communica-
tion, and the foundation of the digital library: laying claim
to the control zone. The Library Quarterly 66(3).
Barry A and Born G (2013) Interdisciplinarity:
Reconfigurations of the Social and Natural Sciences, 1st
ed. New York, NY: Routledge.
Bijker WE (1995) Of Bicycles, Bakelites, and Bulbs: Toward aTheory of Sociotechnical Change. Cambridge, MA:MIT Press.
Borgman CL (2011) The conundrum of sharing research data.
Journal of the American Society for Information Science
63(6): 1–40.
Bowker G (2014) The theory/data thing. International Journal
of Communication 8(5).
Boyd D and Crawford K (2011) Six provocations for Big
Data. SSRN Electronic Journal . DOI: 10.2139/
ssrn.1926431.
Brembs B and Munafo ` M (2013) Deep impact: unintended
consequences of journal rank. ArXiv. Available at: http://
arxiv.org/abs/1301.3748 (accessed 28 October 2014).
Bruns A (2013) Faster than the speed of print: reconciling‘‘big data’’ social media analysis and academic scholar-
ship. First Monday 18(10). Available at: http://first
monday.org/ojs/index.php/fm/article/view/4879/3756
(accessed 7 October 2013).
Butler D (2013) When Google got flu wrong. Nature
494(7436): 155–156.
Christensen CM (1997) The Innovator’s Dilemma: When New
Technologies Cause Great Firms to Fail . Boston, MA:
Harvard Business School Press.
Christensen CM, Grossman JH and Hwang J (2008a) The
Innovator’s Prescription: A Disruptive Solution for Health
Care. New York, NY: McGraw-Hill.
Christensen CM, Horn MB and Johnson CW (2008b)
Disrupting Class: How Disruptive Innovation Will Change
the Way the World Learns. New York, NY: McGraw-Hill.
Christensen CM and Rosenbloom RS (1995) Explaining the
attacker’s advantage: technological paradigms, organiza-
tional dynamics, and the value network. Research Policy
24(2): 233–257.
Committee on Ensuring the Utility and Integrity of Research
Data in a Digital Age (2009) Ensuring the Integrity,
Accessibility, and Stewardship of Research Data in the
Digital Age. Washington, DC: National Academies Press.
Cragin MH, Palmer CL, Carlson JR, et al. (2010) Data shar-
ing, small science and institutional repositories.
Lagoze 9
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
10/11
Philosophical Transactions. Series A, Mathematical,
Physical, and Engineering Sciences 368(1926): 4023–4038.
Dosi G (1982) Technological paradigms and technological
trajectories: a suggested interpretation of the determinants
and directions of technical change. Research Policy 11(3):
147–162.
Driscoll K and Walker S (2014) Big data, big questions work-
ing within a black box: transparency in the collection and
production of big twitter data. International Journal of
Communication 8(0): 20.
Edwards P, Mayernik MS, Batcheller A, et al. (2011) Science
friction: data, metadata, and collaboration. Social Studies
of Science 41(5): 667–690.
Edwards PN (2010) A Vast Machine: Computer Models,
Climate Data, and the Politics of Global Warming.
Cambridge, MA: MIT Press.
Edwards PN, Jackson SJ, Bowker GC, et al. (2007)
Understanding Infrastructure: Dynamics, Tensions, and
Design. Washington, DC: National Science Foundation.
Edwards PN, Jackson SJ, Chalmers MK, et al. (2013)
Knowledge Infrastructures: Intellectual Frameworks and Research Challenges. Ann Arbor, MI.
Ericsson KA and Charness N (1994) Expert performance: its
structure and acquisition. American Psychologist 49(8):
725–747.
Gillespie T (2014) The relevance of algorithms. In: Gillespie
T, Boczkowski P and Foot (eds) Media Technologies:
Essays on Communication, Materiality, and Society.
Cambridge, MA: MIT Press, p.167.
Ginsberg J, Mohebbi MH, Patel RS, et al. (2009) Detecting
influenza epidemics using search engine query data.
Nature 457(7232): 1012–1014.
Ginsparg P (1994) First steps towards electronic research
communication. Los Alamos Science 8(4): 390–396.Gitelman L (2013) ‘ ‘Raw Data’’ Is an Oxymoron
(Infrastructures). Cambridge, MA: The MIT Press, p.192.
Gladney HM, Fox EA, Ahmed Z, et al. (1994) Digital
Library: Gross Structure and Requirements: Report from
a March 1994 Workshop. College Station: IEEE.
Greenberg J, White HC, Carrier S, et al. (2009) A metadata
best practice for a scientific data repository. Journal of
Library Metadata 9(3–4): 194–212.
Hahnel M (2012) Exclusive: figshare a new open data project
that wants to change the future of scholarly publishing. In:
Impact of Social Sciences Blog.
Harrison WTA, Simpson J and Weil M (2010) Editorial. Acta
Crystallographica Section E Structure Reports Online
66(1): e1–e2.
Haythornthwaite C, Lunsford KJ, Bowker GC, et al. (2006)
Challenges for research and practice in distributed, inter-
disciplinary collaboration. In: Hine C (ed) New
Infrastructures for Knowledge Production: Understanding
E-science. Information Science Publishing, pp.143–166.
Hey T, Tansley S and Tolle K (eds) (2009) The Fourth
Paradigm. Redmond, WA: Microsoft Research.
Hirtle PB (2000) Archival authenticity in a digital age.
In: Cullen C, Levy DM, Lynch CA, et al. (eds)
Authenticity in a Digital Environment. Washington, DC:
Council on Library and Information Resources.
Huijboom N and Broek TD (2011) Open data: an inter-
national comparison of strategies. European Journal of
ePractice 12: 1–13.
Ioannidis JPA (2005) Why most published research findings
are false. PLoS Med 2(8): e124.
Jasny BR, Chin G, Chong L, et al. (2011) Data replication &
reproducibility. Again, and again, and again. . .
Introduction. Science (New York, N.Y.) 334(6060): 1225.King G (2011a) Ensuring the data-rich future of the social
sciences. Science (New York, N.Y.) 331(6018): 719–721.
King G (2011b) The social science data revolution. Available
at: http://gking.harvard.edu/files/gking/files/evbase-hori-
zonsp.pdf (accessed 28 October 2014).
Knorr-Cetina K (1999) Epistemic Cultures: How the Sciences
Make Knowledge. Cambridge, MA: Harvard University
Press.
Kuhn TS (1970) The Structure of Scientific Revolutions, 2nd
ed. Chicago: University of Chicago Press.
Lagoze C (2010) Lost Identity: The Assimilation of Digital
Libraries into the Web (PhD dissertation). Cornell
University, Ithaca. Available at: http://carllagoze.files.wordpress.com/2012/06/carllagoze.pdf.
Lamb R and Sawyer S (2005) On extending social informatics
from a rich legacy of networks and conceptual resources.
Information Technology & People 18(1): 9–20.
Laney D (2001) {3D} Data Management: Controlling Data
Volume, Velocity, and Variety.
Lazer D, Kennedy R, King G, et al. (2014) The parable of
Google flu: traps in big data analysis. Science 343(6176):
1203–1205.
Leonelli S (2014) What difference does quantity make? On the
epistemology of Big Data in biology. Big Data & Society
1(1). DOI: 10.1177/2053951714534395.
Lerner FA (1999) Libraries Through the Ages. New York,
NY: Continuum.Mayer-Scho ¨ nberger V (2013) Big Data: A Revolution that
Will Transform How We Live, Work, and Think. Boston:
Houghton Mifflin Harcourt.
Michener W, Vieglais D, Vision T, et al. (2011) DataONE:
data observation network for earth — preserving data and
enabling innovation in the biological and environmental
sciences. D-Lib Magazine 17(1/2).
Milgram S (1967) The small world problem. Psychology
Today 2: 60–67.
Molloy JC (2011) The open knowledge foundation: open data
means better science. PLoS Biology 9. DOI: 10.1371/
journal.pbio.1001195.
Morris CW (1938) Foundations of the Theory of Signs
.Chicago: University of Chicago Press.
Murray-Rust P (2008) Open data in science. Serials Review
34: 52–64.
Naik G (2011). Mistakes in scientific studies surge. Wall Street
Journal . Available at: http://online.wsj.com/news/articles/
SB10001424052702303627104576411850666582080.
Nissenbaum H (2009) Privacy in Context: Technology, Policy,
and the Integrity of Social Life. Stanford, CA: Stanford
Law Books.
Normandeau N (2013) Beyond volume, variety and vel-
ocity is the issue of big data veracity. Available at:
http://inside-bigdata.com/2013/09/12/beyond-volume-
10 Big Data & Society
by guest on December 21, 2014bds.sagepub.comDownloaded from
http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/
8/18/2019 Big Data, Data Integrity, And the Fracturing
11/11
variety-velocity-issue-big-data-veracity/ (accessed 15 April
2014).
Nowotny H (2001) Re-Thinking Science: Knowledge and the
Public in an Age of Uncertainty, 1st ed. Cambridge, UK:
Polity.
Po ¨ schl U (2004) Interactive journal concept for improved sci-
entific publishing and quality assurance. Learned
Publishing 17(2): 105–113.Raven K (2012) 23andMe’s face in the crowdsourced health
research industry gets bigger. Available at: http://blogs.
nature.com/spoonful/2012/07/23andmes-face-in-the-
crowdsourced-health-research-industry-gets-bigger.html
(accessed 28 October 2014).
Reith M, Carr C and Gunsch G (2002) An examination of
digital forensic models. International Journal of Digital
Evidence 1: 1–12.
Researcher faked evidence of human cloning, Koreans report
(2006) The New York Times, 10 January.
Rosenberg D (2013) Data before the fact. In: ‘‘Raw Data’’ is
an Oxymoron. Cambridge, MA: MIT Press, pp.15–30.
Rosenbloom RS and Christensen CM (1994) Technologicaldiscontinuties, organizational capabilities, and strategic
commitments. Industrial and Corporate Change 3(3):
655–685.
Sauer JR, Peterjohn BG and Link WA (1994) Observer
differences in the North American Breeding Bird Survey.
The Auk 111(1): 50–62.
Star SL and Griesemer JR (1989) Institutional ecology, trans-
lations and boundary objects: amateurs and professionals
in Berkeley’s Museum of Vertebrate Zoology, 1907-39.
Social Studies of Science 19(3): 387.
Stodden V (2014) Enabling reproducibility in big data
research: balancing confidentiality and scientific transpar-
ency. In: Privacy, Big Data and the Public Good .
Cambridge, UK: Cambridge University Press. Availableat: http://www.cambridge.org/us/academic/subjects/
statistics-probability/statistical-theory-and-methods/
privacy-big-data-and-public-good-frameworks-
engagement (accessed 28 October 2014).
Sullivan BL, Aycrigg JL, Barry JH, et al. (2014) The eBird
enterprise: an integrated approach to development and
application of citizen science. Biological Conservation 169
(January).
Szalay A and Gray J (2001) The world-wide telescope. Science(New York, N.Y.) 293(5537): 2037–2040.
Tenopir C, Allard S, Douglass K, et al. (2011) Data sharing
by scientists: practices and perceptions. PLoS ONE 6(6):
21.
Van House NA, Bishop AP and Buttenfield BP (2003)
Introduction: Digital Libraries as Sociotechnical Systems.
Cambridge, MA: MIT Press.
Verfaellie M and McGwin J (2011) The case of Diederik
Stapel: Allegations of scientific fraud by prominent
Dutch social psychologist are investigated by multiple uni-
versities. Psychological Science Agenda 25(12).
Wagner CS, Roessner JD, Bobb K, et al. (2011) Approaches
to understanding and measuring interdisciplinary scientificresearch (IDR): a review of the literature. Journal of
Informetrics 5(1): 14–26.
Wallis J, Borgman C, Mayernik M, et al. (2007) Know thy
sensor: trust, data quality, and data integrity in scientific
digital libraries. In: Kova ´ cs L, Fuhr N and Meghini C
(eds) Research and Advanced Technology for Digital
Libraries SE- 32. Vol. 4675, Berlin, Heidelberg: Springer,
pp. 380–391.
Wiggins A and Crowston K (2010) Distributed scientific
collaboration: research opportunities in citizen science.
In: Proceedings of ACM CSCW 2010 workshop on the
changing dynamics of scientific collaborations.
Zachary WW (1977) An information flow model for conflict
and fission in small groups. Journal of Anthropological Research 33: 452–473.
Lagoze 11