Big Data, Data Integrity, And the Fracturing

Embed Size (px)

Citation preview

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    1/11

    Original Research Article

    Big Data, data integrity, and the fracturingof the control zone

    Carl Lagoze

    Abstract

    Despite all the attention to Big Data and the claims that it represents a ‘‘paradigm shift’’ in science, we lack understandingabout what are the qualities of Big Data that may contribute to this revolutionary impact. In this paper, we look beyondthe quantitative aspects of Big Data (i.e. lots of data) and examine it from a sociotechnical perspective. We argue that akey factor that distinguishes ‘‘Big Data’’ from ‘‘lots of data’’ lies in changes to the traditional, well-established ‘‘control

    zones’’ that facilitated clear provenance of scientific data, thereby ensuring data integrity and providing the foundation forcredible science. The breakdown of these control zones is a consequence of the manner in which our network tech-nology and culture enable and encourage open, anonymous sharing of information, participation regardless of expertise,and collaboration across geographic, disciplinary, and institutional barriers. We are left with the conundrum—how toreap the benefits of Big Data while re-creating a trust fabric and an accountable chain of responsibility that make crediblescience possible.

    Keywords

    Big Data, control zone, paradigm shift, sociotechnical

    Big Data is not only about being big

    The popular and scholarly literature is filled with excite-

    ment about   Big Data. A good deal of the enthusiasm

    comes from the business sector, where Big Data offers

    new possibilities for direct and micro marketing,

    supply-chain optimization, and other means of increas-

    ing efficiency and profits. This enthusiasm has also

    spread to the public sector, particularly in the areas

    of security and terrorism prevention. In this paper, we

    examine the impact of Big Data in the context of sci-

    ence,1 encompassing the research that takes place in the

    academic, corporate, and government milieu.Admittedly, the line between commercial research (dis-

    tinguished from   corporate research   such as that which

    takes place at IBM Watson) and scientific research can

    be fuzzy, but we distinguish the former as motivated by

    financial concerns (e.g. product improvement for profit

    improvement), whereas the latter is motivated by the

    search for some ‘‘truth’’. Some argue that Big Data

    represents a new paradigm of science, a ‘‘fourth para-

    digm’’ (Hey et al., 2009), adopting the terminology used

    by Kuhn (1970) to characterize the revolutionary

    transformation of a scientific field.2

    While many viewthis new paradigm as complementary rather than sub-

    stitutive to pre-existing paradigms (observation, experi-

    mentation, and simulation), others like Chris Anderson

    have taken a more extreme view, claiming that Big

    Data represents the ‘‘end of theory’’ (Anderson, 2008).

    Our goal in this paper is to pull back from the hype

    and take a more measured, analytical approach to

    Big Data, focusing on the question ‘‘what are the

    characteristics of (some) Big Data that manifest a para-

    digm shift in the fundamental assumptions of science’’?

    We distinguish between Big Data characteristics that

    have methodological consequences and those thatimpact epistemological foundations. We characterize

    the former as important but not paradigm-shifting.

    In contrast, we argue that a paradigm shift is

    University of Michigan, School of Information, Ann Arbor, MI, USA

    Corresponding author:

    Carl Lagoze, University of Michigan, 105 S. State Street, Ann Arbor,

    MI 48103, USA.

    Email: [email protected]

    Big Data & Society

     July–December 2014: 1–11

    ! The Author(s) 2014

    DOI: 10.1177/2053951714558281

    bds.sagepub.com

    Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License(http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further

    permission provided the original work is attributed as specified on the SAGE and Open Access pages (http://www.uk.sagepub.com/aboutus/

    openaccess.htm).

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    2/11

    indeed evident when Big Data impacts epistemological

    foundations.

    Embedded in this argument is the assumption that

    the characteristics we are looking for are not native to

    all uses of big (in size) data. And, in fact, it may be true

    that data that is not necessarily quantitatively large

    may have characteristics that are paradigm-shiftingwhen used in certain contexts and by certain commu-

    nities of use.

    Before proceeding any further with an analysis of 

    ‘‘big’’ as a qualifying characteristic of (some) ‘‘data’’,

    it is important to establish a definition of ‘‘data’’,

    whether big or small. A National Academies report

    (A Question of Balance: Private Rights and the Public

    Interest in Scientific and Technical Databases, 1999)

    provides a simple and inclusive foundation definition:

    ‘‘data are artifacts, numbers, letters, and symbols that

    describe an object, idea, condition, situation, or other

    factors.’’ Although this definition is useful, it fails to

    capture the ‘‘relative’’ nature of data (in contrast to it

    having an ‘‘essential’’ nature). As Borgman (2011)

    states: ‘‘[d]ata may exist only in the eyes of the

    holder: the recognition that an observation, artifact,

    or record constitute data is itself a scholarly act.’’

    This perspective of data is reflexive; something (e.g.

    images, text, and Excel worksheet, etc.) is data because

    someone uses it as data in a specific context, and tran-

    scendent, it carries across the many disciplines, prac-

    tices, and epistemologies of science.

    This relational/contextual perspective gives us the

    basis for examining the Big Data phenomenon in a

    manner that both crosses epistemological boundariesand is contextualized by them. With due recognition

    of the dangers of making generalizations about ‘‘sci-

    ence’’, we hope to establish some fundamental aspects

    of Big Data that are indeed boundary crossing, while

    remaining shaped by (and shaping) specific disciplinary

    practices.

    Having established this relativistic definition of data,

    we return to the notion that ‘‘Big Data is not only

    about being big’’; that there is some combination of 

    features or dimensions (perhaps among them size)

    that may have revolutionary effects on science and

    knowledge production. This multidimensional perspec-tive is evident in many of the popularized, mass-market

    descriptions of Big Data.

    One popular multidimensional definition of Big

    Data is based on the so-called 3Vs: Volume, Velocity,

    and Variety (Laney, 2001). Volume is the size factor.

    Velocity refers to the speed of accumulation, the result-

    ing dynamic nature of the data, and the high-scale pro-

    cessing capacity needed to make it useful and keep it

    current. Finally, Variety refers to the mixing together,

    or mashing-up, of heterogeneous data types, models,

    and schema and the need to resolve these differences

    in order to make the data useful. Others have enhanced

    this list with additional ‘‘Vs’’: Validity, the amount of 

    bias or noise in the data; Veracity, the correctness and

    accuracy of the data; and Volatility, the persistence and

    longevity of data (Normandeau, 2013), the first two of 

    which, Validity and Veracity, are of particular interest

    to the argument of this paper.Mayer-Schonberger and Cukier in their best-selling

    book Big Data   offer an alternative but complementary

    set of characteristics of Big Data, which they claim

    ‘‘challenges the way we live and interact with the

    world’’ (Mayer-Scho ¨ nberger, 2013, p. 6). They charac-

    terize Big Data as revolutionary because it enables/

    embodies ‘‘three shifts [characteristics] in the way we

    analyze information and transform how we understand

    and organize society.’’ The first is the ‘‘more’’ charac-

    teristic, which they posit as the foundation for the two

    other characteristics. A notable aspect of ‘‘bigness’’

    according to the authors is its equivalence to ‘‘allness’’

    (n¼all). Throughout the book they assert that Big

    Data obviates the need for traditional (in their view

    flawed) sampling techniques and increasingly can be

    considered a complete view of the object of investiga-

    tion. We later question this argument and initially note

    that even if the n¼all principle were true, the notion of 

    data providing a ‘‘complete view’’ of reality, in the

    objective sense, is met with skepticism by a number of 

    modern scholars (Bowker, 2014; Edwards, 2010;

    Gitelman, 2013). The second characteristic is

    ‘‘messy,’’ the effect of which is diminished by the  n¼all

    characteristic. In their words, ‘‘looking at vastly more

    data also permits us to lessen our desire for exactitude.’’The third and final characteristic is the shift in analyt-

    ical technique from causality to correlation. ‘‘Most

    strikingly, society will need to shed some of its obses-

    sion for causality in exchange for simple correlations;

    not knowing   why   but only   what.’’ We will return to

    Mayer-Schonberger and Cukier later in this paper to

    further critique of their   n¼all claim and its implica-

    tions for new paradigm science.

    These two attempts to define Big Data, and many

    others like them, fail to adequately capture the nuances

    and contexts of use of Big Data that may make it revo-

    lutionary and the driver of a new scientific paradigm.Employing Kuhn’s words, when are Big Data ‘‘tra-

    dition-shattering complements to the tradition-bound

    activity of normal science’’ (Kuhn, 1970)? To answer

    this question, we need to examine Big Data from a

    sociotechnical perspective (Bijker, 1995; Lamb and

    Sawyer, 2005). We need to investigate their social, cul-

    tural, historical, and technical facets and the interplay

    and tensions among these facets that collectively estab-

    lish the impact of Big Data on science and the possible

    transformation thereof. An analysis of this sort will

    allow us to distinguish the aspects of Big Data that,

    2   Big Data & Society 

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    3/11

    no matter how contributory to innovation, may be

    more evolutionary than revolutionary, from those

    that are indeed paradigm-shifting. Furthermore, it

    will help us distinguish between locality—discipline

    and/or field-specific characteristics of Big Data—and

    globality—aspects of Big Data that may be paradigm-

    shifting across the scholarly enterprise.

    Lots of data or Big Data?

    Because technology is such a basic enabler and compo-

    nent of Big Data practices (i.e. computation including

    hardware, software, and algorithmic components; high-

    speed networks; massive storage arrays), it is useful to

    build our argument on the notions of  new technological 

     paradigms   (Dosi, 1982) and of   disruption  (Christensen

    and Rosenbloom, 1995; Rosenbloom and Christensen,

    1994). Originating in the business and organizational

    behavior sector, these two concepts nicely complement

    Kuhn’s theories, which focus on the scholarly domain.

    Dosi distinguishes between evolutionary paths of 

    technological change and new technological paradigms

    that represent discontinuities from pre-existing techno-

    logical paths and address new classes of problems.

    Christensen and Rosenbloom expand on this with the

    notion of disruptive innovation, which is a discontinu-

    ity in not only the technological aspect of a product or

    service, but also a sociotechnical disruption; a context-

    ual change in the set of valuations and values that

    frame and are impacted by the technical innovation.

    Christensen initially applied this theory of disruption

    to product lines (Christensen, 1997), with a primeexample being the successive introduction of smaller

    hard disk platters that initially seemed noncompetitive

    with the disc products of mainstream manufacturers,

    but eventually and repetitively obliterated the main-

    stream markets due to their framing within the revolu-

    tionary personalization of computing. Christensen has

    also applied this theoretical framework to health care

    (Christensen et al., 2008a) and education (Christensen

    et al., 2008b).

    By leveraging the theoretical frameworks of Kuhn,

    Rosenbloom, and Christiansen, we argue that a disrup-

    tion in science (a.k.a. the creation of a new paradigm) isnot just methodological, a way of doing (a.k.a. tech-

    nical), but also must be sociotechnical. It must chal-

    lenge existing epistemological norms, ways of 

    knowing and framing the fundamental scientific ques-

    tions of the field; institutional ecologies (Star and

    Griesemer, 1989), agreements on scope, assumed know-

    ledge, and boundaries of research work; reward struc-

    tures, paths to tenure and promotion; and

    communication regimes, mechanisms, and norms for

    disseminating knowledge. We will use this scaffolding

    for the remainder of this essay to distinguish between

    what we will call lots of data, the effects of which are by

    and large methodological and technical, and true   Big

    Data, that which entails epistemological and, as a

    result, paradigmatic change.

    Our distinguishing between these two terms—lots of 

    data (which entails methodological change and tech-

    nical innovation) and Big Data (which implies the re-evaluation of epistemological foundations)—should

    not be interpreted as an attempt to segregate data

    into two disjoint silos, i.e. data set 1 is ‘‘lots of data’’,

    in contrast to data set 2 that is genuine ‘‘Big Data’’.

    Our intention, rather, is to establish these concepts as

    continuous dimensions with which instances of data use

    can be evaluated in order to understand the degree and

    origins of their methodological and/or paradigm-shift-

    ing effects, i.e. a use of data set 1 has high ‘‘lots of data’’

    impact but low ‘‘Big Data’’ impact while a use of data

    set 2 has low ‘‘lots of data’’ impact but high ‘‘Big Data’’

    impact. The term ‘‘instances of data use’’, in contrast to

    simply ‘‘data’’, is intentional and refers to the fact that,

    similar to the definition of data, the methodological

    and epistemological impact of data must be evaluated

    within the context of use. An important facet of this

    context is the distinct epistemic culture (Knorr-Cetina,

    1999) of the community of use and its particular per-

    spectives on data and its meaning. In other words, the

    same data set may ‘‘measure’’ differently according to

    the ‘‘Big Data’’ and ‘‘lots of data’’ dimensions when

    employed by different disciplinary communities and/

    or for different purposes.

    Although the primary focus of the remainder of this

    paper is the Big Data dimension—when, how, and whydoes data use challenge the epistemological foundations

    of science—it is useful, for the purpose of contrast, to

    briefly examine the companion lots of data dimension.

    This brevity should not be construed as dismissive

    towards the significance of these technical challenges

    and the methodological impacts they have. Indeed,

    there are great challenges here and the scholarly and

    practical effects of meeting these challenges can be pro-

    found, albeit not paradigm-shifting.

    Two often-cited instances of data use demonstrate

    the lots of data dimension. The petabytes of data

    streaming in from high-energy physics experiments(studied thoroughly by Knorr-Cetina, 1999) or those

    that are components of the Sloan Digital Sky Survey

    (Szalay and Gray, 2001) are certainly Big Data in terms

    of size. But, considered alone, their bigness and the

    issues associated with them are by and large technical.

    These communities have historic cultures of data shar-

    ing (Ginsparg, 1994; Knorr-Cetina, 1999) and, in fact,

    their data has always been ‘‘big’’ relative to the quan-

    titative definitions of the day. This is similar to the situ-

    ation with many domains of science that have a legacy

    of exploring and manipulating large data sets, where

    Lagoze   3

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    4/11

    ‘‘large’’ is historically contextualized relative to the

    technical affordances of the time (Gitelman, 2013).

    The massive quantity of data in these two examples

    clearly introduces issues about new high-capacity stor-

    age systems, high-speed networks to easily move them

    back and forth, and map-reduce algorithms that permit

    parallel computation over these massive data sets. Arecent white paper co-authored by leading data science

    researchers (Agrawal et al., n.d.) provides a useful list

    of the cross-cutting challenges that need to be met to

    respond to these issues; heterogeneity and incomplete-

    ness, scale, timeliness or speed, privacy, and human

    collaboration. All of these are formidable challenges.

    However, the need for these new methodologies and

    tools to manipulate, store, and curate these massive

    data sets does not correspond to a paradigm-shifting

    disruption of the historically data-focused epistemic

    culture of the communities of practice that engage

    with these data.

    A recent paper by Leonelli (2014) in the inaugural

    issue of this journal explores the same issue in the dis-

    cipline of biology.3 Similar to this paper (albeit limited

    to a single discipline), Leonelli aims ‘‘to inform a cri-

    tique of the supposedly revolutionary power of Big

    Data science,’’ likewise defining revolutionary as syn-

    onymous with creating a new epistemology and a new

    set of norms. Similar to our earlier examples in physics

    and astronomy, she notes that ‘‘data-gathering prac-

    tices in subfields [of the life sciences] have been at the

    heart of inquiry since the early modern era, and have

    generated problems ever since.’’ She then aims the bulk

    of her critique at Mayer-Scho ¨ nberger and Cukier’sclaims that data completeness mitigates data messiness

    and their championing of correlation over causality,

    which we will return to later in this paper. She finishes

    by rejecting the notion that Big Data is exerting a revo-

    lutionary effect on the epistemology of biology itself,

    claiming that ‘‘there is a strong continuity with prac-

    tices of large data collection and assemblage since the

    early modern period; and the core methods and epi-

    stemic problems of biological research, including

    exploratory experimentation, sampling and the search

    for causal mechanisms, remain crucial parts of the

    inquiry in the area of science.’’ In contrast to epistemiceffects on the discipline itself, she acknowledges signifi-

    cant methodological challenges ‘‘encountered in

    developing and applying curatorial standards for

    data . . . ’’ and in the dissemination of that data.

    On the other hand, the sensitivity of the evolutionary

    versus revolutionary impact of big (or of even any) data

    to epistemic culture becomes evident in the context of 

    digital humanities (or as some call it computational

    humanities, and its specializations such as computa-

    tional history). The level of controversy over the ‘‘data-

    fication’’ (Mayer-Scho ¨ nberger, 2013) of historical and/

    or literary artifacts (whether in massive scale such as

    the Google Books Project or the scale of a single liter-

    ary corpus) can be viewed as evidence of resistance to

    the introduction of a new epistemology, based on data,

    that is viewed by some as threatening, and perhaps

    inferior, to existing and historically based epistemolo-

    gies (Bruns, 2013; Rosenberg, 2013).These examples in physics, astronomy, biology, and

    the humanities (and many similar ones) lead us to con-

    clude that mere bigness, lots of data (which appears

    to have different meanings in different scholarly

    fields), is not the basis for declaring a new paradigm

    in science. Furthermore, we can be fairly confident that

    such a blanket declaration without attention to the con-

    founding factor of epistemic cultures warrants

    skepticism.

    Data integrity and credible science

    With these caveats in mind, however, we do claim that

    there might be some cross-cutting framing of data and

    their application across the entire scholarly endeavor,

    recognizing that this framing needs to be parameterized

    to a particular use of data within a particular epistemic

    culture. Then, we need to understand how Big Data

    might challenge this common framing, thereby becom-

    ing ‘‘tradition shattering’’ (Kuhn, 1970).

    At the forefront is the notion of  data integrity, which

    we assert is a consistent and discipline-crossing founda-

    tion of credible science (Committee on Ensuring the

    Utility and Integrity of Research Data in a Digital

    Age, 2009; Nowotny, 2001). We intentionally use theterm   integrity   rather than   correctness   or   quality; the

    latter terms ascribe a level of positivism to data that

    many modern scholars refute (Edwards, 2010;

    Gitelman, 2013). Integrity, on the other hand, has a

    more constructivist tone, implying notions of ‘‘trust’’,

    ‘‘fitness for use’’, and ‘‘consensual understanding’’, all

    of which are contextual and relative to epistemic cul-

    ture, in contrast to the implicitly binary notion of cor-

    rectness. Looking at this from the perspective of 

    infrastructure to support data sharing (using ‘‘infra-

    structure’’ in its broadest most sociotechnical sense;

    Edwards et al., 2007), we can then draw the linksfrom integrity to trust, and ultimately to provenance

    (evidence upon which trust is established), and propose

    that determining the degree of data integrity is based on

    the ability to answer a number of questions. What is the

    origin of these data? Who has been responsible for

    them since their origination? Can we apply our stand-

    ard notions for trust and integrity to them? Do our

    standard methodologies for interpreting them and

    drawing conclusions from them make sense? Big Data

    is then those data that disrupt fundamental notions of 

    integrity and force new ways of thinking and doing to

    4   Big Data & Society 

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    5/11

    reestablish it. Said differently, Big Data is data that

    makes us rethink our notions of credible science.

    Our attention here to the issues of data and scientific

    integrity is coincident with a growing concern with the

    reliability of scientific knowledge. The notion of a crisis

    in reliability has been discussed in the media (Naik,

    2011), and in scientific journal articles (Brembs andMunafò, 2013) and editorials (‘‘Announcement,’’

    2013; Jasny et al., 2011). Some of the concern about

    reliability has been fueled by well-publicized cases of 

    scientific fraud and data falsification in a number of 

    scientific fields (Harrison et al., 2010; ‘‘Researcher

    Faked Evidence of Human Cloning, Koreans

    Report,’’ 2006; Verfaellie and McGwin, 2011). In add-

    ition, a number of academics are warning about the

    prevalence of false results in the scientific literature

    (Ioannidis, 2005; Po ¨ schl, 2004).

    But, as pointed out by Stodden (2014), some of this

    concern arises from the increasing prevalence of data-

    intensive (Big Data) science across the disciplines, and

    the application of computational, analytical methods to

    those data without complete understanding of their

    characteristics (e.g. the nature of the sample represented

    by the data). Absent full understanding of the data (and

    in some cases a failure to account for this lack of intim-

    acy with the data), researchers have at times unwittingly

    or sloppily applied methodological tools or epistemo-

    logical understanding to those data that failed to

    account for the fundamental differences between them

    and traditional highly-curated and reliable data. As

    pointed out by Lazer et al. (2014), ‘‘ . . . most Big Data

    that have received popular attention are not the outputof instruments designed to produce valid and reliable

    data amenable for scientific analysis.’’

    Of particular concern in this area has been scientific

    results based on data sources of questionable proven-

    ance and integrity such as distributed sensors (Wallis

    et al., 2007) and ‘‘black box social media,’’ where the

    origin and basis of the data are difficult to determine

    (Driscoll and Walker, 2014) and the algorithmic bias on

    the conclusions is difficult to unravel (Gillespie, 2014).

    A well-known example of the foibles of the reliance on

    informally collected data and algorithmic projection is

    the Google Flu Trends (GFT), which raised huge sci-entific optimism about the predictive utility of infor-

    mally collected data when first published in   Nature   in

    2009 (Ginsberg et al., 2009). This optimism suffered a

    serious setback in 2013 when the GFT predictions for

    that year were shown to be seriously exaggerated

    (Butler, 2013; Lazer et al., 2014). A complete account-

    ing for this setback is beyond the scope of this paper.

    However, one acknowledged factor is an overconfi-

    dence in the veracity of the data as a true sample of 

    reality, rather than a random snapshot in time and the

    result of algorithmic dynamics.

    We acknowledge that this emphasis on data integrity

    (a.k.a. quality) stands somewhat in opposition to the

    popularized claims by Mayer-Scho ¨ nberger and Cukier

    that ‘‘looking at vastly more data . . . permits us to

    loosen up a desire for exactitude’’ and effectively

    allows us to ignore ‘‘messiness’’ in data (Mayer-

    Scho ¨ nberger, 2013). As mentioned earlier, this claimand subsequent claims by the authors seem to rely

    heavily on   n¼all, that is, Big Data is not a sample

    but a complete set. We find this claim highly suspicious

    and agree with fellow scholars (Boyd and Crawford,

    2011; Bowker, 2014) who take the position that any

    data, no matter what its size, is de facto a sample,

    with bias implicit due to choice of instrumentation,

    span of observation, units of measurement, and

    numerous other factors. In essence,  n  never equals all;

    all is a limit in mathematical terms that can be

    approached but never attained. This point is also

    emphasized by Leonelli, who states that ‘‘having a lot

    of data is not the same as having all of them; and

    cultivating such a vision of completeness is a very

    risky and potentially misleading strategy’’ (Leonelli,

    2014). Thus, if one denies sampling and its effects on

    messiness or on our ability to derive meaning from cor-

    relations, as Mayer-Scho ¨ nberger and Cukier seem to

    do, they tread on questionable territory in terms of 

    high integrity science, and may indeed have an argu-

    ment that is more appropriate to business and com-

    merce. Again quoting Leonelli, ‘‘it is no coincidence

    that most of the examples given by Mayer-

    Scho ¨ nberger and Cukier come from the industrial

    world, and particularly globalized retail strategies asis the case of Amazon.com’’ (Leonelli, 2014).

    As a point of reference, it is useful to look at the

    notions of integrity, trust, and provenance in the con-

    text of archives and archival science, for which they are

    essential concepts. Hirtle (2000) describes the meanings

    of these terms and the manner in which they are core to

    the definition of the archive in the context of the ship

    Constellation, a tourist destination in Baltimore harbor

    that was mistakenly identified as a revolutionary war

    ship when its vintage was really the US Civil War.

    According to Hirtle (2000), ‘‘at the heart of an arch-

    ive. . .

    are   records   that are created by an agency ororganization in the course of its business and that

    serve as evidence of the actions of that agency or organ-

    ization [italics added].’’ Furthermore, ‘‘one way in

    which archivists working with . . . records have sought

    to ensure the enduring value of archives as evidence is

    through the maintenance of an unbroken provenance for

    the records [italics added].’’ Implicit in the notion of 

    ‘‘unbroken provenance’’ is   control   over storage and

    transfer; in order to serve as evidence an archival

    record must demonstrate a complete, unbroken, histor-

    ical knowledge of the item of interest, who has been in

    Lagoze   5

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    6/11

    control of it, and by what means it has been transferred

    or moved to other authorities. Fans of crime shows on

    TV or of detective novels should find this notion quite

    familiar; the evidence presented in a court of law is

    useless if law enforcement has lost control of it and it

    may have been tampered with.

    Defining the control zone

    Taking a cue from archival science then, we should

    look at the role of   control   (and   unbroken provenance)

    as a necessary (but not necessarily sufficient) factor in

    data integrity.   Traditional data origination, sharing,

    and reuse were based on the reality of containable

    and concrete physical data (e.g. written by hand or

    stored on magnetic devices that are kept in drawers

    or file cabinets) and data sharing practices based on

    physical handoff to known colleagues. The physicality

    of both the data and the transfer of data amounted to a

    well-defined   control zone   resulting in a provenance

    chain that was documented and witnessed. Before

    examining the breakdown of this control zone in the

    context of Big Data, in the next section we examine the

    same notion and its role in the disruption of another

    knowledge infrastructure (Edwards et al., 2013) that

    has over the past two decades undergone considerable

    change, the library. In a seminal 1996 article, ‘‘Library

    Functions, Scholarly Communication, and the

    Foundation of the Digital Library: Laying Claim to

    the Control Zone’’ (Atkinson, 1996), the late Ross

    Atkinson, then Associate University Librarian at

    Cornell University, describes how the notion of acontrol zone   lay at the foundation of the library.

    According to Atkinson, the functioning of the library

    depends on the definition of a clear boundary, a

    demarcation of what lies in the library and what is out-

    side. Internal to this boundary, within the control zone,

    the library can lay claim to those resources that have

    been selected as part of the collection, and assert cur-

    ation, or stewardship, of those selected resources to

    ensure their integrity, availability, and stability over

    the long-term.

    The boundary of the traditional library was easy to

    define. It was the ‘‘bricks and mortar’’ structure with aclear and controlled entry point that contained and

    protected the selected physical resources over which

    the library asserted control and curatorial responsibil-

    ity. Correspondingly, from the patron’s point of view,

    the boundary marked what could be called a ‘‘trust

    zone’’, an area to which entry and exit were clearly

    marked and in which they could presume the existence

    of the integrity guarantees of the library. Integrity, in

    this case, does not imply veracity of the resources of the

    library, but adherence to principles of proper informa-

    tion stewardship, including accurate description,

    longevity of the resources, and adherence to some selec-

    tion criteria.

    In Lagoze (2010), we describe how the move from

    physical to digital information resources and the

    attendant access to them by the web architecture

    profoundly disrupted the foundation of the control

    zone. This disruption was not anticipated by early par-ticipants, practitioners, and researchers in the early

    digital library initiatives, who foresaw technical

    but not institutional change. In fact, some predicted

    that in the end ‘‘[digital] library services would fol-

    low a familiar model’’ (Gladney et al., 1994).

    Others saw the Internet as another familiar evolution-

    ary technical change, similar to past challenges to

    libraries, stating that ‘‘The anarchy of the Internet

    may be daunting for the neophyte, but it differs little

    from the bibliographic chaos that is the result of 

    five and a half centuries of the printing press’’

    (Lerner, 1999).

    Two decades later, it is clear that the implications of 

    moving from physical to digital information and net-

    work access to the information is more than a technical

    phenomenon; the implications are more than that

    ‘‘digital information crosses boundaries easily’’ (Van

    House et al., 2003) and in fact are deeply disruptive

    to the library. By viewing the library as a meme,4

    rather than just as an institution or a physical artifact,

    we can see the roots of the disruption. At the founda-

    tion of it is the foundation of the library itself, the dis-

    integration of the control zone. The notions of a clear

    boundary, and the attendant concepts of being inside or

    outside, disappear in the web architecture, where users(i.e. patrons) no longer enter through a well-defined

    door, but ride hyperlinks and land wherever they may

    choose in the digital library. Attempts to reassert a

    boundary by defining a new digital door or portal

    and establishing branding signposts defining inside vs.

    outside have proven incompatible with the dominant

    web context and have largely failed. With the collapse

    of the control zone, other fundamental components of 

    the library meme become difficult to implement or ana-

    chronistic relative to the increasingly normative

    broader web context. These include selection, deciding

    what information sources are available to patrons;intermediation, acting as a buffer between information

    creators and information users; bibliographic descrip-

    tion, providing ‘‘order making’’ via the catalog; and

    fixity, guaranteeing the immutability of information

    resources.

    In conclusion, the wholesale transition of our intel-

    lectual, popular, and cultural heritage to the digital

    realm has been accompanied by a disruptive change

    in our expectations about our knowledge infrastruc-

    tures. The notions of selection, intermediation, biblio-

    graphic description, and fixity that are core principles

    6   Big Data & Society 

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    7/11

    of the library meme stand at odds to the web informa-

    tion meme. These contradictions become sharper as the

    web has moved over the past decade into the web 2.0

    era and beyond. Expectations of open access to infor-

    mation, active participation in knowledge production

    and annotation, and the integration of social activity

    and knowledge activities are now the expected norm.Libraries are certainly part of this modern knowledge

    infrastructure. But they exist as participants in a world

    of competing ‘‘knowledge institutions’’ (e.g. Wikipedia,

    Facebook, Twitter). Meanwhile, notions of informa-

    tion integrity, which were formally grounded in institu-

    tional frameworks such as the library, remain

    problematic and in search of new ways to certify the

    provenance of information resources.

    Rethinking credible science in the age

    of Big Data

    With knowledge of this precedent, we can now return

    to Big Data and recognize parallels in the historical

    transitions of the library and the transformations in

    the ways that scholarly data are created, shared, and

    used. The relatively well-controlled mechanisms (both

    cultural and technical) for data creation, data sharing,

    and data reuse are under pressure for a number of rea-

    sons. Funders, the public, and fellow scientists are

    demanding, for good reason, better access to data

    and in general ‘‘open data’’ (Huijboom and Broek,

    2011; Molloy, 2011; Murray-Rust, 2008), motivating

    the creation of numerous data repositories (Greenberg

    et al., 2009; Hahnel, 2012; Michener et al., 2011) thatallow easy and generally anonymous access to scientific

    data on a global scale. Science in general is becoming

    more collaborative and interdisciplinary (Barry and

    Born, 2013; Haythornthwaite et al., 2006; Wagner

    et al., 2011) (at least partly due to the multidisciplinary

    scope of grand challenge problems like climate change),

    breaking down traditional closely-knit teams of col-

    leagues and bringing together scholars with different

    epistemic and methodological cultures. An increasing

    number of data sources originate from nontraditional

    means, such as social networks for which concerns

    about integrity and provenance are not priorities.Mashups of data are becoming increasingly common,

    blurring the lines between formal and informal data.

    Scientists seem to have a love/hate relationship with

    this new reality. While they support the abstract idea

    of open data (Cragin et al., 2010; Tenopir et al., 2011),

    their sharing practices, and sharing preferences, remain

    relatively closed and motivated by control (Borgman,

    2011; Edwards et al., 2011).

    Quantitative social science research provides an

    interesting example of this data transition and impact

    on the control zone. For the past 50 years, quantitative

    social science has been built on a shared foundation of 

    data sources originating from survey research, aggre-

    gate government statistics, and in-depth studies of indi-

    vidual places, people, or events. Underlying these data

    is a well-established and well-controlled infrastructure

    composed of an international network of highly curated

    and metadata-rich archives of social science such as theInter-University Consortium for Political and Social

    Research5 (ICPSR) and the UK Data Archive.6

    These archives continue to play an important role in

    quantitative social science research. However, the emer-

    gence and maturation of ubiquitous networked com-

    puting and the ever-growing data cloud have

    introduced a spectacular quantity and variety of new

    data sources into this mix. These include social media

    data sources such as Facebook, Twitter, and other

    online communities in which individuals reveal massive

    amounts of information about themselves that are

    invaluable for social science research. When combined

    with more traditional data sources, these provide the

    opportunity for studies at scales and complexities here-

    tofore unimaginable. This transformation has been

    described by Gary King, a Harvard political scientist,

    as the  social science data revolution,  which is character-

    ized by a ‘‘changing evidence base of social science

    research’’   (King, 2011a, 2011b). These new opportu-

    nities present formidable new challenges to the fabric

    of social science research. Among those mentioned by

    King (2011b) include privacy challenges, problems of 

    sampling bias in uncontrolled data sets, a change in the

    basic ‘‘job descriptions’’ of social scientists with

    demand for new skills in statistical methods, computa-tional methods, and the like, and the need for new

    cross-disciplinary collaborations (i.e. breaking down

    the silos that social science scholars formally existed

    in). Clearly this is an example of Big Data rather

    than just lots of data.

    Another example of this fracturing of the control

    zone exists in observational science, for example, iden-

    tification and reporting of phenomena (e.g. species) in

    ecological niches, astronomy, and meteorology. In each

    of these areas there is a growing interest in what has

    been termed   crowd sourced citizen science, which

    engages numerous volunteers as participants in large-scale scientific endeavors (Wiggins and Crowston,

    2010). The opportunities for large-scale citizen science

    arise from the ubiquitous networking and computing

    context and especially the recent spectacular growth

    in the use of mobile devices. The motivations for lever-

    aging this large-scale volunteer workforce as observa-

    tional ‘‘sensors’’ are substantial. The geographic scope

    of the observational spaces and the varieties of habitats

    make reliance on trained observers (e.g. scientists)

    infeasible. Our particular experience in this area is

    with the eBird project,7 originated at the Cornell

    Lagoze   7

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    8/11

    Laboratory of Ornithology, a highly successful citizen

    science project that for over a decade has collected

    observations from volunteer participants worldwide

    (Sullivan et al., 2014). Those data have subsequently

    been used for a large body of highly-regarded and influ-

    ential scientific research.

    It comes as no surprise that crowd sourced citizenscience makes a substantial portion of the formal sci-

    entific community uneasy (Sauer et al., 1994), especially

    in fields where people’s lives are at stake, such as medi-

    cine (Raven, 2012). These data, by nature, breakdown a

    well-established control zone whereby data is collected

    by experts or individuals managed by experts who care-

    fully abide by scientific methods. In contrast, citizen

    science of this type must contend with the problems

    of highly variable observer expertise and experience.

    How can we trust data or the science that results

    from those data when their provenance is rooted in

    sources whose own provenance does not conform to

    ‘‘standard’’ criteria such as degree, publication record,

    or institutional affiliation?

    The examples described above are only two of the

    many instances in which new varieties of Big Data are

    undermining traditional control zones of science. If we

    look longitudinally, we can see that examples such as

    these are only the beginning of the problem. The frac-

    tured control zones, and the resulting uncertain prov-

    enance and trust, only intensify through the lifecycle of 

    sharing, reuse, and circulation of data in an open net-

    work in which not all participants are deemed trust-

    worthy according to established norms. Looking

    across this lifecycle, this dilemma very quickly becomescombinatorially more complex. If the control zone

    around data set A and that around data set B are

    poorly defined, that which results from the reuse and

    combination of the two is only fuzzier. Of course, this is

    only the first step in the progressive mashup and

    ‘‘cooking’’ of these data with other data, a progression

    that is inevitable when data reuse is easy and strongly

    encouraged.

    Despite the challenges and uncertainties, the inclu-

    sion of these ‘‘uncontrolled’’ Big Data into the scientific

    process is a reality that will continue and perhaps

    become more common. Our ‘‘always there, every-where’’ network culture will continue to make more

    and larger amounts of automatically, accidentally,

    and informally created data available for science. The

    value of these data across the scholarly spectrum has

    been demonstrated numerous times. Social scientists

    can conduct studies on large-scale social networks

    that may not replace, but do significantly complement,

    traditional research based on small-scale social groups

    (Milgram, 1967; Zachary, 1977). Observational scien-

    tists can now accumulate heretofore unavailable evi-

    dence of global phenomena, such as bird migrations

    and climatological events, by leveraging the active par-

    ticipation and contribution of enthusiastic human

    volunteers.8

    Our goal in this paper has not been to propose a

    normative framework for this reality, but to simulate

    and add to discussions and investigations of its

    entangled social, cultural, historical, and technicalimplications. Rather than fall back on hyperbolic

    ‘‘Big Data will change the world,’’ the scholarly com-

    munity needs to understand it and investigate its impli-

    cations for science policy and public trust of science.

    We propose two threads for moving forward: one epis-

    temological, evaluate our understanding of quality in

    both data and science and our means for determining it,

    the other methodological, developing means of recover-

    ing traditional quality metrics.

    The first approach begins by raising the awareness of 

    researchers who use Big Data about its opportunities,

    complexities, and dangers. This area is reasonably well

    covered in Boyd and Crawford’s (2011) paper ‘‘Six

    Provocations for Big Data’’, which covers many of 

    the caveats in dealing with this kind of data including

    ‘‘Claims to Objectivity and Accuracy are Misleading’’

    and ‘‘Bigger Data Are Not Always Better Data.’’ As

    the authors point out, a critical component of using Big

    Data for research is understanding the integrity of 

    those data, where they originated, what biases are

    built into them, how data cleaning may lead to over

    fitting, and what sampling biases may be embedded in

    them. In this context, we need to evaluate what quality

    and integrity mean in a networked culture and its

    numerous possible contexts, in the manner that otherscholars are investigating parallel issues such as privacy

    (Nissenbaum, 2009).

    As for methodology, we suggest two technical paths

    that may offer amelioration of the integrity problem,

    both based on retrospectively recovering provenance,

    rather than prospectively, as in the traditional

    manner. In our research with eBird, we have been

    investigating ways to reconstruct observer/contributor

    expertise from the aggregated data. Our realization has

    been that expertise is too nuanced a factor to recon-

    struct, but that   experience, interpreted as   deliberate

     practice, is an effective path to expert performance(Ericsson and Charness, 1994). Evidence of experience

    can be extracted from the aggregated data; for example,

    frequency of contributions, the diversity of contribu-

    tions measured by species distribution, etc. By devising

    ways to recognize these traces we hope to develop

    mechanisms that aid scientists in determining the

    expertise (and perhaps integrity) of anonymous data

    contributors (reference removed for author anonym-

    ity). Another approach might be to employ   digital 

     forensics   (Reith et al., 2002), a technique increasingly

    popular in the intelligence and legal communities,

    8   Big Data & Society 

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    9/11

    which, like our work with expertise, recovers traces of 

    origin and provenance metadata from a digital artifact

    itself.

    In conclusion, we have argued for an understanding

    of the difference between lots of data and Big Data. The

    former, a quantitative feature with mainly technical

    and methodological implications, has, without adoubt, had important effects on the way science is

    done and what it makes possible. However, the latter,

    a qualitative feature with profound epistemological and

    sociotechnical implications, shakes some of the core

    assumptions of credible science: trust and integrity.

    Similar to so many aspects of our modern digital cul-

    ture such as journalism (e.g. the  New York Times versus

    the flood of grassroots news blogs) and reference infor-

    mation (e.g. Encyclopedia Britannica versus

    Wikipedia), it is futile and even undesirable to seek a

    return to traditional, rigid control zones. Nevertheless,

    we are left with the challenge with Big Data to reap its

    benefits while simultaneously holding science to the

    same standards that it has been held to for centuries.

    Declaration of conflicting interests

    The author declares that there is no conflict of interest.

    Funding

    This research received no specific grant from any funding

    agency in the public, commercial, or not-for-profit sectors.

    Notes

    1. Throughout this essay we use the term ‘‘science’’ as a gen-eralization for all academic fields, not just the physical,

    life, etc. sciences.

    2. Some well-known examples of Kuhn’s notion of a para-

    digm shift are the introduction of plate tectonics in geology

    and Einstein’s special relativity theory in physics, both of 

    which challenged primary assumptions of their respective

    fields.

    3. Although Leonelli does undertake a disciplinary-level ana-

    lysis, she acknowledges the flaws of using ‘‘discipline’’ as

    the unit of study, recognizing the fact that within such

    coarse granularity lies a wide variety of epistemological

    and methodological practices.

    4. We employ the term ‘‘meme’’ here to expand what we

    mean by ‘‘library’’ beyond its operational, technical, and

    institutional characteristics, and consider it in a manner

    similar to a semiotic sign (Morris, 1938).

    5. http://www.icpsr.umich.edu

    6. http://www.data-archive.ac.uk

    7. http://ebird.org

    8. One might conjecture about the possibility of machine sen-

    sing to replace the human volunteers. However, bird spe-

    cies observation and identification rely on a highly

    nuanced combination of visual, auditory, habitat, and

    other knowledge that will make automated sensing extre-

    mely difficult to implement.

    References

    Agrawal D, Bernstein P, Bertino E, et al. (n.d.)   Challenges

    and Opportunities with Big Data. Available at: https://

    www.purdue.edu/discoverypark/cyber/assets/pdfs/

    BigDataWhitePaper.pdf (accessed 28 October 2014).

    Anderson C (2008) The end of theory: will the data deluge

    make the scientific method obsolete?  Wired  1–5.Announcement: Reducing our irreproducibility (2013) Nature

    496(7446): 398–398.

    A Question of Balance: Private Rights and the Public Interest

    in Scientific and Technical Databases   (1999) Washington,

    DC: The National Academies Press. Available at: http://

    www.nap.edu/openbook.php?record_id¼9692 (accessed

    28 October 2014).

    Atkinson R (1996) Library functions, scholarly communica-

    tion, and the foundation of the digital library: laying claim

    to the control zone.  The Library Quarterly  66(3).

    Barry A and Born G (2013)   Interdisciplinarity:

    Reconfigurations of the Social and Natural Sciences, 1st

    ed. New York, NY: Routledge.

    Bijker WE (1995)   Of Bicycles, Bakelites, and Bulbs: Toward aTheory of Sociotechnical Change. Cambridge, MA:MIT Press.

    Borgman CL (2011) The conundrum of sharing research data.

    Journal of the American Society for Information Science

    63(6): 1–40.

    Bowker G (2014) The theory/data thing. International Journal 

    of Communication  8(5).

    Boyd D and Crawford K (2011) Six provocations for Big

    Data.   SSRN Electronic Journal . DOI: 10.2139/

    ssrn.1926431.

    Brembs B and Munafo `   M (2013) Deep impact: unintended

    consequences of journal rank.  ArXiv. Available at: http://

    arxiv.org/abs/1301.3748 (accessed 28 October 2014).

    Bruns A (2013) Faster than the speed of print: reconciling‘‘big data’’ social media analysis and academic scholar-

    ship.   First Monday   18(10). Available at: http://first

    monday.org/ojs/index.php/fm/article/view/4879/3756

    (accessed 7 October 2013).

    Butler D (2013) When Google got flu wrong.   Nature

    494(7436): 155–156.

    Christensen CM (1997)  The Innovator’s Dilemma: When New

    Technologies Cause Great Firms to Fail . Boston, MA:

    Harvard Business School Press.

    Christensen CM, Grossman JH and Hwang J (2008a)   The

    Innovator’s Prescription: A Disruptive Solution for Health

    Care. New York, NY: McGraw-Hill.

    Christensen CM, Horn MB and Johnson CW (2008b)

    Disrupting Class: How Disruptive Innovation Will Change

    the Way the World Learns. New York, NY: McGraw-Hill.

    Christensen CM and Rosenbloom RS (1995) Explaining the

    attacker’s advantage: technological paradigms, organiza-

    tional dynamics, and the value network.  Research Policy

    24(2): 233–257.

    Committee on Ensuring the Utility and Integrity of Research

    Data in a Digital Age (2009)   Ensuring the Integrity,

    Accessibility, and Stewardship of Research Data in the

    Digital Age. Washington, DC: National Academies Press.

    Cragin MH, Palmer CL, Carlson JR, et al. (2010) Data shar-

    ing, small science and institutional repositories.

    Lagoze   9

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    10/11

    Philosophical Transactions. Series A, Mathematical,

    Physical, and Engineering Sciences 368(1926): 4023–4038.

    Dosi G (1982) Technological paradigms and technological

    trajectories: a suggested interpretation of the determinants

    and directions of technical change.   Research Policy  11(3):

    147–162.

    Driscoll K and Walker S (2014) Big data, big questions work-

    ing within a black box: transparency in the collection and

    production of big twitter data.   International Journal of 

    Communication  8(0): 20.

    Edwards P, Mayernik MS, Batcheller A, et al. (2011) Science

    friction: data, metadata, and collaboration.   Social Studies

    of Science  41(5): 667–690.

    Edwards PN (2010)   A Vast Machine: Computer Models,

    Climate Data, and the Politics of Global Warming.

    Cambridge, MA: MIT Press.

    Edwards PN, Jackson SJ, Bowker GC, et al. (2007)

    Understanding Infrastructure: Dynamics, Tensions, and 

    Design. Washington, DC: National Science Foundation.

    Edwards PN, Jackson SJ, Chalmers MK, et al. (2013)

    Knowledge Infrastructures: Intellectual Frameworks and Research Challenges. Ann Arbor, MI.

    Ericsson KA and Charness N (1994) Expert performance: its

    structure and acquisition.   American Psychologist   49(8):

    725–747.

    Gillespie T (2014) The relevance of algorithms. In: Gillespie

    T, Boczkowski P and Foot (eds)   Media Technologies:

    Essays on Communication, Materiality, and Society.

    Cambridge, MA: MIT Press, p.167.

    Ginsberg J, Mohebbi MH, Patel RS, et al. (2009) Detecting

    influenza epidemics using search engine query data.

    Nature  457(7232): 1012–1014.

    Ginsparg P (1994) First steps towards electronic research

    communication.  Los Alamos Science  8(4): 390–396.Gitelman L (2013)   ‘ ‘Raw Data’’ Is an Oxymoron

    (Infrastructures). Cambridge, MA: The MIT Press, p.192.

    Gladney HM, Fox EA, Ahmed Z, et al. (1994)   Digital 

    Library: Gross Structure and Requirements: Report from

    a March 1994 Workshop. College Station: IEEE.

    Greenberg J, White HC, Carrier S, et al. (2009) A metadata

    best practice for a scientific data repository.   Journal of 

    Library Metadata  9(3–4): 194–212.

    Hahnel M (2012) Exclusive: figshare a new open data project

    that wants to change the future of scholarly publishing. In:

    Impact of Social Sciences Blog.

    Harrison WTA, Simpson J and Weil M (2010) Editorial.  Acta

    Crystallographica Section E Structure Reports Online

    66(1): e1–e2.

    Haythornthwaite C, Lunsford KJ, Bowker GC, et al. (2006)

    Challenges for research and practice in distributed, inter-

    disciplinary collaboration. In: Hine C (ed)   New

    Infrastructures for Knowledge Production: Understanding

    E-science. Information Science Publishing, pp.143–166.

    Hey T, Tansley S and Tolle K (eds) (2009)   The Fourth

    Paradigm. Redmond, WA: Microsoft Research.

    Hirtle PB (2000) Archival authenticity in a digital age.

    In: Cullen C, Levy DM, Lynch CA, et al. (eds)

    Authenticity in a Digital Environment. Washington, DC:

    Council on Library and Information Resources.

    Huijboom N and Broek TD (2011) Open data: an inter-

    national comparison of strategies.   European Journal of 

    ePractice  12: 1–13.

    Ioannidis JPA (2005) Why most published research findings

    are false.  PLoS Med  2(8): e124.

    Jasny BR, Chin G, Chong L, et al. (2011) Data replication &

    reproducibility. Again, and again, and again. . .

    Introduction. Science (New York, N.Y.)  334(6060): 1225.King G (2011a) Ensuring the data-rich future of the social

    sciences. Science (New York, N.Y.)  331(6018): 719–721.

    King G (2011b) The social science data revolution. Available

    at: http://gking.harvard.edu/files/gking/files/evbase-hori-

    zonsp.pdf (accessed 28 October 2014).

    Knorr-Cetina K (1999)   Epistemic Cultures: How the Sciences

    Make Knowledge. Cambridge, MA: Harvard University

    Press.

    Kuhn TS (1970)  The Structure of Scientific Revolutions,  2nd

    ed. Chicago: University of Chicago Press.

    Lagoze C (2010)   Lost Identity: The Assimilation of Digital 

    Libraries into the Web   (PhD dissertation). Cornell

    University, Ithaca. Available at: http://carllagoze.files.wordpress.com/2012/06/carllagoze.pdf.

    Lamb R and Sawyer S (2005) On extending social informatics

    from a rich legacy of networks and conceptual resources.

    Information Technology & People 18(1): 9–20.

    Laney D (2001)   {3D} Data Management: Controlling Data

    Volume, Velocity, and Variety.

    Lazer D, Kennedy R, King G, et al. (2014) The parable of 

    Google flu: traps in big data analysis.  Science  343(6176):

    1203–1205.

    Leonelli S (2014) What difference does quantity make? On the

    epistemology of Big Data in biology.   Big Data & Society

    1(1). DOI: 10.1177/2053951714534395.

    Lerner FA (1999)   Libraries Through the Ages. New York,

    NY: Continuum.Mayer-Scho ¨ nberger V (2013)   Big Data: A Revolution that

    Will Transform How We Live, Work, and Think. Boston:

    Houghton Mifflin Harcourt.

    Michener W, Vieglais D, Vision T, et al. (2011) DataONE:

    data observation network for earth — preserving data and

    enabling innovation in the biological and environmental

    sciences. D-Lib Magazine  17(1/2).

    Milgram S (1967) The small world problem.   Psychology

    Today  2: 60–67.

    Molloy JC (2011) The open knowledge foundation: open data

    means better science.   PLoS Biology   9. DOI: 10.1371/

     journal.pbio.1001195.

    Morris CW (1938)  Foundations of the Theory of Signs

    .Chicago: University of Chicago Press.

    Murray-Rust P (2008) Open data in science.  Serials Review

    34: 52–64.

    Naik G (2011). Mistakes in scientific studies surge.  Wall Street

    Journal . Available at: http://online.wsj.com/news/articles/

    SB10001424052702303627104576411850666582080.

    Nissenbaum H (2009) Privacy in Context: Technology, Policy,

    and the Integrity of Social Life. Stanford, CA: Stanford

    Law Books.

    Normandeau N (2013) Beyond volume, variety and vel-

    ocity is the issue of big data veracity. Available at:

    http://inside-bigdata.com/2013/09/12/beyond-volume-

    10   Big Data & Society 

     by guest on December 21, 2014bds.sagepub.comDownloaded from 

    http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/http://bds.sagepub.com/

  • 8/18/2019 Big Data, Data Integrity, And the Fracturing

    11/11

    variety-velocity-issue-big-data-veracity/ (accessed 15 April

    2014).

    Nowotny H (2001)  Re-Thinking Science: Knowledge and the

    Public in an Age of Uncertainty, 1st ed. Cambridge, UK:

    Polity.

    Po ¨ schl U (2004) Interactive journal concept for improved sci-

    entific publishing and quality assurance.   Learned 

    Publishing  17(2): 105–113.Raven K (2012) 23andMe’s face in the crowdsourced health

    research industry gets bigger. Available at: http://blogs.

    nature.com/spoonful/2012/07/23andmes-face-in-the-

    crowdsourced-health-research-industry-gets-bigger.html

    (accessed 28 October 2014).

    Reith M, Carr C and Gunsch G (2002) An examination of 

    digital forensic models.   International Journal of Digital 

    Evidence  1: 1–12.

    Researcher faked evidence of human cloning, Koreans report

    (2006)  The New York Times,   10 January.

    Rosenberg D (2013) Data before the fact. In:  ‘‘Raw Data’’ is

    an Oxymoron. Cambridge, MA: MIT Press, pp.15–30.

    Rosenbloom RS and Christensen CM (1994) Technologicaldiscontinuties, organizational capabilities, and strategic

    commitments.   Industrial and Corporate Change   3(3):

    655–685.

    Sauer JR, Peterjohn BG and Link WA (1994) Observer

    differences in the North American Breeding Bird Survey.

    The Auk  111(1): 50–62.

    Star SL and Griesemer JR (1989) Institutional ecology, trans-

    lations and boundary objects: amateurs and professionals

    in Berkeley’s Museum of Vertebrate Zoology, 1907-39.

    Social Studies of Science  19(3): 387.

    Stodden V (2014) Enabling reproducibility in big data

    research: balancing confidentiality and scientific transpar-

    ency. In:   Privacy, Big Data and the Public Good .

    Cambridge, UK: Cambridge University Press. Availableat: http://www.cambridge.org/us/academic/subjects/

    statistics-probability/statistical-theory-and-methods/

    privacy-big-data-and-public-good-frameworks-

    engagement (accessed 28 October 2014).

    Sullivan BL, Aycrigg JL, Barry JH, et al. (2014) The eBird

    enterprise: an integrated approach to development and

    application of citizen science.   Biological Conservation  169

    (January).

    Szalay A and Gray J (2001) The world-wide telescope. Science(New York, N.Y.)  293(5537): 2037–2040.

    Tenopir C, Allard S, Douglass K, et al. (2011) Data sharing

    by scientists: practices and perceptions.   PLoS ONE  6(6):

    21.

    Van House NA, Bishop AP and Buttenfield BP (2003)

    Introduction: Digital Libraries as Sociotechnical Systems.

    Cambridge, MA: MIT Press.

    Verfaellie M and McGwin J (2011) The case of Diederik

    Stapel: Allegations of scientific fraud by prominent

    Dutch social psychologist are investigated by multiple uni-

    versities.  Psychological Science Agenda  25(12).

    Wagner CS, Roessner JD, Bobb K, et al. (2011) Approaches

    to understanding and measuring interdisciplinary scientificresearch (IDR): a review of the literature.   Journal of 

    Informetrics  5(1): 14–26.

    Wallis J, Borgman C, Mayernik M, et al. (2007) Know thy

    sensor: trust, data quality, and data integrity in scientific

    digital libraries. In: Kova ´ cs L, Fuhr N and Meghini C

    (eds)   Research and Advanced Technology for Digital 

    Libraries SE- 32. Vol. 4675, Berlin, Heidelberg: Springer,

    pp. 380–391.

    Wiggins A and Crowston K (2010) Distributed scientific

    collaboration: research opportunities in citizen science.

    In:   Proceedings of ACM CSCW 2010 workshop on the

    changing dynamics of scientific collaborations.

    Zachary WW (1977) An information flow model for conflict

    and fission in small groups.   Journal of Anthropological Research  33: 452–473.

    Lagoze   11