22
1 Big data is not $bout 6ize - When 'ata 7ransform 6cholarship Jean-Christophe Plantin (University of Michigan, Université de Technologie de Compiègne), Carl Lagoze (University of Michigan), Paul N. Edwards (University of Michigan), Christian Sandvig (University of Michigan) Reference: Plantin JC., Lagoze C., Edwards P., Sandvig C. (forthcoming) “Big Data is not About Size : When Data Tranform Scholarship,” in Mabi, C., Plantin JC., Monnoyer-Smith L., (eds.) Les données à l’heure du numérique. Ouvrir, partager, expérimenter, Collection PraTICs, Éditions de la Maison des Sciences de l’Homme, Paris Abstract While “big data” is usually presented by focusing on the size of the datasets, we present an alternative view in this article: the consequences of big data for scholarship do not come primarily from an increase in the size of datasets, but rather from the loss of control over the source. The breakdown of the “control zone” due to the uncertain provenance of data has implications for data integrity, and can be disruptive to scholarship in multiple ways. A retrospective look at the introduction of larger datasets in weather forecasting and epidemiology shows that more data can at times be counter-productive, or destabilize already existing methods. Based on these examples, we will look at two implications of big data for scholarship: when the presence of big data disrupts the traditional structure of scholarly communities and the scholarly infrastructure for dissemination and publication. Introduction The advent of large amounts of data in sciences is far from being a new phenomenon. Historians of science have shown that previous eras each saw their own “data deluge.” For example, the expansion of travel due to the discovery of the New World brought naturalists an unprecedented volume of specimens, observations, and measurements, forcing them to create new classification systems (Strasser, 2012; Burke 2011). In the contemporary period, the term “big data” first appeared in the scientific literature in 1970, and the use of the term

PLANTIN Et Al Big Data is Not About Size

Embed Size (px)

Citation preview

  • 1

    Big data is not $bout 6ize - When 'ata 7ransform 6cholarship

    Jean-Christophe Plantin (University of Michigan, Universit de Technologie de Compigne), Carl Lagoze (University of Michigan), Paul N. Edwards (University of Michigan), Christian Sandvig (University of Michigan)

    Reference: Plantin JC., Lagoze C., Edwards P., Sandvig C. (forthcoming) Big Data is not About Size :

    When Data Tranform Scholarship, in Mabi, C., Plantin JC., Monnoyer-Smith L., (eds.) Les

    donnes lheure du numrique. Ouvrir, partager, exprimenter, Collection PraTICs,

    ditions de la Maison des Sciences de lHomme, Paris

    AbstractWhile big data is usually presented by focusing on the size of the datasets, we present an alternative view in this article: the consequences of big data for scholarship do not come primarily from an increase in the size of datasets, but rather from the loss of control over the source. The breakdown of the control zone due to the uncertain provenance of data has implications for data integrity, and can be disruptive to scholarship in multiple ways. A retrospective look at the introduction of larger datasets in weather forecasting and epidemiology shows that more data can at times be counter-productive, or destabilize already existing methods. Based on these examples, we will look at two implications of big data for scholarship: when the presence of big data disrupts the traditional structure of scholarly communities and the scholarly infrastructure for dissemination and publication.

    IntroductionThe advent of large amounts of data in sciences is far from being a new phenomenon. Historians of science have shown that previous eras each saw their own data deluge. For example, the expansion of travel due to the discovery of the New World brought naturalists an unprecedented volume of specimens, observations, and measurements, forcing them to create new classification systems (Strasser, 2012; Burke 2011). In the contemporary period, the term big data first appeared in the scientific literature in 1970, and the use of the term

  • 2

    slowly increased before reaching a peak after 2008 (Halevi, Moed, 2012). If the term big data is still mostly used in computer science, other disciplines are now involved, such as electrical engineering, telecommunication, mathematics, and business (Ibid. 2012). More broadly, the use of this term coincides with the advent of large infrastructures to support data-driven science. For example, the petabytes of data streaming from high-energy physics experiments (studied thoroughly by Knorr-Cetina (1999)) or those that are components of the Sloan Digital Sky Survey are certainly big in terms of sheer size (measured in petabytes, each of which is approximately one million billion bytes) and require high-capacity storage systems, high-speed networks to easily move large datasets back and forth, and MapReduce algorithms that enable parallel computation over these massive data sets.

    More recently, larger datasets have sparked debate in the humanities and social sciences. Several authors highlight new epistemological considerations that are necessary when conducting research based on big data. Big data research, they argue, comes with its own mythology (boyd, Crawford, 2012), hidden biases, and cult of personality (or big data fundamentalism; Crawford, 2013). Also, scholars highlight how large datasets frame research questions and methods and exclude other framings (Markham, 2013), the difficulties in choosing between small or large datasets (Mahrt, Scharkow, 2013), and the focus on immediate snapshot analysis (Arbesman, 2013). The authors show the difficulties in sampling large datasets, e.g. from the Twitter API (Gerlitz, Rieder, 2013), the shortcomings in dealing with API requests (Vis, 2013), the ethical considerations around accessing personal datasets (boyd, Crawford, 2012), the necessary documentation of the methods used and the various limitations of publishing research based on big data (Bruns, 2013). Alternatively, other authors emphasize the opportunity of larger datasets in the sciences. Big data proponents have challenged the necessity of theory (Bollier, 2010), promoting science conceived only as pattern-recognition (Anderson, 2008). Others specifically emphasize the size of datasets and their disruptive effects on science. Writing for a Microsoft Research collection, prominent scientists infer the advent of a fourth paradigm of data-intensive scientific discovery due to the availability of massive data from ubiquitous sensors and social media. This new paradigm complements or even displaces observation, theory, and simulation as a principal source of scientific knowledge (Hey et al., 2009). More specifically in the social sciences, Lazer (et al. 2009) called for computational social science as the use of a large array of digital data (e.g. from email or gathered by mobile phones) with social network analysis to study how people interact, move and communicate. In the humanities,

  • 3

    Lev Manovich and the Software Studies Initiative advocate for the use of very large and possibly exhaustive datasets of cultural objects (Manovich, 2012), where all of the covers from Time magazine or the entire body of work from a single painter are used in computational humanities research. Whether they emphasize the pros or cons of using big data in research, these authors are concerned about the increase in size of datasets available. In this article, we propose another view, by arguing that consequences of big data towards scholarship do not come primarily from an increase in the size of datasets, but is rather about the loss of control over the sources of data. This loss of what Atkinson calls the control zone (1996) over the data can be disruptive to scholarship in multiple ways. A retrospective look at the introduction of larger datasets in weather forecast and epidemiology will show that more data can be counter-productive and destabilize already existing research methods. Based on these previous examples, we will look at two implications of big data for scholarship: when the presence of big data disrupts the traditional structure of scholarly communities and the scholarly infrastructure for dissemination and publication.

    1. The fracturing of the control zone: an alternative view on big data We propose that a significant disruption of big data is not its size, but rather the fracturing of the traditional provenance chain that forms the basis for determining data integrity, and transitively research integrity. The methodologies for assessing integrity, trust, and provenance in libraries and archives in information and archival sciences provide a framework to present that point. Traditionally, data provenance has been based on physically containable data (e.g., written by hand or stored on magnetic devices that are kept in drawers or file cabinets) and sharing based on physical handoff to colleagues. The physicality of both the artifact and transfer of responsibility provides an unassailable evidence chain. These highly-controlled, regimented procedures were described by Atkinson with the notion of the control zone. In a seminal 1996 article Library Functions, Scholarly Communication, and the Foundation of the Digital Library: Laying Claim to the Control Zone (Atkinson, 1996), the late Ross Atkinson, then Associate University Librarian at Cornell University, described how the notion of a control zone lies at the foundation of the Library. According to Atkinson, the functioning of the library depends on the definition of a clear boundary, a demarcation of what lies in the library and what is outside. Internal to this boundary, within the control zone,

  • 4

    the library can lay claim to those resources that have been selected as part of the Collection, and assert curation, or stewardship, of those selected resources to ensure their consistent availability over the long-term. The boundary of the traditional library was easy to define. It was the building that contained and protected the selected physical resources over which the library asserted control and curation responsibility. Correspondingly, from the patrons point of view, the boundary marked what could be called a trust zone, an area in which they could presume that the integrity guarantees of the library existed. The transition from the physically-contained (bricks and mortar) library to the networked-based digital library has fractured this formerly well-defined control zone (Lagoze, 2010). This fracturing of the control zone in libraries offers a useful metaphor for thinking about big data in the sciences. A particularly interesting example is social science research. For the past 70 years, some of the most arresting new facts to come from the social sciences came from shared datasets. That is, a great deal of influential quantitative social science has been built on a foundation of expensive data sources originating from survey research and aggregate government statistics: these include, for example, the Roper Poll, the General Social Survey, the American National Election Studies, and the Current Population Survey. In the US, after the military needs of World War II opened social science to large-scale federal research funding for the first time (Converse, 1987: 242), the government paid for a well-established data infrastructure composed of an international network of highly-curated and metadata-rich archives of social science such as the US ICPSR (Inter-University Consortium for Political and Social Research). These archives continue to play an important role in quantitative social science research. However, the emergence and maturation of ubiquitous networked computing and the ever-growing data cloud has introduced a spectacular quantity and variety of new data sources into this mix. These include massive social media data sources such as Facebook, Twitter, and other online communities, which when combined with more traditional data sources provide the opportunity for studies at scales and complexities heretofore unimaginable. This transformation has been described by Gary King, a Harvard political scientist, as the social science data revolution, which is characterized by a changing evidence base of social science research (King, 2011a, 2011b). Another example of this fracturing of the control zone exists in observational science, for example, identification and counting of species in ecological niches, astronomy, and meteorology. In each of these areas there is a growing interest in what has been termed

  • 5

    crowd-sourced citizen science, which engages numerous volunteers as participants in large-scale scientific endeavors. Our particular experience is with the eBird project, originated at the Cornell Laboratory of Ornithology; a highly successful citizen science project for over a decade has collected observations from volunteer participants worldwide. By nature, citizen science of this type must contend with the problems of highly variable observer expertise and experience. How can we trust data when they are derived, collected or aggregated from sources with credentials that do not conform to standard criteria such as academic degree, publication record, or institutional affiliation? It comes as no surprise that crowd-sourced citizen science makes a substantial portion of the formal scientific community uneasy (Sauer, Peterjohn, & Link, 1994), especially in fields where peoples lives are at stake, such as medicine (Raven, 2012).

    As both of these examples show, the traditional and well-defined control zone for scientific data has broken down for a number of reasons. For the researcher, an enticing array of data are now available from non-traditional sources, such as social networks. Mashups of data, many of which mix traditional and non-traditional sources, are becoming increasingly common. Funders, the public, and fellow scientists are demanding better access to data and in general open data, which grates against the organization of some existing (closed) data archives. As a result, the traditional criteria for assessing data integrity are being challenged.

    While proponents of big data present its introduction as fundamentally additive--more fuel for the fire of research--the addition of new data to existing research disciplines has in the past often had a destabilizing effect on how research is conducted. In the following section, we present two examples to think through the implications of new data.

    2. How big data destabilizes modes of knowledge production

    The inclusion of uncontrolled data in the scientific process is both inevitable and necessary. In the same manner that it would be impossible for the library to return to the era of control over physical resources within well-defined bricks and mortar institutions (Lagoze, 2010), it is unrealistic for any of the sciences (i.e., physical, social, natural, biological, etc.) to deny the reality and potential benefits of a sociotechnical knowledge infrastructure that by nature mixes the formal with the informal. However, adding data coming from uncontrolled and potentially unreliable sources can jeopardize traditional and historically successful modes of

  • 6

    knowledge production. Some examples from weather forecasting and epidemiology will illustrate this shift.

    2.1. The case of weather forecasting Weather forecasting is one important areas where more data are not necessarily helpful. Meteorologists called for a rseau mondial (worldwide network) of weather data exchange around the turn of the 20th century, based on the telegraph networks that were already routinely used for regional data exchange, especially in Europe. And indeed, worldwide meteorological data exchange did become possible in the first half of the 20th century, based on a combination of telegraph, teletype, shortwave radio, and several other media. Yet forecasters never tried to acquire most of these data and threw away much of what they did receive because pre-computer forecasting techniques simply could not use it within the short time (hours) available for creating a useful forecast. Even climatologists, who did not face the time pressure of forecasting, could not (before computers) make use of much of the available data; instead, they developed a system of distributed calculation, where stations would pre-compute such figures as monthly average temperatures and report only those, rather than provide all the raw data to central collectors. When computer forecast models first became available in the mid-1950s, they faced a different problem. Weather models divide the atmosphere into three-dimensional grids and compute transformations of mass, energy, and momentum across grid boxes on a short time-step (every few minutes). Every grid point must be supplied with a value at each time step; they cannot simply be zeroed out. Yet most instrument observations of weather are taken every few hours (not minutes), and few if any weather stations or other instruments are located exactly at the gridpoints used by the models. So forecasters developed techniques for interpolating gridpoint values, in time and in space, from observations. This later led to a far more complicated technique known as four-dimensional (4-D) data assimilation. 4-D data assimilation systems ingest observations as they arrive, using them to correct or steer an essentially model-based forecast process in which the previous forecast is used as a first guess for the current time period. Forecasters like to say that weather models carry information from data-rich areas, where there are dense observing networks, to data-poor areas which lack them. When a weather

  • 7

    system moves from a data-rich to a data-poor area, the forecast made while that system was still in the data-rich area becomes the first guess for its development in the data-poor area thus transferring the information acquired in the data-rich area forward in time to its later location. A surprising conclusion over five decades of experience with this process is that more real data (i.e. observations) are not necessarily helpful. First, uneven global coverage is more of a problem than is insufficient data volume. Second, observations inevitably contain errors due to instrument bias, weather station siting, local topography, and dozens of other factors. Third, since the error characteristics of observations are not perfectly known, the best forecast centers now generate dozens of different data sets from the observations, in order to simulate the likely range of possibility of the true state of the atmosphere. Then they run forecasts on each of these data streams. The idea is that since we cant know exactly what the errors in the observational data actually are, the statistical properties of a few dozen forecasts run on a few dozen variations on those data are most likely to approximate what will really happen. One of the most striking ways forecasters have pictured this process is to describe 4-D data assimilation as a unique and independent observing system (Bengtsson & Shukla 1988, p. 1134) that can generate better, more detailed images of actual planetary weather than can instruments alone. In other words, simulation models (albeit guided by real observations) give you better data than do your instruments. Or, to say it even more provocatively, simulated data appropriately constrained are better than real data. The example of weather forecasting shows that the quest to obtain more data leads in strange directions. Weather forecasters sought and eventually integrated many new data sources, especially satellites, leading to vast data volumes anyone would describe as big data. Yet the actual practice of forecasting incorporates only some of these observational data. It also shows how the very meaning of data has shifted, over time, toward a definition that accepts processed, simulated, and analyzed data as ultimately more useful and more accurate than anything instruments alone can provide.

    2.2. The case of epidemiology As our earlier introduction of big data has made clear, many conceptions of big data do not clearly define what they mean by big. Researchers in the social sciences, for instance,

  • 8

    seem to habitually use the phrase big data when they mean web data or social media data even though the datasets they ultimately build may not be large. Bigness, when it does appear, is then imagined only in terms of procedure--as a problem of performance requirements, for example (Magoulas, Lorica, 2009, p. 2). In contrast, when other researchers have grappled with transformations similar to big data, the increasing scale of the datasets that researchers build has proven to be epistemologically significant. Another useful precursor to todays big data debates that makes this point clear is epidemiology over the last few decades. In an initially controversial movement termed evidence-based medicine, (Sackett et al., 1996), doctors of the 1980s sought to join forces with epidemiologists and better integrate systematic findings from associative population studies into the clinical practice of medicine. Observational datasets--meaning data from studies that did not involve random assignment to groups and a control group---that were large by the standards of the day held out the promise of solving large numbers of clinical puzzles at one swoop. The biggest concern at the time wasnt that these methods wouldnt work, but that they were a dangerous innovation, perpetrated by the arrogant (71) that would make medicine impersonal by replacing subjective judgements of doctors with cold statistics. But the cold statistics did not work as expected. Recall that pundits like Anderson (2008) imagine large datasets as semi-automatically generating new truths. In medicine of the 1990s, however, more data led to more falsehoods. Epidemiologists produced a plethora of new medical associations such as Vitamin E lowers the risk of heart attack and a low-fat diet prevents cancer in women (examples from Young, 2009). These new findings received widespread press coverage, only to be refuted when tested via expensive, randomized, controlled clinical trials. A 1997 editorial cartoon proclaimed the New England Journal of Medicine to be the New England Journal of Panic-Inducing Gobbledygook and depicted the research process as a series of spinning roulette wheels (Borgman, 1997). One cause of this state of affairs was that statistical techniques then in wide use were calibrated, sometimes implicitly, to smaller sample sizes. Researchers who were used to straining to detect anything at all were suddenly able to easily detect small statistical associations of questionable clinical value, and they did so, and these were published. Trained to habitually worry about false negatives (that is, missing important but true effects),

  • 9

    researchers had little experience in worrying about false positives, which proliferated (Ziliak & McCloskey, 2007). The situation where a sensational new finding appears and then is quickly proven false was termed The Proteus Effect (Ioannidis, 2005: 698) as researchers learned that the truth, like Proteus and the sea, is mutable. The use of more and larger epidemiological research studies also revealed the fact that all associations are not created equal. After detecting all of the conspicuous relationships between risk factors and disease, such as the fact that smoking can increase the risk of lung cancer by 3000% (Taubes & Mann, 1995: 164), researchers were left to sift the data for the remaining, inconspicuous associations that proved subtle and weak by comparison. As these remaining factors might influence disease by 300%, 30%, or perhaps 3%, they proved very difficult to isolate. In the words of one researcher in the mid 1990s, Were pushing the edge of what can be done with epidemiology (167). The first victories with big data promised a future of successes that proved impossible to deliver, and this imperiled the discipline. More foundationally, one new truth that emerged is that many of the older truths in epidemiology were also wrong. New data destabilized old knowledge, leading to recent assessments that the bulk of all of the published findings in the field are incorrect (Ioannidis, 2005). Just as computer power allowed larger datasets to be processed with existing methods, it also enabled new methods, invigorating a 200-year-old philosophical debate about the use of Bayesian inference in place of Frequentist inference (currently the dominant technique in applied statistics). Advances in computers and networks allowed researchers more data, but analyzing these data highlighted some of the flaws in Frequentist methods. Advances in computers and networks also allowed researchers new computational power that made the alternative Bayesian models tractable. The story of Proteus in epidemiology did not lead to a backlash against evidence, and sample size is still seen as a positive, yet the big data hopes from early evidence-based medicine have been sharply curtailed. Some medical statisticians now advocate that the discovery of an apparently important new finding in a research study should be taken first and foremost as evidence of an error, as any associational study should properly be expected to find nothing of value (Ioannidis, 2008a).

  • 10

    The earlier crises in epidemiology eventually led from cries for more evidence to a wide-ranging conversation about the institutions of science and scientific publishing, which were implicated as drivers of false results. Medical statistician John Ioannidis recently characterized epidemiology as potentially supporting two kinds of researchers, the aggressive discoverer and the thoughtful replicator (2008b, 646). The aggressive discoverer, he argued, is currently rewarded by the system of scientific publication, yet he produces findings that are extremely unlikely to be true. In the Ioannidis call to action, in addition to more data, successful epidemiology demands Bayesian statistical methods, a new system of scientific publication that rewards replication and tracks null findings, and ultimately a revaluation of the role of data itself. In Ioannidiss terms, the aggressive discoverer in medicine conceptualizes datasets as private goldmines not to be shared while the successful practice of research demands that databases be public (646) so that suspect findings can be checked. As noted previously, in meterology new data led to an increased reliance on elaborate modes of conditioning and assimilating data, and the field moved away from the analysis of needlessly raw and needlessly large datasets. In the context of epidemiology, new data led to a proliferation of contradictory findings. Although the datasets (presumably) had a known provenance, the discipline is now arguing that the researchers themselves are not trustworthy and that data need to be widely shared in order to allow all new claims to be extensively checked. We will further develop the consequences of big data for scholarly publishing below, after we describe the implications of of big data on collaborations between scientific disciplines.

    3. Big data as transformative for scholarship Our argument thus far has been that there is more to big data than just its bigness. In particular, we are looking for characteristics of the data that are disruptive on the sociotechnical level rather than existing as only a scaling problem that requires a technical solution (e.g., more bandwidth, larger disk drives, parallel computing techniques, etc.). In the first section we defined big data through the breakdown of the control zone due to uncertain provenance of data and the implications on trust and integrity. In the second section we saw that the addition of new data destabilizes disciplines in unpredictable ways. In this section we look at two implications of big data for scholarship: when the presence of big data disrupts

  • 11

    the traditional structure of scholarly communities and the scholarly infrastructure for dissemination and publication.

    3.1. Unexpected scientific collaborations There is a plethora of work, both quantitative and qualitative, that attempts to define the proper unit of analysis for investigating scholarly work. Across this work, there is the common assumption that the institutionally-defined notion of a discipline is not the proper granularity for understanding this community structure. Despite the fact that the Academy, funding agencies, publication venues, learned societies, and other institutions are constructed on the basis of the standard discipline taxonomy, when one trains their focus on any single discipline they see a tremendous variety of methodologies, epistemologies, publication practices, and other cultural norms. In response, research in this area has produced a number of other atomic units of study including invisible colleges, communities of practice, collaboratories, and the like. Despite the problematic nature of characterizing the proper unit (and, indeed, it is questionable whether there is a single granularity to be uncovered), it is clear from even the most superficial empirical analysis that scholars fit into subcommunities that differ from each other in a variety of dimensions. These dimensions include size, varying from the classical paradigm of the solitary bench chemist to the 100+ person teams in high-energy physics; methodology, for example based in experimentation, simulation, observation, interpretation, intervention, and the like; primary publishing venue, including journals, conferences, monographs, and others; and numerous other dimensions. Many of these distinctions are based on a historical situation that is no longer relevant, and thus they can be characterized as path dependencies. Yet this is irrelevant to the fact that they are culturally imprinted and drive not only the nature of the research that is done, but the reward systems (i.e., promotion and tenure) that drive scholarly careers.

    Our premise here is that the realities of big data align well with this post-disciplinary scientific age, as it directly conflicts with the cultural norms that disciplines represent. We see this disruption firstly on the scale of a single unit of scholarly practice, and secondly in contexts that bring together multiple scholarly cultures (e.g., multidisciplinary, interdisciplinary, and transdisciplinary scholarship). To see an example of the first type of disruption, within a single field of scholarship, we can look at the big data science project, the Human Genome Project. Conceived of in the mid-1980s, the Human Genome Project has successfully identified and mapped the entire human genome and made the results available

  • 12

    through a fully open, shared, network accessible database. The resulting data are arguably one of historys greatest achievements of cooperative scientific enterprise. As described by Glasner, (1996) and Hilgartner (1998) the creation of and openness of this primary data source to the scientific community have profound effects on the traditional small group structure of fields such as microbiology. Big data, in this case, produced a context in which traditional group barriers became dysfunctional and progress from that point forward depended on the broader collaborative framework among scientists in the field and a more substantial level of cooperation and sharing. Thus, in this well-known example, the data in the form of the Human Genome Project disrupted the long-standing structure of work.

    Turning now to the second disruption, one of the ways that big data can perturb traditional modes of scholarship is when manipulating these data requires collaboration between scholars who heretofore were separated by established and historical disciplinary, field, or institutional barriers. These data-driven collaborations arise from a number of factors, a few of which are described in the remainder of this section. These frequently multidisciplinary collaborations are intriguing because they create affinities between formally separated epistemologies and methodologies, providing the context for radically new ways of thinking about things and doing things. As stated by Kertcher (2010) [c]ombining knowledge from different domains is the essence of innovation, as it offers individuals and organizations a potent recipe to break away from cemented, path dependent cognitive molds. But, for the same reason, these collaborations are profoundly challenging because they mix together historically-based, and often surprisingly different research cultures and their attendant evaluation systems, publication regimes, privacy and sharing norms, and other field-specific characteristics. We find the concept of frictions (Edwards, Mayernik, Batcheller, Bowker, & Borgman, 2011; Edwards, 2010) particularly useful to think about the challenges encountered in data-centered interdisciplinary collaborations. Frictions occur at field-specific boundaries within these projects and require time, energy, and attention to be resolved for a project to be successful. Frequently, these resolutions require the disruption of familiar routines, values, and practices by both the data scientists and domain scientists.

    However, participating in a new scientific collaboration that is demanded by the sheer amount of data to process and analyse is far from being an intuitive process, as many cultural frictions amongst disciplines arise. Forced marriage big data collaborations are often disruptive due to profound cultural differences. Finholt and Birnholtz (2006) discuss these differences is in a theoretical framework developed by Hofstede (Hofstede, Hofstede, & Minkov, 2010;

  • 13

    Hofstede, 1980) to describe cultural differences in the context of national attitudes. They pointed to two aspects of Hofstedes framework that are particularly problematic in collaborations between domain scientists and data scientists. The first is uncertainty avoidance; in their study of a big data project involving ecologists and computer scientists the ecologists were notably risk-averse due to the difficulty of observation, experimentation, and data gathering, in contrast to the computer scientists who were highly tolerant of uncertainty. The second is power distance, referring to the propensity for hierarchical organizational structures; the ecologists being more comfortable with hierarchies and computer scientists were comfortable with loose organizational structures. Another cultural friction, described by Kervin, Finholt, & Hedstrom (2012) is that disciplines are substantially different in shifting attitudes towards openness and sharing results across scholarly fields. As Vertesi, Dourish, & Ca (2011) points out, these differing attitudes can even exist within a specific scholarly field, based on the organizational structure of the research group rather than the subject of study. Finally, there is the issue of collaborative rhythm described by Jackson, Ribes, Buyuktur, & Bowker (2011). According to these authors, a substantial source of friction in multidisciplinary collaborations are the often opposing and frequently systemic temporal patterns amongst the individuals in the collaboration. A classic example is the contrast between the rhythm of a field ecologist whose research pattern is determined by the appearance of some natural event (e.g., the annual migratory pattern of some particular species of birds) vs. the computer scientist whose rhythm of work is guided by the speed of the CPU. The combination of these two rhythms in a closely collaborative, interdependent project will create a likely cultural conflict.

    In this section we have described a number of situations where big data is indeed disruptive due to the manner in which it breaks down the traditionally established walls around scholarship. These walls may be constructed within a field (e.g., between labs practicing a similar type of research) and between fields (e.g., such as the cultural barriers that separate domain scientists and data scientists). Finally, we will turn our attention to publication and dissemination as a context in which big data is potentially disruptive.

    3.2. The alternative dissemination of scholarship As described by (Roosendaal & Geurts, 1997), the scholarly value chain, which has been at the core of the infrastructures for dissemination and publication of results since the inception of the modern scholarly communication system in the 18th century. Roosendaal and Geurts

  • 14

    distinguish the following functions that must be fulfilled by every system of scholarly communication regardless of its actual implementation:

    Registration, which allows claims of precedence for a scholarly finding.

    Certification, which establishes the validity of a registered scholarly claim.

    Awareness, which allows actors in the scholarly system to remain aware of new claims and findings.

    Archiving, which preserves the scholarly record over time.

    Rewarding, which rewards actors for their performance in the communication system based on metrics derived from that system.

    Two key components of the scholarly communication infrastructure are the article as the unit of publication and citation and peer review as the guarantor of certification and validity. As described by Van de Sompel, Payette, Erickson, Lagoze, & Warner (2004), the emergence of the Internet and the web created a technological infrastructure for publication and dissemination that seriously challenges the traditional model. Furthermore, the increasing importance of data, and in particular big data, in scientific results has substantially added stress on the traditional system (Borgman, 2011; Pepe, Mayernik, Borgman, & Van de Sompel, 2009; Wallis, Borgman, Mayernik, & Angeles, 2010; Wynholds, Wallis, Borgman, Sands, & Traweek, 2012). Any disruption to the publication model of scholarship is a direct disruption of scholarship itself, because, as listed above, the final step in the value chain is rewarding. Any replacement or refactoring of the publication system in response to the new demands must account for rewarding as a necessary function. We describe below two ways in which big data places new demands on the system.

    The first, citation, involves the way that the practice of citation weaves the framework for evaluation of scholarly impact at the journal level (impact factor), at the individual paper level, and at the author level (e.g., via metrics like the H index). The infrastructure for citation at the article level is well-established. However, the infrastructure for the citation of data is not standardized and still experimental (Lawrence, Jones, Matthews, Pepler, & Callaghan, 2011; Parsons, Duerr, & Minster, 2010). Absent a well-established and standardized citation scheme, it is impossible to fulfill key aspects of the scholarly value chain.

  • 15

    The second domain of conflict is peer review. As noted above, the certification and validity of the existing scholarly communication system are directly dependent on peer review. Although there have been numerous calls and suggestions for a new regime of review collectively known as alt metrics (Priem, Piwowar, & Hemminger, 2012), human-centered traditional peer review still drives the certification system. However, mechanisms for the peer review of data are barely established and highly problematic (Parsons et al., 2010). For example, the technological requirements for article review are simple; display the text on screen or print it on paper. In contrast, evaluating data, which can exist in a wide variety of forms data spreadsheets to petabytes of binary information, often requires elaborate technical scaffolding for access to software and computational resources that are frequently impractical and unavailable. In these ways, the increasing prevalence of data-intensive science and big data seriously disrupts the fundamental publication and dissemination system of scholarship.

    4. Conclusion In this article, we aimed to define big data not by the size nor by the software and technical infrastructure adapted to process large datasets. Instead we described it through the associated loss of control towards data sources. We focused on the impact of using large datasets in the context of stresses on the traditional scholarship infrastructure, especially on the disciplinary structure of sciences, the role of methods, scientific collaborations, and the publication infrastructure. If large and potentially exhaustive datasets are often presented as disruptive, we showed that the exit out of the control zone can also destabilize modes of knowledge production: the case studies of weather forecasting and epidemiology showed how the availability of larger datasets could be counter-effective or destabilizing. But this fracturing of the control zone can cause other disruptions, such as troubling the traditional separations between scientific disciplines and disturbing the peer-reviewed journal as benchmark for scholarly dissemination. If big data means big uncertainty towards the provenance of data, a major challenge for scientists willing to use new data sources remains the question of validity. For instance, how is it possible to assess the validity of datasets that arrive via a social media API? We can suggest two paths to resolve this problem, both based on retrospectively recovering

  • 16

    provenance. In our research with eBird, we have been investigating ways to reconstruct observer/contributor expertise from the aggregated data. Our realization has been that expertise is too nuanced a factor to reconstruct, but that experience, interpreted as deliberate practice is an effective path to expert performance (Ericsson & Charness, 1994). Evidence of experience can be extracted from the aggregated data; for example frequency of contributions, the diversity of contributions measured by species distribution, etc. By devising ways to recognize these traces we hope to develop mechanisms that aid scientists in determining the expertise (and perhaps integrity) of anonymous data contributors. Another approach is to employ digital forensics (Reith, Carr, & Gunsch, 2002), a technique increasingly popular in the intelligence and legal communities, which, like our work with expertise, recovers traces of origin and provenance metadata from a digital artifact itself. None of these techniques bring us back to the good old days when well-defined control zones provided a physical trust environment for data. Nostalgia will not move us forward and, hopefully, the more nuanced approach here will provide us the advantages of the new networked world without throwing out the fundamental supports that science depends upon. How do these transformations specifically apply to social sciences and humanities? The advent of large datasets is far from being a new thing in these fields. We previously mentioned the ICPSR as an example of big data social science, and the collaboration between Roberto Busa and IBM are a good example of big data in the humanities: Busa and IBM collaborated on creating the Index Thomisticus to perform large scale analysis on the complete work of Saint Thomas Aquinas. Focusing on the humanities, using big data requires from researchers to take into account the specificities of their scholarship. Borgman (2007) highlighted the central role of the library as data source in humanistic research, the importance of text mining, a traditionally individualistic form of scholarship, and the risk implied by the use of technology-based scholarship (which may not be considered for tenure). To envision the potential evolution of these aspects, the digital humanities can act as a laboratory, as this field of practice offers many experiments in terms of research methods (Burdick et al., 2012), collaboration between disciplines (Scheinfeldt, 2010), data sharing and publication infrastructure (Pepe, 2009). As we have shown, the addition of more data to an existing discipline can challenge established forms of scholarship in unexpected ways. It is thus helpful to consider some of the many moments of big data that have already occurred across the history of the sciences.

  • 17

    This does not allow us to predict the future of research, but rather to realistically assess the range of possible implications of big data for scholarship.

    Sources cited Anderson, C. (2008, June 23). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. WIRED. Retrieved from http://www.wired.com/science/discoveries/magazine/16-07/pb_theory Atkinson, R. (1996). Library Functions, Scholarly Communication, and the Foundation of the Digital Library: Laying Claim to the Control Zone. The Library Quarterly, 66(3). Retrieved from http://www.jstor.org/stable/4309129 Bengtsson L., Shukla J. (1988) Integration of Space and in Situ Observations to Study Global Climate Change. Bulletin of the American Meteorological Society. 69(10), 1130-43. Bollier, D. (2010). The Promise and Peril of Big Data. The Aspen Institute. Retrieved from http://www.aspeninstitute.org/publications/promise-peril-big-data Borgman, J. (1997, April 27). Todays Random Medical News. The New York Times: E4. Borgman, C. L. (2007). Scholarship in the digital age information, infrastructure, and the Internet. Cambridge, Mass.: MIT Press. Borgman, C. L. (2009). The Digital Future is Now: A Call to Action for the Humanities, 3(4). Retrieved from http://www.digitalhumanities.org/dhq/vol/3/4/000077/000077.html Borgman, C. L. (2011). The Conundrum of Sharing Research Data. Journal of the American Society for Information Science, 63(6), 140. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1869155 boyd, danah, & Crawford, K. (2012). Critical Questions for Big Data. Information, Communication & Society, 15(5), 662679. Burdick, A., & Drucker, J., Lunenfeld, Peter. (2012). Digital_humanities. Cambridge, MA: MIT Press.

  • 18

    Burke, P. (201AD). A social history of knowledge II,. Cambridge: Polity Press. Bruns, A. (2013). Faster than the speed of print: Reconciling big data social media analysis and academic scholarship. First Monday, 18(10). Retrieved from http://firstmonday.org/ojs/index.php/fm/article/view/4879 Crawford, K. (2013). The Raw and the Cooked: The Mythologies of Big Data. DataEDGE 2013. Retrieved from http://www.ischool.berkeley.edu/node/24376 Converse, J. M. (1987). Survey Research in the United States: Roots and Emergence, 1890-1960. Berkeley: University of California Press. Cragin, M. H., Palmer, C. L., Carlson, J. R., & Witt, M. (2010). Data sharing, small science and institutional repositories. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 368(1926), 402338. Edwards, P., Mayernik, M. S., Batcheller, A., Bowker, G., & Borgman, C. (2011). Science Friction: Data, Metadata, and Collaboration. Social Studies of Science, 41(5). Retrieved from http://sss.sagepub.com/cgi/doi/10.1177/0306312711413314 Ericsson, K. A., & Charness, N. (1994). Expert performance: Its structure and acquisition. (S. J. Ceci & W. M. Williams, Eds.) American Psychologist, 49(8), 725747. Finholt, T. A., & Birnholtz, J. P. (2005). If We Build it, Will They Come? The Cultural Challenges of Cyberinfrastructure Development. In W. S. Bainbridge & M. Roco (Eds.), Managing Nano-Bio-Info-Cogno Innovations: Converging Technologies in Society (pp. 89102). Springer Verlag. Gerlitz, C., & Rieder, B. (2013). Mining One Percent of Twitter: Collections, Baselines, Sampling. M/C Journal, 16(2). Retrieved from http://journal.media-culture.org.au/index.php/mcjournal/article/view/620 Glasner, P. (1996). From community to collaboratory? The Human Genome Mapping Project and the changing culture of science. Science and Public Policy, 23(2), 109116.

  • 19

    Halevi, G., Moed, F. (2012). The Evolution of Big Data as a Research and Scientific Topic: Overview of the Literature. Research Trends, issue 30, september 2012. Hey, T., Tansley, Stewart, & Tolle, K. (2009). The fourth paradigm data-intensive scientific discovery. Redmond, Wash.: Microsoft Research. Hilgartner, S. (1998). Access to Data and Intellectual Property: Scientific Exchange in Genome Research. Washington, DC: National Academy of Sciences. Hofstede, G. (1980). Cultures Consequences!: International Differences in Work-Related Values (Cross Cultural Research and Methodology) (p. 328). SAGE Publications, Inc. Retrieved from http://www.amazon.com/Cultures-Consequences-International-Differences-Work-Related/dp/0803913060 Hofstede, G., Hofstede, G. J., & Minkov, M. (2010). Cultures and Organizations: Software of the Mind, Third Edition (p. 576). McGraw-Hill. Retrieved from Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine 2(8): 696-701. Ioannidis, J. P. A. (2008a). Finding Large Effect Sizes: Good News or Bad News? The Psychologist 21(8): 690-691. Ioannidis, J. P. A. (2008b). Why Most Discovered True Associations are Inflated. Epidemiology 19(5): 640-648. Jackson, S. J., Ribes, D., Buyuktur, A., & Bowker, G. C. (2011). Collaborative rhythm. Proceedings of the ACM 2011 conference on Computer supported cooperative work - CSCW 11 (p. 245). New York, New York, USA: ACM Press. Kertcher, Z. (2010). Gaps and Bridges in Interdisciplinary Knowledge Integration. In M. Anandarajan (Ed.), e-Research Collaboration SE - 4 (pp. 4964). Springer Berlin Heidelberg. Kervin, K., Finholt, T., & Hedstrom, M. (2012). Macro and micro pressures in data sharing. In Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration, IRI 2012 (pp. 525532).

  • 20

    Knorr-Cetina, K. (1999). Epistemic cultures: how the sciences make knowledge. Cambridge, Mass.: Harvard University Press. King, G. (2011a). The Social Science Data Revolution. Horizons in Political Science. Cambridge, MA: Harvard University. Retrieved from http://gking.harvard.edu/files/gking/files/evbase-horizonsp.pdf King, G. (2011b). Ensuring the data-rich future of the social sciences. Science (New York, N.Y.), 331(6018), 71921. Lagoze, C. (2010). Lost Identity: The Assimilation of Digital Libraries into the Web. Cornell University, Ithaca. Retrieved from http://carllagoze.files.wordpress.com/2012/06/carllagoze.pdf Lawrence, B., Jones, C., Matthews, B., Pepler, S., & Callaghan, S. (2011). Citation and Peer Review of Data: Moving Towards Formal Data Publication. International Journal of Digital Curation, 6(2), 437. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabsi, A.-L., Brewer, D., Alstyne, M. V. (2009). Computational Social. Science, 323(5915), 721723. doi:10.1126/science.1167742 Magoulas, R., Lorica, B. (2009). Big Data: Technologies and Techniques for Large-Scale Data - Strata. OReilly. Retrieved from http://strata.oreilly.com/2009/03/big-data-technologies-report.html Manovich, L. (2012, March 16). Computational humanities vs. digital humanities. Retrieved from http://lab.softwarestudies.com/2012/03/computational-humanities-vs-digital.html Mahrt, M., & Scharkow, M. (2013). The Value of Big Data in Digital Media Research. Journal of Broadcasting & Electronic Media, 57(1), 2033. Markham, A. N. (2013). Undermining data: A critical examination of a core term in scientific inquiry. First Monday, 18(10). Retrieved from http://firstmonday.org/ojs/index.php/fm/article/view/4868

  • 21

    Parsons, M. A., Godoy, O., LeDrew, E., de Bruin, T. F., Danis, B., Tomlinson, S., & Carlson, D. (2011). A conceptual framework for managing very diverse data for complex, interdisciplinary science. Journal of Information Science, 37(6), 555569. Parsons, M. A., Duerr, R., & Minster, J.-B. (2010). Data Citation and Peer Review. Eos, Transactions American Geophysical Union, 91(34), 297. Pepe, A., Mayernik, M., Borgman, C. L., & Van de Sompel, H. (2009). From artifacts to aggregations: Modeling scientific life cycles on the semantic Web. Journal of the American Society for Information Science and Technology, 61(3) 567-582 Priem, J., Piwowar, H. A., & Hemminger, B. M. (2012). Altmetrics in the wild: Using social media to explore scholarly impact. CoRR, abs/1203.4. Retrieved from http://dblp.uni-trier.de/db/journals/corr/corr1203.html#abs-1203-4745 Raven, K. (2012). 23andMes face in the crowdsourced health research industry gets bigger. spoonful of medicine: a blog from Nature Medicine. Retrieved from http://blogs.nature.com/spoonful/2012/07/23andmes-face-in-the-crowdsourced-health-research-industry-gets-bigger.html Reith, M., Carr, C., & Gunsch, G. (2002). An examination of digital forensic models. International Journal of Digital Evidence, 1, 112. Roosendaal, H., & Geurts, P. (1997). Forces and functions in scientific communication: an analysis of their interplay. In Cooperative Research Information Systems in Physics. oldenberg, Germany. Sackett, D. L., Rosenberg, W. M., Gray, J. A. M., Haynes, R. B., & Richardson, W. S. (1996). Evidence-based medicine: What it is and what it isnt. British Medical Journal 312: 71-72. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2349778/ Sauer, J. R., Peterjohn, B. G., & Link, W. A. (1994). Observer Differences in the North American Breeding Bird Survey. The Auk, 111(1), 5062. Scheinfeldt, T. (2010, May 26). Why Digital Humanities is Nice. Found History. http://www.foundhistory.org/2010/05/26/why-digital-humanities-is-%e2%80%9cnice%e2%80%9d/

  • 22

    Strasser, B. J. (2012). Data-driven sciences: From wonder cabinets to electronic databases. Studies in History and Philosophy of Biological and Biomedical Sciences, 43(1), 8587. Tenopir, C., Allard, S., Douglass, K., Aydinoglu, A. U., Wu, L., Read, E., Frame, M. (2011). Data Sharing by Scientists: Practices and Perceptions. (C. Neylon, Ed.) PLoS ONE, 6(6), 21. Retrieved from http://dx.plos.org/10.1371/journal.pone.0021101 Vis, F. (2013). A critical reflection on Big Data: Considering APIs, researchers and tools as data makers. First Monday, 18(10). Van de Sompel, H., Payette, S., Erickson, J., Lagoze, C., & Warner, S. (2004). Rethinking Scholarly Communication: Building the System that Scholars Deserve. D-Lib Magazine, 10(9). Retrieved from http://www.dlib.org/dlib/september04/vandesompel/09vandesompel.html Vertesi, J., Dourish, P., & Ca, I. (2011). The Value of Data!: Considering the Context of Production in Data Economies. Human Factors, 533542. Wallis, J. C., Borgman, C. L., Mayernik, M. S., & Angeles, L. (2010). Who is Responsible for Data!? Exploring data authorship , ownership , and responsibility and their implications for data curation. Distribution. Retrieved from http://works.bepress.com/borgman/251/ Wynholds, L. A., Wallis, J. C., Borgman, C. L., Sands, A., & Traweek, S. (2012). Data, data use, and scientific inquiry. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries - JCDL 12 (p. 19). New York, New York, USA: ACM Press. Ziliak, S. T. & McCloskey, D. N. (2007). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press. Young, S. S. (2009). Everything is Dangerous. American Scientist. (video). http://www.americanscientist.org/science/pub/everything-is-dangerous-a-controversy