Learning by Doing versus Learning by Viewing: An Empirical ......[96] vs. the situated cognition perspective proposed by Jean Lave and Etienne Wenger [58]. The former focuses on the

193

Learning by Doing versus Learning by Viewing:An Empirical Study of Data Analyst Productivity on aCollaborative Platform at eBay

YUE YIN, Northwestern University, USAITAI GURVICH, Cornell Tech, USASTEPHANIE MCREYNOLDS, Alation Inc., USADEBORA SEYS, eBay Inc., USAJAN A. VAN MIEGHEM, Northwestern University, USA

We investigate how data-analyst productivity benefits from collaborative platforms that facilitate learning-by-doing (i.e. analysts learning by writing queries on their own) and learning-by-viewing (i.e. analysts learningby viewing queries written by peers). Learning is measured using a behavioral (productivity-improvement)approach. Productivity is measured using the time from creating an empty query to first executing it.

Using a sample of 2,001 data analysts at eBay Inc. who have written 79,797 queries from 2014 to 2018, wefind that: 1) learning-by-doing is associated with significant productivity improvement when the analyst’sprior experience focuses on the focally queried database; 2) only learning-by-viewing queries that are authoredby analysts with high output rate (average number of queries written per month) is associated with significantimprovement in the viewer’s productivity; 3) learning-by-viewing also depends on the “social influence” ofthe author of the viewed query, which we measure ‘locally’ based on the number of the author’s directviewers per month or ‘globally’ based on the how the author’s queries propagate to peers in the overallcollaboration network. Combining results 2 and 3, when segmenting analysts based on output rate and ‘local’social influence, the viewing of queries authored by analysts with high output but low local influence isassociated with the largest improvement in the viewer’s productivity; whereas when segmenting based onoutput rate and ‘global’ social influence, the viewing of queries authored analysts with high output and highglobal influence is associated with the largest improvement in the viewer’s productivity.

CCS Concepts: • Human-centered computing → Empirical studies in collaborative and social com-puting;

Additional Key Words and Phrases: Learning-by-doing; Learning-by-viewing; Productivity; Data analysts;SQL Query; Expert roles; Segmentation; Collaborative data platform; Alation; eBay

ACM Reference Format:Yue Yin, Itai Gurvich, Stephanie McReynolds, Debora Seys, and Jan A. Van Mieghem. 2018. Learning by Doingversus Learning by Viewing: An Empirical Study of Data Analyst Productivity on a Collaborative Platform ateBay. In Proceedings of the ACM on Human-Computer Interaction, Vol. 2, CSCW, Article 193 (November 2018).ACM, New York, NY. 27 pages. https://doi.org/10.1145/3274462

Authors’ addresses: Yue Yin, Northwestern University, 2211 Campus Dr, Evanston, Illinois, 60201, USA, [email protected]; Itai Gurvich, Cornell Tech, New York City, New York, USA, [email protected]; Stephanie McReynolds,Alation Inc. Redwood City, California, USA; Debora Seys, eBay Inc. 2025 Hamilton Avenue, San Jose, California, USA; Jan A.Van Mieghem, Northwestern University, 2211 Campus Dr, Evanston, Illinois, USA, [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.2573-0142/2018/11-ART193 $15.00https://doi.org/10.1145/3274462

Proc. ACM Hum.-Comput. Interact., Vol. 2, No. CSCW, Article 193. Publication date: November 2018.

https://doi.org/10.1145/3274462

https://doi.org/10.1145/3274462

193:2 Yin et al.

1 INTRODUCTIONEffective data analytics drives business success by enhancing managerial decision-making. Compa-nies often struggle to maintain growth in the productivity of their data analysts. In this paper, weinvestigate how data-analyst productivity benefits from collaborative platforms that, in addition toproviding a query-writing environment for the analysts, also facilitate access to queries authored bytheir peers. Productivity is measured using the time from creating an empty query to first executingit.Companies in various industries employ data analysts. From financial services to retailing to

social media, data analysts aim to transform data into valuable information for decision-makers. Byretrieving, organizing and narrating raw data, data analysts can spot trends in the market, enablemanagers to know more about their consumers, and develop recommendations that are alignedwith the company’s strategy. The current demand for skilled data analysts out-paces the supply.A 2016 McKinsey Global Institute report concludes that by 2024 the U.S. economy would be inshortage of 250,000 analytics professionals [40]. Furthermore, surveyed business leaders find itchallenging to recruit and retain proficient data analysts [39].A good data analyst must be technologically up-to-date to bring insightful results swiftly [18].

The writing and execution of queries is central to the work of data analysts with large-scale data.Queries are written using SQL (Structured Query Language) [71] to answer business questions like:What goods are flying off a retailer’s shelf? Who prefers to shop in a boutique store versus on-line?What are the ten most frequently searched items on eBay Motors in London in July 2017? Dataanalysts answer these questions by extracting data from the proper databases, performing datamanipulations (e.g., sorting, grouping, filtering and joining), and finally reporting the results.

Proficiency in programming SQL queries is a learned skill. Organizational learning theory defineslearning as a change in the knowledge that occurs as a function of experience [31]. Such knowledgetransformation occurs at different levels in organizations — individual, group, organizational, andinter-organizational [21]. Studies have demonstrated that much of the programming knowledgeis tacit knowledge, i.e., knowledge that usually is not openly expressed or taught [83, 94]. It alsohas been established that expert programmers know more than mere syntax and semantics ofa particular language; compared to novices, their knowledge is better organized. For example,an expert programmer would be able to see the underlying commonalities and the differencesamong various problems and programs. Numerous researchers who aim to characterize expertprogramming also have suggested the existence of reusable “chunks” of knowledge representingsolution patterns that achieve different kinds of goals [56].

Most current approaches measure learning by assessing changes in cognition. Qualitative meth-ods like questionnaires, interviews and verbal-protocol analyses are typically used to that end[30, 44]. Such cognitive approaches are nonetheless unable to capture tacit knowledge [41]. Be-havioral approaches measure learning by assessing changes in practices or performance and havebeen shown to capture the tacit knowledge well [6, 7, 27]. Recently more researchers start ex-ploring a behavioral approach to quantitatively measure such ‘informal learning’ from a largeonline community. For example, Yang et al. measured learning of a Scratch user as growth in thecumulative repertoire of weighted vocabulary block use [25, 98]. We thereby deploy a behavioralapproach—measuring change in performance—in our study. The expedition of SQL programmingindicates the positive learning outcome of data analysts.

In organizational learning theory, experience—typically defined as the total or cumulative numberof task completions—underpins learning. The most fundamental characterization of experienceis whether it is acquired directly by the focal organizational unit or indirectly from other units[6]. Two modes of organizational learning are derived from this characterization: learning from


Learning by Doing versus Learning by Viewing 193:3

direct experience (i.e. from one’s own practices) and learning from indirect experience (i.e. fromother organizational units’ experience) [55]. Learning from direct experience is the embodimentof “practice makes perfect” while learning from indirect experience emphasizes the circulation ofknowledge among peers. Studies of the well-known learning curve provide considerable evidenceof learning from direct experience [27, 99]. There is also extensive work on collaboration andknowledge sharing that investigates learning from indirect experience [8, 36, 42, 59, 65].

We study these two modes of organizational learning when eBay data analysts work on Alation.Alation is an enterprise collaborative data platform that makes data accessible to individuals acrossthe organization. The platform empowers analysts to write SQL queries using well-curated data,and allows them to publish their own queries or view any public query authored by their peers. Wefocus on two potential learning processes on Alation: learning-by-doing (“by oneself”) vs. learning-by-viewing (peers’ queries) . Learning-by-doing captures how data analysts become faster the morequeries they write; how they learn from direct experience. In contrast, learning-by-viewing captureshow data analysts improve by viewing queries authored by their peers; it is an instance of learningfrom indirect experience. We pose the following research questions:1. Learning-by-doing: How is data analyst productivity associated with self-practice?2. Learning-by-viewing: How is data analyst productivity associatedwith viewing of queries

authored by peers?Organizational learning theory emphasizes the role of expert in facilitating the diffusion and

validation of credible knowledge. Previous studies have demonstrated that group members are likelyto accept and put more weight on information from a recognized expert [86]. The retaining of andthe interaction with exceptional performers appears to affect organizational outcomes [6, 16]. Giventhat most of the programming knowledge is tacit knowledge, it is not clear how to characterizeexpert (or “star”) analysts. The conventional approaches to the characterization of stars are basedexclusively on individual output [9, 38, 100]. Classic economic growth theories nonetheless claimthat human capital externalities (e.g. the influence that an individual has on the performance ofothers) are also a key input in the generation of knowledge [1, 61, 78]. Recent empirical workadopts this view and expands the traditional characterization of ‘star’ by adding measurementsof social influence [69]. Recent studies on software development gauge expertise identificationapproaches by considering the code a developer authors and the code that the developer consultsduring their work [35, 81].

Inspired by the literature on the role of experts in organizational learning, we ask:3. Learning from experts: How is learning-by-viewing a query associated with the exper-

tise of the query’s author?Our study is also related to two well-known perspectives in cognitive psychology and learning

sciences [5, 19, 37]: the “within-the-human” perspective that is generally attributed to Jean Piaget[96] vs. the situated cognition perspective proposed by Jean Lave and Etienne Wenger [58]. Theformer focuses on the internalized development within the individual’s mental representation ofthe world, and presumes that knowledge can be constructed through one’s own practices [19].The latter, in contrast, suggests that knowledge is constructed in a social context, and individualparticipation in valued social practices is critical for successful learning [19, 37]. Read in our context,the within-the-human perspective would suggest that, mostly, the analyst acquires programmingknowledge from writing queries by herself (learning-by-doing). The situated-cognition perspectivesuggests, instead, that learning is mostly accomplished through the analyst’s interaction withother analysts (learning-by-viewing). For situated-cognition perspective it is needed, of course, fornewcomers to be able to observe experts. In our context Alation is the platform that facilitates thisso-called “legitimate peripheral participation” [58].


193:4 Yin et al.

The rest of the paper is organized as follows. In the next section, we discuss the theory anddevelop our hypotheses. We then describe the empirical setting, data and measures. Later, wepresent our analysis strategy and report our results. We conclude with a comprehensive discussionof the results and their potential implications.

2 THEORY AND HYPOTHESESOrganization learning theory has been applied in a broad spectrum of industries frommanufacturingto services [10, 24]. Other recent studies focus on knowledge industries like IT consulting andsoftware development [32, 48, 54]. We follow this path, relying on organizational learning theoryto develop and test hypotheses related to the learning of data-analysts working, as individuals, ona collaborative platform that facilitates knowledge sharing. Our hypotheses pertain to two modesof organizational learning: learning from direct experience and learning from indirect experience[32, 50, 75]. We use productivity as the main variable of interest and measure how it relates to theaccumulated experience that data analysts have gained by writing queries on their own (learning-by-doing) and by viewing queries written by their peers (learning-by-viewing). In formulating thehypotheses we also rely on two cognitive theories of learning: learning-by-doing as it relates to thewithin-the-human perspective and learning-by-viewing as it relates to situated-learning perspective.

2.1 Learning by Doing2.1.1 Individual Learning from Direct Experience. Recent studies of individual learning from

direct experience cover various industries. Kim et al. estimate the learning curve of IT consultants[54]. KC et al. examine the direct impact of a cardiologist’s own prior experience on individuallearning [50]. Staats and Gino compare the benefits of individual worker’s experience in a day orover several days [85]. They find that specialization is related to productivity improvement over asingle day. We expect data analysts in our study to benefit from their past experience of writingqueries. A positive answer to the following hypothesis further validates the earlier findings andextends them to the context of data analysts.Hypothesis 1. Past experience of writing queries is associated with an improvement in the data

analyst’s productivity.

2.1.2 Specificity of Direct Experience. Repetition of a given task is likely to improve the perfor-mance of an individual more than experience with related (but different) tasks. Boh et al. showthat specialized experience with the same system has the greatest impact on productivity formodification requests completed by individual developers [32, 48]. Such findings, identifying thebenefit of focal experience, appear in other service industries [28, 51, 85].

On Alation, data analysts write queries using different databases. In an interview study by Kandelet al. focusing on the challenges of data analysts, most of the respondents mentioned the difficultyin interpreting certain database fields [46]. The evidence in the literature suggests that, as theanalyst practices more with the focal database, she will become more familiar with this database’snuances, including the field definitions, data quality and assumptions. We therefore measure theassociation between productivity and specificity of experience. A positive answer to the hypothesisbelow corroborates the value of focal direct experience in writing queries.

Hypothesis 2. Past experience in querying the focal database is associated with greater improvementin data analyst productivity than past experience querying different databases.

2.2 Learning by Viewing2.2.1 Individual Learning from Indirect Experience. "social interaction among individuals, groups

and organizations are fundamental to organizational knowledge creation” [68]. Organizational



learning is frequently an interactive, social phenomenon [91]. Such communal processes are impor-tant because no one person embodies sufficient knowledge for solving all complex organizationalproblems. For instance, in the context of machine repair technicians, most of the knowledge isnot acquired in the classroom, but comes, rather, from informal story-sharing among techniciansand users about their experiences in particular work environments [15, 70]. This finding, thatindividuals also benefit from their peers’ experience (learning from indirect experience) is alsoconfirmed in [36, 42, 43, 45, 59, 65, 67, 76, 84].In the context of computer programming, Brandt et al. [11] propound that by relying on infor-

mation and source code fragments provided by other people from the Web, developers engage injust-in-time learning of new skills and approaches, clarify and extend their existing knowledge, andremind themselves of details deemed not worth remembering. Vasilescu et al. [92] argue that par-ticipation in on-line programming communities (e.g. StackOverflow) speeds up code developmentsince quick solutions to technical challenges can be provided by peers. Dasgupta et al. confirmedthat remixing—defined as the reworking and combination of existing creative artifacts—acts apathway to learning [25]. They found that a learner’s repertoire of programming concepts increaseswhen she engages in remixing.

Yet there is a trade-off. Viewing peers’ code may delay programming activities as both viewingand programming compete for the developer’s time and attention. Current empirical evidence forthe benefit of learning from indirect experience is inconclusive. Waldinger finds no evidence forpeer effects on the productivity of researchers in physics, chemistry and mathematics. In his study,even very high-quality scientists do not affect the productivity of their local peers [95]. KC etal. investigate the relationship between cardiologists’ current performance and the performanceof their colleagues in the same hospital [50]. But their data does not include detailed “views”information, i.e., what information individual cardiologists actually observe or share among eachother. The authors, therefore, call for future research to identify the precise micro-mechanisms atwork, exploring how knowledge is shared among individuals and affects their performance. Ourfine-grained data include a complete history of each analyst’s record of viewing specific peers’queries, offering an opportunity to respond to the authors’ call. We test the following hypothesisto measure learning from indirect experience in analysts writing queries. A positive answer tothis hypothesis confirms that it is highly possible that analysts who mostly view queries writtenby peers bear high productivity. A negative answer still leaves the possibility that only viewingqueries written by certain peers predicts high productivity. This is investigated in section 2.2.2.

Hypothesis 3. Past experience of viewing queries written by peers is positively associated withdata analyst productivity.

2.2.2 Characterizing Star Data Analysts. Previous studies on knowledge spillover and peer effectsamong scientists suggest a differential impact of collaborating with different types of individuals.Exceptional performers, or stars, may greatly advance the production of ideas and the innovationprocess [9, 69]. The situated-cognition theory in learning sciences places and emphasizes on therole of experts [26]. In the context of programming and software development, developers appearto have greater interest in following some prolific developers, who are considered ‘coding rockstars’by the overall community [23]. In the on-line programming community, a developer’s status canaffect decision-making. Tsay et al. find that contributions from higher-status submitters are morereadily accepted by project managers [89].

In the context of data analysts, we must first ask how to identify (or characterize) the “rockstars”and, given such a characterization, how are these said stars associated with the changes in pro-ductivity of their peers? The characterization we propose in this paper differs from most existingtaxonomy by considering not just the individual output [9, 38, 100], but also the individual’s social


193:6 Yin et al.

Low High

Low

High

Individual Outputrate

Social In

fluence

Nonstar Lonewolf

AllstarMaven

Fig. 1. We segment analysts using two dimensions—output rate and social influence—into four types: All-star,Lone-wolf, Maven and Non-star.

influence on her peers. Such an expanded definition is unavoidable here as we wish to measurelearning-by-viewing which is an interaction-based construct. This need is also identified in [69, 89]who suggest that the characterization of “stars” should consider the individuals’ social influence[69, 89]. We introduce a two-dimensional segmentation of analysts that incorporates both a measureof the individual analyst’ output and a measure of her social influence on Alation. Relying on Oettl’scharacterization of star scientists [69] we segment analysts in our study into four types: All-star,Lone-wolf, Maven, Non-star; see Figure 1.We specify two segmentations in this paper that both use output rate yet each use a different

measure of social influence. We adopt the conventional measure of individual output: output-rate =the average number of queries created by the analyst i per unit of time. [2, 38] Taking months asunit of time, we will use:

Monthly Output of Queriesi =Total number of queries i has written

Number of months since i joined in Alation(1)

We adopt two different measures of social influence, viewership and PageRank, to describe howinfluential an analyst’s queries have been since they were written on Alation. Viewership capturesthe average number of distinct viewers per month of all queries authored by analyst i . We define:

Monthly Viewers per Queryi =Total number of distinct data analysts who have viewed i’s queries∑

query k written by i Months that query k is viewable(2)

which represents the attention that focal analyst i receives from her peers on Alation. Viewership isa measure of the "local" influence of an author on its direct viewers. A qualitative interview studyby Dabbish et al. [23] demonstrates that, in large-scale distributed collaborations and communities



of practice (e.g. GitHub), the attention that a developer has received signals her status in thecommunity. Quoting to a representative participant in their study, “[One visible cue is] the numberof people watching a project or people interested in the project; obviously it’s a better project thanversus something that has no one else interested in it.”

The second measure of social influence is PageRank, which was introduced by Google forweighting the importance of a web page based on the number and quality of links to this page[13]. To compute the PageRank of each data analyst in our study, we first build a directed, analyst-to-analyst network that represents the social interactions on Alation, which we explain later in3.2.2. Then, running the PageRank algorithm on this network returns the PageRank for every dataanalyst. The analyst with higher PageRank is considered more influential on the overall networkherself. While viewership captures the ‘local’ influence, PageRank represents a ‘global’ network-wideinfluence of an analyst by capturing not only her direct viewers but also the viewers of her viewersetc.

We hypothesize that learning-by-viewing queries authored by analysts who outperform in bothoutput-rate and social influence is associated with the largest improvement in productivity. Toconfirm the expert roles in learning-by-viewing, we start with testing the following hypothesis.

Hypothesis 4. Past experience of viewing queries written by different types of data analysts (All-star,Maven, Lone-wolf or Non-star) is associated with different change in data analyst productivity.

A rejection of Hypothesis 4 would imply that the predicted productivity improvement throughviewing queries are independent of the type of the author. In contrast, support for Hypothesis 4confirms the superiority of expert roles in our context and can be followed by further investiga-tion: which type of analysts writes the most informative queries that are associated with largestproductivity improvement of its viewers?

3 METHODS3.1 Study PlatformWe study eBay data analysts writing and viewing queries on Alation. Alation is an enterprisecollaborative data platform developed by Alation Inc. and used by eBay Inc. As one of its clients.Alation serves as an all-inclusive ‘resort’ for data analysts. First, it provides a repository for alltechnical meta-data in the analytics data warehouse. Main data services that can connect to Alationinclude Oracle, Teradata, MySQL, SQL Server and Tableau. A data analyst can conveniently accessdata if she has the proper permissions. Second, Alation integrates various analytics tools fordata analysts to compose and execute queries, as well as produce comprehensible results. Third,Alation advances collaborations and social computing among data analysts inside eBay. A dataanalyst can share her knowledge with the community by publishing her queries, writing articlesabout her good practices or participating in conversations on technical issues. A data analyst canalso seek knowledge from the community by viewing queries authored and published by otherpeers, searching for relevant articles or asking for help in the conversation board. Serving asan on-line enterprise community, Alation supports collaboration, knowledge sharing, reuse ofresources, expertise location, innovation, organizational change and social networking [63, 64, 66,80]; Everyone in the organization, from data novices to experts, can easily search, collaborate andleverage knowledge on Alation.

3.2 Empirical SettingOur data consists of (1) the usage data of analysts on Alation which are automatically collectedat the back end; and of (2) the employee information data that we scripted by crawling the eBaypersonal pages of all analysts in our study. The usage data include the following information


193:8 Yin et al.

Prealpha Preliminary sourcecode has been

released

Alpha The released codebegins to takeshape with moremodifications

Beta The code is

featurecomplete,but retains faults

Stable The software isreliable enoughfor daily use

Mature There is no newdevelopmentoccuring

Fig. 2. Similar to query writing, a software development process typically starts with a "Pre-alpha" stage.

spanning four years from January, 2014 to March, 2018: 1) records of entire queries on Alation, eachitem including query id, title, author, time of creation, a brief description of this query and whetherthis query has been published or not; 2) complete records of query executions on Alation, eachitem including query id, user id, time of execution and number of statements that were executed; 3)complete records of all users viewing query pages on Alation; 4) simple personal information forall users on Alation, like username, email address, date of joining the Alation platform and dateof last login. The employee information data track the public information of all data analysts inour study, including employee title, subsidiary area, manager path inside eBay, and location (citycampus, country, and building and floor).

To construct our sample of data analysts, we first included all active users who have written atleast one query on Alation during our study period. We then excluded users who are labeled asAlation employees and users who are authorized as Alation administrators inside eBay Inc. We alsoexcluded users whose eBay employee information is missing on the eBay Intra-net. (That happenswhen an eBay employee left the company during the study period.) This resulted in an initial set of2059 users. We then summarized users’ hierarchies inside eBay Inc. by parsing their manager paths.Among these 2059 users, there are 17 level-1 employees, 221 level-2, 777 level-3, 748 level-4, 281level-5 and 15 level-6 (CEO is at level 10). We excluded level-6 and level-1 employees because theyare either too senior or too inexperienced to be considered as representative data analysts in ourstudy. The senior product manager who is in charge of Alation inside eBay Inc. also confirmed thatlevel 2 - 5 employees are the major users. This resulted in a final data set of 2027 users that havewritten 101327 queries during the study period. We excluded queries that have never been executedduring the study period nor exhibit missing field data; this finally left 79797 queries written by 2001data analysts for the study. It is important to point out that only about 1 out of 8 queries is everviewed by an analyst other than the author: out of the 79797 queries, only 10049 queries authoredby 1097 data analysts have been viewed by analysts other than the authors.

3.3 Data and Measures3.3.1 Dependent Variable. To develop a productivity measure for data analysts in our study,

we borrow the concept of Pre-alpha phase from the software development life cycle [17, 72, 88].Figure 2 shows a complete development process as summarized by previous researchers [79, 88].During the Pre-alpha phase developers write preliminary source code, which occurs before theAlpha testing.

Figure 3 illustrates the process that a data analyst follows when she is writing a query fromscratch on Alation. We define FirstCompletionTimei,k as the time interval between the point whenthe data analyst i clicks the button to create an empty query k and the point when she executesthis query for the first time:

FirstCompletionTimeanalyst i , query k = Timestampi first executes k − Timestampi creates the empty k (3)

This time interval, comparable to the Pre-alpha phase in software development, characterizes thetime that a data analyst spends to shape her idea into a testable, prototype query. The longer



09:30:08 a.m.

09:32:40 a.m.

09:56:03 a.m.

01:50:16 p.m.

09:00:15 a.m. (Day 2)

Create an empty query on Alation

(Have finished a prototype or a subpart.) Execute this query for the first time.

Execute

Execute

......

It works perfectly!

Start writing the code

Not exactly what is wanted. Keep editing the code.



First Completion Time

Fig. 3. The First Completion Time is the first stage of a typical query process.

such time interval is, the later this analyst can proceed the following steps. We, therefore, useFirstCompletionTimei,k as a proxy for data analyst productivity. These query writing start time-stamps and the first execution time-stamps are automatically logged at the back end of Alation.We believe that the data analysts in our study cannot manipulate this data directly nor have noincentive to act strategically, as only the administrators of Alation are informed and have access tothese data.

3.3.2 Explanatory Variables.

Learning-by-doing. We define Aggregate Direct Experiencei,k as the number of queries that a dataanalyst i had written by herself on Alation, before she clicked the button to create an empty queryk during our study period. Because each query uses a particular database, we divide AggregateDirect Experiencei,k into two parts: 1) Direct Experience with the Focal Databasei,k , which is thenumber of queries using the same database as query k uses, that i has written by herself on Alationuntil she creates query k ; 2) Direct Experience with Different Databasesi,k , which is the number ofqueries that i has by herself written on Alation before she creates k using a database different fromthe database k uses. Clearly,

Aggregate Direct Experiencei,k = Direct Experience with the Focal Databasei,k+Direct Experience with Different Databasesi,k

(4)


193:10 Yin et al.

Learning-by-viewing. We define Aggregate Indirect Experiencei,k as the number of queries thata data analyst i had viewed from her peers on Alation before she clicked the button to create anempty query k during our study period. To differentiate learning-by-viewing from different typesof data analysts, we further divide Aggregate Indirect Experiencei,k based on types of the authoranalysts. For example, Indirect Experience from All-stari,k is the number of queries written by otherall-star data analysts that i had viewed before she clicked the button to create k . According to oursegmentation of data analysts in figure 1, we divide Aggregate Indirect Experiencei,k as in 5.

Aggregate Indirect Experiencei,k = Indirect Experience from All-stari,k+Indirect Experience from Maveni,k+Indirect Experience from Lone-wolfi,k+Indirect Experience from Non-stari,k

(5)

To implement the analyst segmentation in Figure 1, we useMonthly Output of Queries on Alationas a measure of individual data analyst’s output-rate, andMonthly Viewers per Query as a measure ofindividual data analyst’s viewership, as defined earlier in equations (1-2). To calculate the PageRank,we build a directed, analyst-to-analyst network using the query-viewing data on Alation. Each noderepresents a data analyst in our study. The corresponding graph contains an edge from node A tonode B if analyst A has viewed a query written by analyst B. Note that we weigh the edge fromnode A to node B using the number of distinct queries authored by B that analyst A has viewed.All self-loops have been excluded since we don’t consider the behavior that an analyst viewingher own queries as her social interaction. Running the PageRank algorithm implemented in the Rpackage igraph returns the PageRank for all analysts [22, 73].Figure 4 and 5 are scatter plots illustrating the relationship between output-rate and social

influence. We apply the same segmentation model (see figure 1) to these two plots as follows: in theupper right quadrant we mark data analysts whose output-rate and social influence are both abovethe medians as all-star; in the lower left quadrant we mark data analysts whose output-rate andsocial influence are both below the medians as non-star ; we mark data analysts who reside in theupper left quadrant as maven and data analysts who reside in the lower right quadrant as lone-wolf.

3.3.3 Control Variables. Prior work suggests that individual adeptness, multi-tasking and taskcomplexity are likely to affect productivity. A more adept data analyst may write a query faster; adata analyst who has piles of work may become less productive because of stress and pressure;a data analyst may spend more time on writing a complex query that either consists of manystatements or uses a complicated database. We incorporate the following control variables to see iflearning-by-doing and learning-by-viewing are associated with additive values over these factors inaccelerating data-analyst query-writing. We explain the definitions of these control variables forthe scenario in which the analyst i creates and first executes query k .

• Workloadi,k : This is the average number of queries that analyst i composes simultaneouslytogether with the focal query k during the FirstCompletionTimei,k . This definition is in-spired by the workload developed in [87] for restaurant workers. For example, supposeFirstCompletionTimei,k lasts 40 minutes. During this period, the author data analyst onlycreates another query that overlaps with the focal query k for 20 minutes. The Workloadi,k ,therefore, is (40 min + 20 min)/(40 min) = 1.5 queries.

• Query Sizek : This is the number of statements that were executed in the first execution ofquery k .

• Databasek : This is a categorical variable that indicates which database query k uses.• Saved Queryk : This is a binary variable that indicates whether query k has been saved.



All−star: 757

Lone−wolf: 256

Maven: 255

Non−star: 759

0

0.01

0.1

0.2

0.5

1

0.5 1.0 5.0 10.0 20.0 50.0 100.0150.0

Output−rate = Queries written per month (log−scaled)

Vie

wer

ship

= V

iew

ers

per

mon

th (

log−

scal

ed)

Fig. 4. Segmenting data analysts by Output-rate × Viewership

Note: The median is 0.67 for Output-rate and 0.06 for Viewership. The correlation between Output-rate andViewershipis 0.24. Points at the bottom of the plot with zero Viewership represent the data analysts whosequeries haven’t been viewed by any peer; on a log scale these points fall at −∞ but we moved them up fordisplay convenience. These points heavily overlap because only about 1 out of 8 queries is ever viewed by apeer other than the author.

• Migrated Queryk : This is a binary variable that indicates whether part of query k was migratedfrom a different platform. We acquire such information by parsing the title or description ofquery k .

• Tenure on Alationi,k : This is the number of months between the date when the author analysti joined Alation and the time she creates query k .

• eBay Leveli : This is a categorical variable that indicates author analyst i’s hierarchy in eBayInc.

• eBay Sub-areai : This is a categorical variable that indicates author analyst i’s subsidiary areain eBay Inc.

We also add month indicators for the number of months since January, 2014 until the creation ofquery k to control for any environmental difference across time.

4 ANALYSIS AND RESULTS4.1 Analysis StrategyWe present the descriptive statistics of all variables in the raw data in table 1. Traditional learningcurves are typically modeled as exponential forms and are often estimated using a log-linearregression model [3, 6, 57, 60]. Our data are challenging in three ways: (1) our dependent variable(FirstCompletionTime, in seconds) is a count variable and is over-dispersed, i.e., the variance issignificantly larger than the mean (see table 1); and (2) our dependent variable only takes positivevalues; (3) our data suffers from potential correlations across observations that are nested within


193:12 Yin et al.

Table 1. Descriptive statistics of the raw data capturing N = 79797 queries written and executed by 2001data analysts at eBay during 2014-2018.

Statistic Mean St. Dev. Min Pctl(25) Pctl(75) Max

FirstCompletionTime 90,711 1,200,935 1 20 262 83,798,568

Aggregate Direct Experience 116.2 200.2 0 16 126 1,775

Aggregate Indirect Experience 35.9 70.8 0 1 35 1,510

Direct Experience with DifferentDatabases

52.4 124.8 0 2 49 1,389

Direct Experience with the focalDatabase

63.9 113 0 6 68 1,092

Workload 1.1 0.7 1 1 1 28

Tenure on Alation 14.9 13.1 0 4 24 50

Saved Query 0.3 0.5 0 0 1 1

Migrated Query 0.01 0.1 0 0 0 1

Query Size 4.6 117.6 1 1 1 18,095

Output-rate × Viewership Segmentation

Indirect Experience from All-Star 32.6 64 0 0 32 1,343

Indirect Experience fromLone-Wolf

0.2 0.8 0 0 0 12

Indirect Experience from Maven 2.9 10.5 0 0 1 169

Indirect Experience from Non-star 0.1 0.8 0 0 0 20

Output-rate × PageRank Segmentation

Indirect Experience from All-Star 32.8 64.4 0 0 32 1,352

Indirect Experience fromLone-wolf

0.01 0.2 0 0 0 11

Indirect Experience from Maven 3 10.7 0 0 1 169

Indirect Experience from Non-star 0.07 0.5 0 0 0 22



All−star: 732

Lone−wolf: 266

Maven: 266

Non−star: 7325e−05

1e−04

5e−04

1e−03

2e−02

3e−034e−035e−03

0.5 1.0 5.0 10.0 20.0 50.0 100.0150.0

Output−rate = Queries written per month (log−scaled)

Pag

eRan

k (lo

g−sc

aled

)

Fig. 5. Segmenting data analysts by Output-rate × PageRank

Note: The median is 0.67 for Output-rate and 0.0001 for PageRank. The correlation between Output-rate andPageRankis 0.16.

multiple levels (a data analyst writes multiple queries using different databases over time). To dealwith these challenges, we fit a mixed-effects zero-truncated negative binomial regression model.Zero-truncated negative binomial regression models are a class of generalized linear models thatare appropriate for non-negative and over-dispersive count data [4]. To account for the three-levelnesting of the dataset (from queries to users to databases), we create a mixed-effects model wherethe experience variables and usage of databases are fixed effects and the unique analyst interceptsare represented as random effects [49]. In R [73], mixed-effects zero-truncated negative binomialmodels are implemented in the glmmTMB package [14].

For ease of comparing the relative importance of the explanatory variables, in the running modelswe standardize (i.e., normalize to mean zero and unit standard deviation) all variables except forthe dependent variable and categorical variables.

4.2 ResultsWe build six separate models to test our hypotheses. Table 2 presents the model specifications byindicating the inclusive explanatory variables in each model. Particularly, model 1 contains bothAggregate Direct Experience and Aggregate Indirect Experience ; model 2 contains Direct Experiencewith the Focal Database, Direct Experience with Different Databases and Aggregate Indirect Experience.Model 3 and Model 4 are built under the Output-rate × Viewership segmentation of data analysts:model 3 contains Aggregate Direct Experience and indirect experience respectively from four typesof author analysts: all-star, non-star, maven and lone-wolf; model 4 contains Direct Experience withthe Focal Database, Direct Experience with Different Databases, and indirect experience respectivelyfrom four types of author analysts. Except for being built under the Output-rate × PageRank


193:14 Yin et al.

Table 2. Results of Zero-truncated Negative Binomial Regressions for Model 1- 6

Dependent variable: FirstCompletionTime

Output-rate ×ViewershipSegmentation

Output-rate ×PageRank

Segmentation

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

Aggregate directexperience

0.015(0.032)

−0.071∗(0.034)

−0.045(0.033)

Direct experiencewith focal database

−0.057∗(0.025)

−0.095∗∗∗(0.025)

−0.079∗∗∗(0.025)

Direct experiencewith differentdatabase

0.046∗(0.022)

−0.012(0.023)

−0.003(0.023)

Aggregate indirectexperience

−0.029(0.032)

−0.025(0.031)

Indirect experiencefrom all-star

−0.144∗∗∗(0.040)

−0.143∗∗∗(0.040)

−0.199∗∗∗(0.041)

−0.196∗∗∗(0.040)

Indirect experiencefrom non-star

0.143∗∗∗(0.031)

0.151∗∗∗(0.050)

0.077∗(0.031)

0.083∗∗(0.03)

Indirect experiencefrom maven

0.277∗∗∗(0.033)

0.269∗∗∗(0.034)

0.275∗∗∗(0.026)

0.268∗∗∗(0.033)

Indirect experiencefrom lone-wolf

−0.182∗∗∗(0.039)

−0.178∗∗∗(0.039)

-0.169∗∗∗(0.033)

-0.171∗∗∗(0.033)

Constant -1.36(24.39)

-0.89(19.93)

-1.2(19.54)

-0.74(16.09)

-0.38(13.81)

-0.89(17.96)

Observations 79,797 79,797 79,797 79,797 79,797 79,797

Log Likelihood −581,467 −581,454 −581,401 −581,396 −581,398 −581,393

Akaike Inf. Crit. 1,163,078 1,163,066 1,162,964 1,162,955 1,162,958 1,162,950

Bayesian Inf. Crit. 1,163,802 1,163,800 1,163,716 1,163,717 1,163,711 1,163,712

Note: ∗p<0.05; ∗∗p<0.01; ∗∗∗p<0.001

Control variables are omitted in the table.



Table 3. ANOVA results reject both β1 = β2 and β3 = β4 = β5 = β6

∆Df χ 2 P-Value

Model 1: Model 2 1 14 < 0.001∗∗∗

Model 1: Model 3 3 119.64 < 0.001∗∗∗

Model 1: Model 5 3 125.31 < 0.001∗∗∗

Model 1: Model 4 4 130.62 < 0.001∗∗∗

Model 1: Model 6 4 135.48 < 0.001∗∗∗

Model 2: Model 4 3 116.26 < 0.001∗∗∗

Model 2: Model 6 3 121.48 < 0.001∗∗∗

Model 3: Model 4 1 10.625 0.001∗∗

Model 5: Model 6 1 10.171 0.001∗∗

segmentation, Model 5 and Model 6 have the same specifications as Model 3 and Model 4 have. Allsix models contain control variables that are listed in 3.3.3.

Note that model {1, 2, 3, 4} and model {1, 2, 5, 6} are nested: if we write Model 4 or 6 asд(E[FirstCompletionTimei,k ]) =β0 + β1Direct Experience with Different Databasesi,k+

β2Direct Experience with the focal Databasei,k+β3Indirect Experience from All-stari,k+β4Indirect Experience from Maveni,k+β5Indirect Experience from Lone-wolfi,k+β6Indirect Experience from Non-stari,k+γControl Variablesi,k

(6)

where д(·) is the general link function and is the natural logarithm in our model. Clearly, Model 1 isa special case of Model 4 or 6 (under different segmentations) where β1 = β2 and β3 = β4 = β5 = β6;Model 2 is a special case of Model 4 or 6 where β3 = β4 = β5 = β6; Model 3 is a special case ofModel 4 where β1 = β2. Model 5 is a special case of Model 6 where β1 = β2. Similarly, Model 1 isa special case of Model 2 where β1 = β2 and a special case of Model 3 where β3 = β4 = β5 = β6;Model 2 is a special case of Model 4 or 6 where β3 = β4 = β5 = β6 (under different segmentations).

We use Analysis of Variance (ANOVA) to test the nested models. Table 3 summarizes results ofcomparisons between nested pairs. We find that the difference in log-likelihoods of Model 1 andModel 2 is statistically significant. So is the difference in log-likelihoods of Model 1 and Model3. Thus, we can reject both β1 = β2 and β3 = β4 = β5 = β6 (under output-rate × viewershipsegmentation) with sufficient confidence. Other comparison results in Table 3 all support thisfinding. Similarly, the comparison results in Table 3 suggests that under output-rate × PageRankcharacterization we can reject both β1 = β2 and β3 = β4 = β5 = β6.

4.2.1 Choosing the best model. The main results of the mixed-effects zero-truncated negativebinomial regression on FirstCompletionTime are reported in table 2 (see Appendix A. for the full


193:16 Yin et al.

results). Because we performed mean centering, our baseline was the mean values of all explanatoryvariables (except for the categorical ones). The coefficients can be interpreted as follows: for anexplanatory variable change by one unit (in our case, by one standard deviation), the difference inthe logs of expected counts of the dependent variable (FirstCompletionTime) is expected to changeby the corresponding coefficient, given all the other variables in the model are held constant.We then use the Akaika Information Criterion (AIC) to evaluate the goodness of fit for each

model. Generally, the smaller the AIC, the better the corresponding model over other competingmodels. Under this criterion, Model 4 is the best among the four models {1, 2, 3, 4} that apply theoutput-rate × viewership segmentation of data analysts; Model 6 is the best among the four models{1, 2, 5, 6} that apply the output-rate × PageRank segmentation of data analysts. We thereby adoptModel 4 and Model 6 to understand learning-by-doing and learning-by-viewing on data analysts inour study, under the respective segmentation of data analysts. To avoid multicollinearity betweenexplanatory variables in these two final models, we examine the VIF (variance inflation factor) ofthe set of explanatory variables, comparing against the recommended maximum of 5 [20]. Table 4shows in our case the VIFs of all explanatory variables in the final models remain well below 2,indicating the absence of multicollinearity.

Table 4. Variance Inflation Factor of Explanatory Variables

VIF

Variables Output-rate × ViewershipSegmentation

Output-rate × PageRankSegmentation

Direct experience with the focal database 1.28 1.29Direct experience with differentdatabases 1.29 1.29

Indirect experience from All-Star 1.52 1.50

Indirect experience from Lone-Wolf 1.30 1.17

Indirect experience from Maven 1.58 1.62

Indirect experience from Non-Star 1.32 1.36

Workload 1.02 1.04

Tenure on Alation 1.04 1.05

Saved Query 1.05 1.05

Migrated Query 1.04 1.05

Query Size 1.01 1.01

Mean VIF 1.26 1.23

4.2.2 Output-rate × viewership segmentation. From Model 4 in Table 2 we find that analysts whohave more prior direct experience with the focal database are more likely to have shorter expectedFirstCompletionTime of writing new queries. Holding all other variables constant, a unit (in our case,



by one standard deviation) increase in a data analyst’s direct experience with the focal databaseis significantly associated with an average 9.5% decrease in the expected FirstCompletionTime ofa new query. Estimated by the mean of FirstCompletionTime, 9.5% decrease is equivalent to 2.4hours less. A unit increase in direct experience with different databases predicts a 1.2% shorterFirstComletionTime, yet not statistically significant.

We also find that viewing queries authored by different types analysts predicts different changesin the expected FirstCompletionTime of new queries. Holding all other variables constant, a unitincrease in a data analyst’s indirect experience (the number of queries she has viewed) from all-staranalysts is significantly associated with an average 14.3% decrease (equivalent to 3.6 hours less ifestimated by the mean) in the expected time she would spend between creating and first executinga new query; a unit increase in the number of queries the focal data analyst has viewed fromlone-wolf is significantly associated with an average 17.8% decrease (equivalent to 4.4 hours less ifestimated by the mean). Both indirect experience from non-star and maven is associated with asignificant increase in the expected FirstCompletionTime of a new query.

4.2.3 Output-rate × PageRank characterization. From Model 6 in Table 2 we find that more priordirect experience with the focal database is associated with a shorter expected FirstCompletionTimeof new queries. Holding all other variables constant, a unit increase in a data analyst’s direct expe-rience with the focal database is significantly associated with an average 7.9% decrease (equivalentto 2.0 hours less if estimated by the mean) in the expected FirstCompletionTime of a new query; aunit increase in her direct experience with different databases is linked with the decrease that isneither statistically nor piratically significant (0.3% on average).We also find that viewing queries authored by different types of analysts is associated with

different changes in the expected FirstCompletionTime of a new query. Holding all other variablesconstant, a unit increase in a data analyst’s indirect experience from all-star analysts is significantlyassociated with an average 19.6% decrease (equivalent to 4.9 hours less if estimated by the mean) inthe expected FirstCompletionTime of a new query; a unit increase in the number of queries the focaldata analyst has viewed from lone-wolf authors is significantly associated with an average 17.1%decrease (equivalent to 4.3 hours less if estimated by the mean). Both indirect experience fromnon-star and maven are related with a significant increase in the expected FirstCompletionTime ofa new query.

To summarize, our results provide robust support for hypothesis 2 and 4 under both segmentationsof data analysts. We have partial support for hypothesis 1 because our results suggest only thedirect experience with the focal database is associated with significant improvement in data-analyst productivity. We also have partial support for hypothesis 3 because our results suggestthat only viewing queries authored by all-star and lone-wolf analysts is associated with significantimprovement in data analyst productivity. Analysts who have viewed more queries authored bymaven and non-star analysts are more likely to spend longer FirstCompletionTime on a new queryin our study.Furthermore, we find that under the Output-rate × PageRank segmentation, viewing queries

authored by all-star analysts is associated with the largest improvement in data-analyst productivity.In contrast, under theOutput-rate× viewership segmentation, viewing queries authored by lone-wolfanalysts is associated with the largest improvement.

5 SUMMARY, DISCUSSION, AND LIMITATIONS5.1 Summary and DiscussionOur results provide evidence of a statistically-significant association between both learning-by-doingand learning-by-viewing and data-analyst productivity. Greater direct experience with the focal


193:18 Yin et al.

database and greater indirect experience from viewing queries authored by ‘all-star’ and ‘lone-wolf’analysts both predict significant improvement in data analyst’s productivity. The magnitude of thecoefficients in table 2 underscore the practical significance of these associated value. For instance,an increase by one standard deviation in a data analyst’s direct experience with the focal databasepredicts a significant decrease of 9.5% in the expected FirstCompletionTime, the equivalent of 2.4hours average reduction. One standard deviation increase in the number of queries a data analysthas viewed from ‘all-star’ authors is associated with a significant decrease of 14.3% in the expectedFirstCompletionTime, under the output-rate × viewership characterization, or a significant decreaseof 19.6% under the output-rate × PageRank characterization. These are equivalent to a reduction of3.6 or 4.9 hours respectively in the expected FirstCompletionTime. With 1600 queries created onAlation every month in our study period, these numbers accumulate to substantial numbers.

Our study builds on and contributes to the literature of organizational learning and computationalsocial science, as well as that of cognitive psychology literature and learning sciences. Although therelationship between experience and productivity have been extensively studied in manufacturingand service industries, there are only few empirical studies of learning processes among data analystsand none with this granularity of field data. Most existing studies are qualitative studies, such asinterviews and surveys, and tend to measure learning using cognitive approaches [12, 46, 47, 52, 53].To the best of our knowledge, our study may be the first to empirically examine how data analystslearn to speed up writing queries using behavioral (performance-measure) approaches.

First, our results support association between learning from direct experience and productivity.More experience in writing queries is associated with faster FirstCompletionTime. This finding isconsistent with the study by Kim et. al. that surveyed 793 professional data scientists at Microsoft[53]. Respondents to the survey spoke of ‘getting their hands dirty’ as one of the best practicesto improve data science analysis, and frequently mentioned the desire for hands-on training andpractical case studies. We also find that prior experience in using the focal database is associatedwith greater improvement in productivity than the experience of using a different database. Thisimplies that data analysts obtain more domain knowledge of a specific database through self-practice [48]. This finding also echoes previous studies that demonstrate the greater impact ofrelated experience on productivity [28, 51, 85, 97].Second, our results provide evidence for the role of “experts” in learning-by-viewing. We find

that only viewing queries authored by analysts with high output rate is associated with significantimprovement in data analyst productivity. This finding not only confirms that circulating the querycode can be an effective social practice, but also suggests that interacting with the ‘community ofpractices’ is critical for the advance of an analyst’s knowledge. This is consistent with the findingof Dabbish et al. that being able to view code authored by others supports better programming [23].Kim et al. also reported that “respondents expressed the goal of fostering a ‘community of practice’across the company” [53]. Despite of this, Kandel et al. found in their interview study that “theleast commonly shared resource among data analysts was the analysis code” [46]. Our results givesupport to the value of collaborative platforms in providing an unfenced channel for collaboration.

Third, our findings suggest that different types of social influence, namely the ‘local’ influence (ascaptured by the viewership) vs. the ‘global’ influence (as captured by the page-rank), are associatedwith different changes in viewer analysts’ productivity. Under both segmentations of analysts ,only viewing queries authored by ‘all-star’ and ‘lone-wolf’ is associated with a significant decreasein the FirstCompletionTime of new queries. Analysts who mostly queries authored by ‘non-star’and ‘maven’ analysts, in contrast, are more likely to spend more of that time. Using the output-rate × viewership segmentation, we find that viewing queries authored by ‘lone-wolf’ analysts isassociated with the largest improvement in FirstCompletionTime; using the output-rate × page-ranksegmentation, we find that viewing queries authored by ‘all-star’ analysts is associated with the



largest improvement. These two findings together indicate that the most influential analysts mightbe those that have few incoming links (fewer direct viewers) but a relatively large contingency ofviewers with high PageRank. Such analysts could be the ‘ultimate stars’.

Overall, regardless of the segmentation of analysts, learning-by-viewing is associated with greaterproductivity improvement than learning-by-doing in our study. This result agrees with Lave andco-authors’ conclusion that ‘engaging in practice may well be a condition for the effectiveness oflearning’ and is consistent with the anecdotal evidence in their paper that apprentices learn mostlyin relations with other apprentices [58].

5.2 ImplicationsWhile onemust be careful with extrapolations, our results suggest directions for further explorationsby managers and designers.

5.2.1 Managerial Implications. First, our findings about learning-by-doing provide guidance fordeveloping training plans for data analysts. Based on the result that learning-by-doing with the samedata bases is associated with significant productivity improvement, managers can encourage dataanalysts to practice with focus. Similarly, individual contributors on peer production community(e.g. Wikipedia and GitHub) can choose to work on related projects in order to be more productive.

Second, our findings about learning-by-viewing provide suggestive evidence of the value ofcollaborative platforms. Furthermore, the differential associations of learning-by-viewing queriesauthored by different types of analysts can inform performance evaluation and team-building. Or-ganizations that use collaborative platforms can take their employees’ star status on such platformsseriously into evaluation of their performance and contribution. Besides, with the identification ofstar vs. non-star, project managers can team suitable analysts to work together.

Third, our findings help to identify star workers on collaborative platforms.We show it is useful toleverage articulated social measures and observed code-related activity simultaneously. As Marlowet al. put it [62], “Succinctly summarizing expertise based on behavioral data and incorporatingevidence of social interactions may support more nuanced impressions and reduce bias. In anytype of peer production site where a person shares their work for others to build on, dealing withcontributions from others is necessary and important.” Collaborative platforms or peer productioncommunities may consider adopting measurements such as viewership and PageRank that arediscussed in our study or other measurements that fit their context better.

5.2.2 Design Implications. Our results particularly inform the future design of collaborativeplatforms to support learning. We emphasize providing cues about expertise and star status, byshowing that a data analyst learns better if she has more interactions with certain groups of staranalysts. Accessible cues about expertise and star status can be very critical for the legitimateperiphery participation of the beginners or novice [58]. Researches in corporate domain alsosuggest that expertise finding is an important task (e.g.[74]) and show that many internal toolshave been developed to help people tag their own and others’ expertise [82]. Our results suggestthat collaborative platforms invest in designing tools to deliberately highlight certain informationthereby providing social signals. For example, dashboard pages and activity feeds could signalto data analysts the social significance and technical merits of their peers. The theory of socialtranslucence also confirms the potential for this transparency to radically improve collaborationand learning in complex knowledge-based activities [23]. For learners, analysts or developers cantake these cues into effective strategies for coordinating projects and advancing their technicalskills [29]. For knowledge contributors, signaling their social influence among the communitiesmay motivate desired behavior yet one always should anticipate unintended consequences.


193:20 Yin et al.

5.3 Limitations and Future WorkFirst, our study focuses on the behavioral approach to measure learning. Qualitative studiesthat describe the cognitive phenomena of analysts are lacking. Central questions like, “whatare the analysts learning” or “are they really learning anything” remain to be answered directly.Future work to interview, observe, or survey data analysts will bring insightful answers to thesequestions. These follow-up qualitative studies asking data analysts to explain their behaviorsin creating/viewing queries could complement our findings towards a better understanding ofthe mechanisms underlying these two learning processes. For example, researchers who find thenegative association between viewing queries authored by ‘maven’ and the viewer’s productivityrather intriguing can survey the subgroup of analysts in our study who have viewed most queriesfrom ‘mavens’. Perhaps these analysts found it very difficult to match their existing knowledgewith those queries authored by ‘maven’ [93]. Furthermore, we only report the analyst’s activity ofwriting/viewing a query without further reporting which part of the query that the focal analysthas indeed focused on. Such additional information may help us identify what domain-specificaspects of viewing peers’ queries or what type of self-practice advance learning efficiency more[33, 34]. Still this requires more qualitative evidence in the future work.

Second, our results show associations without supporting causal inferences. Although we applymixed-effects models to address potential correlations and incorporate control variables to dealwith some confounding factors, our analysis may still suffer from endogeneity issues such asself-selection bias. Nonetheless, the associations we find do motivate randomized experiments infuture research to carefully measure causal effects. One possible way to do this is to bring exogenousvariations in analysts’ exposure to queries authored by peers. Randomizing query-searching resultsand arbitrarily mask peers’ queries to analysts are both practical.

Third, we use FirstCompletionTime to measure data-analyst productivity, which is the period oftime from a data analyst creating an empty query to executing it for the first time. Other (aggregate)metrics, like the number of queries a data analyst creates in a week or month, may highly depend onthe assignment that the analysts has received. Nevertheless, it is possible that FirstCompletionTimedepends on individual work style. Some empirical evidence in the learning sciences shows twodistinct programming styles: tinkering vs. planning [90]. Tinkerers keep exploring new ideas bymaking adjustments step by step. Planners first develop a clear plan, then do it once and right[77]. Tinkering data analysts may be making small and frequent code changes and thus haveshorter FirstCompletionTime than planning analysts. Such caveat may be addressed by using adifferent measure of the coding time: the period of time that the data analyst remains active on thecomposing page. To obtain such time requires detailed click-stream data that is not accessible to us.Fourth, our study focuses exclusively on the quantity (productivity) without discussing the

quality of queries. Including the quality of a query is not straightforward as different people mayhave different opinions on what constitutes a good query: teammates or project managers maywant the query to be well-documented so that it can be reused; IT staff may want the query to beefficient when running on large-scale data; stakeholders or decision-makers may want the query todeliver accurate and insightful answers to their questions; the author may want the query to runsmoothly without errors. Again, to operationalize these multiple perspectives on quality requiresdifferent levels of data, both qualitative and quantitative, which are not accessible in our data.Future research could extend our study by generating measures of quality (e.g., the number oferrors during executions, or the rate of positive responses from peers who have viewed the query)to examine the results with quality-adjusted output.



ACKNOWLEDGMENTSThe authors would like to thank Qian Chang, Rena Fired-Chung, Matt Gardner, Aaron Kalb, JakeMagner, and Dennis Zhang for valuable discussions and helpful feedback. Itai Gurvich, Yue Yinand Jan Van Mieghem are grateful for the open collaboration with Alation and eBay; yet they haveno financial interest in, nor received any financial remuneration or funding from, Alation or eBay.


193:22 Yin et al.

A FULL REGRESSION RESULTS FOR MODEL 1-6 IN TABLE 2

Table 5. Zero-truncated Negative Binomial Regressions for Model 1-6: Full results


Output-rate×ViewershipSegmentation

Output-rate×PageRankSegmentation

Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

Aggregate directexperience

0.015(0.032)

−0.071∗(0.034)

−0.045(0.033)

Direct experiencewith the focaldatabase

−0.057∗(0.025)

−0.095∗∗∗(0.025)

−0.079∗∗∗(0.025)

Direct experiencewith differentdatabase

0.046∗(0.022)

−0.012(0.023)

−0.003(0.023)

Aggregate indirectexperience

−0.029(0.032)

−0.025(0.031)

Indirect experiencefrom all-star

−0.144∗∗∗(0.040)

−0.143∗∗∗(0.040)

−0.199∗∗∗(0.041)

−0.196∗∗∗(0.040)

Indirect experiencefrom non-star

0.143∗∗∗(0.031)

0.151∗∗∗(0.050)

0.077∗(0.031)

0.083∗∗(0.03)

Indirect experiencefrom maven

0.277∗∗∗(0.033)

0.269∗∗∗(0.034)

0.275∗∗∗(0.026)

0.268∗∗∗(0.033)

Indirect experiencefrom lone-wolf

−0.182∗∗∗(0.039)

−0.178∗∗∗(0.039)

-0.169∗∗∗(0.033)

-0.171∗∗∗(0.033)

Workload 1.32∗∗∗(0.08)

1.28∗∗∗(0.08)

1.33∗∗∗(0.084)

1.31∗∗∗(0.084)

1.32∗∗∗(0.084)

1.31∗∗∗(0.084)

Query size -0.031∗∗∗(0.01)

-0.036∗∗∗(0.01)

-0.037∗∗∗(0.01)

-0.037∗∗∗(0.01)

-0.037∗∗∗(0.01)

-0.037∗∗∗(0.01)

Saved query 2.89∗∗∗(0.05)

2.91∗∗∗(0.052)

2.89∗∗∗(0.05)

2.89∗∗∗(0.052)

2.90∗∗∗(0.052)

2.90∗∗∗(0.052)

Migrated query 1.88∗∗∗(0.17)

1.87∗∗∗(0.174)

1.88∗∗∗(0.175)

1.88∗∗∗(0.174)

1.90∗∗∗(0.176)

1.89(0.176)

Tenure on Alation -0.061(0.069)

-0.061(0.069)

-0.063(0.069)

-0.061(0.069)

-0.072(0.069)

-0.071(0.069)

Continued on next page



Table 5 – continued from previous page


Output-rate×ViewershipSegmentation

Output-rate×PageRankSegmentation

(1) (2) (3) (4) (5) (6)

eBay Level: 7 0.011(0.246)

0.015(0.246)

0.011(0.246)

0.011(0.246)

0.019(0.246)

0.019(0.246)

eBay Level: 8 -0.153(0.25)

-0.156(0.25)

-0.147(0.25)

-0.153(0.25)

-0.141(0.25)

-0.146(0.25)

eBay Level: 9 0.183(0.322)

0.173(0.321)

0.196(0.322)

0.183(0.322)

0.197(0.322)

0.184(0.322)

eBay Sub-area:Global Function - HR

-0.629(2.34)

-0.729(2.34)

-0.599(2.34)

-0.629(2.34)

-0.664(2.35)

-0.69(2.34)

eBay Sub-area:Global Function -Legal

0.013(0.975)

0.107(0.974)

0.034(0.974)

0.013(0.973)

0.102(0.976)

0.085(0.976)

eBay Sub-area:Marketplaces

-0.274(0.202)

-0.269(0.202)

-0.275(0.202)

-0.275(0.202)

-0.276(0.202)

-0.275(0.202)

Database Yes Yes Yes Yes Yes Yes

Month Yes Yes Yes Yes Yes Yes

Constant -1.36(24.39)

-0.89(19.93)

-1.2(19.54)

-0.74(16.09)

-0.38(13.81)

-0.89(17.96)

Observations 79,797 79,797 79,797 79,797 79,797 79,797

Log Likelihood −581,467 −581,454 −581,401 −581,396 −581,398 −581,393Akaike Inf. Crit. 1,163,078 1,163,066 1,162,964 1,162,955 1,162,958 1,162,950

Bayesian Inf. Crit. 1,163,802 1,163,800 1,163,716 1,163,717 1,163,711 1,163,712

Note: ∗p<0.05; ∗∗p<0.01; ∗∗∗p<0.001


193:24 Yin et al.

REFERENCES[1] Daron Acemoglu. 1996. A microfoundation for social increasing returns in human capital accumulation. The Quarterly

Journal of Economics 111, 3 (1996), 779–804.[2] Paul J Adams, Andrea Capiluppi, and Cornelia Boldyreff. 2009. Coordination and productivity issues in free software:

The role of brooks’ law. In Software Maintenance, 2009. ICSM 2009. IEEE International Conference on. IEEE, 319–328.[3] Armen Alchian. 1963. Reliability of progress curves in airframe production. Econometrica: Journal of the Econometric

Society (1963), 679–693.[4] Paul D Allison and Richard P Waterman. 2002. Fixed-effects negative binomial regression models. Sociological

methodology 32, 1 (2002), 247–265.[5] John R Anderson, Lynne M Reder, and Herbert A Simon. 1997. Situative versus cognitive perspectives: Form versus

substance. Educational researcher 26, 1 (1997), 18–21.[6] Linda Argote. 2012. Organizational learning: Creating, retaining and transferring knowledge. Springer Science &

Business Media.[7] Linda Argote and Dennis Epple. 1990. Learning curves in manufacturing. Science 247, 4945 (1990), 920–924.[8] Linda Argote, Paul Ingram, John M Levine, and Richard L Moreland. 2000. Knowledge transfer in organizations:

Learning from the experience of others. Organizational behavior and human decision processes 82, 1 (2000), 1–8.[9] Pierre Azoulay, Joshua S Graff Zivin, and Jialan Wang. 2010. Superstar extinction. The Quarterly Journal of Economics

125, 2 (2010), 549–589.[10] C Lanier Benkard. 1999. Learning and forgetting: The dynamics of aircraft production. Technical Report. National

bureau of economic research.[11] Joel Brandt, Philip J Guo, Joel Lewenstein, Mira Dontcheva, and Scott R Klemmer. 2009. Two studies of opportunistic

programming: interleaving web foraging, learning, and writing code. In Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems. ACM, 1589–1598.

[12] Matthew Brehmer, Michael Sedlmair, Stephen Ingram, and Tamara Munzner. 2014. Visualizing dimensionally-reduceddata: Interviews with analysts and a characterization of task sequences. In Proceedings of the Fifth Workshop on BeyondTime and Errors: Novel Evaluation Methods for Visualization. ACM, 1–8.

[13] Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computernetworks and ISDN systems 30, 1-7 (1998), 107–117.

[14] Mollie E. Brooks, Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J.Skaug, Martin Maechler, and Benjamin M. Bolker. 2017. glmmTMB Balances Speed and Flexibility Among Packagesfor Zero-inflated Generalized Linear Mixed Modeling. The R Journal 9, 2 (2017), 378–400. https://journal.r-project.org/archive/2017/RJ-2017-066/index.html

[15] JohnSeely Brown and Paul Duguid. 2000. Organizational learning and communities of practice: Toward a unifiedview of working, learning, and innovation. In Knowledge and communities. Elsevier, 99–121.

[16] Ronald S Burt. 2009. Structural holes: The social structure of competition. Harvard university press.[17] Raymond PL Buse and Westley R Weimer. 2008. A metric for software readability. In Proceedings of the 2008

international symposium on Software testing and analysis. ACM, 121–130.[18] Hsinchun Chen, Roger HL Chiang, and Veda C Storey. 2012. Business intelligence and analytics: from big data to big

impact. MIS quarterly (2012), 1165–1188.[19] Paul Cobb and Janet Bowers. 1999. Cognitive and situated learning perspectives in theory and practice. Educational

researcher 28, 2 (1999), 4–15.[20] Patricia Cohen, Stephen G West, and Leona S Aiken. 2014. Applied multiple regression/correlation analysis for the

behavioral sciences. Psychology Press.[21] Mary M Crossan, Henry W Lane, and Roderick E White. 1999. An organizational learning framework: From intuition

to institution. Academy of management review 24, 3 (1999), 522–537.[22] Gabor Csardi and Tamas Nepusz. 2006. The igraph software package for complex network research. InterJournal

Complex Systems (2006), 1695. http://igraph.org[23] Laura Dabbish, Colleen Stuart, Jason Tsay, and Jim Herbsleb. 2012. Social coding in GitHub: transparency and

collaboration in an open software repository. In Proceedings of the ACM 2012 conference on Computer SupportedCooperative Work. ACM, 1277–1286.

[24] Eric D Darr, Linda Argote, and Dennis Epple. 1995. The acquisition, transfer, and depreciation of knowledge in serviceorganizations: Productivity in franchises. Management Science 41, 11 (1995), 1750–1762.

[25] Sayamindu Dasgupta, William Hale, Andrés Monroy-Hernández, and Benjamin Mako Hill. 2016. Remixing as apathway to computational thinking. In Proceedings of the 19th ACM Conference on Computer-Supported CooperativeWork & Social Computing. ACM, 1438–1449.

[26] Chris Dede, Brian Nelson, Diane Jass Ketelhut, Jody Clarke, and Cassie Bowman. 2004. Design-based researchstrategies for studying situated learning in a multi-user virtual environment. In Proceedings of the 6th international


https://journal.r-project.org/archive/2017/RJ-2017-066/index.html

https://journal.r-project.org/archive/2017/RJ-2017-066/index.html

http://igraph.org


conference on Learning sciences. International Society of the Learning Sciences, 158–165.[27] John M Dutton and Annie Thomas. 1984. Treating progress functions as a managerial opportunity. Academy of

management review 9, 2 (1984), 235–247.[28] Carolyn D Egelman, Dennis Epple, Linda Argote, and Erica RH Fuchs. 2016. Learning by doing in multiproduct

manufacturing: Variety, customizations, and overlapping product generations. Management science 63, 2 (2016),405–423.

[29] Thomas Erickson and Wendy A Kellogg. 2000. Social translucence: an approach to designing systems that supportsocial processes. ACM transactions on computer-human interaction (TOCHI) 7, 1 (2000), 59–83.

[30] K Anders Ericsson and Herbert A Simon. 1980. Verbal reports as data. Psychological review 87, 3 (1980), 215.[31] C Marlene Fiol and Marjorie A Lyles. 1985. Organizational learning. Academy of management review 10, 4 (1985),

803–813.[32] Wai Fong Boh, Sandra A Slaughter, and J Alberto Espinosa. 2007. Learning from experience in software development:

A multilevel analysis. Management Science 53, 8 (2007), 1315–1331.[33] Thomas Fritz, Gail C Murphy, and Emily Hill. 2007. Does a programmer’s activity indicate knowledge of code?.

In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFTsymposium on The foundations of software engineering. ACM, 341–350.

[34] Thomas Fritz, Gail C Murphy, Emerson Murphy-Hill, Jingwen Ou, and Emily Hill. 2014. Degree-of-knowledge:Modeling a developer’s knowledge of code. ACM Transactions on Software Engineering and Methodology (TOSEM) 23,2 (2014), 14.

[35] Thomas Fritz, Jingwen Ou, Gail C Murphy, and Emerson Murphy-Hill. 2010. A degree-of-knowledge model to capturesource code familiarity. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume1. ACM, 385–394.

[36] Francesca Gino, Linda Argote, Ella Miron-Spektor, and Gergana Todorova. 2010. First, get your feet wet: The effectsof learning from direct and indirect experience on team creativity. Organizational Behavior and Human DecisionProcesses 111, 2 (2010), 102–115.

[37] James G Greeno. 1997. On claims that answer the wrong questions. Educational researcher 26, 1 (1997), 5–17.[38] Boris Groysberg, Linda-Eling Lee, and Ashish Nanda. 2008. Can they take it with them? The portability of star

knowledge workers’ performance. Management Science 54, 7 (2008), 1213–1230.[39] Cynthia Harrington. 2017. Speaking Data Science with an Investment Accent. CFA Institute Magazine 28, 4 (2017),

4–7.[40] N Henke, J Bughin, M Chui, et al. 2017. The age of analytics: competing in a data-driven world. McKinsey Global

Institute.[41] Gerard P Hodgkinson and Paul R Sparrow. 2002. The competent organization: A psychological analysis of the strategic

management process. Open University Press.[42] George P Huber. 1991. Organizational learning: The contributing processes and the literatures. Organization Science

2, 1 (1991), 88–115.[43] Robert S Huckman and Bradley R Staats. 2011. Fluid tasks and fluid teams: The impact of diversity in experience and

team familiarity on team performance. Manufacturing & Service Operations Management 13, 3 (2011), 310–328.[44] Anne Sigismund Huff and Mark Jenkins. 2002. Mapping strategic knowledge. Sage.[45] Elina H Hwang, Param Vir Singh, and Linda Argote. 2015. Knowledge sharing in online communities: learning to

cross geographic and hierarchical boundaries. Organization Science 26, 6 (2015), 1593–1611.[46] Sean Kandel, Andreas Paepcke, Joseph MHellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization:

An interview study. IEEE Transactions on Visualization & Computer Graphics 12 (2012), 2917–2926.[47] Eser Kandogan, Aruna Balakrishnan, Eben M Haber, and Jeffrey S Pierce. 2014. From data to insight: work practices

of analysts in the enterprise. IEEE computer graphics and applications 34, 5 (2014), 42–50.[48] Keumseok Kang and Jungpil Hahn. 2009. Learning and forgetting curves in software development: Does type of

knowledge matter? ICIS 2009 Proceedings (2009), 194.[49] Raghav Pavan Karumur, Bowen Yu, Haiyi Zhu, and Joseph A Konstan. 2018. Content is King, Leadership Lags: Effects

of Prior Experience on Newcomer Retention and Productivity in Online Production Groups. (2018).[50] Diwas Kc, Bradley R Staats, and Francesca Gino. 2013. Learning from my success and from others’ failure: Evidence

from minimally invasive cardiac surgery. Management Science 59, 11 (2013), 2435–2449.[51] Diwas Singh Kc and Bradley R Staats. 2012. Accumulating a portfolio of experience: The effect of focal and related

experience on surgeon performance. Manufacturing & Service Operations Management 14, 4 (2012), 618–633.[52] Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists

on software development teams. In Proceedings of the 38th International Conference on Software Engineering. ACM,96–107.


193:26 Yin et al.

[53] Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2017. Data Scientists in Software Teams:State of the Art and Challenges. IEEE Transactions on Software Engineering 1 (2017), 1–1.

[54] Youngsoo Kim, Ramayya Krishnan, and Linda Argote. 2012. The learning curve of IT knowledge workers in acomputing call center. Information Systems Research 23, 3-part-2 (2012), 887–902.

[55] Steve WJ Kozlowski. 2012. The Oxford handbook of organizational psychology. Organizational Learning and knowledgemanagement, Vol. 1. Oxford University Press.

[56] H Chad Lane and Kurt VanLehn. 2005. Teaching the tacit knowledge of programming to noviceswith natural languagetutoring. Computer Science Education 15, 3 (2005), 183–201.

[57] Michael A Lapré, Amit Shankar Mukherjee, and Luk N Van Wassenhove. 2000. Behind the learning curve: Linkinglearning activities to waste reduction. Management Science 46, 5 (2000), 597–611.

[58] Jean Lave, Etienne Wenger, and Etienne Wenger. 1991. Situated learning: Legitimate peripheral participation.Vol. 521423740. Cambridge university press Cambridge.

[59] Barbara Levitt and James G March. 1988. Organizational learning. Annual review of sociology 14, 1 (1988), 319–338.[60] Ferdinand K Levy. 1965. Adaptation in the production process. Management Science 11, 6 (1965), B–136.[61] Robert E Lucas Jr. 1988. On the mechanics of economic development. Journal of monetary economics 22, 1 (1988),

3–42.[62] Jennifer Marlow, Laura Dabbish, and Jim Herbsleb. 2013. Impression formation in online peer production: activity

traces and personal profiles in github. In Proceedings of the 2013 conference on Computer supported cooperative work.ACM, 117–128.

[63] Tara Matthews, Jilin Chen, Steve Whittaker, Aditya Pal, Haiyi Zhu, Hernan Badenes, and Barton Smith. 2014. Goalsand perceived success of online enterprise communities: what is important to leaders & members?. In Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems. ACM, 291–300.

[64] Tara Matthews, SteveWhittaker, Hernan Badenes, Barton A Smith, Michael Muller, Kate Ehrlich, Michelle X Zhou, andTessa Lau. 2013. Community insights: helping community leaders enhance the value of enterprise online communities.In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 513–522.

[65] Anne S Miner and Pamela R Haunschild. 1995. Population-level learning. Research in Organizational Behavior: anannual series of analytical essays and critical reviews 17 (1995), 115–166.

[66] Michael Muller, Kate Ehrlich, Tara Matthews, Adam Perer, Inbal Ronen, and Ido Guy. 2012. Diversity amongenterprise online communities: collaborating, teaming, and innovating through social media. In Proceedings of theSIGCHI Conference on Human Factors in Computing Systems. ACM, 2815–2824.

[67] Sriram Narayanan, Jayashankar M Swaminathan, and Srinivas Talluri. 2014. Knowledge diversity, turnover, andorganizational-unit productivity: An empirical analysis in a knowledge-intensive context. Production and OperationsManagement 23, 8 (2014), 1332–1351.

[68] Ikujiro Nonaka. 1994. A dynamic theory of organizational knowledge creation. Organization science 5, 1 (1994),14–37.

[69] Alexander Oettl. 2012. Reconceptualizing stars: Scientist helpfulness and peer performance. Management Science 58,6 (2012), 1122–1140.

[70] Julian E Orr. 2016. Talking about machines: An ethnography of a modern job. Cornell University Press.[71] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J DeWitt, Samuel Madden, and Michael Stone-

braker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMODInternational Conference on Management of data. ACM, 165–178.

[72] James Piggot and Chintan Amrit. 2013. How healthy is my project? open source project attributes as indicators ofsuccess. In IFIP International Conference on Open Source Systems. Springer, 30–44.

[73] R Core Team. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing,Vienna, Austria. https://www.R-project.org/

[74] Daphne R Raban, Avinoam Danan, Inbal Ronen, and Ido Guy. 2012. Impression formation in corporate people tagging.In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 569–578.

[75] Ray Reagans, Linda Argote, and Daria Brooks. 2005. Individual experience and experience working together: Predictinglearning rates from knowing who knows what and knowing how to work together. Management Science 51, 6 (2005),869–881.

[76] Ray Reagans, Ella Miron-Spektor, and Linda Argote. 2016. Knowledge utilization, coordination, and team performance.Organization Science 27, 5 (2016), 1108–1124.

[77] Mitchel Resnick and Eric Rosenbaum. 2013. Designing for tinkerability. Design, make, play: Growing the next generationof STEM innovators (2013), 163–181.

[78] Paul M Romer. 1990. Endogenous technological change. Journal of political Economy 98, 5, Part 2 (1990), S71–S102.[79] Gregor J Rothfuss and K Bauknecht. 2002. A framework for open source projects. Department of Information

Technology. Zurich, Switzerland, University of Zurich 157 (2002).


https://www.R-project.org/


[80] Matthew Rowe, Miriam Fernandez, Harith Alani, Inbal Ronen, Conor Hayes, and Marcel Karnstedt. 2012. Behaviouranalysis across different types of Enterprise Online Communities. In Proceedings of the 4th Annual ACM Web ScienceConference. ACM, 255–264.

[81] David Schuler and Thomas Zimmermann. 2008. Mining usage expertise from version archives. In Proceedings of the2008 international working conference on Mining software repositories. ACM, 121–124.

[82] N Sadat Shami, Kate Ehrlich, Geri Gay, and Jeffrey T Hancock. 2009. Making sense of strangers’ expertise from signalsin digital artifacts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 69–78.

[83] Elliot Soloway, Kate Ehrlich, and Jeffrey Bonar. 1982. Tapping into tacit programming knowledge. In Proceedings ofthe 1982 conference on Human factors in computing systems. ACM, 52–57.

[84] Bradley R Staats. 2012. Unpacking team familiarity: The effects of geographic location and hierarchical role. Productionand Operations Management 21, 3 (2012), 619–635.

[85] Bradley R Staats and Francesca Gino. 2012. Specialization and variety in repetitive tasks: Evidence from a Japanesebank. Management Science 58, 6 (2012), 1141–1159.

[86] Garold Stasser, Dennis D Stewart, and Gwen M Wittenbaum. 1995. Expert roles and information exchange duringdiscussion: The importance of knowing who knows what. Journal of experimental social psychology (1995).

[87] Tom Fangyun Tan and Serguei Netessine. 2014. When does the devil make work? An empirical study of the impact ofworkload on worker productivity. Management Science 60, 6 (2014), 1574–1593.

[88] Vinay Tiwari. 2010. Some observations on open source software development on software engineering perspectives.International Journal of Computer Science & Information Technology (IJCSIT) 2, 6 (2010), 113–125.

[89] Jason Tsay, Laura Dabbish, and James Herbsleb. 2014. Influence of social and technical factors for evaluatingcontribution in GitHub. In Proceedings of the 36th international conference on Software engineering. ACM, 356–366.

[90] Sherry Turkle and Seymour Papert. 1992. Epistemological pluralism and the revaluation of the concrete. Journal ofMathematical Behavior 11, 1 (1992), 3–33.

[91] Marcie J Tyre and Eric Von Hippel. 1997. The situated nature of adaptive learning in organizations. Organizationscience 8, 1 (1997), 71–83.

[92] Bogdan Vasilescu, Vladimir Filkov, and Alexander Serebrenik. 2013. Stackoverflow and github: Associations betweensoftware development and crowdsourced knowledge. In Social computing (SocialCom), 2013 international conferenceon. IEEE, 188–195.

[93] Anneliese Von Mayrhauser and A Marie Vans. 1995. Program comprehension during software maintenance andevolution. Computer 8 (1995), 44–55.

[94] Richard K Wagner and Robert J Sternberg. 1985. Practical intelligence in real-world pursuits: The role of tacitknowledge. Journal of personality and social psychology 49, 2 (1985), 436.

[95] Fabian Waldinger. 2011. Peer effects in science: Evidence from the dismissal of scientists in Nazi Germany. The Reviewof Economic Studies 79, 2 (2011), 838–861.

[96] JP WHITE. 1975. PSYCHOLOGY AND EPISTEMOLOGY-TOWARDS A THEORY OF KNOWLEDGE-PIAGET, J.[97] Sze-Sze Wong. 2004. Distal and local group learning: Performance trade-offs and tensions. Organization Science 15, 6

(2004), 645–656.[98] Seungwon Yang, Carlotta Domeniconi, Matt Revelle, Mack Sweeney, Ben U Gelman, Chris Beckley, and Aditya Johri.

2015. Uncovering trajectories of informal learning in large online communities of creators. In Proceedings of theSecond (2015) ACM Conference on Learning@ Scale. ACM, 131–140.

[99] Louis E Yelle. 1979. The learning curve: Historical review and comprehensive survey. Decision Sciences 10, 2 (1979),302–328.

[100] Lynne G. Zucker, Michael R. Darby, and Marilynn B. Brewer. 1998. Intellectual human capital and the birth of USbiotechnology enterprises. The American Economic Review 88, 1 (1998), 290–306. http://www.jstor.org/stable/116831

Received April 2018; revised July 2018; accepted September 2018


http://www.jstor.org/stable/116831

Documents

Learning by Doing versus Learning by Viewing: An Empirical ......[96] vs. the situated cognition perspective proposed by Jean Lave and Etienne Wenger [58]. The former focuses on the