Institute for Software Researchisr.uci.edu/tech_reports/UCI-ISR-03-4.pdfproject supports 2.5 million lines of code (Linux). Each project has independently chosen the tools and processes

Institute for Software ResearchUniversity of California, Irvine

www.isr.uci.edu/tech-reports.html

Justin R. ErenkrantzUniv. of California, [email protected]

Richard N. TaylorUniv. of California, [email protected]

Supporting Distributed and Decentralized Projects:Drawing Lessons from theOpen Source Community

June 2003

ISR Technical Report # UCI-ISR-03-4

Institute for Software ResearchICS2 210

University of California, IrvineIrvine, CA 92697-3425

www.isr.uci.edu

Supporting Distributed and Decentralized Projects:Drawing Lessons from the Open Source Community

Justin R. Erenkrantz, Richard N. TaylorInstitute for Software ResearchUniversity of California, Irvine

Irvine, CA 92697-3425{jerenkra,taylor}@ics.uci.edu

ISR Technical Report # UCI-ISR-03-4

June 2003

Abstract:

Open source projects are typically organized in a distributed and decentralized manner. These factors strongly determine the processes followed and constrain the types of tools that can be utilized. This paper explores how distribution and decentralization have affected processes and tools in existing open source projects with the goals of summarizing the lessons learned and iden-tifying opportunities for improving both. Issues considered include decision-making, accountabil-ity, communication, awareness, rationale, managing source code, testing, and release management.

Supporting Distributed and Decentralized Projects:Drawing Lessons from the Open Source Community

Institute for Software ResearchUniversity of California, Irvine

Irvine, CA 92697-3425

{jerenkra,taylor}@ics.uci.edu

ISR Technical Report # UCI-ISR-03-4, June 2003

Justin R. Erenkrantz, Richard N. Taylor

Abstract

Open source projects are typically organized in a distributedand decentralized manner. These factors strongly determine theprocesses followed and constrain the types of tools that can beutilized. This paper explores how distribution and decentrali-zation have affected processes and tools in existing opensource projects with the goals of summarizing the lessonslearned and identifying opportunities for improving both.Issues considered include decision-making, accountability,communication, awareness, rationale, managing source code,testing, and release management.

1. Introduction

Organizational distribution and decentralization alter criti-cal factors in the software development process. Historically,centralized organizational structures have prevailed. A singleorganization, or part of an organization, has been fully respon-sible for a project, bearing the ultimate responsibility for shap-ing the deliverables and selecting the tools and processes usedin development. Communication and consensus buildingwithin the organization has been facilitated by physical prox-imity.

For numerous business reasons, over the past decade andmore, many organizations have moved to

distributed

develop-ment � a single administrative authority operating over physi-cally distributed subgroups. This change has been supported byimprovements in communication and networking technologies.Nonetheless, with participants no longer physically collocated,the processes and tools of development have had to change toattempt to cope with the difÞculties so incurred.

Development of applications by

decentralized

organizationsadds an additional wrinkle into the problem. By decentralizeddevelopment we mean that no single organization controls theproject; rather that all decisions related to the goals and objec-tives for the project must be made multilaterally. The motiva-tion for decentralized development is akin to the motivation forparticipation in standards bodies: the common weal can beadvanced while independence is retained. Each organizationholds ultimate authority for its internal processes and tools. Tothe extent that interaction between participating organizationsoccurs, selection of the involved tools and processes must alsobe done multilaterally.

The premise of this paper is that we can gain some insightinto how to effectively meet the challenges that face decentral-

ized and distributed development organizations by examiningthe practices of the open source community, as these projectsare most often both distributed and decentralized. The intendedbeneÞciaries of this work is, largely, new open source projects,through several of the observations have applicability outsidethe open source domain.

Some work along this line has already taken place. Progres-sive open source[6], for instance, has been introduced in somecommercial entities. This model is primarily geared towardsapplying open source practices within the community of inter-nal employees or speciÞc strategic partners rather than the pub-lic. This model is not fully decentralized, however. Implicit inthe notion of progressive open source is a controlling authoritythat can dictate development.

There has also been a large body of work related to distrib-uted software development. One particular area that has beencarefully studied by Herbsleb,

et.al.

is the communicationbetween participants in a distributed software project[15]. Inthis case study, participants were spread across several coun-tries and developed a project collaboratively. One of the signif-icant results was that there was a clear bias towardscommunicating with people in proximity, rather than commu-nicating with remote peers, even when supported by good com-munication technology. This study does not fully explore theeffects of decentralization on development, however, as all ofthe participants essentially worked for the same organizationand were clustered in relatively large groups at a small numberof sites. In a highly decentralized and distributed softwareproject, few developers may be in proximity and belong to thesame organization.

Another body of work has focused on enhancing technolo-gies speciÞcally for supporting distributed development.CSCW technologies Þt into this category, as well as enhance-ments to the web. In [7], for instance, enhancements were dis-cussed that could make the current web tools better facilitatecollaboration. However, it limited itself to web-based artifactsand does not lay out a guideline for the processes best suited tothese tools.

It almost goes without saying that not all projects requireheavyweight processes and tools due to their limited scope orparticipation. Introducing unnecessary processes and tools maystiße a small project. It is also possible that participants do notdesire expansion beyond a speciÞc threshold. These classes ofprojects do not truly Þt the distributed and decentralized crite-ria. In the following discussion, therefore, we will only con-cern ourselves with projects that are sizeable or complex

enough to warrant tool and process support and which aredeveloped in a collaborative, distributed, and decentralizedfashion.

We begin the remainder of the paper with discussion ofa survey of open source projects, showing similarities thathave arisen in tool usage. Discussion then turns to charac-terizing the ways distribution and decentralization can con-strain processes and tools. We then begin to summarizelessons from the open source experience, starting withidentifying the management and coordination needs. Wecontinue with an examination of techniques for satisfyingthese various needs. We conclude with a discussion ofpotential future work.

2. Basis Projects

Since most open source projects display signiÞcantdegrees of distribution and decentralization in their organi-zation, they provide a good foundation for study. Someprior examinations have been conducted into the tool usageof open source projects[12]. While most open sourceprojects are not directly related to each other in terms of thesubject of their production, a commonality of supportingtools has emerged in many cases.

In [12], eleven open source projects were surveyed todetermine what tools are used to support the developmentmodel of the project. The survey was conducted to deter-mine the quality aspects of open source projects and deter-mine how to improve the project deliverables. Thesurveyed projects are spread across several differentdomains - including compilers, web servers, programminglanguages, and desktop environments.

The surveyed projects are among the most successfulopen source projects available. The Apache HTTP Server iscurrently in use by about sixty percent of all websites[20].Servers shipping with the Linux kernel amounted for four-teen percent of all servers shipped in the Þrst quarter of2003[11]. Tomcat is the ofÞcial reference implementationfor the Java Servlet and JavaServer Pages technologies[2].Therefore, these projects provide a reasonable basis forexamining how successful distributed and decentralizedopen source projects should be conducted.

As Figure 1 depicts, these projects also represent a widerange of source code size. One project had as little as 55thousand lines of code (Tomcat), while another surveyedproject supports 2.5 million lines of code (Linux).

Each project has independently chosen the tools andprocesses that best Þt it. Most of the surveyed projects donot have any common developers, so there is no direct rela-tionship. However, in some areas, a consensus appears tohave been reached concerning the proper tools to use.

In the survey, all of the projects shared the same sourcecontrol system, CVS. However, since the publication of thesurvey, Linux has adopted BitKeeper as its source controlsystem[4]. While there does currently appear to be a con-sensus regarding CVS, a number of other replacements toCVS are actively being developed. These include such

tools as Arch[1] and Subversion[5]. Therefore, this consen-sus may not be stable over the long-term as newer productsattempt to replace CVS.

In other areas, there is no single tool that predominates;rather, a small number of tools are commonly used. Onesuch area is in mailing list software; two tools currentlydominate - ezmlm[3] and Mailman[10]. While two of thesurveyed projects used other tools, the rest used one ofthese two tools.

In yet other areas, such as web portals, there is extremevariations in the tools used. No two projects shared thesame tool for updating their website. At this point, mostprojects are creating customized tools for their website thatÞt their individual needs rather than relying upon a pre-built solution for content management.

The variations of tool similarity across problemdomains presents an interesting statement. In some areas,open source projects have found a particular tool that seem-ingly Þts their development model well. This is evidencedby consensus concerning a particular tool. This consensusmay be due to an inherent property of the way the project isorganized whereby this tool is the only obvious choice. Or,perhaps the adoption of a particular tool is a matter of his-torical accident. If the adoption is related to historical acci-dent rather than a solid Þt, the introduction of tools that arebetter suited to a distributed and decentralized developmentmodel may be able to replace the current consensus. How-ever, in order to encourage better tools and processes, wemust understand the constraints placed upon an opensource project by decentralization and distribution.

3. Constraints

The presence of decentralization and distribution in asoftware project places a number of new constraints onwhat processes and tools can be effectively utilized. Inorder to obtain a clearer picture of what may work in these

Figure 1: Lines of Code in Basis Open Source Projects

0 500 1000 1500 2000 2500 3000

XFree86

Python

Perl

NetBeans

Mozilla

Linux Kernel

KDE

Tomcat

GCC

GNOME

Apache HTTP Server

Li f C d (Th d )

environments, we need to be able to identify these con-straints.

3.1. Decentralization

The decentralization aspect of development requiresprocesses to consider multiple interested parties. Theinvolved developers may act towards their own goals,rather than the goals of the entire project. Therefore, not alldevelopers will necessarily be aligned on all items andtasks. Yet, the processes and tools used should try to pro-mote working towards a common beneÞcial goal whilemeeting the individual goals.

Due to decentralization, developers may not all work forthe same physical organization. However, one organizationmay fund a portion of the developers on a project. If thisorganization removes its funding, their associated develop-ers may leave the project. Therefore, the project needs to beable to withstand such losses or risk having the projectabandoned. This risk promotes processes and tools whichmaintain continuity and shared communal knowledge.

When the individual goals of organizations collide, careshould be taken in resolving these concerns. If these con-cerns are not met to everyone�s satisfaction, dissatisÞedorganizations may leave the project. Depending upon theinßuence of the departing subset, it may place the project injeopardy. Therefore, processes promoting compromisesshould be strongly emphasized to minimize such depar-tures.

3.2. Distribution

Distributed software development places a strain on theproject�s communication mechanisms. When developersare not collocated, it is no longer possible to have frequentface-to-face meetings. Therefore, other communicationmechanisms and tools must be deployed to Þll this void.

As noted earlier, prior studies into the nature of distrib-uted software development have indicated that it is hard tofacilitate communication to the right person at the righttime across site boundaries[15]. In order to address thisproblem, processes and tools need to be in place to allowtimely identiÞcation of key contacts. Herbsleb,

et.al.

, forinstance, identiÞes needs in the areas of awareness, richinterpersonal interaction, and support for Þnding experts.

Since developers are not physically collocated, it maycause problems with synchronous communication as devel-opers may be scattered across timezones. If synchronousmethods are used, some participants may not be able tocontribute to a discussion. Therefore, asynchronous com-munication mechanisms are usually preferred.

4. Management and Coordination Needs

This section identiÞes management and coordinationneeds that decentralized and distributed project organiza-tion imposed. If these needs are not properly addressed atthe outset, then repercussions may arise as the project

matures.

4.1. Goals

Before embarking on a project, there is usually a needfor a clear statement of goals that the project needs toaccomplish in order to be successful. Upfront identiÞcationof goals allows for examination by prospective partici-pants. It may be that the initial goals may not suit all inter-ested parties. Therefore, they may need to be altered tosupport other interests. A consensus about the end resultcan then be built by the new participants. If the interests arethus made to correspond, the groups can begin to proceedto coordinate development tasks. If their interests are irrec-oncilable, the parties may proceed separately. A confronta-tion may occur later if an implicit differences in goals islater revealed.

Furthermore, if only a few parties wish to participate,the potential cost of decentralization may not add sufÞcientvalue to the project. It may be that the project does not haveenough outside attraction to reach a critical mass to supporta viable community. Unnecessarily adding the overhead ofdecentralization may end up harming the viability of theproject.

4.2. Coordination of Initial Development

Once a goal has been determined, the interested partiesneed to identify how to reach these objectives. This road-map can be valuable in planning development activities. Aproject may have an initial donation of code to build upon,or the new project needs to start the development process.

Inherited code

A project may inherit code based upon a prior effort thathas decided not to further development, or, one of the inter-ested parties may be willing to donate code to begin thedevelopment effort. In either case, interested parties shouldbe aware of the implications of the decision.

When using inherited code, the primary task becomesenhancement and evolution. At Þrst, the project may beable to bypass the design stage of the software life-cycle.The majority of tools and processes will be geared towardsmaintenance. Depending upon completeness of the dona-tion, design artifacts may need to be reproduced to promoteunderstanding of the inherited code.

As the project matures, limitations may be found in theinitial design that require substantial refactoring. The initialdevelopers may desire a reasonable expectation that theinherited code allows for ample extensibility. Otherwise,efforts to evolve the code may encounter an early road-block that forces reconsidering the usage of this code.

Initial code

When a project begins afresh, the initial processes andtools will primarily be design-oriented rather than mainte-nance-oriented. In order to work in a distributed environ-ment, the processes and tools must be able to supportcollaborative design. As the project evolves, the processes

may alter to primarily supporting implementation andmaintenance tasks.

A common occurrence in open source projects is that apublicly documented standard is implemented. These doc-uments are typically written by a separate standards organi-zation. These documents serve as the initial requirementsand often specify interoperability characteristics of thedeliverable. Once agreeing upon a standard to implement,developers can then devise a plan to carry out the architec-tural design and implementation.

By minimizing the requirements stage of the softwarelife cycle by leveraging pre-existing standards, more effortcan be directed towards the design and implementation.However, many projects implementing public standardsalso provide feedback to the standards committee based onreal-world implementation experience.

Effect on design and requirements

Since some of the most prominent open source projectsinherited code which implements a public or well-knownstandard, it may stand to reason that the processes involvedwith design and requirements gathering are not as welldeveloped as maintenance and extensibility of code in opensource projects.

However, in these particular cases, the requirements andinitial design have already been determined in a very rigor-ous manner. In the case of projects which implement Inter-net RFCs, these requirements have been created in a verydecentralized fashion. However, once these requirementshave been established, various parties will form groups toimplement the standard.

Therefore, while it may seem that some open sourceprojects lack an emphasis on requirements and design, wemay be able to rationalize that on the strict division ofrequirements and implementation in the traditional stan-dard-making bodies of the Internet.

4.3. Common procedures

In decentralized communities, the interested parties mayestablish a common set of rules for running the project.These rules will take the place of a controlling authoritywhich dictates such rules. Furthermore, this will allow allparties to operate within stated organizational parameters.

The parties should have already agreed on the goals andmay have agreed on the initial design, but they must nowalso agree how to reach the desired result in a formal man-ner. If a conßict between members of the project arises,there needs to be a predetermined mechanism for resolvingthese conßicts.

If these steps are ignored and such a process does notexist, it may introduce tension between parties. By creatingand following these guidelines, the belief is that most con-ßicts will be resolved peacefully. These procedures shouldalso attempt to not introduce unnecessary overhead in thedevelopment process.

If these mechanisms fail and an impasse develops, then

the community may be

forked

. One of the most prominentexamples of the forks of open source projects are amongthe BSD-derivatives[17]. Despite having a common ances-try, the vision of each BSD-derivative is slightly different.In essence, each BSD-based platform has the goal of creat-ing a Unix-like operating system. However, each of thesederivatives has a different technical or procedural vision ofhow this goal should be accomplished.

Therefore, we suggest that decentralized projects areself-correcting though at a cost of wasted resources. Whena difference of vision occurs between developers and orga-nizations, projects can be forked to maintain the integrityof the private goals. In the end, each constituency retainstheir private goals, and may be willing to separate fromother participants if an impasse is reached. Only when theirprivate goals are being met will participation continue.

4.4. Tool requirements

When multiple parties participate in decentralizeddevelopment, special attention should be made to therequirements of the tools that support the processes. Theselection of tools should recognize that not all developershave equal resources to acquire specialized tools. Opensource projects may not be directly funded, but when allparticipants are funded, these requirements may not be asstringent.

Since open source projects are traditionally open to alldevelopers regardless of organizational afÞliation, the toolsused are commonly open source as well. By relying on freetools, this alleviates Þnancial barriers to participation as notall developers may receive direct compensation for theirwork on a project. It may be unreasonable to expect devel-opers to purchase tools in order to work on a project.

Due to the variety of developer preferences, mostrequired tools need cross-platform support. In the opensource community, a good tool will not require developersto switch their operating system to use a special tool. Thisallows developers to work on platforms with which theyare most comfortable.

Since a project may attract developers of differentskillsets, it may be unreasonable to expect developers tohave special training in a tool or a technique. To offset this,projects may need to provide clear documentation on tech-niques that will help unfamiliar developers. Furthermore,since the participants are self-selecting, not all participantsmay have formal computer-science backgrounds, so somemore advanced techniques may not be accessible to all par-ticipants.

5. Process and Management Tools and Tech-niques

This section describes process and management tech-niques that may be used in distributed and decentralizedprojects. These techniques attempt to satisfy the needs dis-cussed previously. We will examine how open source

projects are solving these constraints and identify potentialareas of improvements for each concern. Table 1, at the endof the paper, will summarize these techniques.

5.1. Delegation and Decision-Making

A concern for distributed and decentralized projects isdelegation of assignments and leadership. Since the partici-pants do not necessarily share the same reporting structure,traditional management techniques may not apply.

Similarly to other management models, there may eitherbe a ßat or hierarchical structure within this decentralizedorganization. Ultimate authority may rest with a speciÞcindividual, or decision-making responsibility may beshared by the interested participants. In the case of a singleauthority, this individual may set policies unilaterally.However, these policies must still promote participation byothers. This requires the creation of a benevolent dictator-ship where participants are willing to yield authority to acentral authority � explicitly backing away from decen-tralization.

A prominent example of this central authority organiza-tional model is seen in the Linux kernel. Linus Torvaldswas the initial designer and developer of the Linux kernel.The rest of the participants have allowed him to maintaincontrol over the project. Linus has the ultimate say on whatchanges make it into the kernel.

Designating a single individual with ultimate authoritymay create an organizational bottleneck. Therefore, a hier-archical organizational structure usually accompanies thesestructures. In Linux, most substantial components of thekernel have an associated maintainer. Rather than submit-ting a change directly to Linus, changes should be submit-ted to the responsible maintainer. If the maintainer agreeswith the patches, the patches can then be submitted toLinus.

Linus places a certain degree of trust in his maintainersthat they will follow his process for submitting patches tohim and deal with most of the overhead for that componentYet, due to the supreme nature of Linus�s role, he can over-rule the maintainer of a component. It is possible to cir-cumvent a maintainer and send a patch directly to Linus. Ifhe decides to apply the changes without receiving priorinput from the maintainer, he retains that right.

Another model commonly used by open source projects,one more in tune with decentralization, is the meritocracymodel. This is exempliÞed by the Apache HTTP ServerProject[9]. All members share equal power, so there is nodirect leader of the project. Under this ßat organizationalmodel, people gain power by sustained contributions overtime. The power of the developer is enabled by grants ofwrite access to the shared repository and the ability to vetochanges.

Until a developer gains commit access, they are consid-ered a contributor. While they may participate freely on themailing lists, an intermediary with appropriate access mustreview and commit their suggested changes. They may also

cast non-binding votes on issues before the community.Since these votes are non-binding, developers with bindingvotes may choose to disregard such votes.

As the voting developers are exposed to a new partici-pant, they are examining the quality of the contributionsand how the participant works within the community. Then,one voting developer will nominate the contributor for vot-ing privileges to the rest of the voting developers on a pri-vate discussion list. If the group considers the contributionsbeneÞcial and the participant trustworthy, voting privilegeswill be offered.

While no single person can control the project, each vot-ing developer has

veto

authority to stop undesired changesfrom being merged into the shared repository. While thesevetoes can be cast on any patch, there must be a valid tech-nical reason for stopping this change. There is also no wayto override a veto - this organizational model enforces con-sensus-building.

5.2. Accountability

Accountability may become an issue in a decentralizedorganization. If there is a problem with the software, usersmay desire a contact to resolve this problem. Open sourceprojects have typically addressed this concern in two fash-ions: creating for-proÞt corporations that provide commer-cial support or creating non-proÞt foundations that providea perpetual point of contact. These solve the issue by adirect step away from decentralization.

Some organizations that participate in open sourceprojects provide for-fee support as a source of revenue. Forexample, this is common in the relational database domain.Two open source databases, PostgreSQL and MySQL, bothhave strongly related corporations that sell support to end-users.

These commercial entities will often provide supportplans that assist users in setting up the product. These com-panies may also respond to direct support questions con-cerning the product. By having a revenue stream, thesecompanies are able to fund development of the associatedopen source project by directly Þnancing developers. Thesedevelopers may add enhancements that the organization�sclient base has requested, or Þx problems that have beenidentiÞed by support personnel.

As an alternative to providing a commercial support,some open source projects have established non-proÞtfoundations. These foundations are the owner of the codeand do not have any explicit commercial interests. There-fore, it is expected that these foundations will be able tooversee and maintain accountability for the code. Twoprominent examples of this are the Free Software Founda-tion and FreeBSD Foundation.

In these cases, a non-proÞt foundation is usually respon-sible for providing the infrastructure of the project. Theywill typically provide the services that allow developmentto occur. These foundations do not usually provide end-user support or directly fund developers. However, the

foundation is expected to be eternal, while a for-proÞt cor-poration may be forced to dissolve due to Þnancial consid-erations.

5.3. Communication

Due to the introduction of distribution, there may bevarying degrees of developer collocation. Since the projectsare also typically decentralized, developers may not workfor the same physical organization. Therefore, the develop-ment process must allow for communication between peo-ple not at the same location and not belonging to the samephysical organization. Therefore, the majority of communi-cation should be at the virtual organization level, ratherthan the physical organization. By relying upon asynchro-nous forms of communication rather than synchronouscommunications, a higher proportion of global developerscan be supported. Yet, relying upon asynchronous commu-nication introduces a delay factor[8,15].

In order to facilitate communication to the right personat the right time, mailing lists are commonly used. Thisreduces the number of contacts that are required. Almostevery open source projects uses public mailing lists to pro-mote subscription by non-developers and to encouragecontributions by new developers.

Multiple mailing lists may also be used to further seg-ment the mail trafÞc. These mailing lists may be dedicatedto a particular sub-topic. By reducing the scope of a mail-ing list, it allows for separate communities to form withinthe same project. This may be beneÞcial for encouraginggrowth in large projects. It also moves discussion awayfrom a more generic mailing list where there may not be asmany interested people in the discussion.

It has been stated that email is predominately usedbecause it is the least common denominator[8]. One prob-lem with email is that it requires a common language to beused. Mechanical translation services have not yet provedto be sufÞcient to address technical translations. This maypromote developers who are only ßuent in the main lan-guage of the developers.

A possible avenue to research would be to investigateprojects where developers do not share a common lan-guage. In these cases, it would be useful to analyze howdevelopers communicate when they do not share a lan-guage. This may promote islands of developers that do notoften communicate.

In addition to relying upon asynchronous communica-tion, some projects use a variety of synchronous communi-cation (such as real-time chats or instant messaging). Yet,this is only effective when developers are located in similartime zones. If not all developers can participate in synchro-nous communications, it is essential that some archival ofthe communications be made. Otherwise, key participantsmay be left out of a critical discussion.

5.4. Awareness

Awareness is an understanding and coordination of what

participants are doing. Since the personnel of a decentral-ized and distributed community may be rapidly changing,it may be hard to even identify who is currently active. Thisaspect of development processes has been remarkablyunderdeveloped. Most coordination efforts remain ad-hocand short-term.

However, mailing lists provide a rudimentary tool forcoordination. A developer can post on the mailing list indi-cating that they are planning to perform some activity. But,there is no enforcement of this plan. This leads to a prob-lem when a participant says they are going to accomplishsome task, but does not complete it.

Some projects may also use shared information reposi-tories for awareness information. For example, the ApacheHTTP Server Project relies on a STATUS text Þle that listsoutstanding issues; this Þle is emailed weekly to the maindevelopers mailing list. Participants may make a notationas to which issues they are addressing. However, it mayrequire frequent refreshing of this Þle to ensure that theinformation is not stale.

Many open source projects also require that largechanges be discussed before implementation starts. Thisallows other developers to provide feedback on proposedimplementation strategy. Leveraging the feedback of devel-opers may allow potential design problems to be detectedearlier than if review occurs after implementation.

5.5. Historical Rationale

Since turnover may be high in decentralized projects, acollective history should be maintained and documented.By examining past communications and activities, newdevelopers can begin to understand decisions made at acertain point in the past. It also allows developers to learnfrom prior decisions.

It is essential to use communication mechanisms thatallow for long-term archival. Most asynchronous forms ofcommunication lend themselves well to archiving - such aspublic mailing list archives. Therefore, the delay factorintroduced by asynchronous communication has an advan-tage of allowing capture of historical rationale.

However, spontaneous synchronous communicationsare often not archived. This may often be seen in projectswhere a number of developers are physically co-located. Inthese environments, face-to-face communications mayhave an unusually strong bias[15]. In addition to not allow-ing full participation of the group, these sorts of communi-cation may be detrimental to distributed projects becauseartifacts of these conversations are rarely recorded.

A common problem in open source projects is that newdevelopers often repeat or bring up old discussions. Thisdemonstrates a lack of tools that encourage review of pastdiscussions. If tools for reviewing prior discussions werereadily available, developer time spent rehashing prior top-ics could be minimized.

To help with this, Perl has created Perl Design Docu-ments (PDDs) which lay out the rationale for certain deci-

sions made in the development of Perl 6 and Parrot[28],This allows new developers to annotate and reexamineprior decisions in a central location. It may happen that anew developer has added insight that was not noted in theprior conversation. If post-mortem annotation of discus-sions is allowed, it may achieve a balance between stißingand encouraging reexaminations.

5.6. Design Rationale

In addition to allowing for discovery of important his-torical conversations, it may be critical to understand thedesign rationale of certain components. In a distributed anddecentralized environment, it may not be possible to con-tact the original author of a section of code. Therefore,mechanisms need to be in place to communicate rationaleto future participants.

One way to communicate rationale is by creating devel-oper documentation. Some open source projects, such asAbiWord[25], keep interface deÞnitions and notes in-linewith the source code. Documentation can then be publishedwith tools such as doxygen[14]. By synchronizing the loca-tion, when major changes are made to the source code, thebelief is that the documentation will be updated. Thismakes it easier to produce developer documentation whichreßects the current code.

Depending upon whether the project did the initialdesign, other artifacts such as design documents and dia-grams may be available. In projects that provide an extensi-ble interface, it is also common to produce well-explainedand concise examples as a way to illustrate the interface inaction. This helps new developers of an interface under-stand the code by looking at examples.

There has been research into encouraging softwarereuse, but these tools have not yet been integrated into themainstream. There are tools available that provide relevantinterface information in a personalized manner[30]. Therehas also been work towards harvesting the structural andsemantic information of code[19]. Encouraging adoptionof already existing tools may make capturing design ratio-nale easier.

5.7. Participation

In projects where the personnel on a project may changefrequently, it is important to have a published set of devel-oper guidelines. These guidelines allow familiarizationwith the processes and tools used in a project. New partici-pants can review them and contribute to the project in anintelligent manner.

Sites such as the KDE Developer�s Corner provide awealth of information that allow new participants inter-ested in KDE to learn how to contribute[18]. The site con-tains introductory tutorials for developers new to theinternals of KDE. Information about the development toolsrequired to compile KDE and how to obtain the latest KDEsnapshots are also available.

It has been mentioned that having an established set of

guidelines shared by projects can reduce the redeploymentcosts of developers[6]. If all projects shared the essentialguidelines, it would make it easier to contribute to newprojects. If each project had its own set of unique guide-lines, it would be difÞcult to transition to new projects.Therefore, it would be beneÞcial to encourage standardiza-tion of participation guidelines across projects.

5.8. Controlling Participation

A corollary to encouraging participation in decentral-ized and distributed software is that participation by newpeople must be managed by the current participants.Depending upon the access policies of the project, new par-ticipants may only have limited access to making changesto the project. Therefore, processes and tools need to be inplace to support facilitating such contributions.

Tools such as the SourceForge�s patch manager used bythe Python project can be extremely useful[21]. These toolsallows participants to submit patches to be applied, thendevelopers with the appropriate commit access can inte-grate the changes. This particular tool also allows for anno-tations to be stored.

However, these current tools suffer from a lack of inte-gration with the rest of the development process. Someprojects enforce a policy where a certain number of posi-tive reviews must be received before a change can be inte-grated[26]. Contributions may also grow stale as theproject code base evolves. Furthermore, if none of theactive developers deem an issue important, it may be achallenge to motivate integration. The tools used to controlparticipation should ease the burden of merging thechanges.

5.9. Managing Source

Since the developers are distributed, it is often a require-ment to have a uniÞed view of the source code. If a uniÞedview is not available, it may be possible for developers tonot be aware of the current state of affairs. Therefore, mostprojects will adopt some sort of collaborative software con-Þguration management system (SCM). The processes andtools need to balance that each developer should be able towork independently while allowing them to remain consis-tent with the rest of the team.

As discussed previously in [12], the predominate SCMin use by open source projects is currently CVS. There hasbeen a recent trend in seeking tools that can replace CVS[1,4, 5]. CVS is based on a centralized repository model withone repository holding all of the content. Some of thenewer SCM tools that have been introduced are keeping thecentralized model of CVS[5], while others are attemptingto decentralize the repository structure[1, 4].

However, depending upon the accountability structureof the project, it may make sense to keep a centralizedrepository even in a decentralized project. If a project has anon-proÞt organization which holds the copyright, then thisorganization should administer the master repository. How-

ever, if the project has a loose accountability structure, adecentralized repository structure may be more efÞcient.

Most SCM tools currently in use also promote an opti-mistic conßict resolution model rather than a pessimisticconßict resolution mode[16]. An optimistic locking strat-egy allows source conßicts to be resolved at commit-time,while a pessimistic locking strategy uses locking to preventothers from making changes while a change is being devel-oped. A pessimistic locking strategy may interfere withparallel development as it prevents two developers fromworking on the same Þle at the same time. Only using pes-simistic locking may have an impact upon the effectivenessof distribution.

5.10. Issue Tracking

One of the stated advantages of open source projects isthat it is easier to Þx problems since the source code isfreely available[22]. However, it may still be difÞcult fornon-developers to Þx problems as they may not have theappropriate background required to resolve a defect. There-fore, processes and tools are required to report problems tothe people who can help resolve defects.

Due to the presence of decentralization, it may be difÞ-cult to solicit participants who can resolve reported defectsin a timely manner. Some participants may be wary ofworking with end-users, or are too busy to deal withacquiring the relevant information from the reporter. There-fore, the tools need to be able to support novice users andexpert developers efÞciently.

Standardizing on issue tracking tools, such as Mozilla�sBugzilla[27], may increase the familiarity of both users anddevelopers with these tools. However, as these tools areadopted by more projects and enhancements requested,feature creep must be resisted. If the issue tracking toolbecomes too complicated to use effectively, its usefulnessis diminished.

5.11. Documentation

Since not all users of a project are developers that canunderstand code, a project must also be able to deliverquality user documentation. Otherwise, users may Þnd theproduct too complicated to use properly. A signiÞcant chal-lenge for distributed and decentralized projects is to havedocumentation at an equivalent quality to the code.

At best, documentation can be viewed as a form ofsource code. Therefore, many of the processes that apply tosource code can also apply to documentation. Documenta-tion may be written in a collaborative environment usingsimilar tools and processes as the ones used to write code.

A problem in any software project is how to keep theend-user documentation synchronized with the current ver-sion of the source code. Oftentimes, developers are hesitantor reluctant to write user documentation. Therefore, whenthey make a change that is visible to a user, the developermay not update the relevant documentation. This leads tothe documentation becoming out of sync with the code.

Some open source projects have addressed this by hav-ing separate documentation teams. One example of thisseparation is in PHP�s documentation[29]. By isolating thedocumentation process from the development process, itenforces another perspective on the usability aspects of thecode. This may result in an increase of quality of both thecode and end-user documentation.

Another characteristic of the PHP documentation pro-cess is that it allows users to annotate the documentation onthe website. As users spot errors in the documentation, theymay append a correction comment to the website. Then,PHP documentation participants can harvest the changesinto the main documentation.

5.12. Testing

There is often a strong desire to ensure that a projectmeets both the functional and reliability goals previouslyestablished. Therefore, testing processes and tools shouldbe developed and encouraged throughout the life cycle ofthe project. There are two classes of methods that are typi-cally used in open source projects: code review and testing.

Since it is difÞcult to conduct regular meetings in a dis-tributed workplace, it is not possible to conduct periodiccode review sessions. Therefore, code reviews must occuras the changes are conducted. Developers are usually askedto make small veriÞable changes rather than large changes.By asking all developers on a project to review the changesas they happen and asking for the most concise changespossible, it may make it easier to identify problems sooner.

Besides relying upon manual inspection, some projectshave a suite of automated tests for the project. These auto-mated tools allow all participants to run the same set oftests at their discretion on their speciÞc platform. One suchproject that utilizes automated tests is Subversion[5]. Thetest suite in Subversion is extensive and tests almost allfunctionality of the system. Furthermore, no releases canbe made without Þrst passing the automated tests. It may bepossible to integrate some recent research into optimizingwhich regression tests are executed to improve the perfor-mance of the test suites[13].

5.13. Release Management

Since users ostensibly wish to use the deliverables of aproject, quality releases must be produced. Therefore, aviable release strategy must be determined. If the projectdoes not have a coherent process in place, it may haveproblems attracting users or achieving a reputation for sta-bility.

In order to achieve widespread distribution, an infra-structure must be in place to allow public consumption.Some projects rely upon mirrored servers to handle theload of delivering releases to end-users. A critical concernis how to select these mirrors - should they be self-selectedor should they be limited only to trusted individuals.

One such project that relies upon mirrors to deliverreleases is Debian[23]. Debian balances the load across

many geographically dispersed self-selected servers. How-ever, Debian has several push-primary mirrors that are cho-sen because of their reliability. Self-selected mirrors canthen pull releases from one of the pushed mirrors ratherthan accessing the master site directly.

Projects may also place meanings on the versions thatdeliverables are labeled with. This allows a shared under-standing of the expected reliability. At times, it is helpfulfor a project to have a development branch that is notintended for widespread usage. These releases can also beused to perform dry-runs of the release process. This can beespecially helpful when a project is trying a new releaseprocess. By explicitly labeling a version as unstable ordevelopment, it can help match the expectations of userswith the expectation of the developers.

For example, Debian always has at least three versionsthat are actively maintained: stable, testing, and unsta-ble[24]. The stable distribution is the one that is recom-mended for widespread usage. Then, the testingdistribution consists of packages that are waiting to beincluded in the next stable release. Then, the unstable dis-tribution is meant for developers and not meant for produc-tion quality.

6. Summary and Future Work

Adopting a decentralized and distributed organizationfor developing software requires rethinking fundamentalprocess and tools. We have attempted to examine the con-sequences of supporting decentralization and distributionby seeing how open source projects have addressed theseconcerns. Table 1 provides a summary of the issues, tech-niques, and projects discussed in this paper. It also lists

avenues for enhancement that have been identiÞed wherethe current processes and tools could be improved to bettersupport distributed and decentralized software projects.

If projects can create a clear line of accountability that isseparate from all of the participants, it may foster a sense inthe users that responsibility will be maintained. A decen-tralized project should be able to withstand the departure oforganizations gracefully. If this is not present, then usersmay become wary of the project falling out of active main-tainership.

By limiting the scope of discussion lists, it makes it eas-ier for participants to understand what is currently going onin areas of the project. This level of granularity must bebalanced with having too many mailing lists that makes itdifÞcult to Þnd the appropriate forum for discussion. How-ever, when the right balance is achieved, this allows partici-pants to easily partition discussion based upon agreedtopical lines.

One concern for distributed software development isthat a set of standards is required in order to ease partici-pants shifting from one project to another. This may mani-fest itself as a common vocabulary shared betweenprojects. If participants do not share a common language, itbecomes hard to communicate effectively. The creation ofstandards and accepted best practices can help ease migra-tion between projects.

A common problem in a distributed software project isunderstanding what other participants are currently work-ing on. The creation of tools to promote awareness betweendevelopers can address this need. Furthermore, tools thatpromote capturing of historical rationale may make it eas-ier for new participants to enter a project.

Table 1: Summary and Avenues for Enhancements

Issue Techniques Project Exemplar Avenues for Enhancements

Decision-Making Project leader, meritocracy Linux, Apache Understanding consequences

AccountabilityFor-proÞt support,non-proÞt ownership

PostgreSQL, FreeBSD Introducing clarity

Communication Discussion lists, asynchronous All Balancing granularity

AwarenessFrequent status updates,Discussion before implementation

Apache Creating better tools

Historical RationaleArchival of communications,design documents

Perl Creating better tools

Design Rationale Developer-centric docs, examples AbiWord Enforcing synchronization

Participation Clear tutorials, guidelines KDE Creating standards

Controlling Part. Feedback, annotating contributions Python Integrating into other processes

Source CodePublic source repository,optimistic conßict resolution

All Investigating decentralization

Issue Tracking Soliciting developer assistance Mozilla Creating easy-to-use tools

Documentation Distinct personnel, annotations PHP Separation of code and docs

Testing Code reviews, automated tests Subversion Optimizing test executions

Release Management Mirroring, versioning Debian Managing distributions

Another area for tool improvement is introducing a wayto capture the rationale for a decision in the documentation.Currently, it is hard to identify why a particular change ismade. The artifacts for determining this may not be central-ized. Creating a tool that indicates relationships betweenartifacts to encourage rationale understanding may be criti-cal.

The current tools for controlling participation are ad hocand not well integrated. This makes it difÞcult to lower theburden upon the participants in a project in dealing with thecontributions by casual participants. If the tools for han-dling contributions were better integrated into the standardprocesses, it would make this task signiÞcantly easier.

7. Acknowledgments

This material is based upon work supported by theNational Science Foundation under Grant No. 0205724.

8. References

[1]

Arch - Revision Control System

. <http://arch.Þfth-vision.net/>, HTML, 2003.

[2] Apache Software Foundation.

The Jakarta Site -Apache Tomcat

. <http://jakarta.apache.org/tomcat/>,HTML, 2003.

[3] Bernstein, D.J.

ezmlm

. <http://cr.yp.to/ezmlm.html>, HTML, 2000.

[4] Bitmover.

BitKeeper

. <http://www.bitkeeper.com/>, HTML, 2003.

[5] CollabNet.

Subversion

. <http://subver-sion.tigris.org/>, HTML, 2003.

[6] Dinkelacker, J., Garg, P.K., Miller, R., and Nel-son, D. Progressive Open Source. In

Proceedings of theInternational Conference on Software Engineering (ICSE).

p. 177-184, 2002. [7] Fielding, R., Whitehead, E.J., Anderson, K.,

Oreizy, P., Bolcer, G.A., and Taylor, R.N. Web-basedDevelopment of Complex Information Products

. Commu-nications of the ACM.

41(8), p. 84-92, 1998. [8] Fielding, R.T. and Kaiser, G. The Apache HTTP

Server Project

. IEEE Internet Computing.

1(4), p. 88-90,1997.

[9] Fielding, R.T. Shared Leadership in the ApacheProject

. Communications of the ACM.

42(4), p. 42-43,1999.

[10] Free Software Foundation.

Mailman

. <http://www.list.org/>, HTML, 2003.

[11] Fried, I. Sales Increase for U.S. Linux Servers.

CNet News.com

. February 10, 2003. <http://news.com.com/2100-1001-984010.html>.

[12] Halloran, T.J. and Scherlis, W.L. High Quality andOpen Source Software Practices. In

Proceedings of theMeeting Challenges and Surviving Success: 2nd Workshopon Open Source Software Engineering.

May, 2002. [13] Harrold, M.J., Jones, J.A., Li, T., Liang, D., Orso,

A., Pennings, M., Sinha, S., Spoon, S.A., and Gujarathi, A.Regression Test Selection for Java Software. In

Proceed-

ings of the

ACM Conference on OO Programming, Sys-tems, Languages, and Applications (OOPSLA 2001).

p.312-326, Tampa, Florida, October, 2001.

[14] Heesch, D.v.

Doxygen

. <http://www.doxygen.org/>, HTML, 2003.

[15] Herbsleb, J.D., Mockus, A., Finholt, T.A., andGrinter, R.E. An Empirical Study of Global SoftwareDevelopment: Distance and Speed. In

Proceedings of theInternational Conference on Software Engineering (ICSE).

p. 81-90, 2001. [16] Hoek, A.v.d. ConÞguration Management and

Open Source Projects. In

Proceedings of the

3rd Interna-tional Workshop on Software Engineering over the Inter-net.

Limerick, Ireland, June 6, 2000. [17] Howard, J. The BSD Family Tree

. Daemon News.

April, 2001. <http://www.daemonnews.org/200104/bsd_family.html>.

[18] KDE e.V.

Developer's Corner

. <http://devel-oper.kde.org/>, HTML, 2003.

[19] Maletic, J.I. and Marcus, A. Supporting ProgramComprehension Using Semantic and Structural Informa-tion. In

Proceedings of the

23rd International Conferenceon Software Engineering.

p. 103-112, Toronto, Ontario,Canada, May, 2001.

[20] Netcraft.

Netcraft Web Server Survey

. <http://www.netcraft.com/survey/>, HTML, 2003.

[21] Python Software Foundation.

Patch Manager

.<http://sourceforge.net/tracker/?group_id=5470>, HTML,2003.

[22] Raymond, E.S.

The Cathedral & the Bazaar:Musings on Linux and Open Source by an Accidental Revo-lutionary

. O'Reilly, 2001.[23] Software in the Public Interest.

Debian Mirrors

.<http://www.debian.org/mirror/>, HTML, 2003.

[24] Software in the Public Interest.

Debian Releases

.<http://www.debian.org/releases/>, HTML, 2003.

[25] SourceGear Corporation.

AbiWord Documenta-tion

. <http://www.abisource.com/doxygen/>, HTML,2003.

[26] The Apache HTTP Server Project. Apache HTTPServer Project Guidelines and Voting Rules. <http://httpd.apache.org/dev/guidelines.html>, HTML, 2003.

[27] The Mozilla Organization.

Bugzilla ProjectHomepage

. <http://www.bugzilla.org/>, HTML, 2003.[28] The Perl Foundation.

Parrot and Perl6 PDDs

.<http://dev.perl.org/perl6/pdd/>, HTML, 2003.

[29] The PHP Group.

PHP: Documentation

. <http://www.php.net/docs.php>, HTML, 2003.

[30] Ye, Y. and Fischer, G. Supporting Reuse by Deliv-ering Task-Relevant and Personalized Information. In

Pro-ceedings of the

24th International Conference on SoftwareEngineering.

p. 513-523, Orlando, Florida, May, 2002.

Documents

Institute for Software Researchisr.uci.edu/tech_reports/UCI-ISR-03-4.pdfproject supports 2.5 million lines of code (Linux). Each project has independently chosen the tools and processes