49
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science Foundation – Digital Science & Technology Yongqin Gao

Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

Topology andEvolution of the OpenSource SoftwareCommunity

Advisors:

Dr. Vincent W. FreehDr. Kevin Bowyer

Supported in part bythe National Science Foundation – Digital Science & Technology

Yongqin Gao

Page 2: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

2

Outline

�Overview• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusions

Page 3: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

3

Overview (about OSS)

• What is OSS

– Free to use, free to distribute

– Unlimited user and usage

– Source code available and modifiable

• Potential advantages over commercial software– Higher quality

– Faster development

– Lower cost

– Transparent

Page 4: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

4

Overview (about our research)

• Our goal– Understanding the OSS phenomenon

• Approach– SourceForge is the source of our empirical data

– Modeling as a social network

– Analysis of topological statistics

– Use simulation to verify and validate the model

Page 5: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

5

Outline

• Overview

�Data collection

• Network modeling

• Topological statistical analysis

• Simulations

• Publications

• Conclusions

Page 6: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

6

Data Collection — Monthly

• Web crawler (scripts)– Python– Shell– AWK– Sed

• Monthly• Since Jan 2001• ProjectID• DeveloperID• Almost 2 million records• Relational database

PROJ|DEVELOPER8001|dev3488001|dev89728001|dev99228002|dev276508005|dev313518006|dev124098007|dev199358007|dev42628007|dev367118008|dev8972

Page 7: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

7

Outline

• Overview

• Data collection

�Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusions

Page 8: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

8

Modeling as CollaborationNetwork

• What is a collaboration network?– A social network representing the collaborating

relationships.– Movie actor network and scientist collaboration

network

• Difference of SourceForge collaborationnetwork– Link detachment– Virtual collaboration– Voluntary– Global

• Bipartite property of collaboration networks

Page 9: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

9

Collaboration network -bipartite

Adapted from Newman, Strogatz and Watts, 2001

Page 10: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

10

SourceForge DeveloperNetwork

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57] 7597 dev[46]

dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

Dev[80]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

OSS Developer Network (Part)Developers are nodes / Projects are links

24 Developers5 Projects

2 hub Developers1 Cluster

Page 11: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

11

Outline

• Overview

• Data collection

• Network modeling

�Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusion

Page 12: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

12

Topological Analysis

• Statistics inspected– Diameter

– Average degree

– Clustering coefficient

– Degree distribution

– Cluster size distribution

– Relative size of major cluster

– Fitness and life cycle

• Evolution of these statistics

• Dual networks– developer network and project network

Page 13: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

13

Terminology

• Diameter– Average length of shortest paths between all pairs of vertices

• Degree– The count of edges connected to given vertex

• Average degree– Average of the degrees of all vertices in the network

• Cluster– The connected components of the network

• Clustering coefficient (CC)– CCi: Fraction representing the number of links actually present relative

to the total possible number of links among the vertices in itsneighborhood.

– CC: average of all CCi in a network• Degree distribution

– The distribution of degrees throughout a network• Major cluster

– The largest cluster in the network

Page 14: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

14

Diameter of DeveloperNetwork vs. Time

• Network sizeincreasedfrom 30,000to 70,000

Page 15: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

15

Diameter of ProjectNetwork vs. Time

• Network sizeincreasedfrom 20,000to 50,000.

• Diameterdecreasingwith time bothfor developernetwork andprojectnetwork

Page 16: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

16

Clustering Coefficient ofDeveloper Network vs. Time

Page 17: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

17

Clustering Coefficient ofProject Network vs. Time

Page 18: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

18

Degree Distribution(developers)

Page 19: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

19

Degree Distribution(projects)

Page 20: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

20

Cluster Size Distribution

• R2 with majorcluster is0.7426

• R2 withoutmajor clusteris 0.9799

Page 21: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

21

Relative Size of Major Clustervs. Time

• Increase of therelative size ofthe majorcluster

• Increasing rateis decreasing

• May be anindication ofthe networkevolution

Page 22: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

22

Existence of Fitness

• Investigation of development of single projectcan verify the existence of “newcomer”phenomenon

• We tracked the development of every newproject in July 2001 until now (total 1660projects)

• Maximal monthly growth per project is 13while average monthly growth per project isjust 0.3639

Page 23: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

23

Life Cycle of Project

Page 24: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

24

Summary

Page 25: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

25

Summary of Results

• Power law rules– Degree distributions, cluster distribution

• Average degree increasing with time

• Diameter decreasing with time

• Clustering coefficient decreasing with time

• Fitness existed in SourceForge

• Projects have life cycle behaviors

Page 26: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

26

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

�Simulations

• Publications

• Conclusion

Page 27: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

27

Conceptual Framework

Empirical data

Adjustment

Generation

Verification

Validation

Characterization

Description

Model

Simulation

Page 28: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

28

Agent-based Modeling

• EBM vs. ABM– Heterogeneous individuals

– Complex network

• Experience environment– Hardware: computer cluster

– Software:• Simulation toolkits: Swarm

• Database: Oracle

• Language: Java, PL/SQL

Page 29: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

29

Model for SourceForge

• ABM based on bipartite graph

• Model description– Agent: developer

– Behaviors: Create, join, abandon and idle

– Preference: developer’s and project’s

– Fitness

• Four models in iterations– ER, BA, BA with constant fitness and BA with dynamic

fitness

• Comparison of empirical and simulated data

Page 30: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

30

ER Model - Diameter

• Average degreeis decreasingwhile it isincreasing inempirical data

• Diameter isincreasing whileit is decreasingin empirical data

Page 31: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

31

ER Model – ClusteringCoefficient

• Clusteringcoefficient isrelatively lowunder 0.3 while itis around 0.7 inempirical data.

Page 32: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

32

ER Model – DegreeDistribution

• Degreedistribution isnormaldistributionwhile it ispower law inempirical data

Page 33: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

33

ER Model – Cluster SizeDistribution

• power lawdistribution with R2

as 0.6667 (0.9653without the majorcluster) while R2 inempirical data is0.7426 (0.9799without the majorcluster)

• The actualdistribution isdifferent fromempirical data

Page 34: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

34

BA Model – Diameter andClustering Coefficient

• Small diameterand highclusteringcoefficient likeempirical data

• Diameter andclusteringcoefficient areboth decreasinglike empiricaldata

Page 35: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

35

BA Model – DegreeDistribution

• Power laws in degreedistributions, similar toempirical data (o forsimulated data and xfor empirical data).

• For developerdistribution: simulateddata has R2 as 0.9798and empirical data hasR2 as 0.9714.

• For project distribution:simulated data has R2

as 0.6650 andempirical data has R2

as 0.9838.

Page 36: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

36

BA Model with ConstantFitness

• Power laws in degreedistributions, similar toempirical data (o forsimulated data and x forempirical data).

• For developer distribution:simulated data has R2 as0.9742 and empirical datahas R2 as 0.9714.

• For project distribution:simulated data has R2 as0.7253 and empirical datahas R2 as 0.9838.

Page 37: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

37

BA Model with DynamicFitness

• Power laws in degreedistribution, similar toempirical data (o forsimulated data and x forempirical data).

• For developer distribution:simulated data has R2 as0.9695 and empirical datahas R2 as 0.9714.

• For project distribution:simulated data has R2 as0.8051 and empirical datahas R2 as 0.9838.

Page 38: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

38

Advantage of Dynamic Fitness

• Intuition: Fitness should decreasing with time.

• Statistics: project has life cycle behaviorwhich can not be replicated by BA model withconstant fitness but can be replicated by BAmodel with dynamic fitness

Page 39: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

39

Summary

Page 40: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

40

Summary of Results

• We use ABM to model and simulate theSourceForge collaboration network.

• Conceptual framework is proposed for agent-based modeling and simulation.

• Case study of this framework: SourceForgestudy through ER, BA, BA with constantfitness and BA with dynamic fitness.

Page 41: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

41

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

�Publications

• Conclusion

Page 42: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

42

Publications To-date

• Yongqin Gao, "Modeling and Simulation of the OSS Community",Seventh Annual Swarm Researchers Meeting (Swarm2003), NotreDame, IN, 2003.

• Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis andModeling of the Open Source Software Community", NAACSOSConference 2003, Pittsburgh.

• Yongqin Gao, Vince Freeh, and Greg Madey, "ConceptualFramework for Agent-based Modeling and Simulation", NAACSOSConference 2003, Pittsburgh.

• Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, ChrisHoffman, "Agent-based Modeling and Simulation of CollaborativeSocial Networks", AMCIS 2003, Tampa, FL.

Page 43: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

43

Possible Journals

• Chapter 3– Physica A: statistical mechanics and its

applications

– Journal of Social Structure (JSS)

• Chapter 4– Journal of Artificial Societies and Social

Simulation (JASSS)

– Journal of Statistical Computation and Simulation(JSCS)

Page 44: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

44

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

�Conclusion

Page 45: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

45

Conclusion

• Study of SourceForge collaboration networkcan help us understanding the OSScommunity

• We investigate not only the topologicalstatistics but also the evolution of thesestatistics.

• Simulation is used to investigate ofSourceForge collaboration network.

Page 46: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

46

Contribution

• Statistical study of the SourceForgecommunity (snapshot and evolution)

• Verification of the approximate method tocalculate the diameter and CC

• Proposal of a model for the SourceForgecommunity

• Improvement of dynamic fitness to BA model

Page 47: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

47

Future Work

• Data collection– Database dump from SourceForge (PostgreSQL 8GB)– All the possible attributes– Database schema in UML

• More topology analysis (with more attributes)– Discussion forum– Task assignment– Project management– Active testing

• Behavior-based analysis– Interaction between agents– H. Beyton Young’s model

• Information entropy analysis

Page 48: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

48

Acknowledgements

• Committee

• Advisors

• Colleagues

• SourceForge

• NSF

• Others

Page 49: Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf · Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent

49

Thank you