Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Topology andEvolution of the OpenSource SoftwareCommunity
Advisors:
Dr. Vincent W. FreehDr. Kevin Bowyer
Supported in part bythe National Science Foundation – Digital Science & Technology
Yongqin Gao
2
Outline
�Overview• Data collection
• Network modeling
• Topological statistical analysis (real data)
• Simulations
• Publications
• Conclusions
3
Overview (about OSS)
• What is OSS
– Free to use, free to distribute
– Unlimited user and usage
– Source code available and modifiable
• Potential advantages over commercial software– Higher quality
– Faster development
– Lower cost
– Transparent
4
Overview (about our research)
• Our goal– Understanding the OSS phenomenon
• Approach– SourceForge is the source of our empirical data
– Modeling as a social network
– Analysis of topological statistics
– Use simulation to verify and validate the model
5
Outline
• Overview
�Data collection
• Network modeling
• Topological statistical analysis
• Simulations
• Publications
• Conclusions
6
Data Collection — Monthly
• Web crawler (scripts)– Python– Shell– AWK– Sed
• Monthly• Since Jan 2001• ProjectID• DeveloperID• Almost 2 million records• Relational database
PROJ|DEVELOPER8001|dev3488001|dev89728001|dev99228002|dev276508005|dev313518006|dev124098007|dev199358007|dev42628007|dev367118008|dev8972
7
Outline
• Overview
• Data collection
�Network modeling
• Topological statistical analysis (real data)
• Simulations
• Publications
• Conclusions
8
Modeling as CollaborationNetwork
• What is a collaboration network?– A social network representing the collaborating
relationships.– Movie actor network and scientist collaboration
network
• Difference of SourceForge collaborationnetwork– Link detachment– Virtual collaboration– Voluntary– Global
• Bipartite property of collaboration networks
9
Collaboration network -bipartite
Adapted from Newman, Strogatz and Watts, 2001
10
SourceForge DeveloperNetwork
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57] 7597 dev[46]
dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
Dev[80]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
OSS Developer Network (Part)Developers are nodes / Projects are links
24 Developers5 Projects
2 hub Developers1 Cluster
11
Outline
• Overview
• Data collection
• Network modeling
�Topological statistical analysis (real data)
• Simulations
• Publications
• Conclusion
12
Topological Analysis
• Statistics inspected– Diameter
– Average degree
– Clustering coefficient
– Degree distribution
– Cluster size distribution
– Relative size of major cluster
– Fitness and life cycle
• Evolution of these statistics
• Dual networks– developer network and project network
13
Terminology
• Diameter– Average length of shortest paths between all pairs of vertices
• Degree– The count of edges connected to given vertex
• Average degree– Average of the degrees of all vertices in the network
• Cluster– The connected components of the network
• Clustering coefficient (CC)– CCi: Fraction representing the number of links actually present relative
to the total possible number of links among the vertices in itsneighborhood.
– CC: average of all CCi in a network• Degree distribution
– The distribution of degrees throughout a network• Major cluster
– The largest cluster in the network
14
Diameter of DeveloperNetwork vs. Time
• Network sizeincreasedfrom 30,000to 70,000
15
Diameter of ProjectNetwork vs. Time
• Network sizeincreasedfrom 20,000to 50,000.
• Diameterdecreasingwith time bothfor developernetwork andprojectnetwork
16
Clustering Coefficient ofDeveloper Network vs. Time
17
Clustering Coefficient ofProject Network vs. Time
18
Degree Distribution(developers)
19
Degree Distribution(projects)
20
Cluster Size Distribution
• R2 with majorcluster is0.7426
• R2 withoutmajor clusteris 0.9799
21
Relative Size of Major Clustervs. Time
• Increase of therelative size ofthe majorcluster
• Increasing rateis decreasing
• May be anindication ofthe networkevolution
22
Existence of Fitness
• Investigation of development of single projectcan verify the existence of “newcomer”phenomenon
• We tracked the development of every newproject in July 2001 until now (total 1660projects)
• Maximal monthly growth per project is 13while average monthly growth per project isjust 0.3639
23
Life Cycle of Project
24
Summary
25
Summary of Results
• Power law rules– Degree distributions, cluster distribution
• Average degree increasing with time
• Diameter decreasing with time
• Clustering coefficient decreasing with time
• Fitness existed in SourceForge
• Projects have life cycle behaviors
26
Outline
• Overview
• Data collection
• Network modeling
• Topological statistical analysis (real data)
�Simulations
• Publications
• Conclusion
27
Conceptual Framework
Empirical data
Adjustment
Generation
Verification
Validation
Characterization
Description
Model
Simulation
28
Agent-based Modeling
• EBM vs. ABM– Heterogeneous individuals
– Complex network
• Experience environment– Hardware: computer cluster
– Software:• Simulation toolkits: Swarm
• Database: Oracle
• Language: Java, PL/SQL
29
Model for SourceForge
• ABM based on bipartite graph
• Model description– Agent: developer
– Behaviors: Create, join, abandon and idle
– Preference: developer’s and project’s
– Fitness
• Four models in iterations– ER, BA, BA with constant fitness and BA with dynamic
fitness
• Comparison of empirical and simulated data
30
ER Model - Diameter
• Average degreeis decreasingwhile it isincreasing inempirical data
• Diameter isincreasing whileit is decreasingin empirical data
31
ER Model – ClusteringCoefficient
• Clusteringcoefficient isrelatively lowunder 0.3 while itis around 0.7 inempirical data.
32
ER Model – DegreeDistribution
• Degreedistribution isnormaldistributionwhile it ispower law inempirical data
33
ER Model – Cluster SizeDistribution
• power lawdistribution with R2
as 0.6667 (0.9653without the majorcluster) while R2 inempirical data is0.7426 (0.9799without the majorcluster)
• The actualdistribution isdifferent fromempirical data
34
BA Model – Diameter andClustering Coefficient
• Small diameterand highclusteringcoefficient likeempirical data
• Diameter andclusteringcoefficient areboth decreasinglike empiricaldata
35
BA Model – DegreeDistribution
• Power laws in degreedistributions, similar toempirical data (o forsimulated data and xfor empirical data).
• For developerdistribution: simulateddata has R2 as 0.9798and empirical data hasR2 as 0.9714.
• For project distribution:simulated data has R2
as 0.6650 andempirical data has R2
as 0.9838.
36
BA Model with ConstantFitness
• Power laws in degreedistributions, similar toempirical data (o forsimulated data and x forempirical data).
• For developer distribution:simulated data has R2 as0.9742 and empirical datahas R2 as 0.9714.
• For project distribution:simulated data has R2 as0.7253 and empirical datahas R2 as 0.9838.
37
BA Model with DynamicFitness
• Power laws in degreedistribution, similar toempirical data (o forsimulated data and x forempirical data).
• For developer distribution:simulated data has R2 as0.9695 and empirical datahas R2 as 0.9714.
• For project distribution:simulated data has R2 as0.8051 and empirical datahas R2 as 0.9838.
38
Advantage of Dynamic Fitness
• Intuition: Fitness should decreasing with time.
• Statistics: project has life cycle behaviorwhich can not be replicated by BA model withconstant fitness but can be replicated by BAmodel with dynamic fitness
39
Summary
40
Summary of Results
• We use ABM to model and simulate theSourceForge collaboration network.
• Conceptual framework is proposed for agent-based modeling and simulation.
• Case study of this framework: SourceForgestudy through ER, BA, BA with constantfitness and BA with dynamic fitness.
41
Outline
• Overview
• Data collection
• Network modeling
• Topological statistical analysis (real data)
• Simulations
�Publications
• Conclusion
42
Publications To-date
• Yongqin Gao, "Modeling and Simulation of the OSS Community",Seventh Annual Swarm Researchers Meeting (Swarm2003), NotreDame, IN, 2003.
• Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis andModeling of the Open Source Software Community", NAACSOSConference 2003, Pittsburgh.
• Yongqin Gao, Vince Freeh, and Greg Madey, "ConceptualFramework for Agent-based Modeling and Simulation", NAACSOSConference 2003, Pittsburgh.
• Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, ChrisHoffman, "Agent-based Modeling and Simulation of CollaborativeSocial Networks", AMCIS 2003, Tampa, FL.
43
Possible Journals
• Chapter 3– Physica A: statistical mechanics and its
applications
– Journal of Social Structure (JSS)
• Chapter 4– Journal of Artificial Societies and Social
Simulation (JASSS)
– Journal of Statistical Computation and Simulation(JSCS)
44
Outline
• Overview
• Data collection
• Network modeling
• Topological statistical analysis (real data)
• Simulations
• Publications
�Conclusion
45
Conclusion
• Study of SourceForge collaboration networkcan help us understanding the OSScommunity
• We investigate not only the topologicalstatistics but also the evolution of thesestatistics.
• Simulation is used to investigate ofSourceForge collaboration network.
46
Contribution
• Statistical study of the SourceForgecommunity (snapshot and evolution)
• Verification of the approximate method tocalculate the diameter and CC
• Proposal of a model for the SourceForgecommunity
• Improvement of dynamic fitness to BA model
47
Future Work
• Data collection– Database dump from SourceForge (PostgreSQL 8GB)– All the possible attributes– Database schema in UML
• More topology analysis (with more attributes)– Discussion forum– Task assignment– Project management– Active testing
• Behavior-based analysis– Interaction between agents– H. Beyton Young’s model
• Information entropy analysis
48
Acknowledgements
• Committee
• Advisors
• Colleagues
• SourceForge
• NSF
• Others
49
Thank you