View
0
Download
0
Category
Preview:
Citation preview
Supported in part by the National Science
Foundation – ISS/Digital Science & Technology
Analysis of theAnalysis of the OpenOpen SourceSource
Software developmentSoftware development
community using ST mining:community using ST mining:
A Research PlanA Research PlanYongqin GaoYongqin Gao, Greg , Greg MadeyMadey
Computer Science & EngineeringComputer Science & Engineering
University of Notre DameUniversity of Notre Dame
NAACSOS ConferenceNAACSOS Conference
Notre Dame, INNotre Dame, IN
June 26-28, 2005June 26-28, 2005
OutlineOutline
!! BackgroundBackground
!! MotivationMotivation
!! Problem definitionProblem definition
!! Research dataResearch data
!! MethodologyMethodology
!! ConclusionConclusion
Background (OSS)Background (OSS)!! What is OSS?What is OSS?
!! Free to use, modify and distributeFree to use, modify and distribute
!! Source code available and modifiableSource code available and modifiable
!! Potential advantages over commercial softwarePotential advantages over commercial software!! Transparent and easy adoptionTransparent and easy adoption
!! Fast developmentFast development
!! Low costLow cost
!! Potential high qualityPotential high quality
!! Why study OSS?Why study OSS?!! Software engineering Software engineering —— new development and coordination methods new development and coordination methods
!! Open content Open content —— model for other forms of open, shared collaboration model for other forms of open, shared collaboration
!! Complexity Complexity —— successful example of self-organization/emergence successful example of self-organization/emergence
!! Growing popularityGrowing popularity
!! Non-traditional governance and project management practicesNon-traditional governance and project management practices
!! Virtual --> Data!Virtual --> Data!
Open Source Software (OSS)Open Source Software (OSS)!! Free Free ……
!! to view sourceto view source
!! to modifyto modify
!! to shareto share
!! of costof cost
!! ExamplesExamples!! ApacheApache
!! PerlPerl
!! GNUGNU
!! LinuxLinux
!! SendmailSendmail
!! PythonPython
!! KDEKDE
!! GNOMEGNOME
!! MozillaMozilla
!! Thousands moreThousands more
Linux
GNU
Savannah
LeadersLeaders
Linus Tolvalds
Linux
Larry Wall
Perl
Richard Stallman
GNU Manifesto
Eric Raymond
Cathedral and Bazaar
Success of ApacheSuccess of Apache
!! Almost 70% Market Share Almost 70% Market Share ((NetcraftNetcraft.com).com)
Research ApproachResearch Approach
Parameter Values
Structural Features
Parameter Values
Cross Validation
Structural Features
Combined Data Mining
Parameter Values
Understanding the
Social and Task
Dynamics that Predict
Developer Behaviors
Social Network
Analysis: Longitudinal
Study of Preferential
Attachment and Dynamic
Attachment
Conceptual
Explanatory Model of
OSS: Agent-Based
Modeling and Simulation
Opportunity: Huge amounts
of relatively good data
SourceForgeSourceForge..netnet
• VA Software
• Part of OSDN
• Started 12/1999
• Collaboration tools
• 100 K Projects
• 100 K Developers
• 1 M Registered Users
150 150 GBytes GBytes of Data & Growingof Data & Growing
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57]
7597 dev[46]dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
dev[58]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
OSS Developer - Social NetworkDevelopers are nodes / Projects are links
24 Developers5 Projects
2 Linchpin Developers1 Cluster
Scale free distribution: developerScale free distribution: developer
participationparticipation
# projects # of developers on
that many projects
1 21488
2 3688
3 1086
4 413
5 177
6 76
7 35
8 21
9 9
10 6
11 5
12 6
15 1
16 1
17 1
y =10.6905 - 3.70892 x
R2 = 0.979906
0.5 1 1.5 2 2.5
2
4
6
8
10
Log( # of Projects)
Log(#
of
Dev
eloper
s)
Scale Free – Power Law (developers)
Scale free distribution: project sizesScale free distribution: project sizes
Scale Free – Power Law (projects)
Background (DM)Background (DM)
!! Characteristics of data setCharacteristics of data set!! Incomplete, noisy, redundantIncomplete, noisy, redundant
!! Complex structures, unstructuredComplex structures, unstructured
!! HeterogeneousHeterogeneous
!! Database not designed for research, but to support projectDatabase not designed for research, but to support projectmanagement services of management services of SourceForgeSourceForge.net.net
!! Temporal data is available, but not everything a researcherTemporal data is available, but not everything a researcherwould wantwould want
!! Inferencing/discovery Inferencing/discovery of temporal data potentially valuableof temporal data potentially valuableopportunityopportunity
!! What is DM (Data mining)What is DM (Data mining)!! Nontrivial extraction of implicit, previously unknown andNontrivial extraction of implicit, previously unknown and
potentially useful information from data.potentially useful information from data.
Data Mining ProcedureData Mining Procedure
Raw data
Relevant data
Feature selection
Algorithm application
Result Evaluation
Data Integration
Data Pre-processing
Database
Spatial-temporal DM (1)Spatial-temporal DM (1)
!! Temporal data miningTemporal data mining
!! Discover the behavior-based knowledge instead ofDiscover the behavior-based knowledge instead of
state-based knowledge.state-based knowledge.
!! Example: many wolves -> fewer rabbitsExample: many wolves -> fewer rabbits
!! Relationship between timely feedback and quality ofRelationship between timely feedback and quality of
software/success of the OSS projectsoftware/success of the OSS project
Spatio-temporal Spatio-temporal DMDM
!! New research domain: New research domain: Spatio-temporal Spatio-temporal data miningdata mining!! Growing interest in Growing interest in spatio-temporal spatio-temporal data miningdata mining
!! Recommender systemsRecommender systems
!! Location based servicesLocation based services
!! Time based servicesTime based services
!! GIS applicationsGIS applications
!! Extension of classic data mining techniques into data setExtension of classic data mining techniques into data setwith spatial and temporal properties.with spatial and temporal properties.
!! Challenges: complexity of spatial information and difficultyChallenges: complexity of spatial information and difficultyin reasoning temporal information, e.g.,in reasoning temporal information, e.g.,!! IntervalsIntervals
!! PointsPoints
!! HybridsHybrids
MotivationsMotivations
!! LimitationsLimitations of OSS research to dateof OSS research to date
!! Mostly feature based data miningMostly feature based data mining to dateto date
!! Neglecting of the inherent spatial and temporalNeglecting of the inherent spatial and temporal
information in the OSS communityinformation in the OSS community
!! SourceForgeSourceForge.net properties.net properties
!! Spatial informationSpatial information
!! Collaboration networkCollaboration network
!! Temporal informationTemporal information
!! History data and log tablesHistory data and log tables
Spatial information in OSS?Spatial information in OSS?
!! The collaboration network in SFThe collaboration network in SF!! Study of the topology of the collaboration network.Study of the topology of the collaboration network.
!! The network can be mapped as a graphThe network can be mapped as a graph
!! This graph is a non-Metric spaceThis graph is a non-Metric space
!! Spread of ideas (software engineering tools and practices,Spread of ideas (software engineering tools and practices,new project opportunities)new project opportunities)
Temporal information inTemporal information in OSSOSS
!! The network is evolving and the histories of theThe network is evolving and the histories of the
site and individual entitiessite and individual entities comprise thecomprise the
temporal information in the network.temporal information in the network.
!! Discrete time pointsDiscrete time points
!! All the statistics are collected periodically.All the statistics are collected periodically.
!! Partially ordered eventsPartially ordered events
!! Multiple timelines existed in the systemMultiple timelines existed in the system
?a
bc
d
ST MiningST Mining
!! Different from classic data miningDifferent from classic data mining
!! Spatial and temporal relationships are complicatedSpatial and temporal relationships are complicated
!! Metric and non-metric spatial relationsMetric and non-metric spatial relations
!! Temporal relationsTemporal relations
!! Intrinsic dependency and heterogeneityIntrinsic dependency and heterogeneity
!! Scale effect in space and timeScale effect in space and time
!! Significant modification of many data miningSignificant modification of many data mining
techniques are needed.techniques are needed.
Problem definition IProblem definition I
!! Dependency analysisDependency analysis
!! Extension of associations to ST miningExtension of associations to ST mining
!! Complicated associationsComplicated associations
!! Vertical (temporal) and horizontal (spatial) associationsVertical (temporal) and horizontal (spatial) associations
!! Combination of vertical and horizontal associationsCombination of vertical and horizontal associations
!! Examples: lag effects between projectsExamples: lag effects between projects
!! Flexible associationsFlexible associations
!! Huge volume and scale effect of spatial-temporal data setHuge volume and scale effect of spatial-temporal data set
introduce noise and errorintroduce noise and error
!! Strict association is difficult to defineStrict association is difficult to define
Problem definition IIProblem definition II
!! Topic of this study: prediction supportTopic of this study: prediction support
!! Clustering: group the projects with similar evolution.Clustering: group the projects with similar evolution.
!! Summarization: summarize the representativeSummarization: summarize the representative
characteristics of different project evolution patternscharacteristics of different project evolution patterns
!! Prediction: predict the project evolution (based onPrediction: predict the project evolution (based on
the pattern discovered)the pattern discovered)
Research DataResearch Data
!! SourceForgeSourceForge.net database dump June 2005.net database dump June 2005
!! 117 tables117 tables
!! Records up to 30 million per tableRecords up to 30 million per table
!! 23 Gigabytes23 Gigabytes
!! PostgreSQLPostgreSQL
!! Three types of tablesThree types of tables
!! Data tablesData tables
!! History tablesHistory tables
!! Statistics tablesStatistics tables
MethodologyMethodology
!! Project development statisticsProject development statistics
!! Numerical statistics.Numerical statistics.
!! Expertise and survey statistics.Expertise and survey statistics.
!! Time series analysisTime series analysis
!! Generate the time series for these statisticsGenerate the time series for these statistics
!! Classification generationClassification generation
!! ABN algorithm usedABN algorithm used
!! Classifier evaluationClassifier evaluation
!! Evaluation by comparing the predicted class withEvaluation by comparing the predicted class withthe actual classthe actual class
Numerical statisticsNumerical statistics
!! Statistics tables have the information about projectStatistics tables have the information about project
historyhistory
!! Stats_project_monthsStats_project_months
!! Every record stands for a monthly history of a single projectEvery record stands for a monthly history of a single project
!! Records from November 1999 to June 2005Records from November 1999 to June 2005
!! There are 24 attributes in every recordThere are 24 attributes in every record
!! Descriptive attributes (3)Descriptive attributes (3)
!! Statistics (numeric) attributes (21)Statistics (numeric) attributes (21)
!! We use the statistics attributesWe use the statistics attributes
Statistics AttributesStatistics Attributes
CVS_addsCVS_addsSite_viewsSite_views
Support_closedSupport_closed
CVS_commitsCVS_commitsSupport_openedSupport_opened
CVS_checkoutsCVS_checkoutsBug_closedBug_closed
Help_requestsHelp_requestsBug_openedBug_opened
Tasks_closedTasks_closedMsg_postedMsg_posted
Tasks_openedTasks_openedFile_releasesFile_releases
Artifacts_closedArtifacts_closedPage_viewsPage_views
Artifacts_openedArtifacts_openedSubdomain_ViewsSubdomain_Views
Patches_closedPatches_closedDownloadsDownloads
Patches_openedPatches_openedDevelopersDevelopers
AttributesAttributes
Expertise statisticsExpertise statistics
!! Rating scoresRating scores
!! Expertise ratingExpertise rating
!! User ratingUser rating
!! Importance parameterImportance parameter
!! Domain importanceDomain importance
!! Contribution parameterContribution parameter
Time SeriesTime Series
!! Time series used to describe the history of eachTime series used to describe the history of each
attribute.attribute.
!! Time series: an ordered sequence of values of aTime series: an ordered sequence of values of a
variable at equally spaced time intervals.variable at equally spaced time intervals.
!! The available monthly values of each statistic isThe available monthly values of each statistic is
used to generate the time series.used to generate the time series.
!! Goal is to study the project history patterns.Goal is to study the project history patterns.
!! DescriptionDescription
!! PredictionPrediction
ConclusionConclusion
!! Project prediction using ST miningProject prediction using ST mining
!! We used statistics to predict the projectWe used statistics to predict the projectdevelopmentdevelopment
!! Calibration using new data is important to keep theCalibration using new data is important to keep theprediction valid.prediction valid.
QuestionsQuestions
Recommended