Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ii
Victor ODUMUYIWA
Department of Computer Sciences,
University of Lagos
Nigeria
ISBN: 978-978-976-000-8
Copyright © 2019
This work is subject to copyright. All rights are reserved, whether the
whole or part of the material is concerned, specifically the rights of
translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known
or hereafter developed.
The responsibility for opinions expressed in articles, studies and other
contributions in this proceeding rests solely with their authors.
ISKO-West Africa
iii
Transition from Observation to
Knowledge to Intelligence
3rd Biennial International Conference on Transition from Observation
to Knowledge to Intelligence (TOKI)
15-16 August 2019
University of Lagos, Nigeria
Editors
Dr. Victor ODUMUYIWA
Dr. Olufade ONIFADE
Prof. Amos DAVID
Prof. Charles UWADIA
1
Semantic Enabled Profile Recommender for Social
Coding Platform
ODUMUYIWA Victor T.
Department of Computer Sciences
University of Lagos, Nigeria.
OYEYEMI Olusoji
Department of Computer Sciences,
University of Lagos, Nigeria.
Abstract. Sourcing competent developers with required experience and skill level
for project is often a challenge and time consuming exercise for project managers.
A major source is GitHub – a social coding platform that enhances collaboration
among developers and enables them to efficiently work on projects and showcase
their skill and experience. This work proposes an approach that automates the
sourcing process and recommends relevant developers with the right skill set for
new projects based on extracting features from the project readme files and
developer activities on the social coding platform.
Keywords: Social Coding, Recommender System, GitHub
Semantic Enabled Profile Recommender for Social Coding Platform
2
1. Introduction
The success of social media has introduced new ways of sharing
knowledge in different contexts via the Internet, and leveraging this
knowledge is an important and necessary skill for a professional
software developer (Zagalsky, 2013). Software developers face a
constantly changing set of programming languages, platforms and
technologies (Zagalsky, 2013). Software projects may involve
numerous technologies and or platforms, of which the software
developers might not be able to master all. Finding ways to interconnect
developers across boundaries with a view of sharing knowledge and
collaborating on projects, lead to using crowd based development
platform for software development.
In recent years, social coding platforms have become an important
tool for developers to become visible in the developer community,
sharing knowledge and collaborating on development work.
Developers use sites such as GitHub and BitBucket to showcase their
work. Many project managers and software development companies
now search social coding platforms to source developers with the skills
and experience required for their projects. In order to find potential
developers, project managers search for profiles of developers across
various crowd based software development platforms and compare
their experiences and activity history on past projects they have been
involved with. However, this is a very cumbersome process as many
projects' profiles are lengthy, mentioning a plethora of libraries and
techniques. Another complicating factor is the fact that profile writing
style is influenced by the person creating the profile and as such a
“semantic gap” may exist between search terms of profile.
This has led to a situation where a project manager spends a lot of
hours searching through developers profile to see if it matches
described requirements for a new project. In such cases, having insights
into how well potential candidates’ skill and experience fit the project
requirement may help the project manager to judge whether to
recommend the project to the developer or not.
There is therefore a need for a way to automatically extract key
concepts from project readme files since it contains summary of project
V.T. Odumuyiwa & O. Oyeyemi
3
fundamentals, and combine them with selected key behavioural
variables stored in the developer's profile which can then be used to
recommend most qualified developers for new project(s).
2. Experimental and Computational Details
2.1. Our Approach
The overview of our approach is as shown in figure 1 below.
Three main components are involved:
I. Semantic concepts extraction from project readme files for
matching
II. Developer profile modelling using behavioural data.
III. Developer recommendation based on profile ranking.
Figure 1: Semantic enabled profile recommender overview
2.2. Semantic Concepts Extraction from Project Readme Files for
Matching
This work uses DBPedia Spotlight (DBPedia, 2008), one of the most
commonly used open-source annotation toolkits for natural language
text, for concept extraction. This is based on two techniques from
Semantic Enabled Profile Recommender for Social Coding Platform
4
natural language processing: Named Entity Recognition (NER) and
Named Entity Disambiguation (NED). The combination of both
techniques as shown by Aggarwal and Zhai (2012) could be a powerful
mechanism to transform natural language text into a structured
representation that machines can reason about. The large-scale DBpedia
ontology (Mendes, Jakob, Garcia-Silva, & Bizer, 2011) behind the
DBPedia Spotlight service is automatically derived from Wikipedia and
(based on the English Wikipedia edition). As at the time this research
was conducted, it contained more than 4.5 million entities (“things”)
and nearly 600 million links between them. Given that Wikipedia (and
by extension DBPedia) contain several entries covering most of the
programming languages available, important programming frameworks
and libraries, as well as many computer science concepts, DBpedia is
thus considered as a suitable ontology to use for this research work.
The extracted concepts from past project readme files and the new
project readme files are converted to vectors: pi = (w1.i, w2.i, ---,wn.i)
and npj = (w1.j, w2.j, ..., wn.j). Each dimension corresponds to one
DBPedia concept.
Not all concepts extracted from the project profile are equally
important. We use TF.IDF weighting scheme (Baeza-Yates & Ribeiro-
Neto, 1999) to weigh the concepts. TF.IDF gives a low weight to
concepts that appear in many documents based on the assumption that
such concepts are not very informative. This in essence benefits
concepts which occur rarely across the entire corpus of documents, but
often within particular documents.
Using vector space model which is very adapted for natural language
processing, the project similarity is computed. This method provides a
natural mechanism to determine the similarity between the existing
project profile and the new project profile vectors. The cosine of the
angle between the two vectors (called cosine similarity), which is
bounded to a value in [0,1] indicate similarity or dis-similarity between
two vectors. A larger score indicates higher similarity and as it tends to
zero indicates dis-similarity. The output of this is overlaid on other
behavioral data of the developer to pick the most appropriate developers
for the project.
V.T. Odumuyiwa & O. Oyeyemi
5
Because the cosine value gives an indication of how similar 2
projects are, a value closer to 1 indicates closely similar while a value
tending towards 0 indicates loosely similar document. Going by this
theory, a threshold of 0.4 is set as cutoff point from the computed cosine
value. This means we are discarding projects that has a cosine value of
0.39 and less.
Projects with cosine value of 0.4 and above are selected and all
developers who have participated in at least one of the projects is
extracted.
2.3. Developer Profile Modeling using Behavioral Data
GitHub provides several user-based summary statistics including:
contributions in the last year, number of forked projects and number of
followers. However, according to Hauff and Gousios (2015), “the
usefulness of this information is very limited, as neither does it offer
immediate insights into the developer’s programming abilities nor does
it highlight the particular languages or tool chains the developer
knows”. In this work, we mine GitHub user profiles and project
requirements for relevant information that can enrich the user-based
summary statistics provided by GitHub.
The behavioural data we are considering are:
1. Fork: when developers take a copy of source code from
one software package and start independent development on it,
creating a distinct and separate piece of software.
2. Watcher: developers have some interest on the project but not
contributing. This introduces a new type of passive project
membership.
3. Pull request: These are contributions to a project that are
accepted and pulled into the ordinal project file.
4. Issue commit: When developers are collaborating on a project
with each other, they sometimes come across problems that
need to be fixed. Issue commit allows such communication.
5. Follow: When you follow someone on GitHub, you will get
notifications.
Semantic Enabled Profile Recommender for Social Coding Platform
6
The above behavioural data helps create profile model of the
developer however, each variable has its level of significance within
our project. We used manual weighting scheme to assign weights to
each variable to ensure bias is reduced depending on how important the
variable is. The variable weights are shown below:
Table 1 Behavioural data weight
Variable Weight (w)
Fork 0.20
Watcher 0.10
Pull request 0.40
Issue commit 0.15
Follows 0.15
Total 1.00
Using the developer by Project matrix, the degree of experience is
computed as:
Degree of experience = sum(project participated) ------eq1
Using the developer by behavioural data (Fork, Watch, pull request,
issue comment, membership),
Degree of skill = sum(fork.w, watcher.w, pull_request.w, issue
commit.w, follow.w) ------eq2
Total perceived skill and experience = eq1 + eq2 -----eq3
2.4. Developer Recommendation Based on Profile Ranking
To compute perceived developer experience level, a matrix of the
developer against project is constructed. For every project a developer
participated in, a 1 is recorded and a 0 otherwise. It is assumed that the
more the project a developer has participated in, the more the level of
experience the developer is expected to have. To this end, computed
developer experience is a summation of the number of projects a
developer has participated in.
V.T. Odumuyiwa & O. Oyeyemi
7
Combining this with the computed degree of skill, each developer
model value is generated and sorted to generate the rank in order of
relevance.
3. Experiment Design
3.1 Dataset
Evaluating the effectiveness of the proposed technique, we will be
using the data sets contributed by the work of Georgios and Diomidis
(2012). It includes data of 89 most starred Github projects of 9 most
commonly used programming languages. The number of Github users
involved in this data set is 499,485. The time of the recorded user
behaviour data ranges from 2008 when Github was launched, to
September of 2013.
From the various features implemented on Github, we select 5 most
commonly used ones as our user behaviour data source (Fork, Watch,
Comment on issues, Pull Request, membership).
Table 2: Number of Records in Data Set
3.2. Experiment: Using bag of word approach and DBpedia concept
extraction Field
For this experiment, 80 projects were selected from the project file
as the base data and the features were extracted. 5 other projects were
selected as the test case. Each test case is matched with the base data
using the existing bag of word approach and our approach relying on
DBpedia concept extraction. The result shows a better and more concise
concept extracted and also a better ranking result
Data Set Number
of
Records Fork 108,628
Watch 295,798
Comment on issues 534,104
Pull-Request 78,955
Membership 1,941
Semantic Enabled Profile Recommender for Social Coding Platform
8
4. Results and Discussion
There seem not to be a benchmark data with which the accuracy of
our recommendation can be benchmarked. We therefore adopted the
learning approach used in the machine leaning research. The approach
is to segment the dataset into training and test data where the training
data is used to train the model and the test data used to check accuracy.
Using the dataset discussed in section 3.1 above, users in the 6
selected test projects are removed. The test projects are then introduced
one at a time to get a list of recommended developers. This allows us to
see if the recommender will rank the users and recommend them for the
project. To evaluate our approach, we use precision and recall measure
as shown in table 3. Table 4 shows a summary report of 6 experimental
runs based on the six test projects from the dataset. The result shown
above is quite promising. For project Devtools and redis, the system is
able to recommend 4 and 5 developers as expected. Though the number
is small but the system extracted quite a larger number of developers
who has skills and experience relevant to the project. The expected
developer are ranked with additional developers who are relevant to the
project. The 4 other projects however, recommended completely new
developers for the project. The exciting part is that all the project
experiment recommended new developers that were never on the
project.
Table 3: Precision and recall measure
Actually
correct
Actually
incorrect
Rec
om
men
dat
ion
s correct True positive
(tp)
False positive
(fp)
All
recommendations
incorrect False negative
(fn)
True negative
(tn)
All correct
Precision = 𝑡𝑝
𝑡𝑝+𝑓𝑝 Right Developer recommended / all recommendations
V.T. Odumuyiwa & O. Oyeyemi
9
Recall = 𝑡𝑝
𝑡𝑝+𝑓𝑛 Relevant Developer recommended / all relevant
developers
Table 4: Report of Six experimental runs
4. Conclusion
In this experimental study, we conclude that our proposed approach
of semantic enabled recommendation is quite effective in
recommending developers for new projects on a social coding platform
as evident from the result shown above. For two of the projects, the
Akka
Project
Bitcoin
Project
Devtools
Project
PhantomSJ
Project
Plugload
Project
Redis
Project
Total
developer
in the
dataset 83604 83604 83604 83604 83604 83604
Number of
developer
in the
project 436 865 128 1078 502 1458
Recomme
nded
Developer
with
minimum
of 10
points 219 78 86 81 83 81
Correct
prediction
(tp) 0 0 4 0 0 5
False
prediction
(fp) 219 78 82 81 83 76
Developer
s not
found (fn) 436 865 124 1078 502 1453
not
relevant
(fn) 82949 83526 83518 83523 83521 83523
Precision 0 0 0.046512 0 0
0.06172
8
Recall 0 0 0.00 0 0 0.00
Semantic Enabled Profile Recommender for Social Coding Platform
10
system is able to recommend 4 and 5 developers as expected. Though
the number is small but the system extracted quite a larger number of
developers who have skills and experience relevant to the project. The
expected developers are ranked with additional developers who are
relevant to the project. The 4 other projects however, recommended
completely new developers for the project. The exciting part is that all
the project experiment recommended new developers that were never
on the project.
In the future, there can be several directions for research like
augmenting developer profiles with extracted fine-grained information
from the source code, automatic optimal weight determination for the
behavioural data and comparison of our proposed approach with the
state of art recommender algorithms in content based space.
List of References
Aggarwal, C. & Zhai, C., 2012. Mining text data. s.l., Springer Science
& Business Media.
Anil, P., Tanvi, B., Neev, P. & Rekha, S., 2014. Non-Personalized
Recommender Systems and User-based Collaborative
Recommender Systems. International Journal of Applied
Information Systems, Volume 6.
Baeza-Yates, R. and Ribeiro-Neto, B (1999) Modern Information
Retrieval A.C.M. Press New York
Dabbish, L., Stuart, C., Tsay, J. & Herbsleb, J., 2012. Social coding in
github: transparency and collaboration in an open software
repository. s.l., ACM, pp. 1277-1286.
DBPedia, 2008. http://dbpedia-spotlight.github.io/. s.l., DBPedia.
Georgios, G. & Diomidis, S., 2012. Ghtorrent: Github’s data from a
firehose In Mining Software Repositories (MSR). s.l., IEEE, pp. 12-
21.
Gousios, G., Pinzger, M. & Van Deursen, A., 2014. An exploratory
study of the pull-based software development model. s.l., Software
Engineering (ICSE), International Conference, pp. 345-355.
V.T. Odumuyiwa & O. Oyeyemi
11
Hauff, C. & Gousios, G., 2015. Matching GitHub developer profiles to
job advertisements. s.l., ACM, pp. 363-366.
Jing, J., Li, Z. & Lei, L., 2015. Understanding project dissemination on
a social coding site. s.l., IEEE, p. 2013.
Lingxiao, Z., Yanzhen, Z. & Bing, X., 2014. User Behaviour: An
Exploratory Study on Github. s.l., ACM.
Lingxiao, Z., Yanzhen, Z., Bing, X. & Zixiao, Z., 2014. Recommending
Relevant Projects via User Behaviour. CrowdSoft 2014, pp. 25-30.
Li, Q. & Byeong, M. K., 2007. Constructing User Profiles for
Collaborative Recommender System. s.l., researchgate.
Mendes, P. N., Jakob, A., Garcia-Silva & Bizer, C., 2011. Dbpedia
spotlight: shedding light on the web of documents. s.l., ACM, 2011,
pp. 1-8.
Zalgasky A. Investigating Opportunistic Software Development Using
Social Media Recommendation System. M.Sc. Thesis , Tel-Aviv
University, 2013.
12