Victor ODUMUYIWA Enabled Profile... · knowledge in different contexts via the Internet, and leveraging this knowledge is an important and necessary skill for a professional software

ii

Victor ODUMUYIWA

Department of Computer Sciences,

University of Lagos

Nigeria

ISBN: 978-978-976-000-8

Copyright © 2019

This work is subject to copyright. All rights are reserved, whether the

whole or part of the material is concerned, specifically the rights of

translation, reprinting, reuse of illustrations, recitation, broadcasting,

reproduction on microfilms or in any other physical way, and

transmission or information storage and retrieval, electronic adaptation,

computer software, or by similar or dissimilar methodology now known

or hereafter developed.

The responsibility for opinions expressed in articles, studies and other

contributions in this proceeding rests solely with their authors.

ISKO-West Africa

iii

Transition from Observation to

Knowledge to Intelligence

3rd Biennial International Conference on Transition from Observation

to Knowledge to Intelligence (TOKI)

15-16 August 2019

University of Lagos, Nigeria

Editors

Dr. Victor ODUMUYIWA

Dr. Olufade ONIFADE

Prof. Amos DAVID

Prof. Charles UWADIA

1

Semantic Enabled Profile Recommender for Social

Coding Platform

ODUMUYIWA Victor T.

Department of Computer Sciences

University of Lagos, Nigeria.

[email protected]

OYEYEMI Olusoji

Department of Computer Sciences,

University of Lagos, Nigeria.

[email protected]

Abstract. Sourcing competent developers with required experience and skill level

for project is often a challenge and time consuming exercise for project managers.

A major source is GitHub – a social coding platform that enhances collaboration

among developers and enables them to efficiently work on projects and showcase

their skill and experience. This work proposes an approach that automates the

sourcing process and recommends relevant developers with the right skill set for

new projects based on extracting features from the project readme files and

developer activities on the social coding platform.

Keywords: Social Coding, Recommender System, GitHub

mailto:[email protected]

Semantic Enabled Profile Recommender for Social Coding Platform

2

1. Introduction

The success of social media has introduced new ways of sharing

knowledge in different contexts via the Internet, and leveraging this

knowledge is an important and necessary skill for a professional

software developer (Zagalsky, 2013). Software developers face a

constantly changing set of programming languages, platforms and

technologies (Zagalsky, 2013). Software projects may involve

numerous technologies and or platforms, of which the software

developers might not be able to master all. Finding ways to interconnect

developers across boundaries with a view of sharing knowledge and

collaborating on projects, lead to using crowd based development

platform for software development.

In recent years, social coding platforms have become an important

tool for developers to become visible in the developer community,

sharing knowledge and collaborating on development work.

Developers use sites such as GitHub and BitBucket to showcase their

work. Many project managers and software development companies

now search social coding platforms to source developers with the skills

and experience required for their projects. In order to find potential

developers, project managers search for profiles of developers across

various crowd based software development platforms and compare

their experiences and activity history on past projects they have been

involved with. However, this is a very cumbersome process as many

projects' profiles are lengthy, mentioning a plethora of libraries and

techniques. Another complicating factor is the fact that profile writing

style is influenced by the person creating the profile and as such a

“semantic gap” may exist between search terms of profile.

This has led to a situation where a project manager spends a lot of

hours searching through developers profile to see if it matches

described requirements for a new project. In such cases, having insights

into how well potential candidates’ skill and experience fit the project

requirement may help the project manager to judge whether to

recommend the project to the developer or not.

There is therefore a need for a way to automatically extract key

concepts from project readme files since it contains summary of project

V.T. Odumuyiwa & O. Oyeyemi

3

fundamentals, and combine them with selected key behavioural

variables stored in the developer's profile which can then be used to

recommend most qualified developers for new project(s).

2. Experimental and Computational Details

2.1. Our Approach

The overview of our approach is as shown in figure 1 below.

Three main components are involved:

I. Semantic concepts extraction from project readme files for

matching

II. Developer profile modelling using behavioural data.

III. Developer recommendation based on profile ranking.

Figure 1: Semantic enabled profile recommender overview

2.2. Semantic Concepts Extraction from Project Readme Files for

Matching

This work uses DBPedia Spotlight (DBPedia, 2008), one of the most

commonly used open-source annotation toolkits for natural language

text, for concept extraction. This is based on two techniques from


4

natural language processing: Named Entity Recognition (NER) and

Named Entity Disambiguation (NED). The combination of both

techniques as shown by Aggarwal and Zhai (2012) could be a powerful

mechanism to transform natural language text into a structured

representation that machines can reason about. The large-scale DBpedia

ontology (Mendes, Jakob, Garcia-Silva, & Bizer, 2011) behind the

DBPedia Spotlight service is automatically derived from Wikipedia and

(based on the English Wikipedia edition). As at the time this research

was conducted, it contained more than 4.5 million entities (“things”)

and nearly 600 million links between them. Given that Wikipedia (and

by extension DBPedia) contain several entries covering most of the

programming languages available, important programming frameworks

and libraries, as well as many computer science concepts, DBpedia is

thus considered as a suitable ontology to use for this research work.

The extracted concepts from past project readme files and the new

project readme files are converted to vectors: pi = (w1.i, w2.i, ---,wn.i)

and npj = (w1.j, w2.j, ..., wn.j). Each dimension corresponds to one

DBPedia concept.

Not all concepts extracted from the project profile are equally

important. We use TF.IDF weighting scheme (Baeza-Yates & Ribeiro-

Neto, 1999) to weigh the concepts. TF.IDF gives a low weight to

concepts that appear in many documents based on the assumption that

such concepts are not very informative. This in essence benefits

concepts which occur rarely across the entire corpus of documents, but

often within particular documents.

Using vector space model which is very adapted for natural language

processing, the project similarity is computed. This method provides a

natural mechanism to determine the similarity between the existing

project profile and the new project profile vectors. The cosine of the

angle between the two vectors (called cosine similarity), which is

bounded to a value in [0,1] indicate similarity or dis-similarity between

two vectors. A larger score indicates higher similarity and as it tends to

zero indicates dis-similarity. The output of this is overlaid on other

behavioral data of the developer to pick the most appropriate developers

for the project.


5

Because the cosine value gives an indication of how similar 2

projects are, a value closer to 1 indicates closely similar while a value

tending towards 0 indicates loosely similar document. Going by this

theory, a threshold of 0.4 is set as cutoff point from the computed cosine

value. This means we are discarding projects that has a cosine value of

0.39 and less.

Projects with cosine value of 0.4 and above are selected and all

developers who have participated in at least one of the projects is

extracted.

2.3. Developer Profile Modeling using Behavioral Data

GitHub provides several user-based summary statistics including:

contributions in the last year, number of forked projects and number of

followers. However, according to Hauff and Gousios (2015), “the

usefulness of this information is very limited, as neither does it offer

immediate insights into the developer’s programming abilities nor does

it highlight the particular languages or tool chains the developer

knows”. In this work, we mine GitHub user profiles and project

requirements for relevant information that can enrich the user-based

summary statistics provided by GitHub.

The behavioural data we are considering are:

1. Fork: when developers take a copy of source code from

one software package and start independent development on it,

creating a distinct and separate piece of software.

2. Watcher: developers have some interest on the project but not

contributing. This introduces a new type of passive project

membership.

3. Pull request: These are contributions to a project that are

accepted and pulled into the ordinal project file.

4. Issue commit: When developers are collaborating on a project

with each other, they sometimes come across problems that

need to be fixed. Issue commit allows such communication.

5. Follow: When you follow someone on GitHub, you will get

notifications.


6

The above behavioural data helps create profile model of the

developer however, each variable has its level of significance within

our project. We used manual weighting scheme to assign weights to

each variable to ensure bias is reduced depending on how important the

variable is. The variable weights are shown below:

Table 1 Behavioural data weight

Variable Weight (w)

Fork 0.20

Watcher 0.10

Pull request 0.40

Issue commit 0.15

Follows 0.15

Total 1.00

Using the developer by Project matrix, the degree of experience is

computed as:

Degree of experience = sum(project participated) ------eq1

Using the developer by behavioural data (Fork, Watch, pull request,

issue comment, membership),

Degree of skill = sum(fork.w, watcher.w, pull_request.w, issue

commit.w, follow.w) ------eq2

Total perceived skill and experience = eq1 + eq2 -----eq3

2.4. Developer Recommendation Based on Profile Ranking

To compute perceived developer experience level, a matrix of the

developer against project is constructed. For every project a developer

participated in, a 1 is recorded and a 0 otherwise. It is assumed that the

more the project a developer has participated in, the more the level of

experience the developer is expected to have. To this end, computed

developer experience is a summation of the number of projects a

developer has participated in.


7

Combining this with the computed degree of skill, each developer

model value is generated and sorted to generate the rank in order of

relevance.

3. Experiment Design

3.1 Dataset

Evaluating the effectiveness of the proposed technique, we will be

using the data sets contributed by the work of Georgios and Diomidis

(2012). It includes data of 89 most starred Github projects of 9 most

commonly used programming languages. The number of Github users

involved in this data set is 499,485. The time of the recorded user

behaviour data ranges from 2008 when Github was launched, to

September of 2013.

From the various features implemented on Github, we select 5 most

commonly used ones as our user behaviour data source (Fork, Watch,

Comment on issues, Pull Request, membership).

Table 2: Number of Records in Data Set

3.2. Experiment: Using bag of word approach and DBpedia concept

extraction Field

For this experiment, 80 projects were selected from the project file

as the base data and the features were extracted. 5 other projects were

selected as the test case. Each test case is matched with the base data

using the existing bag of word approach and our approach relying on

DBpedia concept extraction. The result shows a better and more concise

concept extracted and also a better ranking result

Data Set Number

of

Records Fork 108,628

Watch 295,798

Comment on issues 534,104

Pull-Request 78,955

Membership 1,941


8

4. Results and Discussion

There seem not to be a benchmark data with which the accuracy of

our recommendation can be benchmarked. We therefore adopted the

learning approach used in the machine leaning research. The approach

is to segment the dataset into training and test data where the training

data is used to train the model and the test data used to check accuracy.

Using the dataset discussed in section 3.1 above, users in the 6

selected test projects are removed. The test projects are then introduced

one at a time to get a list of recommended developers. This allows us to

see if the recommender will rank the users and recommend them for the

project. To evaluate our approach, we use precision and recall measure

as shown in table 3. Table 4 shows a summary report of 6 experimental

runs based on the six test projects from the dataset. The result shown

above is quite promising. For project Devtools and redis, the system is

able to recommend 4 and 5 developers as expected. Though the number

is small but the system extracted quite a larger number of developers

who has skills and experience relevant to the project. The expected

developer are ranked with additional developers who are relevant to the

project. The 4 other projects however, recommended completely new

developers for the project. The exciting part is that all the project

experiment recommended new developers that were never on the

project.

Table 3: Precision and recall measure

Actually

correct

Actually

incorrect

Rec

om

men

dat

ion

s correct True positive

(tp)

False positive

(fp)

All

recommendations

incorrect False negative

(fn)

True negative

(tn)

All correct

Precision = 𝑡𝑝

𝑡𝑝+𝑓𝑝 Right Developer recommended / all recommendations


9

Recall = 𝑡𝑝

𝑡𝑝+𝑓𝑛 Relevant Developer recommended / all relevant

developers

Table 4: Report of Six experimental runs

4. Conclusion

In this experimental study, we conclude that our proposed approach

of semantic enabled recommendation is quite effective in

recommending developers for new projects on a social coding platform

as evident from the result shown above. For two of the projects, the

Akka

Project

Bitcoin

Project

Devtools

Project

PhantomSJ

Project

Plugload

Project

Redis

Project

Total

developer

in the

dataset 83604 83604 83604 83604 83604 83604

Number of

developer

in the

project 436 865 128 1078 502 1458

Recomme

nded

Developer

with

minimum

of 10

points 219 78 86 81 83 81

Correct

prediction

(tp) 0 0 4 0 0 5

False

prediction

(fp) 219 78 82 81 83 76

Developer

s not

found (fn) 436 865 124 1078 502 1453

not

relevant

(fn) 82949 83526 83518 83523 83521 83523

Precision 0 0 0.046512 0 0

0.06172

8

Recall 0 0 0.00 0 0 0.00


10

system is able to recommend 4 and 5 developers as expected. Though

the number is small but the system extracted quite a larger number of

developers who have skills and experience relevant to the project. The

expected developers are ranked with additional developers who are

relevant to the project. The 4 other projects however, recommended

completely new developers for the project. The exciting part is that all

the project experiment recommended new developers that were never

on the project.

In the future, there can be several directions for research like

augmenting developer profiles with extracted fine-grained information

from the source code, automatic optimal weight determination for the

behavioural data and comparison of our proposed approach with the

state of art recommender algorithms in content based space.

List of References

Aggarwal, C. & Zhai, C., 2012. Mining text data. s.l., Springer Science

& Business Media.

Anil, P., Tanvi, B., Neev, P. & Rekha, S., 2014. Non-Personalized

Recommender Systems and User-based Collaborative

Recommender Systems. International Journal of Applied

Information Systems, Volume 6.

Baeza-Yates, R. and Ribeiro-Neto, B (1999) Modern Information

Retrieval A.C.M. Press New York

Dabbish, L., Stuart, C., Tsay, J. & Herbsleb, J., 2012. Social coding in

github: transparency and collaboration in an open software

repository. s.l., ACM, pp. 1277-1286.

DBPedia, 2008. http://dbpedia-spotlight.github.io/. s.l., DBPedia.

Georgios, G. & Diomidis, S., 2012. Ghtorrent: Github’s data from a

firehose In Mining Software Repositories (MSR). s.l., IEEE, pp. 12-

21.

Gousios, G., Pinzger, M. & Van Deursen, A., 2014. An exploratory

study of the pull-based software development model. s.l., Software

Engineering (ICSE), International Conference, pp. 345-355.


11

Hauff, C. & Gousios, G., 2015. Matching GitHub developer profiles to

job advertisements. s.l., ACM, pp. 363-366.

Jing, J., Li, Z. & Lei, L., 2015. Understanding project dissemination on

a social coding site. s.l., IEEE, p. 2013.

Lingxiao, Z., Yanzhen, Z. & Bing, X., 2014. User Behaviour: An

Exploratory Study on Github. s.l., ACM.

Lingxiao, Z., Yanzhen, Z., Bing, X. & Zixiao, Z., 2014. Recommending

Relevant Projects via User Behaviour. CrowdSoft 2014, pp. 25-30.

Li, Q. & Byeong, M. K., 2007. Constructing User Profiles for

Collaborative Recommender System. s.l., researchgate.

Mendes, P. N., Jakob, A., Garcia-Silva & Bizer, C., 2011. Dbpedia

spotlight: shedding light on the web of documents. s.l., ACM, 2011,

pp. 1-8.

Zalgasky A. Investigating Opportunistic Software Development Using

Social Media Recommendation System. M.Sc. Thesis , Tel-Aviv

University, 2013.

12

Documents

Victor ODUMUYIWA Enabled Profile... · knowledge in different contexts via the Internet, and leveraging this knowledge is an important and necessary skill for a professional software