17

On the development and distribution of R packages

Embed Size (px)

Citation preview

On the Development andDistribution of R Packages

An Empirical Analysis of the R Ecosystem

Alexandre Decan, Tom Mens,Maëlick Claes & Philippe GrosjeanCOMPLEXYS Research Institute

8th September 2015, IWSECO-WEA 2015

Statistical environment

Packages with code, doc, examples, tests, datasets:

http://www.r-project.org

i n s t a l l . p a c k a g e s ( " M y P a c k a g e " )

R package repositories (in March 2015)Repository name Number of packages Since Role

CRAN 6411 1997 Distribution

Bioconductor 997 2001 Distribution

R-Forge 1883 2006 SVN developmentDistribution

GitHub 5150 2008 Git developmentDistribution using devtools

But there are more: RForge, Omegahat, Bitbucket, Sourceforge, Google code, ...

How to install packagesinstall.packages function:

automatically installs a package and its dependencies if neededonly uses CRAN by defaultcan be configured to use other repositories like Bioconductor and R-Forge

Package devtools provides various functions to install packages from other sources:SVNGitGitHubBitbucketGitorious

devtools retrieves the package content and installs it using install.packages

Previous workPreliminary empirical study using CRAN meta-data

On the maintainability of CRAN packages (CSMR-WCRE 2014)Inter-project (Type1) clone study of CRAN packages:

An Empirical Study of Identical Function Clones in CRAN (IWSC 2015)Web-dashboard for CRAN maintainers

maintaineR, a web-based dashboard for maintainers of CRAN packages (ICSME 2014)

Research QuestionsWhere are R packages developed and/or distributed?How to resolve package dependencies?

Where are R packagesdeveloped and/or

distributed?

Packages contained in the different repositoriesin March 2015

Number of newly created packages on GitHub

More and more packages are developed on GitHub that are not distributed somewhere else.

Evolution of the number of packages in CRANand GitHub

The number of packages only on GitHub grows faster than the number of packages on CRAN!But it does not seem to impact the growth of CRAN.

How to resolve packagedependencies?

DependenciesDefined in the DESCRIPTION file

Using the fields Depends and Imports

These fields does not specify from which repository the dependency must come!

P a c k a g e : S c i V i e w sT y p e : P a c k a g eT i t l e : S c i V i e w s G U I A P I - M a i n p a c k a g eI m p o r t s : e l l i p s eD e p e n d s : R ( > = 2 . 6 . 0 ) , s t a t s , g r D e v i c e s , g r a p h i c s , M A S SE n h a n c e s : b a s eD e s c r i p t i o n : F u n c t i o n s t o i n s t a l l S c i V i e w s a d d i t i o n s t o R , a n d m o r e ( v a r i o u s ) t o o l sV e r s i o n : 0 . 9 - 5D a t e : 2 0 1 3 - 0 3 - 0 1A u t h o r : P h i l i p p e G r o s j e a nM a i n t a i n e r : P h i l i p p e G r o s j e a n p h g r o s j e a n @ s c i v i e w s . o r gL i c e n s e : G P L - 2L a z y L o a d : y e sU R L : h t t p : / / w w w . s c i v i e w s . o r g / S c i V i e w s - RB u g R e p o r t s : h t t p s : / / r - f o r g e . r - p r o j e c t . o r g / t r a c k e r / ? g r o u p _ i d = 1 9 4P a c k a g e d : 2 0 1 4 - 0 3 - 0 1 2 0 : 3 4 : 1 1 U T C ; p h g r o s j e a nN e e d s C o m p i l a t i o n : n oR e p o s i t o r y : C R A ND a t e / P u b l i c a t i o n : 2 0 1 4 - 0 3 - 0 2 1 2 : 4 0 : 4 2

I m p o r t s : e l l i p s eD e p e n d s : R ( > = 2 . 6 . 0 ) , s t a t s , g r D e v i c e s , g r a p h i c s , M A S S

Package repository priorityFor each defined dependency relationship we consider the first package matching the dependencyby privileging repositories in this order:

CRAN Bioconductor GitHub R-Forge

Dependencies between repositories

CRAN

Bioconductor GitHub

R-Forge

58,8% 48.9%

37.2%

5.2%

2.3%

77.1%

61%

5.8%

5.7%

CRAN is the core of the ecosystem

ConclusionWe looked where R packages are developed and distributed taking into account CRAN,Bioconductor, GitHub and R-ForgeGitHub is growing at a faster pace than the other repositoriesMore and more packages are developed on GitHub but not distributed somewhere elseHowever it does not impact the other repositories:

CRAN is (still) at the center of the ecosystemMost of Bioconductor, R-Forge and GitHub requires CRAN in order to work

Current and future workTake into account more R package repositories (e.g. Bitbucket)Investigate why there are so many packages only on GitHubAsking developers (survey) about usage of CRAN and GithubEventually provide support to R package users and developersby improving package dependency managementSocio-technical analysis of R package developer communitiesSimilar study of an ecosystem based on another programming

Thanks for your attention

Questions?

Slides: http://maelick.net/presentations/iwseco-wea2015/