Exploratory research on API usage - Uni Koblenz-Landaulaemmel/esecourse/... · 2014. 7. 24. · 2]; the wrapper must neutralize these diﬀerences. Our example concerns two GUI APIs:

Exploratory research on API usage

Ralf Lämmel Software Languages Team, University of Koblenz-Landau

Slides prepared for ESE 2014 course

at University of Koblenz-Landau

1

© 2009-2014 Ralf Lämmel, Software Languages Team and collaborators 2

“Exploratory case studies are used as initial investigations of some phenomena to derive new hypotheses and build theories.”

Easterbrook, S. M., Singer, J., Storey, M-A., and Damian, D., Selecting Empirical Methods for Software

Engineering Research. In F. Shull, J. Singer and D. Sjøberg(eds) Guide to Advanced Empirical Software

Engineering, Springer, 2007.

© 2009-2014 Ralf Lämmel, Software Languages Team and collaborators

API = Application Programming Interface

3


These are core and 3rd party APIs for the Java platform.

than 100 commits to the repository. This selection resulted in 60reference projects out of all the 1,476 built projects.

3.5 Size metrics for the corpusNumbers of projects and their NCLOC sizes, and other metrics

are summarized in Table 1 and Table 2. We use the metric MC forthe number of method calls.

Metric ValueProjects 6,286Source files 2,121,688LOC 377,640,164NCLOC 264,536,500Import statements 14,335,066

Table 1. Summary of token-based analysis (with all auto-matically identified Java/SVN projects on SourceForge).

Metric ValueProjects with attempted builds 1,639Built projects 1,476Packages 46,145Classes 198,948Methods 1,397,099Method calls 8,163,083

Table 2. Summary of AST-based analysis.

Figure 1. Size metrics (NCLOC, MC) for built and un-built projects (thinned out). The projects are ordered by thevalues for the NCLOC metric.

Figure 1 presents the distribution of size metrics (NCLOC,MC) for the corpus (y-axis is normalized w.r.t. the maximum ofthe metric in each group). As one can see, both metrics correlatereasonably. The maximum of NCLOC in the whole corpus (incl.unbuilt projects) is 25,515,312, the maximum of NCLOC amongbuilt projects is 1,985,977, which implies a factor 12.85 differ-ence. Hence, we are missing the biggest projects currently. Themaximum of MC among built projects is 228,242.

API

Dom

ain

Cor

e

#Pr

ojec

ts

#C

alls

#D

istin

ctm

etho

dsca

lled

Java Collections Collections yes 1374 392639 406AWT GUI yes 754 360903 1607Swing GUI yes 716 581363 3369Reflection Other yes 560 15611 154Core XML XML yes 413 90415 537DOM XML yes 324 52593 180SAX XML no 310 13725 156log4j Logging no 254 43533 187JUnit Testing no 233 71481 1011Comm.Logging Logging no 151 21996 88

Table 3. Top 10 of the known APIs (sorted by the numberof projects using an API).

3.6 Provision of API metadataBoth Core Java APIs and manually downloaded JARs were

processed by us to assign metadata: name of the API, the nameof a programming domain, one or more package prefixes, and po-tentially a white-list of API types. Table 3 lists the top 10 of allthe 77 manually tagged APIs together with some metadata andmetrics. We used a special reflection-based fact extractor for thevisible types and members of the API JARs. (Alternatively, onecould also attempt to leverage API documentation, but such docu-mentation may not be available for some of the APIs, and it wouldtake extra effort to establish consistency between JARs and docu-mentation.) These facts are also stored in the database, and theyare leveraged by some forms of API-usage analysis.

4. Examples of API-usage analysisWe will introduce a few examples of API-usage analysis. In

each case, we will provide a motivation related to API migra-tion and language conversion—before we apply the analysis to ourSourceForge corpus.

Admittedly, the statistical analysis of a corpus does not directlyhelp with any specific migration project. However, the reportedanalyses as such are meaningful for single projects, too (perhapssubject to refinements). For instance, we will discuss API cov-erage below, and such information, when obtained for a specificproject, directly helps prioritizing migration efforts. In this paper,which presents early research, we take a statistical view on thecorpus to indicate the de-facto range for some of the measures ofinterest.

4.1 API footprint per projectWe begin with a very simple, nevertheless informative API-

usage analysis for the footprint of API usage. There are differentdimensions of footprint. Below, we consider the numbers of usedAPIs and used (distinct) API methods. In the online appendix, wealso consider the ratio of API calls to all calls. In extension of thesenumbers, we could also be interested in the ‘reference projects �API pool’ matrix (showing for each project the combination ofAPIs that it uses).

What’s an API?

4


Namespace #Typ

es

#M

etho

ds

MA

Xsiz

ecl

ass

tree

MA

Xsiz

ein

terf

ace

tree

#R

efer

ence

dna

mes

pace

s

#R

efer

ring

nam

espa

ces

%C

lass

es

%In

terf

aces

%Va

lue

type

s

%D

eleg

ate

type

s

%G

ener

icty

pes

%C

lass

argu

men

ts

%In

terf

ace

argu

men

ts

%Va

lue

type

argu

men

ts

%D

eleg

ate

argu

men

ts

%Se

aled

clas

ses

%Sp

ecia

lizab

lecl

asse

s

%Sp

ecia

lizab

lety

pes

%O

rpha

ncl

asse

s

%O

rpha

nin

terf

aces

%O

rpha

nty

pes

System.Web.* 2327 29315 • • 43 • • • • • • • • • • • • • • •

System.Windows.* • • • 82 • • • • • • • • • • • • • • • • •

System.ServiceModel.* • • • • 43 • • • • • • • • • • • • • • •System.Windows.Forms.* • • • • • • • • • • • • • • • • • • • •

System.Data.* • • • • • • • • • • • • • • • • • • • • •System.Activities.* • • • • • • • • • • • • • • • • • • • • •System.ComponentModel.* • • • • • • • • • • • • • • • • • • • •System.Workflow.* • • • • • • • • • • • • • • • • • • • •System.Xml.* • • • • • • • • • • • • • • • • • • • •System.Net.* • • • • • • • • • • • • • • • • • • • •System.DirectoryServices.* • • • • • • • • • • • • • • • •System • • • • • 69 • • • • • • • • • • • • • • •

System.Security.Cryptography.* • • • • • • • • • • • • • • •Microsoft.VisualBasic.* • • 38 • • • • • • • • • • • 48 • • • • • •System.Runtime.InteropServices.* • • • • • • • • • • • • • • • • •Microsoft.JScript.* • • • • • • • • • • • • • • • • • •System.Drawing.* • • • • • • • • • • • • • • • • • • • •

System.Runtime.Remoting.* • • • • • • • • • • • 23 • • • • • • • •System.Configuration.* • • • • • • • • • • • • • • • • • • • •System.Diagnostics.* • • • • • • • • • • • • • • • • • • •System.IO.* • • • • • • • • • • • • • • • • • • • •

System.Reflection.* • • • • • • • • • • • • • • • • •

System.EnterpriseServices.* • • • • • • • • • • • • • • • •System.CodeDom.* • • • • • • • • • • • • • • • • •

System.IdentityModel.* • • • • • • • • • • • • • • • • • • •Microsoft.Build.* • • • • • • • • • • • • • • • • • •System.Management.* • • • • • • • • • • • • • • • • • • •

System.Threading.* • • • • • • • • • • • • • • • • • • • •

System.Runtime.Serialization.* • • • • • • • • • • • • • • • • • • •System.Security.AccessControl • • • • • • • • • • • • • • •

System.Security.Permissions • • • • • • • • • • • • 86 • •

System.Runtime.CompilerServices • • • • • • • • • • • • • • • • • • • •

System.Linq.* • • • • • • • • • • • 23 • • • • • • • •System.AddIn.* • • • • • • • • • • • • • • • • •System.Xaml.* • • • • • • • • • • • • • • • • • • •System.Messaging.* • • • • • • • • • • • • • • • • •Microsoft.Win32.* • • • • • • • • • • • • • • • • •System.Security.Policy • • • • • • • • • • • • • • • • •

System.Globalization • • • • • • • • • • • •Microsoft.VisualC.* • • • • • • • 100 • • • • 100 100 • •System.Transactions.* • • • • • • • • • • • • • • • • • 100 •System.Security • • • • • • • • • • • • • • •

System.Collections.Generic • • • • • • • • • • • • • • • • • •System.Runtime.DurableInstancing • • • • • • • • • • • • • • • • •

System.Collections • • • • • • • • • • • • • •System.Text • • • • • • • • • • • • • •System.Deployment.* • • • • • • • • • • • • • • •System.Runtime.Caching.* • • • • • • • • • • • • • • • • • 100 •System.ServiceProcess.* • • • • • • • • • • • •System.Resources.* • • • • • • • • • • • • • • •System.Dynamic • • • • • • • • 91 • • • • • 72 • 70System.Security.Principal • • • • • • • • • • • • • •Microsoft.SqlServer.Server • • • • • • • • • • • • • 100 •System.Security.Authentication.* • • • • • • • • • • • • • • •System.Collections.Specialized • • • • • • • • • • • • • • 100 100System.Device.Location • • • • • • • • • • • • • •System.Text.RegularExpressions • • • • • • • • • • • • 100 100 • •Accessibility • • • • • • 60 • • • • 100 100 • •System.Collections.Concurrent • • • • • • • • • • • • 100 100 • •System.Runtime.Versioning • • • • • • • • • • •

Microsoft.CSharp.* • • • • • • • • • • • •System.Collections.ObjectModel • • • • 100 100 • • • • 100 100 • •System.Runtime.ConstrainedExecution • • • • • • • • •

System.Runtime • • • • • • 100 • • •

System.Timers • • • • • 25 • • • • 100 100System.Media • • • • 100 • • • • • •

System.Runtime.Hosting • • • • 100 • • • • •

System.Runtime.ExceptionServices • • • 100 • • •

System.Numerics • • • • 100 • • •75% 190 1579 8 7 22 26 80 16 23 5 1 66 8 46 5 51 89 90 4 40 10Median 60 543 3 2 17 9 72 6 13 0 0 54 4 34 2 33 67 69 1 0 425% 18 175 0 0 11 3 60 0 6 0 0 37 1 19 0 10 46 50 0 0 0

Table IIInfographics for reuse-related metrics for .NET (See the online version for additional data.)

Each .NET namespace may

be regarded as an API

What’s an API?

5


API domains

• Programming domains

‣ XML

‣ GUI

‣ BCE (bytecode engineering)

‣ Testing

‣ ...

• Application domains

‣ Financial exchange

‣ Human resources

‣ ...

6


Research challenges related to APIs

7

• Help programmers to understand APIs.

• Help API developers to evolve APIs.

• Help API users to do API migration.


API migration = to eliminate an application's dependencies on a given API (the “original”

API) and to make it depend instead on another API (the “replacement” API).

Why to do it?

How to do it?

8


Incentives for API migration

• The replacement API is ‣ more modern, or ‣ more typed, or ‣ more compliant, or ‣ more efficient, or ‣ more usable, etc.

• The original API is ‣ no longer supported, or ‣ too costly in terms of license costs.

• The application must use less APIs per domain.

• Help with language migration.

9


Figure 1: SwingSet2’s Tooltip demo in Swing, SwingWT and evolved SwingWT versions.

The aforementioned method has been implemented in atoolkit called Koloo. The paper’s website1 provides accessto accompanying material such as the wrappers and appli-cations of the study as well as additional data.

Road-map of the paper. §2 presents a motivating exam-ple for wrapper development based on an open-source wrap-per for Java’s Swing API for GUI programming. §3 re-ports on interviews conducted with developers experiencedin API migration, providing requirements for an API migra-tion method. §4 develops the overall notion of compliancetesting for wrapper-based API migration. §5 describes amethod for API migration based on compliance testing. §6describes validation. §7 analyzes related work. §8 concludesthe paper.

2. MOTIVATING EXAMPLEWe will illustrate here compliance issues between original

API and its wrapper-based re-implementation. The aim ofwrapper development is to reduce such issues up to the pointthat the wrapper is ‘good enough’ for use in applications.Original API and replacement API (i.e., the API used inter-nally by the wrapper) may enjoy ‘arbitrary’ differences [3,2]; the wrapper must neutralize these differences.

Our example concerns two GUI APIs: i) the Swing APIwith the help of the AWT API, which are both part of JDK

(‘the Java platform’), and ii) the SWT API2, which is partof the Eclipse project. There exists the open-source projectSwingWT

3, which is a wrapper-based re-implementation ofSwing in terms of SWT.Using the method of this paper (see §5–§6) and designated

tool support (i.e., the Koloo toolkit), we have developed arevision of SwingWT

4, to which we also refer as ‘evolvedSwingWT’ in the sequel.Fig. 1 illustrates compliance issues of SwingWT versus

Swing; it also illustrates how our systematically evolved wrap-per reduces the compliance issues. In the figure, we exerciseone scenario of the Tooltip demo of SwingSet25, which is aset of Swing demos originally distributed by Sun. The sce-nario is concerned with displaying a specific tooltip, as it istriggered when the user positions the mouse over the cow’smouth. The scenario was chosen because the Tooltip demois clearly concerned with triggering tooltips.

1http://gsd.uwaterloo.ca/issta20122http://eclipse.org/swt3http://swingwt.sf.net4http://swingwt.svn.sf.net/viewvc/swingwt/swingwt/trunk/CHANGELOG?revision=865Source code see Sun JDK folder demo/jfc/SwingSet2/

On the left, the reference behavior is that the cow moos.In the middle, the SwingWT wrapper, prior to our efforts,is at work. Two problems are noticeable in the image: thebackground color is gray instead of white, and the tooltipshows a different message: ‘cow’ as opposed to ‘Mooooooo’.On the right, the evolved SwingWT wrapper is at work.The background color complies with the original API; thetooltip’s messages complies with the original API as well.We have chosen the example to be visually striking, butwe must emphasize that the method proposed in this pa-per specifically reveals issues that are not easily spotted bylooking at particular screenshots; it also applies to domainsother than GUI programming.

Our method reveals behavioral differences in an automatedmanner on the grounds of checked assertions for API con-tracts. Revealed differences are also referred to as violations(of compliance). Our method can be exercised with varyinglevels of scrutiny in terms of API contracts and correspond-ing assertions to be imposed on scenario execution. Koloosupports recording and replaying scenarios as well as com-paring results of execution across original API and wrapper.

If we enable API contracts for return values of calls to APImethods exercised by the application, then the violation isdetected that SwingWT does not reproduce the return valuefor the method contains(Point) of class java.awt.Polygon.An investigation reveals that the Tooltip demo uses themethod to map areas of the image to the corresponding mes-sages. The violation is directly linked to the observable factthat a wrong message is displayed in the middle of Fig. 1.

If we also enable API contracts for asynchronous callsfrom the API to the application, then the violationis detected that SwingWT does not execute an asyn-chronous callback to the method contains(int, int) ofclass java.awt.Component. An investigation reveals thatthe Tooltip demo overrides the method to select the appro-priate tooltip to display—specific to the component. Theviolation is also directly linked to the observable fact that awrong message is displayed in the middle of Fig. 1.

If we also enable contracts for state integrity for method-call receivers of API types, then a violation is also flaggedfor the diverging background color, which cannot be revealedthough by looking at return values of method calls becausethe scenario’s design is such that the application does notexercise the (getter of the) background color in any way.It must be noted that additional scrutiny in terms of APIcontracts may also imply noise in reporting violations (see§6) in so far that elimination of these violations does notneed to be reasonably expected by a ‘good enough’ wrapper.

API migration is a hard problem!

2 T. T. Bartolomei and K. Czarnecki and R. Lämmel: Compliance testing for wrapper-based API migration. Submitted.

A Swing demoThe same demo

with the SwingWT1 wrapper

Use of an improved wrapper2

1 http://swingwt.sourceforge.net/

10

http://swingwt.sourceforge.net/


Large-scale, AST-based API-usage analysis of open-source Java projects

Ralf Lämmel1,2 and Ekaterina Pek2 and Jürgen Starek1

1 Software Languages Team, Universität Koblenz-Landau, Germany2 ADAPT Lab, Universität Koblenz-Landau, Germany

Abstract

Research on API migration and language conversion can beinformed by empirical data about API usage. For instance, suchdata may help with designing and defending mapping rules forAPI migration in terms of relevance and applicability. We de-scribe an approach to large-scale API-usage analysis of open-source Java projects, which we also instantiate for the Source-Forge open-source repository in a certain way. Our approachcovers checkout, building, tagging with metadata, fact extraction,analysis, and synthesis with a large degree of automation. Fact ex-traction relies on resolved (type-checked) ASTs. We describe a fewexamples of API-usage analysis; they are motivated by API mi-gration. These examples are concerned with analysing API foot-print (such as the numbers of distinct APIs used in a project), APIcoverage (such as the percentage of methods of an API used in acorpus), and framework-like vs. class-library-like usage.

1. IntroductionThe broader context of the reported research is API1 migra-

tion [3, 24, 36, 2] (but also language conversion [17, 29, 22] to theextent that it involves API migration). Given a programming do-main, and given a couple of different APIs for that domain, it canbe challenging to devise transformations or wrappers for migra-tion from one API to the other. The APIs may differ with regardto types, methods, contracts, and protocols so that actual API mi-gration efforts must compromise with regard to automation andcorrectness [3, 2].

Several researchers, including ourselves, are working towardsgeneral techniques for reliable and scalable API migration. Be-cause of the complexity of transformations and wrappers for mi-gration as well as the difficulty of proving them correct, it is also

1In this paper, we use the term API to refer both to a public pro-gramming interface and its actual implementation as a softwarelibrary for reuse.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SAC’11 March 21-25, 2011, TaiChung, Taiwan.Copyright 2011 ACM 978-1-4503-0113-8/11/03 ...$10.00.

advisable to leverage diverse knowledge about actual API usage.In the present paper, we describe an approach to large-scale

API-usage analysis for the analysis of open-source Java projects.Our approach covers checkout, building, tagging with metadata,fact extraction, analysis, and synthesis with a large degree of au-tomation. We describe a few examples of API-usage analysis; theyare motivated by API migration. Overall, API-usage analysis helpswith designing and defending mapping rules for API migration interms of relevance and applicability.

While API migration remains the primary motivation for ourefforts on API-usage analysis, we must say that the work reportedin this paper has meanwhile grown into an effort of its own right:this kind of supported API-usage analysis also caters for under-standing simple structural properties of Java software (such as thenumber of APIs used in open-source projects) and fundamentalAPI characteristics (such as class library-like vs. framework-likeAPI usage). In this sense, it supports potential empirical work onlanguage usage in the line of Baxter et al.’s “Understanding theshape of Java software” [4].

Contributions of the paper• We work out a few examples of API-usage analysis: (i) API-

footprint analysis for projects; (ii) API-coverage analysis forthe corpus; (iii) analysis of framework-like API usage. Wediscuss how such analyses inform API-migration efforts.

• We describe a process for obtaining a large corpus of builtJava projects and involved Java APIs from an open-sourcerepository in a scalable manner. Fact extraction uses precise,resolved (type-checked) ASTs.

Practical results and online accessIn the Software Languages Lab at Koblenz, we are working ondifferent aspects of API-usage analysis and several implementa-tions. In this paper, we report on an infrastructure that we appliedto SourceForge. All empirical data in this paper is based on thisimplementation. All figures and tables from the present paper aswell as additional data material have been made available online.2

Road-map of the paper§2 provides an overview of the approach to API-usage analysis. §3instantiates the approach for the SourceForge open-source repos-itory in a certain way. §4 describes some examples of API-usageanalysis, and exercises them for the SourceForge-based corpus. §5discusses major threats to validity. §6 discusses related work. §7concludes the paper.2Paper’s web site incl. all support material:http://softlang.uni-koblenz.de/sourceforge/

Published in SAC’11





Abstract


















Abstract
















• 6K+ projects from SourceForge

• 1.5K projects built (200k classes)

• 60 reference projects

• Pool of 77 known APIs aggregated

• Additional API packages detected automatically




Abstract
















Many APIs per project

The APIs or API methods used in a project provide insight intothe API-related complexity of the project. In fact, such footprint-like data serves as a proxy for the API dependence or platform de-pendence of a project. In [16], we mention such API dependenceas a form of software asbestos. In the following, we simply countthe number of APIs used in a project as a proxy for the difficulty ofAPI migration. Ultimately, a more refined analysis is needed suchthat specific (known to be difficult) API combinations are counted,and attention is payed to the status of whether these API combosare really exercised in one program scope or only separately.

In this context, we need to define what constitutes usage ofan API. One option would be to count each method call with anAPI’s type as static receiver type (in the case of an instance call),or as the hosting scope (in the case of a static call), or as the con-structed type (in the case of a constructor call). Another optionis to count any sort of reference to an API’s types (including theaforementioned positions of API types in method calls, but count-ing additionally local variable declarations or argument positionsof method declarations and method calls). Yet another option isto consider simply imports in a project. The latter option has theadvantage that we can measure such imports very easily—evenfor unbuilt projects. Indeed, the following numbers were obtainedby counting imports that were obtained with the token-based factextractor.

0

5

10

15

20

25

30

1 10 100 1000 10000 1e+05 1e+06 1e+07

Nu

mb

er

of

use

d k

no

wn

AP

Is

Project Size (in NCLOC)

Unbuilt projectsBuilt projects

Reference projects

Projects Min 1st Q Median Mean 3rd Q MaxUnbuilt 1 2 4 4.409 6 27Built 1 3 4 4.692 6 23Reference 1 4 6 6.937 8 20

Figure 2. Numbers of known APIs used in the projects;reference projects are plotted on top of built projects whichin turn are plotted on top of unbuilt projects.

Figure 2 shows the number of known APIs (y-axis) that areused in the projects ordered by NCLOC-based project size (x-axis). Unbuilt, built, and reference projects are distinguished. Thelisted maxima and quartiles give a sense of the API footprint inprojects in the wild. The set of unbuilt projects exercises a highermaximum of used APIs than the set of built projects—potentiallybecause of a correlation between the complexity of projects interms of the number of involved APIs and the difficulty to buildthose projects.

We also need to clarify how to measure usage of API meth-

ods. That is, how to precisely distinguish distinct methods so thatcounting uses is well defined. Particularly, in the case of instancemethod calls, the situation is complicated due to inheritance, over-riding, and polymorphism. As a starting point, we may distinguishmethods by possible receiver type—no matter whether the methodis overridden or inherited at a given subtype. Then, a method callis counted towards the static receiver type in a call. Addition-ally, we may also count the call towards subtypes (subject to apolymorphism-based argument: the runtime receiver type may bea subtype) and supertypes (subject to an inheritance-based argu-ment: the inherited implementation may be used, if not inherited).Such inclusion could also be made more precise by a global pro-gram analysis.

1

10

100

1000

10000

100000

1 10 100 1000 1e+04 1e+05

Nu

mb

er

of

dis

tinct

AP

I m

eth

od

s

Project size (in MC)

Projects Min 1st Q Median Mean 3rd Q MaxAll 1 94 199.5 370.7 423 10850Reference 20 305.8 611 866.2 948.8 5351

Figure 3. Numbers of distinct API methods used in theprojects (without distinguishing APIs).

Figure 3 shows the numbers of distinct API methods used inthe built projects of the corpus; reference projects are highlighted.Methods on sub- and supertypes of static receiver types were notincluded. For simplification, we also considered overloaded meth-ods as basically one method.

There is a trend of increasing API footprint with project size.Both axes are logarithmic, but project size grows more quicklythan the count of distinct API methods. Most projects, even mostof the largest ones, use less than 1,000 distinct API methods. Asthe table with maxima and quartiles shows, there are a few projectswith exceptionally high counts. We have verified for these projectsthat they essentially implement or test large frameworks (such asofbiz.apache.org). That is, these outliers embody large num-bers of ‘self-calls’ for a large number of API methods.

4.2 API coverage by the corpusAn important form of API-usage analysis concerns API cover-

age; see, for example, the discussion of coverage in the API mi-gration project of [3]. That is, coverage information is helpful inAPI migration as means to prioritize efforts, and to leave out map-ping rules for obscure parts of the API. Coverage information isalso helpful in improving API usability [13, 12].

As it is the case with other forms of API-usage analysis, APIcoverage may be considered for either a specific project, or, cu-mulatively, for all projects in a corpus. For instance, for any given

14


Many distinct API methods

The APIs or API methods used in a project provide insight intothe API-related complexity of the project. In fact, such footprint-like data serves as a proxy for the API dependence or platform de-pendence of a project. In [16], we mention such API dependenceas a form of software asbestos. In the following, we simply countthe number of APIs used in a project as a proxy for the difficulty ofAPI migration. Ultimately, a more refined analysis is needed suchthat specific (known to be difficult) API combinations are counted,and attention is payed to the status of whether these API combosare really exercised in one program scope or only separately.

In this context, we need to define what constitutes usage ofan API. One option would be to count each method call with anAPI’s type as static receiver type (in the case of an instance call),or as the hosting scope (in the case of a static call), or as the con-structed type (in the case of a constructor call). Another optionis to count any sort of reference to an API’s types (including theaforementioned positions of API types in method calls, but count-ing additionally local variable declarations or argument positionsof method declarations and method calls). Yet another option isto consider simply imports in a project. The latter option has theadvantage that we can measure such imports very easily—evenfor unbuilt projects. Indeed, the following numbers were obtainedby counting imports that were obtained with the token-based factextractor.

0

5

10

15

20

25

30

1 10 100 1000 10000 1e+05 1e+06 1e+07

Num

ber

of use

d k

now

n A

PIs

Project Size (in NCLOC)

Unbuilt projectsBuilt projects

Reference projects

Projects Min 1st Q Median Mean 3rd Q MaxUnbuilt 1 2 4 4.409 6 27Built 1 3 4 4.692 6 23Reference 1 4 6 6.937 8 20

Figure 2. Numbers of known APIs used in the projects;reference projects are plotted on top of built projects whichin turn are plotted on top of unbuilt projects.

Figure 2 shows the number of known APIs (y-axis) that areused in the projects ordered by NCLOC-based project size (x-axis). Unbuilt, built, and reference projects are distinguished. Thelisted maxima and quartiles give a sense of the API footprint inprojects in the wild. The set of unbuilt projects exercises a highermaximum of used APIs than the set of built projects—potentiallybecause of a correlation between the complexity of projects interms of the number of involved APIs and the difficulty to buildthose projects.

We also need to clarify how to measure usage of API meth-

ods. That is, how to precisely distinguish distinct methods so thatcounting uses is well defined. Particularly, in the case of instancemethod calls, the situation is complicated due to inheritance, over-riding, and polymorphism. As a starting point, we may distinguishmethods by possible receiver type—no matter whether the methodis overridden or inherited at a given subtype. Then, a method callis counted towards the static receiver type in a call. Addition-ally, we may also count the call towards subtypes (subject to apolymorphism-based argument: the runtime receiver type may bea subtype) and supertypes (subject to an inheritance-based argu-ment: the inherited implementation may be used, if not inherited).Such inclusion could also be made more precise by a global pro-gram analysis.

1

10

100

1000

10000

100000

1 10 100 1000 1e+04 1e+05

Nu

mb

er

of

dis

tinct

AP

I m

eth

od

s

Project size (in MC)

Projects Min 1st Q Median Mean 3rd Q MaxAll 1 94 199.5 370.7 423 10850Reference 20 305.8 611 866.2 948.8 5351

Figure 3. Numbers of distinct API methods used in theprojects (without distinguishing APIs).

Figure 3 shows the numbers of distinct API methods used inthe built projects of the corpus; reference projects are highlighted.Methods on sub- and supertypes of static receiver types were notincluded. For simplification, we also considered overloaded meth-ods as basically one method.

There is a trend of increasing API footprint with project size.Both axes are logarithmic, but project size grows more quicklythan the count of distinct API methods. Most projects, even mostof the largest ones, use less than 1,000 distinct API methods. Asthe table with maxima and quartiles shows, there are a few projectswith exceptionally high counts. We have verified for these projectsthat they essentially implement or test large frameworks (such asofbiz.apache.org). That is, these outliers embody large num-bers of ‘self-calls’ for a large number of API methods.

4.2 API coverage by the corpusAn important form of API-usage analysis concerns API cover-

age; see, for example, the discussion of coverage in the API mi-gration project of [3]. That is, coverage information is helpful inAPI migration as means to prioritize efforts, and to leave out map-ping rules for obscure parts of the API. Coverage information isalso helpful in improving API usability [13, 12].

As it is the case with other forms of API-usage analysis, APIcoverage may be considered for either a specific project, or, cu-mulatively, for all projects in a corpus. For instance, for any given

15


Low usage of API methods

API, we may be interested in the API types (classes and inter-faces) that are exercised in projects by any means: extension, im-plementation, variable declaration, all kinds of method calls, andother, less obvious idioms (e.g., instance-of tests). At a more fine-grained level, we may be interested in the exercised members foreach given API type. Hence, qualitative measurements focus ontypes and members that are exercised at all, while quantitativemeasurements rank usage by the number of occurrences of a typeor a member or other weights.

Assuming a representative corpus, further assuming appropri-ate criteria for detecting coverage, we may start with the naiveexpectation that a good API should be covered more or less bythe corpus. Otherwise, the API would contain de-facto unneces-sary types and methods—which is clearly not in the interest of theAPI designer. However, it should not come as a surprise that, inpractice, APIs are not covered very well—certainly not by singlemeaningful projects [3], but—as our results show—not even by asubstantial corpus; see below.

We have actually tried to determine two simple coverage met-rics for all 77 known APIs: i) a percentage-based metrics for thetypes; ii) another percentage-based metrics for all methods. How-ever, we do not feel comfortable presenting a chart of those met-rics for all known APIs here. Unless considerable effort is spenton each API, such a chart may be disputable. The challenge liesin the large number of projects and APIs, the different character-istics of the APIs (e.g., in terms of their use of subtyping), aspectsof cloning, and yet other problems. Some of the issues will bedemonstrated for selected APIs below.

Projects Min 1st Q Median Mean 3rd Q MaxAll built 0.1629 1.954 2.769 4.861 3.746 59.93Reference 1.954 2.891 3.664 3.441 3.95 4.56

Figure 4. Usage of JDOM’s distinct methods.

Let us investigate coverage for specific APIs. As our first tar-get, we pick JDOM—a DOM-like (i.e., tree-based, in-memory)API for XML processing. We know that JDOM is a ‘true library’as opposed to a framework. Regular client code should simplyconstruct objects of the JDOM classes and invoke methods di-rectly. We mention these characteristics because library-like APIsmay be expected to show higher API coverage than framework-

like APIs—if we measure coverage in terms of called methods,as we do here. In this paper, in the case of a call to an instancemethod, we only count the method on the immediate static receivertype as covered. We have checked that the inclusion of super- andsubtypes, as discussed earlier, does not change much the chartsdescribed below.

Initially, we measured cumulative coverage for the methods ofthe JDOM API to be 68.89 %. We decided to study the contribu-tion of the different projects. There are 86 projects with JDOMusage among the built projects of the corpus. Figure 4 shows thepercentage-based coverage metrics for the methods of the JDOMAPI for those JDOM-using projects. The table with maxima andquartiles gives a good indication of the relatively low usage of theJDOM API.

Obviously, 3 projects stand out with their coverage. We foundthat these projects should not be counted towards cumulative cov-erage because these projects contain JDOM clones in source form.That is, it the API calls within the API’s implementation imply ar-tificial coverage for more than half of all JDOM methods. Withoutthose outliers, the cumulative coverage is considerably lower, only24.10 %.

Projects Min 1st Q Median Mean 3rd Q MaxAll built 0.3268 0.3268 0.9804 2.22 2.614 27.12Reference 0.3268 0.3268 1.144 2.369 3.023 11.44

Figure 5. Usage of SAX’ distinct methods.

Let us consider another specific API. We pick SAX—a push-based XML parsing API. The push-based characteristics implythat client code typically extends ‘handler’ classes or implementshandler interfaces with handler methods such as startElementand endElement—to which XML-parsing events are pushed.As a result, one should be prepared to find relatively low APIcoverage—if we measure coverage in terms of called methods.

We measured cumulative coverage for the methods of the SAXAPI to be 50.98 %. This relatively high coverage was suprising.There are 310 projects with SAX usage among the built projects ofthe corpus. Figure 5 shows the percentage-based coverage metricsfor the methods of the SAX API for those SAX-using projects. Wefound that three of the projects with the highest coverage were infact the previously discussed projects with JDOM clones in source

Cumulative coverage of API methods is 24.10% (if we ignore JDOM sources).

16


Few subclass derivations for APIs

Fig. 3 shows the relative frequency of API-class extensions. The picture is entirely dominated by Swing’s types for GUIs. Many otherAPI types are implemented or extended, but only with a marginal frequency.

Figure 3. Tag cloud of overridden API classes.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 1e+04 1e+05

Nu

mb

er

of

ca

lls

Method

APINon-API

In [1], library reuse is studied at a level of shared objects in the operating systems Sun OS and Mac OS X. One of the observations isthat reuse seems to be low in the sense of Zipf’s law. Thus the most frequent function will be referenced approximately twice as often asthe second most frequent function, which occurs twice as often as the fourth most frequent function, etc. Fig. 4 shows the distribution offrequency of method calls in the corpus of this paper. Only 0.03% of methods are called more than 10 000 times, while 98.2% of methodsare called less than 100 times. The plot suggests a Zipf-style distribution.

Figure 4. Frequency of calling API vs. non-API methods.

Only 7 APIs are used in more than 10 projects with

“generalization”.

17


Few interfaces implementations for APIs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 1e+05 1e+06

Ratio

Project Size (in MC)

Projects Min 1st Q Median Mean 3rd Q MaxAll 0.0205 0.4461 0.5551 0.5567 0.6724 1Reference 0.1138 0.4332 0.5026 0.5015 0.5704 0.9969

Fig. 1 shows the usage of known API methods relative to all methods in a project—both in terms of calls. The smaller the ratio (the closerto zero), the lower the contribution of API calls. The quartiles show that in most projects, about each second method call is an API call.As far as instance-method calls are concerned, the figure distinguishes API vs. project-based method calls solely on the grounds of thestatic receiver type of methods.

Figure 1. Ratio of API method calls to all method calls.

Fig. 2 shows the relative frequency of API-interface implementations for all the known APIs (including Core APIs). The picture isdominated by AWT handler types, the interface for iterators, and a few XML-related types.

Figure 2. Tag cloud of implemented API interfaces.

Only 7 APIs are used in more than 10 projects with

“generalization”.

18


Published in WCRE’12

A Framework Profile of .NET

Ralf Lammel, Rufus Linke, Ekaterina Pek, and Andrei VaranovichSoftware Languages Team & ADAPT Lab

Universitat Koblenz-Landau, Germany

Abstract—We develop a basic form of framework com-prehension which is based on simple, reuse-related metricsfor the as-implemented design and usage of frameworks. Tothis end, we provide a framework profile which incorporatespotential reuse characteristics (e.g., specializability of typesin a framework) as well as actual reuse characteristics (e.g.,evidence of specialization of framework types in projects). Weapply framework comprehension in an empirical study of theMicrosoft .NET Framework. The approach is helpful in severalcontexts of software reverse and re-engineering.

Keywords-framework, .NET, framework design, frameworkusage, framework profile, reuse, type specialization, late bind-ing, polymorphism, inheritance, program comprehension, soft-ware metrics, dynamic program analysis.

I. INTRODUCTION

Suppose you need to (better) understand the architectureof a platform such as the Java Standard Edition1, the Mi-crosoft .NET Framework2, or another composite framework.There is no silver bullet for such framework comprehension,but a range of models may be useful in this context. Thepresent paper describes the notion of framework profilewhich incorporates characteristics of potential and actualreuse of frameworks. The approach is applied to the Mi-crosoft .NET Framework and a corpus of .NET projects inan empirical study.

Framework comprehension supports reverse and re-engineering activities. In quality assessment of designs [1],framework profiles help understanding frameworks in amanner complementary to architectural smells, patterns, oranti-patterns. Also, one can compare given projects withthe framework profile. More specifically, in framework re-modularization, framework profiles help summarizing thestatus of modularization and motivating refactorings [2]. InAPI migration [3], framework profiles help assessing APIreplacement options with regard to, for example, differentextensibility characteristics. Finally, framework profiles helpin teaching OO architecture, design, and implementation [4].

A framework profile is illustrated in Figure 1. Reuse-related properties are depicted for a number of .NET name-spaces and open-source projects; the names are elided here.The infographics is derived from the results of static anddynamic program analysis. The leftmost column displaysa metric for the percentage of specializable types (i.e.,

1http://www.oracle.com/us/javase2http://www.microsoft.com/net/

• ⇥ ⇥ ⇥ � � �• – – � – – – –• – – ⇥ – – – –• � ⇥• � � � � ⇥ ⇥ �• � � ⇥ � � ⇥ � � ⇥ ⇥ ⇥ � � �• � � � � � � ⇥ � � � � �• � � � � � �• � � � � � ⇥ � � � � � � � � � ⇥ � �• � ⇥ �

Rows: top-10 .NET namespaces, in terms of number of types.Middle block of columns: actual reuse by project of the corpus.Leftmost column: potential reuse in terms of specializability.Rightmost column: summary of actual reuse.

Figure 1. Infographics for an excerpt of a framework profile for .NET

non-sealed, non-static classes and all interfaces) by usingbullets of increasing size. (This approach will be definedmore precisely later on.) The bulk of the columns classifyactual reuse of namespaces by projects as follows: ‘–’ – thenamespace was not available in the framework version usedby the project; ‘�’ – the namespace is referenced; ‘⇥’ – thenamespace is even specialized by the project (i.e., there isa project type that extends or implements a type of thenamespace); ‘�’ – late binding involves the namespace (i.e.,there is a project type that acts as runtime receiver type fora call with a static receiver type from the namespace). Therightmost column summarizes actual reuse by means of thedominating classifier, if any, for each row.

Overall, the figure contrasts potential with actual reuse ata high-level of abstraction.

Summary of contributions3

• We use metrics to analyze potential and actual frameworkreuse for some limited forms of reuse. To this end, weleverage static and dynamic program analysis.• The metrics also help classifying composite frameworkswith regard to reuse so that namespaces can be associatedwith categories that describe reuse characteristics concisely.• We describe an empirical study for the Microsoft .NETframework and a corpus of .NET projects such that reusecharacteristics add up to a framework profile for .NET.

Road-map§ II describes our methodology. § III explores reuse-related met-

rics for frameworks. § IV proposes a classification of frameworks.§V studies actual framework reuse. §VI identifies threats tovalidity. §VII discusses related work. §VIII concludes the paper.

3The paper’s web site, http://softlang.uni-koblenz.de/dotnet, providessupport material for the empirical study.







I. INTRODUCTION





• ⇥ ⇥ ⇥ � � �• – – � – – – –• – – ⇥ – – – –• � ⇥• � � � � ⇥ ⇥ �• � � ⇥ � � ⇥ � � ⇥ ⇥ ⇥ � � �• � � � � � � ⇥ � � � � �• � � � � � �• � � � � � ⇥ � � � � � � � � � ⇥ � �• � ⇥ �










Projects

Nam

espa

ces

Patterns of API usage

20

Type specializationDynamic dispatchUsage


• 17 projects from different repositories

• well known, widely used, mature

• amenable to dynamic analysis

• API = .NET namespace

• 401 in total

• 69 after grouping






I. INTRODUCTION





• ⇥ ⇥ ⇥ � � �• – – � – – – –• – – ⇥ – – – –• � ⇥• � � � � ⇥ ⇥ �• � � ⇥ � � ⇥ � � ⇥ ⇥ ⇥ � � �• � � � � � � ⇥ � � � � �• � � � � � �• � � � � � ⇥ � � � � � � � � � ⇥ � �• � ⇥ �
















I. INTRODUCTION





• ⇥ ⇥ ⇥ � � �• – – � – – – –• – – ⇥ – – – –• � ⇥• � � � � ⇥ ⇥ �• � � ⇥ � � ⇥ � � ⇥ ⇥ ⇥ � � �• � � � � � � ⇥ � � � � �• � � � � � �• � � � � � ⇥ � � � � � � � � � ⇥ � �• � ⇥ �










.NET Project Repository LOC Description

3.5 Castle ActiveRecord GitHub 30,303 Object-relational mapper4.0 Castle Core Library GitHub 36,659 Core library for the Castle framework3.5 Castle MonoRail GitHub 58,121 MVC Web framework4.0 Castle Windsor GitHub 50,032 Inversion of control container4.0 Json.NET Codeplex 43,127 JSON framework2.0 log4net Sourceforge 27,799 Logging framework2.0 Lucene.Net Apache.org 158,519 Search engine4.0 Managed Extensibility Framework Codeplex 149,303 Framework for extensible applications and components4.0 Moq GoogleCode 17,430 Mocking library2.0 NAnt Sourceforge 56,529 Build tool3.5 NHibernate Sourceforge 330,374 Object-relational mapper3.5 NUnit Launchpad 85,439 Unit testing framework4.0 Patterns & Practices - Prism Codeplex 146,778 Library to build flexible WPF and Silverlight applications3.5 RhinoMocks GitHub 23,459 Mocking framework2.0 SharpZipLib Sourceforge 25,691 Compression library2.0 Spring.NET GitHub 183,772 Framework for enterprise applications2.0 xUnit.net Codeplex 23,366 Unit testing framework

Table I.NET projects in study’s corpus (versions as of 19 June 2011)

II. METHODOLOGY

A. Research hypothesisPlatforms such as JSE or .NET leverage programming

language concepts in a systematic manner to make thoseframeworks reusable (say, extensible, instantiatable, or con-figurable). It is challenging to understand the reuse charac-teristics of frameworks and actual reuse in projects at a highlevel of abstraction. Software metrics on top of simple staticand dynamic program analysis are useful to infer essentialhigh-level reuse characteristics.

B. Research questions1) What are the interesting and helpful high-level char-

acteristics of frameworks with regard to their potentialand actual reuse?

2) To what extend can those characteristics be computedwith simple metrics subject to simple static and dy-namic program analysis?

C. Research methodWe applied an explorative approach such that a larger set

of metrics of mainly structural properties was incrementallyscreened until a smaller set of key metrics and derivedclassifiers emerged. We use infographics (such as Figure 1)to visualize metrics, classifiers, and other characteristicsof frameworks and projects that use them. The resultingclaims are subject to validation by domain experts for theframework under study.

D. Study subjectThe subject of study consists of the Microsoft .NET

Framework and a corpus of open-source .NET projectstargeting different versions of .NET (2.0, 3.5, 4.0).

.NET (4.0) has 401 namespaces in total, but we groupthese namespaces reasonably, based on the tree-like or-ganization of their compound names. For instance, allnamespaces in the System.Web branch provide web-related

functionality and can be viewed as a single namespace. Inthis manner, we obtained the manageable number of 69namespaces; see Table II.4 In the rest of the paper, we signifygrouping by “*” as in System.Web.*. Grouping is often usedin discussions of .NET—also by Microsoft.5

Table I collects metadata about the corpus of the study.The following text summarizes the requirements for the cor-pus and the process of its accumulation; more informationis available from the paper’s website.

One requirement is that the corpus is made up from well-known, widely-used and mature projects. We assume thatsuch projects make good use of .NET.

Another requirement is that dynamic analysis must befeasible for the projects of the corpus. This requirementimplies practically that we need projects with good availabletestsuites. The need for testsuites, in turn, implies practicallythat the corpus is made up from frameworks or libraries asopposed to, e.g., interactive tools. Admittedly, advanced test-data generation approaches could be used instead [5].

Yet another requirement is that the corpus is made up fromopen-source projects so that our results are more easily re-producible. Also, the instrumentation for static and dynamicanalysis would be problematic for proprietary projects whichusually commit to signed assemblies.

We searched CodePlex, SourceForge, GitHub, and GoogleCode applying the repository-provided ranking for popu-larity (preferably based on downloads). For the topmostapprox. 30 projects of each repository we checked all therequirements, and in this manner we identified a diverse setof projects as shown in Table I. These projects all use C#as implementation language. (In principle, our approach isprepared to deal with other .NET languages as well—sincethe analysis uses bytecode engineering.)

4We also excluded some namespaces that are fully marked as obsoleteand an auxiliary namespace, XamlGeneratedNamespace, used only by theworkflow designer tool.

5http://msdn.microsoft.com/en-us/library/gg145045.aspx

Research questions







I. INTRODUCTION





• ⇥ ⇥ ⇥ � � �• – – � – – – –• – – ⇥ – – – –• � ⇥• � � � � ⇥ ⇥ �• � � ⇥ � � ⇥ � � ⇥ ⇥ ⇥ � � �• � � � � � � ⇥ � � � � �• � � � � � �• � � � � � ⇥ � � � � � � � � � ⇥ � �• � ⇥ �










Research method




II. METHODOLOGY

A. Research hypothesisPlatforms such as JSE or .NET leverage programming








Framework and a corpus of open-source .NET projectstargeting different versions of .NET (2.0, 3.5, 4.0).

.NET (4.0) has 401 namespaces in total, but we groupthese namespaces reasonably, based on the tree-like or-ganization of their compound names. For instance, allnamespaces in the System.Web branch provide web-related

functionality and can be viewed as a single namespace. Inthis manner, we obtained the manageable number of 69namespaces; see Table II.4 In the rest of the paper, we signifygrouping by “*” as in System.Web.*. Grouping is often usedin discussions of .NET—also by Microsoft.5

Table I collects metadata about the corpus of the study.The following text summarizes the requirements for the cor-pus and the process of its accumulation; more informationis available from the paper’s website.


Another requirement is that dynamic analysis must befeasible for the projects of the corpus. This requirementimplies practically that we need projects with good availabletestsuites. The need for testsuites, in turn, implies practicallythat the corpus is made up from frameworks or libraries asopposed to, e.g., interactive tools. Admittedly, advanced test-data generation approaches could be used instead [5].


We searched CodePlex, SourceForge, GitHub, and GoogleCode applying the repository-provided ranking for popu-larity (preferably based on downloads). For the topmostapprox. 30 projects of each repository we checked all therequirements, and in this manner we identified a diverse setof projects as shown in Table I. These projects all use C#as implementation language. (In principle, our approach isprepared to deal with other .NET languages as well—sincethe analysis uses bytecode engineering.)

4We also excluded some namespaces that are fully marked as obsoleteand an auxiliary namespace, XamlGeneratedNamespace, used only by theworkflow designer tool.



Corpus of the .NET study




II. METHODOLOGY

A. Research hypothesisPlatforms such as JSE or .NET, leverage programming








Framework and a corpus of open-source .NET projects thattarget different versions of the .NET Framework (2.0, 3.5,4.0). Table I collects metadata about the corpus.

.NET (4.0) has 401 namespaces in total, but we groupthese namespaces reasonably based on the tree-like organiza-tion of their compound names. For instance, all namespaces

in the System.Web branch provide web-related functionalityand can be viewed as a single namespace. (In all tables andfigures to come, we signify grouping by “*” as a postfixfor a namespace; see, e.g., Table II.) In this manner, weobtained the more manageable number of 69 namespaces.Such grouping is also common in other presentations of.NET including presentations by Microsoft itself.4

It follows a summary of the requirements for the corpusand the process of its accumulation; more information isavailable from the paper’s website.


Another requirement is that dynamic analysis must befeasible for the projects of corpus. This requirement impliespractically that we need projects with good available test-suites. The need for testsuites, in turn, implies practicallythat the corpus is made up from frameworks or librariesas opposed to, e.g., interactive tools. Admittedly, advancedtest-data generation approaches could be used instead [5].


We searched CodePlex, SourceForge, GitHub, and GoogleCode applying the repository-provided ranking for popu-larity (preferably based on downloads). For the topmostapprox. 30 projects of each repository we checked all therequirements, and in this manner, we identified a diverse setof projects as shown in Table I. These projects all use C#as implementation language. (In principle, our approach isprepared to deal with other .NET languages as well—sincethe analysis uses bytecode engineering.)


24


Namespace #Typ

es

#M

etho

ds

MA

Xsiz

ecl

ass

tree

MA

Xsiz

ein

terf

ace

tree

#R

efer

ence

dna

mes

pace

s

#R

efer

ring

nam

espa

ces

%C

lass

es

%In

terf

aces

%Va

lue

type

s

%D

eleg

ate

type

s

%G

ener

icty

pes

%C

lass

argu

men

ts

%In

terf

ace

argu

men

ts

%Va

lue

type

argu

men

ts

%D

eleg

ate

argu

men

ts

%Se

aled

clas

ses

%Sp

ecia

lizab

lecl

asse

s

%Sp

ecia

lizab

lety

pes

%O

rpha

ncl

asse

s

%O

rpha

nin

terf

aces

%O

rpha

nty

pes

System.Web.* 2327 29315 • • 43 • • • • • • • • • • • • • • •

System.Windows.* • • • 82 • • • • • • • • • • • • • • • • •

















System.Runtime • • • • • • 100 • • •






“Potential reuse”


Namespace #Typ

es

#M

etho

ds

MA

Xsiz

ecl

ass

tree

MA

Xsiz

ein

terf

ace

tree

#R

efer

ence

dna

mes

pace

s

#R

efer

ring

nam

espa

ces

%C

lass

es

%In

terf

aces

%Va

lue

type

s

%D

eleg

ate

type

s

%G

ener

icty

pes

%C

lass

argu

men

ts

%In

terf

ace

argu

men

ts

%Va

lue

type

argu

men

ts

%D

eleg

ate

argu

men

ts

%Se

aled

clas

ses

%Sp

ecia

lizab

lecl

asse

s

%Sp

ecia

lizab

lety

pes

%O

rpha

ncl

asse

s

%O

rpha

nin

terf

aces

%O

rpha

nty

pes

System.Web.* 2327 29315 • • 43 • • • • • • • • • • • • • • •

System.Windows.* • • • 82 • • • • • • • • • • • • • • • • •

















System.Runtime • • • • • • 100 • • •






“Potential reuse”


Namespace categories with regard to ‘inter-namespace reuse’:• application if # Referring namespaces = 0.• core if # Referring namespaces is ‘exceptional’.Namespace categories with regard to ‘specializability’:• open if % Specializable types is ‘exceptional’.• closed if % Sealed classes is ‘exceptional’.• incomplete if % Orphan types is ‘exceptional’.Namespace categories with regard to ‘class-inheritance trees’:• branched if MAX size class tree is ‘exceptional’.• flat if MAX size class tree = 0.Namespace categories with regard to ‘intensiveness’:• interface-intensive if % Interface arguments is ‘exceptional’.• delegate-intensive if % Delegate arguments is ‘exceptional’.A sub-category for delegate-intensive namespaces:• event-based if % Delegate types is ‘exceptional’.

Occurrences of ‘exceptional’ are essentially configurable. In thispaper, we assume though that “x is ‘exceptional’ for a namespace”proxies for the statement that the metric x for the given namespaceis in the [75, 100) percentage interval with regard to the distributionfor metric x over all namespaces.

Figure 2. Definition of (non-mutually exclusive) categories

further views (available in the online version of the paper)to develop intuitions about reuse-related characteristics ofnamespaces. The following classification only uses someof the metrics directly, but the other metrics are useful forunderstanding and validation.

IV. CLASSIFICATION OF FRAMEWORKS

In the following, we use the reuse-related metrics to definecategories for reuse characteristics of frameworks—in fact,namespaces. See Figure 2 for the concise definition of thecategories. See Table III for the application of the classifica-tion to a few .NET namespaces that serve as representativesin this section. The section is finished with considerationsof validation.

A. Derivation of the categoriesLet us start with ‘inter-namespace reuse’. An application

namespace is characterized by the lack of other namespacesreferring to it. That is, no reuse potential is realized for thegiven namespace within the composite framework. Insteadof namespaces with zero referring namespaces, we may alsoconsider namespaces with the most referring namespaces.These are called core namespaces for obvious reasons.

As the medians and other percentiles at the bottom ofTable II indicate, inter-namespace usage is very commonfor .NET. (The appendix of the online version even showssubstantial mutual dependencies.) There are these applica-tion namespaces. The System.AddIn.* namespace provides ageneric framework for framework plug-ins in the sense ofclient frameworks on top of .NET. The Microsoft.VisualC.*namespace supports compilation and code generation forC++. The System.Device.Location namespace allows appli-cation developers to access the computer’s location. TheSystem.Runtime.ExceptionServices namespace supports ad-vanced exception handling for applications.

Namespace App

licat

ion

Cor

e

Ope

n

Clo

sed

Inco

mpl

ete

Bran

ched

Flat

Inte

rfac

e-in

tens

ive

Del

egat

e-in

tens

ive

Even

t-bas

ed

System.Web.* � � �System.Data.* � �System.Activities.* � � �System.ComponentModel.* � � � � � �System.Xml.* �System.DirectoryServices.* � �System.EnterpriseServices.* � �System.CodeDom.* � �System.Linq.* � � �System.AddIn.* � � � � �Microsoft.VisualC.* � � � � �System.Transactions.* � � � �System.Collections � � �System.Runtime.Caching.* � � �System.Device.Location � � � �System.Runtime.ExceptionServices � �

Table IIIClassification of selected .NET namespaces(See the online version for additional data.)

Perhaps the most obvious representative of a core name-space is System.Collections as it provides collection typesvery much like a library for basic datatypes. Starting atthe top of Table II, the largest core namespace is Sys-tem.ComponentModel.* with its fundamental support forimplementing the run-time and design-time behavior ofcomponents and controls. The next core namespace is Sys-tem.Xml.* with various APIs for XML processing.

Let us consider ‘specializability’. We speak of an opennamespace when the percentage of specializable types is‘exceptional’. We speak of a closed namespace when thepercentage of sealed classes is ‘exceptional’. It will be inter-esting to see whether open namespaces are subject to ‘more’specialization in projects than non-open (or even closed)namespaces. In any case, it is helpful to understand whichnamespaces come wide open and which namespaces limitspecialization explicitly. In this context, another categoryemerges. We speak of an incomplete namespace, when thepercentage of orphan types is ‘exceptional’.

Starting at the top of Table II, the largest open namespaceis System.DirectoryServices.*; it models entities in a network(such as users and printers) and it supports common tasks(such as adding users and setting permissions). The nextopen namespace is System.CodeDom.*; it models an abstractsyntax of .NET languages. These namespaces provide richinheritance hierarchies that are left open for specializationby other frameworks or client code. We mention that Sys-tem.DirectoryServices.* is not specialized within the .NETFramework itself while System.CodeDom.* is specialized byseveral namespaces. Basic knowledge of .NET suggests thatCodeDom is specialized by namespaces that host ‘CodeDomproviders’ and regular projects are actually not very likelyto contain additional providers.

The largest closed namespace is System.Data.*; it sup-ports data access and management for diverse sources—

Namespace

categories


Namespace categories with regard to ‘inter-namespace reuse’:• application if # Referring namespaces = 0.• core if # Referring namespaces is ‘exceptional’.Namespace categories with regard to ‘specializability’:• open if % Specializable types is ‘exceptional’.• closed if % Sealed classes is ‘exceptional’.• incomplete if % Orphan types is ‘exceptional’.Namespace categories with regard to ‘class-inheritance trees’:• branched if MAX size class tree is ‘exceptional’.• flat if MAX size class tree = 0.Namespace categories with regard to ‘intensiveness’:• interface-intensive if % Interface arguments is ‘exceptional’.• delegate-intensive if % Delegate arguments is ‘exceptional’.A sub-category for delegate-intensive namespaces:• event-based if % Delegate types is ‘exceptional’.

Occurrences of ‘exceptional’ are essentially configurable. In thispaper, we assume though that “x is ‘exceptional’ for a namespace”proxies for the statement that the metric x for the given namespaceis in the [75, 100) percentage interval with regard to the distributionfor metric x over all namespaces.

Figure 2. Definition of (non-mutually exclusive) categories

further views (available in the online version of the paper)to develop intuitions about reuse-related characteristics ofnamespaces. The following classification only uses someof the metrics directly, but the other metrics are useful forunderstanding and validation.

IV. CLASSIFICATION OF FRAMEWORKS

In the following, we use the reuse-related metrics to definecategories for reuse characteristics of frameworks—in fact,namespaces. See Figure 2 for the concise definition of thecategories. See Table III for the application of the classifica-tion to a few .NET namespaces that serve as representativesin this section. The section is finished with considerationsof validation.

A. Derivation of the categoriesLet us start with ‘inter-namespace reuse’. An application

namespace is characterized by the lack of other namespacesreferring to it. That is, no reuse potential is realized for thegiven namespace within the composite framework. Insteadof namespaces with zero referring namespaces, we may alsoconsider namespaces with the most referring namespaces.These are called core namespaces for obvious reasons.

As the medians and other percentiles at the bottom ofTable II indicate, inter-namespace usage is very commonfor .NET. (The appendix of the online version even showssubstantial mutual dependencies.) There are these applica-tion namespaces. The System.AddIn.* namespace provides ageneric framework for framework plug-ins in the sense ofclient frameworks on top of .NET. The Microsoft.VisualC.*namespace supports compilation and code generation forC++. The System.Device.Location namespace allows appli-cation developers to access the computer’s location. TheSystem.Runtime.ExceptionServices namespace supports ad-vanced exception handling for applications.

Namespace App

licat

ion

Cor

e

Ope

n

Clo

sed

Inco

mpl

ete

Bran

ched

Flat

Inte

rfac

e-in

tens

ive

Del

egat

e-in

tens

ive

Even

t-bas

ed

System.Web.* � � �System.Data.* � �System.Activities.* � � �System.ComponentModel.* � � � � � �System.Xml.* �System.DirectoryServices.* � �System.EnterpriseServices.* � �System.CodeDom.* � �System.Linq.* � � �System.AddIn.* � � � � �Microsoft.VisualC.* � � � � �System.Transactions.* � � � �System.Collections � � �System.Runtime.Caching.* � � �System.Device.Location � � � �System.Runtime.ExceptionServices � �

Table IIIClassification of selected .NET namespaces(See the online version for additional data.)

Perhaps the most obvious representative of a core name-space is System.Collections as it provides collection typesvery much like a library for basic datatypes. Starting atthe top of Table II, the largest core namespace is Sys-tem.ComponentModel.* with its fundamental support forimplementing the run-time and design-time behavior ofcomponents and controls. The next core namespace is Sys-tem.Xml.* with various APIs for XML processing.

Let us consider ‘specializability’. We speak of an opennamespace when the percentage of specializable types is‘exceptional’. We speak of a closed namespace when thepercentage of sealed classes is ‘exceptional’. It will be inter-esting to see whether open namespaces are subject to ‘more’specialization in projects than non-open (or even closed)namespaces. In any case, it is helpful to understand whichnamespaces come wide open and which namespaces limitspecialization explicitly. In this context, another categoryemerges. We speak of an incomplete namespace, when thepercentage of orphan types is ‘exceptional’.

Starting at the top of Table II, the largest open namespaceis System.DirectoryServices.*; it models entities in a network(such as users and printers) and it supports common tasks(such as adding users and setting permissions). The nextopen namespace is System.CodeDom.*; it models an abstractsyntax of .NET languages. These namespaces provide richinheritance hierarchies that are left open for specializationby other frameworks or client code. We mention that Sys-tem.DirectoryServices.* is not specialized within the .NETFramework itself while System.CodeDom.* is specialized byseveral namespaces. Basic knowledge of .NET suggests thatCodeDom is specialized by namespaces that host ‘CodeDomproviders’ and regular projects are actually not very likelyto contain additional providers.

The largest closed namespace is System.Data.*; it sup-ports data access and management for diverse sources—

Classification of selected namespaces


Namespace #Ty

pes

%Sp

ecia

lizab

lety

pes

Activ

eRec

ord

Castl

eCor

e

Mon

oRai

l

Win

dsor

Json

.NET

log4

net

MEF

Moq

NAn

t

NH

iber

nate

NU

nit

Prism

Rhin

o.M

ocks

Sprin

g.N

ET

xUni

t

Shar

pZip

Lib

Luce

ne.N

et

Dom

inat

or%

Ref

eren

ced

OO

type

s

%Sp

ecia

lizab

lety

pes

(rel

.)

%Sp

ecia

lized

type

s(r

el.)

%La

te-b

ound

type

s(r

el.)

Framework 3.5 4.0 3.5 4.0 4.0 2.0 4.0 4.0 2.0 3.5 3.5 4.0 3.5 2.0 2.0 2.0 2.0System.Web.* 2327 • ⇥ ⇥ ⇥ � � � • • • •System.Windows.* • • – – � – – – – • • • •System.ServiceModel.* • • – – ⇥ – – – – • • •System.Windows.Forms.* • • � ⇥ • • • •System.Data.* • • � � � � ⇥ ⇥ � • • •System.ComponentModel.* • • � � ⇥ � � ⇥ � � ⇥ ⇥ ⇥ � � � • • • •System.Xml.* • • � � � � � � ⇥ � � � � � • • • •System.Net.* • • � � � � � � • •System • • � � � � � ⇥ � � � � � � � � � ⇥ � � • • • •System.Security.Cryptography.* • • � ⇥ � • • •System.Runtime.InteropServices.* • • � � � � � � � � � � � � � � � � � • •

Microsoft.VisualBasic.* • • � � • •

System.Drawing.* • • � � � • •

System.Runtime.Remoting.* • • � � � � � � � � � ⇥ � • • • •System.Configuration.* • • ⇥ � ⇥ ⇥ ⇥ ⇥ ⇥ � � ⇥ ⇥ � ⇥ • • • •System.Diagnostics.* • • � � � � � � � � � � � � � � ⇥ � � • • •System.IO.* • • � � ⇥ � � � � � � � � � � � � � � � • • • •System.Reflection.* • • � � � � � � � � � � � � � � � � � � • •

System.EnterpriseServices.* • • � • •

System.CodeDom.* • • � � � � � � � • •Microsoft.Build.* • • ⇥ � • • •System.Threading.* • • � � � � � � � � � � � � � ⇥ � � � � • • •System.Runtime.Serialization.* • • � ⇥ � ⇥ ⇥ ⇥ ⇥ � ⇥ ⇥ � � � ⇥ � � ⇥ ⇥ • • 75System.Security.Permissions • • � � � � � � � � � � � � � •

System.Runtime.CompilerServices • • � � � � � � � � � � � � � � � � � � • •

System.Linq.* • • � � � � � – � � – � � � – – – – � • • • •System.Messaging.* • • � • 100Microsoft.Win32.* • • � � � � • •

System.Security.Policy • • � � � � � •

System.Globalization • • � � � � � � � � � � � � � � � � � • •

System.Transactions.* • • ⇥ � � • • •System.Security • • � � � � � � � � � � � � � � • •

System.Collections.Generic • • � � � � � � � ⇥ � � � � � ⇥ � � � • • • •System.Text • • � � � � � � � � � � � � � � � � � � • •

System.Collections • • � ⇥ � ⇥ ⇥ � � ⇥ � � � ⇥ � � � ⇥ � � • • • •System.ServiceProcess.* • • � • 100System.Resources.* • • � � � � � � � � � � � � • • • •System.Security.Principal • • � � � � • 100System.Collections.Specialized • 100 ⇥ � � � � � � ⇥ � � � � 86 100 • •System.Text.RegularExpressions • 100 � � � � � � � � � � � � � � � • 100System.Runtime.Versioning • • � � � � � � •

System.Collections.ObjectModel • 100 � � � � � � � � � � � � � • 100 • 50Microsoft.CSharp.* • • � � � • 100System.Timers • 100 � � • 100# Referenced types 137 301 245 229 277 229 201 174 375 374 437 213 135 604 308 113 193# Specialized types 16 39 28 26 27 20 18 13 26 39 31 29 10 73 19 11 26# Late bound types 6 7 9 10 11 3 6 4 10 12 20 11 8 22 5 2 875 % 235 89 33 92 33 8Median 80 73 20 75 6 025 % 36 54 12 50 0 0

Table IVInfographics for comparing potential and actual reuse for .NET (See the online version for additional data.)

low (high resp.) specialization is not predicted by low (highresp.) specializability in any obvious sense.

Most namespaces are actually referenced by enoughprojects to get assigned an actual reuse summary in the formof a dominator. This suggests that the projects of the corpusindeed share a ‘profile’ in an informal sense.

Let us compare potential reuse in terms of specializabilitywith actual reuse in terms of the dominator. There areeight namespaces with dominator ‘⇥’ or ‘�’. Half of thesenamespaces contribute to the System.Collections.* hierarchyand the associated specializability is ‘exceptional’. How-ever, specializability is ‘non-exceptional’ for the remaining

“Actual reuse”


Namespace #Ty

pes

%Sp

ecia

lizab

lety

pes

Activ

eRec

ord

Castl

eCor

e

Mon

oRai

l

Win

dsor

Json

.NET

log4

net

MEF

Moq

NAn

t

NH

iber

nate

NU

nit

Prism

Rhin

o.M

ocks

Sprin

g.N

ET

xUni

t

Shar

pZip

Lib

Luce

ne.N

et

Dom

inat

or%

Ref

eren

ced

OO

type

s

%Sp

ecia

lizab

lety

pes

(rel

.)

%Sp

ecia

lized

type

s(r

el.)

%La

te-b

ound

type

s(r

el.)


















Let us compare potential reuse in terms of specializabilitywith actual reuse in terms of the dominator. There areeight namespaces with dominator ‘⇥’ or ‘�’. Half of thesenamespaces contribute to the System.Collections.* hierarchyand the associated specializability is ‘exceptional’. How-ever, specializability is ‘non-exceptional’ for the remaining

“Actual reuse”


Data interpretation

Namespace #Ty

pes

%Sp

ecia

lizab

lety

pes

Activ

eRec

ord

Castl

eCor

e

Mon

oRai

l

Win

dsor

Json

.NET

log4

net

MEF

Moq

NAn

t

NH

iber

nate

NU

nit

Prism

Rhin

o.M

ocks

Sprin

g.N

ET

xUni

t

Shar

pZip

Lib

Luce

ne.N

et

Dom

inat

or%

Ref

eren

ced

OO

type

s

%Sp

ecia

lizab

lety

pes

(rel

.)

%Sp

ecia

lized

type

s(r

el.)

%La

te-b

ound

type

s(r

el.)


















Let us compare potential reuse in terms of specializabilitywith actual reuse in terms of the dominator. There areeight namespaces with dominator ‘⇥’ or ‘�’. Half of thesenamespaces contribute to the System.Collections.* hierarchyand the associated specializability is ‘exceptional’. How-ever, specializability is ‘non-exceptional’ for the remainingcases; specializability is, in fact, in the percentage interval(0,25) for two cases; see namespaces System.Configuration.*and System.Runtime.Serialization.*. This observation furtherconfirms that high specialization is not predicted by highspecializability in any obvious sense.

VI. THREATS TO VALIDITY

There are the following threats to internal validity. Weuse homegrown tools in the study, especially for bytecodeinstrumentation, for the analysis of .NET design and usage.More subtly, there are threats due to the model underlyingour research. First, while investigating potential and actual.NET reuse, we focus on type specialization—even thoughframeworks might be also configured via attributes (i.e.,annotations) or XML files. This applies to a number of .NETnamespaces. Second, we observe late binding based solelyon the calls from client code to the framework, while itmight also be the case that the framework calls into theclient code through callbacks. Further, the analysis of latebinding relies on the runtime data gathered from the testsuiteexecution. Coverage of method-call sites is incomplete; thetests do not cover 38.96 % of the method-call sites in theprojects of the study.

The major threat to external validity is that though wesystematically collected our corpus, the generalization ofthe results might be biased because of the corpus’ size andcontent as well as the selection criteria.

VII. RELATED WORK

Software metrics are leveraged in our work for exploringreuse characteristics and the alignment between potentialand actual reuse. Elsewhere, metrics are typically used tounderstand maintainability [7] or quality of the code anddesign [8], [1], [9]. There is also a trend to analyze thedistribution characteristics for metrics and the correlationbetween different metrics [10], [11]. In the context of OOprogramming, work on metrics typically focuses on Java; thework of [12] targets .NET with a few basic metrics withoutfocus on reuse.

Type specialization (including class and interface inheri-tance, interface implementation, overriding) is at the centerof attention in our work; there is related work that studies re-lated metrics—without though the objective of summarizingreuse characteristics at a high of level of abstraction. Thework of [13] studies structural metrics of Java bytecode;some reuse-related measurements are covered, too, e.g.,the number of types that inherit from external frameworktypes, or the most implemented external interfaces. The workof [14], [15] focuses on metrics for inheritance and overrid-ing for Java, and it shows, for example, that programmersextend user-defined types more often than external library orframework types. In those works, depth of inheritance treesis considered relevant whereas our metrics-based approachfavored size of inheritance trees since we are interested in

the number of types participating in specialization. The workof [16] analyzes instantiations of frameworks (Eclipse UI,JHotDraw, Struts), though for a purpose of detecting usagechanges in the course of framework evolution. None of theaforementioned efforts involve dynamic analysis.

Static analysis of API or framework usage often addressreuse-related concerns, which are however complementaryto our notion of framework profile. The work of [17]leverages metrics to determine the popularity of the EclipseAPI. Research on API popularity in Java is also presentedin [18]; the authors analyze import statements in open-sourcesoftware to detect and predict changes in usage of APIsover time. The work [19] (co-authored by two of the presentauthors) and [20] analyzes popularity of the Java standardAPI in several dimensions. The work of [21] analyzes APIusage in Java applications and corresponding, ported C#applications to help with automated migration. There issubstantial interest in analyzing API usage with regard tousage patterns; see, for example, [22]. Usage patterns andour framework profiles provide very different abstractionlevels for reuse-related models.

Dynamic usage analysis is leveraged in our work todiscover late-bound framework types. The resulting com-bination of static and dynamic analysis is also encounteredelsewhere [23], [24]. These efforts are relevant in so far asthey inspired our approach (specifically, our implementation)for aligning static and dynamic receiver types. In particular,the work of [23] deals with the dynamic measurement ofpolymorphism in Java and interprets it from a reuse-orientedpoint of view. Bytecode is instrumented and runtime receivertypes are determined by accessing the virtual machine’sstack—similar to our approach. This work is not focusedthough on reuse of a composite framework.

VIII. CONCLUSION

We presented a new approach to understanding reusecharacteristics of composite frameworks such as JSE or.NET. We applied the approach in an empirical study to.NET and a suitable corpus of .NET projects. The reusecharacteristics include metrics of potential reuse (such asthe percentage of specializable types), categories related toreuse (such as open or closed namespaces), and metrics ofactual reuse (such as the percentage of specialized types).These metrics and the classification add up to what we calla framework profile. Infographics can be used to providedifferent views on framework profiles.

Future work needs to address issues of generality men-tioned in §VI. That is, other forms of framework reuse(in particular, configuration) should be investigated, i.e.,forms that do not use basic OO facets. Another importantdirection concerns partitioning of frameworks into relevantsub-frameworks. Such partitioning will make classificationmore useful. Also, partitioning will identify different rolesof sub-frameworks more clearly for developers.


For each class, the number of subclasses in the corpus are shown.

Figure 9. Top 30 .NET classes inherited in the corpus

For each interfaces, the number of implementing classes in the corpus are shown. Note: the full bar counts allimplementations whereas the black part excludes classes that can be reliably classified as being compiler-generated.

Figure 10. Top 30 .NET interfaces implemented in the corpus

Top-30 inherited .NET classes

32


For each class, the number of subclasses in the corpus are shown.

Figure 9. Top 30 .NET classes inherited in the corpus

For each interfaces, the number of implementing classes in the corpus are shown. Note: the full bar counts allimplementations whereas the black part excludes classes that can be reliably classified as being compiler-generated.

Figure 10. Top 30 .NET interfaces implemented in the corpus

Top-30 implemented .NET interfaces

33


Published in ICPC’13

Multi-dimensional exploration of API usageCoen De Roover

Software Languages LabVrije Universiteit Brussel

Ralf LammelSoftware Languages Team

University of Koblenz-Landau

Ekaterina PekADAPT Lab


Abstract—This paper is concerned with understanding APIusage in a systematic, explorative manner for the benefit of bothAPI developers and API users. There exist complementary, lessexplorative methods based on code search or API documentation.In contrast, our approach is highly interactive and can be seen asan extension of what IDEs readily provide today. Exploration isbased on multiple dimensions: i) the hierarchically organizedscopes of projects and APIs; ii) metrics of API usage (e.g.,number of project classes extending API classes); iii) metadata forAPIs; iv) project- versus API-centric views. We also provide theQUAATLAS corpus of Java projects which enhances the existingQUALITAS corpus to enable API-usage analysis. We implementedthe exploration approach in an open-source, IDE-like, Web-enabled tool EXAPUS.

Index Terms—API usage. Code exploration. Metadata. Pro-gram comprehension. Reverse engineering. QUAATLAS. QUALI-TAS. EXAPUS.

I. INTRODUCTION

The use (and the design) of APIs is an integral part ofOO software development. Projects are littered with usageof easily a dozen APIs; perhaps every third line of codereferences some API [1]. Accordingly, understanding APIsor their usage must be an important objective. Much of theexisting work, as discussed in detail in §III, focuses on someform of documentation or discovery of API-usage scenariosperhaps by code completion or code search [2], [3], [4], [5].

In our work on API migration [6], [7], we have alwaysmissed a suitable exploration-based approach to understandingAPI usage in a systematic manner. In this paper, we dodescribe a form of exploration, which is also informed byquery-based program understanding [8], [9]. We specificallyserve API-usage exploration with certain expected insightsin mind. We serve both API developers and API users (i.e.,project developers) who need to understand API usage indifferent ways.

Contributions: We identify abstract exploration insightsas they are expected by API developers and project developerswith regard to their overall intention to understand API usage.These expected insights rely on multiple dimensions of explo-ration, e.g., hierarchical organization of scopes and project-versus API-centric perspectives. Existing methods such ascode completion and searching API documentation do notserve these insights.

We set up QUAATLAS (for QUALITAS API Atlas)—a Java-based corpus for API-usage analysis that builds on top of theexisting QUALITAS corpus while revising it substantially suchthat fact extraction can be applied with the level of precision

required for API-usage analysis, while also adding metadatathat supports exploration and records knowledge about APIs.

We provide conceptual support for said exploration insightsby means of an abstract model of API-usage views, which weimplemented in EXAPUS (for Explore API usage)—an open-source, IDE-like, Web-enabled tool so that we also providetool support for exploration that can be used by others forexploration experiments.

The paper’s website1 provides access to QUAATLAS andEXAPUS.

Road-map: §II motivates multi-dimensional explorationof API usage by means of an ‘exploration story’. §III discussesrelated work and further motivates our research. §IV describesbasic concepts regarding APIs and API usage. §V describesthe development of the QUAATLAS corpus that can be used forexperimenting with API-usage exploration in the Java context.§VI presents an inventory of abstract insights expected fromexploration. §VII describes an abstract model of views forexploration. §VIII describes the EXAPUS tool which supportsthe described exploration approach. §IX concludes the paper.

II. AN EXPLORATION STORY

Joanna Programmer is a new hire in software developmentat the fictional Acme Corporation. The company’s main prod-uct is JHotDraw and Joanna was hired to respond to pendingrenovation plans.

JHotDraw has been heavily reverse-engineered in the pastfor the sake of incorporating crosscutting concerns such aslogging, enabling refactoring (e.g., for design patterns), orgenerally understanding its architecture at various levels. Suchexisting research does not directly apply to Joanna’s assign-ment. She is asked to renovate JHotDraw to use JSON insteadof XML; to replace native GUI programming by HTML5compliance. Further, an Android SDK-based version is neededas well. Joanna is not particularly familiar yet with JHotDraw,but she quickly realizes that much of the challenge lies inthe API usage of JHotDraw. This is when Joanna encountersEXAPUS.

Fig. 1 summarizes API usage in JHotDraw as analyzed withEXAPUS. The tree view shows all APIs as they are known toEXAPUS and exercised by JHotDraw. The heavier the border,the more usage. Rectangles proxy for APIs that are packages.Triangles proxy for APIs with a package subtree.

1http://softlang.uni-koblenz.de/explore-API-usage










I. INTRODUCTION






















I. INTRODUCTION















Exploration insights

• The API Dispersion insight

• The API Distribution insight

• The API Footprint insight

• The Sub-API Footprint insight

• The API Cocktail insight

• The API Coupling insight

• The API Profile insight


Fig. 4. JDOM’s API Dispersion in QUAATLAS (project-centric table).

A. Format of insight descriptions

We use the following format. The Intent paragraph summa-rizes the insight. The Stakeholder paragraph identifies whetherthe insight benefits the API developer, the project developer,or both. The API usage paragraph quantifies API usage ofinterest, e.g., whether one API is considered or all APIs. TheView paragraph describes, in abstract terms, how API-usagedata is to be rendered. The Illustration paragraph applies theabstract insight concretely to APIs and projects of QUAATLAS.We use different forms of illustrations: tables, trees, and tagclouds. The Intelligence paragraph hints at the ‘operational’intelligence supported by the insight.

B. The API Dispersion insight

Intent – Understand an API’s dispersion in a corpus by com-paring API usage across the projects in the corpus.Stakeholder – API developer.API usage – One API.View – The listing of projects with associated API-usage met-rics for quantitative comparison and API facets for qualitativecomparison.Illustration – Fig. 4 summarizes JDOM’s dispersion quantita-tively in QUAATLAS. 6 projects in the corpus exercise JDOM.The projects are ordered by the #ref metric with the othermetrics not aligning. Only 2 projects (jspwiki and velocity)exercise type derivation at the boundary of API and project.Intelligence – The insight is about the significance of APIusage across corpus. In the figure, arguably, project jspwikishows the most significant API usage because it references themost API elements. Project jmeter shows the least significantAPI usage. Observation of significance helps an API developerin picking hard and easy projects for compliance testing alongAPI evolution—an easy one to get started; a hard one fora solid proof of concept. For instance, development of awrapper-based API re-implementation for API migration relieson suitable ‘test projects’ just like that [6], [7].

C. The API Distribution insight

Intent – Understand API distribution across project scopes.Stakeholder – Project developer.API usage – One API.View – The hierarchical breakdown of the project scopes withassociated API-usage metrics for quantitative comparison andAPI facets for qualitative comparison.

Fig. 5. JDOM’s API Footprint in QUAATLAS (api-centric table).

Illustration – Remember JHotDraw’s slice of DOM usage inFig. 2 in §II. This view was suitable for efficient explorationof project scopes that directly depend DOM.Intelligence – The insight may help a developer to decide onthe feasibility of an API migration, as we discussed in §II.

D. The API Footprint insight

Intent – Understand what API elements are used in a corpusor varying project scopes.Stakeholder – Project developer and API developer.API usage – One API.View – The listing of used API packages, types, and methods.Illustration – Remember the tree-based representation of theAPI footprint for JHotDraw as shown in Fig. 3 in §II. Ina similar manner, while using a table-based representation,Fig. 5 summarizes JDOM usage across QUAATLAS. AllJDOM packages are listed. The core package is heavily usedand thus the listing is further refined to show details per APItype. Ordering relies on the #ref metric. Clearly, there is littleusage of API elements outside the core package.Intelligence – Overall, the footprint describes the (smaller)‘actual’ API that needs to be understood as opposed to thefull (‘official’) API. For instance, many APIs enable nontrivial,framework-like usage [1], [25], but in the absence of actualframework-like usage, the project developer may entertaina much simpler view on the API. In the context of APIevolution, an API developer consults an API’s footprint tominimize changes that break actual usage or to make an impactanalysis for changes. In the context of wrapper-based APIre-implementation for API migration, an API developer or aproject developer (who develops a project-specific wrapper)uses the footprint to limit the effort [6], [7].













The “A

PI dispersion” insight


Fig. 1. API usage in JHotDraw with scaling applied to numbers of API references

Slice of JHotDrawwith DOM usage

The view only shows packages and types with APIreferences to DOM. Out of the 13 top-level packagesof JHotDraw, only 1 of them, the xml package and itssubpackage css reference DOM. There is a total of 4 classtypes that contain references. The combined referencecount is 94 where 19 unique API elements are referenced,which is a relatively small number of used API elementsin the view of hundreds of API elements declared by theDOM API.public void applyStylesTo(Element elem) {

for (CSSRule rule : rules) {if (rule.matches(elem)) {

rule.apply(elem);}

}}

Fig. 2. The slice of JHotDraw with DOM usage

Fig. 3. Minuscule view for DOM usage in JHotDraw: with leaves formethods, eggs for types, and the remaining nodes for packages.

Let us focus on the requirement for replacing XML byJSON. In Fig. 1, two XML APIs show up: DOM and SAX.Joanna begins with an exploration of DOM usage. Fig. 3 sum-marizes DOM usage in JHotDraw as analyzed with EXAPUS.Encouragingly, DOM’s footprint in JHotDraw only covers afew types and methods.

A logical option for continuation of exploration is to ex-amine the distribution of API usage across JHotDraw. Inthis manner, Joanna gets a sense of locality of API usage.The corresponding view is shown in Fig. 2 and it strikinglyreveals good news in so far that DOM usage is limited to theJHotDraw package org.jhotdraw.xml, which she shall explorefurther to prepare a possible XML-to-JSON migration.

III. RELATED WORK

We identify the following categories of related work. (Indiscussion of the cited papers, we bring forward the aspectsdirectly comparable to our effort.)

A. Exploration of projectsThere are several conceptual styles of project comprehen-

sion. An example of interactive, human-involving effort can befound in work of Bruhlmann et al. [10], where experts annotateproject parts to capture human knowledge. They further use

the emerged meta-model to analyze features, architecture, anddesign flaws of the project.

Query-driven comprehension can proceed through user-defined queries that identify code of interest, as in the workof Mens and Kellens [11] or De Roover et al. [9], wherea comprehensive tool suite facilitates defining and exploringquery results. Alwis and Murphy in their work [8] identify andinvestigate pre-defined queries for exploration of a softwaresystem, e.g., “What calls this method.”

Visual summary of projects usually involves some sort ofscaling, color coding, and hierarchical grouping, as discussedby Lanza and Ducasse [12]. Visualizations can be moreinvolved, as in the work of Wettel et al. [13], where a a citymetaphor is used to represent a 3D structure of projects basedon the value of metrics.

Our approach combines these conceptual styles. We allowthe user to accumulate and refine knowledge about APIs, theirfacets, and domains. The exploration activities explained inthe paper are intuitive; flexibility in their combination enablesanswering the typical questions like identified by Alwis andMurphy [8]. Tag clouds, tables, and trees accompanied bymetrics provide basic and familiar visual aid in exploration.

B. Exploration of APIs1) Measuring usage: Research on API usage often lever-

ages usage frequency, or popularity, of APIs and their parts.For instance, Mileva et al. use popularity to identify mostcommonly used library versions [14] or to identify and predictAPI usage trends over time [15]. Holmes et al. appeal topopularity as the main indicator: for the API developer, to beable to prioritize efforts and be informed about consumptionof libraries; for the API user, to be able to identify libraries of






rule.apply(elem);}

}}





III. RELATED WORK















rule.apply(elem);}

}}





III. RELATED WORK














rule.apply(elem);}

}}





III. RELATED WORK















rule.apply(elem);}

}}





III. RELATED WORK















rule.apply(elem);}

}}





III. RELATED WORK











interest and be informed of ways of their usage [16]. Eisenberget al. use font scaling w.r.t. popularity of API elements tohelp navigate through its structure [5], [17]. Thummalapentaand Xie use Google search to find relevant code examplesfor further frequency analysis of API parts usage [18]. Maet al. investigate coverage of Java Standard API to identifywhich parts are ignored by the API users [19]. In our work,we suggest more advanced metrics indicating API usage; theirdistribution is integrated in the table and graph views of ourtool, providing sorting and scaling.

2) Understanding usage: Robillard and DeLine discoveredin their field study on API learning obstacles that API usersprefer to learn from patterns of related calls rather thanillustrations of individual methods [20]. And, indeed, manyexisting efforts are exercising information about API usage tohelp developers use APIs. E.g., Nasehi and Maurer show thatAPI units tests can be used as usage examples [21]. Zhong etal. cluster API calls and mine patterns to recommend usefulcode snippets to API users [3]. Bruch et al. develop intelligentcode completion that narrows down the possible suggestions tothose API elements that are actually relevant [22]. Mandelinet al. present an approach for synthesizing a snippet to fillin a gap in the code using an API, given certain contextualinformation [23]. Our effort differs in that it enables navigatingboth projects and APIs in the familiar IDE-like manner withAPI usage in focus. We also identify a catalogue of possibleexploration activities to perform.

IV. BASICS CONCEPTS

We set up the basic concepts underlying this paper: APIs,API usage, and API-usage metrics. We also augment thebasic notion of API with extra dimensions of abstraction—API domains and API facets—which are helpful in raisingthe level of abstraction in exploration.

APIs: We use the term API to refer to the actual interfacebut also to the underlying implementation. We do not payattention to any distinction between libraries and frameworks.We simply view an API as a set of types (classes, interfaces,etc.) referable by name and distributed together for use in soft-ware projects. Without loss of generality, this paper invokesJava for most illustrations and intuitions.

Indeed, we assume that package names, package prefixes,and types within packages can be used to describe APIs.For instance, the package prefix javax.swing (and possiblyothers) could be associated with the Swing API for GUIprogramming. It is important that we view javax.swing as apackage prefix because Swing is indeed organized in a packagetree. In contrast, the java.util API corresponds to all the typesin the package of ditto name. There are various subpackagesof java.util, but they are preferably considered separate APIs.In fact, the java.util API deserves further breakdown, givingrise to the notion of sub-API because the package serves defacto unrelated purposes, notably Java’s collections and Java’sevent system, which can be quantified as subsets of the typesin java.util. (This is not an uncommon situation.)

Clearly, APIs may exist in different versions. If these aremajor versions (e.g., JUnit 3 and 4), then they may be treatedeffectively as different APIs. In the case of minor versions(assuming qualified names of API elements have remainedstable), they may be treated as the same API.

API usage: We are concerned with API usage in givensoftware projects. API usage is evidenced from any sortof reference from projects to APIs. References are directlyassociated with syntactical patterns in the code of the projects,e.g., a method call in a class of a project that invokes a methodof an API type, or a class declaration in a project that explicitlyextends a class of an API. The resulting patterns can hencebe used to classify API references and to control explorationwith regard to the kinds of references to present to users.

A reasonably precise analysis of API usage requires that theunderlying projects are ‘resolved’ in that each API referencein a project can be followed to the corresponding declarationin the API. Further, since exploration of API usage relies onthe developer’s view on source code of projects, we effectivelyneed compilable source code of all projects.

API-usage metrics: For quantifying API usage, metricsare needed that can be used in exploration views in differentways, e.g., for ordering (elements or scopes of APIs orprojects) or for scaling in the visualization of API usage. Forthe purpose of this paper, the following metrics suffice:#proj: Number of projects referencing APIs.#api: Number of APIs being referenced.#ref: Number of references from projects to APIs.#elem: Number of API elements being referenced.#derive: Number of project types derived from API types.#super: Number of API types serving as supertype for derivations.#sub: Number of project types serving as subtype for derivations.

These metrics can be applied, of course, to different selec-tions of projects or APIs as well as specific packages, types,or methods thereof. For instance, we may be interested in #apifor a specific project. Also, we may be interested in #ref forsome part of an API.

Further, these metrics can be configured to count onlyspecific patterns. It is easy to see now that the given metricsare not even orthogonal because, for example, #derive can beobtained from #ref by only counting patterns for ‘extends’ and‘implements’ relationships.

API domains: We assume that each API addresses someprogramming domain such as XML processing or GUI pro-gramming. We are not aware of any general, widely adoptedattempt to associate APIs with domains, but the idea appearsto merit further research. We have begun collecting program-ming domains (or in fact, API domains) and tagging APIsappropriately. Let us list a few API domains and associatethem with well-known Java APIs:GUI: GUI programming, e.g., Swing and AWT.XML: XML processing, e.g., DOM, JDOM, and SAX.Data: Data structures incl. containers, e.g., java.util.IO: File- and stream-based I/O, e.g., java.io and java.nio.Component: Component-oriented programming, e.g., JavaBeans.Meta: Meta-programming incl. reflection, e.g., java.lang.reflect.Basics: Basic language support, e.g., java.lang.String.

API-usage metrics


API domains

45

interest and be informed of ways of their usage [16]. Eisenberget al. use font scaling w.r.t. popularity of API elements tohelp navigate through its structure [5], [17]. Thummalapentaand Xie use Google search to find relevant code examplesfor further frequency analysis of API parts usage [18]. Maet al. investigate coverage of Java Standard API to identifywhich parts are ignored by the API users [19]. In our work,we suggest more advanced metrics indicating API usage; theirdistribution is integrated in the table and graph views of ourtool, providing sorting and scaling.

2) Understanding usage: Robillard and DeLine discoveredin their field study on API learning obstacles that API usersprefer to learn from patterns of related calls rather thanillustrations of individual methods [20]. And, indeed, manyexisting efforts are exercising information about API usage tohelp developers use APIs. E.g., Nasehi and Maurer show thatAPI units tests can be used as usage examples [21]. Zhong etal. cluster API calls and mine patterns to recommend usefulcode snippets to API users [3]. Bruch et al. develop intelligentcode completion that narrows down the possible suggestions tothose API elements that are actually relevant [22]. Mandelinet al. present an approach for synthesizing a snippet to fillin a gap in the code using an API, given certain contextualinformation [23]. Our effort differs in that it enables navigatingboth projects and APIs in the familiar IDE-like manner withAPI usage in focus. We also identify a catalogue of possibleexploration activities to perform.

IV. BASICS CONCEPTS

We set up the basic concepts underlying this paper: APIs,API usage, and API-usage metrics. We also augment thebasic notion of API with extra dimensions of abstraction—API domains and API facets—which are helpful in raisingthe level of abstraction in exploration.

APIs: We use the term API to refer to the actual interfacebut also to the underlying implementation. We do not payattention to any distinction between libraries and frameworks.We simply view an API as a set of types (classes, interfaces,etc.) referable by name and distributed together for use in soft-ware projects. Without loss of generality, this paper invokesJava for most illustrations and intuitions.

Indeed, we assume that package names, package prefixes,and types within packages can be used to describe APIs.For instance, the package prefix javax.swing (and possiblyothers) could be associated with the Swing API for GUIprogramming. It is important that we view javax.swing as apackage prefix because Swing is indeed organized in a packagetree. In contrast, the java.util API corresponds to all the typesin the package of ditto name. There are various subpackagesof java.util, but they are preferably considered separate APIs.In fact, the java.util API deserves further breakdown, givingrise to the notion of sub-API because the package serves defacto unrelated purposes, notably Java’s collections and Java’sevent system, which can be quantified as subsets of the typesin java.util. (This is not an uncommon situation.)

Clearly, APIs may exist in different versions. If these aremajor versions (e.g., JUnit 3 and 4), then they may be treatedeffectively as different APIs. In the case of minor versions(assuming qualified names of API elements have remainedstable), they may be treated as the same API.

API usage: We are concerned with API usage in givensoftware projects. API usage is evidenced from any sortof reference from projects to APIs. References are directlyassociated with syntactical patterns in the code of the projects,e.g., a method call in a class of a project that invokes a methodof an API type, or a class declaration in a project that explicitlyextends a class of an API. The resulting patterns can hencebe used to classify API references and to control explorationwith regard to the kinds of references to present to users.

A reasonably precise analysis of API usage requires that theunderlying projects are ‘resolved’ in that each API referencein a project can be followed to the corresponding declarationin the API. Further, since exploration of API usage relies onthe developer’s view on source code of projects, we effectivelyneed compilable source code of all projects.

API-usage metrics: For quantifying API usage, metricsare needed that can be used in exploration views in differentways, e.g., for ordering (elements or scopes of APIs orprojects) or for scaling in the visualization of API usage. Forthe purpose of this paper, the following metrics suffice:#proj: Number of projects referencing APIs.#api: Number of APIs being referenced.#ref: Number of references from projects to APIs.#elem: Number of API elements being referenced.#derive: Number of project types derived from API types.#super: Number of API types serving as supertype for derivations.#sub: Number of project types serving as subtype for derivations.

These metrics can be applied, of course, to different selec-tions of projects or APIs as well as specific packages, types,or methods thereof. For instance, we may be interested in #apifor a specific project. Also, we may be interested in #ref forsome part of an API.

Further, these metrics can be configured to count onlyspecific patterns. It is easy to see now that the given metricsare not even orthogonal because, for example, #derive can beobtained from #ref by only counting patterns for ‘extends’ and‘implements’ relationships.

API domains: We assume that each API addresses someprogramming domain such as XML processing or GUI pro-gramming. We are not aware of any general, widely adoptedattempt to associate APIs with domains, but the idea appearsto merit further research. We have begun collecting program-ming domains (or in fact, API domains) and tagging APIsappropriately. Let us list a few API domains and associatethem with well-known Java APIs:GUI: GUI programming, e.g., Swing and AWT.XML: XML processing, e.g., DOM, JDOM, and SAX.Data: Data structures incl. containers, e.g., java.util.IO: File- and stream-based I/O, e.g., java.io and java.nio.Component: Component-oriented programming, e.g., JavaBeans.Meta: Meta-programming incl. reflection, e.g., java.lang.reflect.Basics: Basic language support, e.g., java.lang.String.


API facets

46

API domains are helpful in reporting API usage and quan-tifying API usage of interest in more abstract terms than thenames of individual APIs, as will be illustrated in §VI.

API facets: An API may contain dozens or hundredsof types each of which has many method members in turn.Some APIs use subpackages to organize such API complexity,but those subpackages are typically concerned with advancedAPI usage whereas the core facets of API usage are notdistinguished in any operational manner. This makes it hardto understand API usage at a somewhat abstract level.

Accordingly, we propose leveraging a notion of API facets.Each API enjoys a manageable number of facets. In general,we may use arbitrary program analyses to attest use of afacet. In this paper, we limit ourselves to a simple form offacets, which can be attested on the grounds of specific typesor methods being used. Except for the notion of API usagepatterns (see §III), we are not aware of any general, widelyadopted attempt to break down APIs into facets, but the ideaappears to merit further research. We have begun identifyingAPI facets and tagging APIs appropriately. As an illustration,we briefly characterize a few API facets of the typical DOM-like API such as DOM itself, JDOM, or dom4j:Input / Output: De-/serialization for DOM trees.Observation: Getter-like access and other ‘read only’ forms.Addition: Addition of nodes et al. as part also of construction.Removal: Removal of nodes et al. as a form of mutation.Namespaces: XML namespace manipulation.Nontrivial XML: Use of CDATA, PI, and other XML idiosyncrasies.

We may also designate a facet to ‘Nontrivial API’ usagewhen it involves advanced types and methods that are beyondnormal API usage. For instance, XML APIs may provide someframework for node factories or adapters for API integration.API facets are helpful in communicating API usage to the userat a more abstract level than the level of individual types andmethods, as will be illustrated in §VI.

V. THE QUAATLAS CORPUS FOR API-USAGE ANALYSIS

Our study requires a suitable corpus of mature, well-developed projects coming from different application domains.Arguably, such projects show sufficient and advanced APIusage. We decided to restrict ourselves to open-source Javaprojects; in order to increase quality and reproducibility of ourresearch, we decided to use an existing, established and cu-rated, collection of Java projects—the QUALITAS corpus [24],release 20101126r. As we discussed in §IV, API usage entailsthe ability to resolve types. However, QUALITAS does notguarantee the availability of a project’s library types. Thecollection consists of source and binary forms as they areprovided by the project developers.

In the interest of similar research tasks that require adependency-resolved corpus, we detail our method for corpus(re-)engineering. The resulting dependency-resolved QUALI-TAS variant is available on the paper’s website.

A. Method

Consider the following, partially automated pseudocode:

1. input : corpus, candidateList

2. output : corpus

3. for each name in candidateList :

4. (p

src

, p

bin

) = obtainProject(name);

5. patches = exploratoryBuild(p

src

, p

bin

);

6. timestamp = build(p

src

, patches);

7. (java, classes, jars) = collectStats(p

src

);

8. java

0= filter(java);

9. (jars

built

, jars

lib

) =

detectJars(timestamp, java

0, jars);

10. java

0compiled

=

detectJava(timestamp, java

0, classes, jars

built

);

11. p

0src

= (java

0compiled

, jars

lib

);

12. p

0bin

= jars

built

;

13. p

0= (p

0src

, p

0bin

);

14. if validate(p

0) :

15. corpus = corpus + p

0;

The input is a (possibly empty) corpus to be extended anda list of candidate projects, candidateList , to be added to it.The output is the corpus populated with refined projects.

Line 4 assumes that a project can be obtained both in itssource and binary forms (e.g., downloading them from theproject website). During an exploratory build (line 5), thenature of the project is manually investigated by an expert.The expert investigates how the project is built, what errorsoccur during the build (if any), and how to patch them. At thisstage, we also compare the set of built JARs with the JARsin the binary distribution form of the project. If the formerset is smaller than the latter (e.g., because default targets inbuild scripts may be insufficient and a series of target callsor invocation of several build scripts is needed), we attemptto push the build of the project for completeness. Once theexploratory build is successful, we are able to automaticallybuild the project (line 6), if necessary after applying patches.

After the build, we collect the full path, creation andmodification times of each file in the project (line 7). ForJava files we extract qualified names of contained top-leveltypes, for class files we detect their qualified names. For JARswe explore their contents and collect information about thecontained class files.

On line 8, we apply a filter, keeping only the source codethat we consider to be both system and core (see Section V-C).On line 9, we use the known start time of the build togetherwith information about Java types computed on lines 7 and 8to classify the JARs found after the build either as library JARsor as built JARs. On line 10, we use the identified built JARsand the compiled class files to identify Java types that werecompiled during the build. On line 11, we refine the project’ssource code form p0

src

to include only the compiled Javatypes together with the necessary library JARs. On line 12,the binary form p0

bin

is refined to consist of the built JARs.The refined project p0 (line 13) is validated (line 14) by

rebuilding the project in a sandbox, outside its specific setup,making sure to use only those files that have been identified bythe method.2 A successful sandbox build indicates that source

2In practice, we use an Eclipse workspace with automatically generatedconfiguration files (i.e., .system and .classpath).


























Another visualization of API footprint

Top-level circles = packages Inner circles = classes

Size of inner circle = LOC Color = # API-type references

(Fields are ignored.)Thanks are due to Victor Winter / SHIFT Lab @ UNO.

49

“project centric”

(for 1 project)


Another visualization of API footprint

Top-level circles = packages Inner circles = classes

Size of inner circle = LOC Color = # API-type references

Thanks are due to Victor Winter / SHIFT Lab @ UNO.

50

“API centric” (for 1 project)


Nontrivial JDOM API usage in velocityorg.apache.velocity.anakia.AnakiaJDOMFactory

Scope Tags incl. facets #proj

...Fig. 6. ‘Non-trivial API’ usage for package org.jdom in QUAATLAS.

Swing!!java.lang!!JavaBeans!!java.io!!AWT!!java.util

Package org.jhotdraw.undo

AWT!!Swing!!java.io java.lang java.util

JavaBeans java.text java.lang.reflect!!DOM!!java.net java.util.regex!!Java Print Service!!java.util.zip!!java.lang.annotation java.math java.lang.ref java.util.concurrent Java security!!javax.imageio!!SAX

JHotDraw’s API Cocktail

Fig. 7. The API Cocktail of JHotDraw (cloud of API tags).

E. The Sub-API Footprint insightIntent – Understand usage of a sub-API in a corpus or project.Stakeholder – API developer and, possibly, project developer.API usage – One API.View – A list as in the case of the API Footprint insight, exceptthat it is narrowed down to a sub-API of interest.Illustration – Fig. 6 illustrates ‘Non-trivial API’ usage forJDOM’s core package. The selection is concerned with aproject type which extends the API type DefaultJDOMFactoryto introduce a project-specific factory for XML elements.Basic IDE functionality could be used from here on to checkwhere the API-derived type is used.Intelligence – In the example, we explored non-trivial APIusage, such as type derivation at the boundary of project andAPI—knowing that it challenges API evolution and migra-tion [7]. More generally, developers are interested in specificsub-APIs, when they require detailed analysis for understand-ing. API developers (more likely than project developers)may be more aware of sub-APIs; they may, in fact, capturethem, as part of the exploration. (This is what we did duringthis research.) Such sub-API tagging, which is supported bythe Sub-API Footprint insight may ultimately improve APIdocumentation in ways that are complementary to existingapproaches [4], [5].

F. The API Cocktail insightIntent – Understand what APIs are used together in largerproject scopes.Stakeholder – Project developer.API usage – All APIs.View – The listing of all APIs exercised in the project or aproject package with API-usage metrics applied to the APIs.Illustration – Remember the tree-based representation of theAPI cocktail for JHotDraw as shown in Fig. 1 in §II. Thesame cocktail of 20 APIs is shown as a tag cloud in Fig. 7.Scaling is based on the #ref metric.Intelligence – The cocktail lists and ranks APIs that are usedin the corresponding project scope. Thus, the cocktail proxiesas a measurement for system complexity, required developerskills, and foreseeable design and implementation challenges.API usage is part of the software architecture, in the sense of

GUI!!Data!!Basics!!IO!!Format!!Component!!Meta!!XML!!Distribution!!Parsing!!Control!!Math!!Output!!Security!!Concurrency

JHotDraw’s API Domain Cocktail

GUI!!Basics!!Component!!IO Package org.jhotdraw.undo

Project jhotdraw

Fig. 8. Cocktail of domains for JHotDraw.

Basics!!Distribution!!GUI!!IO!!Component

java.lang!!java.net!!Swing!!JavaBeans!!java.io!!

APIs

API domains

Coupling in JHotDrawfor the interface org.jhotdraw.app.View

Fig. 9. API Coupling for JHotDraw’s interface org.jhotdraw.app.View.

“what makes it hard to change the software”. Chances are thatAPI usage may cause some “software or API asbestos” [26].While a large cocktail may be acceptable and unavoidable fora complex project, the cocktail should be smaller for individualpackages in the interest of a modularized, evolvable system.

G. APIs versus domains

We can always use API domains in place of APIs toraise the level of abstraction. Thus, any insight that comparesAPIs may as well be applied to API domains. APIs areconcrete technologies while API domains are more abstractsoftware concepts. Consider Fig. 8 for illustration. It showsAPI domains for all of JHotDraw and also for its undopackage. Thus, it presents the API cocktails of Fig. 7 in amore abstract manner.

H. The API Coupling insight

Intent – Understand what APIs or API domains are usedtogether in smaller project scopes.Stakeholder – Project developer.API usage – All APIs.View – See §VI-F except APIs or domains are listed for smallerproject scopes.Illustration – Fig. 9 shows API Coupling for the interfaceorg.jhotdraw.app.View from the JHotDraw’s app package4.According to the documentation, the package “defines aframework for document oriented applications and providesdefault implementations”. The View type “paints a documenton a JComponent within an Application”. (Application is themain type from the package which “handles the lifecycle ofviews and provides windows to present them on screen”.) Thecoupled use of APIs can be dissected in terms of the involvedtypes as follows:java.lang: trivial usage of strings.java.net: types for the location to save the view.JavaBeans: de-/registration of PropertyChangeListeners.java.io: exception handling for reading/writing views.Swing: usage of JComponent on which to paint a document; usageof ActionMap for actions on the GUI component.

4The lifecycle of the interface as explained by its documentation: http://www.randelshofer.ch/oop/jhotdraw/JavaDoc/org/jhotdraw/app/View.html
















Project jhotdraw




APIs

API domains
























Project jhotdraw




APIs

API domains
























Project jhotdraw




APIs

API domains










Observation!!Input!!Exception!!Package de.nava.informa.parsers

Observation!!Input!!

Nontrivial XML!!Manipulation Exception!!Renaming

Addition Namespaces!!Nontrivial API!!Output!!

Project informa

JDOM’s API Profile for informa

Fig. 10. JDOM’s API Profile in the informa project (cloud of facet tags).

Intelligence – Simultaneous presence of several domains orAPIs in a relatively small project scope may indicate acciden-tal complexity and poor separation of concerns. Thus, suchexploration may reveal a code smell [27], [28] that is worthaddressing. Alternatively, a dissection, as performed for theillustrative example, may help in understanding the design andreasonable API dependencies.

I. The API Profile insight

Intent – Understand what API facets are used in varyingproject scopes.Stakeholder – Project developer and, possibly, API developer.API usage – One API with available facets.View – The listing of all API facets exercised in the selectedproject scope with API-usage metrics applied to the facets.Illustration – Fig. 10 shows JDOM profiles for a project andone of its packages. The project, as a whole, exercises mostfacets of the API. In contrast, the selected package is morefocused; it is concerned only with loading XML into memory,reading access by getters and friends, and some inevitableexception handling. There is no involvement of namespaces,non-trivial XML, or data access other than observation.Intelligence – At the level of a complete project, the profilereveals the API facets that the project depends on. As some ofthe facets are more idiosyncratic than others, such explorationmay, in fact, reveal “software or API asbestos” [26], asdiscussed in §VI-F. For instance, the JDOM facets ‘Non-trivial API’ and ‘Non-trivial XML’ and to a lesser extent also‘Namespaces’ proxy for development challenges or idiosyn-cracies. At the level of smaller project scopes, an API’s profilemay characterize an actual usage scenario, as in the case of theprofile at the bottom of Fig. 10. Such a facet-based approachto understanding API-usage scenarios complements existingmore code pattern-based approaches [2], [3]. API profiles alsoprovide feedback to API developers with regard to ‘usage inthe wild’, thereby guiding API evolution or documentation.

VII. EXPLORATION VIEWS

Let us systematically conceptualize attainable views inabstract terms. In this manner, a more abstract model of ex-ploration arises and a foundation for tool support is provided.

We approach this task essentially as a data modeling prob-lem in that we describe the structure behind views and the

underlying facts. We use Haskell for data modeling.5

A. ForestsWe begin by modeling the (essential) facts about projects

and APIs as well as API usage. To this end, we think of twoforests: one for all the projects in the corpus, another for allthe APIs used in the corpus.�� Forests as collections of named treesdata Forest = Forest [(UqName,PackageTree)]

Each project or API gives rise to one tree (root) in therespective forest. Such a tree breaks down recursively intopackage layers. If a package layer corresponds to an actualpackage, then it may also contain types. Types further breakdown into members. Thus:�� Trees breaking down into packages, types, etc.data PackageTree = PackageTree [PackageLayer]data PackageLayer = PackageLayer UqName [PackageLayer] [Type]data Type = Type UqName [Member] [Ref]data Member = Member Element UqName [Type] [Ref]data Element = Interface | Class | InstanceMethod | StaticMethod | ...

�� Different kinds of namestype RName = QName �� qualified names within foreststype QName = [UqName] �� qualified names within treestype UqName = String �� unqualified names

In both forests, we associate types and members with API-usage references; see the occurrences of Ref . Depending onthe forest, the references may be ‘inbound’ (from project toAPI) or ‘outbound’ and each reference may be classified bythe (syntactic) pattern expressing it. Thus:data Ref = Ref Direction Pattern Element RNamedata Direction = Outbound | Inbounddata Pattern = InstanceMethodCall | ExtendsClass | ...

The components of a reference carry different meanings de-pending on the chosen direction:

Outbound Inbound

Pattern Project pattern Project patternElement API element Project elementRName Name of API element Name of project element

The project forest is obtained by walking the primary repre-sentation of projects and deriving the forest as a projection/ab-straction at all levels. The API forest is obtained by a (non-trivial) transposition of the project forest to account for theproject-specific jars and memory constraints on simultaneouslyopen projects.

B. View descriptionsWe continue with the descriptions of views. These are the

executable models that are interpreted on top of the forestsof APIs and projects. Here is the overall structure of thesedescriptions:

5Products are formed with “(...)”. Lists are formed with “[...]”. We useHaskell’s data types to group alternatives (as in a sum); they are separatedby ‘|’. Each alternative groups components (as in a product) and is labeledby a constructor name. Enums are degenerated sums where the constructorname stands alone without any components. Other types may suffice withtype aliases on top of existing types.


End of Lecture

56

In this course, we could carry out research designs to study API usage in an exploratory or more definitive manner (e.g., in experiments). Further, we could discuss research on API migration, which is informed by API-usage analysis. We could also look into language usage analysis as a similar research domain.

Rather we could do a separate course on „program comprehension“ or „mining

software repositories“

Documents

Exploratory research on API usage - Uni Koblenz-Landaulaemmel/esecourse/... · 2014. 7. 24. · 2]; the wrapper must neutralize these diﬀerences. Our example concerns two GUI APIs: