Retrieving Software Component using Clone … of Computer Science Retrieving Software Component using Clone Detection and Program Slicing Memorandum CS-07-03 Maslita Abdul Aziz Dr

Department of Computer Science

Retrieving Software Component using Clone Detection and Program Slicing

Memorandum CS-07-03

Maslita Abdul AzizDr. Siobhán North

February 2007

i

Abstract: Identifying appropriate software components in a library or software component retrieval are an important task in software reuse. This report is compiled upon my current understanding of source code retrieval. Retrieval system should give attention to cost, recall and precision percentage and simple query formation. Furthermore, the system must assist reuse developer to understand the suitable code component. The proposed research will concentrate on the property of the source code and try to comprehend the overall feature of the code rather than looking at the semantic of natural language. Clone detection and program slicing is a promising techniques to analyse the property of the source code. Therefore the research will address these problems in four phases, (i) tagging the user query to class and method names, (ii) constructing pattern/s of code features using clone detection technique, (iii) searching components in the library based on the clone pattern/s and (iv) aid the comprehension of the retrieved code components with program slicing technique.

ii

Contents1 Introduction to Software Reuse ……………………………………………... 1

2 Software Reuse Retrieval ……………………………………………….…… 3

2.1 Retrieval in General …………………………………………………. 3

2.2 Code Retrieval ……………………………………………………….. 4

2.2.1 Descriptive Method ……………………………………..…… 4

Freetext keyword …………………………………………… 4

Faceted Classification ……………………………………….. 6

Other Classification Scheme ………………………………... 8

Knowledge Based with Semantic nets ……………………… 9

Knowledge Based with Neural Network ………………….… 10

Special Purpose Search Engine …………………………….. 11

2.2.2 Property of Software Component …………………………… 13

Behavioural Sampling ………………………………………. 13

Signature Matching …………………………………………. 15

Specification Matching ……………………………………… 15

2.3 Recommendation Method…………………………………………… 17

3 Outstanding Issues in Code Retrieval………………………………………… 20

4 Conclusion and Future Work ……………………………………………....… 21

4.1 Clone Detection Technique ……………………………………………23

4.2 Program Slicing …………………………………………………….. 24

4.3 References …………………………………………………………… 25

1

1 Introduction to Software ReuseSoftware reuse is defined as the "process of using existing software components rather

than building them from scratch" (Krueger, 1992). Reusable software components include not only the generic source code, but also other aspect of the software lifecycle including design structure, specifications, and documentation. Software reuse has been recognised as one of most realistic and promising ways to improve software productivity, quality and reliability(Vitharana and Jain, 2000), shorten the time required to release software for client use (Due, 2000; Kim and Stohr, 1998; Ugurel et al., 2002) and reduce maintenance costs (Hooper and Chester, 1991). Gibb et al. (2000) highlight two advantages of software reuse ‘(1) those components that have already been tested provide higher guarantees of robustness and reliability in any future implementation, and (2) component reuse should lead to faster development times and lower costs’.

In addition, software reuse needs effective retrieval techniques to make development with the reusable components more convenient than development from scratch (Krueger, 1992). Software retrieval is concerned with locating and identifying appropriate components to satisfy users’ requirements. It is considered to be one of the key technical issues in software reuse for “You must find it before you can reuse it” (Barringer et al., 1984). Although software reuse has been promoted as an effective means to develop software products (Krueger, 1992; Mili et al., 1995), in practice it has proved to be hard to achieve and of limited use (Biggerstaff and Richter, 1987; Fischer, 1987; Biggerstaff, 1992; Henninger, 1995).

There are a lot of problems associated to this disappointment most notably that existing source code rarely fits or that a programmer does not know what code is available (Ye et al., 2000; Ye and Fisher, 2002; Ohsugi et al., 2002, McCarey et al., 2004, McCarey et al., 2005) and that once the source code is found it is difficult to understand how it works or how it can be used (Fischer, 1987). Most reuse efforts are in libraries of reusable components but to be useful, they should have many examples of source codes and ways to find and choose the most suitable (Henninger, 1995). While a lot of solutions have been proposed and as reuse repositories are experiencing rapid growth, efficient means to retrieve software components from large repositories are still a major software reuse research topic (Sugurmaran and Storey, 2003).

Retrieval techniques are receiving a good deal of attention (Frakes and Gandel, 1990) and play a key role in software reuse (Henninger, 2002). There are a lot of solutions to software retrieval that have been proposed and implemented which were classified differentlybut with minor differences (Yao and Etzkorn, 2004). Thus only a few opinions will be discussed here. Ostertag et al. (1992) classified the retrieval approaches in three types that are free-text keywords, faceted index and semantic-net based. In their survey of retrieval approaches, Mili et al. (1998) groups the retrieval methods into four different types, simple

2

keyword and string match, faceted classification and retrieval, signature matching and lastly behaviour matching. Whereas Sugumaran and Storey (2003) combine Ostertag et al. and Mili et al. observation and come out with five types of retrieval methods. They considered free-text keyword is similar as simple keyword and string match; therefore their retrieval schemes are grouped into keyword search, faceted classification, signature matching, behavioral matching and semantic-based method (Sugumaran and Storey, 2003).

Ali and Du (2004) agree on the free-text keyword and faceted indexing classification but grouped semantic-net into knowledge-based methods. Here they include frame-based classification scheme in the same category of semantic-net. They argue knowledge-based systems use lexical, syntactic and semantic analysis of natural language specification. While frame-based tries to capture the meaning of the software reuse by using frames to represent conceptual objects. A distinguish classification is suggested by Bouchachia et al. (2000), formal and informal methods. The formal methods consist of specification, signature and sampling or known as behavioural by other literature. The informal methods rely on the use of natural language which are hypertext, knowledge base and information retrieval techniques. While other literature focus on specific methods without discussing or categorized current approaches.

In my observation, those approaches could be classified into two main categories, descriptive classification methods with support from natural language and property based methods. Descriptive classification methods will discuss methods that employ natural language especially as information for the indexing process. The methods range from free-textkeyword, faceted classification, other classification scheme, knowledge based with semantic-net, knowledge based with neural network and special purpose search engine. On the other hand methods focus on property of software components are behavioural sampling, signature matching, specification matching and recommendation approach. Nevertheless a few of therecommendation approach also use natural language to extract keywords from code documentation such as comment.

3

2 Software Reuse Retrieval

2.1 Retrieval in General

There are many techniques that can be used to represent and access software repositories (Pighin and Brajnik, 2000). Among these a widely used retrieval technique is descriptive methods (Mili et al., 1998). The methods match keyword-based queries against assets that are represented by lists of descriptive keywords (Mili et al., 1998). The simplest way to describe software assets is to assign a keyword to describe the assets (Mili et al., 1998). Damiani and Fugini (1996a; 1996b) developed classification mechanism for semi automatic extraction of keywords from the C++ source code classes and software design documentation such as Entity-Relationship diagram. Keywords are classified by term pairs ofnamed features, describing the component functionalities (Damiani and Fugini, 1996a; 1996b). Then the features are assigned a relevance weight between 0 and 1 as the weighting function that later is interpret as a fuzzy set (Damiani and Fugini, 1996b).

Descriptive methods usually group the keywords that described the software component. Faceted classification is one of descriptive method used by Ali and Du (2004) which classify and retrieve object oriented design models. Faceted classification makes use of the is-a-relationship between features, known as facets, or classes of objects to be retrieved (Behle et al., 2002). This approach assumes that a classification system is based on several terms to make the full picture as a whole (Behle et al., 2002). Ali and Du (2004) set up six facets to classify the attributes of object oriented design models. Aware of the criticism regarding faceted method from Mili et al. (1997), Ali and Du (2004) stated that object oriented is oriented around objects which makes it suitable for faceted classification.

Retrieval techniques that classified software artifacts are considered as a key to success factor for software reuse projects, especially with artifacts other than program code (Basili and Rombach, 1991; Prieto-Diaz, 1993; Pfleeger, 1996; Damiani and Fugini, 1996; Damiani et al., 1997). The classification scheme use both information in the source code and natural language available in comments and README files to automate the categorization (Ugurel et al., 2002; Krovetz et al, 2003; Bouchachia and Mittermeir, 2001; Bouchachia et al., 2000).

Among these are Vitharana et al. (2003a, 2003b) who borrowed the idea from faceted classification which is used to describe features of binary code components but their classification and coding scheme considers higher level business oriented features. Thebusiness components are classified and describe within a knowledge-based repository coded in eXtensible Markup Language (XML) (Vitharana et al., 2003a; 2003b). They propose that the knowledge base could catalogue the components systematically. This systematic approach

4

is important since there can be a lot of components in a particular domain application (Vitharana et al., 2003; 2003b; Yao and Etzkorn, 2004).

The retrieval system by Behle et al. (2002) combined faceted classification with other classification schemes. Their system does not store the component themselves but contains information about these components which make the system capable of managing different kinds of software components (Behle et al., 2002). They suggested that the classification scheme has to be general, adaptable, and extensible; therefore a Universal Classification Approach was built. This approach grouped any feature that is useful for the distinction of components (Behle et al., 2002).

2.2 Code Retrieval

Source code is a special case of a document which contains additional information embedded in its structure (Mishne and de Rijke, 2004). Other structured texts that have been extensively research for their effective retrieval are HTML and XML (Anslow et al., 2004). Research shows that using structural knowledge of the HTML and XML may improve results significantly (Wilkinson, 1994; Myaeng, 1998). Hence, there are various source code archives and open source sites on the World Wide Web that could be used to locate useful existing code (Ugurel et al., 2002; Krovetz et al., 2003).

2.2.1 Descriptive Method

Early forms of code retrieval were based on the descriptive method, a classification scheme for cataloguing code components with a set of keywords (Prieto-Diaz and Freeman, 1987; Prieto-Diaz, 1990; Prieto-Diaz, 1991). Classification is the process of grouping items or objects with a shared characteristic into categories (Frakes, 2004). In other words, the approach depends on a textual description of software components (Mili et al., 1998). The schemes specify some attributes to be used as keywords or descriptors of a software component, mostly focusing on the action that component performs and on the objects manipulated by the component (Mili et al 1998). The classification and retrieval are performed by specifying a list of descriptive keywords for each attribute in the scheme(Kawaguchi et al., 2006). These include free-text keyword and faceted classification (Berglund, 2002).

Free-text Keyword

The descriptive method based on free-text indexing is the simplest and most widely used technique for indexing or searching (Frakes and Pole, 1994; Girardi and Ibrahim, 1995;Mark and Bernstein, 2002). Free-text keywords allows words anywhere in the natural-language documentation to be indexed or searched (Frakes and Pole, 1994; Ali and Du, 2004). It implements an uncontrolled vocabulary technique because one or more keywords are

5

assigned to the software component without imposing any constraints on the number of keywords (Frakes and Pole, 1994; Bouchachia and Mittermeir, 2001). Basically this approachuses information retrieval and indexing technology to automatically extract keywords from software documentation without the semantic knowledge (Girardi and Ibrahim, 1995; Yao and Etzkorn, 2004; Ali and Du, 2004). It uses the frequency of words in natural language descriptions as indicator of relevance of the content to a given topic (Bouchachia et al., 2000). No interpretation of the document is given which make this an approach which attempts to characterize the document rather than to understand it (Maarek et al., 1991).

The free text indexing has been implemented to automatically index terms that were extracted from the descriptive header of C modules and functions (Frakes and Nejmeh, 1988). They use the CATALOG information-retrieval system and the indexes are stored in an inverted file in an information retrieval system (Frakes and Nejmeh, 1986). The retrieval was done using Boolean search techniques such that if the query is ‘convolution algorithms’, the system will return items which have the terms convolution or algorithms (Frakes and Gandel, 1989). Furthermore, CATALOG will try to find other words in the database which might be related and lists all the possibility for the user to select (Frakes and Nejmeh, 1986). For the query “sorting”, the system will match terms like “sort”, “sorting” and “sorts”. Users could place wild cards “\”, “*” or “?” to use ‘related word’ feature. Thus, “\” will yield exact match while the query ‘airlin*’ will finds ‘airline’, ‘airlines’ or ‘airliner’ and the query ‘airlin?’ will produces ‘airline’.

They extract keywords from the documentation by determining the frequency of words (Ali and Du, 2004). That is, words that are found most often are included in the index. Word steaming is also applied so that the word searching and search are considered to mean the same concept. However, it cannot fully describe an object. For example it is difficult to express inheritance using keywords (Ali and Du, 2004). Furthermore the approach is limited to the lexical level and does not use the syntactic and semantic information making it an imprecise approach (Girardi and Ibrahim, 1995; Yao and Etzkorn, 2004).

The GURU system (Maarek et al., 1991) also classified software components according to attributes automatically extracted from the documentation pages or commentsbased on the lexical affinity and their statistical distribution. Lexical affinity is used to represent the functional description of the documentation (Maarek et al., 1991). They tried to overcome the disadvantage of the usual form of free-text indexing that is a single-term index with each single word is indexed without the contextual information (Girardi and Ibrahim, 1995). They proposed the use of term phrases or pairs of words. For example the word ‘file overwrite’ is more meaningful than the word ‘new’ (Maarek et al., 1991). They also apply frequency of appearance as the indicator of relevance (Maarek et al., 1991). The approach exhibits better precision than the single keyword-based even without attempting to understand using semantic knowledge (Girardi and Ibrahim, 1995).

This less rigorous retrieval technique also caught the attention of Pighin and Brajnik (2000). They build ALICE (Application of Information Retrieval to Catalogues of Existing

6

Software) on a fully automatic code indexing technique and use free text queries (Pighin and Brajnik, 2000; Gonzalez and Van de Meer, 2004). The experiment on ALICE suggested that free text keyword is helpful in solving reuse problem because the users were satisfied with the information retrieved and the process of searching components as well as to identify relevant components and could complete their projects in less time regardless of their background experience (Pighin and Brajnik, 2000; Pighin, 2001).

Faceted Classification

Although the free-text indexing is considered as a straightforward method, the faceted approach which is also from the family of descriptive method is extensively discussed in reuse literature (Mili et. al., 1998). The faceted approached is introduced by Prieto-Diaz (Prieto-Diaz, 1991; Prieto-Diaz, 1990). This approach is also known as controlled vocabulary because vocabulary control is imposed to insure a uniform meaning for the component description (Bouchachia and Mittermeir, 2001). Prieto-Diaz extended the faceted classification as known in library science by the concept of similarity (Prieto-Diaz, 1991). In faceted classification, software descriptions are understood before index terms are assigned (Prieto-Diaz, 1991; Berglund, 2002; Behle et al., 2002; Kawaguchi et al., 2006). A facet is a set of concepts from which the components are viewed where within each facet includes a list of keywords defined by experts (Bouchachia and Mittermeir, 2001). In short, facet is simply a category that may express some aspect of the knowledge being described.

The faceted scheme by Prieto-Diaz is divided into two major areas, functionality of the component and environment (Prieto-Diaz and Freeman, 1987). Each area is further subdivided into three parameters. Functionality is described by function (what the software does), object (the objects manipulated by the function) and medium (the data structures or devices that the function uses). While environment has system type (refers to application independent modules), functional area (describes application dependent activities or set of procedures) and setting (describes where the application is used) (Prieto-Diaz and Freeman, 1987; Swanson and Samadzadeh, 1992; Girardi and Ibrahim, 1995; Ugurel et al., 2002).

A thesaurus is used to convert all of the definitions to descriptive words of similar meaning which leads to consistency of the comparisons (Ugurel et al., 2002). In order to retrieve the components, queries in faceted classification are created by selecting keywords from the facets and matching them against the known classification (Swanson and Samadzadeh, 1992). Then a weighted conceptual graph is used to measure closeness by the conceptual distance among terms in a facet to determine similarity between query and software components (Swanson and Samadzadeh, 1992).

In the system described by Prieto Diaz and Freeman (1987), a user starts by selecting a representative term of each facet for all six terms as shown in Table 1. For example, the complete query selected by the user for searching a component to change backspaces to multiple lines is <substitute, backspaces, file, text_formatter,

7

program_development, software_shop> (Prieto Diaz and Freeman, 1987). When a query does not result in a satisfying answer, the user may modify or expand the query by generalization or specialization process. From the above example, the <substitute, backspaces> could be replaced by <substitute, quotes>, <substitute, blank> or <substitute, digit> or simply by <substitute, *>.

Function Objects Medium System type Functional area Setting

add

append

close

compare

complement

compress

create

decode

delete

divide

evaluate

exchange

expand

format

input

insert

join

measure

modify

move

.

arguments

arrays

backspaces

blanks

buffers

characters

descriptors

digits

directories

expressions

files

functions

instructions

integers

lines

lists

macros

pages

.

.

.

array

buffer

cards

disk

file

keyboard

line

list

mouse

printer

screen

sensor

stack

table

tape

tree

.

.

.

.

.

assembler

code generation

code optimization

compiler

db management

expression evaluator

file handler

hierarchical db

hybrid db

interpreter

lexical analyzer

line editor

network db

pattern matcher

predictive parsing

relational db

retriever

scheduler

.

.

.

account payable

accounts receivable

analysis structural

auditing

batch job control

billing

bookkeeping

budgeting

capacity planning

cad

cost accounting

cost control

customer information

db analysis

db design

db management

.

.

.

.

.

advertising

appliance repair

appliance store

association

auto repair

barbershop

broadcast station

cable station

car dealer

catalog sales

cemetery

circulation

classified ads

cleaning

clothing store

composition

computer store

.

.

.

.

Table 1: A partial listing of the faceted classification (Prieto Diaz and Freeman, 1987)

Another similar approach to faceted classification is the REBOOT (Reuse Based on Object-Oriented Techniques) by Morel and Faget (1993). REBOOT is developed to organize large libraries of reusable object-oriented components (Morel and Faget, 1993). They divided the components into four facets that are Abstraction, Operations, Operates On, and Dependencies (Morel and Faget, 1993; Ali and Du, 2004). Faceted concept is also implemented by Damiani et al. (1999a) to develop a hierarchy-aware classification scheme for object-oriented code. Software artifacts are classified according to their characteristics for example services provided, algorithms employed and data needed (Damiani et al., 1999b). As Prieto-Diaz use thesaurus, REBOOT construct a descriptor for a component from the set of characteristics associated with it (Damiani et al., 1999a; 1999b).

8

Prieto-Diaz has performed studies on facet classification in several company repositories and identified that faceted classification works best for domain-specific repositories. If a repository stored software components from broad or heterogeneous domain, the facet will become too general and lose its descriptive precision (Prieto-Diaz, 1991; Marshall et al., 2004). Another problem that comes into view during their studies is the role of the experts is critical in order to find proper abstractions and to avoid uncontrolled growth (Prieto-Diaz, 1991; Mishne and de Rijke, 2004).

Faceted classification yields good results and enhances software reuse in term of locating existing code quickly and easily if the components are correctly classified (Ugurel et al., 2002; Krovetz et al, 2003). However, Mili et al. (1997) argue that faceted classification generated more cost in manually classifying the components than the benefit derived from facet-based queries over keyword-based queries. The argument is strengthen by experiments with document retrieval system, which shown that uncontrolled vocabularies produce retrieval results that are comparable to those produced with controlled vocabularies (Frakes and Gandel, 1989).

Other Classification Scheme

Lucena (2001) agreed that the task of classifying and storing components must be done by experts who are capable of identifying the characteristics of a component well before its storage in a common repository. The best person to know the exact usage of the software component is none other than the developers; therefore the responsibility of describing a meaningful classification of components belongs to the developer (Lucena, 2001). On that perspective, Lucena (2001) introduced a component publication procedure dedicated for the industrial automation domain and use pre-defined terms based on the industrial vocabulary. During the publication, component developers are guided to clarify the component’s usage, potentials and constraints via a tutorial system.

Apart from categorizing software components by facet, Ugurel et al. (2002) classified source code into appropriate application domains and programming languages using learning techniques for text categorization, Support Vector Machines (SVM) (Yusof and Rana, 2004). They claim that programming languages have specific keywords and features that can be identified by SVM (Ugurel et al., 2002; Krovetz et al., 2003). The features were extracted from words, bigrams and lexical phrases in comments and README files while header file names are extracted from the source code (Ugurel et al., 2002). The features are then selected using statistical measure which is the expected entropy loss. To train the topic classifier theydownloaded examples from the Ibiblio (Ibiblio) and the Sourceforge (Sourceforge) archives, and concentrated on C/C++ (Ugurel et al., 2002; Krovetz et al., 2003).

The idea of categorizing source code is simple to understand and widely used. Merkl (1995) suggested organizing a library of reusable software components by using self-organizing neural network. The software components are clustered into groups of semantically similar components. Keywords are automatically extracted from the manual

9

documentation of software components. This seems a reasonable algorithm when assuming comments and identifiers clearly reflect human concepts (Merkl, 1995). Archives on the Internet such as Ibiblio and Sourceforge are other examples that categorized source code(Seacord et al., 1998; Wallnau et al., 2002; ComponentSource, 2004; Erdur and Dikenelli, 2002; Yao and Etzkorn, 2004). Ibiblio archive is organized hierarchically into categories of applications, commercial, development tools, games, hardware and drivers, science and system tools (Ibiblio). While Sourceforge archive is categorized under 43 different programming language and 19 topics (Sourceforge).

In order to do a comparative empirical evaluation, Frakes and Pole (1994) designed PROTEUS, a software reuse system that supports different classification methods for software components which are simple keyword-based, faceted, enumerated graph and attribute value. They found that there was no statistical difference in the effectiveness of search using the different methods (Frakes and Pole, 1994; Frakes and Gandel, 1989; Alonso and Frakes, 2000). As there was no clear advantage to using methods requiring domain analysis, they recommended that querying using automatically extracted keywords is the best approach (Drummond et al., 2000).

The free-text keyword and faceted approaches are simple and effective for experienced users who are familiar with the proper terminology (Girardi and Ibrahim, 1995). Consequently many researchers were inspired by Artificial Intelligence to use linguistic knowledge which includes structural information and semantic information of the software (Chen et al., 1993; Bouchachia and Mittermeir, 2001). The knowledge-based approach aims to understand the queries and functionality of the components (Maarek et al., 1991). Novice users tend to use a related term or a more general or specialized term (Girardi and Ibrahim, 1995). Therefore Girardi and Ibrahim (1995) believed that the user should submit a friendlier query in natural language to free the user to think of proper keywords, choosing Boolean combination of keywords or selecting classification schemes.

Knowledge Based with Semantic nets

The system, ROSA (Reuse of Software Artifacts) is based on knowledge base that keeps the semantic information about the natural language and the application domain (Ali and Du, 2004). The user could write a query as simple as ‘I want a component to search a file for a string’ (Girardi and Ibrahim, 1995). Their approach falls under free-text indexing but they included the semantic net (Yao and Etzkorn, 2004). The knowledge base stores semantic information about the application domain and about the natural language itself (Girardi and Ibrahim, 1995).

Similar to free-text approach, the retrieval system starts by extracting keyword from the query and the keyword are then used to search and locate components from the repository (Girardi and Ibrahim, 1995). They concentrate on words that occur in comments, documentation or meaningful variable names (Girard and Ibrahim, 1995; Michail, 2002). The

10

key principle of semantic nets is to detect equivalent or similar expressions in different sentences such as the following sentences:

a. search a string in a file

b. search a file for a string

c. look for a string in a file

d. examine a text file for lines matching a pattern

The similarity analysis infers that sentence (a), (b) and (c) are equivalent. This is because the sentences have the same semantic structure as well as the content in the semantic structures is related by synonymy relationship. These three sentences are also similar to (d) because of the semantic structure and related by specialized term (hyponymy) and a generalized terms (hypernymy) relationship.

Another natural language processing system is by Etzkorn et al. (1997). They designed a system called Partricia that automatically identified object-oriented software components through understanding comments and identifiers (Etzkorn et al., 1997). They found that object-oriented code is more reusable than functionally-oriented code. Partricia uses a heuristic method deriving information from linguistic aspects of comments and identifiers and from non linguistic aspect of object oriented code such as a class hierarchy (Ugurel et al., 2002).

Nevertheless the semantic-net is also rigid in a narrow application domain and usually the pre-encoded semantic information is done manually (Maarek et al., 1991; Yao and Etzkorn 2004). Furthermore the knowledge base must also store semantic information about the language itself. For instance only the semantics of English are processed (Maarek et al., 1991). As semantic-net are constructed as a directed graph where the nodes represent the software component and the arcs represent the relationships between them, frame-based aims at capturing the meaning of software components (Ali and Du, 2004).

The frame-based software component catalogue, part of the natural language processing system is proposed by Wood and Sommerville (1988). The catalogue consists of a frame for the basic function that a software component performs, with slots for the objects manipulated by the component (Girardi and Ibrahim, 1995). The frames are not used to parse input sentences. Instead, the retrieval system presents them as templates to be filled in by the user via menus. The time-consuming search through the possible fillers is left to the user (Mili et al., 1998).

Knowledge Based with Neural networks

It is well accepted that documentation created by natural language text is preferable especially because of the freedom of expression (Bouchachia and Mittermeir, 2001).

11

However, when interacting with a software repository, users tend to expect that the machine understands their requests which leads to ambiguity and fuzziness (Bouchachia and Mittermeir, 2001). Neural networks, another area of interest in Artificial Intelligence, are used to identify relations among documents and keep the context of the particular word. Neural networks are associated with learning capabilities and one of the techniques is fuzzy sets that utilize fuzzy matching and fuzzy classification (Bouchachia and Mittermeir, 2001).

An unsupervised learning algorithm in neural networks is useful in order to train the data to be indexed. One of the algorithms is self-organizing map (SOM) to cluster and visualize large complex data sets (Kohonen, 2001). SOM is capable of mapping a high-dimensional input space into a low-dimensional map and similar data are found close together in the map with the help of an automatic free-text indexing method (Kohonen, 2001). SOM is considered unsupervised because no training data is required instead the input data is used as the learning process (Ye and Lo, 2001). Nested software self-organizing map (NSSOM) is an extension of SOM used to organize a software repository according to the semantic similarity or functional similarity (Ye and Lo, 2000, 2001). The NSSOM was tested with 440 Unix manual pages (Ye and Lo, 2001). In this method, the most important features are selected using the feature competitive algorithm, where initially a SOM is trained until a feature reduction ratio is reached (Ye and Lo, 2001).

A more recent technique based on SOM is growing hierarchical self-organizing map (GHSOM) which is the extension to hierarchical SOM and growing grid SOM (Tangsipairoj and Samadzadeh, 2005). GHSOM can build a hierarchy of multiple layers of independent growing SOMs and the size of the SOMs as well as the depth of the hierarchy are determined during the learning process (Tangsipairoj and Samadzadeh, 2005) This is an advantage because generally in NSSOM, the size of the map and the depth of the hierarchy are fixed (Freeman and Yin, 2004).

These systems are usually more powerful than traditional keyword retrieval systemsbecause it draws semantic information about software components from a human expert. However they usually require enormous human resources, knowledge bases are created for each application domain and populated from scratch (Girardi and Ibrahim, 1995).

Special Purpose Search Engine

With a large quantity of software freely available in the web, the Internet can be considered to be a giant software reuse library. Unfortunately, search engines such as Google, the most widely used search engine (Kobayashi et al., 2000; McFedries, 2003) are general search engines, and are not customized for software component searches. Google search is keyword based and lacks domain knowledge which produces unrelated information especially while searching for software components. Furthermore general search engines do not consider software as a search target (Yao and Etzkorn, 2004).

12

Initial approach of retrieving components in the web is by Agora which is a special purpose search engine attempts to integrate component technologies and Web search engines for JavaBeans and CORBA (Seacord et al., 1998). It aims to create an automatically generated and indexed worldwide database of software products classified by component model. An agent is used to search the web and introspection techniques are applied to discover the interface information of the component dynamically. Agora is capable to automatically build indexes of large number of software components available on the web.

Agora supports the basic operators ‘+’ and ‘-’ as well as Boolean logic operator AND, OR, NOT and NEAR. The ‘+’ and ‘-’ indicate words or phrases that are required or prohibited in the search results. The user could narrow the search criteria by specifying characteristics of the component. Figure 1 shows the results from a JavaBeans search that must contain method color and draw and the property color but must not contain the term funscroll. The result returns the Uniform Resource Locator (URL) of the components.

Figure 1: Agora Query Interface (Seacord et al., 1998)

Yao and Etzkorn (2004) also developed a special purpose search engine for large software component libraries and the web. Their tool used program understanding techniques to mark components with semantic description and a natural language search engine is developed for user to describe the intended software component (Yao and Etzkorn, 2004).

13

2.2.2 Property of Software Component

The basic problem with retrieval system is the accuracy in identifying the information that serves to index the library (Michail, 2002). The popular approach is to use keyword or information that use natural language. We are aware that people in general prefer to communicate using their language. Nevertheless, other properties of software components should also take into account. Methods that were discussed previously do not exploit the property that distinguishes software components from other texts (Podgurski and Pierce, 1993).

The fact that they can be executed could give an alternative mode for retrieval process.Podgurski and Pierce (1993) proposed behaviour sampling technique that explicitly discusses the functional behaviour of software component that can be executed on given inputs to produce outputs.

Other property of source code that has been studied is the type information about the code. Currently there are two known approaches, signature matching and specification matching (Khayati and Giraudin, 2002).

Behavioural Sampling

The retrieval process based on behavioural method is to identify code through executing samples of input and a list of the output to be produced. Podgurski and Pierce (1993) and Hall (1993) proposed retrieving the source code by executing source code with sample inputs and comparing the output with the output specified by the user (Atkinson and Duke, 1995). The input samples are either generated randomly from the argument domains of function component (Podgurski and Pierce, 1993) or specified by the user (Hall, 1993).

The retrieval process is likely to be long because of the actual execution and faces the problem of non-termination which made it impractical in practice (Park, 2000). Park (1996) refined the execution based by executing at least once and stores the sample inputs and outputs for future use. The samples are reused during code retrieval so that the source code does not need to be executed again (Park, 1996). To avoid using generated or users’ sample input, Park (2000) lists the samples which are chosen by developers of the components with the argument that the developers know well about their components and can provide good samples to describe their components.

As explained by Park (2000), Table 2 describe five functions with the respectivesamples. Each sample consists of the actual argument before execution, the argument after execution and the return value after execution. Then a concept lattice of the concept and their subconcept-superconcept relations are computed as follows:

14

Concept Bottom (B)=({},{s1, s2, s3, s4, s5, s6, s7, s8, s9, s10})

Concept X1=({scopy}, {s1,s4,s8})

Concept X2=({sswap}, {s1,s5,s9})

Concept X3=({srmv}, {s2,s6,s10})

Concept X4=({strim}, {s2,s7,s10})

Concept X5=({sconcat}, {s3,s4,s10})

Concept X6= X1 v X2.({scopy,sswap}, {s1})

Concept X7= X1 v X5.({scopy,sconcat}, {s4})

Concept X8= X2 v X5.({srmv,strim,sconcat},{s10})

Concept X9= X3 v X4.({srmv, strim}, {s2,s10})

Concept Top (T)=({scopy,sswap,srmv,strim,sconcat},{})

If a user query Q1={s10}, the system will retrieves components srmv, strim and sconcat under concept X8. From table 2 we know that s10 is a valid sample for these three components. For the query Q2={s10,s4}, the system look at concept X8 which has s10 and concept X7 which has s4. Then the system will finds concept X5 and retrieve component sconcat.

Components Samples

scopy (str1, str2) :

copy the string str2 to the string str1

S1=[(“abc”, “abc”), (“abc”, “abc”),void)]

S4=[( “”, “abc”), (“abc”, “abc”),void)]

S8=[( “abc”, “xyz”), (“xyz”, “xyz”),void)]

sswap (str1, str2) :

swap the string str1 and the string str2

S1=[(“abc”, “abc”), (“abc”, “abc”),void)]

S5=[( “xyz”, “abc”), (“xyz”, “xyz”),void)]

S9=[( “abc”, “xyz”), (“xyz”, “abcv),void)]

srmv (str1, str2) :

remove the string str2 from the string str1

S2=[( “xyz”, “xyz”), (“”, “xyz”),void)]

S6=[( “xyzabcxyz”, “abc”),(“xyzxyz”, “abc”),void)]

S10=[( “abc”, “”), (“abc”, “”),void)]

strim (str1, str2) :

trim the string str2 from the string str1 both at the beginning and at the end

S2=[( “xyz”, “xyz”), (“”, “xyz”),void)]

S7=[( “abcxyzabc”, “abc”), (“xyz”, “abc”),void)]

S10=[( “abc”, “”), (“abc”, “”),void)]

sconcat (str1, str2) :

concatenate the string str2 to the string str1 at the end

S3=[( “abc”, “xyz”), (“abcxyz”, “xyz”),void)]

S4=[( “”, “abc”), (“abc”, “abc”),void)]

S10=[( “abc”, “”), (“abc”, “”),void)]

Table 2 : Components and samples (Park, 2000)

15

Signature Matching

Signature matching focuses on the type and number of arguments defined for methods and takes an indirect approach to identify whether a code is relevant (Zaremski and Wing, 1993; Zaremski and Wing, 1997). Furthermore, signature matching uses the structure of the interface of a source code as built-in information for indexing software libraries (Rollins and Wing, 1991). Another approach based on signature matching is presented by Luqi and Guo (1999) that reusable codes are computed into a profile from the components signatures to identify them uniquely. A profile of integers is generated by counting the number of signature types, the cardinality of related and unrelated type groups.

The concepts of signature matching are mostly well-known with functional languages(Zaremski and Wing, 1996). As an example, the following signature in ML function is used as a query, a function for inserting an integer in a sorted set of integers (Hemer and Lindsay, 2001):

insert_int : int x int set → int set

The query would match the ML library function insert, that is used for inserting an element in a set where a is a type parameter. The signature for the function insert is:

insert : a x a set → a set

The match is returned by instantiating the parameter a ~> int. Generally the signature match is defined with the signature QuerySignature, MatchPredicate,ComponentLibrary → SetOfComponents. In other words, given a query, expressed as the target signature for the searched component, and a match predicate defining the type of match and a component library, the signature matching process returns a set of matching candidates (Zaremski and Wing, 1995).

The technique is excellent to find the desired library function except that there aremany functions in the library that share the same signature. The math library in the standard ANSI C has 31 out of 47 functions with signature double → double (Zaremski and Wing, 1997).

Specification Matching

A specification matching approach is introduced to overcome the problem of signature matching (Zaremski and Wing, 1995). Specification matching is used to compare two software components based on descriptions of the components’ behaviour with the method’s pre and post conditions that capture the functionality of the method (Zaremski and Wing, 1997). It use mathematical notation to describe the behaviour of a software component (Hemer and Lindsay, 2001; Jilani et al, 1997; Zaremski and Wing, 1993; Zaremski and Wing, 1995; Zaremski and Wing, 1997). Specification languages like Z, B or OCL are used to write

16

the components specifications (Fischer et al., 1998; Fischer, 2000). Specification based retrieval allows formal specifications as search keys and retrieve components whose indices satisfy a given match relation with respect to the key (Fischer et al., 1998).

Zaremski and Wing (1996) use the same signature matching definition of retrieval problem in specification matching approach, QuerySpecification, MatchPredicate, ComponentLibrary → SetOfComponents. They illustrate both exact and relaxed match between components and come out with 8 ways to find a match. (i) Exact pre/post match is satisfies if the preconditions and post conditions are equivalent; (ii) Plug-in match is when query is matched by library specification whose precondition is weaker and post conditions is stronger compare to the query specification; (iii) Plug-in post match is satisfies when only the post conditions are equivalent; (iv) Guarded plug-in match is where the conjunction of specification of library component and query are equivalent; (v) Guarded post match is satisfies when precondition of component library is weaker than the query; (vi) Exact predicate match is satisfies when the conjunction of the preconditions and post conditions for the query and library component are equivalent; (vii) Generalized match is satisfies when specification of the component library is stronger than the query; and (viii) Specialized match is satisfies when specification of the component library is weaker than the query.

Hemer and Lindsay (2001) demonstrate the exact pre/post match, plug-in match and exact predicate match by the four KIDS-like functions shown in Figure 2. Each functionreturns the first integer in a list of numbers. Function hd1 has a precondition that the list contains at least one element and returns the element with index 1 in the list. Function hd2has a precondition that the list is not empty and returns the head of the list. Function hd3returns the head of a non-empty list or returns 0. Function hd4 has a precondition that the list contains at least two elements and returns the element with index 1. Functions hd1 and hd2are exact pre/post match, plug-in match and exact predicate match because both functions are logically equivalent. If function hd3 is considered to be the query specification, it will satisfyplug-in match with function hd1, hd2 and hd4 because function hd3 has a weaker precondition and stronger post condition. Lastly exact predicate is satisfies for hd1, hd2 and hd3 because the conjunction of the precondition and post conditions of the functions are logically equivalent.

function hd1(x: seq(integer)): integerwhere #x > 0return {z | x(1) = z}.

function hd2(x: seq(integer)): integerwhere x ≠ <>return {z | z = head(x)}.

17

function hd3(x: seq(integer)): integerwhere truereturn {z | z = if z ≠ <> then head(x) else 0}.

function hd4(x: seq(integer)): integerwhere #x > 2return {z | x(1) = z}.

Figure 2: Functions for returning the head of a list, in KIDS notation

These approaches are good because no additional information is needed but it is difficult to map user requirements to code signature (Braga et al., 2001). Furthermore the signature does not guarantee expected component characteristic (Braga et al., 2001). With the nature of object-oriented programming that permits method overloading and overriding, it is difficult to distinguish multiple components with similar signatures. The specification matching approach appears to provide more accurate hits but it is too time-consuming to be practical as its implementation based on theorem proving techniques is expensive (Goguen, 1996). Nevertheless this category of methods is valued for its precision. To retrieve software components the user is obliged to provide the system with precise and well structured information (Bouchachia et al., 2000).

2.2.3 Recommendation method

Most of the research discussed above focused on the retrieval method with the intention of improving the relevance of retrieved components to the query submitted by the user. Such techniques are useful especially for developers who understand the task at hand and which components to find. Unfortunately, developers are unlikely to query for components that they believe do not exist (Ye et al., 2000; Ye and Fisher, 2002; Ye, 2003; Ohsugi et al., 2002, McCarey et al., 2004, McCarey et al., 2005). Information islands caused the developer not to reuse certain components (Ye, 2001). One of solution is to advise the developers through recommendations generated by the retrieval system.

The basic idea of the recommendation method is for the repository to be more active by suggesting a set of candidate components that are likely to be useful to a particular developer (Holmes and Murphy, 2005). Moreover, integrating the method in a repository system can encourage users to make full use of the repository by recommending components that the developers is interested in and recommending useful components that the developer is not aware of (McCarey et al., 2005 and Ye et al., 2002).

Recommending system such as CodeBroker is integrated into the development environment so that the system can find components relevant to the task (Ye and Fischer,2002). CodeBroker will run continuously as a background process in Emacs and deliver

18

suitable components whenever a standard Java comment is entered in the environment. Thus the comment is considered as a query and the system will respond and show a list of components that match (Ye and Fischer, 2000; 2001; 2002). The system also accepts the signature definition as a query; in addition the developer can refine the query by continue type the signature definition of the method. Then the system will analyse the preceding java comment with the signature to find matching components (Ye et al., 2000). The comments are processed with free-text information retrieval techniques while signature definition is handles with signature matching technique.

The system will filter components that have been used previously because it is assumed that the developer is familiar with the component. CodeBroker use similarity analysis between the comments and signatures as identifiers to show the relevance of a component to the task at hand. Furthermore, latent semantic analysis is used to index text documents and queries (Holmes and Murphy, 2005). However, the effectiveness of this approach may be limited by the need to write suitable comments which most developers appear lack the skill of writing good comments (Holmes and Murphy, 2005). Strathcona tool (Holmes and Murphy, 2005) resembles most of the capabilities as the CodeBroker system.The tool extracts the structural context of the code on which a developer is working when the developer requests examples. In contrast to CodeBroker, Strathcona can apply to any code and any framework irrespective of coding conventions since all source code incorporates structure.

Another recommendation approach that has been studied is based on collaborative filtering (Ohsugi et al., 2002; McCarey et al., 2004, 2005). Collaborative filtering algorithms are used to suggest new items or to predict the items interested based on the user’s previous preference and the opinions of like minded users (Sarwar et al., 2001). Collaborative filtering believes that users can be clustered. Users in a cluster share preferences and dislikes and likely to agree on future items. The collaborative filtering method is used to recommend useful functions in application software such as MS-Word or MS-Excel (Ohsugi et al. 2002). Function execution histories are automatically collected to build the user clustering. Then a set of candidate functions is recommended that were collected on the opinions of like-minded users.

McCarey et al. (2004) stressed that collaborative filtering approach will allows developers to discover reusable components, support learn on demand, improve developers’ productivity and promote software reuse. Based on the collaborative filtering, a repository of open-source Java code from SourceForge (OSTG, 2004) is mined and usage histories of components are automatically collected. The system judged users are similar if they use the same or a similar set of components. The similarity between the two users is computed by determining the cosine of the history records.

Later the system was extended and implemented the system in an Agile environment known as RASCAL (McCarey et al, 2004, 2005). RASCAL integrates collaborative filtering and content-based filtering. The collaborative filtering will recommends components similar

19

to other users preferences and content-based filtering will recommends which components are suitable to implement next as soon as the system analyse the current method that were implemented. RASCAL continuously runs in the background and updates the developer’s usage history and produce recommendations based on Java classes similar to the current class under development. RASCAL does not need additional knowledge from the developers because the system will actively analyse methods that were currently developed. Furthermore, the system can predict the next likely components to use instead of recommending components that the user has not yet tried.

Ichii et al. (2004) also incorporate collaborative filtering into their component search system, Spars-J. Initially Spars-J which is a component search system specifically for Java components is much similar to Google's PageRank for HTML documents (Inoue et al., 2003; Inoue et al., 2005; Yokomori et al., 2003). Component rank aims to analyse actual use relations of components, in other words "a collection of software components is represented as a weighted directed graph whose nodes correspond to the components and edges correspond to the usage relations" (Inoue et al., 2003). However, as Google directly weights the incoming reference of each web page, Spars-J first explores similarities between components before performing the weight computation (Inoue et al., 2005). They have ranked 7171 Java files found in SourceForge and upon extending the system functionality with collaborative filtering.

Similar to component rank, CodeWeb (Michail, 2000; 2001) uses software structure to determine which parts of a framework are frequently used. To provide information about which classes and methods are frequently used, a developer must populate CodeWeb with applications that are similar to the one which they are developing. It is more effective if a developer could engage to use CodeWeb from the beginning of the development process because CodeWeb is based on browsing rather than querying (Holmes and Murphy, 2005).

Project history is also recognized as one kind of information that could use as suggestion to help the developer. This information is utilized in Hipikot tool (Cubranic and Murphy, 2003) that recommend relevant development artifacts. One kind of artifact that the system used is the past change task.

20

3 Outstanding Issues in Code Retrieval The first aspect to take into consideration in a retrieval system is the cost involved in

retrieving code components. Most of the current retrieval system tried to avoid manual labour unlike the faceted approach which uses human expertise to classify components for knowledge acquisition and representation (Mili et al., 2003). Consequently it involves a major cost in building and maintaining such a vocabulary and in classifying the components.Retrieval system should also search and retrieve components that are most relevant instead of listing components that have similar keywords but are out of context as appears in free-text retrieval system (Mili et al., 2003).

From the user point of view, it is preferrerable to retrieve software through natural language rather than keyword-based interfaces or semantic-based interfaces (Faloutsos and Oard, 1995). It seems more intuitive for users to specify their requirements through a sentence in natural language than to select appropriate keywords, terms for facets in classification schemes or boolean combinations of keywords (Faloutsos and Oard, 1995; Girardi and Ibrahim, 1995). On that line, the query facilities must be easy to use and permit the posing of any conceivable question (Cox and Clarke, 2001).

The ease of getting hold of the necessary information clearly affects the scalability of retrieval tools as component libraries grow (Lucena, 2001). Nevertheless, a study of four classification methods for software components which are simple keyword-based, faceted, enumerated graph and attribute value methods for reusable software components showed that none of the methods worked very well for helping users understand the components (Frakes and Pole, 1994; Alonso and Frakes, 2000). The retrieval system should aid the developers reusing software components to understand how the components work and how they can be reused.

Ideally, an improved system should avoid a common deficiency of exclusively allowing searches for individual code (Boonsiri et al., 2002). For retrieval system to be successful, they must search for highly compatible code because the exact reuse codes are rarely found (Girardi and Ibrahim, 1995). It is impossible to assume that component retrieval will always result in a component that satisfies a problem specification (Sugurmaran and Storey, 2003). Therefore, in a retrieval system it must be possible to find similar code.

21

4 Conclusion and Future WorkRetrieving software components include retrieving all aspect of the software lifecycle

including design structure, specification and documentation as well as generic source code.My work is focused on retrieving source code. This is because reuse is not limited to directly implement the code but also in understanding how a particular feature has been coded in a particular language (Yusof and Rana, 2004). Furthermore, the source code is often the only true documentation of the system structure and behavior (Paul, 1994; Robson et al., 1991).This opinion gives me the strong foundation to concentrate on only the source code. Code retrieval is the task of retrieving, classifying and extracting information from source code files. It is considered challenging task because of the structure and content of the source code as well as the differences between the syntax and semantics of programming language and natural language syntax and semantics (Mishne and de Rijke, 2004; Frakes, 2004).

Most retrieval systems associate the retrieval with their own software reuse repository and do not communicate with other repositories. It is believed that the storage representation is crucial for the retrieval to be efficient (Mili et al., 1998) and it needs to be integrated seamlessly with the retrieval system (Ye, 2001). Such efforts limit the scalability and accessibility of the repository (Seacord, 1998). One new source for collection of reusable component is from the World Wide Web. The Internet has become a rich base of reusable software with extensive quantities of software freely available (Yao and Etzkorn, 2004).

Without a doubt the Internet has attracted the attention of researcher in the reuse field to share components by this decentralized architecture (Sun et. al., 2002). This is a good opportunity to use source code archives and open source sites on the World Wide Web to investigate the feasibility of my system. Potential archives are Ibiblio (Ibiblio) and Sourceforge (Sourceforge). These archives contain a large amount of code example and it is free. For software reuse to be effective, a large collection of reusable code is necessary (Henninger, 1995). Ugurel et al. (2002) reported Ibiblio contains over 55 gigabytes of Linux programs and documentations while Sourceforge hosts over seventy thousand Open Source software systems. Currently Ibiblio reports to contains 171 gigabytes of freely available Linux programs and documentations (Ibiblio) while Sourceforge contains more than hundred thousand of registered projects (Sourceforge). It is not surprising that SourceForge.net owned by Open Source Development Network, Inc. claims to be the world's largest open source development website (Sourceforge; Kawaguchi et al., 2006). This should provide useful experimental material.

My retrieval method will attempt to address the outstanding issues that have been outlined. The ideal retrieval system must take into consideration recall and precision percentage, cost and simple query formation. The system should also aid the developer to understand the components. With these features, I am confident the system could assist reuse developers to overcome some of the problems as highlighted by Reeves (1994). The empirical study shows that: (i) users do not have well-formed goals and plans, (ii) users do not know about the existence of components, (iii) users do not know how to access components, (iv)

22

users do not understand the results that components produce for them and (v) users cannot combine, adapt, and modify components according to their specific needs (Henninger, 1995).

The research will address these problems in four phases, (i) tagging the user query to class and method names, (ii) constructing pattern/s of code features using clone detection techniques, (iii) searching components in the library based on the clone pattern/s and (iv) aiding the comprehension of the retrieved code components with program slicing technique.

To start, users can use keyword to describe their query. This does not mean that the proposed system is similar to keyword based approaches but rather has been adopted to free the users from learning new query language. A keyword based approach is only effective for experienced users who are familiar with the proper terminology; therefore in the proposed system even the users who know only one of the terms, will be able to depend on the system to search for software component that use different terms but relevant.

This approach will be illustrated by an example based on the word ‘crawler’ and the site Sourceforge. The term ‘crawler’ is a program or automated script that browses the World Wide Web and is used by search engines. Crawler will produce 22 results fromSourceforge.net. Since a crawler is also known as a ‘spider’, the users could query Sourceforge.net and could find 19 results. Users who are not familiar with either of these two terminologies, would simply type ‘search engine’ which will produce 1194 results.

Crawler Spider

Heritrix: Internet Archive Web CrawlerWebNews CrawlerCourse CrawlerItSucks

P�dznsnatch

YACY distributed WWW search engineXMLCrawlerDeDuplicator (Heritrix add-on)Crawl-By-Example (Heritrix plugin)IsobelGronoSpyCrawler/Load Tester in JavaDiving BellWebSPHINXSmartCrawlerwebloupeCEToolsArn0lDJ-Obey (Robots.txt Crawler Module)dpanicTarantulaOntology Based Focused Crawler

Heaton Research Spider for JavaJava SpiderMETA SpiderArachnid Web Spider FrameworkWebNews CrawlerSit StartJ2EE SpiderNightCrawlerWebLech URL SpiderItSucksSperowiderJSpiderJLinkCheckSpidered Data RetrievalWebSPHINXASpiderwebloupeRSS spider

Table 3 : Search results for term in ‘crawler’ and ‘spider’

23

Table 3 shows these results. Four software components are returned from ‘crawler’ as well as ‘spider’. This verifies that the terms ‘crawler’ and ‘spider’ are similar. The proposed system should overcome this condition. It will then go on to categorize the listed software components to generate clone patterns using clone detection techniques.

4.1 Clone Detection Technique

Clone detection technique arise because duplication does exist for several reasons such as time pressure, shortcomings of the programming languages, inexperienced developers(Davey et al., 1995) or even the use of similar mathematic formula. Code duplication occurswhen developers reuse fragments of the code by cut-and-paste and this is generally known as software cloning (Khusidman and Bridgeland, 2006). This type of code cloning is the most basic and widely used approach to software reuse (Basit and Jarzabek, 2005). It involvescopying logic in the existing solution, and modifying to suit the needs of the new application.However this is not how software reuse is practice as the term is generally understood (Chilton, 2003). Nevertheless several studies suggest that as much as 20 – 30% of large software systems consist of cloned code (Baker, 1995; Mayrand et al., 1996).

Various research on finding software similarities has been performed, most studiesfocused on detecting program plagiarism (Yamamoto, 2002). The aim of using clone detection is that the technique is able to find similar or near similar code which most of the current approaches demonstrated their system search for candidates that are near exact match. Furthermore when look for candidates for reuse and choosing between them, I’m interested in similarities that reflect the overall code characteristics or features. This technique should be able to produce the clone pattern in order for the system to find relevant components. Mymain interest is to retrieve as many components with similar features to increase the number of recall of correct components from in the library.

The tools for locating similar code that have been used for duplication or plagiarism detection can be grouped as (i) Pattern-based analyzers which check for shallow similarity between lines of codes, using pattern matching techniques and tiling algorithms. This approach is very effective mostly at detecting simply duplicated copy-pasted chunks of code or very similar pieces of code. (ii) Code Signature analyzers which associates a “code signature” with every piece of code, calculated by examining certain features of the code; programs with similar signatures are considered to be similar. (iii) Structural analyzerscompare structural properties of the programs by representing the programs as strings andmeasuring the string distance between them. Any of these analyzers would search the similar or near similar code for my retrieval process.

The process of retrieving and understanding source code are closely intertwined. Once the system retrieved a list of relevant code, users must efficiently evaluate it to determine whether or not it meets their needs. One approach is to use program slicing technique. Program slicing is regards as a technique to assist the developers in understanding, debugging,

24

program integration, software maintenance, testing and software assurance (Weiser, 1984). Program slicing is a program analysis and reverse engineering technique that reduces a program to those statements that are relevant for a particular computation (Weiser, 1984).

4.2 Program Slicing

Program slicing was originally introduced by Mark Weiser as a "method for automatically decomposing programs by analyzing their data flow and control flow. Starting from a subset of a program's behavior, slicing reduces that program to a minimal form which still produces that behavior. The reduced program, called a slice, is an independent program guaranteed to represent faithfully the original program within the domain of the specified subset of behavior" (Weiser, 1984). Informally, a slice provides the answer to the question “What program statements potentially affect the value of variable v at statement s?” (Binkley and Gallagher, 1996). In the process of program slicing, a program is converted to a program dependence graph whose vertices represent statements and edges represent control and data relations (Horwitz et al., 1990). It allows the programmer to focus the attention on the statements that are part of the slice. A program slice is extracted by backward or forward traversal of the graph from slice criteria (Ishio et al. 2005). A backward slice contains statements that affect on variables, and a forward slice contains statements that depend on the variables. Other variants of program slicing are static slicing, dynamic slicing, chopping and interface slicing to name a few (Mohapatra et al., 2006).

As the retrieved code components have similar code fragments or features, I intend to use program slicing to compare and contrast at least between two source codes at one time. This will greatly enhance users understanding on how a particular feature is constructed in the respected code. The user can therefore either reuse the code or simply gain the knowledge ofhow a particular feature has been coded.

In order to validate my system’s performance I would compare the precision and recall criteria of the system with the source code archives that were used for code examples. Mining the same repository give the advantage of concentrating only on the performance of the retrieval algorithm. I will compare the repositories’ retrieval algorithm with my system. The precision criteria will evaluate the retrieval algorithm, the ratio of relevant retrieved codes over the total number of retrieved codes. While recall criteria will look whether the relevance criterion logically implies the matching criterion.

To conclude the proposed system would not require users’ ideas of what type of code examples that would help them to finish their task, eliminate users to learn new query language, and there are no obligations to follow the needs of the software repository.

25

5 ReferencesAli, F. M. and Du, W. (2004). Toward reuse of object-oriented software design models.

Information & Software Technology, 46(8), pp. 499-517.

Alonso, O., and Frakes, W. B. (2000). Visualization of reusable software assets. Lecture Notes in Computer Science. Springer-Verlag, Berlin. 1844, pp. 251 - 265.

Anslow, C., Marshall, S., Biddle, R., Noble, J., and Jackson, K. (2004). XML database support for program trace visualisation. In: Proceedings of the 2004 Australasian symposium on Information Visualisation, Christchurch, New Zealand. Australian Computer Society. pp. 25 – 34.

Atkinson, S. and Duke, R. (1995). Behavioural retrieval form class libraries, Australian Computer Science Communication, 17(1), pp. 13 – 20.

Baker, B.S. (1995). On finding duplication and near-duplication in large software systems. In L. Wills, P. Newcomb, and E. Chikofsky, editors, Second Working Conference on Reverse Engineering, pages 86–95, Los Alamitos, California. IEEE Computer Society Press.

Barringer, H., Cheng, J. H. and Jones, C. B. (1984). A logic covering undefinedness in program proofs. Acta Informatica, 21(3). pp. 251 – 269.

Basili, V. R. and Rombach, H. D. (1991) Support for comprehensive reuse. Software Engineering Journal, 6(5), pp. 303 – 316.

Basit, H. A. and Jarzabek, S. (2005). Detecting higher-level similarity patterns in programs. ACM SIGSOFT Software Engineering Notes. 30(5). pp. 156 – 165.

Behle, A., Kirchhof, M, Nagl, M. and Welter, R. (2002). Retrieval of software components using a distributed web system. Journal of Network and Computer Applications. 25(3). pp. 197 – 222.

Berglund, E. (2002). Designing electronic reference documentation for software component libraries. Journal of Systems and Software. 68(1). pp. 65 – 75.

Biggerstaff, T. J. (1992). An assessment and analysis of software reuse. Advances in Computers. 34, pp. 1 – 57.

Biggerstaff, T. J. and Richter, C. (1987). Reusability framework, assessment, and directions, IEEE Software. 4(2). pp. 41 – 49.

Binkley, D. and Gallagher, K. B. (1996). Program Slicing. Advances in Computers, 43.

26

Boonsiri, S., Seacord, R. C. and Bunting, R. (2002). Automated component ensemble evaluation. International Journal of Information Technology. 8(1). pp. 40 – 53.

Bouchachia, A. and Mittermeir, R. T. (2001). Coping with uncertainty in software retrieval systems. In: Proceedings of the 2nd International Workshop on Soft Computing Applied to Software Engineering, Enschede, The Netherlands. pp. 31 – 40.

Bouchachia, A. and Mittermeir, R. T. and Pozewaunig, H. (2000). Document identification by shallow semantic analysis. Lecture notes in computer science. 1959, pp. 190 – 202.

Braga, R. M. M., Mattoso, M., and Werner, C. M. L., (2001). The Use of Mediation and Ontology Technologies for Software Component Information Retrieval. ACM SIGSOFT Software Engineering Notes. 26(3), pp. 19 – 28.

Chen, P. S., Hennicker, R., and Jarke, M. (1993). On the retrieval of reusable software components. In: Advances in Software Reuse, Proceedings of the 2nd International Workshop on Software Reusability. IEEE Computer Society Press, Los Alamitos, California, pp. 99–108.

Chilton, J. (2003). The case against human cloning (humans cloning software). Domino Power Magazine.

ComponentSource. http://www.componentsource.com/ 2006.

Cox, A and Clarke, C. (2001). Representing and accessing extracted information. In: Proceedings of IEEE International Conference in Software Maintenance, Florence, Italy.IEEE Computer Society. pp. 12 – 21.

Cubranic, D. and Murphy, G. C. (2003). Hipikat: recommending pertinent softwaredevelopment artifacts. In: Proceeding of the 25th International Conference on Software Engineering, pp. 408 – 418.

Damiani, E. and Fugini, M. G. (1996a). Design and code reuse based on fuzzy classification of components. ACM SIGAPP Applied Computing Review, 4(2), pp. 26 – 32.

Damiani, E. and Fugini, M. G. (1996b). Fuzzy techniques for software reuse. In: Proceedings of the 1996 ACM Symposium on Applied Computing, Pennsylvania, United States, ACM Press. pp. 552 – 557.

Damiani, E., Fugini, M. G. and Fusaschi, E. (1997). A descriptor-based approach to OO code reuse. IEEE Computer, 30(10). pp. 73 – 80.

Damiani, E., Fugini, M. G., and Bellettini C. (1999a). A hierarchy-aware approach to faceted classification of objected-oriented components. ACM Transactions on Software Engineering and Methodology (TOSEM), 8(3), pp. 215 – 262.

27

Damiani, E., Fugini, M. G., and Bellettini C. (1999b). Corrigenda: a hierarchy-aware approach to faceted classification of object-oriented components. ACM Transactions on Software Engineering and Methodology (TOSEM), 8(4), pp. 425 – 472.

Davey, N., Barson, P., Field, S., Frank, R., and Tansley, D. (1995). The development of a software clone detector. International Journal of Applied Software Technology. 1(3-4), pp. 219 – 236.

Drummond, T. and Cipolla, R. (2000). Real time tracking of multiple articulated structures in multiple views. In: 6th European Conference on Computer Vision, ECCV 2000, Dubline, Ireland. pp. 20 – 36.

Due, R. T. (2000). The economics of component-based development. Information Systems Management, 17(1), pp. 92 – 95.

Erdur, R. C. and Dikenelli, O. (2002). A multi-agent system infrastructure for software component market-place: an ontological perspective. ACM SIGMOD. 31(1), pp. 55 – 60.

Etzkorn, L. and Davis, c. G. (1997). Automatically identifying reusable OO legacy code. IEEE Computer, 30(10), pp. 66 – 71.

Faloutsos, C. and Oard, D. W. (1995). A survey of information retrieval and filtering methods. Univ. of Maryland Institute for Advanced Computer Studies Report. University of Maryland at College Park, College Park, MD.

Fischer, B. (2000). Specification-based browsing of software component libraries, Automated Software Engineering, 7(2), pp. 179 – 200.

Fischer, B., Schumann, J. M. P., Snelting, G. (1998). Deduction-based software component retrieval. In: Bibel, W. and Schmitt, P. H. (eds.): Automated deduction – a basis for applications. Dordrecht, Kluwer, pp. 265 – 292.

Fischer, G. (1987). Cognitive view of reuse and redesign. IEEE Software, Special issue on reusability 4, 4, pp. 60 – 72.

Frakes, W. B. (2004). A case study of a reusable component collection in the information retrieval domain. Journal of Systems and Software, 72(2), pp. 265 – 270.

Frakes, W. B. and Gandel, P. B. (1989). Representation methods for software reuse. In: Proceedings of the Conference on Tri-Ada ’89: Ada Technology in Context: Application, Development and Deployment, Pennsylvania, United States. ACM Press, pp. 302 – 314.

Frakes, W. B. and Gandel, P. B. (1990) Representing Reusable Software, Information and Software Technology. 32(10), pp. 653-664.

28

Frakes, W. B. and Nejmeh, B. A. (1988). An information system for software reuse, In: Tracz, W. (Ed.), IEEE Tutorial: Software Reuse: Emerging Technology, IEEE Computer Society.

Frakes, W. B., and Pole, T. P. (1994). An Empirical Study of Representation Methods for Reusable Software Components. IEEE Transactions on Software Engineering, 20(8) pp. 617 – 630.

Freeman, R. T. and Yin, H. (2004). Adaptive topological tree structure for document organisation and visualisation. Neural Network. 17(8–9), pp. 1255 – 1271.

Gibb, F., McCartan, C., O’Donnel, R., Sweeney, N. and Leon, R. (2000). The integration of information retrieval techniques within a software reuse environment. Journal of Information Science, 26(4), pp. 211 – 226.

Girardi, M. R. and Ibrahim, B. (1995) Using english to retrieve software. Journal of Systems and Software. 30(3), pp. 249 – 270.

Goguen, J. (1996). Formality and informality in requirements engineering. In: Proceedings of International Conference on Requirements Engineering, Colorado, United States. IEEE Computer Society, pp 102-108.

Gonzalez, R., and van der Meer, K. (2004). Standard metadata applied to software retrieval. Journal of Information Science. 30(4). pp. 300 - 309.

Hall, R. J. (1993). Generalized Behaviour-based Retrieval. In: International Conference on Software Engineering ICSE93, Baltimore. IEEE Computer Society.

Hemer, D. and Lindsay, P. (2001) Specification-based retrieval strategies for module reuse. In: Proceedings 2001 Australian Software Engineering Conference, Canberra, Australia. IEEE Computer Society. pp. 235 – 243.

Henninger, S. (1995). Information access tools for software reuse. Journal Systems Software, 30(3), pp. 231 – 247.

Henninger, S. (2002). A methodology and tools for applying context-specific usability guidelines to interface design. Interacting With Computers, 12(2000). pp. 225-243.

Holmes, R. and Murphy, G. C. (2005). Using structural context to recommend source code examples. In: Proceedings of 27th International Software Engineering, ICSE 2005. IEEE Computer Society. pp. 117 -125.

Hooper, J. W. and Chester, R. O. (1991). Software reuse and methods. Plenum Press, New York.

Horwitz, S., Reps, T. and Binkley, D. (1990). Interprocedural slicing using dependence graphs. ACM Transactions on Programming Languages and Systems. 12(1), pp. 26 – 60.

29

Ibiblio: www.ibiblio.org/pub/linux

Ichii, M., Yokomori, R. and Inoue, K. (2004). Application of Collaborative Filtering for Software Component Retrieval System, In: International Workshop on Computer Supported Knowledge Collaboration, Fudan University, Shanghai, China.

Inoue, K., Yokomori, R., Fujiwara, H., Yamamoto, T., Matsushita, M., and Kusumoto, S. (2005). Ranking significance of software components based on use relations. IEEE Transactions on Software Engineering. 31(3). pp. 213 -225.

Inoue, K., Yokomori, R., Fujiwara, H., Yamamoto, T., Matsushita, M., and Kusumoto, S. (2003). Component rank: relative significance rank for software component search. In: Proceedings of the 25th International Conference on Software Engineering, Portland, Oregon. Washington, IEEE Computer Society Press. pp. 14 - 24.

Ishio, T., Niitani, R. and Inoue, K. (2006). Towards locating a functional concern based on a program slicing technique. In: 21 st IEEE/ACM International Conference on Automated Software Engineering.

Jilani, L. L., Desharnais, J., Frappier, M., Mili, R., and Mili, A. (1997). Retrieving software components that minimize adaptation effort. In: Proceedings of the 12th Automated Software Engineering Conference, Nevada, United States. Washington, IEEE Computer Society. pp. 255 – 262.

Kawaguchi, S., Garg, P. K., Matsushita, M. and Inoue, K. (2006). MUDABlue: An automatic categorization system for Open Source repositories. Journal of Systems and Software. 79(7). pp. 939 – 953.

Khayati, O. and Giraudin, J. (2002) Components retrieval systems. In: Workshop of 8th International Conference on Object Oriented Information Systems, Montpellier, France. Springer-Verlag.

Khusidman, V. and Bridgeland, D. M. (2006). A classification framework for software reuse, Journal of Object Technology. 5(6), pp. 43 – 61. Online address: http://www.jot.fm/issues/issue_2006_07/article1.

Kim, Y. and Stohr, E. A. (1998). Software reuse: survey and research directions, Journal of Management Information Systems, 14(4), pp. 113 – 147.

Kobayashi, M., and Takeda, K. (2000). Information Retrieval on the Web. ACM Computing Surveys, 32(2).

Kohonen, T. (2001). Self-organizing maps. Berlin, Heidelberg: Springer.

30

Krovetz, R., Ugurel, S. and Giles, C. L. (2003) Classification of source code archives. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, Toronto, Canada. ACM Press. pp. 425 - 426.

Krueger, W. (1992). Software Reuse, ACM Computing Surveys, 24(2). pp. 131 – 184.

Lucena, V. F. jr. de. (2001) Facet-based classification scheme for industrial automation software components. In: Sixth International Workshop on Component-Oriented Programming at ECOOP 2001, Budapest, Hungary. Berlin, Heidelberg. Springer-Verlag.

Luqi, and Guo, J. (1999). Toward automated retrieval for a software component repository, In: IEEE Conference and Workshop on Engineering of Computer-Based Systems. IEEE Computer Society. p. 99

Maarek, Y. S., Berry, D. M. and Kaiser, G. E. (1991). An information retrieval approach for automatically constructing software libraries, IEEE Transactions on Software Engineering, 17(8), pp. 800 – 813.

Mark, K. and Bernstein, A. (2002). Searching for services on the semantic web using process ontologies in The Emerging Semantic Web - Selected papers from the first Semantic Web Working Symposium, Isabel Cruz, S. Decker, J. Euzenat, and D. McGuinness, Eds. Amsterdam: IOS press, pp. 159-172.

Marshall, S., Biddle, R., and Noble, J. (2004) A web user interface for an interactive software repository. In: Proceedings of the fifth conference on Australasian user interface. Dunedin, New Zealand. 28. pp. 57 – 64.

Mayrand, J., Leblanc, C. and Merlo, E. (1996). Experiment on the automatic detection of function clones in a software system using metrics. In: Proceedings of the International Conference on Software Maintenance 1996, IEEE Computer Society Press, pp. 244–253.

McCarey, F., Cinnéide, M., O., and Kushmerick, N. (2005). Rascal: A recommender agent for software components in an agile environment. Artificial Intelligence Review. 24(3-4). pp. 253 – 276.

McCarey, F., O Cinneide, M. and Kushmerick, N. (2004) A case study on recommendingreusable software components using collaborative filtering. In: Proceedings of the international workshop on mining software repositories, Edinburgh, Scotland. New Jersey: IEEE Computer Society.

McFedries, P. (2003). Google this. IEEE Spectrum. 40(2). pp. 68.

Merkl, D. (1995). Content-based software classification by selforganization. In: Proceeding of the IEEE International Conference on Neural Networks, IEEE Computer Society. pp. 1086-1091.

31

Michail, A. (2000). Data mining library reuse patterns using generalized association rules. In: Proceeding of the 22nd International Conference on Software Engineering, IEEE Computer Society. pp. 167 – 176.

Michail, A. (2001). Code web: Data mining library reuse patterns. In: Proceeding of the 23nd

International Conference on Software Engineering, IEEE Computer Society. pp. 827 –828.

Michail, A. (2002). Browsing and searching source code of applications written using a GUI framework. In: Proceedings of the 24th International Conference on Software Engineering, Orlando, Florida. ACM Press. pp. 327 – 337.

Mili, A., Mili, R., and Mittermeir, R. T. (1998). A survey of software reuse libraries. Annals of Software Engineering, (5), pp. 349 – 414.

Mili, H., Ah-Ki, E., Godin, R. and Mcheick, H. (1997). Another nail to the coffin of faceted controlled-vocabulary component classification and retrieval. In: Proceedings of the 1997 symposium on software reusability. ACM Press. pp. 89 -98.

Mili, H., Jaoude, G. B., Tremblay, G., Lefebvre, E. and Petrenko, A. (2003). Business Process Modeling Languages: Sorting Through the AlphabetSoup. TR LATECE.

Mili, H., Mili, F. and Mili, A. (1995). Reusing software: issues and research directions, IEEE Transactions on Software Engineering, 21(6), pp. 528 – 562.

Mishne, G. and de Rijke, M. (2004). Source Code Retrieval Using Conceptual Similarity, In: Proceeding 2004 Conference Computer Assisted Information Retrieval (RIAO ’04), France. Centre De Hautes Etudes Internationales D'Informatique Documentaire. pp. 539-554.

Mohapatra, D. P., Mall, R. and Kumar, R. (2006). An overview of slicing techniques for object-oriented programs. Informatica. 30, pp. 253 – 277.

Morel, J. M. and Faget, J. (1993). The REBOOT environment. In: Proceedings of the Second International Workshop on Software Reusability, Lucca, Italy. IEEE Computer Society Press. pp. 80 – 88.

Myaeng, S.H.. Jang, D.H. Kim, M.S and Zhoo. Z.C. (1998) A flexible model for retrieval of sgml documents. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, pp. 138–145.

Ohsugi, N., Monden, A., Morisaki, S. (2002) Collaborative filtering approach for software function discovery. In: Proceeding of the 2002 International Symposium on Empirical Software Engineering (ISESE 2002), Nara, Japan. IEEE Computer Society. 2. pp. 45-46.

32

Ostertag, E., Hendler, J., Prieto Diaz, R. and Braun, C. (1992). Computing similarity in a reuse library system: an AI-based approach. ACM Transactions on Software Engineering and Methodology (TOSEM). 1(3). pp. 205 – 228.

OSTG (2004). SourceForge.net is owned by the Open Source Technology Group Inc (OSTG), a subsidiary of VA Software Corporation. http://sourceforge.net.

Park, Y (1996). Organizing reusable components for execution-based retrieval. In: Proceeding of the international symposium on applied corporate computing. pp. 147 – 155.

Park, Y (2000). Software retrieval by samples using concept analysis. The Journal of Systems and Software. 54. pp. 179 – 183.

Paul, S. (1994). Modeling and querying software repositories. In: Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research, Ontario, Canada. IBM Press. p. 53.

Pfleeger, S. L. (1996) Measuring reuse: A cautionary tale. IEEE Software, 13(4), pp. 118 –125.

Pighin, M. (2001). Tracing object-oriented code into functional requirements. In: Proceedings Fifth Conf. Software Maintenance and Reengineering, pp. 196–199.

Pighin, M. and Brajnik, G. (2000). A formative evaluation of information retrieval techniques applied to software catalogues. The Journal of Systems and Software. 52(2-3), pp. 131 –138.

Podgurski, A. and Pierce, L. (1993). Retrieving reusable software by sampling behavior. ACM Transactions of Software Engineering and Methodology, 2(3). pp. 286 - 303.

Prieto-di`az, R. (1990) Implementing Faceted Classification for Software Reuse (Experience Report). In: Proceeding 12th International Conference on Software Engineering, Nice, France, IEEE Computer Society Press, pp. 300–304.

Prieto-di’az, R. (1991). Implementing faceted classification for software reuse. Communications of the ACM, 34(5), pp. 88 – 97.

Prieto-Diaz, R. (1993). Status report: Software reusability. IEEE Software, 10(3), pp. 61 – 66.

Prieto-Diaz, R. an Freeman, P. (1987). Classifying software for reusability. IEEE Software, 4(1), pp. 6 – 16.

Reeves, C. (1994). Genetic algorithms and neighbourhood search. In: Evolutionary Computing AISB Workshop, Leeds, UK. Berlin: Springer – Verlag. pp. 115 – 130.

33

Robson, D. J., Bennett, K. H., Cornelius, B. J. and Munro, M. (1991). Approaches to program comprehension. Journal of Systems and Software. 14(2). pp. 79 – 84.

Rollins, E. & Wing, J. (1991), Specifications as search keys for software libraries. In: K. Furukawa, ed., `Eighth International Conference on Logic Programming ', MIT Press, pp. 173--187.

Sarwar, B. M., Karypis, G., Konstan, J. A. and Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms, In: Proceedings of the 10th International World Wide Web Conference, Hong Kong, ACM Press. pp. 285 – 295.

Seacord, R. C., Hissam, S. A. and Wallnau, K. C. (1998). Agora: a search engine for software components, IEEE Internet Computing, 6(2), pp. 62 – 70.

SourceForge : http:/www.sourceforge.net/ , 2006.

Sugumaran, V. and Storey, V. C. (2003). A semantic-based approach to component retrieval. SIGMIS Database. 34(3), pp. 8 – 24.

Sun, C., Agrawal, D. and Abbadi, A. E. (2002). Selectivity Estimation for Spatial Joins with Geometric Selections. Lecture Notes in Computer Science. 2287, pp. 609 – 626.

Swanson, J. E. and Samadzadeh, M. H. (1992). Reusable software catalog interface. In: Proceeding 92 ACM SIGAPP Symposium Application Computing SAC, New York, USA. ACM Press. pp. 1076-1082.

Tangsripairoj, S. and Samadzadeh, M. H. (2005). Organizing and Visualizing Software Repositories Using the Growing Hierarchical Self-Organizing Map. In: Proceedings of the 2005 ACM Symposium on Applied Computing (SAC’05), Special Track on Software Engineering, Santa Fe, New Mexico. New York, ACM Press. pp. 1539 – 1545.

Ugurel, S., Krovetz, R., Giles, C. L., Pennock, D. and Glover, E. (2002) What’s the Code? Automatic Classification of Source Code Archives. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada.ACM Press. pp. 632 – 638.

Vitharana, P., Zahedi, F. M., Jain, H. (2003a). Knowledge-based repository scheme for storing and retrieving business components: a theoretical design and an empirical analysis, IEEE Transactions on Software Engineering, 29(7), pp. 649-664.

Vitharana, P., Zahedi, F. M., Jain, H. (2003b). Design, retrieval and assembly in component-based software development, Communications of the ACM, 46(11), pp. 97 – 102.

Wallnau, K. C., Hissam, S. and Seacord, R. (2002). Building systems from commercialcomponents. Addison-Wesley.

34

Weiser, M. (1984). Program Slicing. IEEE Transactions of Software Engineering. 10(4), pp. 352 – 357.

Wilkinson, R. (1994). Effective retrieval of structured documents. In: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Springer-Verlag New York, Inc, pp. 311–317.

Wood, M., and Sommerville, I., (1988). An information retrieval system for software components, SIGIR Forum. 22.

Yamamoto, K. (2002). Acquisition of lexical paraphrases from texts. In: Proceedings of the 2nd International Workshop on Computational Terminology (Computerm 2002, in conjunction with Coling 2002), pp. 22–28.

Yao, H. and Etzkorn, L. (2004). Towards a semantic-based approach for software reusable component classification and retrieval. In: Proceedings of the 42nd annual Southeast regional conference at Huntsville, Alabama. New York, ACM Press, pp. 110 – 115.

Ye, H. and Lo, B. W. N. (2000). A visualised software library: nested self-organising maps for retrieving and browsing reusable software assets. Neural Computing and Applications. 9(4), pp. 266 – 279.

Ye, H. and Lo, B. W. N. (2001). Towards a self-structuring software library. IEE Proceedings Software. 148(2). pp. 45 – 55.

Ye, Y, and Fischer, G. (2000). Promoting Reuse with Active Reuse Repository Systems. In: Proceedings of the 6th International Conference on Software Reuse (ICSR-6), Vienna, Austria, Springer-Verlag, Berlin Heidelberg, pp.302-317.

Ye, Y, and Fischer, G. (2001). Context-Aware Browsing of Large Component Repositories. In: Proceedings of 16th International Conference on Automated Software Engineering (ASE'01), Coronado Island, pp.99-106.

Ye, Y, and Fischer, G. (2002). Supporting reuse by delivering task-relevant and personalized information. In: Proceedings of the 24th International Conference on Software Engineering, Orlando, Florida, ACM Press. pp. 513 – 523.

Ye, Y, and Fischer, G., and Reeves, B. (2000). Integrating Active Information Delivery and Reuse Repository Systems. In: International Symposium on Foundations of Software Engineering, San Diego, California, United States. New York, ACM Press, pp. 60-68.

Ye, Y. (2001). An active and adaptive reuse repository system. In: Proceedings of the 34th

Annual Hawaii International Conference on System Sciences. IEEE Computer Society Press. p. 10.

35

Ye, Y. (2003). Programming with an intelligent agent. Intelligent Systems, IEEE. 18(3). Pp. 43 – 47.

Ye, Y., and Reeves, B. (2000). An Active and Intelligent Agent for Component Location. In: Proceedings of Software Symposium 2000 (SS2000), Kanazawa, Japan, Software Engineer Association, pp.67-74.

Yusof, Y. and Rana, O. F. (2004). Template mining in source-code digital libraries. In: Proc. of Int. Workshop on Mining Software Repositories, Edinburgh, Scotland, UK. IEEE Computer Society Press. pp. 122 – 126.

Zaremski, A. M. and Wing, J. M. (1993). Signature matching: A key of reuse. Software engineering notes. 18(5). pp. 182 – 190.

Zaremski, A. M. and Wing, J. M. (1995). Signature matching: a tool for using software libraries, ACM Transactions on Software Engineering and Methodology, 42(2), pp. 146 –170.

Zaremski, A. M. and Wing, J. M. (1996). Specification matching of software components. In: Third ACM SIGSOFT Symposium on the Foundations of Software Engineering.

Zaremski, A. M. and Wing, J. M. (1997). Specification matching of software components.ACM Transactions on Software Engineering and Methodology (TOSEM). 6(4). pp. 333 –369.

Documents

Retrieving Software Component using Clone … of Computer Science Retrieving Software Component using Clone Detection and Program Slicing Memorandum CS-07-03 Maslita Abdul Aziz Dr