19
Wed. 7 April 2004 Profes 2004 Graduate School of Information Science, Nara Institute of Science and Technology - Effort Estimation Based on Collaborative Filtering Naoki Ohsugi, Masateru Tsunoda, Akito Monden, and Ken-ichi Matsumoto

Effort Estimation Based on Collaborative Filtering

Embed Size (px)

DESCRIPTION

Effort Estimation Based on Collaborative Filtering. Naoki Ohsugi, Masateru Tsunoda, Akito Monden, and Ken-ichi Matsumoto. Software Development Effort Estimation. There are methods for estimating required efforts to complete ongoing software development projects. - PowerPoint PPT Presentation

Citation preview

Page 1: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004Profes 2004

Graduate School of Information Science,Nara Institute of Science and Technology

-

Effort Estimation Based on Collaborative Filtering

Naoki Ohsugi, Masateru Tsunoda,Akito Monden, and Ken-ichi Matsumoto

Page 2: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 2 of 19

Software Development Effort Estimation• There are methods for estimating required

efforts to complete ongoing software development projects.

• We can conduct the estimation based on past projects’ data.

Cow

Page 3: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 3 of 19

??

Problems in Estimating Effort

• Past project’s data usually contain many Missing Values (MVs).

– Briand, L., Basili, V., and Thomas, W.: A Pattern Recognition Approach for Software Engineering Data Analysis. IEEE Trans. on Software Eng., vol.18, no.11, pp.931-942 (1992)

• MVs give bad influences to accuracy of estimation.

– Kromrey, J., and Hines, C.: Nonrandomly Missing Data in Multiple Regression: An Empirical Comparison of Common Missing-Data Treatments. Educational and Psychological Measurement, vo.54, no.3, pp.573-593 (1994)

Cow? Horse?

Page 4: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 4 of 19

Goal and Approach

• Goal: to achieve accurate estimation using data with many MVs.

• Approach: to employ Collaborative Filtering (CF).– Technique for estimating user preferences using data with m

any MVs (e.g. Amazon.com)

Page 5: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 5 of 19

CF based User Preference Estimation

• Evaluating similarities between the target user and the other users.

• Estimating the target preference using the other users’ preferences.

Similar User

Similar User

Dissimilar User

?(target)

5(prefer)

User A

User B

Book 2Book 1

5(prefer)

?(MV)

User C

User D

1(not prefer)

Book 4Book 3 Book 5

5(prefer)

Estimate

5(prefer)

5(prefer)

5(prefer)

5(prefer)

5(prefer)

1(not prefer)

1(not prefer)

1(not prefer)

?(MV)

?(MV)

3(so so)

5(prefer)

?(MV)

?(MV)

3(so so)

Page 6: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 6 of 19

CF based Effort Estimation

• Evaluating similarities between the target project and the past projects.

• Estimating the target effort using the other projects’ efforts.

Similar Project

Similar Project

Dissimilar Project

?(target)

1(new develop)

Project A

Project B

# of faults

Project type

60

?(MV)

0(maintain)

Project C

Project D

50

100

Coding cost

Design cost

50

80

Testing cost

30

20

80

25 Estimate

40

?(MV)

?(MV)

?(MV)

200

1(new develop)

?(MV)

40

Page 7: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 7 of 19

Similarity: 0.71

Step1. Evaluating Similarities

• Each project is represented as a vector of normalized metrics.

• Smaller angle between 2 vectors denotes higher similarity between 2 projects.

Project A

Project B

# of faults

Project type

Coding cost

Design cost

Testing cost

?(target)

?(MV)

1(1.0)

1(1.0)

50(0.0625)

20(0.0)

60(1.0)

40(0.0)

100(1.0)

40(0.0)

0 Project type

# of faults

Coding cost

Project A

Project A

Project B

Project B

Page 8: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 8 of 19

Step2. Calculating Estimated Value

• Choosing similar k-projects.– k is called Neighborhood Size.

• Calculating the estimated value from weighted sum of the observed values on the similar k-projects.

Similarity: 0.71

Similarity: 0.71

Similarity: 0.062

Project A

Project B

Project C

Project D

?(target)

1(new develop)

# of faults

Project type

60

?(MV)

0(maintain)

50

100

Coding cost

Design cost

50

80

Testing cost

30

20

80

25 Estimate (k=2)

40

?(MV)

?(MV)

?(MV)

200

1(new develop)

?(MV)

40

Page 9: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 9 of 19

Case Study

• We evaluated the proposed method, using data collected from large software development company (over 7,000 employees).– The data were collected from 1,081 projects in a

decade.• 13% projects for developing new products.• 36% projects for customizing ready-made products.• 51% projects were unknown.

– The data contained 14 kinds of metrics.• Design cost, coding cost, testing cost, # of faults, etc., ...

Page 10: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 10 of 19

Unevenly Distributed Missing Values

MetricsRate of

MVs

Mainframe or not 75.76%

New development or not 7.49%

Total design cost (DC) 0.00%

Total coding cost (CC) 0.00%

DC for regular staffs of a company 86.68%

DC for dispatched staffs from other companies 86.68%

DC for subcontract companies 86.59%

CC for regular staffs 86.68%

CC for dispatched staffs 86.68%

CC for subcontract companies 86.59%

# of faults found in the review of conceptual design 83.53%

# of faults found in the review of functional design 70.77%

# of faults found in the review of program design 80.20%

Testing cost 0.00%

Total 59.83%

Page 11: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 11 of 19

Evaluation Procedure

1. We divided the data into 50-50 two datasets randomly; Fit Dataset and Test Dataset

2. We estimated Testing Costs in the Test Dataset using the Fit Dataset.

3. We compared the estimated Costs and the actual Costs.

Original Data1081

projects

Fit Dataset541

projects

Test Dataset540

projects

divided

used

EstimatedTesting Costs

compared

ActualTesting Costs

extracted

Page 12: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 12 of 19

Regression Model We Used

• We employed stepwise metrics selection.• We employed the following Missing Data

Treatments.– Listwise Deletion– Pairwise Deletion– Mean Imputation

Page 13: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 13 of 19

Relationships Between the Estimated Costs and the Actual Costs

0.0001

0.001

0.01

0.1

1

10

100

0.0001 0.001 0.01 0.1 1 10 100

Estimated Costs

Act

ual C

osts

0.0001

0.001

0.01

0.1

1

10

100

0.0001 0.001 0.01 0.1 1 10 100

Estimated Costs

Act

ual C

osts

Actu

al C

osts

100

10

1

0.1

0.01

0.001

0.00010.00010.001 0.01 0.1 1 10 100

Estimated Costs

0.0001

0.001

0.01

0.1

1

10

100

0.0001 0.001 0.01 0.1 1 10 100

Estimated Costs

Act

ual C

osts

0.0001

0.001

0.01

0.1

1

10

100

0.0001 0.001 0.01 0.1 1 10 100

Estimated Costs

Act

ual C

osts

Actu

al C

osts

100

10

1

0.1

0.01

0.001

0.00010.00010.001 0.01 0.1 1 10 100

Estimated Costs

CF (k = 22)Regression

(Listwise Deletion)

Page 14: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 14 of 19

Evaluation Criteria of Accuracy

• MAE: Mean Absolute Error• VAE: Variance of AE

• MRE: Mean Relative Error• VRE: Variance of RE

• Pred25– Ratio of the projects whose Relative Errors are under

0.25.

Absolute Error =|Estimated Cost – Actual Cost |

|Estimated Cost – Actual Cost |Actual Cost

Relative Error =

Page 15: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 15 of 19

MRE = 0.82 (k = 22)

Accuracy of Each Neighborhood Size

• The most accurate estimation was observed at k = 22.

Neighborhood Size

Mean R

ela

tive E

rror

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89

1 5 10 15 20 25 30 35 40 45 50

Page 16: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 16 of 19

Accuracies of CF and Regression Models• All evaluation criteria indicated CF (k=22) was

the most effective for our data.

  MAE VAE MRE VREPred2

5

CF (k = 22) 0.21 2.2 0.82 3.45 36%

Regression(Listwise Deletion)

0.7 16.45 30.22 287581.18 10%

Regression(Pairwise Deletion)

52.7597171.

846344.2

7189376230

9812%

Regression(Mean

Imputation)1.33 5.07 331.69

24208218.49

4%

Page 17: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 17 of 19

Related Work

• Analogy-based Estimation– It estimates effort using values of the similar

projects.• Shepperd, M., and Schofield, C.: Estimating Software Project

Effort Using Analogies. IEEE Trans. on Software Eng., vol.23, no.12, pp.76-743 (1997)

– They had another approach to evaluate similarities between projects.

– They never mentioned missing values.

Page 18: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 18 of 19

Summary

• We proposed a method for estimating software development efforts using Collaborative Filtering.

• We evaluated the proposed method.– The results suggest the proposed method has

possibility for making good estimation using data including many MVs.

Page 19: Effort Estimation Based on Collaborative Filtering

Wed. 7 April 2004 Profes 2004 19 of 19

Future Work

• Designing the method to find appropriate neighborhood size automatically.

• Improving accuracy of estimation by other similarity evaluation algorithms.

• Comparing accuracies to another methods (e.g. analogy-based estimation).