Transcript
Page 1: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology,Osaka University

Finding File Clones in FreeBSD Ports Collection

Yusuke Sasaki

Tetsuo Yamamoto

Yasuhiro Hayase

Katsuro Inoue

Page 2: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

File Clones

Two or more files with the same contentComments and code indentation ignored

Inside a project or between different projects Research about file-clones is scarce

Get new knowledge about file-clones

int main() {printf(“Hello msr!”);return 0;}

Project AProject A Project BProject B

Page 3: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

FCFinder

Input .c and .h files

Output File-clone sets

Faster than other tools

Detection Tokenization MD5 Hash Calculation Exact Matching

Tool Speed

CCFinder 1.4M files / 960 hours x1 1PC

D-CCFinder 1.4M files / 51 hours x19 80PCs

FCFinder 1.4M files / 17.16 hours x55 1PC

Page 4: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Experiment

Target Only .c and .h files in the FreeBSD Ports Collection ~1.4M files ~12 GB 17.16 hours

We measured: File size Number of files in each project Size of each file-clone set Number of file-clones in a project

These values follow the power law

Page 5: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

population of file clone set

num

ber o

f file

clo

ne s

ets

110

0File-clone Set Size

5 5010 100

Left : used in PHP5Right : used in PHP4

DE

used in both of PHP4 and 5

419 setsL:650 setsR:500 sets

L:61 file clonesR:59 file clones

120 file clones

file clone set size R*2 = 0.8508

Page 6: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

number of file clones in projects (clones inside project are excluded)

num

ber o

f pro

ject

s w

ith fi

le c

lone

s

15

5050

0

File-clones per Project

Right : PHP4 modulesCenter : projects related bin-utilsLeft : PHP5 modules

G

5 5010 100 500 1K 5K 10K

number of file clone sets R*2 = 0.8263

Page 7: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

File-clones Between Projects (1/3)

* Nodes show the projects* Edges between projects show the number of file clones between two projects

Ex) gcc41 and gfortran shares 7691 file clones

Page 8: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

File-clones Between Projects (2/3)

* Nodes show the projects* Edges between projects show the number of file clones between two projects

Page 9: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

File-clones Between Projects (3/3)

* Nodes show the projects* Edges between projects show the number of file clones between two projects

Page 10: Finding File Clones in FreeBSD Ports Collection

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Conclusions & Future Work

Conclusions Measured several features of the FreeBSD

Ports collection. Found that the measured features follow the

power law

Future Work Projects logical coupling investigation


Recommended