Department of Computer Science, Graduate School of Information Science & Technology,Osaka University
Finding File Clones in FreeBSD Ports Collection
Yusuke Sasaki
Tetsuo Yamamoto
Yasuhiro Hayase
Katsuro Inoue
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File Clones
Two or more files with the same contentComments and code indentation ignored
Inside a project or between different projects Research about file-clones is scarce
Get new knowledge about file-clones
int main() {printf(“Hello msr!”);return 0;}
Project AProject A Project BProject B
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
FCFinder
Input .c and .h files
Output File-clone sets
Faster than other tools
Detection Tokenization MD5 Hash Calculation Exact Matching
Tool Speed
CCFinder 1.4M files / 960 hours x1 1PC
D-CCFinder 1.4M files / 51 hours x19 80PCs
FCFinder 1.4M files / 17.16 hours x55 1PC
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Experiment
Target Only .c and .h files in the FreeBSD Ports Collection ~1.4M files ~12 GB 17.16 hours
We measured: File size Number of files in each project Size of each file-clone set Number of file-clones in a project
These values follow the power law
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
population of file clone set
num
ber o
f file
clo
ne s
ets
110
0File-clone Set Size
5 5010 100
Left : used in PHP5Right : used in PHP4
DE
used in both of PHP4 and 5
419 setsL:650 setsR:500 sets
L:61 file clonesR:59 file clones
120 file clones
file clone set size R*2 = 0.8508
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
number of file clones in projects (clones inside project are excluded)
num
ber o
f pro
ject
s w
ith fi
le c
lone
s
15
5050
0
File-clones per Project
Right : PHP4 modulesCenter : projects related bin-utilsLeft : PHP5 modules
G
5 5010 100 500 1K 5K 10K
number of file clone sets R*2 = 0.8263
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File-clones Between Projects (1/3)
* Nodes show the projects* Edges between projects show the number of file clones between two projects
Ex) gcc41 and gfortran shares 7691 file clones
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File-clones Between Projects (2/3)
* Nodes show the projects* Edges between projects show the number of file clones between two projects
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
File-clones Between Projects (3/3)
* Nodes show the projects* Edges between projects show the number of file clones between two projects
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University
Conclusions & Future Work
Conclusions Measured several features of the FreeBSD
Ports collection. Found that the measured features follow the
power law
Future Work Projects logical coupling investigation