Title: How far have we progressed in identifying self-admitted technical debts? A comprehensive empirical study
This repository stores the source codes of the four state-of-the-art SATD comments detection approaches, and 20 Java projects whose comments were manually labeled by Maldonado et al. (10) and ourselves (10).
-
MAT/dataset/
This folder stores the comment data of 20 Java projects, consisting of 40 files: 20comments
files (e.g., data--Ant.txt), 20labels
files (i.e., label--Ant). -
MAT/src/
This folder stores the source code ofPattern
,NLP
,TM
, andMAT
written in Java. -
MAT/CNN_Code/
This folder stores the source code forCNN
written in Python. This code was provided by Ren et al. and we modified some code so that it can be used for cross-project predictions. -
MAT/exp_data/
This folder stores the experimental data, including some configuration filesexp_data/dic
and the comment data folderexp_data/origin
. -
MAT/result/
This folder stores all classification results of the each approaches. In particular,MAT/result/predictions/
stores the detailed classification result for each comment of each project. -
Tips 1: The implementation of
Jitterbug
can be found inits origin repository
-
Tips 2: The results of
CNN
approach are all copied from their paper Neural network based detection of self-admitted technical debt: From performance to explainability [1].
In order to make it easier to obtain the classification results, all the Java source codes have been packaged into a runnable jar archive file MAT.jar. One can run it according to the following command regulation.
java -jar MAT.jar -p exp_data_folder_path -o result_path -m model -s scenario
In above command,
-p
indicates the experimental data folder path, in which two sub-folder dic and origin that can be found in MAT/exp_data/ should be pre placed. Specifically, dic stores some configuration files, and origin stores the comment data of each project. The simplest way is to copy the MAT/exp_data/ folder to your machine and then add option -p local position/MAT/exp_data in the command;-o
indicates the result folder path, which stores the classification result of each approach. You can make an empty folder to store the results.-m
indicates a SATD identification model, i.e., Pattern, NLP, TM, and MAT;-s
indicates a prediction scenario, i.e., MTO and OTO.
Here is some usage examples:
java -jar MAT.jar -p D:/exp_data/ -o D:/Result/ -m Pattern -s MTO
java -jar MAT.jar -p D:/exp_data/ -o D:/Result/ -m NLP -s MTO
java -jar MAT.jar -p D:/exp_data/ -o D:/Result/ -m TM -s MTO
java -jar MAT.jar -p D:/exp_data/ -o D:/Result/ -m MAT -s MTO
java -jar MAT.jar -p D:/exp_data/ -o D:/Result/ -m NLP -s OTO
java -jar MAT.jar -p D:/exp_data/ -o D:/Result/ -m TM -s OTO
Year | Authors | Approach | isSupervised | Description |
---|---|---|---|---|
2015 | Potdar et al. | Pattern | No | Pattern (key words) matching |
2017 | Maldonado et al. | NLP | Yes | Natural language processing |
2018 | Huang et al. | TM | Yes | Text mining |
2019 | Ren et al. | CNN | Yes | Convolutional Neural Network |
2020 | Yu et al. | Jitterbug | Yes | Pattern matching & Hunman effort |
Project | Release | Contributors | #Classes | #Comments | #After flitering | SATD | % of SATD |
---|---|---|---|---|---|---|---|
Ant | 1.7.0 | 74 | 1,475 | 21,587 | 3,052 | 102 | 0.47% |
ArgoUML | 0.34 | 87 | 2,609 | 67,716 | 5,426 | 969 | 1.43% |
Columba | 1.4 | 9 | 1,711 | 33,895 | 4,090 | 128 | 0.38% |
EMF | 2.4.1 | 30 | 1,458 | 25,229 | 2,585 | 74 | 0.29% |
Hibernate | 3.3.2 | 226 | 1,356 | 11,630 | 2,492 | 377 | 3.24% |
JEdit | 4.2 | 57 | 800 | 16,991 | 4,644 | 195 | 1.15% |
JFreeChart | 1.0.19 | 19 | 1,065 | 23,474 | 2,494 | 101 | 0.43% |
JMeter | 2.10 | 33 | 1,181 | 20,084 | 4,148 | 282 | 1.40% |
JRuby | 1.4.0 | 328 | 1,486 | 11,149 | 3,652 | 383 | 3.44% |
SQuirrel | 3.0.3 | 46 | 3,108 | 27,474 | 4,473 | 201 | 0.73% |
Total | ----- | - | 16,249 | 259,229 | 37,056 | 2,812 | 1.08% |
Project | Release | Contributors | #Files | #Comments | #After flitering | SATD | % of SATD |
---|---|---|---|---|---|---|---|
Dubbo | 2.7.4 | 255 | 1,493 | 5,875 | 1,649 | 85 | 1.45% |
Gradle | 5.6.3 | 409 | 7,965 | 15,901 | 3,324 | 321 | 2.02% |
Groovy | 2.5.8 | 284 | 1,526 | 14,199 | 4,435 | 249 | 1.75% |
Hive | 3.1.2 | 192 | 5,817 | 81,127 | 29,340 | 1,046 | 1.29% |
Maven | 3.6.2 | 87 | 886 | 5,448 | 1,219 | 136 | 2.50% |
Poi | 4.1.1 | 12 | 3,477 | 45,666 | 15,033 | 618 | 1.35% |
SpringFramework | 5.2.0 | 401 | 6,355 | 42,574 | 7,712 | 98 | 0.23% |
Storm | 2.1.0 | 304 | 2,267 | 12,258 | 3,639 | 92 | 0.75% |
Tomcat | 9.0.27 | 31 | 2,343 | 37,038 | 12,218 | 287 | 0.77% |
Zookeeper | 3.5.6 | 93 | 677 | 6,894 | 2,691 | 63 | 0.91% |
Total | ------ | - | 32,806 | 266,980 | 81,260 | 2,995 | 1.12% |
Mail: gzq@smail.nju.edu.cn
[1] X. Ren, Z. Xing, X. Xia, D. Lo, X. Wang, J. Grundy. Neural network based detection of self-admitted technical debt: From performance to explainability. ACM Transactions on Software Engineering and Methodology, 28(3), 2019: 1-45.