This repository contains the reproducibility package for the paper Automatic detection of Feature Envy and Data Class code smells using machine learning. We used the MLCQ dataset for Data Class and Feature Envy code smell detection in our experiments:
Madeyski, L. and Lewowski, T., 2020. MLCQ: Industry-relevant code smell data set. In Proceedings of the Evaluation and Assessment in Software Engineering (pp. 342-347).
publicly available at https://zenodo.org/record/3666840#.YnOJ1ehBwuU.
A dataset containing code snippets annotated for the presence of Feature Envy and Data Class code smells from the MLCQ dataset that were available for download:
Dataset has been divided into the training (80%) and test (20%) datasets via a stratified random sampling strategy. Each experiment has been repeated 51 times on different train-test dataset splits (feature envy and data class Jupyter notebooks) in order to get more reliable results. These train-test dataset splits can be found:
We extracted the following features:
-
Source code metrics – we extracted the metrics values by using the following metric extraction tools:
We provide two csv files with original metrics values:
-
CuBERT neural source code embeddings – we used the pre-trained Java model available here.
We extracted the 1024-dim vectors for Data Class and Feature Envy code snippets from the MLCQ dataset. First, we calculated the code embedding for every line in the code snippet separately. Afterward, we used simple mathematical operations - sum and average value of all line embeddings from the code snippet. The embeddings are available in pickle DataFrames:
- Feature Envy:
- Data Class:
-
CodeT5 neural source code embeddings - we used base and small pre-trained models available here.
We extracted the 768-dim (for base model) and 512-dim (for small model) vectors for Data Class and Feature Envy code snippets from the MLCQ dataset. Besides the line by line embedding, we embedded the whole class/method at once (feature envy and data class Jupyter notebooks) . The embeddings are available in pickle DataFrames:
- Feature Envy:
- CodeT5 base model
- Code T5 small model
- Data Class:
- CodeT5 base model
- Code T5 small model
- Feature Envy:
Jupyter notebooks evaluating the performance of all approaches:
Jupyter notebooks presenting the most important features of models trained over 51 trials using source code metrics: