ML-code-smell-detection

This repository contains the reproducibility package for the paper Automatic detection of Feature Envy and Data Class code smells using machine learning. We used the MLCQ dataset for Data Class and Feature Envy code smell detection in our experiments:

Madeyski, L. and Lewowski, T., 2020. MLCQ: Industry-relevant code smell data set. In Proceedings of the Evaluation and Assessment in Software Engineering (pp. 342-347).

publicly available at https://zenodo.org/record/3666840#.YnOJ1ehBwuU.

Dataset

A dataset containing code snippets annotated for the presence of Feature Envy and Data Class code smells from the MLCQ dataset that were available for download:

Dataset has been divided into the training (80%) and test (20%) datasets via a stratified random sampling strategy. Each experiment has been repeated 51 times on different train-test dataset splits (feature envy and data class Jupyter notebooks) in order to get more reliable results. These train-test dataset splits can be found:

Features extraction

We extracted the following features:

Source code metrics – we extracted the metrics values by using the following metric extraction tools:
- CK Tool
- RepositoryMiner.
We provide two csv files with original metrics values:
- Feature Envy
- Data Class
CuBERT neural source code embeddings – we used the pre-trained Java model available here.

We extracted the 1024-dim vectors for Data Class and Feature Envy code snippets from the MLCQ dataset. First, we calculated the code embedding for every line in the code snippet separately. Afterward, we used simple mathematical operations - sum and average value of all line embeddings from the code snippet. The embeddings are available in pickle DataFrames:
- Feature Envy:
  - CuBERT_sum
  - CuBERT_avg
- Data Class:
  - CuBERT_sum
  - CuBERT_avg
CodeT5 neural source code embeddings - we used base and small pre-trained models available here.

We extracted the 768-dim (for base model) and 512-dim (for small model) vectors for Data Class and Feature Envy code snippets from the MLCQ dataset. Besides the line by line embedding, we embedded the whole class/method at once (feature envy and data class Jupyter notebooks) . The embeddings are available in pickle DataFrames:
- Feature Envy:
  - CodeT5 base model
  - Code T5 small model
- Data Class:
  - CodeT5 base model
  - Code T5 small model

Results

Jupyter notebooks evaluating the performance of all approaches:

Feature importance analysis

Jupyter notebooks presenting the most important features of models trained over 51 trials using source code metrics:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data_class		data_class
feature_envy		feature_envy
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-code-smell-detection

Dataset

Features extraction

Results

Feature importance analysis

About

Releases

Packages

Languages

License

milica-skipina/ML-code-smell-detection

Folders and files

Latest commit

History

Repository files navigation

ML-code-smell-detection

Dataset

Features extraction

Results

Feature importance analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages