Skip to content
This repository has been archived by the owner on May 16, 2021. It is now read-only.

DuDiiC/engineering-thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparison of machine learning methods based on the prediction of the popularity of open source projects on the GitHub platform

Conspect

The theoretical part describes the basics of machine learning and selected supervised learning algorithms: decision trees, random forests and gradient boosting. The training set from the GHTorrent website was also described.

In the practical part, on the example of predicting the popularity of projects on the GitHub platform, the full process related to working on a machine learning project was carried out – data mining, preparation of learning data, training of selected models and analysis of the obtained results. The machine learning models and their performance in the studied case were also compared.

Implementation was done using the Python programming language and its popular libraries, mainly scikit-learn and pandas.

keywords: machine learning, supervised learning, predictions, Python, pandas, scikit-learn, , GtiHub, GHTorrent

Technologies

Python 3.7.6 Conda 4.9.2
IPython 7.12.0 Pandas 1.1.1
Scikit-learn 0.21.3 Matplotlib 3.1.1
SQLAlchemy 1.3.13 Imbalanced-learn 0.7.0

Predicted values:

The number of new stars in the given month

  • for regression predicting a specific value
  • for classification, predicting one of the predefined classes:
class the number of new stars in the given month
0 0
1 [1; 20)
2 [20; 50)
3 [50; 100)
4 100+

Machine learning models:

REGRESSION CLASSIFICATION
DecisionTreeRegressor DecisionTreeClassifier
RandomForestRegressor RandomForestClassifier
GradientBoostingRegressor GradientBoostingClassifier

Results

Regression

MSE R^2

Classification

MATRICES


For more details, please contact me by e-mail.