Skip to content

Identifying the compiler family and optimization level using machine learning approaches on BinComp dataset.

Notifications You must be signed in to change notification settings

MohamedElahl/Compiler-provenance-using-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

Compiler Provenance using Machine and Deep Learning

Problem: Identifying the compiler family and optimization level is a crucial phase for malware analysis and reverse engineering. Cracking binary files for extracting provenance information supports a faster detection of malware files.

Methodologies: In this project, Logistic Regression, Support Vector Machines (SVM), Multi-Layer Perceptron (MLP), Decision tree, AdaBoost classifier, Random forest, ensemble learning, and model stacking were exploited for compiler provenance of executables. Features engineering was carried out by Strings utility and the Ndisasm disassembler. Besides all, the optimization classification problem was tested over deep learning.

Results: The best test accuracy of 100% was achieved by the stacking model for the classification of the compiler family, and 85.9% for the optimization level by the deep learning model.

Compiler Family Confusion Matrix optimization level results

Dataset

BinComp compiler fingerprinting dataset. https://github.com/BinSigma/BinComp/tree/master/Dataset.

Disassembled and strings csv files are available upon request.

Request Compiler Provenance CSV Dataset!

Contributors

Mohamed Elahl - Hassan Mohamed - Karim Youssef - Doha ElHady

About

Identifying the compiler family and optimization level using machine learning approaches on BinComp dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published