DeepLearning Android Malware

Detecting and Classifying Android Malware using Deep Learning Techniques

Jupyter Notebooks

Creating a large CSV file with all the features and categories
Creating multiple data files for Benign vs. a malware category
Selecting features for a dataset (In progress)
Visualizing some features in the data

Since our models haven't been performing well, I decided to complete a Sanity Check notebook, demonstrating all of the techniques we're employing here and trying to find any failures in our methods.

One issue I found was stratifying the data using train_test_split from sklearn. As it turns out, this function does not stratify by default and I've fixed this in the Adware vs Benign notebook and the rest. Despite this, performance is still low.

X vs Benign

Experiments

Below are the experiments we want to run for the paper. Each experiment should be

Metrics

The metrics we want to collect for all of these experiments are the Accuracy, Loss, True Positive Rate (TPR), True Negative Rate (TNR), False Positive Rate (FPR), and False Negative Rate (FNR). Depending on the platform the experiments are ran (fastai, Keras), there will be different ways of acquiring the data. Notes of how to do so will be detailed below.

Keras for Binary Classification

# Initializing the metrics objects
accuracy = BinaryAccuracy()
tp = TruePositives()
tn = TrueNegatives()
fp = FalsePositives()
fn = FalseNegatives()
metrics = [accuracy, tp, tn, tp, fn]

# Adding to the model's compile method
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=metrics)

FastAI for Binary Classification

# Set up the metrics we want to collect. I wanted TP,TN,FP,FN but that wasn't available. Recall and precision 
# are still extremely helpful for evaluating the model
metrics = [accuracy, Recall(), Precision()]

# Create the learner


### To-Do List
 - [x] Create a line graph demonstrating how **loss** changes as we change the learning rate and a scatter plot demonstrating the differences in performance (**accuracy** or **loss**) for each different optimizer.
ScreenShot is in /ScreeShots/sgd.png
- Loss was improved by using mean_squared_error loss function. Using other optimizers like Adam, Optimizers, Adadelta; the accurace was the same i.e. ~53%. 
- Various learning rate like 0.1, 0.01 were applied, however doing some research, 0.001 was found to be the optimal one. 
- Using SGD optimizers and mean_squared_loss function on keras-tensorflow, and keras-theano produced the similar result (Accuracy: ~53% and loss function value: ~24).

Experimental Results

Place acquired graphs and data here (or point us to the file with the data)

Introduction

While detecting android malware with Deep Learning and other machine learning techniques seems to be a solved academic problem (Z. Yuan et. al, X. Su et. al, Abdelmonim Naway and Yuancheng LI, Karbab et. al), employing both static and dynamic analysis on the malware, there is little published work using machine learning techniques on network traffic to specifically detect android malware. The CICMalAnal2017 dataset is one of the only datasets containing real, up-to-date network traffic from malicious and benign android applications. The goal of this project is to employ deep learning techniques, in conjunction with the CICMalAnal2017 dataset, to accurately identify the intent of a given application through collected network traffic data.

Dataset Summary

How to Download

The dataset used for this project is described by the Canadian Institute for Cybersecurity at the University of New Brunswick here. The link at the bottom of the description of their site can be used to download the dataset. Additionally, the provided dl-data.sh script may be used (however the link used needs occasional updating. The script works as of May 2020).

Since this is a significant dataset (roughly 300 MB zipped), the download takes a while. Go enjoy a coffee while you wait.

Data Cleanup

As described in Arash et. al, only nine attributes of the provided 80+ are used to achieve high-accuracy in simpler machine learning algorithms. For computational and temporal simplicity, only these nine attributes are kept for the analysis conducted here. Below are listed the nine attributes from the paper matched to the attribute name in the dataset:

Maximum flow packet length (Flow IAT Max)
Minimum flow packet length (Flow IAT Min)
Backward variance data bytes (Bwd Packet Length Std)*
Flow FIN F 17 (FIN Flag Count)
Flow forward bytes (Fwd IAT Total)
Flow backward bytes (Bwd IAT Total)
Maximum Idle (Idle Max)
Initial window forward (Init_Win_bytes_forward)
Minimum segment size forward (min_seg_size_forward)

* (Can't find the variance, so using this instead since it is related)

Since the analysis is focused on determining the type of traffic (malicious/benign) given a sample, attributes such as IP and Port numbers are dropped from the dataset. There is an obvious use of these in ideas such as black/whitelists, however this is not the contribution of the project. Nan values are also dropped if present.

Binary and Multi-Classification Files

Dataset Composition

The composition of the dataset is shown in the table below:

Type	Number of Instances
Benign	1,210,210
Malware	982,212
Adware	424,147

Broken down further, we have a clearer idea of the makeup.

Type	Number of Instances
Benign	1,210,210
Adware	424,147
Scareware	401,165
Ransomware	348,943
SMSmalware	229,275

Additionally, there is a listing out of both the types of malware and species of each individual malware below.

Malware Type	Species	Number of Instances
ADWARE	DOWGIN	39,682
	EWIND	43,374
	FEIWO	56,632
	GOOLIGAN	93,772
	KEMOGE	38,771
	KOODOUS	32,547
	MOBIDASH	31,034
	SELFMITE	13,029
	SHUANET	39,271
	YOUMI	36,035
RANSOMWARE	CHARGER	39,551
	JISUT	25,672
	KOLER	44,555
	LOCKERPIN	25,307
	PLETOR	4,715
	PORNDROID	46,082
	RANSOMBO	39,859
	SIMPLOCKER	36,340
	SVPENG	54,161
	WANNALOCKER	32,701
SCAREWARE	ANDROIDDEFENDER	56,440
	ANDROIDSPY	25,414
	AVFORANDROID	42,448
	AVPASS	40,776
	FAKEAPP	34,676
	FAKEAPPAL	44,563
	FAKEAV	40,089
	FAKEJOBOFFER	30,683
	FAKETAOBAO	33,299
	PENETHO	21,631
	VIRUSSHIELD	23,716
	(Unlabeled)	7,430
SMSMALWARE	BEANBOT	12,371
	BIIGE	33,678
	FAKEINST	15,026
	FAKENOTIFY	22,197
	FAKEMART	6,401
	JIFAKE	5,993
	MAZARBOT	6,065
	NANDROBOX	44,517
	PLANKTON	39,765
	SMSSNIFFER	33,618
	ZSONE	9,644
MALWARE	Unlabeled	2,828

Experiments

Deep Learning Frameworks

perfomance results using various deep learning frameworks are compared

Fastai-Pytorch

https://www.fast.ai/
uses PyTorch (https://pytorch.org/) as the backend

Keras

Results

Adware

classification of adware types

Framework	Accuracy (%)
Fastai-Pytorch	42.72
Keras-Tensorflow	*
Keras-Theano	*

Ransomware

Framework	Accuracy (%)
Fastai-Pytorch	*
Keras-Tensorflow	*
Keras-Theano	*

Scareware

Framework	Accuracy (%)
Fastai-Pytorch	*
Keras-Tensorflow	*
Keras-Theano	*

SMSmalware

Framework	Accuracy (%)
Fastai-Pytorch	*
Keras-Tensorflow	*
Keras-Theano	*

References

Arash Habibi Lashkari, Andi Fitriah A. Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.,
Z. Yuan, Y. Lu and Y. Xue, "Droiddetector: android malware characterization and detection using deep learning," in Tsinghua Science and Technology, vol. 21, no. 1, pp. 114-123, Feb. 2016, doi: 10.1109/TST.2016.7399288.
X. Su, D. Zhang, W. Li and K. Zhao, "A Deep Learning Approach to Android Malware Feature Learning and Detection," 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, 2016, pp. 244-251, doi: 10.1109/TrustCom.2016.0070.
Abdelmonim Naway and Yuancheng LI, "A Review on The Use of Deep Learning in Android Malware Detection", 2018
Karbab, Elmouatez & Debbabi, Mourad & Derhab, Abdelouahid & Mouheb, Djedjiga. (2017). Android Malware Detection using Deep Learning on API Method Sequences.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
ScreenShots		ScreenShots
data_scripts		data_scripts
fastai		fastai
keras-tensorflow		keras-tensorflow
keras-theano		keras-theano
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
treebeard.yaml		treebeard.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepLearning Android Malware

Jupyter Notebooks

Experiments

Metrics

Keras for Binary Classification

FastAI for Binary Classification

Experimental Results

Introduction

Dataset Summary

How to Download

Data Cleanup

Binary and Multi-Classification Files

Dataset Composition

Experiments

Deep Learning Frameworks

Fastai-Pytorch

Keras

Results

Adware

Ransomware

Scareware

SMSmalware

References

About

Releases

Packages

Contributors 3

Languages

License

Colorado-Mesa-University-Cybersecurity/DeepLearning-AndroidMalware

Folders and files

Latest commit

History

Repository files navigation

DeepLearning Android Malware

Jupyter Notebooks

Experiments

Metrics

Keras for Binary Classification

FastAI for Binary Classification

Experimental Results

Introduction

Dataset Summary

How to Download

Data Cleanup

Binary and Multi-Classification Files

Dataset Composition

Experiments

Deep Learning Frameworks

Fastai-Pytorch

Keras

Results

Adware

Ransomware

Scareware

SMSmalware

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages