Description: This repository contains code and datasets for the ACM CCS 2022 paper:
Title: Exposing the Rat in the Tunnel: Using Traffic Analysis for Tor-based Malware Detection
Authors: Priyanka Dodia, Mashael AlSabah, Omar Alrawi, Tao Wang
Our proposed solution is a Machine Learning based prototype designed to identify stealthy Tor-based malware C&C connections using traffic analysis on encrypted Tor traffic. The models further infer the type of malware from the Tor traffic by fingerprinting malicious behavior at the connection and host-levels.
Note: Conference presentation slides PDF
Main file: classify_topk.py
USAGE: python(3.7) classify_topk.py --options [options_file] --topk [topk] --train/zeroday
[options_file]: Options file defining parameter inputs for classification
[topk]: Use k=1 or k=3 for topk most active Tor connections (connections with most activity)
[train]: Set option to train models for binary/multi-label classification
[zeroday]: Set option to test trained models on provided zeroday data
Datasets provided:
- train_D5: Data used for training/validation/testing ML models
- zerodaytest.zip: Zero day data for testing the trained models on unseen malware Tor traffic
Note: The data consists of cell files representing connections from a PCAP (ie. Tor traffic obtained from malware/benign binary executions in the Falcon Sandbox). Connection-level features use Tor cell direction, time, order information and Host-level features use information from all Tor connections in a PCAP (appended to the end of each cell file).
Option files provided:
- options-D5
- options-D5_host
- options-zeroday_binary
- options-zeroday_multilabel
Note(!): 'MULTICLASS' option must be set to 0 in options file
-
Train models with CONNECTION-LEVEL features only [Hayes et al. 2016] derived from top3 highly active Tor connections
cmd: python classify_topk.py --options options-D5 --topk 3 --train
-
Train models with CONNECTION+HOST-LEVEL features [Dodia et al. 2022] using top3 highly active Tor connections for connection-level features
cmd: python classify_topk.py --options options-D5_host --topk 3 --train
Note(!): 'MULTICLASS' option must be set to 1 in options file
Same commands as used in binary classification.
-
Identify zeroday malware connections using pre-trained binary classifier model
cmd: python classify_topk.py --options options-zeroday_binary --topk 3 --zeroday
-
Identify type of malware (class labels) using pre-trained multi label classifier models
cmd: python classify_topk.py --options options-zeroday_multilabel --topk 3 --zeroday
- All experiments can be run with topk=1 or topk=3 (optimal results achieved when top3 most active Tor connections are used for training & testing).
- Host features can be activated/deactivated by setting HOSTFTS to True/False or commenting in/out in the options file.
- Models trained with HOSTFTS, must be tested with HOSTFTS option activated in the test (ie. in the zeroday option files).