This repository contains a solution for the ML-PT test task with the detailed task description available via the link.
The core of the task is devising unsupervised anomaly/attack detection method for network events - HTTP requests, and implementing it as a Restful Service. The main content of the provided solution can be reached by sequentially browsing through the following notebooks:
01_data_exploration.ipynb
- Initial data exploration
- Data wrangling
- Preliminary analysis of the data columns
02_data_analysis.ipynb
(!)- In depth investigation of the data column values
- PCA analysis and feature space visualization
- Extracting and dumping the features
- Storing the events feature extractor for use in online events classification
03_tune_dbscan.ipynb
(!)- Using DBSCAN for optimal feature clustering
- DBSCAN hyperparameters tuning, finding the optimal number of clusters
- Feature space visualization with clusters
- Identifying GOOD, ATTACK and Unclassified (Noize) clusters
04_select_classifier.ipynb
- Choosing a classifier for online event classification
- Training and comparing default performance of 13 classifiers
- Using k-fold cross-validation
- Motivating selection of Random Forest Classifier
05_tune_classifier.ipynb
(!)- Hyperparameters tuning of the Random Forest Classifier
- Using Random and Grid search with k-fold cross-validation
- Training the Random Forest Classifier with best hyperparameters
- Storing the events classifier to be used in online events classification
- Hyperparameters tuning of the Random Forest Classifier
06_run_test_client.ipynb
(!)- Runs the client side to evaluate the online classifier ran as a Restful (Flask) Service
- Loads the initially provided raw events data and queries the online service for classification
- Compares the resulting event classes with the ones generated with Random Forest Classifier
- This is done for the sanity check to make sure data wrangling and featurization work well
NOTE 01: The notebooks 01
to 05
above can be run sequentially, if needed to re-produce the results or re-generate the classifier and feature extractor models.
NOTE 02: To test the service one does not have to run notebooks 01
to 05
but shall:
- Run the event classification restful service (see sections below)
- Execute the
06_run_test_client.ipynb
notebook or use his/her own client to send requests to:http://127.0.0.1:8080/predict
- ML-PT/ - the root folder
- docs/ - the papers used for inspiration and the test task description
- data/ - the provided raw data csv along with files generated from notebooks
- src/ - the main source code folder
- wrangler/ - data wrangling related files
- features/ - feature extraction related files
- model/ - packages related to clustering and classification models training/tuning
- dbscan/ - the DBSCAN tuning, cluster visulisation related sources
- classifier/ - the files related to selecting and training/tuning the classifier
- service/ - Flask-based event classification Restful Service
- utils/ - various utility source files
- README.md - this read-me file
- LICENSE - licensing information file
- Dockerfile - docker image building file
- requirements.txt - the python package list needed for running
- *.ipynb - multiple notebooks containing the main research work and testing
Requires Python 3.8
- Install packages
ML-PT % pip install -r requirements.txt
- Run server
ML-PT % python ./src/service/flask_app.py
- Run test notebook
ML-PT % 06_run_test_client.ipynb
- Install Docker
https://docs.docker.com/get-docker/
- Pull docker image:
docker image pull zapreevis/python-ml-pt:latest
- Run docker container:
docker run -p 127.0.0.1:8080:8080 -i -t zapreevis/python-ml-pt:latest
- Run test notebook
ML-PT % 06_run_test_client.ipynb
Hereby an example request sent via curl
:
% curl -X 'POST' \
'http://127.0.0.1:8080/predict' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[{"data": "{\"CLIENT_IP\": \"188.138.92.55\", \"CLIENT_USERAGENT\": NaN, \"REQUEST_SIZE\": 166, \"RESPONSE_CODE\": 404, \"MATCHED_VARIABLE_SRC\": \"REQUEST_URI\", \"MATCHED_VARIABLE_NAME\": NaN, \"MATCHED_VARIABLE_VALUE\": \"//tmp/20160925122692indo.php.vob\", \"EVENT_ID\": \"AVdhXFgVq1Ppo9zF5Fxu\"}"}, \
{"data": "{\"CLIENT_IP\": \"93.158.215.131\", \"CLIENT_USERAGENT\": \"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0\", \"REQUEST_SIZE\": 431, \"RESPONSE_CODE\": 302, \"MATCHED_VARIABLE_SRC\": \"REQUEST_GET_ARGS\", \"MATCHED_VARIABLE_NAME\": \"url\", \"MATCHED_VARIABLE_VALUE\": \"http://www.galitsios.gr/?option=com_k2%5C%5C\", \"EVENT_ID\": \"AVdcJmIIq1Ppo9zF2YIp\"}"}]'
The example service response structure is:
[
{
"EVENT_ID": "AVdhXFgVq1Ppo9zF5Fxu",
"LABEL_PRED": 42
},
{
"EVENT_ID": "AVdcJmIIq1Ppo9zF2YIp",
"LABEL_PRED": 3
}
]
ML-PT % pipreqs . --force
ML-PT % docker build -t zapreevis/python-ml-pt:latest .
ML-PT % docker push zapreevis/python-ml-pt:latest
- Event though we get at most 2.8% of noize when running DBSCAN, one can try to reduce the noize levels by reducing min_samples
- It would make sence to change the encoding of
MATCHED_VARIABLE_NAME
andMATCHED_VARIABLE_VALUE
column values- It is currently done using TFIDF vectorizer but it is not very suitable for encoding URLs, URL parameters and alike