MSc thesis project code
- clone the repo and
cd
into the folder - create a YAML configuration file for the experiment setup.
- name: the name of the index column in the dataframe, a shorter value will make it easier to manipulate
- key: a dimension along which the experiment varies
- values: varied for the experiment
For example, you can have one experiment using the adaline classifier and another using the logistic regression classifier. You would express that as:
- name: classifier
key: [classifier, type]
values:
- adaline
- logistic regression
Where [adaline, logistic regression]
is a list of the
values that classifier
can take.
A more general example:
- name: classifier
key: [classifier, type]
values:
- adaline
- logistic regression
- naive bayes
- name: attack
key: [attack, type]
values:
- dictionary
- empty
- ham
- focussed
- name: '% poisoned'
key: [attack, parameters, percentage_samples_poisoned]
values:
- .0
- .1
- .2
- .5
The order of the (name, key, values) group counts, as that is the order the columns will be in the DataFrame results (but the order can then be changed).
-
Check the values in default_spec.yaml, especially the
dataset_filename
key (although this can also be a key that is varied.Here is a full example:
dataset_filename: trec2007-1607252257 label_type: ham_label: -1 spam_label: 1 classifier: type: none training_parameters: {} testing_parameters: {} attack: type: none parameters: {}
-
decide how many threads to run your code on. Given the size of the dataset in memory, allocate at least "double" the amount of RAM. For example, if you run on 8 cores, make sure you have 16GB of RAM otherwise you will get a
MemoryError
. -
Run the pipeline, here with 4 threads for example:
python3 main.py ~/path/to/experiment/config.yaml ~/folder/where/dataset/is/ ~/folder/to/save/results/to/ 4
This repo includes code for:
-
feature extraction from spam datasets:
- from the TREC 2007 dataset: extract.py
the features for an email are the (binary) presence or absence of a token (a word)
-
poisoning attacks on the training data:
-
dictionary attack: dictionary.py
all the features of the poisoned emails are set to one
-
empty attack: empty.py
all the features of the poisoned emails are set to zero
-
ham attack: ham.py
contaminating emails contain features indicative of the ham class
-
focussed attack: focussed.py
-
-
training and testing of binary classification models:
-
ADALINE model: adaline.py
like the better known perceptron, a single layer neural network that calculates a weighted sum of inputs. the difference is that it trains on this weighted sum, and outputs the thresholded weighted sum
-
A few ipython notebooks showcase the results and brief initial observations/interpretations:
-
analysis of offline experiment batch 1608030254
-
analysis of experiment batches 1608310218 and 1608302248, where I looked at the effect of varying the adaptive rate: online notebook
-
analysis of varying attacker knowledge
-
implement attacker knowledge
-
prepare batch test specs to find good learning rates depending on classifier and dataset
-
implement different attacks in adaptive experiment pipeline
-
write extract functions for:
- enron
- MNIST
-
test experiments on MNIST
-
brainstorm attacks for adaptive convex combination experiment
-
implement regret measure
- implement how to store experiment files, prob grouped in batches
- assert all matrix shapes and types
- implement data loading from different filetypes (automatically detect npy, dat, csv, etc.)
- add tests
- ? optimise pipeline for experiments where the same dataset, same attacks, etc. are used or not worth the time ? -> look into Makefile to manage dependencies between files
- profile code
- re-implement logging of intermediate results, but maybe only first few characters, or statistics or info on the array (contains nan, something like that), would need to see what is actually useful
- ? bit arrays
- ? explicitly free memory
- ? make ipython notebook on MI and feature selection for ham