Bayesian Spam Filter to organize Spam and Ham
This script will create 2 dictionaries utilizing the two provided test cases “learning_ham” and “learning_spam”. After creating the two dictionaries, the script will then create two files titled “outputHam.txt” and “outputSpam.txt” which contains the total number of words matched for all data sets, a list of all the words, their frequencies, P(word|spam or ham), and P(spam or ham|word).
To use this script, ensure that learning datasets are in the current directory, then on the command line type: python
This will produce the two files stated above.
In addition to learning the two dataset provided, the program also comes with the function to determine whether or not a folder containing email messages is considered spam or not. Simply going into the program and choosing the correct path for the variable fileTest will allow users to receive a list of files within the test folder that are either SPAM or HAM depending on the set confidence level.