In this paper and the attached notebooks we inspect several methods and models in order to build a classifier that distinguishes if a chat conversations is of malicious nature. Our goal was to create a very sharp model that can "read" a full chat conversation and detect if one of the participants is a sexual predator. We based the solution on a well known predator conversation dataset PAN-12 and inspected different models such as BERT transformers, SVM, DeepNN and Ensemble models. In the end we concluded that using a two-stages classification, as the first stage is engaging voting classifier on-premise and the second is to send all benign classified conversations into the cloud on second inspection with RoBERTa.
Furthermore, we inspected some strategies in order to watermark the model so that it could be deployable on mobile devices and prove ownership if it were to be stolen. we then implemented a recent paper that suggests robust watermarks based on inverse document frequency.
Data:
- Perverted Justice Dataset - a "labeled" dataset of 56 chat conversations with predators from Perverted Justice website, added to the repo as zip.
- PAN12 Dataset - For this dataset, you should enter https://zenodo.org/record/3713280#.YrwsLahRWUk and submit request to get the data.