GitHub - DavideBruni/FakeTweetDetection: Detection of tweets generated by humans versus deepfake tweets generated by various AI models.

Deep Fake Tweet Detection Project

Authors: Caterina Bruchi, Davide Bruni
Academic Year: 2022-2023
Institution: University of Pisa
Course: Master's Degree in Artificial Intelligence and Data Engineering Exam: Data Mining and Machine Learning

Introduction

Welcome to the Deep Fake Tweet Detection project repository. This project is a higly inspired by the work "TweepFake: About detecting deepfake tweets" by exploring the detection of tweets generated by humans versus deepfake tweets generated by various AI models. The project involves data pre-processing, feature selection, model training, and result comparison.

Note that, since the project was developed for the DAta Mining and Machine Learning exam, it was not possible to use neural networks.

Abstract

Social media plays a crucial role in shaping public opinion. However, it can be manipulated through deepfake content, including tweets. This project aims to detect deepfake tweets by distinguishing between tweets written by real users and those generated by bots. We extend the previous work by incorporating more recent tweets and comparing our results with the original findings.

Data Pre-Processing

We obtained the dataset from the authors of "TweepFake: About detecting deepfake tweets". The dataset includes features such as user id, status id, tweet text, and account type (human, gpt-2, rnn, others). Our pre-processing steps involved:

Removing duplicates
Handling deprecated Unicode characters
Transforming the text by removing tags, hashtags, and URLs

Feature Selection

We used the Bag-of-Words (BoW) technique for text representation. The feature selection process involved setting thresholds for minimum and maximum document frequency to filter out less informative words. After applying these thresholds, we reduced the number of features to 1412.

Model Training and Evaluation

We experimented with several models, including Logistic Regression, Support Vector Classifier (SVC), Random Forest, Multinomial Naive Bayes, and Adaboost. We found that simpler models like Logistic Regression provided competitive results.

Data Stream and Incremental Learning

To handle new tweets and detect potential concept drifts, we used the Twint tool to scrape recent tweets. We then applied incremental learning techniques, such as retraining the classifier with new data chunks each month. This approach ensured our model remained up-to-date with the latest tweet patterns.

Results

The results were compared with the original paper's findings, showing that our methods achieved similar accuracy levels. Detailed results and confusion matrices for each classifier are provided in the provided documentation.

Application

We developed an (application) to classify new tweets based on our trained models.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Note:

To comply with X's (Twitter) privacy policy:

The original dataset containing the tweets is not included in the repository.
All the outputs showing tweet's text were removed.

For detailed documentation, refer to the provided PDF in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Documentation.pdf		Documentation.pdf
FakeDetectionContentDrifting.ipynb		FakeDetectionContentDrifting.ipynb
LICENSE		LICENSE
Project_Presentation.pptx		Project_Presentation.pptx
README.md		README.md
TweetFakeStaticDetection.ipynb		TweetFakeStaticDetection.ipynb
app.py		app.py
finalized_model2.pkl		finalized_model2.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Fake Tweet Detection Project

Introduction

Table of Contents

Abstract

Data Pre-Processing

Feature Selection

Model Training and Evaluation

Data Stream and Incremental Learning

Results

Application

License

References

About

Releases

Packages

Languages

License

DavideBruni/FakeTweetDetection

Folders and files

Latest commit

History

Repository files navigation

Deep Fake Tweet Detection Project

Introduction

Table of Contents

Abstract

Data Pre-Processing

Feature Selection

Model Training and Evaluation

Data Stream and Incremental Learning

Results

Application

License

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages