Preprocessing the Dataset
We used the Part00000 file of January 2020 for our analysis. In the preprocessing step we cleaned the dataset which involved multiple steps to ensure the tweets were properly formatted and grouped for analysis. First, raw tweet data in JSON format was parsed, and essential fields such as tweet ID, user ID, and text were extracted. A custom function was employed to clean the text by removing URLs, retweet indicators, mentions, hashtags, colons, and extra whitespace, resulting in a cleaned version of each tweet. Tweets were categorized based on their type—original tweets, retweets, quotes, or replies—using attributes like retweeted_status, quoted_status, and in_reply_to_status_id_str. To maintain logical grouping, tweets were organized by their parent tweet ID into a dictionary. This grouping ensured that related retweets, quotes, and replies were connected to their original parent tweet. The grouped tweets were then flattened into a sorted list, with parent tweets prioritized, and written to a CSV file containing columns for tweet ID, user ID, tweet type, and cleaned text.
Sentiment Analysis We use the BERT-based multilingual model for sentiment analysis, specifically nlptown/bert-base-multilingual-uncased-sentiment, which is a pre-trained transformer model from Hugging Face. This model supports multiple languages, including French, which makes it suitable for analyzing French text as in this case. The model predicts the sentiment of a given text on a five-point scale (1 to 5 stars). Each star corresponds to a sentiment ranging from "Very Negative" (1 star) to "Very Positive" (5 stars), which is mapped to human-readable sentiment labels for better interpretability. The process begins by loading the pre-trained BERT model and its associated tokenizer. The tokenizer converts raw text into tokenized inputs compatible with the model, truncating text longer than 512 tokens to meet the model's input size requirements. The analyze_sentiment function handles sentiment prediction. For each input text, it performs the following steps: Tokenization: The text is tokenized using the BERT tokenizer, converting it into input tensors for the model. Prediction: The model processes the tokenized input and outputs raw scores (logits) for each sentiment label. Softmax Transformation: The raw scores are normalized into probabilities using the softmax function, ensuring they sum to 1. Mapping Scores to Labels: The probabilities are mapped to the five sentiment labels ("1 star" to "5 stars"), with the highest probability determining the predicted sentiment. The input dataset, containing pre-cleaned tweets, is loaded as a Pandas DataFrame. Each tweet is processed individually to predict its sentiment. The results are stored in a new column, sentiment, which contains the human-readable sentiment labels ("Very Negative" to "Very Positive"). Finally, the updated dataset, including sentiment analysis results, is saved as a CSV file for further use. This sentiment analysis workflow effectively quantifies the sentiment of multilingual text, leveraging BERT’s robust language understanding capabilities to assign accurate and context-aware sentiment labels to the tweets.