Skip to content

Topic-modeling on large data (1.85M tweets written in Spanish, ~1M "Spain geolocated", about 'coronavirus' between 2019 to 2020-04-20). Forked from ShuaiW/twitter-analysis (adapted for Python3 to use a discriminative score), mainly for Twitter LDA (Latent Dirichlet allocation using Gibbs sampling, https://lda.readthedocs.io/)

License

Notifications You must be signed in to change notification settings

mmaguero/twitter-analysis

 
 

Repository files navigation

Twitter analysis

virtualenv

First create a virtual environment in the root dir by running:

python3 -m venv venv

then activate the virtual env with

source venv/bin/activate

(to get out of the virtualenv, run deactivate)

Dependencies

install all the dependencies with

pip install -r requirements.txt

also make sure to download nltk's corpus by running those line in python interpreter:

import nltk
nltk.download()

and spacy model:

python -m spacy download es_core_news_sm

and spacy custom lemmatizer files:

python -m spacy_spanish_lemmatizer download wiki

(for language detection go to this repo)

Credentials

Rename sample_credentials.json to credentials.json, and fill in the four credentials from your twitter app.

Real-time twitter trend discovery

(Not tested in this fork) Run

bokeh serve --show real-time-twitter-trend-discovery.py --args <tw> <top_n_words> <*save_history>,

where <tw> and <top_n_words> are arguments representing within what time window we treat tweets as a batch, and how many words with highest idf scores to show, while <*save_history> is an optional boolean value indicating whether we want to dump the history. Make sure API credentials are properly stored in the credentials.json file.

Topic modeling and t-SNE visualization: 20 Newsgroups

(Not tested in this fork) To train a topic model and visualize the news in 2-D space, run

python topic_20news.py --n_topics <n_topics> --n_iter <n_iter> --top_n <top_n> --threshold <threshold>,

where <n_topics> being the number of topics we select (default 20), <n_iter> being the number of iterations for training an LDA model (default 500), <top_n> being the number of top keywords we display (default 5), and <threshold> being the threshold probability for topic assignment (default 0.0).

Scrape tweets and save them to disk

(Not tested in this fork) To scrape tweets and save them to disk for later use, run

python scrape_tweets.py.

If the script is interrupted, just re-run the same command so new tweets collected. The script gets ~1,000 English tweets per min, or 1.5 million/day.

Make sure API credentials are properly stored in the credentials.json file.

Topic modeling and t-SNE visualization: tweets

First make sure you accumulated some tweets (in this fork, we prefer https://github.com/Jefferson-Henrique/GetOldTweets-python and save it in CSV format), then run

python topic_tweets.py --raw_tweet_dir <raw_tweet_dir> --num_train_tweet <num_train_tweet> --n_topics <n_topics> --n_iter <n_iter> --top_n <top_n> --threshold <threshold> --num_example <num_example> --start_date <start_date> --end_date <end_date> --scope <scope> --lang <lang> --eval_n_topics <eval_n_topics>

where <raw_tweet_dir> being a folder containing raw tweet files, <num_train_tweet> being the number of tweets we use for training an LDA model, <n_topics> being the number of topics we select (default 20), <n_iter> being the number of iterations for training an LDA model (default 1500), <top_n> being the number of top keywords we display (default 8), <threshold> being the threshold probability for topic assignment (default 0.0), and <num_example> being number of tweets to show on the plot (default 5000). The same for topic_profiles.py.

Extra params for topic_tweets.py: <start_date>, <end_date> for filter the data, and <scope>, for merge with a CSV file with Spain users (default SPA). Also <lang>, for filter by language (es [stable], es_gn and gn [pre-alfa]) and <eval_n_topics>, if you want to evaluate the optimal numbers of topics...

Data input

4 .csv files:

  1. tweets file, with columns: 'tweet_id','tweet','date','user_id'
  2. lang detected file, with columns: 'tweet_id','lang'
  3. user file of particular location (Spain for us), with column: 'id_str' (then merge with 'user_id')
  4. and a extra file to check locations.

Corpus

For reproducibility, tweet_ids and dates are available here.

How do I cite this work?

Please, cite this paper Discovering topics in Twitter about the COVID-19 outbreak in Spain:

@article{PLN6333,
	author = {Marvin M. Agüero-Torales and David Vilares and Antonio G. López-Herrera},
	title = {Discovering topics in Twitter about the COVID-19 outbreak in Spain},
	journal = {Procesamiento del Lenguaje Natural},
	volume = {66},
	number = {0},
	year = {2021},
	keywords = {COVID-19, Twitter, social networks, topic modeling},
	abstract = {In this work, we apply topic modeling to study what users have been discussing in Twitter during the beginning of the COVID-19 pandemic. More particularly, we explore the period of time that includes three differentiated phases of the COVID-19 crisis in Spain: the pre-crisis time, the outbreak, and the beginning of the lockdown. To do so, we first collect a large corpus of Spanish tweets and clean them. Then, we cluster the tweets into topics using a Latent Dirichlet Allocation model, and define generative and discriminative routes to later extract the most relevant keywords and sentences for each topic. Finally, we provide an exhaustive qualitative analysis about how such topics correspond to the situation in Spain at different stages of the crisis.},
	issn = {1989-7553},
	url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6333},
	pages = {177--190}
}

References

  1. https://github.com/lda-project/lda
  2. https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
  3. https://datascience.blog.wzb.eu/2017/11/09/topic-modeling-evaluation-in-python-with-tmtoolkit/
  4. https://github.com/WZBSocialScienceCenter/tmtoolkit
  5. https://github.com/starry9t/TopicLabel
  6. https://towardsdatascience.com/%EF%B8%8F-topic-modelling-going-beyond-token-outputs-5b48df212e06

About

Topic-modeling on large data (1.85M tweets written in Spanish, ~1M "Spain geolocated", about 'coronavirus' between 2019 to 2020-04-20). Forked from ShuaiW/twitter-analysis (adapted for Python3 to use a discriminative score), mainly for Twitter LDA (Latent Dirichlet allocation using Gibbs sampling, https://lda.readthedocs.io/)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%