7-TopicModeling.qmd

---
title: 'Worksheet 7: Topic Modeling'
author: 'Candy Boatwright'
date: 'April 10, 2024'
---

_This is the seventh in a series of worksheets for History 8510 at Clemson University. The goal of these worksheets is simple: practice, practice, practice. The worksheet introduces concepts and techniques and includes prompts for you to practice in this interactive document. When you are finished, you should change the author name (above), knit your document, and upload it to canvas. Don't forget to commit your changes as you go and push to github when you finish the worksheet._

Text analysis is an umbrella for a number of different methodologies. Generally speaking, it involves taking a set (or corpus) of textual sources, turning them into data that a computer can understand, and then running calculations and algorithms using that data. Typically, at its most basic level, that involves the counting of words.

Topic modeling (TM) is one type of text analysis that is particularly useful for historians. 

TM takes collections or corpuses of documents and returns groups of "topics" from those documents. It is a form of unsupervised classification that finds groups of items that are probabilistically likely to co-occur. 

Latent Dirichlet allocation (LDA) is the most popular algorithm or method for topic modeling, although there are others. It assumes that each document has a mixture of topics and that each topic is a mixture of words. That means that topics overlap each other in terms of content rather than being confined to distinct and singular groups. 

To prepare a corpus for topic modeling, we'll do many of the same types of operations that we used last week to prepare a corpus for analysis. First we'll pre-process the data and then we'll create a document term matrix from our corpus using the `tm` (text mining) package. 

```{r}
library(tidytext)
library(tidyverse)
library(readtext)
library(tm)
library(topicmodels)

```

```{r}
download.file("https://github.com/regan008/8510-TextAnalysisData/blob/main/TheAmericanCity.zip?raw=true", "AmCity.zip")
unzip("AmCity.zip")
```

```{r}
# Metadata that includes info about each issue.
metadata <- read.csv("https://raw.githubusercontent.com/regan008/8510-TextAnalysisData/main/AmCityMetadata.csv")

meta <- as.data.frame(metadata)
#meta$Filename <- paste("MB_", meta$Filename, sep="")
file_paths <- system.file("TheAmericanCity/")
ac_texts <- readtext(paste("TheAmericanCity/", "*.txt", sep=""))
ac_whole <- full_join(meta, ac_texts, by = c("filename" = "doc_id")) %>% as_tibble() 

tidy_ac <- ac_whole %>%
  unnest_tokens(word, text) %>% 
  filter(str_detect(word, "[a-z']$")) %>% 
  anti_join(stop_words)

tidy_ac <- tidy_ac %>% filter(!grepl('[0-9]', word))

```
The above code borrows from what we did last week. It pulls in the texts from the _The American City_ corpus, joins them together into a single data frame, and then turns then uses `unnest_tokens()` to tokenize the text and, finally, removes stop words. 

For topic modeling, we need a Document Term Matrix, or a DTM. Topic Modeling has the documents running down one side and the terms across the top. `Tidytext` provides a function for converting to and from DTMs. First, we need to create a document that has the doc_id, the word and the count of the number of times that word occurs. We can do that using `count()`.

```{r}
tidy_ac_words <- tidy_ac %>% count(filename, word)

head(tidy_ac_words)
```

Now we can use `cast_dtm()` to turn `tidy_ac_words` into a dtm. 

```{r}
ac.dtm <- tidy_ac_words %>% 
  count(filename, word) %>% 
  cast_dtm(filename, word, n)
```

If you run `class(ac.dtm)` in your console you will notice that it now has a class of "DocumentTermMatrix". 

Now that we have a dtm, we can create a topic model. For this, we'll use the topic models package and the `LDA()` function. Take a minute and read the documentation for `LDA()`.

There are two important options when running `LDA()`. The first is k which is the number of topics you want the model to generate. What number topics you generate is a decision that often takes some experimentation and depends on the size of your corpus. The American City corpus isn't that bigbut still has over 209k words. In this instance, because the corpus is so small we're going to start with a small number of topics. Going above 5 causes errors with this particular corpus. Later, when you work with a different corpus you should experiment with changing the number of topics from 10 to 20 to 30 to 50 to see how it changes your model. 

The second important option when running `LDA()` is the seed option. You don't worry too much about what setting the seed does, but put simply - it ensures the output of the model is predictable and reproducible. Using the seed ensures that if you come back to your code later or someone else tries to run it, the model will return exactly the same results. 

Lets now train our model. This will take a few minutes: 
```{r}
ac.lda <- LDA(ac.dtm, k = 5, control = list(seed = 12345))
ac.lda
```

Now we have a LDA topic model that has 5 topics. There are two ways to look at this model: word-topic probabilities and document-topic probabilities. 

Lets start with **word-topic probabilities.**

Every topic is made up of words that are most associated with that topic. Together these words typically form some sort of theme. To understand what this looks like the easiest thing to do is create a bar chart of the top terms in a topic. 

```{r}
ac.topics <- tidy(ac.lda, matrix = "beta")
head(ac.topics)
```
What we have here is a list of topics and the weight of each term in that topic. Essential we have turned this into a one-topic-per-term-per-row format. So, for example, the term 10th has a weight of 5.135047e-05 in topic 1 but 7.269700e-05 in topic 2. Now that doesn't mean a lot to us at this moment and this format is impossible to grasp in its current size and iteration, but we can use tidyverse functions to pair this down and determine the 10 terms that are most common within each topic. 
```{r}
ac.top.terms <- ac.topics %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

ac.top.terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```
(@) Can you adjust the code above to show the top 10 words from just one topic?
```{r}
ac.lda <- LDA(ac.dtm, k = 5, control = list(seed = 12345))
ac.lda

ac.topics <- tidy(ac.lda, matrix = "beta")
head(ac.topics)

ac.top.terms <- ac.topics %>%
  arrange(desc(beta)) %>% 
  slice_max(beta, n = 1) %>%
  group_by(topic) %>% slice(1:10)

ac.top.terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  #facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```

Another useful way to look at the words in each topic is by visualizing them as a wordcloud.
```{r warning=FALSE}
library(wordcloud)
topic1 <- ac.topics %>% filter(topic == 2)
wordcloud(topic1$term, topic1$beta, max.words = 100, random.order = FALSE,
    rot.per = 0.3, colors = brewer.pal(6, "Dark2"))
```


Now we can see what words are most common in each topic. But the document-topic probabilities are also useful for understanding what topics are prevalent in what documents. Just as each topic is made up of a mixture of words, the LDA algorithm also assumes that each topic is made up of a mixture of topics. 

```{r}
ac.documents <- tidy(ac.lda, matrix = "gamma")
head(ac.documents)
```
For each document, the model gives us an estimated proportion of what words in the document are from a topic. So for the April 1915 issue it estimates that about 23% of the words are from topic 1. The gamma number represents the posterior topic distribution for each document. 

This is easier to see if we filter to see the breakdown for just one document. 
```{r}
ac.documents %>%  filter(document == "1916_May.txt") %>% arrange(desc(gamma))
```

This gamma value is really useful and we can use it to see which topics appear in which documents the most. This is frequently referred to as looking at topics over time. 

We can do that using the ac.documents dataframe that we just created but it needs to be joined with the metadata. Again, this is why it is important to have a filename within the metadata spreadsheet. To join these two together we can do a full_join because we want to keep all of the columns.
```{r}
topics.by.year <- full_join(ac.documents, metadata, by = join_by(document == filename))
```

Now what we have is a document that includes the proportion of each topic in each document. Because this is a dataset about a periodical, we have values in our metadata that will make it easy to plot the distrubtion of a topic over time -- in this case for each edition of the journal.
```{r}
topics.by.year$issue_date <- paste(topics.by.year$month, " ", topics.by.year$year, sep = "")
ggplot(data=topics.by.year, aes(x=issue_date, y=gamma)) + geom_bar(stat="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

Using this we can tell that topic 5, which from earlier had the words improve, grow, ties, contracts, and gasoline as the top five words, is most prominent in January 1915. 

(@) Use the rest of this worksheet to experiment with topic modeling. I've added the code to download a much larger dataset - the issues of Mind and Body. This corpus has 413 documents ranging from the 1890s to 1936. You'll want to start with at least 25 topics. 
```{r}
#| eval: false
download.file("https://github.com/regan008/8510-TextAnalysisData/blob/main/MindAndBody.zip?raw=true", "MB.zip")
unzip("MB.zip")
file.rename("txt", "mbtxt")
```

```{r}
# Metadata that includes info about each issue.
mb.metadata <- read.csv("https://raw.githubusercontent.com/regan008/8510-TextAnalysisData/main/mb-metadata.csv")
mb.meta <- as.data.frame(mb.metadata)

mb.meta$Filename <- paste("MB_", mb.meta$Filename, sep = "")

head(mb.meta)
```

```{r}
mb.file.paths <- list.files("mbtext/")
mb.data.dir <- paste(getwd(), "/mbtxt", sep = "")
mb.texts <- readtext(paste("mbtxt/", "*.txt", sep=""))
mb.whole <- full_join(mb.meta, mb.texts,  by = join_by(Filename == doc_id)) %>% as_tibble()

head(mb.whole)
```

```{r}
tidy.mb <- mb.whole %>%
  unnest_tokens(word, text) %>% 
  filter(str_detect(word, "[a-z']$")) %>% 
  anti_join(stop_words)

head (tidy.mb)
```

```{r}
tidy.mb.words <- tidy.mb %>% count(Filename, word)

head(tidy.mb.words)
```

```{r}
mb.dtm <- tidy.mb.words %>% 
  count(Filename, word) %>% 
  cast_dtm(Filename, word, n)

head(mb.dtm)
```

```{r}
mb.lda <- LDA(mb.dtm, k = 25, control = list(seed = 12345))
mb.lda
```

```{r}
mb.topics <- tidy(mb.lda, matrix = "beta")
head(mb.topics)
```

```{r}
mb.top.terms <- mb.topics %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

mb.top.terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```


(@) What happens if you create a custom stopword list? How does this change the model?
```{r}
mb.custom.tidy.list <- c("real", "mentioned", "vol", "ulm", "rath", "iii", "wis", "st", "mich", "ave", "cts")

mb.custom.tidy <- tidy.mb.words %>%
  filter(!word %in% mb.custom.tidy.list) %>%
  arrange (-n)


mb.dtm.2 <- mb.custom.tidy %>% 
  count(Filename, word) %>% 
  cast_dtm(Filename, word, n)


mb.lda.2 <- LDA(mb.dtm.2, k = 25, control = list(seed = 12345))

mb.topics.2 <- tidy(mb.lda.2, matrix = "beta")

mb.top.terms.2 <- mb.topics.2 %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

mb.top.terms.2 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```
>Just removing a few words that I felt wouldn't be terribly important or were nonsense words totally changed the topic model. Honestly, I feel that the original results were better before I removed words.


(@) Can you create a topic model for just the documents in the 1920s? How does that change the model? 
```{r}
mb.1920 <- mb.whole %>% 
  unnest_tokens(word, text) %>% 
  filter(str_detect(word, "[a-z']$")) %>% 
  anti_join(stop_words) %>%
  filter(Year >= 1920 & Year <= 1929)


mb.dtm.3 <- mb.1920 %>% 
  count(Filename, word) %>% 
  cast_dtm(Filename, word, n)


mb.lda.3 <- LDA(mb.dtm.3, k = 5, control = list(seed = 12345))

mb.topics.3 <- tidy(mb.lda.3, matrix = "beta")

mb.top.terms.3 <- mb.topics.3 %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

mb.top.terms.3 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

```

>I initially left the model number at 25 which was far too many for the new model. 5 seems to be better, although there is a tremendous overlap in words in all 5 of the topics generated.

(@) Now, lets return to the Buffalo Bill data from last week. You should be able to use topic modeling to address two of the research questions provided:

* Can we detect some change over time in promotion language and reception language (marketing and reviewing)? Were there types of characters, scenarios, action promised in promotional material and/or noted in reviews earlier vs later?
* What can be gleaned from the items tagged as extraneous as far as topics? These are news items that are somehow related to BBWW. Crime, finances, celebrity, etc.

To analyze this you should first generate a topic model for the buffalo bill data. Play with the number of topics until you find a number that feels about right for the dataset. I am guessing it'll be in the 8-15 range but you'll need to play with it to see what number gives you the best fit. 

```{r}
download.file("https://github.com/dseefeldt/IndianaNewspapers/raw/main/bb-txt.zip", "bb-txt.zip")
unzip("bb-txt.zip")
```

```{r}
# Metadata that includes info about each article.
bb.metadata <- read.csv("https://raw.githubusercontent.com/dseefeldt/IndianaNewspapers/main/metadata.csv")

#bb.tidy.meta <- bb.metadata %>% separate(Date, c("Month", "Day", "Year"))

data_dir <- paste(getwd(), "/txt", sep = "")
bb <- readtext(paste0(data_dir, "/*.txt"))

bb.whole <- 
  bb.metadata %>%  
  arrange(Publication) %>% # sorts tidy version of bb.metadata (date seperated)
  bind_cols(bb) %>% # combines with full texts from bb
  as_tibble() 

head(bb.whole)
```

```{r}
bb.tidy <- bb.whole %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$")) %>% 
  anti_join(stop_words)
   #filter(word != "bill" & bb.whole$text != "buffalo" & bb.whole$text != "bill's" & bb.whole$text != "cody" & bb.whole$text != "pg" & bb.whole$text != "wild" & bb.whole$text != "west") %>%

head(bb.tidy)
```

```{r}
tidy.bb.words <- bb.tidy %>% count(doc_id, word)

head(tidy.bb.words)
```

```{r}
bb.dtm <- tidy.bb.words %>% 
  count(doc_id, word) %>% 
  cast_dtm(doc_id, word, n)

head(bb.dtm)
```

```{r}
bb.lda <- LDA(bb.dtm, k = 15, control = list(seed = 12345))
bb.lda
```

```{r}
bb.topics <- tidy(bb.lda, matrix = "beta")
head(bb.topics)
```

```{r}
bb.top.terms <- bb.topics %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

bb.top.terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```

```{r}
#add custom stop words to model
bb.custom.tidy.list <- c("buffalo", "bill", "bill's", "wild", "west", "cody", "pg")

bb.custom.tidy <- tidy.bb.words %>%
  filter(!word %in% bb.custom.tidy.list) %>%
  arrange (-n)


bb.dtm.2 <- bb.custom.tidy %>% 
  count(doc_id, word) %>% 
  cast_dtm(doc_id, word, n)


bb.lda.2 <- LDA(bb.dtm.2, k = 15, control = list(seed = 12345))

bb.topics.2 <- tidy(bb.lda.2, matrix = "beta")

bb.top.terms.2 <- bb.topics.2 %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

bb.top.terms.2 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

```


To address the first research question, you'll need to plot topics over time. I would create three models, one for all of the data, one for promotion, and one for reception. What do we learn by doing this?

```{r}
bb.documents <- tidy(bb.lda.2, matrix = "gamma")
#head(bb.documents)

bb.all.topics.by.year <- full_join(bb.documents, bb.metadata, by = join_by(document == Filename)) #%>%
  #top_n(5)

head(bb.all.topics.by.year)
```

```{r}
#bb.all.topics.by.year$FullDate <- paste(bb.all.topics.by.year$month, " ", bb.all.topics.by.year$year, sep = "")
ggplot(data=bb.all.topics.by.year, aes(x=Date, y=gamma)) + geom_bar(stat ="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

```{r}
bb.p.topics.by.year <- bb.all.topics.by.year %>%
  filter(R_P_E == "P")
ggplot(data=bb.p.topics.by.year, aes(x=Date, y=gamma)) + geom_bar(stat="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```

```{r}
bb.r.topics.by.year <- bb.all.topics.by.year %>%
  filter(R_P_E == "R")
ggplot(data=bb.r.topics.by.year, aes(x=Date, y=gamma)) + geom_bar(stat="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```


For the second, a general topic model of the extraneous articles will be needed. 

Add code blocks below as necessary.

```{r}
#topic model for entire corpus
bb.top.terms <- bb.topics.2 %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

bb.top.terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```

```{r}
bb.tidy.join <- full_join(bb.custom.tidy, bb.metadata, by = c("doc_id" = "Filename")) %>% as_tibble()

head(bb.tidy.join)
```

```{r}
bb.beta.join <- full_join(bb.tidy.join, bb.topics.2, by = c("word" = "term"))

head(bb.beta.join)
```

```{r}
bb.e.dtm <- bb.beta.join %>% 
  filter(R_P_E == "E")
  count(doc_id, word) %>% 
  cast_dtm(doc_id, word, n)

bb.e.lda <- LDA(bb.e.dtm, k = 5, control = list(seed = 12345))
bb.e.lda

bb.e.topics <- tidy(bb.e.lda, matrix = "beta")

bb.top.e.terms <- bb.beta.join %>%
  arrange(desc(beta)) %>% 
  group_by(topic) %>% slice(1:5)

bb.top.e.terms %>%
  mutate(term = reorder_within(word, beta, topic)) %>%
  ggplot(aes(beta, word, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()
```

```{r}
library(wordcloud)
topic1 <- bb.e.topics %>% filter(topic == 2)
wordcloud(topic1$term, topic1$beta, max.words = 100, random.order = FALSE,
    rot.per = 0.3, colors = brewer.pal(6, "Dark2"))
```

```{r}
bb.e.top.topics <- bb.e.topics %>% top_n(25) %>% arrange(desc(beta))

print(bb.e.top.topics)
```

(@) Finally, when you are all done. Write up your findings. What research question did you focus on and what did you learn? 

>I attempted to look at a topic model for the entire Buffalo Bill corpus and then filter out only the "Extraneous" articles and both a LDA plot as well as a word cloud to see if different terms emerged from those high level models. It seems to me that the entire corpus produces themes of location and characters (ie: cowboys, riders) and that the 'E' articles produces themes related more to time and perhaps relationships.
>I don't feel that I was able to accomplish very much with the Buffalo Bill dataset using topic modeling

**This was the final worksheet for this class. You did it - you learned to code in R! Congrats!**