forked from regan008/8500-Worksheets
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path7-TopicModeling.qmd
514 lines (378 loc) · 19.7 KB
/
7-TopicModeling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
---
title: 'Worksheet 7: Topic Modeling'
author: 'Candy Boatwright'
date: 'April 10, 2024'
---
_This is the seventh in a series of worksheets for History 8510 at Clemson University. The goal of these worksheets is simple: practice, practice, practice. The worksheet introduces concepts and techniques and includes prompts for you to practice in this interactive document. When you are finished, you should change the author name (above), knit your document, and upload it to canvas. Don't forget to commit your changes as you go and push to github when you finish the worksheet._
Text analysis is an umbrella for a number of different methodologies. Generally speaking, it involves taking a set (or corpus) of textual sources, turning them into data that a computer can understand, and then running calculations and algorithms using that data. Typically, at its most basic level, that involves the counting of words.
Topic modeling (TM) is one type of text analysis that is particularly useful for historians.
TM takes collections or corpuses of documents and returns groups of "topics" from those documents. It is a form of unsupervised classification that finds groups of items that are probabilistically likely to co-occur.
Latent Dirichlet allocation (LDA) is the most popular algorithm or method for topic modeling, although there are others. It assumes that each document has a mixture of topics and that each topic is a mixture of words. That means that topics overlap each other in terms of content rather than being confined to distinct and singular groups.
To prepare a corpus for topic modeling, we'll do many of the same types of operations that we used last week to prepare a corpus for analysis. First we'll pre-process the data and then we'll create a document term matrix from our corpus using the `tm` (text mining) package.
```{r}
library(tidytext)
library(tidyverse)
library(readtext)
library(tm)
library(topicmodels)
```
```{r}
download.file("https://github.com/regan008/8510-TextAnalysisData/blob/main/TheAmericanCity.zip?raw=true", "AmCity.zip")
unzip("AmCity.zip")
```
```{r}
# Metadata that includes info about each issue.
metadata <- read.csv("https://raw.githubusercontent.com/regan008/8510-TextAnalysisData/main/AmCityMetadata.csv")
meta <- as.data.frame(metadata)
#meta$Filename <- paste("MB_", meta$Filename, sep="")
file_paths <- system.file("TheAmericanCity/")
ac_texts <- readtext(paste("TheAmericanCity/", "*.txt", sep=""))
ac_whole <- full_join(meta, ac_texts, by = c("filename" = "doc_id")) %>% as_tibble()
tidy_ac <- ac_whole %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$")) %>%
anti_join(stop_words)
tidy_ac <- tidy_ac %>% filter(!grepl('[0-9]', word))
```
The above code borrows from what we did last week. It pulls in the texts from the _The American City_ corpus, joins them together into a single data frame, and then turns then uses `unnest_tokens()` to tokenize the text and, finally, removes stop words.
For topic modeling, we need a Document Term Matrix, or a DTM. Topic Modeling has the documents running down one side and the terms across the top. `Tidytext` provides a function for converting to and from DTMs. First, we need to create a document that has the doc_id, the word and the count of the number of times that word occurs. We can do that using `count()`.
```{r}
tidy_ac_words <- tidy_ac %>% count(filename, word)
head(tidy_ac_words)
```
Now we can use `cast_dtm()` to turn `tidy_ac_words` into a dtm.
```{r}
ac.dtm <- tidy_ac_words %>%
count(filename, word) %>%
cast_dtm(filename, word, n)
```
If you run `class(ac.dtm)` in your console you will notice that it now has a class of "DocumentTermMatrix".
Now that we have a dtm, we can create a topic model. For this, we'll use the topic models package and the `LDA()` function. Take a minute and read the documentation for `LDA()`.
There are two important options when running `LDA()`. The first is k which is the number of topics you want the model to generate. What number topics you generate is a decision that often takes some experimentation and depends on the size of your corpus. The American City corpus isn't that bigbut still has over 209k words. In this instance, because the corpus is so small we're going to start with a small number of topics. Going above 5 causes errors with this particular corpus. Later, when you work with a different corpus you should experiment with changing the number of topics from 10 to 20 to 30 to 50 to see how it changes your model.
The second important option when running `LDA()` is the seed option. You don't worry too much about what setting the seed does, but put simply - it ensures the output of the model is predictable and reproducible. Using the seed ensures that if you come back to your code later or someone else tries to run it, the model will return exactly the same results.
Lets now train our model. This will take a few minutes:
```{r}
ac.lda <- LDA(ac.dtm, k = 5, control = list(seed = 12345))
ac.lda
```
Now we have a LDA topic model that has 5 topics. There are two ways to look at this model: word-topic probabilities and document-topic probabilities.
Lets start with **word-topic probabilities.**
Every topic is made up of words that are most associated with that topic. Together these words typically form some sort of theme. To understand what this looks like the easiest thing to do is create a bar chart of the top terms in a topic.
```{r}
ac.topics <- tidy(ac.lda, matrix = "beta")
head(ac.topics)
```
What we have here is a list of topics and the weight of each term in that topic. Essential we have turned this into a one-topic-per-term-per-row format. So, for example, the term 10th has a weight of 5.135047e-05 in topic 1 but 7.269700e-05 in topic 2. Now that doesn't mean a lot to us at this moment and this format is impossible to grasp in its current size and iteration, but we can use tidyverse functions to pair this down and determine the 10 terms that are most common within each topic.
```{r}
ac.top.terms <- ac.topics %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
ac.top.terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
(@) Can you adjust the code above to show the top 10 words from just one topic?
```{r}
ac.lda <- LDA(ac.dtm, k = 5, control = list(seed = 12345))
ac.lda
ac.topics <- tidy(ac.lda, matrix = "beta")
head(ac.topics)
ac.top.terms <- ac.topics %>%
arrange(desc(beta)) %>%
slice_max(beta, n = 1) %>%
group_by(topic) %>% slice(1:10)
ac.top.terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
#facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
Another useful way to look at the words in each topic is by visualizing them as a wordcloud.
```{r warning=FALSE}
library(wordcloud)
topic1 <- ac.topics %>% filter(topic == 2)
wordcloud(topic1$term, topic1$beta, max.words = 100, random.order = FALSE,
rot.per = 0.3, colors = brewer.pal(6, "Dark2"))
```
Now we can see what words are most common in each topic. But the document-topic probabilities are also useful for understanding what topics are prevalent in what documents. Just as each topic is made up of a mixture of words, the LDA algorithm also assumes that each topic is made up of a mixture of topics.
```{r}
ac.documents <- tidy(ac.lda, matrix = "gamma")
head(ac.documents)
```
For each document, the model gives us an estimated proportion of what words in the document are from a topic. So for the April 1915 issue it estimates that about 23% of the words are from topic 1. The gamma number represents the posterior topic distribution for each document.
This is easier to see if we filter to see the breakdown for just one document.
```{r}
ac.documents %>% filter(document == "1916_May.txt") %>% arrange(desc(gamma))
```
This gamma value is really useful and we can use it to see which topics appear in which documents the most. This is frequently referred to as looking at topics over time.
We can do that using the ac.documents dataframe that we just created but it needs to be joined with the metadata. Again, this is why it is important to have a filename within the metadata spreadsheet. To join these two together we can do a full_join because we want to keep all of the columns.
```{r}
topics.by.year <- full_join(ac.documents, metadata, by = join_by(document == filename))
```
Now what we have is a document that includes the proportion of each topic in each document. Because this is a dataset about a periodical, we have values in our metadata that will make it easy to plot the distrubtion of a topic over time -- in this case for each edition of the journal.
```{r}
topics.by.year$issue_date <- paste(topics.by.year$month, " ", topics.by.year$year, sep = "")
ggplot(data=topics.by.year, aes(x=issue_date, y=gamma)) + geom_bar(stat="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
Using this we can tell that topic 5, which from earlier had the words improve, grow, ties, contracts, and gasoline as the top five words, is most prominent in January 1915.
(@) Use the rest of this worksheet to experiment with topic modeling. I've added the code to download a much larger dataset - the issues of Mind and Body. This corpus has 413 documents ranging from the 1890s to 1936. You'll want to start with at least 25 topics.
```{r}
#| eval: false
download.file("https://github.com/regan008/8510-TextAnalysisData/blob/main/MindAndBody.zip?raw=true", "MB.zip")
unzip("MB.zip")
file.rename("txt", "mbtxt")
```
```{r}
# Metadata that includes info about each issue.
mb.metadata <- read.csv("https://raw.githubusercontent.com/regan008/8510-TextAnalysisData/main/mb-metadata.csv")
mb.meta <- as.data.frame(mb.metadata)
mb.meta$Filename <- paste("MB_", mb.meta$Filename, sep = "")
head(mb.meta)
```
```{r}
mb.file.paths <- list.files("mbtext/")
mb.data.dir <- paste(getwd(), "/mbtxt", sep = "")
mb.texts <- readtext(paste("mbtxt/", "*.txt", sep=""))
mb.whole <- full_join(mb.meta, mb.texts, by = join_by(Filename == doc_id)) %>% as_tibble()
head(mb.whole)
```
```{r}
tidy.mb <- mb.whole %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$")) %>%
anti_join(stop_words)
head (tidy.mb)
```
```{r}
tidy.mb.words <- tidy.mb %>% count(Filename, word)
head(tidy.mb.words)
```
```{r}
mb.dtm <- tidy.mb.words %>%
count(Filename, word) %>%
cast_dtm(Filename, word, n)
head(mb.dtm)
```
```{r}
mb.lda <- LDA(mb.dtm, k = 25, control = list(seed = 12345))
mb.lda
```
```{r}
mb.topics <- tidy(mb.lda, matrix = "beta")
head(mb.topics)
```
```{r}
mb.top.terms <- mb.topics %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
mb.top.terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
(@) What happens if you create a custom stopword list? How does this change the model?
```{r}
mb.custom.tidy.list <- c("real", "mentioned", "vol", "ulm", "rath", "iii", "wis", "st", "mich", "ave", "cts")
mb.custom.tidy <- tidy.mb.words %>%
filter(!word %in% mb.custom.tidy.list) %>%
arrange (-n)
mb.dtm.2 <- mb.custom.tidy %>%
count(Filename, word) %>%
cast_dtm(Filename, word, n)
mb.lda.2 <- LDA(mb.dtm.2, k = 25, control = list(seed = 12345))
mb.topics.2 <- tidy(mb.lda.2, matrix = "beta")
mb.top.terms.2 <- mb.topics.2 %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
mb.top.terms.2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
>Just removing a few words that I felt wouldn't be terribly important or were nonsense words totally changed the topic model. Honestly, I feel that the original results were better before I removed words.
(@) Can you create a topic model for just the documents in the 1920s? How does that change the model?
```{r}
mb.1920 <- mb.whole %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$")) %>%
anti_join(stop_words) %>%
filter(Year >= 1920 & Year <= 1929)
mb.dtm.3 <- mb.1920 %>%
count(Filename, word) %>%
cast_dtm(Filename, word, n)
mb.lda.3 <- LDA(mb.dtm.3, k = 5, control = list(seed = 12345))
mb.topics.3 <- tidy(mb.lda.3, matrix = "beta")
mb.top.terms.3 <- mb.topics.3 %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
mb.top.terms.3 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
>I initially left the model number at 25 which was far too many for the new model. 5 seems to be better, although there is a tremendous overlap in words in all 5 of the topics generated.
(@) Now, lets return to the Buffalo Bill data from last week. You should be able to use topic modeling to address two of the research questions provided:
* Can we detect some change over time in promotion language and reception language (marketing and reviewing)? Were there types of characters, scenarios, action promised in promotional material and/or noted in reviews earlier vs later?
* What can be gleaned from the items tagged as extraneous as far as topics? These are news items that are somehow related to BBWW. Crime, finances, celebrity, etc.
To analyze this you should first generate a topic model for the buffalo bill data. Play with the number of topics until you find a number that feels about right for the dataset. I am guessing it'll be in the 8-15 range but you'll need to play with it to see what number gives you the best fit.
```{r}
download.file("https://github.com/dseefeldt/IndianaNewspapers/raw/main/bb-txt.zip", "bb-txt.zip")
unzip("bb-txt.zip")
```
```{r}
# Metadata that includes info about each article.
bb.metadata <- read.csv("https://raw.githubusercontent.com/dseefeldt/IndianaNewspapers/main/metadata.csv")
#bb.tidy.meta <- bb.metadata %>% separate(Date, c("Month", "Day", "Year"))
data_dir <- paste(getwd(), "/txt", sep = "")
bb <- readtext(paste0(data_dir, "/*.txt"))
bb.whole <-
bb.metadata %>%
arrange(Publication) %>% # sorts tidy version of bb.metadata (date seperated)
bind_cols(bb) %>% # combines with full texts from bb
as_tibble()
head(bb.whole)
```
```{r}
bb.tidy <- bb.whole %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$")) %>%
anti_join(stop_words)
#filter(word != "bill" & bb.whole$text != "buffalo" & bb.whole$text != "bill's" & bb.whole$text != "cody" & bb.whole$text != "pg" & bb.whole$text != "wild" & bb.whole$text != "west") %>%
head(bb.tidy)
```
```{r}
tidy.bb.words <- bb.tidy %>% count(doc_id, word)
head(tidy.bb.words)
```
```{r}
bb.dtm <- tidy.bb.words %>%
count(doc_id, word) %>%
cast_dtm(doc_id, word, n)
head(bb.dtm)
```
```{r}
bb.lda <- LDA(bb.dtm, k = 15, control = list(seed = 12345))
bb.lda
```
```{r}
bb.topics <- tidy(bb.lda, matrix = "beta")
head(bb.topics)
```
```{r}
bb.top.terms <- bb.topics %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
bb.top.terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
```{r}
#add custom stop words to model
bb.custom.tidy.list <- c("buffalo", "bill", "bill's", "wild", "west", "cody", "pg")
bb.custom.tidy <- tidy.bb.words %>%
filter(!word %in% bb.custom.tidy.list) %>%
arrange (-n)
bb.dtm.2 <- bb.custom.tidy %>%
count(doc_id, word) %>%
cast_dtm(doc_id, word, n)
bb.lda.2 <- LDA(bb.dtm.2, k = 15, control = list(seed = 12345))
bb.topics.2 <- tidy(bb.lda.2, matrix = "beta")
bb.top.terms.2 <- bb.topics.2 %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
bb.top.terms.2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
To address the first research question, you'll need to plot topics over time. I would create three models, one for all of the data, one for promotion, and one for reception. What do we learn by doing this?
```{r}
bb.documents <- tidy(bb.lda.2, matrix = "gamma")
#head(bb.documents)
bb.all.topics.by.year <- full_join(bb.documents, bb.metadata, by = join_by(document == Filename)) #%>%
#top_n(5)
head(bb.all.topics.by.year)
```
```{r}
#bb.all.topics.by.year$FullDate <- paste(bb.all.topics.by.year$month, " ", bb.all.topics.by.year$year, sep = "")
ggplot(data=bb.all.topics.by.year, aes(x=Date, y=gamma)) + geom_bar(stat ="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
```{r}
bb.p.topics.by.year <- bb.all.topics.by.year %>%
filter(R_P_E == "P")
ggplot(data=bb.p.topics.by.year, aes(x=Date, y=gamma)) + geom_bar(stat="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
```{r}
bb.r.topics.by.year <- bb.all.topics.by.year %>%
filter(R_P_E == "R")
ggplot(data=bb.r.topics.by.year, aes(x=Date, y=gamma)) + geom_bar(stat="identity") + facet_wrap(~ topic, scales = "free") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```
For the second, a general topic model of the extraneous articles will be needed.
Add code blocks below as necessary.
```{r}
#topic model for entire corpus
bb.top.terms <- bb.topics.2 %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
bb.top.terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
```{r}
bb.tidy.join <- full_join(bb.custom.tidy, bb.metadata, by = c("doc_id" = "Filename")) %>% as_tibble()
head(bb.tidy.join)
```
```{r}
bb.beta.join <- full_join(bb.tidy.join, bb.topics.2, by = c("word" = "term"))
head(bb.beta.join)
```
```{r}
bb.e.dtm <- bb.beta.join %>%
filter(R_P_E == "E")
count(doc_id, word) %>%
cast_dtm(doc_id, word, n)
bb.e.lda <- LDA(bb.e.dtm, k = 5, control = list(seed = 12345))
bb.e.lda
bb.e.topics <- tidy(bb.e.lda, matrix = "beta")
bb.top.e.terms <- bb.beta.join %>%
arrange(desc(beta)) %>%
group_by(topic) %>% slice(1:5)
bb.top.e.terms %>%
mutate(term = reorder_within(word, beta, topic)) %>%
ggplot(aes(beta, word, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
```
```{r}
library(wordcloud)
topic1 <- bb.e.topics %>% filter(topic == 2)
wordcloud(topic1$term, topic1$beta, max.words = 100, random.order = FALSE,
rot.per = 0.3, colors = brewer.pal(6, "Dark2"))
```
```{r}
bb.e.top.topics <- bb.e.topics %>% top_n(25) %>% arrange(desc(beta))
print(bb.e.top.topics)
```
(@) Finally, when you are all done. Write up your findings. What research question did you focus on and what did you learn?
>I attempted to look at a topic model for the entire Buffalo Bill corpus and then filter out only the "Extraneous" articles and both a LDA plot as well as a word cloud to see if different terms emerged from those high level models. It seems to me that the entire corpus produces themes of location and characters (ie: cowboys, riders) and that the 'E' articles produces themes related more to time and perhaps relationships.
>I don't feel that I was able to accomplish very much with the Buffalo Bill dataset using topic modeling
**This was the final worksheet for this class. You did it - you learned to code in R! Congrats!**