-
Notifications
You must be signed in to change notification settings - Fork 10
/
shakespeare.rmd
1041 lines (800 loc) · 51.6 KB
/
shakespeare.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Shakespeare Start to Finish
The following attempts to demonstrate the usual difficulties one encounters dealing with text by procuring and processing the works of Shakespeare. The source is [MIT](http://shakespeare.mit.edu/), which has made the 'complete' works available on the web since 1993, plus one other from Gutenberg. The initial issue is simply getting the works from the web. Subsequently there is metadata, character names, stopwords etc. to be removed. At that point, we can stem and count the words in each work, which, when complete, puts us at the point we are ready for analysis.
The primary packages used are <span class="pack">tidytext</span>, <span class="pack">stringr</span>, and when things are ready for analysis, <span class="pack">quanteda.</span>
## ACT I. Scrape MIT and Gutenberg Shakespeare
### Scene I. Scrape main works
Initially we must scrape the web to get the documents we need. The <span class="pack">rvest</span> package will be used as follows.
- Start with the url of the site
- Get the links off that page to serve as base urls for the works
- Scrape the document for each url
- Deal with the collection of Sonnets separately
- Write out results
```{r shakes_urls, eval=FALSE}
library(rvest); library(tidyverse); library(stringr)
page0 = read_html('http://shakespeare.mit.edu/')
works_urls0 = page0 %>%
html_nodes('a') %>%
html_attr('href')
main = works_urls0 %>%
grep(pattern='index', value=T) %>%
str_replace_all(pattern='index', replacement='full')
other = works_urls0[!grepl(works_urls0, pattern='index|edu|org|news')]
works_urls = c(main, other)
works_urls[1:3]
```
Now we just paste the main site url to the work urls and download them. Here is where we come across our first snag. The <span class="func">html_text</span> function has what I would call a bug but what the author feels is a feature. [Basically, it ignores line breaks of the form `<br>` in certain situations](https://github.com/hadley/rvest/issues/175). This means it will smash text together that shouldn't be, thereby making *any* analysis of it fairly useless[^bug]. Luckily, [\@rentrop](https://github.com/rentrop) provided a solution, which is in `r/fix_read_html.R`.
```{r scrape_shakes, eval=FALSE}
works0 = lapply(works_urls, function(x) read_html(paste0('http://shakespeare.mit.edu/', x)))
source('r/fix_read_html.R')
html_text_collapse(works0[[1]]) #works
works = lapply(works0, html_text_collapse)
names(works) = c("All's Well That Ends Well", "As You Like It", "Comedy of Errors"
"Cymbeline", "Love's Labour's Lost", "Measure for Measure"
"The Merry Wives of Windsor", "The Merchant of Venice", "A Midsummer Night's Dream"
"Much Ado about Nothing", "Pericles Prince of Tyre", "The Taming of the Shrew"
"The Tempest", "Troilus and Cressida", "Twelfth Night"
"The Two Gentlemen of Verona", "The Winter's Tale", "King Henry IV Part 1"
"King Henry IV Part 2", "Henry V", "Henry VI Part 1"
"Henry VI Part 2", "Henry VI Part 3", "Henry VIII"
"King John", "Richard II", "Richard III"
"Antony and Cleopatra", "Coriolanus", "Hamlet"
"Julius Caesar", "King Lear", "Macbeth"
"Othello", "Romeo and Juliet", "Timon of Athens"
"Titus Andronicus", "Sonnets", "A Lover's Complaint"
"The Rape of Lucrece", "Venus and Adonis", "Elegy")
```
### Scene II. Sonnets
We now hit a slight nuisance with the Sonnets. The Sonnets have a bit of a different structure than the plays. All links are in a single page, with a different form for the url, and each sonnet has its own page.
```{r scrape_sonnets, eval=F}
sonnet_urls = paste0('http://shakespeare.mit.edu/', grep(works_urls0, pattern='sonnet', value=T)) %>%
read_html() %>%
html_nodes('a') %>%
html_attr('href')
sonnet_urls = grep(sonnet_urls, pattern = 'sonnet', value=T) # remove amazon link
# read the texts
sonnet0 = purrr::map(sonnet_urls, function(x) read_html(paste0('http://shakespeare.mit.edu/Poetry/', x)))
# collapse to one 'Sonnets' work
sonnet = sapply(sonnet0, html_text_collapse)
works$Sonnets = sonnet
```
### Scene III. Save and write out
Now we can save our results so we won't have to repeat any of the previous scraping. We want to save the main text object as an RData file, and write out the texts to their own file. When dealing with text, you'll regularly want to save stages so you can avoid repeating what you don't have to, as often you will need to go back after discovering new issues further down the line.
```{r initial_save, eval=F, echo=1}
save(works, file='data/texts_raw/shakes/moby_from_web.RData')
# This will spit the text to the console unfortunately
purrr::map2(works,
paste0('data/texts_raw/shakes/moby/', str_replace_all(names(works), " |'", '_'), '.txt'),
function(x, nam) write_lines(x, path=nam))
```
### Scene IV. Read text from files
After the above is done, it's not required to redo, so we can always get what we need. I'll start with the raw text as files, as that is one of the more common ways one deals with documents. When text is nice and clean, this can be fairly straightforward.
The function at the end comes from the <span class="pack">tidyr</span> package. Up to that line, each element in the <span class="objclass">text</span> column is the entire text, while the column itself is thus a 'list-column'. In other words, we have a 42 x 2 matrix. But to do what we need, we'll want to have access to each line, and the <span class="func">unnest</span> function unpacks each line within the title. The first few lines of the result are shown after.
```{r read_shakes_works, echo=-5, eval=FALSE}
library(tidyverse); library(stringr)
shakes0 =
data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), text) %>%
unnest(text)
save(shakes0, file='data/initial_shakes_dt.RData')
# Alternate that provides for more options
# library(readtext)
# shakes0 =
# data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>%
# mutate(text = map(file, readtext, encoding='UTF8')) %>%
# unnest(text)
```
```{r show_shakes0, echo=F}
load('data/initial_shakes_dt.RData')
DT::datatable(shakes0[1:20,],
options=list(dom='tp')) %>%
DT::formatStyle(
0, # ignores columns, but otherwise put here
target='row',
backgroundColor = 'transparent'
)
```
### Scene V. Add additional works
It is typical to be gathering texts from multiple sources. In this case, we'll get *The Phoenix and the Turtle* from the Project Gutenberg website. There is an R package that will allow us to work directly with the site, making the process straightforward[^r4everything]. I also considered two other works, but I refrained from "The Two Noble Kinsmen" because like many other of Shakespeare's versions on Gutenberg, it's basically written in a different language. I also refrained from *The Passionate Pilgrim* because it's mostly not Shakespeare.
When first doing this project, I actually started with Gutenberg, but it became a notable PITA. The texts were inconsistent in source, and sometimes reproduced printing errors purposely, which would have compounded typical problems. I thought it could have been solved by using the *Complete Works of Shakespeare* but the download only came with that title, meaning one would have to hunt for and delineate each separate work. This might not have been too big of an issue, except that there is no table of contents, nor consistent naming of titles across different printings. The MIT approach, on the other hand, was a few lines of code. This represents a common issue in text analysis when dealing with sources, a different option may save a lot of time in the end.
The following code could be more succinct to deal with one text, but I initially was dealing with multiple works, so I've left it in that mode. In the end, we'll have a <span class="objclass">tibble</span> with an id column for the file/work name, and another column that contains the lines of text.
```{r phoenix_turtle, eval=FALSE}
library(gutenbergr)
works_not_included = c("The Phoenix and the Turtle") # add others if desired
gute0 = gutenberg_works(title %in% works_not_included)
gute = lapply(gute0$gutenberg_id, gutenberg_download)
gute = mapply(function(x, y) mutate(x, id=y) %>% select(-gutenberg_id),
x=gute,
y=works_not_included,
SIMPLIFY=F)
shakes = shakes0 %>%
bind_rows(gute) %>%
mutate(id = str_replace_all(id, " |'", '_')) %>%
mutate(id = str_replace(id, '.txt', '')) %>%
arrange(id)
# shakes %>% split(.$id) # inspect
save(shakes, file='data/texts_raw/shakes/shakes_df.RData')
```
## ACT II. Preliminary Cleaning
If you think we're even remotely getting close to being ready for analysis, I say Ha! to you. Our journey has only just begun (cue the Carpenters).
Now we can start thinking about prepping the data for eventual analysis. One of the nice things about having the data in a tidy format is that we can use string functionality over the column of text in a simple fashion.
### Scene I. Remove initial text/metadata
First on our to-do list is to get rid of all the preliminary text of titles, authorship, and similar. This is fairly straightforward when you realize the text we want will be associated with something like `ACT I`, or in the case of the Sonnets, the word `Sonnet`. So, the idea it to drop all text up to those points. I've created a [function](https://github.com/m-clark/text-analysis-with-R/blob/master/r/detect_first_act.R) that will do that, and then just apply it to each works tibble[^tibblefail]. For the poems and *A Funeral Elegy for Master William Peter*, we look instead for the line where his name or initials start the line.
```{r remove_preliminary_text, echo=-(1:2)}
load('data/texts_raw/shakes/shakes_df.RData')
source('r/detect_first_act.R')
shakes_trim = shakes %>%
split(.$id) %>%
lapply(detect_first_act) %>%
bind_rows
shakes %>% filter(id=='Romeo_and_Juliet') %>% head
shakes_trim %>% filter(id=='Romeo_and_Juliet') %>% head
```
### Scene II. Miscellaneous removal
Next, we'll want to remove empty rows, any remaining titles, lines that denote the act or scene, and other stuff. I'm going to remove the word *prologue* and *epilogue* as a stopword later. While some texts have a line that just says that (`PROLOGUE`), others have text that describes the scene (`Prologue. Blah blah`) and which I've decided to keep. As such, we just need the word itself gone.
```{r remove_misc}
titles = c("A Lover's Complaint", "All's Well That Ends Well", "As You Like It", "The Comedy of Errors",
"Cymbeline", "Love's Labour's Lost", "Measure for Measure",
"The Merry Wives of Windsor", "The Merchant of Venice", "A Midsummer Night's Dream",
"Much Ado about Nothing", "Pericles Prince of Tyre", "The Taming of the Shrew",
"The Tempest", "Troilus and Cressida", "Twelfth Night",
"The Two Gentlemen of Verona", "The Winter's Tale", "King Henry IV, Part 1",
"King Henry IV, Part 2", "Henry V", "Henry VI, Part 1",
"Henry VI, Part 2", "Henry VI, Part 3", "Henry VIII",
"King John", "Richard II", "Richard III",
"Antony and Cleopatra", "Coriolanus", "Hamlet",
"Julius Caesar", "King Lear", "Macbeth",
"Othello", "Romeo and Juliet", "Timon of Athens",
"Titus Andronicus", "Sonnets",
"The Rape of Lucrece", "Venus and Adonis", "A Funeral Elegy", "The Phoenix and the Turtle")
shakes_trim = shakes_trim %>%
filter(text != '', # remove empties
!text %in% titles, # remove titles
!str_detect(text, '^ACT|^SCENE|^Enter|^Exit|^Exeunt|^Sonnet') # remove acts etc.
)
shakes_trim %>% filter(id=='Romeo_and_Juliet') # we'll get prologue later
```
### Scene III. Classification of works
While we're at it, we can save the classical (sometimes arbitrary) classifications of Shakespeare's works for later comparison to what we'll get in our analyses. We'll save them to call as needed.
```{r shakespeare_classes, eval=FALSE}
shakes_types = data_frame(title=unique(shakes_trim$id)) %>%
mutate(class = 'Comedy',
class = if_else(str_detect(title, pattern='Adonis|Lucrece|Complaint|Turtle|Pilgrim|Sonnet|Elegy'), 'Poem', class),
class = if_else(str_detect(title, pattern='Henry|Richard|John'), 'History', class),
class = if_else(str_detect(title, pattern='Troilus|Coriolanus|Titus|Romeo|Timon|Julius|Macbeth|Hamlet|Othello|Antony|Cymbeline|Lear'), 'Tragedy', class),
problem = if_else(str_detect(title, pattern='Measure|Merchant|^All|Troilus|Timon|Passion'), 'Problem', 'Not'),
late_romance = if_else(str_detect(title, pattern='Cymbeline|Kinsmen|Pericles|Winter|Tempest'), 'Late', 'Other'))
save(shakes_types, file='data/shakespeare_classification.RData') # save for later
```
## ACT III. Stop words
As we've noted before, we'll want to get rid of stop words, things like articles, possessive pronouns, and other very common words. In this case, we also want to include character names. However, the big wrinkle here is that this is not English as currently spoken, so we need to remove 'ye', 'thee', 'thine' etc. In addition, there are things that need to be replaced, like o'er to over, which may then also be removed. In short, this is not so straightforward.
### Scene I. Character names
We'll get the list of character names from [opensourceshakespeare.org](http://opensourceshakespeare.org/) via <span class="pack">rvest</span>, but I added some from the poems and others that still came through the processing one way or another, e.g. abbreviated names.
```{r character_names, eval=FALSE, echo=-1}
library(rvest)
shakes_char_url = 'https://www.opensourceshakespeare.org/views/plays/characters/chardisplay.php'
page0 = read_html(shakes_char_url)
tabs = page0 %>% html_table()
shakes_char = tabs[[2]][-(1:2), c(1,3,5)] # remove header and phantom columns
colnames(shakes_char) = c('Nspeeches', 'Character', 'Play')
shakes_char = shakes_char %>%
distinct(Character,.keep_all=T)
save(shakes_char, file='data/shakespeare_characters.RData')
```
A new snag is that some characters with multiple names may be represented (typically) by the first or last name, or in the case of three, the middle, e.g. Sir Toby Belch. Others are still difficultly named e.g. RICHARD PLANTAGENET (DUKE OF GLOUCESTER). The following should capture everything by splitting the names on spaces, removing parentheses, and keeping unique terms.
```{r character_names_clean, echo=-1}
load('data/shakespeare_characters.RData')
# remove paren and split
chars = shakes_char$Character
chars = str_replace_all(chars, '\\(|\\)', '')
chars = str_split(chars, ' ') %>%
unlist
# these were found after intial processsing
chars_other = c('enobarbus', 'marcius', 'katharina', 'clarence','pyramus',
'andrew', 'arcite', 'perithous', 'hippolita', 'schoolmaster',
'cressid', 'diomed', 'kate', 'titinius', 'Palamon', 'Tarquin',
'lucrece', 'isidore', 'tom', 'thisbe', 'paul',
'aemelia', 'sycorax', 'montague', 'capulet', 'collatinus')
chars = unique(c(chars, chars_other))
chars = chars[chars != '']
sample(chars)[1:3]
```
### Scene II. Old, Middle, & Modern English
While Shakespeare is considered [Early Modern English](https://en.wikipedia.org/wiki/Early_Modern_English), some text may be more historical, so I include Middle and Old English stopwords, as they were readily available from the <span class="pack">cltk</span> Python module ([link](https://github.com/cltk/cltk)). I also added some things to the modern English list like "thou'ldst" that I found lingering after initial passes. I first started using the works from Gutenberg, and there, the Old English might have had some utility. As the texts there were inconsistently translated and otherwise problematic, I abandoned using them. Here, the Old English vocabulary applied to these texts it only removes 'wit', so I refrain from using it.
```{r old_middle_modern_english}
# old and me from python cltk module;
# em from http://earlymodernconversions.com/wp-content/uploads/2013/12/stopwords.txt;
# I also added some to me
old_stops0 = read_lines('data/old_english_stop_words.txt')
# sort(old_stops0)
old_stops = data_frame(word=str_conv(old_stops0, 'UTF8'),
lexicon = 'cltk')
me_stops0 = read_lines('data/middle_english_stop_words')
# sort(me_stops0)
me_stops = data_frame(word=str_conv(me_stops0, 'UTF8'),
lexicon = 'cltk')
em_stops0 = read_lines('data/early_modern_english_stop_words.txt')
# sort(em_stops0)
em_stops = data_frame(word=str_conv(em_stops0, 'UTF8'),
lexicon = 'emc')
```
### Scene III. Remove stopwords
We're now ready to start removing words. However, right now, we have lines not words. We can use the <span class="pack">tidytext</span> function <span class="func">unnest_tokens</span>, which is like <span class="func">unnest</span> from <span class="pack">tidyr</span>, but works on different tokens, e.g. words, sentences, or paragraphs. Note that by default, the function will make all words lower case to make matching more efficient.
```{r unnest_words}
library(tidytext)
shakes_words = shakes_trim %>%
unnest_tokens(word, text, token='words')
save(shakes_words, file='data/shakes_words_df_4text2vec.RData')
```
We also will be doing a little stemming here. I'm getting rid of suffixes that end with the suffix after an apostrophe. Many of the remaining words will either be stopwords or need to be further stemmed later. I also created a middle/modern English stemmer for words that are not caught otherwise (<span class="func">me_st_stem</span>). Again, this is the sort of thing you discover after initial passes (e.g. 'criedst'). After that, we can use the <span class="func">anti_join</span> remove the stopwords.
```{r anti_join}
source('r/st_stem.R')
shakes_words = shakes_words %>%
mutate(word = str_trim(word), # remove possible whitespace
word = str_replace(word, "'er$|'d$|'t$|'ld$|'rt$|'st$|'dst$", ''), # remove me style endings
word = str_replace_all(word, "[0-9]", ''), # remove sonnet numbers
word = vapply(word, me_st_stem, 'a')) %>%
anti_join(em_stops) %>%
anti_join(me_stops) %>%
anti_join(data_frame(word=str_to_lower(c(chars, 'prologue', 'epilogue')))) %>%
anti_join(data_frame(word=str_to_lower(paste0(chars, "'s")))) %>% # remove possessive names
anti_join(stop_words)
```
As before, you should do a couple spot checks.
```{r stopword_check, results='hold'}
any(shakes_words$word == 'romeo')
any(shakes_words$word == 'prologue')
any(shakes_words$word == 'mayst')
```
## ACT IV. Other fixes
Now we're ready to finally do the word counts. Just kidding! There is *still* work to do for the remainder, and you'll continue to spot things after runs. One remaining issue is the words that end in 'st' and 'est', and others that are not consistently spelled or otherwise need to be dealt with. For example, 'crost' will not be stemmed to 'cross', as 'crossed' would be. Finally, I limit the result to any words that have more than two characters, as my inspection suggested these are left-over suffixes, or otherwise would be considered stopwords anyway.
```{r other_fixes}
# porter should catch remaining 'est'
add_a = c('mongst', 'gainst') # words to add a to
shakes_words = shakes_words %>%
mutate(word = if_else(word=='honour', 'honor', word),
word = if_else(word=='durst', 'dare', word),
word = if_else(word=='wast', 'was', word),
word = if_else(word=='dust', 'does', word),
word = if_else(word=='curst', 'cursed', word),
word = if_else(word=='blest', 'blessed', word),
word = if_else(word=='crost', 'crossed', word),
word = if_else(word=='accurst', 'accursed', word),
word = if_else(word %in% add_a,
paste0('a', word),
word),
word = str_replace(word, "'s$", ''), # strip remaining possessives
word = if_else(str_detect(word, pattern="o'er"), # change o'er over
str_replace(word, "'", 'v'),
word)) %>%
filter(!(id=='Antony_and_Cleopatra' & word == 'mark')) %>% # mark here is almost exclusively the character name
filter(str_count(word)>2)
```
At this point we could still maybe add things to this list of additional fixes, but I think it's time to actually start playing with the data.
## ACT V. Fun stuff
We are finally ready to get to the fun stuff. Finally! And now things get easy.
### Scene I. Count the terms
We can get term counts with standard <span class="pack">dplyr</span> approaches, and packages like <span class="pack">tidytext</span> will take that and also do some other things we might want. Specifically, we can use the latter to create the document-term matrix (DTM) that will be used in other analysis. The function <span class="func">cast_dfm</span> will create a <span class="objclass">dfm</span> class object, or 'document-feature' matrix class object (from <span class="pack">quanteda</span>), which is the same thing but recognizes this sort of stuff is not specific to words. With word counts in hand, would be good save to save at this point, since they'll serve as the basis for other processing.
```{r term_counts, results='hold', eval=-8}
term_counts = shakes_words %>%
group_by(id, word) %>%
count
term_counts %>%
arrange(desc(n))
library(quanteda)
shakes_dtm = term_counts %>%
cast_dfm(document=id, term=word, value=n)
save(shakes_words, term_counts, shakes_dtm, file='data/shakes_words_df.RData')
```
Now things are looking like Shakespeare, with love for everyone[^love]. You'll notice I've kept place names such as Rome, but this might be something you'd prefer to remove. Other candidates would be madam, woman, man, majesty (as in 'his/her') etc. This sort of thing is up to the researcher.
### Scene II. Stemming
Now we'll <span class="emph">stem</span> the words. This is actually more of a pre-processing step, one that we'd do along with (and typically after) stopword removal. I do it here to mostly demonstrate how to use <span class="pack">quanteda</span> to do it, as it can also be used to remove stopwords and do many of the other things we did with <span class="pack">tidytext</span>.
Stemming will make words like eye and eyes just *ey*, or convert war, wars and warring to *war*. In other words, it will reduce variations of a word to a common root form, or 'word stem'. We could have done this in a step prior to counting the terms, but then you only have the stemmed result to work with for the document term matrix from then on. Depending on your situation, you may or may not want to stem, or maybe you'd want to compare results. The <span class="pack">quanteda</span> package will actually stem with the DTM (i.e. work on the column names) and collapse the word counts accordingly. I note the difference in words before and after stemming.
```{r stem, echo=-(1:4), results='hold'}
load('data/shakes_words_df.RData')
library(quanteda)
shakes_dtm
ncol(shakes_dtm)
shakes_dtm = shakes_dtm %>%
dfm_wordstem()
shakes_dtm
ncol(shakes_dtm)
```
The result is notably fewer columns, which will speed up any analysis, as well as produce a slightly more dense matrix.
### Scene III. Exploration
```{r top_20_init, echo=F}
top20 = topfeatures(shakes_dtm, 20)
```
#### Top features
Let's start looking at the data more intently. The following shows the 10 most common words and their respective counts. This is also an easy way to find candidates to add to the stopword list. Note that dai and prai are stems for day and pray. Love occurs `r round(top20[1]/top20[2], 2)` times as much as the most frequent word!
```{r top_10}
top10 = topfeatures(shakes_dtm, 10)
top10
```
The following is a word cloud. They are among the most useless visual displays imaginable. Just because you can, doesn't mean you should.
```{r wordcloud, echo=FALSE}
# useless!
textplot_wordcloud(shakes_dtm, min.freq = 400, random.order = T,
rot.per = .25,
colors = viridis::viridis(30))
```
If you want to display relative frequency do so.
```{r better_cloud, echo=FALSE}
library(forcats)
# data_frame(term = names(top20), freq = top20) %>%
# mutate(
# term = fct_reorder(term, freq, .desc = TRUE),
# percentage = freq/ncol(shakes_dtm)
# ) %>%
# plot_ly(x=~term, y=~percentage) %>% # , size=~percentage
# add_text(text=~term,
# color=I('#ff5500'),
# size =~ percentage,
# sizes = c(15,15), # doesn't work for text, only markers
# showlegend = F) %>%
# theme_plotly() %>%
# layout(xaxis = list(
# autotick = FALSE,
# ticks = F,
# showticklabels = F),
# yaxis = list(range=c(0,.25)))
g = data_frame(term = names(top20), freq = top20) %>%
mutate(
term = fct_reorder(term, freq, .desc = TRUE),
percentage = freq/ncol(shakes_dtm)
) %>%
ggplot() +
geom_text(aes(x=term,
y=percentage,
size=percentage,
label=term,
angle=45, # ignored by plotly
color=term),
# color='#ff5500',
hjust=-.1,
show.legend = F) +
scale_size_continuous(range=c(2,8)) +
ylim(c(0,.25)) +
lazerhawk::theme_trueMinimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x= element_blank(),
legend.position='none') # for plotly's sake
g
# ggplotly(width='auto', hoverinfo='none') %>%
# lazerhawk::theme_plotly()
```
#### Similarity
The <span class="pack">quanteda</span> package has some built in similarity measures such as [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), which you can think of similarly to the standard correlation (also available as an option). I display it visually to better get a sense of things.
```{r similarity, echo=1, eval=3}
textstat_simil(shakes_dtm, margin = "documents", method = "cosine")
textstat_simil(shakes_dtm, margin = "documents", method = "cosine") %>%
as.matrix() %>%
heatR::corrheat(show_grid = F, height = 500)
textstat_simil(dfm_weight(shakes_dtm, 'relFreq'), margin = "documents", method = "correlation") %>%
as.matrix() %>%
heatR::corrheat()
```
<br>
We can already begin to see the clusters of documents. For example, the more historical are the clump in the upper left. The oddball is [*The Phoenix and the Turtle*](https://en.wikipedia.org/wiki/The_Phoenix_and_the_Turtle), though *Lover's Complaint* and the *Elegy* are also less similar than standard Shakespeare. The Phoenix and the Turtle is about the death of ideal love, represented by the Phoenix and Turtledove, for which there is a funeral. It actually is considered by scholars to be in stark contrast to his other output. [Elegy](https://en.wikipedia.org/wiki/Shakespeare_apocrypha#A_Funeral_Elegy) itself is actually written for a funeral, but probably not by Shakespeare. [*A Lover's Complaint*](https://en.wikipedia.org/wiki/A_Lover%27s_Complaint) is thought to be an inferior work by the Bard by some critics, and maybe not even authored by him, so perhaps what we're seeing is a reflection of that lack of quality. In general, we're seeing things that we might expect.
#### Readability
We can examine readability scores for the texts, but for this we'll need them in raw form. We already had them from before, I just added *Phoenix* from the Gutenberg download.
```{r readability1, echo=-1}
raw_texts =
data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
bind_rows(data_frame(file = dir('data/texts_raw/shakes/gutenberg/', full.names = TRUE)) %>%
mutate(text = map(file, read_lines))) %>%
mutate(file = str_extract(file, '[^/]*$')) %>%
rename(id=file)
raw_texts
```
With raw texts, we need to convert them to a <span class="objclass">corpus</span> object to proceed more easily. The <span class="func">corpus</span> function from <span class="pack">quanteda</span> won't read directly from a list column or a list at all, so we'll convert it via the <span class="pack">tm</span> package, which more or less defeats the purpose of using the <span class="pack">quanteda</span> package, except that the <span class="func">textstat_readability</span> function gives us what we want, but I digress.
Unfortunately, the concept of <span class="emph">readability</span> is ill-defined, and as such, there are dozens of measures available dating back nearly 75 years. The following is based on the Coleman-Liau grade score (higher grade = more difficult). The conclusion here is first, Shakespeare isn't exactly a difficult read, and two, the poems may be more so relative to the other works.
```{r readability2, echo=1:3}
library(tm)
raw_text_corpus = corpus(VCorpus(VectorSource(raw_texts$text)))
shakes_read = textstat_readability(raw_text_corpus)
shakes_read_ordered = shakes_read %>%
mutate(text = str_sub(raw_texts$id, end=-5)) %>%
arrange(Coleman.Liau.grade)
shakes_read_ordered %>%
select(Coleman.Liau.grade) %>%
round(1) %>%
heatmaply::heatmaply(Rowv=NA,
Colv=NA,
labRow=shakes_read_ordered$text,
colors='Blues',
plot_method='plotly',
colorbar_len=.5,
colorbar_ypos= .25,
colorbar_xpos= .65,
subplot_widths=.25
) %>%
plotly::layout(yaxis=list(ticklen=0)) %>%
theme_plotly()
# shakes_read %>%
# mutate(id=raw_texts$id) %>%
# gather(key=index, value=score, -id) %>%
# group_by(index) %>%
# mutate(s_score=scales::rescale(score)) %>%
# ungroup %>%
# select(-score) %>%
# spread(key = index, value=s_score) %>%
# heatmaply::heatmaply(Rowv=NA, Colv=NA)
```
#### Lexical diversity
There are also metrics of <span class="emph">lexical diversity</span>. As with readability, there is no one way to measure 'diversity'. Here we'll go back to using the standard DTM, as the focus is on the terms, whereas readability is more at the sentence level. Most standard measures of lexical diversity are variants on what is called the type-token ratio, which in our setting is the number of unique terms (types) relative to the total terms (tokens). We can use <span class="func">textstat_lexdiv</span> for our purposes here, which will provide several measures of diversity by default.
```{r lexical_diversity}
ld = textstat_lexdiv(shakes_dtm)
```
This visual is based on the (absolute) scaled values of those several metrics, and might suggest that the poems are relatively more diverse. This certainly might be the case for *Phoenix*, but it could also be a reflection of the limitation of several of the measures, such that longer works are seen as less diverse, as tokens are added more so than types the longer the text goes.
```{r lexical_diversity_vis, echo=FALSE}
scaled_ld = scale(ld) %>%
abs() %>%
data.frame() %>%
mutate(text = rownames(ld)) %>%
select(text, everything()) %>%
arrange(rowMeans(.[,-1]))
scaled_ld %>%
select(-text) %>%
heatmaply::heatmaply(Rowv=F,
Colv=F,
colors='Blues', # note that RCB scales will not work unless method='plotly'
plot_method = 'plotly',
labRow=scaled_ld$text, # note: heatmaply irregularly trims/justifies, so str_pad won't work
draw_cellnote=F,
hide_colorbar=T,
subplot_widths=.5) %>%
plotly::layout(yaxis=list(ticklen=0)) %>%
theme_plotly()
```
<br>
As a comparison, the following shows the results of the 'Measure of Textual Diversity' calculated using the <span class="pack">koRpus</span> package[^noshowmtld]. It is notably less affected by text length, though the conclusions are largely the same. There is notable correlation between the MTLD and readability as well[^correadmtld]. In general, Shakespeare tends to be more expressive in poems, and less so with comedies.
```{r MTLD, echo=FALSE, eval=FALSE}
fnames = c(dir('data/texts_raw/shakes/moby/', full.names=T),
dir('data/texts_raw/shakes/gutenberg/', full.names=T))
library(koRpus)
library(parallel)
cl = makeCluster(7)
clusterEvalQ(cl, library(koRpus))
clusterExport(cl, 'fnames')
# this actually just wraps 'lex.div'
MTLD_results = parLapply(cl, fnames,
function(x) MTLD(txt=kRp.text.analysis(x, force.lang='en', tagger='tokenize')))
stopCluster(cl)
names(MTLD_results) = c(dir('data/texts_raw/shakes/moby/'),
dir('data/texts_raw/shakes/gutenberg/'))
save(MTLD_results, file='data/shakes_mtld.RData')
# sapply(MTLD_results, function(x) x@TTR)
```
```{r MTLD_vis, echo=FALSE}
load('data/shakes_mtld.RData')
MTLD_df = lapply(MTLD_results, function(x) x@MTLD$MTLD) %>%
data.frame() %>%
gather(key=text, value=MTLD) %>%
mutate(text = str_sub(text, end=-5)) %>%
arrange(MTLD)
MTLD_df %>%
select(MTLD) %>%
heatmaply::heatmaply(Rowv=F,
Colv=F,
colors='Blues', # note that RCB scales will not work unless method='plotly'
plot_method = 'plotly',
labRow=MTLD_df$text, # note: heatmaply irregularly trims/justifies, so str_pad won't work
draw_cellnote=F,
hide_colorbar=T,
subplot_widths=.5) %>%
plotly::layout(yaxis=list(ticklen=0)) %>%
theme_plotly()
```
<br>
### Scene IV. Topic model
I'd say we're now ready for topic model. That didn't take too much did it?
#### Running the model and exploring the topics
We'll run one with 10 topics. As in the previous example in this document, we'll use <span class="pack">topicmodels</span> and the <span class="func">LDA</span> function. Later, we'll also compare our results with the traditional classifications of the texts. Note that this will take a while to run depending on your machine (maybe a minute or two). Faster implementation can be found with <span class="pack">text2vec</span>.
```{r topic_model, echo=1:2, eval=F}
library(topicmodels)
shakes_10 = LDA(convert(shakes_dtm, to = "topicmodels"), k = 10, control=list(seed=1234))
library(stm)
shakes_10_stm = stm(shakes_dtm, K = 10, seed=1234) # can take quanteda
# shakes_10_cor = CTM(convert(shakes_dtm, to = "topicmodels"), k = 10, control=list(cg=list(iter.max=2000)))
save(shakes_10, shakes_10_stm, file='data/shakespeare_topic_model.RData')
```
One of the first things to do is to interpret the topics, and we can start by seeing which terms are most probable for each topic.
```{r tm10_results, eval=FALSE}
get_terms(shakes_10, 20)
```
```{r tm10_results_pretty_terms, echo=FALSE}
load('data/shakespeare_topic_model.RData')
library(topicmodels)
get_terms(shakes_10, 20) %>%
DT::datatable(options=list(dom='tp')) %>%
DT::formatStyle(
0, # ignores columns, but otherwise put here
target='row',
backgroundColor = 'transparent'
)
```
<br>
We can see there is a lot of overlap in these topics for top terms. Just looking at the top 10, *love* occurs in all of them, *god* and *heart* are common as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term's probability of occurrence within a document, and term *exclusivity*, or how likely a term is to occur in only one particular topic. See the <span class="pack">stm</span> package and corresponding <span class="func">labelTopics</span> function as a way to get several alternatives. As an example, I show the results of their version of the following[^scoredefs]:
- <span class="emph">FREX</span>: **FR**equency and **EX**clusivity, it is a weighted harmonic mean of a term's rank within a topic in terms of frequency and exclusivity.
- <span class="emph">lift</span>: Ratio of the term's probability within a topic to its probability of occurrence across all documents. Overly sensitive to rare words.
- <span class="emph">score</span>: Another approach that will give more weight to more exclusive terms.
- <span class="emph">prob</span>: This is just the raw probability of the term within a given topic.
```{r coherence, eval=FALSE, echo=FALSE}
library(stm)
docs = apply(shakes_dtm, 1, function(x) rbind((1:ncol(shakes_dtm))[x>0], x[x>0])) # bow for stemmed terms
tq = topicQuality(model=shakes_10_stm, documents=docs)
sc = semanticCoherence(model=shakes_10_stm, documents=docs, M=20)
ex = exclusivity(model=shakes_10_stm, M=20)
plot_ly(x=~sc, y=~ex, data=data_frame(sc, ex, topic=paste('Topic', 1:10))) %>%
add_text(text=~topic, color=I('#ff5500')) %>%
theme_plotly() %>%
layout(xaxis = list(title='Semantic Coherence'),
yaxis = list(title='Exclusivity'))
```
```{r label_topics_stm, eval=F, echo=FALSE}
# as stm topics won't be exactly the same, it might through them off
# topic_labels = stm::labelTopics(shakes_10_stm, n=10)
# sapply(topic_labels[1:4], function(e) apply(e, 1, paste, collapse=', ')) %>%
# as_tibble() %>%
# mutate(Topic = 1:10) %>%
# gather(key=Score, value=Terms, -Topic) %>%
# arrange(Topic, Score) %>%
# DT::datatable(options=list(dom='t', scrollY=T, pageLength=40), rownames=F) %>%
# DT::formatStyle(
# 0, # ignores columns, but otherwise put here
# target='row',
# backgroundColor = 'transparent'
# )
```
```{r lable_topics, echo=FALSE}
library(stm)
frex_res = calcfrex(shakes_10@beta, wordcounts=colSums(as.matrix(shakes_10@wordassignments)))[1:10,] %>%
shakes_10@terms[.] %>%
matrix(., 10, 10)
# lift is definitely bugged, will just flag singletons
lift_res = calclift(shakes_10@beta, wordcounts=colSums(as.matrix(shakes_10@wordassignments)))[1:10,] %>%
shakes_10@terms[.] %>%
matrix(., 10, 10)
score_res = calcscore(shakes_10@beta)[1:10,] %>%
shakes_10@terms[.] %>%
matrix(., 10, 10)
prob_res = apply(shakes_10@beta, 1, order, decreasing = TRUE)[1:10,] %>%
shakes_10@terms[.] %>%
matrix(., 10, 10)
score_list = list(FREX = frex_res, LIFT=lift_res, SCORE=score_res, PROB=prob_res)
sapply(score_list, function(e) apply(e, 2, paste, collapse=', ')) %>%
as_tibble() %>%
mutate(Topic = 1:10) %>%
gather(key=Score, value=Terms, -Topic) %>%
arrange(Topic, Score) %>%
DT::datatable(options=list(dom='t', scrollY='400px', scrollCollapse=T, pageLength=40), rownames=F) %>%
DT::formatStyle(
0, # ignores columns, but otherwise put here
target='row',
backgroundColor = 'transparent'
)
```
<br>
As another approach, consider the <span class="emph">saliency</span> and <span class="emph">relevance</span> of term via the <span class="pack">LDAvis</span> package. While you can play with it here, it's probably easier to [open it separately](vis/index.html). Note that this has to be done separately from the model, and may have topic numbers in a different order.
<br>
```{r ldavis, eval=FALSE, echo=FALSE}
# may need to redo LDA just for this visual depending on what's been updated/changed
library(LDAvis)
json = createJSON(phi = exp(shakes_10@beta),
theta = shakes_10@gamma,
# doc.length = rowSums(shakes_dtm),
doc.length = length(shakes_10@documents),
vocab = shakes_10@terms,
plot.opts = list(mdswidth = 250,
mdsheight = 250,
barwidth = 250,
barheight = 250),
# term.frequency = shakes_10_stm$settings$dim$wcounts$x,
term.frequency = shakes_10@wordassignments %>% as.matrix() %>% colSums(),
R=10)
serVis(json, out.dir = 'docs/vis', open.browser = F) # have to redo if book folder is cleaned can only adjust plot height/width directly in .js file
```
<iframe src="https://m-clark.github.io/text-analysis-with-R/vis/index.html" width="100%" height="862px" frameborder="0"
scrolling="yes"
marginheight="0"
marginwidth="0" title="LDAvis results">
<p>Your browser does not support iframes.</p>
</iframe>
<br>
Given all these measures, one can assess how well they match what topics the documents would be most associated with.
```{r tm10_results_topic_classification, eval=FALSE}
t(topics(shakes_10, 3))
```
```{r tm10_results_pretty_topic_classification, echo=FALSE}
t(topics(shakes_10, 2)) %>%
data.frame %>%
rename_all(str_replace, 'X', 'Top Topic ') %>%
DT::datatable(options=list(dom='t',
scrollY='400px',
scrollCollapse=T,
pageLength=40,
autoWidth=T,
align='center',
columnDefs=list(list(width='150px', targets=0),
list(width='100px', targets=1:2),
list(className = 'dt-center', targets = 1:2))),
width='500') %>%
DT::formatStyle(
0, # ignores columns, but otherwise put here
target='row',
backgroundColor = 'transparent'
)
```
<br>
For example, based just on term frequency, Hamlet is most likely to be associated with Topic `r t(topicmodels::topics(shakes_10, 3))['Hamlet',1]`. That topic is affiliated with the (stemmed words) `r topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet',1]]`. The other measures pick up on words like Dane and Denmark. Sounds about right for Hamlet.
<!-- Hamlet is also one that is actually a decent mix, with its second topic expressed being Topic `r t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]`, with common terms `r topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]]`. They both have `r intersect(topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet',1]], topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]])` among their top 20 terms. -->
The following visualization shows a heatmap for the topic probabilities of each document. Darker values mean higher probability for a document expressing that topic. I've also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram[^clustering]. The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem.
<br>
```{r viz_topics, echo=FALSE, out.height='700px', out.width='700px', fig.width=11, fig.height=8.5, fig.align='center'}
library(quanteda)
load('data/shakespeare_classification.RData')
suppressPackageStartupMessages(library(dendextend))
# see proxy::pr_DB %>% data.frame() for actual info for the metrics that
# quanteda uses, whose functions don't bother to even tell you that's where they
# are coming from
# proxy::pr_DB %>% data.frame() %>% select( distance, formula, description, reference) %>% DT::datatable()
# cosine distance, is not a proper distance
row_dend =
(1-textstat_simil(dfm_weight(shakes_dtm, 'relMaxFreq'),
margin = "documents",
method = "cosine")) %>%
as.dist() %>%
hclust(method="complete") %>%
as.dendrogram %>%
set("branches_k_color", k = 4) %>%
set("branches_lwd", c(.5,.5)) %>%
ladderize
shakes_10@gamma %>%
round(3) %>%
heatmaply::heatmaply(Rowv=row_dend,
Colv=F,
# colors='Oranges',
colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
row_side_colors = data.frame(shakes_types$class),
row_side_palette = plasma,
k_row= 4,
# RowSideColors = 'Set2',
labRow=rev(labels(row_dend)),
labCol=paste0('Topic ', 1:10),
hide_colorbar=T,
grid_gap=2,
plot_method='plotly'
) %>%
layout(showlegend=F) %>% # showing the legend will screw up the colorbar and any associated options
config(displayModeBar = F) %>%
theme_plotly()
```
<br>
A couple things stand out. To begin with, most works are associated with one topic[^howmanytopics]. In terms of the discovered topics, traditional classification really probably only works for the <span class="" style="color:#9C179E">historical</span> works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work). Furthermore, <span class="" style="color:#F0F921">tragedies</span> and <span class="" style="color:#0D0887">comedies</span> might hit on the same topics, albeit from different perspectives. In addition, at least some works are very poetical, or at least have topics in common with the <span class="" style="color:#ED7953">poems</span> (love, beauty). If we take four clusters from the cluster analysis, the result boils down to *Phoenix*, *Complaint*, standard poems, a mixed bag of more romance-oriented works and the remaining poems, then everything else.
Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic. The order is based on the most likely topics across all documents.
<br>
```{r cluster_topics, echo=FALSE, fig.align='center', fig.width=11, fig.height=8, out.height='800px', out.width='650px'}
topic_class = shakes_10@gamma %>%
round(2) %>%
data.frame() %>%
rename_all(function(x) str_replace(x, 'X', 'Topic ')) %>%
mutate(text = shakes_10@documents,
class = shakes_types$class)
order_topics = order(colSums(shakes_10@gamma), decreasing=T)
# order_topics = order(apply(shakes_10@gamma, 2, mean), decreasing=T)
topic_class = topic_class %>%
select(order_topics, text, class) %>%
arrange_at(vars(contains('Topic')), desc)
topic_class %>%
select(-text, -class) %>%
heatmaply::heatmaply(Rowv=NA,
Colv=NA,
labRow=rev(topic_class$text),
labCol=apply(get_terms(shakes_10, 10), 2, paste0, collapse='\n')[order_topics],
column_text_angle=0,
colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
# subplot_widths=c(1),
plot_method = 'plotly',
fontsize_row=8,
fontsize_col=8,
hide_colorbar = T) %>%
layout(showlegend=F) %>%
config(displayModeBar = F) %>%
theme_plotly()
```
The following shows the average topic probability for each of the traditional classes. Topics are represented by their first five most probable terms.
```{r avg_topic_probs_per_class, echo=FALSE, fig.width=6, fig.height=3, out.height='600px', out.width='650px'}
class_probs = shakes_10@gamma %>%
data.frame() %>%
round(3) %>%
rename_all(function(x) stringr::str_replace_all(x, 'X', 'Topic ')) %>%
bind_cols(doc=shakes_10@documents, arrange(shakes_types, title), .) %>%
select(-title) %>%
gather(key=topic, value=prob, -doc, -class, -problem, -late_romance) %>%
mutate(topic = forcats::fct_reorder(topic, rep(1:10, e=n_distinct(doc)))) %>%
group_by(class, topic) %>%
summarise(mean_prob = mean(prob)) %>%
arrange(class, topic) %>%
spread(topic, mean_prob) %>%
ungroup
# class_probs %>%
# select(-class) %>%
# d3heatmap::d3heatmap(Rowv=F,
# Colv=F,
# labRow=class_probs$class,
# labCol=apply(get_terms(shakes_10, 5), 2, paste0, collapse=' '),
# # colors=viridis::inferno(500),
# colors='Oranges',
# k_row = 10,
# # row_side_colors = select(arrange(shakes_types, title), class),
# xaxis_font_size = 8,
# yaxis_font_size=12,
# show_grid = F,
# width=800,
# xaxis_height = 150
# )
class_probs %>%
select(-class) %>%
heatmaply::heatmaply(Rowv=F,
Colv=F,
labRow=rev(class_probs$class),
labCol=apply(get_terms(shakes_10, 10), 2, paste0, collapse='\n'),
# colors=viridis::inferno(500),
colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
plot_method = 'plotly',
xaxis_font_size = 8,
yaxis_font_size=12,
# xaxis_height = 250,
# subplot_widths = .75,
hide_colorbar = T,
# colorbar_len=.5,
column_text_angle=0
) %>%
config(displayModeBar = F) %>%
theme_plotly()
```
Aside from the poems, the classes are a good mix of topics, and appear to have some overlap. Tragedies are perhaps most diverse.
#### Summary of Topic Models
This is where the summary would go, but I grow weary...
<br>
<br>
<br>
<div style="text-align:center; font-size:500%">**FIN**</div>
```{r ggplay, echo=F, eval=F}
shakes_10@gamma %>%
data.frame() %>%
round(3) %>%
rename_all(function(x) stringr::str_replace_all(x, 'X', 'Topic_')) %>%