shakespeare.rmd

# Shakespeare Start to Finish

The following attempts to demonstrate the usual difficulties one encounters dealing with text by procuring and processing the works of Shakespeare.  The source is [MIT](http://shakespeare.mit.edu/), which has made the 'complete' works available on the web since 1993, plus one other from Gutenberg.  The initial issue is simply getting the works from the web.  Subsequently there is metadata, character names, stopwords etc. to be removed. At that point, we can stem and count the words in each work, which, when complete, puts us at the point we are ready for analysis.

The primary packages used are <span class="pack">tidytext</span>, <span class="pack">stringr</span>, and when things are ready for analysis, <span class="pack">quanteda.</span>


## ACT I. Scrape MIT and Gutenberg Shakespeare

### Scene I. Scrape main works

Initially we must scrape the web to get the documents we need.  The <span class="pack">rvest</span> package will be used as follows.

- Start with the url of the site
- Get the links off that page to serve as base urls for the works
- Scrape the document for each url
- Deal with the collection of Sonnets separately
- Write out results


```{r shakes_urls, eval=FALSE}
library(rvest); library(tidyverse); library(stringr)

page0 = read_html('http://shakespeare.mit.edu/')

works_urls0 = page0 %>%
  html_nodes('a') %>%
  html_attr('href')

main = works_urls0 %>%
  grep(pattern='index', value=T) %>%
  str_replace_all(pattern='index', replacement='full')

other = works_urls0[!grepl(works_urls0, pattern='index|edu|org|news')]

works_urls = c(main, other)
works_urls[1:3]
```

Now we just paste the main site url to the work urls and download them.  Here is where we come across our first snag.  The <span class="func">html_text</span> function has what I would call a bug but what the author feels is a feature.  [Basically, it ignores line breaks of the form `<br>` in certain situations](https://github.com/hadley/rvest/issues/175).  This means it will smash text together that shouldn't be, thereby making *any* analysis of it fairly useless[^bug].  Luckily, [\@rentrop](https://github.com/rentrop) provided a solution, which is in `r/fix_read_html.R`.

```{r scrape_shakes, eval=FALSE}
works0 = lapply(works_urls, function(x) read_html(paste0('http://shakespeare.mit.edu/', x)))

source('r/fix_read_html.R')

html_text_collapse(works0[[1]]) #works
works = lapply(works0, html_text_collapse)


names(works) = c("All's Well That Ends Well", "As You Like It", "Comedy of Errors"
                 "Cymbeline", "Love's Labour's Lost", "Measure for Measure"
                 "The Merry Wives of Windsor", "The Merchant of Venice", "A Midsummer Night's Dream"
                 "Much Ado about Nothing", "Pericles Prince of Tyre", "The Taming of the Shrew"
                 "The Tempest", "Troilus and Cressida", "Twelfth Night"
                 "The Two Gentlemen of Verona", "The Winter's Tale", "King Henry IV Part 1"
                 "King Henry IV Part 2", "Henry V", "Henry VI Part 1"
                 "Henry VI Part 2", "Henry VI Part 3", "Henry VIII"
                 "King John", "Richard II", "Richard III"
                 "Antony and Cleopatra", "Coriolanus", "Hamlet"
                 "Julius Caesar", "King Lear", "Macbeth"
                 "Othello", "Romeo and Juliet", "Timon of Athens"
                 "Titus Andronicus", "Sonnets", "A Lover's Complaint"
                 "The Rape of Lucrece", "Venus and Adonis", "Elegy")
```


### Scene II. Sonnets 

We now hit a slight nuisance with the Sonnets. The Sonnets have a bit of a different structure than the plays. All links are in a single page, with a different form for the url, and each sonnet has its own page.

```{r scrape_sonnets, eval=F}
sonnet_urls = paste0('http://shakespeare.mit.edu/', grep(works_urls0, pattern='sonnet', value=T)) %>%
  read_html() %>%
  html_nodes('a') %>%
  html_attr('href')

sonnet_urls = grep(sonnet_urls, pattern = 'sonnet', value=T)  # remove amazon link

# read the texts
sonnet0 = purrr::map(sonnet_urls, function(x) read_html(paste0('http://shakespeare.mit.edu/Poetry/', x)))

# collapse to one 'Sonnets' work
sonnet = sapply(sonnet0, html_text_collapse)
works$Sonnets = sonnet
```


### Scene III. Save and write out 

Now we can save our results so we won't have to repeat any of the previous scraping.  We want to save the main text object as an RData file, and write out the texts to their own file.  When dealing with text, you'll regularly want to save stages so you can avoid repeating what you don't have to, as often you will need to go back after discovering new issues further down the line.

```{r initial_save, eval=F, echo=1}
save(works, file='data/texts_raw/shakes/moby_from_web.RData')

# This will spit the text to the console unfortunately

purrr::map2(works,
            paste0('data/texts_raw/shakes/moby/', str_replace_all(names(works), " |'", '_'), '.txt'),
            function(x, nam) write_lines(x, path=nam))
```

### Scene IV. Read text from files

After the above is done, it's not required to redo, so we can always get what we need.  I'll start with the raw text as files, as that is one of the more common ways one deals with documents.  When text is nice and clean, this can be fairly straightforward.

The  function at the end comes from the <span class="pack">tidyr</span> package. Up to that line, each element in the <span class="objclass">text</span> column is the entire text, while the column itself is thus a 'list-column'. In other words, we have a 42 x 2 matrix. But to do what we need, we'll want to have access to each line, and the <span class="func">unnest</span> function unpacks each line within the title. The first few lines of the result are shown after.

```{r read_shakes_works, echo=-5, eval=FALSE}
library(tidyverse); library(stringr)

shakes0 =
  data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>%
  mutate(text = map(file, read_lines)) %>%
  transmute(id = basename(file), text) %>%
  unnest(text)

save(shakes0, file='data/initial_shakes_dt.RData')

# Alternate that provides for more options
# library(readtext)
# shakes0 =
#   data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>%
#   mutate(text = map(file, readtext, encoding='UTF8')) %>%
#   unnest(text)
```

```{r show_shakes0, echo=F}
load('data/initial_shakes_dt.RData')
DT::datatable(shakes0[1:20,], 
              options=list(dom='tp')) %>% 
  DT::formatStyle(
    0, # ignores columns, but otherwise put here
    target='row',
    backgroundColor = 'transparent'
  )
```


### Scene V. Add additional works 

It is typical to be gathering texts from multiple sources. In this case, we'll get *The Phoenix and the Turtle* from the Project Gutenberg website.  There is an R package that will allow us to work directly with the site, making the process straightforward[^r4everything].  I also considered two other works, but I refrained from "The Two Noble Kinsmen" because like many other of Shakespeare's versions on Gutenberg, it's basically written in a different language.  I also refrained from *The Passionate Pilgrim* because it's mostly not Shakespeare.

When first doing this project, I actually started with Gutenberg, but it became a notable PITA. The texts were inconsistent in source, and sometimes reproduced printing errors purposely, which would have compounded typical problems.  I thought it could have been solved by using the *Complete Works of Shakespeare* but the download only came with that title, meaning one would have to hunt for and delineate each separate work.  This might not have been too big of an issue, except that there is no table of contents, nor consistent naming of titles across different printings.  The MIT approach, on the other hand, was a few lines of code. This represents a common issue in text analysis when dealing with sources, a different option may save a lot of time in the end.

The following code could be more succinct to deal with one text, but I initially was dealing with multiple works, so I've left it in that mode.  In the end, we'll have a <span class="objclass">tibble</span> with an id column for the file/work name, and another column that contains the lines of text.

```{r phoenix_turtle, eval=FALSE}
library(gutenbergr)

works_not_included = c("The Phoenix and the Turtle")   # add others if desired

gute0 = gutenberg_works(title %in% works_not_included)
gute = lapply(gute0$gutenberg_id, gutenberg_download)
gute = mapply(function(x, y) mutate(x, id=y) %>% select(-gutenberg_id), 
              x=gute, 
              y=works_not_included, 
              SIMPLIFY=F)

shakes = shakes0 %>%
  bind_rows(gute) %>%
  mutate(id = str_replace_all(id, " |'", '_')) %>% 
  mutate(id = str_replace(id, '.txt', '')) %>% 
  arrange(id)

# shakes %>% split(.$id) # inspect

save(shakes, file='data/texts_raw/shakes/shakes_df.RData')
```


## ACT II. Preliminary Cleaning

If you think we're even remotely getting close to being ready for analysis, I say Ha! to you. Our journey has only just begun (cue the Carpenters). 

Now we can start thinking about prepping the data for eventual analysis. One of the nice things about having the data in a tidy format is that we can use string functionality over the column of text in a simple fashion.

### Scene I. Remove initial text/metadata

First on our to-do list is to get rid of all the preliminary text of titles, authorship, and similar.  This is fairly straightforward when you realize the text we want will be associated with something like `ACT I`, or in the case of the Sonnets, the word `Sonnet`.  So, the idea it to drop all text up to those points.  I've created a [function](https://github.com/m-clark/text-analysis-with-R/blob/master/r/detect_first_act.R) that will do that, and then just apply it to each works tibble[^tibblefail].  For the poems and *A Funeral Elegy for Master William Peter*, we look instead for the line where his name or initials start the line.

```{r remove_preliminary_text, echo=-(1:2)}
load('data/texts_raw/shakes/shakes_df.RData')

source('r/detect_first_act.R')

shakes_trim = shakes %>%
  split(.$id) %>%
  lapply(detect_first_act) %>%
  bind_rows

shakes %>% filter(id=='Romeo_and_Juliet') %>% head
shakes_trim %>% filter(id=='Romeo_and_Juliet') %>% head
```


### Scene II. Miscellaneous removal

Next, we'll want to remove empty rows, any remaining titles, lines that denote the act or scene, and other stuff.  I'm going to remove the word *prologue* and *epilogue* as a stopword later. While some texts have a line that just says that (`PROLOGUE`), others have text that describes the scene (`Prologue. Blah blah`) and which I've decided to keep. As such, we just need the word itself gone.

```{r remove_misc}
titles =  c("A Lover's Complaint", "All's Well That Ends Well", "As You Like It", "The Comedy of Errors",
            "Cymbeline", "Love's Labour's Lost", "Measure for Measure",
            "The Merry Wives of Windsor", "The Merchant of Venice", "A Midsummer Night's Dream",
            "Much Ado about Nothing", "Pericles Prince of Tyre", "The Taming of the Shrew",
            "The Tempest", "Troilus and Cressida", "Twelfth Night",
            "The Two Gentlemen of Verona", "The Winter's Tale", "King Henry IV, Part 1",
            "King Henry IV, Part 2", "Henry V", "Henry VI, Part 1",
            "Henry VI, Part 2", "Henry VI, Part 3", "Henry VIII",
            "King John", "Richard II", "Richard III",
            "Antony and Cleopatra", "Coriolanus", "Hamlet",
            "Julius Caesar", "King Lear", "Macbeth",
            "Othello", "Romeo and Juliet", "Timon of Athens",
            "Titus Andronicus", "Sonnets", 
            "The Rape of Lucrece", "Venus and Adonis", "A Funeral Elegy", "The Phoenix and the Turtle")

shakes_trim = shakes_trim %>%
  filter(text != '',                     # remove empties
         !text %in% titles,              # remove titles
         !str_detect(text, '^ACT|^SCENE|^Enter|^Exit|^Exeunt|^Sonnet')  # remove acts etc.
         )

shakes_trim %>% filter(id=='Romeo_and_Juliet') # we'll get prologue later
```

### Scene III. Classification of works

While we're at it, we can save the classical (sometimes arbitrary) classifications of Shakespeare's works for later comparison to what we'll get in our analyses. We'll save them to call as needed.

```{r shakespeare_classes, eval=FALSE}
shakes_types = data_frame(title=unique(shakes_trim$id)) %>%
  mutate(class = 'Comedy',
         class = if_else(str_detect(title, pattern='Adonis|Lucrece|Complaint|Turtle|Pilgrim|Sonnet|Elegy'), 'Poem', class),
         class = if_else(str_detect(title, pattern='Henry|Richard|John'), 'History', class),
         class = if_else(str_detect(title, pattern='Troilus|Coriolanus|Titus|Romeo|Timon|Julius|Macbeth|Hamlet|Othello|Antony|Cymbeline|Lear'), 'Tragedy', class),
         problem = if_else(str_detect(title, pattern='Measure|Merchant|^All|Troilus|Timon|Passion'), 'Problem', 'Not'),
         late_romance = if_else(str_detect(title, pattern='Cymbeline|Kinsmen|Pericles|Winter|Tempest'), 'Late', 'Other'))

save(shakes_types, file='data/shakespeare_classification.RData') # save for later
```

## ACT III. Stop words

As we've noted before, we'll want to get rid of stop words, things like articles, possessive pronouns, and other very common words.  In this case, we also want to include character names.  However, the big wrinkle here is that this is not English as currently spoken, so we need to remove 'ye', 'thee', 'thine' etc.  In addition, there are things that need to be replaced, like o'er to over, which may then also be removed. In short, this is not so straightforward.

### Scene I. Character names

We'll get the list of character names from [opensourceshakespeare.org](http://opensourceshakespeare.org/) via <span class="pack">rvest</span>, but I added some from the poems and others that still came through the processing one way or another, e.g. abbreviated names.

```{r character_names, eval=FALSE, echo=-1}
library(rvest)
shakes_char_url = 'https://www.opensourceshakespeare.org/views/plays/characters/chardisplay.php'
page0 = read_html(shakes_char_url)
tabs = page0 %>% html_table()
shakes_char = tabs[[2]][-(1:2), c(1,3,5)] # remove header and phantom columns
colnames(shakes_char) = c('Nspeeches', 'Character', 'Play')
shakes_char = shakes_char %>%
  distinct(Character,.keep_all=T)

save(shakes_char, file='data/shakespeare_characters.RData')
```

A new snag is that some characters with multiple names may be represented (typically) by the first or last name, or in the case of three, the middle, e.g. Sir Toby Belch. Others are still difficultly named e.g. RICHARD PLANTAGENET (DUKE OF GLOUCESTER). The following should capture everything by splitting the names on spaces, removing parentheses, and keeping unique terms.

```{r character_names_clean, echo=-1}
load('data/shakespeare_characters.RData')

# remove paren and split
chars = shakes_char$Character
chars = str_replace_all(chars, '\\(|\\)', '')
chars = str_split(chars, ' ') %>%
  unlist

# these were found after intial processsing
chars_other = c('enobarbus', 'marcius', 'katharina', 'clarence','pyramus',
                'andrew', 'arcite', 'perithous', 'hippolita', 'schoolmaster',
                'cressid', 'diomed', 'kate', 'titinius', 'Palamon', 'Tarquin',
                'lucrece', 'isidore', 'tom', 'thisbe', 'paul',
                'aemelia', 'sycorax', 'montague', 'capulet', 'collatinus')

chars = unique(c(chars, chars_other))

chars =  chars[chars != '']
sample(chars)[1:3]
```

### Scene II. Old, Middle, & Modern English

While Shakespeare is considered [Early Modern English](https://en.wikipedia.org/wiki/Early_Modern_English), some text may be more historical, so I include Middle and Old English stopwords, as they were readily available from the <span class="pack">cltk</span> Python module ([link](https://github.com/cltk/cltk)).  I also added some things to the modern English list like "thou'ldst" that I found lingering after initial passes. I first started using the works from Gutenberg, and there, the Old English might have had some utility.  As the texts there were inconsistently translated and otherwise problematic, I abandoned using them.  Here, the Old English vocabulary applied to these texts it only removes 'wit', so I refrain from using it.

```{r old_middle_modern_english}
# old and me from python cltk module; 
# em from http://earlymodernconversions.com/wp-content/uploads/2013/12/stopwords.txt; 
# I also added some to me

old_stops0 = read_lines('data/old_english_stop_words.txt')
# sort(old_stops0)
old_stops = data_frame(word=str_conv(old_stops0, 'UTF8'),
                      lexicon = 'cltk')

me_stops0 = read_lines('data/middle_english_stop_words')
# sort(me_stops0)
me_stops = data_frame(word=str_conv(me_stops0, 'UTF8'),
                      lexicon = 'cltk')

em_stops0 = read_lines('data/early_modern_english_stop_words.txt')
# sort(em_stops0)
em_stops = data_frame(word=str_conv(em_stops0, 'UTF8'),
                      lexicon = 'emc')
```

### Scene III. Remove stopwords

We're now ready to start removing words. However, right now, we have lines not words.  We can use the <span class="pack">tidytext</span> function <span class="func">unnest_tokens</span>, which is like <span class="func">unnest</span> from <span class="pack">tidyr</span>, but works on different tokens, e.g. words, sentences, or paragraphs. Note that by default, the function will make all words lower case to make matching more efficient.

```{r unnest_words}
library(tidytext)

shakes_words = shakes_trim %>%
  unnest_tokens(word, text, token='words')
save(shakes_words, file='data/shakes_words_df_4text2vec.RData')
```

We also will be doing a little stemming here. I'm getting rid of suffixes that end with the suffix after an apostrophe.  Many of the remaining words will either be stopwords or need to be further stemmed later. I also created a middle/modern English stemmer for words that are not caught otherwise (<span class="func">me_st_stem</span>).  Again, this is the sort of thing you discover after initial passes (e.g. 'criedst'). After that, we can use the <span class="func">anti_join</span> remove the stopwords.

```{r anti_join}
source('r/st_stem.R')

shakes_words = shakes_words %>%
  mutate(word = str_trim(word),    # remove possible whitespace
         word = str_replace(word, "'er$|'d$|'t$|'ld$|'rt$|'st$|'dst$", ''),    # remove me style endings
         word = str_replace_all(word, "[0-9]", ''),    # remove sonnet numbers
         word = vapply(word, me_st_stem, 'a')) %>%
  anti_join(em_stops) %>%
  anti_join(me_stops) %>%
  anti_join(data_frame(word=str_to_lower(c(chars, 'prologue', 'epilogue')))) %>%
  anti_join(data_frame(word=str_to_lower(paste0(chars, "'s")))) %>%     # remove possessive names
  anti_join(stop_words)
```

As before, you should do a couple spot checks.

```{r stopword_check, results='hold'}
any(shakes_words$word == 'romeo')
any(shakes_words$word == 'prologue')
any(shakes_words$word == 'mayst')
```

## ACT IV. Other fixes

Now we're ready to finally do the word counts.  Just kidding! There is *still* work to do for the remainder, and you'll continue to spot things after runs.  One remaining issue is the words that end in 'st' and 'est', and others that are not consistently spelled or otherwise need to be dealt with.  For example, 'crost' will not be stemmed to 'cross', as 'crossed' would be.  Finally, I limit the result to any words that have more than two characters, as my inspection suggested these are left-over suffixes, or otherwise would be considered stopwords anyway.


```{r other_fixes}
# porter should catch remaining 'est'

add_a =  c('mongst', 'gainst')  # words to add a to

shakes_words = shakes_words %>%
  mutate(word = if_else(word=='honour', 'honor', word),
         word = if_else(word=='durst', 'dare', word),
         word = if_else(word=='wast', 'was', word),
         word = if_else(word=='dust', 'does', word),
         word = if_else(word=='curst', 'cursed', word),
         word = if_else(word=='blest', 'blessed', word),
         word = if_else(word=='crost', 'crossed', word),
         word = if_else(word=='accurst', 'accursed', word),
         word = if_else(word %in% add_a,
                        paste0('a', word),
                        word),
         word = str_replace(word, "'s$", ''),                # strip remaining possessives
         word = if_else(str_detect(word, pattern="o'er"),    # change o'er over
                        str_replace(word, "'", 'v'),
                        word)) %>%
  filter(!(id=='Antony_and_Cleopatra' & word == 'mark')) %>%     # mark here is almost exclusively the character name
  filter(str_count(word)>2)
```

At this point we could still maybe add things to this list of additional fixes, but I think it's time to actually start playing with the data.


## ACT V. Fun stuff

We are finally ready to get to the fun stuff.  Finally! And now things get easy. 

### Scene I. Count the terms

We can get term counts with standard <span class="pack">dplyr</span> approaches, and packages like <span class="pack">tidytext</span> will take that and also do some other things we might want. Specifically, we can use the latter to create the document-term matrix (DTM) that will be used in other analysis.  The function <span class="func">cast_dfm</span> will create a <span class="objclass">dfm</span> class object, or 'document-feature' matrix class object (from <span class="pack">quanteda</span>), which is the same thing but recognizes this sort of stuff is not specific to words.  With word counts in hand, would be good save to save at this point, since they'll serve as the basis for other processing.

```{r term_counts, results='hold', eval=-8}
term_counts = shakes_words %>%
  group_by(id, word) %>%
  count

term_counts %>% 
  arrange(desc(n))

library(quanteda)
shakes_dtm = term_counts %>%
  cast_dfm(document=id, term=word, value=n)

save(shakes_words, term_counts, shakes_dtm, file='data/shakes_words_df.RData')
```

Now things are looking like Shakespeare, with love for everyone[^love].  You'll notice I've kept place names such as Rome, but this might be something you'd prefer to remove.  Other candidates would be madam, woman, man, majesty (as in 'his/her') etc.  This sort of thing is up to the researcher.


### Scene II. Stemming

Now we'll <span class="emph">stem</span> the words. This is actually more of a pre-processing step, one that we'd do along with (and typically after) stopword removal.  I do it here to mostly demonstrate how to use <span class="pack">quanteda</span> to do it, as it can also be used to remove stopwords and do many of the other things we did with <span class="pack">tidytext</span>.

Stemming will make words like eye and eyes just *ey*, or convert war, wars and warring to *war*. In other words, it will reduce variations of a word to a common root form, or 'word stem'.  We could have done this in a step prior to counting the terms, but then you only have the stemmed result to work with for the document term matrix from then on.  Depending on your situation, you may or may not want to stem, or maybe you'd want to compare results.  The <span class="pack">quanteda</span> package will actually stem with the DTM (i.e. work on the column names) and collapse the word counts accordingly.   I note the difference in words before and after stemming.

```{r stem, echo=-(1:4), results='hold'}
load('data/shakes_words_df.RData')

library(quanteda)

shakes_dtm
ncol(shakes_dtm)

shakes_dtm = shakes_dtm %>%
  dfm_wordstem()

shakes_dtm
ncol(shakes_dtm)
```

The result is notably fewer columns, which will speed up any analysis, as well as produce a slightly more dense matrix.

### Scene III. Exploration

```{r top_20_init, echo=F}
top20 = topfeatures(shakes_dtm, 20)
```

#### Top features

Let's start looking at the data more intently.  The following shows the 10 most common words and their respective counts. This is also an easy way to find candidates to add to the stopword list.  Note that dai and prai are stems for day and pray. Love occurs `r round(top20[1]/top20[2], 2)` times as much as the most frequent word!

```{r top_10}
top10 = topfeatures(shakes_dtm, 10)
top10
```

The following is a word cloud. They are among the most useless visual displays imaginable.  Just because you can, doesn't mean you should.

```{r wordcloud, echo=FALSE}
# useless!
textplot_wordcloud(shakes_dtm, min.freq = 400, random.order = T,
                   rot.per = .25,
                   colors = viridis::viridis(30))
```

If you want to display relative frequency do so.

```{r better_cloud, echo=FALSE}
library(forcats)
# data_frame(term = names(top20), freq = top20) %>%
#   mutate(
#     term = fct_reorder(term, freq, .desc = TRUE),
#     percentage = freq/ncol(shakes_dtm)
#   ) %>%
#   plot_ly(x=~term, y=~percentage) %>%  # , size=~percentage
#   add_text(text=~term,
#            color=I('#ff5500'),
#            size =~ percentage,
#            sizes = c(15,15),         # doesn't work for text, only markers
#            showlegend = F)  %>%
#   theme_plotly() %>%
#   layout(xaxis = list(
#   autotick = FALSE,
#   ticks = F,
#   showticklabels = F),
#   yaxis = list(range=c(0,.25)))

g = data_frame(term = names(top20), freq = top20) %>%
  mutate(
    term = fct_reorder(term, freq, .desc = TRUE),
    percentage = freq/ncol(shakes_dtm)
  ) %>% 
  ggplot() + 
  geom_text(aes(x=term, 
                y=percentage,
                size=percentage, 
                label=term, 
                angle=45,         # ignored by plotly
                color=term),  
            # color='#ff5500', 
            hjust=-.1,
            
            show.legend = F) + 
  scale_size_continuous(range=c(2,8)) +
  ylim(c(0,.25)) +
  lazerhawk::theme_trueMinimal() + 
  theme(axis.text.x = element_blank(),
        axis.ticks.x= element_blank(), 
        legend.position='none')     # for plotly's sake
g
# ggplotly(width='auto', hoverinfo='none') %>% 
#   lazerhawk::theme_plotly() 
```


#### Similarity

The <span class="pack">quanteda</span> package has some built in similarity measures such as [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity), which you can think of similarly to the standard correlation (also available as an option).  I display it visually to better get a sense of things.

```{r similarity, echo=1, eval=3}
textstat_simil(shakes_dtm, margin = "documents", method = "cosine") 

textstat_simil(shakes_dtm, margin = "documents", method = "cosine") %>%
  as.matrix() %>% 
  heatR::corrheat(show_grid = F, height = 500)

textstat_simil(dfm_weight(shakes_dtm, 'relFreq'), margin = "documents", method = "correlation") %>%
  as.matrix() %>% 
  heatR::corrheat()
```

<br>

We can already begin to see the clusters of documents.  For example, the more historical are the clump in the upper left.  The oddball is [*The Phoenix and the Turtle*](https://en.wikipedia.org/wiki/The_Phoenix_and_the_Turtle), though *Lover's Complaint* and the *Elegy* are also less similar than standard Shakespeare.  The Phoenix and the Turtle is about the death of ideal love, represented by the Phoenix and Turtledove, for which there is a funeral.  It actually is considered by scholars to be in stark contrast to his other output. [Elegy](https://en.wikipedia.org/wiki/Shakespeare_apocrypha#A_Funeral_Elegy) itself is actually written for a funeral, but probably not by Shakespeare. [*A Lover's Complaint*](https://en.wikipedia.org/wiki/A_Lover%27s_Complaint) is thought to be an inferior work by the Bard by some critics, and maybe not even authored by him, so perhaps what we're seeing is a reflection of that lack of quality.  In general, we're seeing things that we might expect.

#### Readability

We can examine readability scores for the texts, but for this we'll need them in raw form. We already had them from before, I just added *Phoenix* from the Gutenberg download.

```{r readability1, echo=-1}
raw_texts = 
  data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>%
  mutate(text = map(file, read_lines)) %>% 
  bind_rows(data_frame(file = dir('data/texts_raw/shakes/gutenberg/', full.names = TRUE)) %>%
              mutate(text = map(file, read_lines))) %>% 
  mutate(file = str_extract(file, '[^/]*$')) %>% 
  rename(id=file)
raw_texts
```

With raw texts, we need to convert them to a <span class="objclass">corpus</span> object to proceed more easily.  The <span class="func">corpus</span> function from <span class="pack">quanteda</span> won't read directly from a list column or a list at all, so we'll convert it via the <span class="pack">tm</span> package, which more or less defeats the purpose of using the <span class="pack">quanteda</span> package, except that the <span class="func">textstat_readability</span> function gives us what we want, but I digress.

Unfortunately, the concept of <span class="emph">readability</span> is ill-defined, and as such, there are dozens of measures available dating back nearly 75 years.  The following is based on the Coleman-Liau grade score (higher grade = more difficult).  The conclusion here is first, Shakespeare isn't exactly a difficult read, and two, the poems may be more so relative to the other works.

```{r readability2, echo=1:3}
library(tm)
raw_text_corpus = corpus(VCorpus(VectorSource(raw_texts$text)))
shakes_read = textstat_readability(raw_text_corpus) 

shakes_read_ordered = shakes_read %>% 
  mutate(text = str_sub(raw_texts$id, end=-5)) %>% 
  arrange(Coleman.Liau.grade) 

shakes_read_ordered %>% 
  select(Coleman.Liau.grade) %>%
  round(1) %>% 
  heatmaply::heatmaply(Rowv=NA, 
                       Colv=NA, 
                       labRow=shakes_read_ordered$text, 
                       colors='Blues',
                       plot_method='plotly',
                       colorbar_len=.5,
                       colorbar_ypos= .25,
                       colorbar_xpos= .65,
                       subplot_widths=.25
                       ) %>% 
  plotly::layout(yaxis=list(ticklen=0)) %>%
  theme_plotly()
# shakes_read %>% 
#   mutate(id=raw_texts$id) %>% 
#   gather(key=index, value=score, -id) %>% 
#   group_by(index) %>% 
#   mutate(s_score=scales::rescale(score)) %>% 
#   ungroup %>% 
#   select(-score) %>% 
#   spread(key = index, value=s_score) %>% 
#   heatmaply::heatmaply(Rowv=NA, Colv=NA)
```


#### Lexical diversity

There are also metrics of <span class="emph">lexical diversity</span>.  As with readability, there is no one way to measure 'diversity'.  Here we'll go back to using the standard DTM, as the focus is on the terms, whereas readability is more at the sentence level.  Most standard measures of lexical diversity are variants on what is called the type-token ratio, which in our setting is the number of unique terms (types) relative to the total terms (tokens). We can use <span class="func">textstat_lexdiv</span> for our purposes here, which will provide several measures of diversity by default.

```{r lexical_diversity}
ld = textstat_lexdiv(shakes_dtm)
```


This visual is based on the (absolute) scaled values of those several metrics, and might suggest that the poems are relatively more diverse. This certainly might be the case for *Phoenix*, but it could also be a reflection of the limitation of several of the measures, such that longer works are seen as less diverse, as tokens are added more so than types the longer the text goes.  

```{r lexical_diversity_vis, echo=FALSE}
scaled_ld = scale(ld) %>%
  abs() %>% 
  data.frame() %>% 
  mutate(text = rownames(ld)) %>% 
  select(text, everything()) %>% 
  arrange(rowMeans(.[,-1])) 

scaled_ld %>% 
  select(-text) %>% 
  heatmaply::heatmaply(Rowv=F, 
                       Colv=F, 
                       colors='Blues',            # note that RCB scales will not work unless method='plotly'
                       plot_method = 'plotly',
                       labRow=scaled_ld$text,  # note: heatmaply irregularly trims/justifies, so str_pad won't work 
                       draw_cellnote=F,
                       hide_colorbar=T,
                       subplot_widths=.5) %>% 
  plotly::layout(yaxis=list(ticklen=0)) %>% 
  theme_plotly()
```

<br>
As a comparison, the following shows the results of the 'Measure of Textual Diversity' calculated using the <span class="pack">koRpus</span> package[^noshowmtld].  It is notably less affected by text length, though the conclusions are largely the same. There is notable correlation between the MTLD and readability as well[^correadmtld]. In general, Shakespeare tends to be more expressive in poems, and less so with comedies. 

```{r MTLD, echo=FALSE, eval=FALSE}
fnames = c(dir('data/texts_raw/shakes/moby/', full.names=T), 
           dir('data/texts_raw/shakes/gutenberg/', full.names=T))
library(koRpus)
library(parallel)
cl = makeCluster(7)
clusterEvalQ(cl, library(koRpus))
clusterExport(cl, 'fnames')

# this actually just wraps 'lex.div'
MTLD_results = parLapply(cl, fnames, 
                       function(x) MTLD(txt=kRp.text.analysis(x, force.lang='en', tagger='tokenize')))
stopCluster(cl)
names(MTLD_results) = c(dir('data/texts_raw/shakes/moby/'), 
           dir('data/texts_raw/shakes/gutenberg/'))
save(MTLD_results, file='data/shakes_mtld.RData')


# sapply(MTLD_results, function(x) x@TTR)

```


```{r MTLD_vis, echo=FALSE}
load('data/shakes_mtld.RData')
MTLD_df = lapply(MTLD_results, function(x) x@MTLD$MTLD) %>% 
  data.frame() %>% 
  gather(key=text, value=MTLD) %>% 
  mutate(text = str_sub(text, end=-5)) %>% 
  arrange(MTLD)
  
MTLD_df %>% 
  select(MTLD) %>% 
  heatmaply::heatmaply(Rowv=F, 
                       Colv=F, 
                       colors='Blues',            # note that RCB scales will not work unless method='plotly'
                       plot_method = 'plotly',
                       labRow=MTLD_df$text,  # note: heatmaply irregularly trims/justifies, so str_pad won't work 
                       draw_cellnote=F,
                       hide_colorbar=T,
                       subplot_widths=.5) %>% 
  plotly::layout(yaxis=list(ticklen=0)) %>% 
  theme_plotly()
```

<br>

### Scene IV. Topic model

I'd say we're now ready for topic model.  That didn't take too much did it?

#### Running the model and exploring the topics

We'll run one with 10 topics.  As in the previous example in this document, we'll use <span class="pack">topicmodels</span> and the <span class="func">LDA</span> function. Later, we'll also compare our results with the traditional classifications of the texts.  Note that this will take a while to run depending on your machine (maybe a minute or two). Faster implementation can be found with <span class="pack">text2vec</span>.


```{r topic_model, echo=1:2, eval=F}
library(topicmodels)
shakes_10 = LDA(convert(shakes_dtm, to = "topicmodels"), k = 10, control=list(seed=1234))

library(stm)
shakes_10_stm = stm(shakes_dtm, K = 10, seed=1234)  # can take quanteda
# shakes_10_cor = CTM(convert(shakes_dtm, to = "topicmodels"), k = 10, control=list(cg=list(iter.max=2000)))
save(shakes_10, shakes_10_stm, file='data/shakespeare_topic_model.RData')
```

One of the first things to do is to interpret the topics, and we can start by seeing which terms are most probable for each topic.

```{r tm10_results, eval=FALSE}
get_terms(shakes_10, 20)
```

```{r tm10_results_pretty_terms, echo=FALSE}
load('data/shakespeare_topic_model.RData')
library(topicmodels)
get_terms(shakes_10, 20) %>% 
  DT::datatable(options=list(dom='tp')) %>% 
  DT::formatStyle(
    0, # ignores columns, but otherwise put here
    target='row',
    backgroundColor = 'transparent'
  )
```

<br>

We can see there is a lot of overlap in these topics for top terms.  Just looking at the top 10, *love* occurs in all of them, *god* and *heart* are common as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term's probability of occurrence within a document, and term *exclusivity*, or how likely a term is to occur in only one particular topic.  See the <span class="pack">stm</span> package and corresponding <span class="func">labelTopics</span> function as a way to get several alternatives.  As an example, I show the results of their version of the following[^scoredefs]:

- <span class="emph">FREX</span>: **FR**equency and **EX**clusivity, it is a weighted harmonic mean of a term's rank within a topic in terms of frequency and exclusivity.
- <span class="emph">lift</span>: Ratio of the term's probability within a topic to its probability of occurrence across all documents.  Overly sensitive to rare words.
- <span class="emph">score</span>: Another approach that will give more weight to more exclusive terms.
- <span class="emph">prob</span>: This is just the raw probability of the term within a given topic.

```{r coherence, eval=FALSE, echo=FALSE}
library(stm)
docs = apply(shakes_dtm, 1, function(x) rbind((1:ncol(shakes_dtm))[x>0], x[x>0]))  # bow for stemmed terms
tq = topicQuality(model=shakes_10_stm, documents=docs)
sc = semanticCoherence(model=shakes_10_stm, documents=docs, M=20)
ex = exclusivity(model=shakes_10_stm, M=20)
plot_ly(x=~sc, y=~ex, data=data_frame(sc, ex, topic=paste('Topic', 1:10))) %>% 
  add_text(text=~topic, color=I('#ff5500')) %>% 
  theme_plotly() %>% 
  layout(xaxis = list(title='Semantic Coherence'),
         yaxis = list(title='Exclusivity'))
```


```{r label_topics_stm, eval=F, echo=FALSE}
# as stm topics won't be exactly the same, it might through them off
# topic_labels = stm::labelTopics(shakes_10_stm, n=10)
# sapply(topic_labels[1:4], function(e) apply(e, 1, paste, collapse=', ')) %>%
#   as_tibble() %>%
#   mutate(Topic = 1:10) %>%
#   gather(key=Score, value=Terms, -Topic) %>%
#   arrange(Topic, Score) %>%
#   DT::datatable(options=list(dom='t', scrollY=T, pageLength=40), rownames=F) %>%
#   DT::formatStyle(
#     0, # ignores columns, but otherwise put here
#     target='row',
#     backgroundColor = 'transparent'
#   )
```

```{r lable_topics, echo=FALSE}
library(stm)
frex_res = calcfrex(shakes_10@beta, wordcounts=colSums(as.matrix(shakes_10@wordassignments)))[1:10,] %>% 
  shakes_10@terms[.] %>% 
  matrix(., 10, 10)
# lift is definitely bugged, will just flag singletons
lift_res = calclift(shakes_10@beta, wordcounts=colSums(as.matrix(shakes_10@wordassignments)))[1:10,] %>%
  shakes_10@terms[.] %>%
  matrix(., 10, 10)
score_res = calcscore(shakes_10@beta)[1:10,] %>%
  shakes_10@terms[.] %>%
  matrix(., 10, 10)
prob_res = apply(shakes_10@beta, 1, order, decreasing = TRUE)[1:10,] %>% 
  shakes_10@terms[.] %>% 
  matrix(., 10, 10)

score_list = list(FREX = frex_res, LIFT=lift_res, SCORE=score_res, PROB=prob_res)

sapply(score_list, function(e) apply(e, 2, paste, collapse=', ')) %>%
  as_tibble() %>%
  mutate(Topic = 1:10) %>%
  gather(key=Score, value=Terms, -Topic) %>%
  arrange(Topic, Score) %>%
  DT::datatable(options=list(dom='t', scrollY='400px', scrollCollapse=T, pageLength=40), rownames=F) %>%
  DT::formatStyle(
    0, # ignores columns, but otherwise put here
    target='row',
    backgroundColor = 'transparent'
  )
```

<br>

As another approach, consider the <span class="emph">saliency</span> and <span class="emph">relevance</span> of term via the <span class="pack">LDAvis</span> package. While you can play with it here, it's probably easier to [open it separately](vis/index.html). Note that this has to be done separately from the model, and may have topic numbers in a different order.

<br>

```{r ldavis, eval=FALSE, echo=FALSE}
# may need to redo LDA just for this visual depending on what's been updated/changed
library(LDAvis)
json = createJSON(phi = exp(shakes_10@beta), 
                   theta = shakes_10@gamma, 
                   # doc.length = rowSums(shakes_dtm), 
                   doc.length = length(shakes_10@documents), 
                   vocab = shakes_10@terms, 
                   plot.opts = list(mdswidth  = 250,
                                    mdsheight = 250,
                                    barwidth  = 250,
                                    barheight = 250),
                   # term.frequency = shakes_10_stm$settings$dim$wcounts$x,
                   term.frequency = shakes_10@wordassignments %>% as.matrix() %>%  colSums(),
                   R=10)
serVis(json, out.dir = 'docs/vis', open.browser = F)  # have to redo if book folder is cleaned can only adjust plot height/width directly in .js file
```

<iframe src="https://m-clark.github.io/text-analysis-with-R/vis/index.html" width="100%" height="862px" frameborder="0"
    scrolling="yes"
    marginheight="0"
    marginwidth="0" title="LDAvis results">
  <p>Your browser does not support iframes.</p>
</iframe>

<br> 

Given all these measures, one can assess how well they match what topics the documents would be most associated with.  

```{r tm10_results_topic_classification, eval=FALSE}
t(topics(shakes_10, 3))
```


```{r tm10_results_pretty_topic_classification, echo=FALSE}
t(topics(shakes_10, 2)) %>% 
  data.frame %>% 
  rename_all(str_replace, 'X', 'Top Topic ') %>% 
  DT::datatable(options=list(dom='t', 
                             scrollY='400px', 
                             scrollCollapse=T, 
                             pageLength=40,
                             autoWidth=T, 
                             align='center',
                             columnDefs=list(list(width='150px', targets=0),
                                             list(width='100px', targets=1:2),
                                             list(className = 'dt-center', targets = 1:2))), 
                width='500') %>% 
  DT::formatStyle(
    0, # ignores columns, but otherwise put here
    target='row',
    backgroundColor = 'transparent'
)
```

<br>

For example, based just on term frequency, Hamlet is most likely to be associated with Topic `r t(topicmodels::topics(shakes_10, 3))['Hamlet',1]`. That topic is affiliated with the (stemmed words) `r topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet',1]]`.  The other measures pick up on words like Dane and Denmark. Sounds about right for Hamlet.

<!-- Hamlet is also one that is actually a decent mix, with its second topic expressed being Topic `r t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]`, with common terms `r topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]]`. They both have `r intersect(topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet',1]], topicmodels::get_terms(shakes_10, 20)[, t(topicmodels::topics(shakes_10, 3))['Hamlet', 2]])` among their top 20 terms. -->

The following visualization shows a heatmap for the topic probabilities of each document.  Darker values mean higher probability for a document expressing that topic.  I've also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram[^clustering].  The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem.

<br>

```{r viz_topics, echo=FALSE, out.height='700px', out.width='700px', fig.width=11, fig.height=8.5, fig.align='center'}
library(quanteda)
load('data/shakespeare_classification.RData')

suppressPackageStartupMessages(library(dendextend))
# see proxy::pr_DB %>% data.frame() for actual info for the metrics that
# quanteda uses, whose functions don't bother to even tell you that's where they
# are coming from
# proxy::pr_DB %>% data.frame() %>% select( distance, formula, description, reference) %>% DT::datatable()

# cosine distance, is not a proper distance
row_dend  = 
  (1-textstat_simil(dfm_weight(shakes_dtm, 'relMaxFreq'), 
                    margin = "documents", 
                    method = "cosine")) %>%
  as.dist() %>%
  hclust(method="complete") %>%
  as.dendrogram %>%
  set("branches_k_color", k = 4) %>% 
  set("branches_lwd", c(.5,.5)) %>%
  ladderize

shakes_10@gamma %>%
  round(3) %>%
  heatmaply::heatmaply(Rowv=row_dend,
                       Colv=F,
                       # colors='Oranges',
                       colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
                       row_side_colors = data.frame(shakes_types$class),
                       row_side_palette = plasma,
                       k_row= 4,
                       # RowSideColors = 'Set2',
                       labRow=rev(labels(row_dend)),
                       labCol=paste0('Topic ', 1:10),
                       hide_colorbar=T,
                       grid_gap=2,
                       plot_method='plotly'
                       ) %>%
  layout(showlegend=F) %>% # showing the legend will screw up the colorbar and any associated options
  config(displayModeBar = F) %>%
  theme_plotly()
```

<br>

A couple things stand out.  To begin with, most works are associated with one topic[^howmanytopics].  In terms of the discovered topics, traditional classification really probably only works for the <span class="" style="color:#9C179E">historical</span> works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work).  Furthermore, <span class="" style="color:#F0F921">tragedies</span> and <span class="" style="color:#0D0887">comedies</span> might hit on the same topics, albeit from different perspectives.  In addition, at least some works are very poetical, or at least have topics in common with the <span class="" style="color:#ED7953">poems</span> (love, beauty).  If we take four clusters from the cluster analysis, the result boils down to *Phoenix*, *Complaint*, standard poems, a mixed bag of more romance-oriented works and the remaining poems, then everything else.   

Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic.  The order is based on the most likely topics across all documents.

<br>

```{r cluster_topics, echo=FALSE, fig.align='center', fig.width=11, fig.height=8, out.height='800px', out.width='650px'}
topic_class = shakes_10@gamma %>%
  round(2) %>%
  data.frame() %>% 
  rename_all(function(x) str_replace(x, 'X', 'Topic ')) %>% 
  mutate(text =  shakes_10@documents, 
         class = shakes_types$class)
order_topics = order(colSums(shakes_10@gamma), decreasing=T)
# order_topics = order(apply(shakes_10@gamma, 2, mean), decreasing=T)

topic_class = topic_class %>%
  select(order_topics, text, class)  %>%
  arrange_at(vars(contains('Topic')), desc) 

topic_class %>%
  select(-text, -class) %>% 
  heatmaply::heatmaply(Rowv=NA,
                       Colv=NA,
                       labRow=rev(topic_class$text),
                       labCol=apply(get_terms(shakes_10, 10), 2, paste0, collapse='\n')[order_topics],
                       column_text_angle=0, 
                       colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
                       # subplot_widths=c(1),
                       plot_method = 'plotly',
                       fontsize_row=8,
                       fontsize_col=8,
                       hide_colorbar = T) %>% 
  layout(showlegend=F) %>% 
  config(displayModeBar = F) %>% 
  theme_plotly() 
```

The following shows the average topic probability for each of the traditional classes. Topics are represented by their first five most probable terms.

```{r avg_topic_probs_per_class, echo=FALSE, fig.width=6, fig.height=3, out.height='600px', out.width='650px'}
class_probs = shakes_10@gamma %>%
  data.frame() %>% 
  round(3) %>%
  rename_all(function(x) stringr::str_replace_all(x, 'X', 'Topic ')) %>% 
  bind_cols(doc=shakes_10@documents, arrange(shakes_types, title), .) %>% 
  select(-title) %>% 
  gather(key=topic, value=prob, -doc, -class, -problem, -late_romance) %>% 
  mutate(topic = forcats::fct_reorder(topic, rep(1:10, e=n_distinct(doc)))) %>% 
  group_by(class, topic) %>% 
  summarise(mean_prob = mean(prob)) %>% 
  arrange(class, topic) %>% 
  spread(topic, mean_prob) %>% 
  ungroup

# class_probs %>% 
#   select(-class) %>% 
#   d3heatmap::d3heatmap(Rowv=F,
#                        Colv=F,
#                        labRow=class_probs$class,
#                        labCol=apply(get_terms(shakes_10, 5), 2, paste0, collapse=' '),
#                        # colors=viridis::inferno(500),
#                        colors='Oranges',
#                        k_row = 10,
#                        # row_side_colors = select(arrange(shakes_types, title), class),
#                        xaxis_font_size = 8,
#                        yaxis_font_size=12, 
#                        show_grid = F, 
#                        width=800,
#                        xaxis_height = 150
#                        ) 
class_probs %>% 
  select(-class) %>% 
  heatmaply::heatmaply(Rowv=F,
                       Colv=F,
                       labRow=rev(class_probs$class),
                       labCol=apply(get_terms(shakes_10, 10), 2, paste0, collapse='\n'),
                       # colors=viridis::inferno(500),
                       colors=colorRampPalette(c('#fffff8', palettes$tyrian_purple2$tyrian_purple)),
                       plot_method = 'plotly',
                       xaxis_font_size = 8,
                       yaxis_font_size=12, 
                       # xaxis_height = 250,
                       # subplot_widths = .75,
                       hide_colorbar = T,
                       # colorbar_len=.5,
                       column_text_angle=0
                       ) %>%
  config(displayModeBar = F) %>% 
  theme_plotly()
```


Aside from the poems, the classes are a good mix of topics, and appear to have some overlap.  Tragedies are perhaps most diverse.

#### Summary of Topic Models

This is where the summary would go, but I grow weary...

<br>
<br>
<br>

<div style="text-align:center; font-size:500%">**FIN**</div>


```{r ggplay, echo=F, eval=F}
shakes_10@gamma %>%
  data.frame() %>% 
  round(3) %>%
  rename_all(function(x) stringr::str_replace_all(x, 'X', 'Topic_')) %>% 
  bind_cols(doc=shakes_10@documents, .) %>% 
  gather(key=topic, value=prob, -doc) %>% 
  mutate(topic = forcats::fct_reorder(topic, rep(1:10, e=n_distinct(doc)))) %>% 
  ggplot(aes(x=prob, y=doc, height=..density..)) +
  geom_joy(scale=20, fill='#ff5500', alpha=.25, color='#ff5500') +
  scale_y_discrete(expand=c(0.01, 0)) +
  scale_x_continuous(expand=c(0, 0)) + 
  lazerhawk::theme_trueMinimal()
```

```{r ggplay2, echo=F, eval=F}
shakes_10@gamma %>%
  data.frame() %>% 
  round(3) %>%
  rename_all(function(x) stringr::str_replace_all(x, 'X', 'Topic_')) %>% 
  bind_cols(doc=shakes_10@documents, .) %>% 
  gather(key=topic, value=prob, -doc) %>% 
  mutate(topic = forcats::fct_reorder(topic, rep(1:10, e=n_distinct(doc)))) %>% 
  ggplot(aes(x=prob, y=topic, height=..density..)) +
  geom_joy(scale=4, fill='#ff5500', alpha=.25, color='#ff5500') +
  scale_y_discrete(expand=c(0.01, 0)) +
  scale_x_continuous(expand=c(0, 0)) + 
  lazerhawk::theme_trueMinimal()
```


[^bug]: If you can think of a use case where `x<br>y<br>z` leading to `xyz` would be both expected as default behavior and desired please let me know.

[^r4everything]: If this surprises you, let me remind you that there are over 10k packages on CRAN alone.

[^tibblefail]: I found it easier to work with the entire data frame for the function, hence splitting it on id and recombining.  Some attempt was made to work within the tidyverse, but there were numerous issues to what should have been a fairly easy task.

[^love]: Love might as well be a stopword for Shakespeare.

[^noshowmtld]: I don't show this as I actually did it in parallel due to longer works taking a notable time to calculate MTLD.

[^correadmtld]: The Pearson correlation between MTLD and the Coleman Liau grade readability depicted previously was .87.

[^clustering]: If you are actually interested in clustering the documents (or anything for that matter in my opinion), this would not be the way to do so. For one, the documents are already clustered based on most probable topic. Second, cosine distance isn't actually a proper distance. Third, as shocking as it may seem, newer methods have been developed since the hierarchical clustering approach, which basically has a dozen arbitrary choices to be made at each step.  However, as a simple means to a visualization, the method is valuable if it helps with understanding the data.

[^scoredefs]: These descriptions are from Sievert and Shirley 2014.