rtweet Exploration.Rmd

---
title: "rtweet Exploration"
author: "Matthew Hendrickson"
date: "2/9/2020"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```

## rtweet

rtweet (<https://rtweet.info/>) is a R package that makes interacting with Twitter easy.
There are a few key interactions that you can perform with rtweet (per the [rtweet site](https://rtweet.info/))

1. Search Tweets
2. Stream Tweets
3. Get Friends
4. Get Timelines
5. Get Favorites
6. Search Users
7. Get Trends
8. Post Actions

This document explores Tweets from specific users and compares activity across those users.

## Setup

Run this if you haven't installed the required packages.


```{r pacakges, message=FALSE, warning=FALSE}
# Only needed if you haven't arealdy installed the packages
#install.packages(c("rtweet", "devtools", "tidyverse", "gridExtra", "lubridate", "kableExtra", "scales"))
# Needed to load the packages
library("rtweet")
library("devtools")
library("tidyverse")
library("gridExtra")
library("lubridate")
library("kableExtra")
library("scales")
```

## Connecting to Data

You'll need to set up and authorize the Twitter API. This is explained [here](https://rtweet.info/articles/auth.html).

Once you've completed this task, connect to the Twitter API.
The Twitter API limits the number of search results to 18.000 every 15 minutes.
Keep this in mind if you're pulling a large amount of data.
The rtweet package can help manage this with `retryonratelimit = TRUE`.
This command will help manage the search results in accordance with the Twitter API.

Note that I've used `rstudioapi::askForSecret`.
This prompt the user to manually type or copy their credentials into R Studio.

```{r connect, eval=FALSE}
appname <- rstudioapi::askForSecret("Twitter App Name") # name of twitter app
key     <- rstudioapi::askForSecret("API Key") # api key
secret  <- rstudioapi::askForSecret("API Secret") # api secret
token   <- create_token(app = appname,
                        consumer_key = key,
                        consumer_secret = secret)
```

## Getting Specific Users

Let's pull tweet data for a few users to create a comparison.
Let's begin with the R for Data Science community's Tweets [Slack community sign up](bit.ly/R4DSslack), based on this [text](https://r4ds.had.co.nz/).
This community is great for anyone wanting to learn more R.
Let's begin with the R4DS handle's Tweets.

Let's see how many records we need to pull to get the R4DScommunity timeline to ensure we don't approach the 18,000 results limit.
First, we'll pull the most recent record and get the value in the `statuses_count` field.
The number of statuses is the number of Tweets the account has posted.

```{r r4ds test}
# See how many times R4DScommunity has Tweeted
r4ds_count <- get_timeline(c("R4DScommunity"), n = 1) # Get most recent record
r4ds_count$statuses_count # Get number of Tweets (statuses) = 6162 as of 2020/02/09
# rm(r4ds_count) # Removes r4ds_count dataframe
```

We should have no issue with the 18,000 result limit.
I've included `retryonratelimit = TRUE` here to illustrate how that looks, though it is unnecessary.
Let's now pull all Tweets for the R4DScommunity handle.
Adjust the `n = 7500` if necessary depending on the results of the previous call.

```{r r4ds data}
# Get R4DScommunity's Tweets
r4ds <- get_timeline(c("R4DScommunity"), n = 7500, retryonratelimit = TRUE)
# Adjust datestamp to EST from UTC
r4ds$created_at_est <- r4ds$created_at - 18000 # 5 hours in seconds
# Check adjustment
head(select(r4ds, created_at, created_at_est))
# Save results for future use to avoid having to call Twitter API
saveRDS(r4ds, "r4ds.rds")
```

That's it! This gets us the R for Data Science Tweet timeline.
Keep in mind the Twitter API 18,000 search results per 15 minutes if you adjust the number of records in the code above (or below).

Now, let's get another users' data. Feel free to swap out another handle (or even yours)!

Let's check to make sure we won't approach the 18,000 result limit.
If we do, simply include `retryonratelimit = TRUE` to have the call pause and restart every 15 minutes.

```{r mh test}
mh_count <- get_timeline(c("mjhendrickson"), n = 1) # Get most recent record
mh_count$statuses_count # Get number of Tweets (statuses) = 1736 as of 2020/02/09
# rm(mh_count) # Removes mh_count dataframe
```

Again, we will have no issue with the rate limit.
Let's pull all Tweets from the handle.
Adjust the `n = 7500` if necessary depending on the results of the previous call.

```{r mh data}
# Get mjhendrickson's Tweets
mh <- get_timeline(c("mjhendrickson"), n = 7500, retryonratelimit = TRUE)
# Adjust datestamp to EST from UTC
mh$created_at_est <- mh$created_at - 18000 # 5 hours in seconds
# Check adjustment
head(select(mh, created_at, created_at_est))
# Save results for future use to avoid having to call Twitter API
saveRDS(mh, "mh.rds")
```

## Create Comparison Plots

Now that we've pulled together the most recent 9,000 tweets for the two accounts above, let's examine the frequency of Tweets for both.

### R For Data Science's Tweets

I'm assigning the plot to the object `r4ds_plot`, then plotting that object.
I'm doing this so I can assemble multiple plots together at the end.

```{r r4ds frequency}
r4ds_plot <- 
r4ds %>% 
  filter(created_at_est >= "2018-04-14") %>% # R4DS handle started 2018-04-14
ts_plot("days", trim = 1L) +
  geom_point() +
  theme_minimal() +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")) +
  labs(
    x = NULL,
    y = NULL,
    title = "Frequency of @R4DScommunity Twitter Statuses",
    subtitle = "Tweet counts by day since April 14, 2018",
    caption = "\nSource: Data collected via rtweet - graphic by @mjhendrickson"
  )
r4ds_plot
```

### My Tweets

Now lets create the plot for the second handle.

```{r mjhendrickson frequency}
mh_plot <- 
mh %>% 
  filter(created_at_est >= "2018-04-14") %>% # R4DS handle started 2018-04-14
ts_plot("days", trim = 1L) +
  geom_point() +
  theme_minimal() +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")) +
  labs(
    x = NULL,
    y = NULL,
    title = "Frequency of @mjhendrickson Twitter Statuses",
    subtitle = "Tweet counts by day since April 14, 2018",
    caption = "\nSource: Data collected via rtweet - graphic by @mjhendrickson"
  )
mh_plot
```

### Let's put it all together

Here, we blend the two plots together with `gridExtra`.
Since the plots were filtered to the same time-frames, we can easily stack the plots to compare Tweet activity across the two handles.

```{r combine plots}
grid.arrange(r4ds_plot, mh_plot)
```

Clearly, the R for Data Science handle is much more active than mine!

But it looks like something happened mid-January! Let's find out what I was up to.

## What Happened Mid-January?

Let's start off by getting the Tweet counts by day for January.

```{r day counts}
mh %>% 
  filter(created_at_est >= "2019-01-10" & 
         created_at_est <= "2019-01-20") %>% 
  group_by(floor_date(created_at_est, unit = "day")) %>% # uses lubridate, rounds to day
  summarize(n = n())
```

We can see that January 14th had 31 Tweets! But what were they?

```{r jan 14, paged.print=TRUE}
# Search the created_at_est field for records starting with "2019-01-14"
# to get contents of all tweets fro 2019-01-14
mh_jan_14 <- filter(mh, grepl("2019-01-14", created_at_est, fixed = TRUE))
# kable used here to display emoji, links, etc.
kable(select(mh_jan_14, created_at_est, text))
```

On January 14th, [WeAreRLadies](https://twitter.com/WeAreRLadies) had a thread about sharing resources.
I guess I went a little crazy.

## What Else Can We Get?

Reviewing a few records also lets us explore the data we get back from Twitter.

```{r dictionary, paged.print=TRUE}
colnames(mh)
```

For a full list of what's included in a data pull like this, go to the Tweet Data Dictionary [here](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html).
But of interest to me are:

* `created_at` = UTC time of Tweet
* `text` = content of the Tweet
* `source` = utility used to post the Tweet
* `reply_to_screen_name` = username of original Tweet's author if the Tweet was a reply
* `is_retweet` = TRUE/FALSE if this was a re-tweet
* `favorite_count` = number of times the Tweet was favorited
* `retweet_count` = number of times the Tweet was re-tweeted
* `hashtags` = string of all hashtags used
* `urls_url` = string of all urls
* `mentions_screen_name` = string of screen_names mentioned
* A whole host of fields on re-tweets, such as the user, their follower count, friend count, reach, location (if enabled), text
* Geolocation fields, such as place name, coordinates, location
* User info, including `followers_count`, `friends_count`, `favorites_count`

### Let's do a quick analysis of a few interesting fields

#### Tweet Sources

What's my mix of sources? Do I Tweet from my phone? What about desktop?

I'm not going to lie, I'm busy and want a consistent feed to Twitter.
So how much do I Tweet through Buffer?

Let's find out.

```{r source}
mh %>% 
  group_by(source) %>% 
  summarize(n = n())
```

Interesting! I clearly prefer Tweeting from my phone, then by the web client & app.

I Tweeted just once from my old iPad. That was when I mostly consumed Tweets to stay on top of news during my doctoral program.

You can also see that I use [Buffer](https://buffer.com/) as a means to schedule Tweets.

I had to search the [Twitter developer pages](https://developer.twitter.com/en/docs/twitter-for-websites/overview.html) to determine the source of `Twitter for Websites`.
This will show if you click the Tweet button on a website. Not sure what spurred me to Tweet from any specific website.

```{r source website}
mh %>% 
  filter(str_detect(source, "Twitter for Websites")) %>% 
  select(source, created_at_est, text, urls_expanded_url, mentions_screen_name)
```

A [Revolution Analytics blog post](https://blog.revolutionanalytics.com/2013/09/statistician-survey-results.html) from September 2013 inspired me to Tweet directly from the page.

#### Preferred Hashtags

Now that I know my preferred method of Tweeting, what do I Tweet about?
One way to check is to look at hashtag usage.

Let's start by exploring the hashtag field.

```{r hashtags}
mh_hashtags <- na.omit(mh$hashtags)
head(mh_hashtags)
mh_hashtags <- mh_hashtags[!is.na(mh_hashtags)]
head(mh_hashtags)
```

#### How do Others Interact with my Tweets?

```{r }
```


#### How long are my Tweets?

```{r length 140}
mh %>% 
  #filter(created_at_est >= "2019-01-10") # add filter for pre 280 character limit
  group_by(display_text_width) %>% 
  #tally(sort = TTRUE, wt = NULL)
  summarize(n())
```

I liked to push the limits of my Tweets. 140 used to be the character limit.
Let's check again to see if that holds _after_ the character limit increase (2017-11-07).

```{r lenght 280}
mh %>% 
  filter(created_at_est >= "2017-11-07") %>% 
  group_by(display_text_width) %>% 
  tally(sort = TRUE, wt = NULL)
```

#### Are my Tweets original, or re-Tweets??

```{r retweet}
mh %>% 
  group_by(is_retweet) %>% 
  tally(sort = TRUE, wt = NULL)
```

Whew! I'm about a 2:1 Tweet:Re-Tweet ratio. Does this vary by source?

```{r retweet source}
mh %>% 
  group_by(source, is_retweet) %>% 
  tally(sort = FALSE, wt = NULL)
```

I don't re-Tweet through Buffer. Of Tweets from my phone, the majority are re-Tweets.
The majority of my Tweets from the browser are original.

#### How engaging are my Tweets? Let's start with some simple descriptive stats.

```{r favorited}
mh %>% 
  filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
  summarize(
    min  =    min(favorite_count), 
    med  = median(favorite_count), 
    mean =   mean(favorite_count), 
    max  =    max(favorite_count), 
    iqr  =    IQR(favorite_count), 
    n    =      n()
  )
```

Not unexpected. The median is 0, mean is ~3. I had one good tweet at 164!

Let's look at the distribution.

```{r favorited counts}
mh %>% 
  filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
  group_by(favorite_count) %>% 
  summarize(n = n()) %>% 
  arrange(desc(favorite_count))
```

It is clear that very few of my Tweets recieved a large number of likes.
If that isn't clear enough, let's visualize the distribution.

```{r favorited dist}
mh %>% 
  filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
ggplot() +
  geom_bar(mapping = aes(x = favorite_count)) +
  #scale_x_continuous(name = " ", breaks = c(5, 165)) + # Enable if filtered
  scale_y_continuous(name = " ", labels = comma) +
  labs(title = "Distribution of Tweet Favorites",
       caption = "\n @mjhendrickson")
```


```{r retweeted}
mh %>% 
  filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
  summarize(
    min  =    min(retweet_count), 
    med  = median(retweet_count), 
    mean =   mean(retweet_count), 
    max  =    max(retweet_count), 
    iqr  =    IQR(retweet_count), 
    n    =      n()
  )
```

As expected, the numbers are even lower than `favorite_counts`.
Though, I did have one with 47 re-Tweets!
Let's see if it was the same Tweet.

```{r fav rt}
mh %>% 
  filter(is_retweet == "FALSE" & retweet_count == '47') # Keeps original Tweets retweeted 47x
```


Now let's look at a combined engagement metric of favorites and re-Tweets.

```{r engagement calculation}
mh$engagement <- mh$favorite_count + mh$retweet_count # Add favorites + retweets
head(mh[,c("favorite_count",'retweet_count',"engagement")]) # Check calculation

# filter(favorite_count != 0 & retweet_count != 0) # Error filter_
```