-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrtweet Exploration.Rmd
416 lines (321 loc) · 13.4 KB
/
rtweet Exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
---
title: "rtweet Exploration"
author: "Matthew Hendrickson"
date: "2/9/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
## rtweet
rtweet (<https://rtweet.info/>) is a R package that makes interacting with Twitter easy.
There are a few key interactions that you can perform with rtweet (per the [rtweet site](https://rtweet.info/))
1. Search Tweets
2. Stream Tweets
3. Get Friends
4. Get Timelines
5. Get Favorites
6. Search Users
7. Get Trends
8. Post Actions
This document explores Tweets from specific users and compares activity across those users.
## Setup
Run this if you haven't installed the required packages.
```{r pacakges, message=FALSE, warning=FALSE}
# Only needed if you haven't arealdy installed the packages
#install.packages(c("rtweet", "devtools", "tidyverse", "gridExtra", "lubridate", "kableExtra", "scales"))
# Needed to load the packages
library("rtweet")
library("devtools")
library("tidyverse")
library("gridExtra")
library("lubridate")
library("kableExtra")
library("scales")
```
## Connecting to Data
You'll need to set up and authorize the Twitter API. This is explained [here](https://rtweet.info/articles/auth.html).
Once you've completed this task, connect to the Twitter API.
The Twitter API limits the number of search results to 18.000 every 15 minutes.
Keep this in mind if you're pulling a large amount of data.
The rtweet package can help manage this with `retryonratelimit = TRUE`.
This command will help manage the search results in accordance with the Twitter API.
Note that I've used `rstudioapi::askForSecret`.
This prompt the user to manually type or copy their credentials into R Studio.
```{r connect, eval=FALSE}
appname <- rstudioapi::askForSecret("Twitter App Name") # name of twitter app
key <- rstudioapi::askForSecret("API Key") # api key
secret <- rstudioapi::askForSecret("API Secret") # api secret
token <- create_token(app = appname,
consumer_key = key,
consumer_secret = secret)
```
## Getting Specific Users
Let's pull tweet data for a few users to create a comparison.
Let's begin with the R for Data Science community's Tweets [Slack community sign up](bit.ly/R4DSslack), based on this [text](https://r4ds.had.co.nz/).
This community is great for anyone wanting to learn more R.
Let's begin with the R4DS handle's Tweets.
Let's see how many records we need to pull to get the R4DScommunity timeline to ensure we don't approach the 18,000 results limit.
First, we'll pull the most recent record and get the value in the `statuses_count` field.
The number of statuses is the number of Tweets the account has posted.
```{r r4ds test}
# See how many times R4DScommunity has Tweeted
r4ds_count <- get_timeline(c("R4DScommunity"), n = 1) # Get most recent record
r4ds_count$statuses_count # Get number of Tweets (statuses) = 6162 as of 2020/02/09
# rm(r4ds_count) # Removes r4ds_count dataframe
```
We should have no issue with the 18,000 result limit.
I've included `retryonratelimit = TRUE` here to illustrate how that looks, though it is unnecessary.
Let's now pull all Tweets for the R4DScommunity handle.
Adjust the `n = 7500` if necessary depending on the results of the previous call.
```{r r4ds data}
# Get R4DScommunity's Tweets
r4ds <- get_timeline(c("R4DScommunity"), n = 7500, retryonratelimit = TRUE)
# Adjust datestamp to EST from UTC
r4ds$created_at_est <- r4ds$created_at - 18000 # 5 hours in seconds
# Check adjustment
head(select(r4ds, created_at, created_at_est))
# Save results for future use to avoid having to call Twitter API
saveRDS(r4ds, "r4ds.rds")
```
That's it! This gets us the R for Data Science Tweet timeline.
Keep in mind the Twitter API 18,000 search results per 15 minutes if you adjust the number of records in the code above (or below).
Now, let's get another users' data. Feel free to swap out another handle (or even yours)!
Let's check to make sure we won't approach the 18,000 result limit.
If we do, simply include `retryonratelimit = TRUE` to have the call pause and restart every 15 minutes.
```{r mh test}
mh_count <- get_timeline(c("mjhendrickson"), n = 1) # Get most recent record
mh_count$statuses_count # Get number of Tweets (statuses) = 1736 as of 2020/02/09
# rm(mh_count) # Removes mh_count dataframe
```
Again, we will have no issue with the rate limit.
Let's pull all Tweets from the handle.
Adjust the `n = 7500` if necessary depending on the results of the previous call.
```{r mh data}
# Get mjhendrickson's Tweets
mh <- get_timeline(c("mjhendrickson"), n = 7500, retryonratelimit = TRUE)
# Adjust datestamp to EST from UTC
mh$created_at_est <- mh$created_at - 18000 # 5 hours in seconds
# Check adjustment
head(select(mh, created_at, created_at_est))
# Save results for future use to avoid having to call Twitter API
saveRDS(mh, "mh.rds")
```
## Create Comparison Plots
Now that we've pulled together the most recent 9,000 tweets for the two accounts above, let's examine the frequency of Tweets for both.
### R For Data Science's Tweets
I'm assigning the plot to the object `r4ds_plot`, then plotting that object.
I'm doing this so I can assemble multiple plots together at the end.
```{r r4ds frequency}
r4ds_plot <-
r4ds %>%
filter(created_at_est >= "2018-04-14") %>% # R4DS handle started 2018-04-14
ts_plot("days", trim = 1L) +
geom_point() +
theme_minimal() +
theme(
legend.title = element_blank(),
legend.position = "bottom",
plot.title = element_text(face = "bold")) +
labs(
x = NULL,
y = NULL,
title = "Frequency of @R4DScommunity Twitter Statuses",
subtitle = "Tweet counts by day since April 14, 2018",
caption = "\nSource: Data collected via rtweet - graphic by @mjhendrickson"
)
r4ds_plot
```
### My Tweets
Now lets create the plot for the second handle.
```{r mjhendrickson frequency}
mh_plot <-
mh %>%
filter(created_at_est >= "2018-04-14") %>% # R4DS handle started 2018-04-14
ts_plot("days", trim = 1L) +
geom_point() +
theme_minimal() +
theme(
legend.title = element_blank(),
legend.position = "bottom",
plot.title = element_text(face = "bold")) +
labs(
x = NULL,
y = NULL,
title = "Frequency of @mjhendrickson Twitter Statuses",
subtitle = "Tweet counts by day since April 14, 2018",
caption = "\nSource: Data collected via rtweet - graphic by @mjhendrickson"
)
mh_plot
```
### Let's put it all together
Here, we blend the two plots together with `gridExtra`.
Since the plots were filtered to the same time-frames, we can easily stack the plots to compare Tweet activity across the two handles.
```{r combine plots}
grid.arrange(r4ds_plot, mh_plot)
```
Clearly, the R for Data Science handle is much more active than mine!
But it looks like something happened mid-January! Let's find out what I was up to.
## What Happened Mid-January?
Let's start off by getting the Tweet counts by day for January.
```{r day counts}
mh %>%
filter(created_at_est >= "2019-01-10" &
created_at_est <= "2019-01-20") %>%
group_by(floor_date(created_at_est, unit = "day")) %>% # uses lubridate, rounds to day
summarize(n = n())
```
We can see that January 14th had 31 Tweets! But what were they?
```{r jan 14, paged.print=TRUE}
# Search the created_at_est field for records starting with "2019-01-14"
# to get contents of all tweets fro 2019-01-14
mh_jan_14 <- filter(mh, grepl("2019-01-14", created_at_est, fixed = TRUE))
# kable used here to display emoji, links, etc.
kable(select(mh_jan_14, created_at_est, text))
```
On January 14th, [WeAreRLadies](https://twitter.com/WeAreRLadies) had a thread about sharing resources.
I guess I went a little crazy.
## What Else Can We Get?
Reviewing a few records also lets us explore the data we get back from Twitter.
```{r dictionary, paged.print=TRUE}
colnames(mh)
```
For a full list of what's included in a data pull like this, go to the Tweet Data Dictionary [here](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html).
But of interest to me are:
* `created_at` = UTC time of Tweet
* `text` = content of the Tweet
* `source` = utility used to post the Tweet
* `reply_to_screen_name` = username of original Tweet's author if the Tweet was a reply
* `is_retweet` = TRUE/FALSE if this was a re-tweet
* `favorite_count` = number of times the Tweet was favorited
* `retweet_count` = number of times the Tweet was re-tweeted
* `hashtags` = string of all hashtags used
* `urls_url` = string of all urls
* `mentions_screen_name` = string of screen_names mentioned
* A whole host of fields on re-tweets, such as the user, their follower count, friend count, reach, location (if enabled), text
* Geolocation fields, such as place name, coordinates, location
* User info, including `followers_count`, `friends_count`, `favorites_count`
### Let's do a quick analysis of a few interesting fields
#### Tweet Sources
What's my mix of sources? Do I Tweet from my phone? What about desktop?
I'm not going to lie, I'm busy and want a consistent feed to Twitter.
So how much do I Tweet through Buffer?
Let's find out.
```{r source}
mh %>%
group_by(source) %>%
summarize(n = n())
```
Interesting! I clearly prefer Tweeting from my phone, then by the web client & app.
I Tweeted just once from my old iPad. That was when I mostly consumed Tweets to stay on top of news during my doctoral program.
You can also see that I use [Buffer](https://buffer.com/) as a means to schedule Tweets.
I had to search the [Twitter developer pages](https://developer.twitter.com/en/docs/twitter-for-websites/overview.html) to determine the source of `Twitter for Websites`.
This will show if you click the Tweet button on a website. Not sure what spurred me to Tweet from any specific website.
```{r source website}
mh %>%
filter(str_detect(source, "Twitter for Websites")) %>%
select(source, created_at_est, text, urls_expanded_url, mentions_screen_name)
```
A [Revolution Analytics blog post](https://blog.revolutionanalytics.com/2013/09/statistician-survey-results.html) from September 2013 inspired me to Tweet directly from the page.
#### Preferred Hashtags
Now that I know my preferred method of Tweeting, what do I Tweet about?
One way to check is to look at hashtag usage.
Let's start by exploring the hashtag field.
```{r hashtags}
mh_hashtags <- na.omit(mh$hashtags)
head(mh_hashtags)
mh_hashtags <- mh_hashtags[!is.na(mh_hashtags)]
head(mh_hashtags)
```
#### How do Others Interact with my Tweets?
```{r }
```
#### How long are my Tweets?
```{r length 140}
mh %>%
#filter(created_at_est >= "2019-01-10") # add filter for pre 280 character limit
group_by(display_text_width) %>%
#tally(sort = TTRUE, wt = NULL)
summarize(n())
```
I liked to push the limits of my Tweets. 140 used to be the character limit.
Let's check again to see if that holds _after_ the character limit increase (2017-11-07).
```{r lenght 280}
mh %>%
filter(created_at_est >= "2017-11-07") %>%
group_by(display_text_width) %>%
tally(sort = TRUE, wt = NULL)
```
#### Are my Tweets original, or re-Tweets??
```{r retweet}
mh %>%
group_by(is_retweet) %>%
tally(sort = TRUE, wt = NULL)
```
Whew! I'm about a 2:1 Tweet:Re-Tweet ratio. Does this vary by source?
```{r retweet source}
mh %>%
group_by(source, is_retweet) %>%
tally(sort = FALSE, wt = NULL)
```
I don't re-Tweet through Buffer. Of Tweets from my phone, the majority are re-Tweets.
The majority of my Tweets from the browser are original.
#### How engaging are my Tweets? Let's start with some simple descriptive stats.
```{r favorited}
mh %>%
filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
summarize(
min = min(favorite_count),
med = median(favorite_count),
mean = mean(favorite_count),
max = max(favorite_count),
iqr = IQR(favorite_count),
n = n()
)
```
Not unexpected. The median is 0, mean is ~3. I had one good tweet at 164!
Let's look at the distribution.
```{r favorited counts}
mh %>%
filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
group_by(favorite_count) %>%
summarize(n = n()) %>%
arrange(desc(favorite_count))
```
It is clear that very few of my Tweets recieved a large number of likes.
If that isn't clear enough, let's visualize the distribution.
```{r favorited dist}
mh %>%
filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
ggplot() +
geom_bar(mapping = aes(x = favorite_count)) +
#scale_x_continuous(name = " ", breaks = c(5, 165)) + # Enable if filtered
scale_y_continuous(name = " ", labels = comma) +
labs(title = "Distribution of Tweet Favorites",
caption = "\n @mjhendrickson")
```
```{r retweeted}
mh %>%
filter(is_retweet == "FALSE") %>% # Keeps only original Tweets
summarize(
min = min(retweet_count),
med = median(retweet_count),
mean = mean(retweet_count),
max = max(retweet_count),
iqr = IQR(retweet_count),
n = n()
)
```
As expected, the numbers are even lower than `favorite_counts`.
Though, I did have one with 47 re-Tweets!
Let's see if it was the same Tweet.
```{r fav rt}
mh %>%
filter(is_retweet == "FALSE" & retweet_count == '47') # Keeps original Tweets retweeted 47x
```
Now let's look at a combined engagement metric of favorites and re-Tweets.
```{r engagement calculation}
mh$engagement <- mh$favorite_count + mh$retweet_count # Add favorites + retweets
head(mh[,c("favorite_count",'retweet_count',"engagement")]) # Check calculation
# filter(favorite_count != 0 & retweet_count != 0) # Error filter_
```