-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.Rmd
417 lines (298 loc) · 20 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
message = F,
warning = F
)
ggplot2::theme_set(ggplot2::theme_minimal())
library(dplyr)
```
# peRspective <img src="man/figures/perspective.png" width="160px" align="right"/>
```{r, echo = FALSE, results='asis', eval = T}
library(badger)
git_repo <- "favstats/peRspective"
cat(
"[![](https://cranlogs.r-pkg.org/badges/grand-total/peRspective)](https://cran.rstudio.com/web/packages/peRspective/index.html)",
" [![R-CMD-check](https://github.com/favstats/peRspective/workflows/R-CMD-check/badge.svg)](https://github.com/favstats/peRspective/actions)",
# badge_custom("My 1st Package", "Ever", "blue"),
badge_cran_release("peRspective", "blue"),
"[![Codecov test coverage](https://codecov.io/gh/favstats/peRspective/branch/master/graph/badge.svg)](https://codecov.io/gh/favstats/peRspective?branch=master)",
badge_code_size(git_repo),
badge_last_commit(git_repo)
)
```
Perspective is an API that uses machine learning models to score the perceived impact a comment might have on a conversation. [Website](http://www.perspectiveapi.com/).
`peRspective` provides access to the API using the R programming language.
For an excellent documentation of the Perspective API see [here](https://github.com/conversationai/perspectiveapi/tree/master/2-api).
> This is a work-in-progress project and I welcome feedback and pull requests!
## Overview
+ [Setup](https://github.com/favstats/peRspective#setup)
+ [Models](https://github.com/favstats/peRspective#models)
+ [Usage](https://github.com/favstats/peRspective#usage)
+ [`prsp_score`](https://github.com/favstats/peRspective#prsp_score)
+ [`prsp_stream`](https://github.com/favstats/peRspective#prsp_stream)
## Setup
### Get an API key
[Follow these steps](https://developers.perspectiveapi.com/s/docs-get-started) as outlined by the Perspective API to get an API key.
**NOTE:** Perspective API made some changes recently and you now have to apply for an API key via a Google Form.
### Quota and character length Limits
Be sure to check your quota limit! You can learn more about Perspective API quota limit by visiting [your google cloud project's Perspective API page](https://console.cloud.google.com/apis/api/commentanalyzer.googleapis.com/quotas).
## Models
For detailed overview of the used models [see here](https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages).
### Languages in production
Currently, Perspective API has production `TOXICITY` and `SEVERE_TOXICITY` attributes in the following languages:
+ English (en)
+ Spanish (es)
+ French (fr)
+ German (de)
+ Portuguese (pt)
+ Italian (it)
+ Russian (ru)
### Toxicity attributes
| Attribute name | Type | Description | Available Languages |
|------------------------------|-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| `TOXICITY` | prod. | A rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion. | English (en), Spanish (es), French (fr), German (de), Portuguese (pt), Italian (it), Russian (ru) |
| `TOXICITY_EXPERIMENTAL` | exp. | A rude, disrespectful, or unreasonable comment that is likely to make people leave a discussion. | Arabic (ar) |
| `SEVERE_TOXICITY` | prod. | A very hateful, aggressive, disrespectful comment or otherwise very likely to make a user leave a discussion or give up on sharing their perspective. This attribute is much less sensitive to more mild forms of toxicity, such as comments that include positive uses of curse words. | en, fr, es, de, it, pt, ru |
| `SEVERE_TOXICITY_EXPERIMENTAL` | exp. | A very hateful, aggressive, disrespectful comment or otherwise very likely to make a user leave a discussion or give up on sharing their perspective. This attribute is much less sensitive to more mild forms of toxicity, such as comments that include positive uses of curse words. | ar |
| `IDENTITY_ATTACK` | prod. | Negative or hateful comments targeting someone because of their identity. | de, it, pt, ru, en |
| `IDENTITY_ATTACK_EXPERIMENTAL` | exp. | Negative or hateful comments targeting someone because of their identity. | fr, es, ar |
| `INSULT` | prod. | Insulting, inflammatory, or negative comment towards a person or a group of people. | de, it, pt, ru, en |
| `INSULT_EXPERIMENTAL` | exp. | Insulting, inflammatory, or negative comment towards a person or a group of people. | fr, es, ar |
| `PROFANITY` | prod. | Swear words, curse words, or other obscene or profane language. | de, it, pt, ru, en |
| `PROFANITY_EXPERIMENTAL` | exp. | Swear words, curse words, or other obscene or profane language. | fr, es, ar |
| `THREAT` | prod. | Describes an intention to inflict pain, injury, or violence against an individual or group. | de, it, pt, ru, en |
| `THREAT_EXPERIMENTAL` | exp. | Describes an intention to inflict pain, injury, or violence against an individual or group. | fr, es, ar |
| `SEXUALLY_EXPLICIT` | exp. | Contains references to sexual acts, body parts, or other lewd content. | en |
| `FLIRTATION` | exp. | Pickup lines, complimenting appearance, subtle sexual innuendos, etc. | en |
### New York Times attributes
These attributes are experimental because they are trained on a single source of comments—New York Times (NYT) data tagged by their moderation team—and therefore may not work well for every use case.
| Attribute name | Type | Description | Language |
| -------------------- | ---- | ----------- | -------- |
| `ATTACK_ON_AUTHOR` | exp. | Attack on the author of an article or post. | en |
| `ATTACK_ON_COMMENTER` | exp. | Attack on fellow commenter. | en |
| `INCOHERENT` | exp. | Difficult to understand, nonsensical. | en |
| `INFLAMMATORY` | exp. | Intending to provoke or inflame. | en |
| `LIKELY_TO_REJECT` | exp. | Overall measure of the likelihood for the comment to be rejected according to the NYT's moderation. | en |
| `OBSCENE` | exp. | Obscene or vulgar language such as cursing. | en |
| `SPAM` | exp. | Irrelevant and unsolicited commercial content. | en |
| `UNSUBSTANTIAL` | exp. | Trivial or short comments. | en |
A character vector that includes all`peRspective` supported models can be obtained like this:
```{r}
c(
peRspective::prsp_models,
peRspective::prsp_exp_models
)
```
## Usage
First, install package from GitHub:
```{r, eval = F}
devtools::install_github("favstats/peRspective")
```
Load package:
```{r}
library(peRspective)
```
Also the `tidyverse` for examples.
```{r, eval = F}
library(tidyverse)
```
```{r, eval = F, echo=F}
key <- readr::read_lines("../keys/prsp_simon.txt")
Sys.setenv(perspective_api_key = key)
```
**Define your key variable.**
`peRspective` functions will read the API key from environment variable `perspective_api_key`. In order to add your key to your environment file, you can use the function `edit_r_environ()` from the [`usethis` package](https://usethis.r-lib.org/).
```{r, eval = F}
usethis::edit_r_environ()
```
This will open your .Renviron file in your text editor. Now, you can add the following line to it:
```{r, eval = F}
perspective_api_key="YOUR_API_KEY"
```
Save the file and restart R for the changes to take effect.
Alternatively, you can provide an explicit definition of your API key with each function call using the `key` argument.
### `prsp_score`
Now you can use `prsp_score` to score your comments with various models provided by the Perspective API.
```{r, eval = T}
my_text <- "You wrote this? Wow. This is dumb and childish, please go f**** yourself."
text_scores <- prsp_score(
text = my_text,
languages = "en",
score_model = peRspective::prsp_models
)
text_scores %>%
tidyr::gather() %>%
dplyr::mutate(key = forcats::fct_reorder(key, value)) %>%
ggplot2::ggplot(ggplot2::aes(key, value)) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::ylim(0, 1) +
ggplot2::geom_hline(yintercept = 0.5, linetype = "dashed") +
ggplot2::labs(x = "Model", y = "Probability", title = "Perspective API Results")
```
```{r, echo=F}
Sys.sleep(1)
```
A Trump Tweet:
```{r}
trump_tweet <- "The Fake News Media has NEVER been more Dishonest or Corrupt than it is right now. There has never been a time like this in American History. Very exciting but also, very sad! Fake News is the absolute Enemy of the People and our Country itself!"
text_scores <- prsp_score(
trump_tweet,
score_sentences = F,
score_model = peRspective::prsp_models
)
text_scores %>%
tidyr::gather() %>%
dplyr::mutate(key = forcats::fct_reorder(key, value)) %>%
ggplot2::ggplot(ggplot2::aes(key, value)) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::ylim(0, 1) +
ggplot2::geom_hline(yintercept = 0.5, linetype = "dashed") +
ggplot2::labs(x = "Model", y = "Probability", title = "Perspective API Results")
```
```{r, echo=F}
Sys.sleep(1)
```
Instead of scoring just entire comments you can also score individual sentences with `score_sentences = T`. In this case the Perspective API will automatically split your text into reasonable sentences and score them in addition to an overall score.
```{r, eval = T, fig.width=12, fig.height=8}
trump_tweet <- "The Fake News Media has NEVER been more Dishonest or Corrupt than it is right now. There has never been a time like this in American History. Very exciting but also, very sad! Fake News is the absolute Enemy of the People and our Country itself!"
text_scores <- prsp_score(
trump_tweet,
score_sentences = T,
score_model = peRspective::prsp_models
)
text_scores %>%
tidyr::unnest(sentence_scores) %>%
dplyr::select(type, score, sentences) %>%
tidyr::gather(value, key, -sentences, -score) %>%
dplyr::mutate(key = forcats::fct_reorder(key, score)) %>%
ggplot2::ggplot(ggplot2::aes(key, score)) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::facet_wrap(~sentences, ncol = 2) +
ggplot2::geom_hline(yintercept = 0.5, linetype = "dashed") +
ggplot2::labs(x = "Model", y = "Probability", title = "Perspective API Results")
```
```{r, echo=F}
Sys.sleep(1)
```
You can also use Spanish (`es`) for `TOXICITY`, `SEVERE_TOXICITY` and `_EXPERIMENTAL` models.
```{r}
spanish_text <- "gastan en cosas que de nada sirven-nunca tratan de saber la verdad del funcionalismo de nuestro sistema solar y origen del cosmos-falso por Kepler. LAS UNIVERSIDADES DEL MUNDO NO SABEN ANALIZAR VERDAD O MENTIRA-LO QUE DICE KEPLER"
text_scores <- prsp_score(
text = spanish_text,
languages = "es",
score_model = c("TOXICITY", "SEVERE_TOXICITY", "INSULT_EXPERIMENTAL")
)
text_scores %>%
tidyr::gather() %>%
dplyr::mutate(key = forcats::fct_reorder(key, value)) %>%
ggplot2::ggplot(ggplot2::aes(key, value)) +
ggplot2::geom_col() +
ggplot2::coord_flip() +
ggplot2::geom_hline(yintercept = 0.5, linetype = "dashed") +
ggplot2::labs(x = "Model", y = "Probability", title = "Perspective API Results")
```
**NOTE:** Your provided text will be stored by the Perspective API for future research. This option is the default. If the supplied texts are private or any of the authors of the texts are below 13 years old, `doNotStore` should be set to `TRUE.`
### `prsp_stream`
```{r, echo=F}
Sys.sleep(1)
```
So far we have only seen how to get individual comments or sentences scored. But what if you would like to run the function for an entire dataset with a text column? This is where `prsp_stream` comes in. At its core `prsp_stream` is a loop implemented within `purrr::map` to iterate over your text column. To use it let's first generate a mock tibble.
```{r}
text_sample <- tibble(
ctext = c("You wrote this? Wow. This is dumb and childish, please go f**** yourself.",
"I don't know what to say about this but it's not good. The commenter is just an idiot",
"This goes even further!",
"What the hell is going on?",
"Please. I don't get it. Explain it again",
"Annoying and irrelevant! I'd rather watch the paint drying on the wall!"),
textid = c("#efdcxct", "#ehfcsct",
"#ekacxwt", "#ewatxad",
"#ekacswt", "#ewftxwd")
)
```
```{r, echo=F}
Sys.sleep(1)
```
`prsp_stream` requires a `text` and `text_id` column. It wraps `prsp_score` and takes all its arguments. Let's run the most basic version:
```{r, message=T}
text_sample %>%
prsp_stream(text = ctext,
text_id = textid,
score_model = c("TOXICITY", "SEVERE_TOXICITY"))
```
You receive a `tibble` with your desired scorings including the `text_id` to match your score with your original dataframe.
Now, the problem is that sometimes the call might fail at some point. It is therefore suggested to set `safe_output = TRUE`. This will put the function into a `purrr::safely` environment to ensure that your function will keep running even if you encounter errors.
Let's try it out with a new dataset that contains text that the Perspective API can't score
```{r, echo=F}
Sys.sleep(1)
```
```{r}
text_sample <- tibble(
ctext = c("You wrote this? Wow. This is dumb and childish, please go f**** yourself.",
"I don't know what to say about this but it's not good. The commenter is just an idiot",
## empty string
"",
"This goes even further!",
"What the hell is going on?",
"Please. I don't get it. Explain it again",
## Gibberish
"kdlfkmgkdfmgkfmg",
"Annoying and irrelevant! I'd rather watch the paint drying on the wall!",
## Gibberish
"Hippi Hoppo"),
textid = c("#efdcxct", "#ehfcsct",
"#ekacxwt", "#ewatxad",
"#ekacswt", "#ewftxwd",
"#eeadswt", "#enfhxed",
"#efdmjd")
)
```
And run the function with `safe_output = TRUE`.
```{r, echo=F}
Sys.sleep(1)
```
```{r, message=T}
text_sample %>%
prsp_stream(text = ctext,
text_id = textid,
score_model = c("TOXICITY", "SEVERE_TOXICITY", "INSULT"),
safe_output = T)
```
`safe_output = T` will also provide us with the error messages that occured so that we can check what went wrong!
Finally, there is one last argument: `verbose = TRUE`. Enable this argument and thanks to [`crayon`](https://github.com/r-lib/crayon) you will receive beautiful console output that guides you along the way, showing you errors and text scores as you go.
```{r, echo=F}
Sys.sleep(1)
```
```{r, eval = F}
text_sample %>%
prsp_stream(text = ctext,
text_id = textid,
score_model = c("TOXICITY", "SEVERE_TOXICITY"),
verbose = T,
safe_output = T)
```
![](man/figures/prsp_stream_output.png)
Or the (not as pretty) output in Markdown
```{r, echo = F}
text_sample %>%
prsp_stream(text = ctext,
text_id = textid,
score_model = c("TOXICITY", "SEVERE_TOXICITY"),
verbose = T,
safe_output = T)
```
## Citation
Thank you for using peRspective! Please consider citing:
Votta, Fabio. (2019). peRspective: A wrapper for the Perspective API. Source: https://github.com/favstats.
<div>Icons made by <a href="https://www.freepik.com/?__hstc=57440181.1504a979705d81fb44d4169a0ccdf2ae.1558036002089.1558036002089.1558036002089.1&__hssc=57440181.4.1558036002090&__hsfp=2902986854" title="Freepik">Freepik</a> from <a href="https://www.flaticon.com/" title="Flaticon">www.flaticon.com</a> is licensed by <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0" target="_blank">CC 3.0 BY</a></div>