-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCatCouchTextAnalysis.Rmd
219 lines (171 loc) · 7.36 KB
/
CatCouchTextAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
title: "Cat Couch Sentiment"
author: "Johanna Schmidle"
output:
html_document:
df_print: paged
---
This is part 4 (the bonus part!) of my Amazon Cat couch review project. I am just practicing my sentiment analysis skills in R. This will be a pretty simple notebook
[**Part 1:** Data Prep](#clean-and-prep) <br>
I will be doing two main things: <br>
1. [General setup](#load) <br>
2. [Create Dataset for text analysis](#text_db)
[**Part 2:** Analysis](#analysis) <br>
Here are my main plots and points of EDA: <br>
1. [WordCloud](#cloud) <br>
2. [Words in Sentiments](#bar) <br>
3. [Distribution of Sentiments](#pie) <br>
4. [Summary Table Sentiment vs Couch Colour](#summary) <br>
5. [Review Sentiments vs Couch Colour](#bars) <br>
<a id="clean-and-prep"></a>
# Data Prep
<a id="load"></a>
### Set up
1. Load the libraries <br>
2. Read in database <br>
3. Clean the column names <br>
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
```
```{r}
library(readr)
library(dplyr)
library(kableExtra)
library(tidyverse)
library(tidytext)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
library(textdata)
library(sentimentr)
```
```{r}
couch <- read_csv("/Users/.../CatCouchReviewsTable.csv",
col_names = TRUE,
col_types = cols())
# Edit column names
couch<-couch %>%
rename(Location=`Review Location`) %>%
rename(Reviewer=`Reviewer Name`) %>%
rename(Date=`Review Date`) %>%
rename(Text=`Review Text`) %>%
rename(Colour=`Colour Name`)
couch %>% glimpse
```
<a id="text_db"></a>
## Text Dataset
This is for text analysis focused on the comments column, so I will create a special version of the dataset for this purpose.
1. Create ufo_txt dataset <br>
2. Unnest tokens <br>
3. Remove numbers from comments <br>
4. Remove stop words <br>
5. Add sentiment column <br>
```{r}
v_stopwords <- get_stopwords() %>% .$word
numb <- "[[:digit:]]+"
couch_txt<-couch %>%
unnest_tokens(word,Text)
couch_txt <- couch_txt %>%
filter(!str_detect(word,numb))%>%
filter(!(word %in% v_stopwords))%>%
left_join(get_sentiments("nrc"), by="word")
dim_desc(couch_txt)
```
<a id="analysis"></a>
# Text Analysis
<a id="cloud"></a>
### WordCloud for Review Text
```{r wordcloud, fig.align='center', fig.height = 3.5, fig.width = 3.5}
pallet <- brewer.pal(15,"Set2")
couch_txt %>%
count(word,sort=TRUE)%>%
with(wordcloud(word,n, colors=pallet, random.order = FALSE, min.freq = 500))
```
Some of the top words are **cat**, **love**, **perfect**, **happy**, **good**. These are all postive words so this indicates that the sentiments of the Review Text should be mostly positive. But note that the word **disappointed** is also big, so this indicates that the people who had negative feelings towards the couch were mostly feeling disappointed. I will analyze the sentiments further and see what I can find.
<a id="pie"></a>
### Distribution of Sentiments
I want to see how most people when reviewing the Cat Couch. I will do this by plotting and comparing the amount of times each sentiment is seen in the *emotion* column. I will do this using a pie chart. Although pie charts are not always the most effective way to plot data, I think this plot is the easiest to visually analyze as you clearly see which sentiment is dominant.
```{r, fig.align='center', fig.height = 3.5, fig.width = 3.5}
#Choose sentiments I want to see
senti <- c("positive", "fear","surprise", "negative", "anger", "disgust")
couch_txt%>%
count(sentiment) %>%
filter(sentiment %in% senti) %>%
mutate(sentiment=fct_reorder(sentiment,n)) %>%
ggplot() +
geom_bar(aes(x="", y=n,fill=sentiment), stat = "identity") +
coord_polar("y",start=0) +
theme(axis.text.x=element_blank()) +
labs(x="",y="",title="Sentiments",fill="")
```
As we can see the sentiments are overwhelmingly positive. This makes sense with the words we saw in the WordCloud. The second most common emotion was negative, which could be where the word *disappointed* came from.
<a id="bar"></a>
### Bar Charts for Top Words in Sentiments
I'm curious to explore the most frequently used words associated with each emotion. I plan to visualize this information using bar graphs for better readability.
```{r fig1, fig.align='center', fig.height = 5, fig.width = 7}
#Sentiment Bar Charts
couch_txt %>%
count(sentiment, word) %>%
filter(sentiment %in% senti) %>%
group_by(sentiment) %>%
top_n(10,n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x=word, y=n, fill = sentiment)) +
geom_bar(stat = "identity", show.legend = FALSE) +
guides(x=guide_axis(angle = 45)) + coord_flip() +
facet_wrap(~sentiment, scales = "free") +
labs(y = "", x = "",
title = "Sentiment analysis of Cat Couch Reviews",
subtitle = "Using the NRC Lexicon")
```
Notice that the word **disappointed** is the top word for the emotions *anger*, *disgust*, and *negative*. This could explain why the word was so large in our WordCloud.
<a id="summary"></a>
### Summary table
I am creating a table summary to analyze some statistics for the relationship between sentiment (positive or negative) and the color of the purchased couch.
```{r}
couch_summary <- couch_txt %>%
filter(!is.na(Colour) & !is.na(sentiment)) %>%
group_by(Colour) %>%
summarize(
mean_sentiment = mean(ifelse(sentiment == "positive", 1, ifelse(sentiment == "negative", -1, 0))),
num_positive = sum(sentiment == "positive"),
num_negative = sum(sentiment == "negative")
)
couch_summary
```
**Overall Positivity:** Colours like *Black*, *Pink*, *White*, and *Yellow* have mean sentiment scores above 0.1, indicating generally positive sentiment among reviews for these couch colors. But note *Black* and *Pink* have relatively low counts of negative reviews (2 and 4, respectively), suggesting they are perceived positively by most reviewers.
**Varied Sentiment Levels:** *Blue* and *Green* have mean sentiment scores closer to zero, suggesting mixed sentiments or a more neutral sentiment distribution compared to other colors.
**Positive vs. Negative Reviews:** While most colors have more positive reviews than negative ones, Blue stands out with a higher count of negative reviews (14) compared to its positive reviews (17).
**Popular Colours:** Green appears to have a higher number of reviews overall (46 in total)
<a id="bars"></a>
### Review Sentiment vs Couch Colour
I am going to display the information from above.
I want to compare the positive and negative sentiments per couch colour, and I also want to compare all sentiments per couch colour. I will use grouped bar charts for this
```{r, fig.align='center', fig.height = 3.5, fig.width = 5}
couch_txt %>%
filter(!is.na(sentiment)) %>%
filter(!is.na(Colour)) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(Colour, sentiment) %>%
ggplot(aes(x = Colour, y = n, fill = sentiment)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Sentiment Distribution by Couch Color",
x = "Couch Color",
y = "Number of Reviews") +
theme_minimal()
couch_txt %>%
filter(!is.na(sentiment)) %>%
filter(!is.na(Colour)) %>%
count(Colour, sentiment) %>%
ggplot(aes(x = Colour, y = n, fill = sentiment)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Sentiment Distribution by Couch Color",
x = "Couch Color",
y = "Number of Reviews") +
theme_minimal()
```