-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUS hate_crimes Exploratory Data analysis.R
129 lines (82 loc) · 3.85 KB
/
US hate_crimes Exploratory Data analysis.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# This project aims at performing an exploratory data analysis of a data set from Bloomington Police Department (Indiana, USA)
#cases where a hate or bias crime has been reported from 02/06/2016 to 02/25/2022.
# The dataset has been retrevied directly from the Data Gov website (url: https://catalog.data.gov/dataset/hate-crimes).
#--------------------------------------------------------------------------------------------------------------------------------------
# Here we go !
# Setting the working directory to the location of the data set file.
setwd("C:/Users/OS/Desktop")
# Loading the data set
x <- read.csv("Hate_crimes.csv")
# View the data
View(x)
colnames(x)
#_______________________
# Summary statistics
#_______________________
str(x)
dim(x)
head(x)
tail(x)
# Check to see if there are missing data?
sum(is.na(x))
#Get a larger set of statistics (a broader overview)
install.packages("skimr")
library(skimr)
skim(x)
#____________________
# Data transformation
#____________________
x$motivation
# Here, we notice that there are 4 modalities ("Anti-Gay"/ "Anti-Homosexual Male"/ "Anti-Male Homosexual"/ "Anti-LGBT+" )
#that can be grouped in one single category: "Anti-LGBT+".
# So to make things simple, we choose to integrate the 3 modalities ("Anti-Gay"/ "Anti-Homosexual Male"/ "Anti-Male Homosexual") within
#the "Anti-LGBT+" modality.
x$motivation[7] <- "Anti-Homosexual Male"
which(x$motivation == "Anti-Gay")
x$motivation[39:41] <- "Anti-LGBT+"
which(x$motivation == "Anti-Homosexual Male")
x$motivation[c(1,7,15,20)] <- "Anti-LGBT+"
x$offender_race
factor(x$offender_race)
table(factor(x$offender_race))
# Another anomaly detected here : The white race is represented by 2 different forms ("W" uppercase and "w" lowercase) due to the case sensitivity of R.
# We are going to correct that by transforming "w" lowercase into "W" uppercase.
which(x$offender_race == "w")
x$offender_race[c(18,25)] <- "W"
x$offense
# A third anomaly noticed regarding to the nomenclature of the types of offense.
# Some types of offense are referred to by two slighlty different names.
which(x$offense == "Simple Assault/DC")
x$offense[31] <- "Simple Assault"
which(x$offense == "Aggravated Battery (att)")
x$offense[23] <- "Aggravated Battery"
#____________________________
# Quick data visualization
#____________________________
# Eventually, the visualization will allow answering some interesting questions that flood the mind right off the bat.
#Create a function that facilitates the visualization of different variables:
vis <- function(var) {
f <- factor(var)
barplot(table(f))
}
# What is the frequency of the number of victims ?
vis(x$victims)
# What are the most frequent hate crime's motivations ?
vis(x$motivation)
# The week's days where the hate crimes occur the most ?
vis(x$weekday)
# What is the gender the most affected by hate crimes ?
vis(x$victim_gender)
# What is the race of the offender in most cases ?
vis(x$offender_race)
# What is the kind of offense the most reported ?
vis(x$offense)
# However, data visualization will not be always useful when it comes down to examning the distribution within each variable.
# In the case of a categorical variable which contains too many modalities, it would be difficult to look at the distribution using a barplot.
# For example, to find out in which places the hate crimes occur the most, we should look at the "location_type" variable that contains 18 modalities.
# A barplot of the data of this variable will be so messy.
vis(x$location_type)
# So to answer the question, we are going to build a table comprised of all the modalities with their respective frequencies.
factor(x$location_type)
levels(factor(x$location_type))
sort(table(factor(x$location_type))