forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
182 lines (160 loc) · 6.99 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading the data
***
The data was downloaded from [this link](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip)
provided in the assignment. The data in the file is described in the README.md
document. Briefly, there are three columns in the CSV named 'steps', 'date', and
'interval'. The read.csv function, along with specified colClasses, can be used
to read the data into R. The package data.table is also used to turn the data
frame into a data.table for easier manipulation.
```{r, echo=TRUE}
activity <- read.csv(unz('activity.zip', 'activity.csv'),
colClasses = c('numeric', 'Date', 'numeric'))
library(data.table)
activity <-data.table(activity)
```
## What is the mean total number of steps taken per day?
***
For this part of the assignment, the missing values in the dataset are ignored
(i.e. use ```na.rm = TRUE``` as necessary and don't impute missing values).
First step is to calculate the total number of steps taken per day. The ```sum```
function will be applied to the data table with the option ```by=date``` to get
the sum of steps per day. This will be saved to a new variable called 'total'.
The first few lines of the ```total``` data table is shown below.
```{r, echo=TRUE}
total <- activity[,list(steps = sum(steps, na.rm = TRUE)), by = date]
head(total)
```
A histogram of the total number of steps taken each day is then made. There was
no specified break points.
```{r, echo=TRUE}
hist(total$steps, xlab = 'Total steps per day', main = 'Total Steps Per Day')
```
The histogram indicates that most of the days had a total number of steps
between 10,000 and 15,000.
The mean and median of the total number of steps taken per day are then
calculated and reported.
```{r, echo=TRUE}
mean_steps <- mean(total$steps)
median_steps <- median(total$steps)
```
Echoing the above
```{r, echo=TRUE}
mean_steps
median_steps
```
## What is the average daily activity pattern?
***
To create a time series plot (i.e. type = "l") of the 5-minute interval (x-axis)
and the average number of steps taken averaged across all days (y-axis), a data
frame that summarizes the mean steps per 5-minute interval is created. The data
frame (here saved as ```by_interval``) is then plotted.
```{r, echo=TRUE}
by_interval <- activity[,list(steps = mean(steps, na.rm = TRUE)), by = interval]
with(by_interval, plot(interval,
steps,
type = 'l',
ylab = 'Average Nunber of Steps',
xlab = '5-minute Interval'))
```
To find out Which 5-minute interval, on average across all the days in the
dataset, contains the maximum number of steps, the function ```max``` can be
used. As shown below, the interval 835 has the maximum number of steps on
average across all the days in the dataset.
```{r, echo=TRUE}
by_interval[steps == max(by_interval$steps)]
```
## Imputing Missing Values
***
To calculate and report the total number of missing values in the dataset (i.e.
the total number of rows with NAs), the ```summary``` or the ```is.na```
functions can be used.
```{r, echo=TRUE}
summary(activity)
nrow(activity[is.na(activity$steps)])
```
A total of 2304 rows have NA. A new dataset (here called ```imputed```) that is
equal to the original dataset but with the missing data filled in with the mean
for that day is created. To do this, a data frame containing all the means for
each interval was made (here called ```invterval_means```). The ```imputed```
dataset is then created by looking for the rows with NAs in the ```activity```
dataset and filling that in with the corresponding mean for that interval using
the ```interval_means``` data frame.
```{r, echo=TRUE}
interval_means <- activity[, list(steps=mean(steps, na.rm=TRUE)), by=interval]
imputed <- transform(activity,
steps = ifelse(is.na(steps),
interval_means[interval_means$interval == interval,
steps],
steps))
summary(imputed)
```
Compare the summary of the imputed data with the summary of the original data.
There are no longer any NAs in the steps column and the third quartile has
changed. Interestingly, the mean number of steps for the entire dataset remains
the same.
The total steps taken for each day can be computed in the same way that the
total steps taken for each day was computed for the dataset with no imputation.
A histogram can also be created.
```{r, echo=TRUE}
imputed_total <- imputed[,list(steps = sum(steps)), by = date]
hist(imputed_total$steps,
main = 'Total Steps per Day (Missing Values Imputed)',
xlab = 'Total steps per day')
```
The histogram of the imputed data set shows that most days still have a total
number of steps between 10,000 and 15,000. There are now less days with total
between 0 and 5,000 steps compared to the dataset where no imputation was done.
The meand and median of the imputed dataset is computed as follows.
```{r}
imputed_mean <- mean(imputed_total$steps)
imputed_median <- median(imputed_total$steps)
```
Echoing the above, we get the new mean and median for the imputed dataset of
total steps per day. This is different from the estimates obtained in the first
part of the assignment where no imputation was done. The mean and median of the
imputed dataset are higher than the dataset with no imputation. Also, the mean
and median of the imputed dataset are the same.
```{r}
imputed_mean
imputed_median
```
## Are there differences in activity patterns between weekdays and weekends?
***
To determine whether there are differences in activity patterns between weekdays
and weekends, a new data frame (here called ```weekends```) was created. In this
data frame, a new column (called ```days```) was added which stated whether the
date for a row was on a weekend or a weekday. The ```days``` column was then
used as a factor to summarize the data and create plots with the interval in the
x-axis and the number of steps in the y-axis.
```{r}
weekends <- transform(activity,
days = ifelse(weekdays(date) == 'Saturday' |
weekdays(date) == 'Sunday',
'weekend',
'weekday'))
weekends$days <- factor(weekends$days)
by_interval_w <- weekends[,list(steps = mean(steps, na.rm = TRUE)),
by = c('interval', 'days')]
par(mfrow = c(2, 1))
with(by_interval_w[days == 'weekend'],
plot(interval,
steps,
type = 'l',
ylim = c(0, 200),
main = 'Weekend'))
with(by_interval_w[days == 'weekday'],
plot(interval,
steps,
type = 'l',
ylim = c(0, 200),
main = 'Weekday'))
```
The plots above show that on the weekend, there seems to be more intervals >1000
that have higher number of steps. Does this mean that people walk/ run/ jog
more in the weekends? Perhaps.