-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1.Rmd
146 lines (101 loc) · 5.57 KB
/
PA1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
title: 'Reproducible Research: Peer Assessment 1'
output:
html_document:
fig_caption: yes
keep_md: yes
---
## Loading and preprocessing the data
After unzipping the given file, load the `csv` file:
```{r load-data, echo=TRUE}
activity <- read.csv('data/activity.csv')
```
And preprocess the data given the questions we want to answer, in this case recognizing the `date` variable as date in order to use the `weekdays()` function in question 5 (could also be done later on).
```{r}
activity$date <- as.Date(activity$date)
```
## What is mean total number of steps taken per day?
1. Histogram of the total number of steps taken each day
```{r histogram-per-day, echo=TRUE}
library(ggplot2)
stepsPerDay <- aggregate(activity$steps, by=list(Day = activity$date), FUN = sum, na.rm = T)
qplot(stepsPerDay$x, binwidth = 1000, xlab = "Total number of steps taken each day")
```
2. Calculate and report the **mean** and **median** total number of steps taken per day
```{r mean-median, echo=TRUE}
stepsMean <- mean(stepsPerDay$x, na.rm = T)
stepsMedian <- median(stepsPerDay$x, na.rm = T)
```
The mean is **`r format(stepsMean)`** and the median is **`r format(stepsMedian)`**.
## What is the average daily activity pattern?
1. Make a time series plot (i.e. `type = "l"`) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
```{r time-series, echo=TRUE}
avgStepsPerInterval <- aggregate(x=list(meanSteps=activity$steps), by=list(interval=activity$interval), FUN=mean, na.rm=TRUE)
ggplot(data=avgStepsPerInterval, aes(x=interval, y=meanSteps)) +
geom_line(colour="#333333", size = 1.1) +
xlab("intervals") +
ylab("avg. steps")
```
2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
```{r max-interval, echo=TRUE}
maxSteps <- avgStepsPerInterval[which(avgStepsPerInterval$meanSteps == max(avgStepsPerInterval$meanSteps)), ]
```
The interval with the maximum number of steps (an avg. of `r format(maxSteps$meanSteps)`) is **`r format(maxSteps$interval)`**.
## Imputing missing values
1. Calculate and report the total number of missing values in the dataset
```{r na-values, echo=TRUE}
naValues <- length(which(is.na(activity$steps)))
```
The total number of missing values (`NA` values) in the dataset is **`r format(naValues)`**.
2. Devise a strategy for filling in all of the missing values in the dataset
Missing values will be filled with the mean steps for each interval where a missing value exists. To achieve that, a temporary dataframe will be generated with an additional column containing the mean number of steps for the corresponding interval.
```{r temp, echo=TRUE}
temp <- merge(activity, avgStepsPerInterval, by ="interval")
```
3. Create a new dataset that is equal to the original dataset but with the missing data filled in
Then, still in the temporary dataframe, the steps column is filled with the mean steps if the original value is missing, and then the extra `meanSteps` column is removed to get a dataset the same as the original but with missing values filled in.
```{r fill-missing, echo=TRUE}
temp$steps <- ifelse(is.na(temp$steps), temp$meanSteps, temp$steps)
activityFilled <- temp[,1:4]
```
4. Make a histogram of the total number of steps taken each day
The same histogram as for question 1 is drawn, but with filled data:
```{r filled-histogram, echo=TRUE}
stepsPerDayFilled <- aggregate(activityFilled$steps, by=list(Day = activityFilled$date), FUN = sum, na.rm = T)
qplot(stepsPerDayFilled$x, binwidth = 1000, xlab = "Total number of steps taken per day (with missing values filled)")
```
5. Calculate and report the **mean** and **median** total number of steps taken per day.
```{r filled-mean-median, echo=TRUE}
stepsMeanF <- mean(stepsPerDayFilled$x, na.rm = T)
stepsMedianF <- median(stepsPerDayFilled$x, na.rm = T)
```
After filling missing values, the mean is **`r format(stepsMeanF)`** and the median is **`r format(stepsMedianF)`**.
So, to sum up,
| | With NAs | Without NAs |
|--------|-------------------------|--------------------------|
| mean | `r format(stepsMean)` | `r format(stepsMeanF)` |
| median | `r format(stepsMedian)` | `r format(stepsMedianF)` |
the values do differ slightly, specially regarding the mean. Also, after filling missing values, mean and median are the same.
## Are there differences in activity patterns between weekdays and weekends?
1. Create a new factor variable in the dataset with two levels -- "weekday" and "weekend" indicating whether a given date is a weekday or weekend day
```{r weekends, echo=TRUE}
dayType <- function(x) {
switch(x,
'lunes'='weekday',
'martes'='weekday',
'miércoles'='weekday',
'jueves'='weekday',
'viernes'='weekday',
'sábado'='weekend',
'domingo'='weekend')
}
activityFilled$weekend = sapply(weekdays(activity$date), dayType)
```
2. Make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
```{r panel-grid, echo=TRUE}
activityFilledDailyMean <- aggregate(activityFilled$steps, by = list(interval=activityFilled$interval, weekend=activityFilled$weekend), mean)
ggplot(activityFilledDailyMean, aes(interval,x) ) +
geom_line(colour="#333333", size = 1.1) +
facet_grid(weekend ~ .)
```
There are slight differences between weekdays and weekends, with the former getting more steps in the earliest hours of the day (possibl due to commutes to work) and weekends having more _rest_ periods.