-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathw01-hw-assign_balajis2.Rmd
237 lines (172 loc) · 8.41 KB
/
w01-hw-assign_balajis2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
title: "Week 1 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
html_document:
toc: yes
pdf_document: default
urlcolor: cyan
---
***
## Exercise 1 (Subsetting and Statistics)
For this exercise, we will use the `msleep` dataset from the `ggplot2` package.
**(a)** Install and load the `ggplot2` package. **Do not** include the installation command in your `.Rmd` file. (If you do it will install the package every time you knit your file.) **Do** include the command to load the package into your environment.
```{r include=FALSE}
library(ggplot2)
```
**(b)** Note that this dataset is technically a `tibble`, not a data frame. How many observations are in this dataset? How many variables? What are the observations in this dataset?
```{r}
?msleep
```
**No. of observations** = 83
**No. of variables** = 11
**Observations** An updated and expanded version of the mammals sleep dataset
**(c)** What is the mean hours of REM sleep of individuals in this dataset?
```{r}
mean(msleep$sleep_rem,na.rm = TRUE)
```
**(d)** What is the standard deviation of brain weight of individuals in this dataset?
```{r}
sd(msleep$brainwt,na.rm = TRUE)
```
**(e)** Which observation (provide the `name`) in this dataset gets the most REM sleep?
```{r}
msleep$name[which.max(msleep$sleep_rem)]
```
**(f)** What is the average bodyweight of carnivores in this dataset?
```{r}
mean(subset(msleep,vore == "carni")$bodywt)
```
***
## Exercise 2 (Plotting)
For this exercise, we will use the `birthwt` dataset from the `MASS` package.
**(a)** Note that this dataset is a data frame and all of the variables are numeric. How many observations are in this dataset? How many variables? What are the observations in this dataset?
```{r include=FALSE}
?birthwt
```
**observations: risk factors associated with low infant birth weight**
**no. of observations: 189**
**no. of variables: 10**
**(b)** Create a scatter plot of birth weight (y-axis) vs mother's weight before pregnancy (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
```{r libraries, echo=FALSE}
library(MASS)
plot(birthwt$lwt,birthwt$bwt,xlab="mother's weight in pounds",ylab="birth weight in grams",main="comparison of mother's weight vs birth weight of child",col="blue")
```
**(c)** Create a scatter plot of birth weight (y-axis) vs mother's age (x-axis). Use a non-default color for the points. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the scatter plot, does there seem to be a relationship between the two variables? Briefly explain.
**Mother's with more weight seems to likely deliver heavier baby's**
**(d)** Create side-by-side boxplots for birth weight grouped by smoking status. Use non-default colors for the plot. (Also, be sure to give the plot a title and label the axes appropriately.) Based on the boxplot, does there seem to be a difference in birth weight for mothers who smoked? Briefly explain.
```{r echo=FALSE}
boxplot(bwt~smoke,data=birthwt, main="smoking status vs. birth weight",
xlab="smoking status during pregnancy", ylab="birth weight in grams",col = "blue")
```
**There is not much difference but mother's who doesn't smoke likely to deliver heavier babies than mother's who smoke**
***
## Exercise 3 (Importing Data, More Plotting)
For this exercise we will use the data stored in [`nutrition-2018.csv`](nutrition-2018.csv). It contains the nutritional values per serving size for a large variety of foods as calculated by the USDA in 2018. It is a cleaned version totaling 5956 observations and is current as of April 2018.
The variables in the dataset are:
- `ID`
- `Desc` - short description of food
- `Water` - in grams
- `Calories` - in kcal
- `Protein` - in grams
- `Fat` - in grams
- `Carbs` - carbohydrates, in grams
- `Fiber` - in grams
- `Sugar` - in grams
- `Calcium` - in milligrams
- `Potassium` - in milligrams
- `Sodium` - in milligrams
- `VitaminC` - vitamin C, in milligrams
- `Chol` - cholesterol, in milligrams
- `Portion` - description of standard serving size used in analysis
**(a)** Create a histogram of `Calories`. Do not modify `R`'s default bin selection. Make the plot presentable. Describe the shape of the histogram. Do you notice anything unusual?
```{r}
library(readr)
nutrition_2018 = read_csv("nutrition-2018.csv")
hist(nutrition_2018$Calories,xlab = "Calories",main="Calories for different foods",col="blue")
```
**The distribution is mostly to the right. There are 2 spikes at 400 and 800 which is unusual if the remaining of the distribution is towards the right **
**(b)** Create a scatter plot of calories (y-axis) vs protein (x-axis). Make the plot presentable. Do you notice any trends? Do you think that knowing only the protein content of a food, you could make a good prediction of the calories in the food?
```{r}
plot(nutrition_2018$Protein,nutrition_2018$Calories,xlab="Protein",ylab="Calories",main="Calories vs. Protein",col = "blue")
```
**As protein increases the calories also increases**
**(c)** Create a scatter plot of `Calories` (y-axis) vs `4 * Protein + 4 * Carbs + 9 * Fat` (x-axis). Make the plot presentable. You will either need to add a new variable to the data frame, or use the `I()` function in your formula in the call to `plot()`. If you are at all familiar with nutrition, you may realize that this formula calculates the calorie count based on the protein, carbohydrate, and fat values. You'd expect then that the result here is a straight line. Is it? If not, can you think of any reasons why it is not?
```{r}
plot((4*nutrition_2018$Protein)+(4*nutrition_2018$Carbs) + (9*nutrition_2018$Fat),nutrition_2018$Calories,xlab = "protein, carb and fat",ylab = "Calories",main="Protein, Carb & Fat vs. Calories",col = "blue")
```
**The result is a straight line**
***
## Exercise 4 (Writing and Using Functions)
For each of the following parts, use the following vectors:
```{r}
a = 1:10
b = 10:1
c = rep(1, times = 10)
d = 2 ^ (1:10)
```
**(a)** Write a function called `sum_of_squares`.
- Arguments:
- A vector of numeric data `x`
- Output:
- The sum of the squares of the elements of the vector $\sum_{i = 1}^n x_i^2$
Provide your function, as well as the result of running the following code:
```{r echo=TRUE}
sum_of_squares= function(x)
{
sum(x^2)
}
sum_of_squares(x = a)
sum_of_squares(x = c(c, d))
```
**(b)** Using only your function `sum_of_squares()`, `mean()`, `sqrt()`, and basic math operations such as `+` and `-`, calculate
\[
\sqrt{\frac{1}{n}\sum_{i = 1}^n (x_i - 0)^{2}}
\]
where the $x$ vector is `d`.
```{r}
sqrt(mean(sum_of_squares(d-0)))
```
**(c)** Using only your function `sum_of_squares()`, `mean()`, `sqrt()`, and basic math operations such as `+` and `-`, calculate
\[
\sqrt{\frac{1}{n}\sum_{i = 1}^n (x_i - y_i)^{2}}
\]
where the $x$ vector is `a` and the $y$ vector is `b`.
```{r}
sqrt(mean(sum_of_squares(a-b)))
```
***
## Exercise 5 (More Writing and Using Functions)
For each of the following parts, use the following vectors:
```{r}
set.seed(42)
x = 1:100
y = rnorm(1000)
z = runif(150, min = 0, max = 1)
```
**(a)** Write a function called `list_extreme_values`.
- Arguments:
- A vector of numeric data `x`
- A positive constant, `k`, with a default value of `2`
- Output:
- A list with two elements:
- `small`, a vector of elements of `x` that are $k$ sample standard deviations less than the sample mean. That is, the observations that are smaller than $\bar{x} - k \cdot s$.
- `large`, a vector of elements of `x` that are $k$ sample standard deviations greater than the sample mean. That is, the observations that are larger than $\bar{x} + k \cdot s$.
Provide your function, as well as the result of running the following code:
```{r echo=TRUE}
list_extreme_values = function(x,k=2)
{
x_sample_mean = mean(x)
x_sample_sd = sd(x)
list(small = x[x < x_sample_mean - k * x_sample_sd ],large = x[x > x_sample_mean + k * x_sample_sd ])
}
list_extreme_values(x = x, k = 1)
list_extreme_values(x = y, k = 3)
list_extreme_values(x = y, k = 2)
list_extreme_values(x = z, k = 1.5)
```
**(b)** Using only your function `list_extreme_values()`, `mean()`, and basic list operations, calculate the mean of observations that are greater than 1.5 standard deviation above the mean in the vector `y`.
```{r}
mean(list_extreme_values(x = y, k = 1.5)$large)
```