-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject1
668 lines (510 loc) · 24 KB
/
Project1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
---
title: 'Project 1 : Predicting the weight of bananas'
author: "Iza"
date: "2/6/2022"
output:
pdf_document: default
word_document: default
html_document:
df_print: paged
---
```{r, setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width = 8)
knitr::opts_chunk$set(fig.show = T)
knitr::opts_chunk$set(fig.cap = T)
```
```{r wrap-hook, include=FALSE}
library(knitr)
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
```{r include=FALSE}
library(knitr)
library(ggplot2)
library(GGally) # an extension to ggplot2
library(Metrics)
```
```{r, echo=FALSE}
# import the banana data
bdata = readxl::read_xlsx("Copy of Banana Data.xlsx")
colnames(bdata) = c("weight","length","radius")
```
```{r, echo=FALSE}
# split data into train and test subsets
# test data are rows:1,8,12,17,18,23,24,28,44,45,47,53,55,69,76,77
testrows = c(1,8,12,17,18,23,24,28,44,45,47,53,55,69,76,77)
banana_train = bdata[-testrows,]
banana_test = bdata[testrows,]
```
## Summary
This report is about predicting the weight of bananas (given its radius and length), by using linear regression. A sample data of size, n= 78 is collected,then split into training dataset and test data set for cross validation of the model. The Mean Absolute Value and Mean Absolute Percent Error are also used for further evaluation of the models. A weighted linear model is generated to be the final linear model.
## Introduction
### Background
A known formula for computing the weight of an object is
$Weight = Density \times Volume.$ If a cylinder has the volume defined as $Volume = \pi \times radius^2 \times length$ then, we would have an approximate formula of weight as
$Weight = Density \times \pi \times radius^2 \times length.$ \
### Purpose
If we assume that a banana is technically a cylinder then we can approximate its weight given its length and radius. Since the log function $log()$ has the property of $log(XY^m) = log(X) + mlog(X)$ we can use it to transform the weight equation above into a linear form. By doing so we can use a linear regression model to predict the weight of a banana. Applying the log function the new equation we use for linear regression is
$log(Weight) = log(Density) + log(\pi) + 2log(radius) + log(length).$ \
Then further modified into
$log(Weight) = \beta_0 + \beta_1 log(radius) + \beta_2 log(length).$\
\
where we have the constants in $\beta_0.$
\
The linear regression will then estimate the values of $\beta_0, \beta_1, \beta_2$
## Methods
### Data collection
Each student is tasked to measure the radius, length and weight of six bananas which is compiled together into a dataset with 78 rows and 3 columns.
### Exploratory data analysis
To get a glimpse of the data, a correlation plot is generated. From the pairwise plot below, its found that the variables have weak to moderate linear relationship with each other.
```{r, echo=FALSE, fig.cap= "Pairwise plot with the correlation of each variable in the data"}
ggpairs(data = banana_train,
axisLabels = "show")
```
### Model Building
There are three linear models to choose from: \
- $log(Weight) = \beta_0 + \beta_1 log(radius) + \beta_2 log(length) + \varepsilon$\
- $log(Weight) = \beta_0 + \beta_1 log(length) + \varepsilon$\
- $log(Weight) = \beta_0 + \beta_1 log(radius) + \varepsilon$ \
\
For each model, we use the *lm()* to generate the estimates of the coefficients, $\beta_0, \beta_1$, and the residuals,$\varepsilon$. But, before applying *lm()*, the original dataset is split into a training and test set where, we use the train set to build the models. Once the models are built,the models are assessed by comparing their Mean Absolute Error(MAE) and Mean Absolute Percent Error(MAPE). The model that yields the lowest MAE and MAPE is chosen to be the model for prediction. The assumptions for linear regression have to be checked on each model, if any assumption is validated, the model is modified. \
## Results
Once the models are generated, the estimated coefficients are on the table below. \
```{r =FALSE}
linfit_full = lm(log(weight) ~ log(length) + log(radius),data=banana_train) #all variables
linfit_l = lm(log(weight)~log(length), data=banana_train) #just length as x
linfit_r = lm(log(weight)~log(radius), data=banana_train) #just radius as x
linfit_none = lm(weight~1, data=banana_train) #empty; used for forward step, model selection
summary(linfit_full)
summary(linfit_l)
summary(linfit_r)
summary(linfit_none)
```
```{r echo=F}
linfit_b0 = c(linfit_full$coefficients[1], linfit_l$coefficients[1], linfit_r$coefficients[1])
linfit_b1 = c(linfit_full$coefficients[2], linfit_l$coefficients[2], linfit_r$coefficients[2])
linfit_b2 = c(linfit_full$coefficients[3], NA, NA)
linfit_coefs =
cbind.data.frame(
c("linfit_full","linfit_l","linfit_r"),
linfit_b0,linfit_b1,linfit_b2)
names(linfit_coefs)[1] ="model"
kable(linfit_coefs, format="pipe")
```
\
\
\
\
\
Thus, the linear models are: \
- **linfit_full**: $log(Weight) = 0.918 + (0.121) log(radius) + (1.182) log(length)$ \
- **linfit_l**: $log(Weight) = 3.492 + (0.281) log(length)$ \
- **linfit_r**: $log(Weight) = 1.294 + (1.278) log(radius)$ \
From just the adjusted R-squared values of each model, **linfit_r** had the highest value of $0.2092$. This means that the predictor in **linfit_r*** (radius), is responsible of the response variable's (weight) variablity of about 20.92%. When the Step wise method is used for Akaike Information Criterion (AIC) and Bayesian Information Criterion(BIC), the model produced is the as **linfit_r**. So by comparing the adjusted R-squared values and performing variable selection by AIC and BIC, **linfit_r** seems to be the best model among the three. \
```{r include=FALSE}
# backwards, using both length and radius variables
step(linfit_full, direction="backward", trace=F)
# forward...
step(linfit_none, direction="forward", scope = list(lower= linfit_none, upper= linfit_full), trace=F)
# Result is the model with radius as the predictor
```
### Results of cross validation
```{r include=FALSE}
# using the model with all the variables (l and r)
pred.full = predict(object = linfit_full, newdata = banana_test)
# with just length (l)
pred.l = predict(object = linfit_l, newdata = banana_test)
# with just radius (r)
pred.r = predict(object = linfit_r, newdata = banana_test)
```
```{r include=FALSE}
# MAE - Mean absolute error
# MPAE - Mean Percent absolute error
# function to compute MAE
computeMae = function(measured,predicted,n){
summed_term = 0
for (i in 1:n){
abs_term = abs(measured[i]-predicted[i])
summed_term = summed_term + abs_term
}
MAE = summed_term/n
return(MAE)
}
```
```{r include=FALSE}
#----MAE
MAE_train_full = computeMae(log(banana_train$weight), linfit_full$fitted.values, n=nrow(banana_train))
MAE_test_full = computeMae(log(banana_test$weight), log(pred.full), n= nrow(banana_test))
mae_full = c(MAE_train_full, MAE_test_full)
MAE_train_l = computeMae(log(banana_train$weight), linfit_l$fitted.values, n= nrow(banana_train))
MAE_test_l = computeMae(log(banana_test$weight), log(pred.l), n= nrow(banana_test))
mae_l = c(MAE_train_l, MAE_test_l)
MAE_train_r = computeMae(log(banana_train$weight), linfit_r$fitted.values, n= nrow(banana_train))
MAE_test_r = computeMae(log(banana_test$weight), log(pred.r), n= nrow(banana_test))
mae_r = c(MAE_train_r, MAE_test_r)
mae_df = cbind.data.frame(mae_full, mae_l, mae_r)
rownames(mae_df) = c("weight(train)","weight(test)")
colnames(mae_df) = c("predicted weight(full)", "predicted weight(length)", "predicted weight(radius)")
mae_table = kable(mae_df,format= "pipe", row.names = T)
#----MAPE
# use mape() in Metrics package
MAPE_train_full = mape(log(banana_train$weight), linfit_full$fitted.values)
MAPE_test_full = mape(log(banana_test$weight), log(pred.full))
mape_full = c(MAPE_train_full,MAPE_test_full)
MAPE_train_l = mape(log(banana_train$weight), linfit_l$fitted.values)
MAPE_test_l = mape(log(banana_test$weight), log(pred.l))
mape_l = c(MAPE_train_l, MAPE_test_l)
MAPE_train_r = mape(log(banana_train$weight), linfit_r$fitted.values)
MAPE_test_r = mape(log(banana_test$weight), log(pred.r))
mape_r = c(MAPE_train_r, MAPE_test_r)
#put in df
mape_df = cbind(mape_full, mape_l, mape_r)
rownames(mape_df) = c("obs.weight(train)"," obs.weight(test)")
colnames(mape_df) = c("pred.weight(full)", "pred.weight(length)", "pred.weight(radius)")
#then table
mape_table = kable(mape_df, row.names = T)
```
In order to perform a cross validation, the models are used to predict the weight of bananas using the test dataset. The resulting MAE and MAPE values are: \
```{r echo=TRUE, fig.cap="MAE table", paged.print=FALSE}
mae_table
# The lesser the percent value of mae/mape, the more accurate the model
```
*MAE values; The column name correspond to the which model generated the predicted values and the row name is where the observed values are from.*
\
```{r echo=TRUE, fig.cap="MAE table", paged.print=TRUE}
mape_table
```
*MAPE values; The column name correspond to the which model generated the predicted values and the row name is where the observed values are from.* \
From the MAE and MAPE tables, we can see that, the linear model with radius as the sole predictor(**linfit_r**) and the model with length as the predictor(**linfit_l**),have the same values of MAE and MAPE when the test dataset is used to predict the weight. If we consider the MAE and MAPE values when predicted weight is generated by the train dataset we can see that **linfit_r** has a lower value for both measurements. Thus, MAE and MAPE confirms that **linfit_r** is the best among the three linear models.
### Validity of the model
In order for the predicted values to be valid and to make other inferences from the the **linfit_r** model, the assumptions for linear regression is checked. \
```{r echo=FALSE, fig.cap= "Diagnostic plots for the model:linfit.r"}
#radius
par(mfrow=c(2,2))
plot(linfit_r)
```
```{r include=FALSE}
# Shapiro-Wilk Normality Test
#H_0: The residuals are normally distributed
#H_1: The residuals are not normally distributed
shapiro.test(linfit_r$residuals)
```
- ***Linearity***: The *Residuals vs. Fitted* graph has an approximately horizontal line which means that there is no fitted pattern. Thus, we can assume that the radius has a linear relationship with the weight. \
- ***Residual's normality test***: After running a Shapiro-Wilk Normality Test with a significance level of 0.05, we fail to reject the null hypothesis since $p-value = 0.6425 > (2*0.05)$. This means that we have 95% confidence of the residuals being normally distributed. \
- **Homoscedasticity(equal variances)**: From the plot of Scale-Location we can see that the line is not horizontal which implies that variance of the residuals increases with the fitted values. This imply that **linfit_r** produces non-constant variance of the response variable (banana weights). Adding weights to the model done in the next section overcomes this violation. \
\
```{r, echo=FALSE, fig.cap= "Boxplots of the variables"}
#boxplot
par(mfrow=c(1,3))
boxplot(banana_train$weight, ylab = "Weight(g)", main = "Boxplot of Banana weights", col="lightyellow")
boxplot(banana_train$length, ylab = "length(mm)", main = "Boxplot of Banana length", col="lightgreen")
boxplot(banana_train$radius, ylab = "radius(mm)", main = "Boxplot of Banana radius", col="lightpink")
```
- **Outliers and influential points**: The *Residuals vs. Leverage* Graph, show that there is no point that crosses Cook's distance. Therefore, there are no influential points that we could omit. So even though there seemed to be some radius outliers shown on the boxplot, removing them would not have a significant effect on the model.
### Modifying the model
By adding the weight $w_i = \frac{1}{(st.dev(Y_i))^2}$ to the **linfit_r** model we get a new model (**fitr_wi**): \
$$log(Weight)_w = \hat{\beta_{0}} + \hat{\beta_{1}} log(radius)_w + \hat{\varepsilon},$$
where $_w$ represents the weights added on the observations[1]. \
Applying the weights on the linear model resulted in a small improvement of the Adjusted R-squared from $0.2092$ to $0.2928$. The figure below confirms small improvement as the scale location graph became a bit more horizontal.
\
```{r}
# add a weight to stabilize the variance
# weight = 1/(residuals)^2; we estimate the residuals via slr for it
resfit = lm(abs(linfit_r$residuals)~ linfit_r$fitted.values)
wi = 1/resfit$fitted.values^2
fitr_wi = lm(log(weight)~log(radius), data=banana_train, weights = wi)
summary(fitr_wi)
```
\
```{r}
par(mfrow=c(2,2))
plot(fitr_wi)
```
\
When **fitr_wi** is used to predict the banana weights with the test dataset, the resulting MAE and MAPE are: \
\
```{r echo=FALSE, table.cap="MAE and MAPE with the fitr_wi model"}
# predict weight using fitr_w8
pred.rwi = predict(object = fitr_wi, newdata = banana_test)
#MAE
mae_rwi = computeMae(log(banana_test$weight), pred.rwi, n= nrow(banana_test))
#MAPE
mape_rwi = mape(log(banana_test$weight), pred.rwi)
# to make a line plot of your regression model:
#ggplot(fitr_w8, aes(names(fitr_w8)[2], names(fitr_w8)[1])) + geom_abline()
maepe_table = data.frame(mae_rwi,mape_rwi)
names(maepe_table)[1] = "MAE"
names(maepe_table)[2] = "MAPE"
kable(maepe_table, format="pipe")
```
\
*The MAE and MAPE values above are computed from the test dataset.* \
### Final model
The final model is **fitr_wi**:
$$log(Weight)_w = \hat{\beta_0} + \hat{\beta_1} log(radius)_w + \hat{\varepsilon},$$
where $\hat{\beta_0}= 1.395139, \hat{\beta_1}= 1.243427.$\
The following graph is the models **linfit_r**(purple), and **fitr_wi**(blue) overlaying the plot of banana weights from the test test data set.
\
```{r echo=FALSE, fig.cap="linfit_r(purple) and fitr_wi model(blue)"}
plot(x=banana_test$radius, y = banana_test$weight, type='p', col='red')
#overlay line plot of linfit_r predictions
lines(banana_test$radius, exp(pred.r), col='blue')
#overlay line plot of fitr_wi predictions
lines(banana_test$radius, exp(pred.rwi), col='purple')
#add legend
legend(1, 25, legend=c('Line 1', 'Line 2', 'Line 3'),
col=c('red', 'blue', 'purple'), lty=1)
```
\
The predicted banana weights are: \
```{r echo=FALSE}
bweight_df = data.frame(banana_test$radius, banana_test$weight, exp(pred.r),exp(pred.rwi))
print(bweight_df)
```
While the 95% CI of the density is
```{r}
# confidence interval for the density
# density is in the intercept (beta_0) term of linfit_full
b0_full = linfit_r$coefficients[1]
# st error of intercept in linfit_r: 49.059
q.se = qnorm(1-(0.05/2))*49.059
confidence_i = b0_full + c(-1,1)*q.se
q.se2 = 25.78
confidence_i + c(-1,1)*q.se2
```
## Conclusion
Even though MAE and MAPE values are small the linear regression **fitr_wi** may not be the best model to predict the weight of bananas. That is because the Adjusted R-squared values are not high enough which means that the predictor variable (i.e., radius) does not predict the response (banana weight) very well. More predictor variables could possibly improve the linear model. For instance, the ripeness of the banana (ripe/unripe). Adding size (e.g.,small/medium/large) might also allow for separate prediction models based on each size. Moreover, variance could be minimized during data collection by implementing the same techniques and tools in measuring the radius, length and weight of the banana. Non linear models may also yield better predictions.
## Appendix
The following are the R codes used for this report.
\
```{r, linewidth=60}
#library(knitr)
##library(ggplot2)
#library(GGally) # an extension to ggplot2
#library(Metrics)
```
```{r, linewidth=60}
# import the banana data
#bdata = readxl::read_xlsx(
#"Banana Data.xlsx")
#colnames(bdata) = c("weight","length","radius")
```
```{r, linewidth=60}
# split data into train and test subsets
# test data are rows:
#1,8,12,17,18,23,24,28,44,45,47,53,55,69,76,77
#testrows = c(1,8,12,17,18,
#23,24,28,44,45,47,53,55,69,76,77)
#banana_train = bdata[-testrows,]
#banana_test = bdata[testrows,]
```
```{r, linewidth=60}
#ggpairs(data = banana_train,
# title =
#"Pairwise plot of the three variables in the data",
# axisLabels = "show")
# https://www.
#rdocumentation.org/packages/GGally/versions/
#1.5.0/topics/ggpairs
# stars signify significance at 10%, 5% and 1% levels. https://r-coder.com/correlation-plot-r/
```
```{r, fig.cap= "Histogram of weight distribution of the bananas in the training set.", linewidth=60}
# histogram of the weight's distribution
#hist(banana_train$weight, col = "lightyellow",
# border="black", prob= T,
# xlab=" banana weight (g)",
#main = " Histogram of bananas' weight distribution")
#lines(density(banana_train$weight), lwd = 2.5, col = "orange")
```
```{r, linewidth=60}
#linfit_full = lm(log(weight) ~ log(length) + log(radius),data=banana_train) #all variables
#linfit_l = lm(log(weight)~log(length), data=banana_train) #just length as x
#linfit_r = lm(log(weight)~log(radius), data=banana_train) #just radius as x
#linfit_none = lm(weight~1, data=banana_train) #empty; used for forward step, model selection
#summary(linfit_full)
#summary(linfit_l)
#summary(linfit_r)
#summary(linfit_none)
```
```{r}
#linfit_b0 = c(linfit_full$coefficients[1], linfit_l$coefficients[1], linfit_r$coefficients[1])
#linfit_b1 = c(linfit_full$coefficients[2], linfit_l$coefficients[2], linfit_r$coefficients[2])
#linfit_b2 = c(linfit_full$coefficients[3], NA, NA)
#linfit_coefs = cbind.data.frame(c("linfit_full","linfit_l",
#"linfit_r"),linfit_b0,linfit_b1,linfit_b2)
#names(linfit_coefs)[1] ="model"
#kable(linfit_coefs, format="simple")
```
*Not really need to do this, since there are only 2 possible predictors*).
\
```{r echo=TRUE}
# backwards, using both length and radius variables
#step(linfit_full, direction="backward", trace=F)
# forward...
#step(linfit_none, direction="forward", scope = list(lower= linfit_none, upper= linfit_full), trace=F)
# Result is the model with radius as the predictor
```
```{r}
# using the model with all the variables (l and r)
#pred.full = predict(object = linfit_full, newdata = banana_test)
# with just length (l)
#pred.l = predict(object = linfit_r, newdata = banana_test)
# with just radius (r)
#pred.r = predict(object = linfit_r, newdata = banana_test)
```
Comparing the models's accuracy
```{r}
# MAE - Mean absolute error
# MPAE - Mean Percent absolute error
# function to compute MAE
#computeMae = function(measured,predicted,n){
# summed_term = 0
# for (i in 1:n){
# abs_term = abs(measured[i]-predicted[i])
# summed_term = summed_term + abs_term
# }
# MAE = summed_term/n
# return(MAE)
#}
```
```{r, linewdith=60}
# now compare all the maes of train and test on each model
#----MAE
#MAE_train_full = computeMae(log(banana_train$weight), linfit_full$fitted.values, n=nrow(banana_train))
#MAE_test_full = computeMae(log(banana_test$weight),
#pred.full, n= nrow(banana_test))
#mae_full = c(MAE_train_full, MAE_test_full)
#MAE_train_l = computeMae(log(banana_train$weight), linfit_l$fitted.values, n= nrow(banana_train))
#MAE_test_l = computeMae(log(banana_test$weight),
#pred.l, n= nrow(banana_test))
#mae_l = c(MAE_train_l, MAE_test_l)
#MAE_train_r = computeMae(log(banana_train$weight), linfit_r$fitted.values, n= nrow(banana_train))
#MAE_test_r = computeMae(log(banana_test$weight), pred.r,
#n= nrow(banana_test))
#mae_r = c(MAE_train_r, MAE_test_r)
#mae_df = cbind.data.frame(mae_full, mae_l, mae_r)
#rownames(mae_df) = c("weight(train)","weight(test)")
#colnames(mae_df) = c("predicted weight(full)", "predicted weight(length)", "predicted weight(radius)")
#mae_table = kable(mae_df,format= "pipe", row.names = T)
#----MAPE
# use mape() in Metrics package
#MAPE_train_full = mape(log(banana_train$weight), linfit_full$fitted.values)
#MAPE_test_full = mape(log(banana_test$weight), pred.full)
#mape_full = c(MAPE_train_full,MAPE_test_full)
#MAPE_train_l = mape(log(banana_train$weight), linfit_l$fitted.values)
#MAPE_test_l = mape(log(banana_test$weight), pred.l)
#mape_l = c(MAPE_train_l, MAPE_test_l)
#MAPE_train_r = mape(log(banana_train$weight), linfit_r$fitted.values)
#MAPE_test_r = mape(log(banana_test$weight), pred.r)
#mape_r = c(MAPE_train_r, MAPE_test_r)
#put in df
#mape_df = cbind(mape_full, mape_l, mape_r)
#rownames(mape_df) = c("weight(train)","weight(test)")
#colnames(mape_df) = c("predicted weight(full)", "predicted weight(length)", "predicted weight(radius)")
#then table
#mape_table = kable(mape_df,format= "pipe", row.names = T)
```
```{r, echo=FALSE, fig.cap= "Mean Absolute Error(Banana weight observations minus the three models's predicted weight"}
#mae_table
# The lesser the percent value of mae/mape, the more accurate the model
```
```{r, echo=FALSE,fig.cap= "Mean Absolute Percent Error (Banana weight observations minus the three models's predicted weight"}
#mape_table
```
```{r, echo=FALSE,fig.cap= "Diagnostic plots for the model: linfit.full"}
#plot to check assumptions for linear regression
#full
#par(mfrow=c(2,2))
#plot(linfit_full)
```
```{r, echo=FALSE, fig.cap= "Diagnostic plots for the model:linfit.l"}
#length
#par(mfrow=c(2,2))
#plot(linfit_l)
```
```{r, echo=FALSE, fig.cap= "Diagnostic plots for the model:linfit.r"}
#radius
#par(mfrow=c(2,2))
#plot(linfit_r)
```
```{r}
# Shapiro-Wilk Normality Test
#H_0: The residuals are normally distributed
#H_1: The residuals are not normally distributed
#shapiro.test(linfit_r$residuals)
```
```{r, echo=FALSE, fig.cap= "Boxplots of the variables"}
#boxplot
#par(mfrow=c(1,3))
#boxplot(banana_train$weight, ylab = "Weight(g)", main = "Boxplot of Banana weights", col="lightyellow")
#boxplot(banana_train$length, ylab = "length(mm)", main = "Boxplot of Banana length", col="lightgreen")
#boxplot(banana_train$radius, ylab = "radius(mm)", main = "Boxplot of Banana radius", col="lightpink")
```
```{r include=FALSE}
# add a weight to stabilize the variance
# weight = 1/(residuals)^2; we estimate the residuals via slr for it
#resfit = lm(abs(linfit_r$residuals)~ linfit_r$fitted.values)
#wi = 1/resfit$fitted.values^2
#fitr_wi = lm(log(weight)~log(radius), data=banana_train, weights = wi)
#summary(fitr_wi)
```
```{r echo=FALSE}
#par(mfrow=c(2,2))
#plot(fitr_wi)
```
```{r echo=FALSE, table.cap="MAE and MAPE with the fitr_wi model"}
# predict weight using fitr_w8
#pred.rwi = predict(object = fitr_wi, newdata = banana_test)
#MAE
#mae_rwi = computeMae(log(banana_test$weight), pred.rwi, n= nrow(banana_test))
#MAPE
#mape_rwi = mape(log(banana_test$weight), pred.rwi)
# to make a line plot of your regression model:
#ggplot(fitr_w8, aes(names(fitr_w8)[2], names(fitr_w8)[1])) + geom_abline()
#maepe_table = data.frame(mae_rwi,mape_rwi)
#names(maepe_table)[1] = "MAE"
#names(maepe_table)[2] = "MAPE"
#kable(maepe_table, format="pipe")
```
```{r}
#create plot of rdius vs weight in test data
#plot(x=banana_test$radius, y = banana_test$weight, type='p', col='red')
#overlay line plot of radius vs predicted weight by linfit_r
#lines(banana_test$radius, exp(pred.r), col='blue')
#overlay line plot of radius vs predicted by fitr_wi
#lines(banana_test$radius, exp(pred.rw8), col='purple')
#add legend
#legend(1, 25, legend=c('actual', 'linfit_r', 'fitr_wi'),
# col=c('red', 'blue', 'purple'), lty=1)
```
```{r, linewidth=60}
#bweight_df = data.frame(banana_test$radius,
#banana_test$weight,
#exp(pred.r),exp(pred.rwi))
#print(bweight_df)
```
```{r}
# confidence interval for the density
# density is in the intercept (beta_0) term of linfit_full
b0_full = linfit_full$coefficients[1]
# st error of intercept in linfit_full: 49.39822
q.se = qnorm(1-(0.05/2))*49.39822
confidence_i = (pred.rwi / pred.r)+c(-1,1)*0.008949
min(confidence_i)
max(confidence_i)
confidence_i2 = (pred.rwi / pred.r)+c(-1,1)*0.5986
abs(min(confidence_i2))
abs(max(confidence_i2))
```
## References
[1] “13.1 - weighted least squares: Stat 501,” PennState: Statistics Online Courses. [Online]. Available: https://online.stat.psu.edu/stat501/lesson/13/13.1. [Accessed: 06-Feb-2022].