-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTESFAHUN TEGENE BOSHE - bootstrapping solved.Rmd
147 lines (124 loc) · 5.89 KB
/
TESFAHUN TEGENE BOSHE - bootstrapping solved.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: "Homework - Bootstrapping to find best linear regression"
author: "Tesfahun Tegene Boshe"
date: "12/17/2020"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
A sample of data with two variables x,y was given. It is known that there is a natural true relationship between those two variables, however, hidden from us because there is some randomness. The task was to find this true relationship using only sample data.
Assuming the relationship to be linear
Y = alpha + beta*x + error
I used bootstrapping to find the best approximations of alpha, beta which minimize error term 'error'. Since I do not have the who data, I won't be able to know the error ( I believe the professor will evaluate my solution by comparing my result to the true line.)
boot() - boot() function from boot R package was used for creating samples.
lm() - lm() function from stats package was used to fit linear models.
pam()- PAM clustering (with 1 cluster center) was used to find the mediods of x,y variables separately.
```{r}
# load required packages
requiredPackages = c( "broom","triangle", "readr", "RColorBrewer","boot","cluster") # list of required packages
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE) }
```
## Load the sample data
A sample data of 350 samples (from 1000,000) was given.
```{r}
# load the data
sample_all <- dataset <- read_delim("C:\\Users\\tesfa\\OneDrive\\Documents\\CoursesDataScince\\Applied Microeconomics\\Materials of Topic 3.2-20201111\\Topic_3_2_MC_2020\\Topic_3_2_MC_2020\\sample_all.csv", ";", escape_double = FALSE, trim_ws = TRUE) # load the sample
```
## function performing regression
return the coefficients of regression .
```{r}
# function to obtain regression weights
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
```
## bootstrapping with 10000 replications
```{r}
# bootstrapping with 1000 replications
#
results <- boot(data=sample_all, statistic=bs,
R=10000, formula=y~x)
```
## results of bootstrapping
```{r}
# view results
plot(results, index=1,, main="Histogram of y - intercepts", col="aquamarine3", breaks=50, probability=TRUE) # y-intercept
```
```{r}
# view results
plot(results, index=2,, main="Histogram of slope values", col="aquamarine3", breaks=50, probability=TRUE) # y-intercept
```
### results for alpha variable
```{r}
# view results
par(mfrow=c(2,1)) #to have two graphs in one window one above the other
#histogram for alfa_a
hist(results$t[,1], main="Histogram of a", col="aquamarine3", breaks=50, probability=TRUE)
lines(density(results$t[,1]), lwd=3)
#summary of coefficient alfa_a
summary(results$t[,1])
#histogram for alfa_b
```
### results for beta variable
```{r}
hist(results$t[,2], main="Histogram of b", col="aquamarine3", breaks=50, probability=TRUE)
lines(density(results$t[,2]), lwd=3)
#summary of coefficient alfa_b
summary(results$t[,2])
par(mfrow=c(1,1)) #return to the screen with one graph displayed
```
## Clustering
* I clustered the y - intercept and the slope values to find the best measure of center
* PAM clustering was used ( with k = 1 as number of clusters)
* Medoids are the most central elements of the data
```{r}
#use clustering to find the best approximation of alfa_median, beta_edian
alfa_a = pam(results$t[,1],1) #cluster::pam(), works for n<65536
beta_b = pam(results$t[,2],1) #cluster::pam(), works for n<65536
alfa_median = (alfa_a$medoids[,1])
beta_median = (beta_b$medoids[,1])
```
### color plot of points distribution
* Define a function for plotting color by density
```{r}
# distribution of a and b as XY plot
#creating a function that will help to make a graph (scatterplot of x and y estimates of 1000 repetitions)
#densCols produces a vector containing colors which encode the local densities at each point in a scatterplot.
plot_colorByDensity = function(x1,x2,
ylim=c(min(x2),max(x2)),
xlim=c(min(x1),max(x1)),
xlab="",ylab="",main="") {
df <- data.frame(x1,x2)
x <- densCols(x1,x2, colramp=colorRampPalette(c("black", "white")))
df$dens <- col2rgb(x)[1,] + 1L
cols <- colorRampPalette(c("#000099", "#00FEFF", "#45FE4F","#FCFF00", "#FF9400", "#FF3100"))(256)
df$col <- cols[df$dens]
plot(x2~x1, data=df[order(df$dens),],
ylim=ylim,xlim=xlim,pch=20,col=col,
cex=2,xlab=xlab,ylab=ylab,
main=main)}
############################### function end ###################################
```
# Plot the color by density for the sample data
```{r}
#using the above function on our data
plot_colorByDensity(results$t[,1],results$t[,2],xlab="alpha",ylab="beta",
main = paste0("Bootstrapped estimators: [mean_a =", round( mean(results$t[,1]),1), ", mean_b =", round( mean(results$t[,2]),2), "]"))
# abline(h = beta_b,v = alfa_b,lwd=2)
abline(h=mean(results$t[,2]), v=mean(results$t[,1]), lwd=2)
print("alpha = ",); mean(results$t[,1])
print("beta = ",); mean(results$t[,2])
```
## Comments
* The sampling method used in Bootstrap 2020 Student.R was replaced by boot() function as recommended.
* PAM clustering with 1 center returns the medoid of the data. SInce the data is almost perfectly gaussian, I believe the mean can be a better measure of center.
* boot package does a much faster sampling.
## References
[1] AM 2020 Topic 3.2 Bootstrapping ok at moodle page for the course 'Advanced Microeconomics' at UW.
[2] https://www.statmethods.net/advstats/bootstrapping.html
[3] Sample code 'Bootstrap 2020 Student' at moodle page for the course 'Advanced Microeconomics' at UW.