-
Notifications
You must be signed in to change notification settings - Fork 0
/
Hypothesis_Testing_By_Example.py
160 lines (106 loc) · 4.1 KB
/
Hypothesis_Testing_By_Example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#### Question 1 ########
########################
'''
The population proportion of Ireland having heart disease is 42%.
Are more people suffering from heart disease in the US”?
============================================================
'''
import pandas as pd
heart=pd.read_table('heart.csv',sep=',')
'''
Ho: p = 0.42 #null hypothesis
H1: p > 0.42 #alternative hypothesis
'''
p_us=len(heart[heart['target']==0])/ len(heart['target'])
'''
population proportion having heart disease in the US is 46%
'''
from statsmodels.stats.proportion import proportions_ztest
stat, p_value = proportions_ztest(count=len(heart[heart['target']==0]),
nobs=len(heart['target']), value=0.42, alternative='larger')
#### Question 2 ########
########################
'''
test if the population proportion of females with heart disease is different
from the population proportion of males with heart disease.
============================================================
we're going to test the difference of proportion in two populations
our hypothesis to be checked are:
h0: p_male = p_female
ha: p_male <> p_female
We will use a 2-sample z-test to check if the sample allows us to accept or reject the null hypothesis
'''
# spliting the data to 2 dataframes:
females=heart.loc[heart['sex']==1, ['target']]
males=heart.loc[heart['sex']==0,['target']]
## calculating the proportions of each group:
p_female= len(females[females['target']==1])/len(females)
p_male= len(males[males['target']==1])/len(males)
gender_test=pd.DataFrame( {
'count':[males['target'].sum(), females['target'].sum()],
'nobs':[males['target'].count(), females['target'].count()]
}
,index=['males','females']
)
proportions_ztest(gender_test['count'], gender_test['nobs'])
'''
the p value of our test is: 1.0071642033238824e-06, we can't reject the null hypothesis
The population proportion of males with heart disease is not significantly different than
the population proportion of females with heart disease.
'''
#### Question 3 ########
########################
'''
Check if the mean RestBP is great than 135.
============================================================
So here we are going to test the mean against a known value.
our hypothesis should be:
h0: m=135
ha: m>153
this is a one-sided t-test. we need to confirm its hypothesis first:
1- The sample should be a simple random sample.
2- The data need to be normally distributed.
'''
import seaborn as sns
from scipy import stats
sns.distplot(heart['trestbps'])
## let's test the normality with shapiro test
stats.shapiro(heart['trestbps'])
#pvalue of the test is not significant
m=heart['trestbps'].mean()
'''
In scipy there is no direct way to indicate that we want to run
a one-tailed variant of the test.
However, to obtain the desired results we adjust the output ourselves.
In the case of this setting, we simply need to divide the p-value by 2
(the test statistic stays the same).
'''
from scipy import stats
t_test = stats.ttest_1samp(heart['trestbps'], 135)
## one sided t-test's p-value is:
p_val=t_test.pvalue/2
'''
There is only a 0.05% probability that we will see the observed result is true when the null hypothesis is true.
So, we reject the null hypothesis and accept the alternative hypothesis
based on this sample data.
'''
#### Question 4 ########
########################
'''
Hypothesis Testing for the Difference in Mean
test if there is any difference between the mean RestBP of females
to the mean RestBP of males
============================================================
our hypothesis should be:
h0: m_males = m_females
ha: m_males <> m_females
This is a 2 sample t-test:
'''
females_=heart.loc[heart['sex']==1,['trestbps']]
males_=heart.loc[heart['sex']==0,['trestbps']]
t, p = stats.ttest_ind(females_, males_, equal_var=False)
'''
there is approximately 35% probability that the observed result or more extreme is true when the null hypothesis is true.
In another way, the p-value is much bigger than the significance level.
So, we fail to reject the null hypothesis.
'''