-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathObservations.txt
107 lines (82 loc) · 4.2 KB
/
Observations.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
No of entries - 614, Columns - 13
Columns/ Features :-
Loan_ID
Gender - Male/ Female
Married - (Yes/No)
Dependents - 0, 1, 2, 3+
Education - Graduate/ Not Graduate
Self_Employed - (Yes/No)
ApplicantIncome
CoapplicantIncome
LoanAmount - Loan amount in thousands
Loan_Amount_Term - Term of loan in months
Credit_History
Property_Area - Urban/ Semiurban/ Rural
Loan_Status - (Y/N)
Target: Loan_Status (Categorical)
Continuous (4) :-
ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term
Categorical (8) :-
Nominal :- Gender, Married, Self_Employed, Credit_History, Loan_Status(Target)
Ordinal :- Dependents, Property_Area, Education
Univariate Analysis :-
This is intended to understand each variable/feature independently. This is usually done using plots,
for
a) Continuous :- Histograms, boxplots
b) Categorical :- count plot
Continuous :-
1)ApplicantIncome :- 0 nulls, Right Skewed, Lot outliners
2)CoapplicantIncome :- 0 nulls, Right Skewed, some outliners
3)LoanAmount :- 22 nulls,quite well distributed
4)Loan_Amount_Term :- 14 nulls, single high value (around 380)
Categorical :-
1)Gender :- data has 81% Male, (13 nulls)
2)Married :- 65% Applicants are Married, (3 nulls)
3)Dependents :- 57% of Applicants do not have any dependents,
10% have 1 ,another 10% have 2 dependents (15 nulls)
4)Education :- 80% of the Applicants are Graduates
5)Self_Employed :- just 15%, (32 nulls)
6)Property_Area :- Almost no bias
7)Loan_Status :- 69% of the Applicants have got their loan sanctioned
Bivariate Analysis :-
Categorical Vs Target:
1)Gender :- Nothing significant, plot shows that the final result is not biased based on gender
2)Self_Employed :- Nothing significant
3)Married :- Married Applicants MAY have 10% higher chance of getting loan
4)Education :- Same as Married
5)Credit_History :- It seems people with credit history as 1 are more likely to get their loans approved (80%of them get loan approved).
Continuous Vs Target:
1)Applicant & CoApplicant Income :- Even after looking at each quartile separately, Income doesn't seem to impact Loan_Status individually,
which is quite counter intuitive
But when we combine both we find that the ppl below a combined Income of 2800 are less likely to get loan
2)Loan_Amount_Term :- higher Loan_Amount_Term seems less likely to get loan
Iteration 1:
Used all Categorical variables except - [Gender, Self_Employed, Education]
ROC - AUC : 0.84
Precision - Recall curve: AUC :0.71
applied basic Munging
Iteration 2:
not much different from Iteration - 1
but just added a imputation to Credit_History based on
-- TotalIncome (ApplicantIncome + CoapplicantIncome)
-- Dependents
-- Property_Area
Also found that - ROC was a misleading statistic since it doesn't consider imbalance b/w the classes
instead
Precision - Recall curve turned out to be a good measure
as it says how good are we classifying the minority class correctly
Finally --- The result:
It did not have a significant Impact but could lift accuracy from 0.777 to 0.78
Iteration 3:
applied Decision Tree classifier
Seemed promising but pulled down my accuracy to 0.69 ( -_- )
Iteration 4:
yet another try
This time i implemented Logistic Regression from scratch with tweakable parameters such as
epochs, batchsize, lr_rate, optimizer etc., but the accuracy is still 0.78
Iteration n:
tried many different complex approaches which seemed to be promissing but failed the test data
some of the models i tried were
RandomForest, Bagging, AdaBoost, GradientBoosting, Voting with different parameters but it only increased the variance
by lowering the bias which simply is , it was overfitting
also tried IterativeImputation, SMOTE over_sampling strategy, RandomUnderSampling but nothing seemed to work