Observations.txt

No of entries - 614, Columns - 13

Columns/ Features :-

Loan_ID              
Gender               - Male/ Female
Married              - (Yes/No)
Dependents           - 0, 1, 2, 3+
Education            - Graduate/ Not Graduate
Self_Employed        - (Yes/No)
ApplicantIncome      
CoapplicantIncome    
LoanAmount           - Loan amount in thousands
Loan_Amount_Term     - Term of loan in months
Credit_History       
Property_Area        - Urban/ Semiurban/ Rural
Loan_Status          - (Y/N)

Target: Loan_Status (Categorical)

Continuous (4) :- 
    ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term

Categorical (8) :-
    Nominal :- Gender, Married, Self_Employed, Credit_History, Loan_Status(Target)
    Ordinal :- Dependents, Property_Area, Education

Univariate Analysis :- 

    This is intended to understand each variable/feature independently. This is usually done using plots,
    for 
        a) Continuous :- Histograms, boxplots
        b) Categorical :- count plot
    
    Continuous :- 

        1)ApplicantIncome :- 0 nulls, Right Skewed, Lot outliners
        2)CoapplicantIncome :- 0 nulls, Right Skewed, some outliners
        3)LoanAmount :- 22 nulls,quite well distributed
        4)Loan_Amount_Term :- 14 nulls, single high value (around 380)
    
    Categorical :-
        
        1)Gender :- data has 81% Male, (13 nulls)
        2)Married :- 65% Applicants are Married, (3 nulls)
        3)Dependents :-  57% of Applicants do not have any dependents, 
                        10% have 1 ,another 10% have 2 dependents (15 nulls)
        4)Education :- 80% of the Applicants are Graduates
        5)Self_Employed :- just 15%, (32 nulls)
        6)Property_Area :- Almost no bias
        7)Loan_Status :- 69% of the Applicants have got their loan sanctioned

Bivariate Analysis :-
    
    Categorical Vs Target:

        1)Gender :- Nothing significant, plot shows that the final result is not biased based on gender
        2)Self_Employed :- Nothing significant
        3)Married :- Married Applicants MAY have 10% higher chance of getting loan
        4)Education :- Same as Married 
        5)Credit_History :- It seems people with credit history as 1 are more likely to get their loans approved (80%of them get loan approved).
    
    Continuous Vs Target:
        1)Applicant & CoApplicant Income :- Even after looking at each quartile separately, Income doesn't seem to impact Loan_Status individually,
                            which is quite counter intuitive

                            But when we combine both we find that the ppl below a combined Income of 2800 are less likely to get loan
        2)Loan_Amount_Term :- higher Loan_Amount_Term seems less likely to get loan

Iteration 1:

    Used all Categorical variables except - [Gender, Self_Employed, Education]
    ROC - AUC : 0.84
    Precision - Recall curve: AUC :0.71
    applied basic Munging

Iteration 2:
    not much different from Iteration - 1
    but just added a imputation to Credit_History based on 
        -- TotalIncome (ApplicantIncome + CoapplicantIncome)
        -- Dependents
        -- Property_Area
    
    Also found that - ROC was a misleading statistic since it doesn't consider imbalance b/w the classes
    instead 
    Precision - Recall curve turned out to be a good measure
        as it says how good are we classifying the minority class correctly

    Finally --- The result:
    It did not have a significant Impact but could lift accuracy from 0.777 to 0.78

Iteration 3:
    applied Decision Tree classifier
    Seemed promising but pulled down my accuracy to 0.69 ( -_- )

Iteration 4:
    yet another try 
    This time i implemented Logistic Regression from scratch with tweakable parameters such as
    epochs, batchsize, lr_rate, optimizer etc., but the accuracy is still 0.78

Iteration n:
    tried many different complex approaches which seemed to be promissing but failed the test data
    some of the models i tried were
        RandomForest, Bagging, AdaBoost, GradientBoosting, Voting with different parameters but it only increased the variance
        by lowering the bias which simply is , it was overfitting

        also tried IterativeImputation, SMOTE over_sampling strategy, RandomUnderSampling but nothing seemed to work