EDA_Loan.Rmd

---
title: "Prosper Loan Data Analysis"
author: "Mayukh Sarkar"
output:
  pdf_document:
    highlight: tango
  html_notebook: default
  prettydoc::html_pretty:
    highlight: vignette
    keep_md: yes
    theme: cayman
  word_document: default
header-includes: \usepackage{booktabs}
---

```{r echo=FALSE, message=FALSE, warning=FALSE}
suppressMessages(library(devtools))
suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
suppressMessages(library(dplyr))
suppressMessages(library(memisc))
suppressMessages(library(gridExtra))
suppressMessages(library(RColorBrewer))
suppressMessages(library(magrittr))
suppressMessages(library(Kmisc))
suppressMessages(library(xtable))
suppressMessages(library(knitr))
suppressMessages(library(DT))
suppressMessages(library(scales))
suppressMessages(library(plotrix))
suppressMessages(library(corrplot))
```


```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData <- read.csv('prosperLoanData.csv')
```
## Introduction
This dataset is financial dataset and this is related to the loan, borrowers, lenders, interest rates and stuffs like that. Prosper or [Prosper Marketplace Inc.](https://www.prosper.com/) is a San Francisco, California based company specializing in loans at low interest rates to the borrowers. In this dataset, we are using the data from the Posper to analyse it and trying to find the pattern in the Prosper data. This may be tedious because of the sheer size of the dataset and the complicated nature of all the financial datasets. We are using R, an advanced high level programming language of the analysis with some of its most popular graphic package ```ggplot```. The codes are not visible in the HTML/PDF export for the simlicity but the codes can be reviewed from the RMD file. 

The dataset is comprised of 81 variables and contains 113937 entries. The variable that are explored in the dataset are the following 
Term : Amount of month customers opted for loan

**LoanStatus** : Current status of the loan like chargedoff, completed, defauted etc...

**EstimatedEffectiveYield** : Yield of lenders from borrowers minus the processing fee and late fines

**ProsperScore** : Risk Factor score from 1 to 10. 10 being least risky

**BorrowerAPR** : The Borrower's Annual Percentage Rate (APR) for the loan.

**BorrowerRate** : The Borrower's interest rate for this loan.

**ListingCategory..numeric.** : Prosper rating for borrowers in numbers

**EmploymentStatus** : Current type of employment

**Occupation** : Occupation of borrower at the time of listing

**EmploymentStatusDuration** : How long the employee has been employed

**IsBorrowerHomeowner** : Does the borrower owns house at the time of listing (True & False)

**ProsperRating..Alpha.** : Prosper rating for borrowers in alphabets

**IncomeVerifiable** : If the income of the borrower is verifiable at the time of listing (True & False)

**StatedMonthlyIncome** : Monthly income of the borrower

**MonthlyLoanPayment** : Monthly loan payment amount

**Recommendations** : Recommendations the borrowers has at the time of listing

**DebtToIncomeRatio** : The debt to income ratio of the borrower at the time the credit profile was pulled.

**LoanOriginalAmount** : Original amount of the loan

**LoanOriginationQuarter** : Quarter of the month when loan was originated

A basic exploration of the datset would give the following information
```{r echo=FALSE, message=FALSE, warning=FALSE}
str(loanData)
```

## The first Qustion
The first question that needs to be asked is HOW LONG PEOPLE USUALLY OPT FOR LOAN? Let's answer this question with a histogram

```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>% 
    ggplot(aes(x = Term / 12)) +
        geom_histogram(binwidth = 1) +
        theme_hc() + 
        xlab('Term in Years') +
        ylab('How many?') + 
        scale_x_continuous(breaks = seq(1, 5, 2)) +
        ggtitle("Distribution Loan Terms") +
        theme(plot.title = element_text(face = 'bold.italic', 
                                        colour = "black", size=18))
```

We can see that people don't really loan any amount for less than one year and the most popular loan amount is of 3 years although some people do choose for 5 years. Now lets assume something and check if it is correct or not. I assume that people who opted for 1 year would fail to repay their loan more as compared to the people who opted for 3 or 5 years. But is it true? So the next question is
DOES PEOPLE REPAY THE LOANS BETTER WHEN THEY ARE GIVEN MORE TIME?

```{r echo=FALSE, message=FALSE, warning=FALSE}
totalLoanTermes <- loanData %>%
    group_by(Term) %>%
    summarise(n = n()) %>%
    use_series(n)
# totalLoanTermes <- rep(totalLoanTermes, each = 2)
loanData.two_status <- loanData %>%
    group_by(Term, LoanStatus) %>%
    summarise(n = n())
levels(loanData.two_status$LoanStatus) <-
    c(levels(loanData.two_status$LoanStatus), "Other")
loanData.two_status$LoanStatus[!(loanData.two_status$LoanStatus %in%
                                     c("Completed", "Defaulted"))] <- "Other"

loanData.two_status <- loanData.two_status %>%
    group_by(Term, LoanStatus) %>%
    summarise(p = n(), total = sum(n)) %>%
    mutate(freq = round(total / sum(total) * 100, 2))
```

## Plotting the trend of different customer types


```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = Term / 12, y = freq, fill = LoanStatus), 
       data = loanData.two_status) +
    geom_bar(stat = 'identity', position="stack", color = 'black') + 
    scale_x_continuous(breaks = c(1, 3, 5)) +
    xlab('Loan term in years') + 
    theme_pander() +
    ggtitle("LoanStatus: Completed vs Defaulted", 
            subtitle = "for each loan Term") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = '#F35E3A', size=18),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = '#17b4ba', size=11)) +
    ylab("% of Borrowers")
```

That's unusual because for **LoanStatus** of _Completed_, we see a trend but not in **LoanStatus** of _Defaulted_ though. Now just because some customer's loan status is not Completed, doesn't mean his/her loan status is Defaulted.

We can see that there is a trend in the of **LoanStatus** of _Completed_ and _Other_ but we see any trend in _Defaulted_ though. Now just because some customer's loan status is not Completed, doesn't mean his/her loan status is Defaulted. So let's explore the Defaulted section more.

The reason we are exploring this is because we want to find if BANKS SHOULD FOCUS ON CUSTOMERS OPTING LOANS FOR SMALLER TERMS OR NOT? AND HOW SOME OF THE MOST COMMON BORROWER LOANSTATUS VARIES FOR DIFFERENT LOAN TERMS? For this getting data of only two loan status won't be enough. So let's divide the customers into two groups

    1. Good Customer
    2. Bad Customer


```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData.gb <- loanData %>%
    group_by(Term, LoanStatus) %>%
    summarise(n = n()) %>%
    mutate(customer_type = ifelse((LoanStatus == 'Current' | 
                                   LoanStatus == 'Completed'| 
                                   LoanStatus == 'FinalPaymentInProgress'),
                                  'good', 'bad')) %>%
    filter(LoanStatus != 'Cancelled') %>%
    mutate(freq = n / sum(n) * 100) %>%
    ungroup() %>%
    group_by(Term, customer_type) %>%
    summarise(n = sum(freq))
```

But before that lets see the distribution of LonaStatus variable

```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
    group_by(LoanStatus) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = LoanStatus, y = n)) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_pander() +
    ylab("Number of borrowers") +
    xlab("Different Loan Status") +
    coord_flip()
```

## The bigger Picture

```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = Term / 12, y = n, fill = customer_type), data = loanData.gb) +
    geom_bar(stat = 'identity', position="dodge") + 
    theme_pander() +
    xlab('Loan term in years') +
    ylab('Percentage') +
    scale_x_continuous(breaks = c(1, 3, 5)) +
    geom_text(aes(label = sprintf("%2.1f%%", round(n, 2)), vjust = -.3), 
              color="black") +
    facet_wrap(~customer_type) +
    scale_fill_hc() +
    ggtitle("Good vs Bad", 
            subtitle = "Number of good borrowes vs bad borrowers 
in each loan Term") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = '#333338', 
                                    size=18),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = '#6aa5e7', 
                                       size=11))
```

As we can that the short and long term prospects of the banks who issues the loan seems good because both in 1 and 5 year category, we see more number of good customers as compared to the 3 years. The number of good customers is however decreased a little from 1 year to 5 years but as compared to 3 years it is nothing. This may mean that banks should focus in long and short term customers as compared to medium term customers. But we should take this fact with a grain of slat because there might be other factors that may affect this trend. Only further exploration of the data would reveal that.

But before we move ahead, lets analyse the distribution of some of the features that might be useful for us in this data set. We would also remove some the outliers to better understand the features and may dive into some descriptive statistics too.

## Distribution for some continuous/categorical features

```{r echo=FALSE, message=FALSE, warning=FALSE}
prosperScoreCountPlot <- loanData %>%
    filter(!is.na(ProsperScore)) %>%
    group_by(ProsperScore) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = ProsperScore, y = n)) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_pander() +
    ylab("No. of borrowers") +
    scale_y_continuous(breaks = seq(0, 15000, 2500)) +
    scale_x_continuous(breaks = seq(1, 11, 1))

EmploymentStatusDurationHist <- ggplot(aes(x = EmploymentStatusDuration), 
                                       data = loanData) +
    geom_histogram(binwidth = 10, color = 'white') +
    scale_x_continuous(breaks = seq(0, 400, 100), limits = c(0, 400)) +
    theme_pander() +
    xlab("Borrower Experience")

prosperRatingCount <- loanData %>%
    group_by(ProsperRating..Alpha.) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = ProsperRating..Alpha., y = n)) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_pander() +
    ylab("\nNo. of borrowers") +
    xlab("Borrower Rating")

StatedMonthlyIncomeHist <- ggplot(aes(x = StatedMonthlyIncome), 
                                  data = loanData) +
    geom_histogram(binwidth = 1000, color = 'white') +
    scale_x_continuous(limits = c(0, 20000)) +
    theme_pander()

MonthlyLoanPaymentHist <- ggplot(aes(x = MonthlyLoanPayment), 
                                 data = loanData) +
    geom_histogram(binwidth = 50, color = 'white') +
    scale_x_continuous(breaks = seq(0, 1000, 300), limits = c(0, 1000)) +
    theme_pander()

LoanOriginalAmountHist <- ggplot(aes(x = LoanOriginalAmount), 
                                 data = loanData) +
    geom_histogram(binwidth = 1000, color = 'white') +
    scale_x_continuous(breaks = seq(0, 25000, 7000), limits = c(0, 25000)) +
    theme_pander()


BorrowerAPRHist <- ggplot(aes(x = BorrowerAPR), 
                          data = loanData) +
    geom_histogram(binwidth = 0.01, color = 'white') +
    scale_x_continuous(breaks = seq(0.07, 0.3, 0.05), limits = c(0.07, 0.3)) +
    theme_pander()

BorrowerRateHist <- ggplot(aes(x = BorrowerRate), 
                           data = loanData) +
    geom_histogram(binwidth = 0.01, color = 'white') +
    scale_x_continuous(breaks = seq(0.07, 0.3, 0.05), 
                       limits = c(0.07, 0.3)) +
    theme_pander()

DebtToIncomeRatioHist <- ggplot(aes(x = DebtToIncomeRatio), 
                                data = loanData) +
    geom_histogram(binwidth = 0.05, color = 'white') +
    scale_x_continuous(limits = c(0.0, 0.7), 
                       breaks = seq(0.0, 0.6, 0.1)) +
    theme_pander()

grid.arrange(prosperScoreCountPlot,
             EmploymentStatusDurationHist,
             prosperRatingCount,
             StatedMonthlyIncomeHist,
             MonthlyLoanPaymentHist,
             LoanOriginalAmountHist,
             BorrowerAPRHist,
             BorrowerRateHist,
             DebtToIncomeRatioHist,
             ncol = 3, nrow = 3,
             top = "Distribution of feature variables\n")
```

## EstimatedEffectiveYield - A better measure for a successful Lender

```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = EstimatedEffectiveYield), data = loanData) +
    geom_histogram(aes(y = ..density..), binwidth = 0.01, na.rm = T, 
                   color = 'darkblue', fill = 'lightblue') +
    theme_hc() +
    scale_x_continuous(limits = c(-0.1, 0.5), 
                       breaks = seq(-0.1, 0.5, 0.05)) +
    geom_density(alpha=.2, fill="#FF6666", na.rm = T) + 
    geom_vline(aes(xintercept=mean(EstimatedEffectiveYield, na.rm=T)), 
               color="blue", linetype="dashed", size=1) +
    ggtitle("Distribution of EstimatedEffectiveYield", 
            subtitle = "with the mean axis") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = '#FF6666', 
                                    size=18),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'darkblue', 
                                       size=13))
```

**EstimatedEffectiveYield** is said to be better estimate for the lenders than the interest rate because the interest includes _processing fees_, _uncollected interest due to borrower being chargedoff_. Plus it also doesn't include _late fines_. Hence EstimatedEffectiveYield takes account for all these things and it is thus a better measure. Above we are trying to see the distribution of the EstimatedEffectiveYield and we can see that it is multimodal. We see the most popular EstimatedEffectiveYield is around 0.3 while the mean is around 0.17 represented by the blue dotted line. The multimodal pattern shows that there are multiple EstimatedEffectiveYield that is popular. Strangely we can also see that the some customers have negative EstimatedEffectiveYield. This may mean a lot of things. This may mean that their BorrowerRate is a lot lower than their _service fee rate_ or these customer's _uncollected interest on chargeoff_ is lot more or they just never payed the late fee and payed back the loans along with the interest always on time.

## Does Lenders prefer borrowers with better Prosper Score ? 

Now lets see what is the distribution of EstimatedEffectiveYield depending on the different **ProsperScore**. This is important because we want to answer a question, i.e., IF LENDERS GET MORE EstimatedEffectiveYield IF THEY HAVE BETTER ProsperScore ?

We are using violin plot instead of box plot for this.

```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData$ProsperScore <- factor(loanData$ProsperScore)
ggplot(aes(x = ProsperScore, y = EstimatedEffectiveYield, fill=ProsperScore), 
       data = subset(loanData, !is.na(loanData$ProsperScore) & 
                         !is.na(loanData$EstimatedEffectiveYield))) +
    geom_violin(trim = F, scale = "width") +
    stat_summary(fun.y=median, geom="point", size=2, color="black") +
    scale_fill_manual(values=colorRampPalette(c("pink", "lightgreen"))(11)) + 
    theme_minimal() +
    ylab('Effective yeild of Lenders') +
    ggtitle("Effective Yield for each Risk Factors",
            subtitle = "story of lenders preference") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = '#f3b0c0', 
                                    size=22),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'darkgreen', 
                                       size=14)) +
    guides(fill = F)
```

Well this is interesting. We see an wonderful trend here. Here more score for the risk factor means better the borrower and lesser score for risk factor means poor prospects from the borrowers. We can see that for lower ProsperScore distribution of effective yield in a lot more than the higher ProsperScore. This may mean that lenders charges a variety of interest rate from the borrower with poor prospects as compared to borrowers with better prospect. We can also notice how median (represented by the black dot) is decreasing as ProsperScore is increasing. This may mean that lenders give more relaxations to borrowers with better ratings as compared to borrowers with poor rating. Does that mean lenders trust and like borrowers with better ProsperScore! Let's do a little more analysis to reveal it more. The reason we need more exploration on this is because EstimatedEffectiveYield includes more things such as late fine and doesn't include processing fee and others. So more EstimatedEffectiveYield for lesser ProsperScore borrowers may be due to high late fines because lesser ProsperScore borrowers are more prone to fail to repay their loan on time each month. So, Let's see if borrower's interest rate shows the same trend for each ProsperScore categories or not because interest rates doesn't include late fines. 


```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=14}

borrower_apr <- ggplot(aes(x = ProsperScore, 
                           y = BorrowerAPR), 
       data = subset(loanData, !is.na(loanData$ProsperScore) & 
                         !is.na(loanData$BorrowerAPR))) +
    geom_boxplot() +
    theme_pander()

borrower_rate <- ggplot(aes(x = ProsperScore, 
                            y = BorrowerRate), 
       data = subset(loanData, !is.na(loanData$ProsperScore) & 
                         !is.na(loanData$BorrowerRate))) +
    geom_boxplot() +
    theme_pander()
grid.arrange(borrower_apr, borrower_rate, 
             nrow = 1, ncol = 2,
             top = "Interest Rate Distributions")
```

Finally things are revealed much better now! We can clearly observe that for both _BorrowerAPR_ and _BorrowerRate_ which are metric for interest rates, we see a declining trend as the _ProsperScore_ is increasing. This justifies the fact even more that lenders somehow prefers to charge less for all the borrowers with better ProsperScore as compared to borrowers with inferior ProsperScore. Here food for thought is that is it moral too? Well I guess no amount of EDA can reveal it.

But are we missing the real question here? We are performing all these analysis on the Prosper loan data to answer several questions but what is the most important thing for a loan ? It's **BORROWERS** and what is the most important question related to borrowers? 

## Distribution of Listing Category

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=7, fig.width=10}
loanData %>%
    group_by(ListingCategory..numeric.) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = ListingCategory..numeric., y = n)) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_pander() +
    scale_x_continuous(breaks = seq(0, 20, 1)) +
    scale_y_continuous(breaks = seq(0, 60000, 5000)) +
    xlab("Listing Category") +
    ylab("Number of borrowers") +
    ggtitle("Distribution of ListingCategory")
    
```

The other section indicated by number 7 with more than 10000 of the borrowers would remain unknown. Why the loan reason would be unknown? Does it indicate any illegal reason to opt for loan! Let's explore this more. Let's see the Occupation and Employment status of Borrowers for all the loans falling into the _others_ section.

## Who are others??

```{r echo=FALSE, message=FALSE, warning=FALSE}
other.borrowers <- loanData %>%
    filter(ListingCategory..numeric. == 7) %>%
    group_by(EmploymentStatus, Occupation) %>%
    summarise(n = n()) %>%
    filter(!is.na(Occupation)) %>%
    ungroup() %>%
    arrange(EmploymentStatus, desc(n)) %>%
    ungroup() %>%
    group_by(EmploymentStatus) %>%
    top_n(n = 5, wt = n) %>%
    ungroup()
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
######### This is for knitr in HTML ############
# datatable(other.borrowers,
#           colnames = c('Employment Status' = 'EmploymentStatus',
#                        'Occupation' = 'Occupation',
#                        'Total' = 'n'),
#           class = 'stripe',
#           rownames = FALSE,
#           filter = 'top',
#           options = list(
#               pageLength = 5,
#               autoWidth = TRUE,
#               columnDefs = list(list(targets = c(1, 2), searchable = FALSE))
#               ))
```
```{r results='asis', echo=FALSE}
######## This is for for knitr in PDF  #########
employed <- other.borrowers %>%
    filter(EmploymentStatus == 'Employed') %>%
    dplyr::select(-EmploymentStatus)
full.time <- other.borrowers %>%
    filter(EmploymentStatus == 'Full-time') %>%
    dplyr::select(-EmploymentStatus)
part.time <- other.borrowers %>%
    filter(EmploymentStatus == 'Part-time') %>%
    dplyr::select(-EmploymentStatus)
self.employed <- other.borrowers %>%
    filter(EmploymentStatus == 'Self-employed') %>%
    dplyr::select(-EmploymentStatus)
others <- other.borrowers %>%
    filter(EmploymentStatus == 'Retired' | EmploymentStatus == 'Other')

t1 <- kable(employed, format = "latex", booktabs = TRUE)
t2 <- kable(full.time, format = "latex", booktabs = TRUE)
t3 <- kable(part.time, format = "latex", booktabs = TRUE)
t4 <- kable(self.employed, format = "latex", booktabs = TRUE)
t5 <- kable(others, format = "latex", booktabs = TRUE)
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{ Employed }
      \\centering",
        t1,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{ Full-time }",
        t2,
    "\\end{minipage}
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{ Part-time }",
        t3,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{ Self-employed }",
        t4,
    "\\end{minipage}
    \\centering
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{ Others }",
        t5,
    "\\end{minipage}
\\end{table}"
))
```

We can see some really strange things happening here. In fact there are so many strange things that I have to list them down
    
**Firstly** what is the difference between _Employed_ and _Full-time_?
    
**Secondly** there are too many others. The complete sample of the data is taken for those borrowers whose purpose for borrowing is _Others_. But there are _Others_ for _Occupation_ and _EmploymentStatus_ too. 

**Thirdly** how can someone be _Retired_ but still has _Occupation Others_ ?

**Lastly** there are some dubious borrowers in the _Others_ table. We can see that there are 352 cases where the borrowing reason, EmploymentStatus, Occupation all are _Others_ and 95 cases where _Occupation_ is left blank.
    
These above points, specially the last one says not all the burrower's information is revealed or in some cases they seemed to be fake too. I guess it is a partial dead end for us to see why people mention _Other_ as a reason while opting for loan.

Let's now move into something more computational. Let's first see after working for how many years, people like to opt for a loan? We can then move into some deeper level of analysis.

## Does experienced people opt for loan lesser ?

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=4, fig.width=11}
borrower.experience <- loanData %>%
    filter(!is.na(EmploymentStatusDuration))

hist1 <- ggplot(aes(x = EmploymentStatusDuration / 12), 
                data = borrower.experience) +
         geom_histogram(binwidth = 1, 
                        color = 'red', 
                        fill = 'deeppink', 
                        alpha = 1/2) +
         theme_pander() +
         scale_x_continuous(breaks = 
                        seq(min(borrower.experience$EmploymentStatusDuration),
                        max(borrower.experience$EmploymentStatusDuration),
                                        2)) +
         coord_cartesian(xlim = c(0, 41)) +
         xlab('\nTotal experience in years') +
         ggtitle("Loans for Experienced people",
                 subtitle = "actual distribution") +
        theme(plot.title = element_text(face = 'bold.italic', 
                                        colour = 'deeppink', 
                                        size=20),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'black', 
                                       size=14))

duration <- loanData %>%
    filter(!is.na(EmploymentStatusDuration)) %>%
    mutate(EmploymentStatusDuration = round(EmploymentStatusDuration / 12, 0)) %>%
    group_by(EmploymentStatusDuration) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n) * 100) %>%
    top_n(n = 5, wt = freq)


hist2<- 
    ggplot(aes(x = reorder(EmploymentStatusDuration, -freq), y = freq, fill = freq), 
           data = duration) +
        geom_bar(stat = 'identity', position="dodge",
             color = 'white') +
         theme_pander() +
    scale_fill_continuous_tableau(palette = "Blue") +
    guides(fill = F) +
    ggtitle("Loans for Experienced people",
                 subtitle = "top 5 experience levels") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                        colour = 'darkslategrey', 
                                        size=20),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'black', 
                                       size=14)) +
    xlab("Top 5 experience levels in years") +
    ylab("% of borrowers")
grid.arrange(hist1, hist2, ncol = 2, nrow = 1) 

```

## I was right ..:)

Well this was really interesting. We can see that our assumption was overall correct because the as people gain experience in their jobs, lesser they opt for the loans. This may be due the fact that as people gain experience their salary also increases and hence the lesser reason they find to opt for loans or the reason can be something different. We can also see in the right histogram of 95% quarterly, most people opt for loans when they have almost 2 years of experience. 

Lets explore it even more let's see the correlation between number of month a person has experience and the EstimatedEffectiveYield variable that we have explored earlier. The question that we want to explore - 
DOES LENDERS ASK FOR LESS INTEREST FORM THE BORROWERS WHO ARE MORE EXPERIENCED? This can be true because people with more job experience should have more potential to replay their loan better because they have higher paying jobs and hence their ProsperScore would be higher. And as we have seen that borrowers with better prosper score pay lesser to the lenders and lenders somehow prefer them.

```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(y = EmploymentStatusDuration / 12, 
           x = EstimatedEffectiveYield, 
           color = ProsperScore), 
       data = subset(loanData, !is.na(loanData$ProsperScore))) +
    geom_point(na.rm = T, position = 'jitter', 
               alpha = 1 / 6, shape = 1) +
    scale_color_brewer(type = 'div', 
                       palette = 'Spectral', 
                       direction = 1,
                       guide = guide_legend(title = 'ProsperScore', 
                                    reverse = T,
                                    override.aes = list(alpha = 1, size = 2))) +
    coord_cartesian(xlim = c(0.0, 0.3)) +
    xlab('Actual yield interest of lenders') +
    ylab('Boroower\'s Experience in Years') +
    theme_pander() +
    geom_smooth(se = F) +
    ggtitle("Correlation of Borrower's Experience", 
            subtitle = " with Actual Yield of Lenders")
```

As it seems from the scatter plot that the pattern seems to have no correlation. It means our assumption was not correct. Borrowers with better **EmploymentStatusDuration** don't seem to get any special relaxation from lenders in terms of interest each month. This can be further confirmed by checking the _Perarson's correlation Coefficient_. We see that the correlation coefficient is almost 0 (-0.0233).

```{r echo=FALSE, message=FALSE, warning=FALSE}
with(subset(loanData, !is.null(EmploymentStatusDuration) & 
                !is.null(EstimatedEffectiveYield)), 
     cor.test(EmploymentStatusDuration / 12, EstimatedEffectiveYield))
```

This also says that even though the true correlation is not true and alternative hypothesis is accepted, their is some **serious statistical evidence of significance**. But if we look into the CI, it is within the range of -0.03 to -0.016 which is very small. Good R value is said a value < -0.3 or value > 0.3. This value is definitely not that large. Judging from the context latest it is not. So we can say that there is **no practical significance.** Hence we can not tell with any confirmation that More Experienced Lenders end up paying Less/More interest to the Lenders.

## Other Correlations
```{r echo=FALSE, message=FALSE, warning=FALSE}
d <- data.frame(Term=loanData$Term,
                BorrowerAPR=loanData$BorrowerAPR,
                BorrowerRate=loanData$BorrowerRate,
                LenderYield=loanData$LenderYield,
                EffectiveYield = loanData$EstimatedEffectiveYield,
                Loss = loanData$EstimatedLoss,
                Duration = loanData$EmploymentStatusDuration,
                income = loanData$StatedMonthlyIncome,
                payment = loanData$MonthlyLoanPayment)
d[is.na(d)] <- 0
d <- round(cor(d[sapply(d, is.numeric)]), 2)
corrplot(d, method = "circle")
```


## How people loan for their Homes ?

Here we are going to explore the people for two category. 

    1. First those who are opting for loan to rennovation of home when they have a house.
    2. Second those who opt for home loans even though they don't have house.


```{r echo=FALSE, message=FALSE, warning=FALSE}
home.loans <- loanData %>%
    filter(ListingCategory..numeric. == 2, 
           ProsperRating..Alpha. != "", 
           IsBorrowerHomeowner != "") %>%
    group_by(IsBorrowerHomeowner, ProsperRating..Alpha.) %>%
    summarise(n = n()) %>%
    mutate(freq = round(n / sum(n) * 100, 2))
levels(home.loans$IsBorrowerHomeowner) <- c('No Home', 'Has Home')
home.loans$ProsperRating..Alpha. <- 
    factor(home.loans$ProsperRating..Alpha., ordered = T,
           levels = c('AA', 'A', 'B', 'C', 'D', 'E', 'HR'))
home.loans <- arrange(home.loans, ProsperRating..Alpha.)
```

**Now let's plot the data**

```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = ProsperRating..Alpha., y = freq, 
           fill = IsBorrowerHomeowner), data = home.loans) +
    geom_bar(stat = 'identity', position="dodge",
             color = 'white') +
    theme_hc() +
    scale_fill_fivethirtyeight() +
    xlab('Prosper Ratings for borrowers') +
    ylab('% of borrowers') +
    ggtitle("Home Improvement for All", 
            subtitle = "even if you dont have home") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = 'red', 
                                    size=20),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'dodgerblue3', 
                                       size=14))
```

## Are Investors partial to Borrowers with better rating ??

Here we visualizing an interesting trend. We are seeing for people with ProsperRating of _AA, A & B_, number of people opted for loan mentioning the reason for loan as **Home Improvement** but still has no home is more than the people who has home and opted for loan for Home Improvement and truly a home owner. But for people with poor rating, the trend in opposite and expected also. People who don't have any house should not get any loan mentioning the loan reason of House Improvement. These whole thing shows not only lenders give much preference to Rating over Verification and KYC (Know Your Customer) but this also shows irrespective of rating of borrowers, lenders font care much about loan reason as a whole. The following code proves it even more. We can see that there are not a single borrowers who mentioned their loan purpose to be Home Improvement when they didn't have their own house and their loan was not approved. That's a strong evidence to show that investors don't care much about how much borrowers fudge or fake their loan purpose.

```{r message=FALSE, warning=FALSE}
a <- loanData %>%
    filter(ListingCategory..numeric. == 2,
           LoanStatus == 'Cancelled')
print.numeric_version(dim(a)[0])
```
## What's up with the people having dubious income sources ?

```{r echo=FALSE, message=FALSE, warning=FALSE}
dubious.borrowers <- loanData %>%
    filter(ListingCategory..numeric. == 7, 
           Occupation == 'Other', 
           EmploymentStatus == 'Other',
           IncomeVerifiable == 'False') %>%
    group_by(ProsperRating..Alpha.) %>%
    summarise(n = n()) %>%
    mutate(freq = round(n / sum(n) * 100, 2))
dubious.borrowers$ProsperRating..Alpha. <- 
    ordered(dubious.borrowers$ProsperRating..Alpha.,
            levels = c('AA', 'A', 'B', 'C', 'D', 'E', 'HR'))
dubious.borrowers <- arrange(dubious.borrowers, ProsperRating..Alpha.)
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = ProsperRating..Alpha., y = freq, fill = freq), 
       data = dubious.borrowers) +
    geom_bar(stat = 'identity', position="dodge") + 
    theme_pander() +
    scale_fill_continuous(low = 'deepskyblue1', 
                          high = 'dodgerblue4') +
    guides(fill=FALSE) +
    xlab('\nProsper rating for borrowers') +
    ylab('% of borrowers') +
    geom_text(aes(label = sprintf("%2.1f%%", round(freq, 2))), 
              color="dodgerblue4", size = 5.5, nudge_y = 1, 
              fontface = 'bold', nudge_x = 0.06) +
    theme(axis.text.x = element_text(face = 'bold.italic', 
                                     colour = 'dodgerblue4', 
                                     size = 15),
          axis.text.y=element_blank(),
          axis.title.x = element_text(face = 'bold', 
                                      colour = 'dodgerblue4', 
                                      size = 16),
          axis.title.y = element_text(face = 'bold', 
                                      colour = 'dodgerblue4', 
                                      size = 16),
          plot.title = element_text(face = 'bold.italic', 
                                    colour = 'deepskyblue1', 
                                    size=18),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'dodgerblue4', 
                                       size=12)) +
    ggtitle("Story of Dubious borrowers", 
            subtitle = "by different borrower's category")
    
```

## Prosper rating system seems legit !!

Although we can't see any specific pattern among the dubious borrowers but we can see one thing very well. Overall borrowers with better ratings tends to be less dubious than borrowers with poor ratings. One more interesting thing is that the above bar chart does not include borrowers with _AA_ rating. It means that the best borrowers are really best and the prosper rating really works.

## Are Lenders greedy ?

I always wanted to do this. But from this dataset we can answer this question a lot of way. One of the way is to check if the lenders asked for money if the borrowers income was high. Let's see if the correlation is substantial.

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=4, fig.width=12}
base <- loanData %>%
    filter(StatedMonthlyIncome != 0.0, 
           MonthlyLoanPayment != 0.0, 
           ProsperRating..Alpha. != "") %>%
    ggplot(aes(y = StatedMonthlyIncome, 
               x = MonthlyLoanPayment)) +
    geom_jitter(color = 'black', 
                fill = '#F79420', 
                shape = 21,
                alpha = 1/8) +
    coord_cartesian(ylim = c(100, 30000), xlim = c(10, 1000)) +
    scale_y_continuous(trans = log10_trans(), 
                       breaks = c(100, 1000, 10000, 20000, 30000)) +
    scale_x_continuous(trans = log10_trans(), 
                       breaks = c(10, 50, 100, 200, 500, 700, 1000)) +
    theme_pander()
scatter1 <- base + geom_smooth(span = 0.3, method = 'lm') + 
    theme(plot.margin=unit(c(0, 1, 0, 0), "cm")) + 
    xlab('\nAmount paid/month for loan') +
    ggtitle("Monthly Income vs Monthly Loan (Linear fit)")
scatter2 <- base + geom_smooth(span = 0.3, 
                               method = 'lm', 
                               formula = y ~ splines::bs(x, 3)) + 
    theme(axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank()) +
    xlab('\nAmount paid/month for loan') +
    ggtitle("Monthly Income vs Monthly Loan (Cubic fit)")
grid.arrange(scatter1, scatter2, nrow = 1, ncol = 2)
```


Here I am trying to see the relationship of _MonthlyLoanPayment_ with _StatedMonthlyIncome._ We can see from the above statterplot that most of the _MonthlyLoanPayment_ is distributed from 10 to 1000 and the _StatedMonthlyIncome_ is distributed from 100 to 30000. Both the scales are transformed in log scale and we can clearly observe two things, namely,
  
    
    1. There is definite a strong positive correlation 
    between monthly income and monthly loan amount.
    
    2. Cubic relationship seems a better fit that simple linear model.

## Are we sure that they are greedy ?

Now we can see that there was definitely a strong correlation between the two variables but are we sure? Let's find the **Correlation Coefficient** to analyse it more.


```{r echo=FALSE, message=FALSE, warning=FALSE}
with(loanData, cor.test(MonthlyLoanPayment, StatedMonthlyIncome, 
                        method = "pearson"))
```


Well we can't really say that there is a strong correlation looking at the value of R which is almost 0.2. Usually it is said to be of high statistical importance if it is more than 0.3 or less than -0.3. But we can see that the value is still acceptable with somewhat positive correlation with the population Conficence Interval being more than 0. The strong t-statistics of 67.76 and small p-value shows that the statistical significance of alternative hypothesis is very strong. But this can also be somewhat practical significance too, at least to be included in a linear model.

## Does Recommendations matter ??

Before we move ahead let's review the _Recommendations_ variable and understand it's impact and distribution. 

### Distribution of Recommendation
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
    group_by(Recommendations) %>%
    summarise(n = n()) %>%
    ggplot(aes(x = Recommendations, y = n)) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_pander() +
    ylab("Number of borrowers") +
    xlab("Number of Recommendations") +
    scale_x_continuous(breaks = unique(loanData$Recommendations))
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
rec <- loanData %>%
    group_by(Recommendations) %>%
    summarise(n = n(), 
              AvgLoan = median(MonthlyLoanPayment)) %>%
    mutate(freq = AvgLoan / sum(AvgLoan) * 100)


A <- cut(rec$Recommendations, breaks = c(-1, 5, 10, 15, 20, 40), 
    labels= c("0 to 5", "6 to 10", "11 to 15", "16 to 20", "Above 20"))
rec$recrange <- A

recommend <- rec %>%
    group_by(recrange) %>%
    summarise(total = round(sum(freq), 2))
k <- recommend[order(recommend$total), ]

ggplot(aes(x = reorder(recrange, -total), 
           y = total, fill = -total), data = recommend) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_minimal() +
    guides(fill = F) +
    scale_fill_gradient(low = 'darkblue', high = 'blue') +
    geom_text(aes(label = sprintf("%2.1f%%", round(total, 2))), 
              color="dodgerblue4", size = 5.5, nudge_y = 1.5, 
              fontface = 'bold', nudge_x = 0.06) +
    ylab("% of borrowers") +
    xlab("Number of Recommendations") +
    theme(
          plot.title = element_text(face = 'bold.italic', 
                                    colour = 'darkblue', 
                                    size=18),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'blue', 
                                       size=12)) +
    ggtitle("Borrowers with different recommendations", 
            subtitle = "in terms of monthly loan payment")
```

As we can see that people with _Above 20_ recommendations has the highest number of _MonthlyLoanPayment_ percentage as compared to any other group. But this still can't assure us that people with higher recommendations pay more loan each month. This question was relevant because we want to find out IF PEOPLE WITH MORE RECOMMENDATION ARE POWERFUL AND RICH ENOUGH TO PAY MORE LOAN MONTHLY? The answer is we can't be really sure.


## Let's predict the Monthly Loan Payment

```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginalAmount, y = MonthlyLoanPayment), 
       data = loanData) +
    geom_point(position = "jitter", alpha = 1/2) +
    theme_hc() +
    ggtitle("Correlation of MonthlyLoanPayment vs LoanOriginalAmount") +
    geom_smooth(method = "lm")
```


As we can the a very good linear realtionship here. Let's explore this even further. There are multiple number of models that we can create with different number of input or independent variables. Let's try to create a linear model and see how we can improve it.

```{r echo=FALSE, message=FALSE, warning=FALSE}
# Selecting columns that we require
model <- loanData %>%
    dplyr::select(MonthlyLoanPayment, 
                  LoanOriginalAmount,
                  DebtToIncomeRatio,
                  ProsperRating..numeric.,
                  Term, 
                  LoanOriginationQuarter) %>%
    filter(!is.na(MonthlyLoanPayment), !is.na(LoanOriginalAmount),
           MonthlyLoanPayment != 0.0, LoanOriginalAmount != 0.0, 
           MonthlyLoanPayment != "", LoanOriginalAmount != "", 
           !is.na(DebtToIncomeRatio), DebtToIncomeRatio != "",
           !is.na(ProsperRating..numeric.))

# Creating the first Linear Model

fit1 <- lm(MonthlyLoanPayment ~ LoanOriginalAmount, data = model)

# Predict from the model
model$predicted <- predict(fit1)

# Calculate the residuals for accuracy
model$residuals <- residuals(fit1)

# Visualize how accurate the model (fit1)

ggplot(data = model, aes(x = LoanOriginalAmount, 
                         y = MonthlyLoanPayment)) +
    geom_segment(aes(xend = LoanOriginalAmount , 
                     yend = predicted), alpha = 0.3) +
    theme_pander() +
    geom_point(shape = 21, fill = 'violet', size = 1, alpha = 0.4) +
    geom_point(aes(y = predicted), shape = 4) +
    ggtitle("Residuals plot for initial model (fit1)")
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(fit1)
```

_LoanOriginalAmount_ should be the most important factor to decide what would be the _MonthlyLoanPayment_. As we can see that the correlation has very good _Adjusted R-squared_ value of 0.845 or almost 85%. The _Residual Standard Error_ is however quite high of almost75 which indicates the average performance of the linear model. We can visualize how the residuals are spread from the above plot. Their distribution is quite high. So the bigger question is that can we improve this model. Improving the model would ensure the following things
    
    1. Higher Adjusted R-squared value
    2. Lower Residual Standard Error
    3. Less variation/ Less dispersed residual graph
    
The key to improve any model liner/non-linear is to increase the number of independent variables. The choice of independent variable should be very logical. We should not include the variable which are dependent on the predicting variable. The independent variables should be truly independent. Here are the following variables that are chosen to improve the model. 
    
    1. LoanOriginalAmount: This should be the most important one 
    because how much the borrower has to pay each month 
    should be dependent on the total amount that is loaned
    
    2. DebtToIncomeRatio: This is important because lenders may 
    ask more money each month who has a record of high DebtToIncomeRatio 
    because they are considered borrowers with poor prospect and 
    lenders might charge them more if giving loans
    
    3. ProsperRating..numeric.: This is obvious because the we have shown 
    already that lenders trust the prosper rating a lot over anything else 
    and hence borrowers with poor ratings should pay more each 
    month as compared to better ratings.
    
    4. Term: Term decides how long the borrower is opting for the loan 
    and hence longer the Term shorter should be monthly payment.
    
    5. LoanOriginationQuarter: Lenders may ask for different monthly loan payment 
    on different time of the year because of dynamic interest rate 
    and changing macro economic factors on different time of the year.

```{r echo=FALSE, message=FALSE, warning=FALSE}
# Selecting columns that we require
model <- loanData %>%
    dplyr::select(MonthlyLoanPayment, 
                  LoanOriginalAmount,
                  DebtToIncomeRatio,
                  ProsperRating..numeric.,
                  Term, 
                  LoanOriginationQuarter) %>%
    filter(!is.na(MonthlyLoanPayment), !is.na(LoanOriginalAmount),
           MonthlyLoanPayment != 0.0, LoanOriginalAmount != 0.0, 
           MonthlyLoanPayment != "", LoanOriginalAmount != "", 
           !is.na(DebtToIncomeRatio), DebtToIncomeRatio != "",
           !is.na(ProsperRating..numeric.))

# Creating the improved Linear Model

fit2 <- lm(MonthlyLoanPayment ~ LoanOriginalAmount + 
               DebtToIncomeRatio + 
               ProsperRating..numeric. +
               Term + LoanOriginationQuarter,data = model)

# Predict from the model
model$predicted <- predict(fit2)

# Calculate the residuals for accuracy
model$residuals <- residuals(fit2)

# Visualize how accurate the model (fit2)
# summary(fit2)
ggplot(data = model, aes(x = LoanOriginalAmount, 
                         y = MonthlyLoanPayment)) +
    geom_segment(aes(xend = LoanOriginalAmount , 
                     yend = predicted), alpha = 0.3) +
    theme_pander() +
    geom_point(shape = 21, fill = 'violet', size = 1, alpha = 0.4) +
    geom_point(aes(y = predicted), shape = 4) +
    ggtitle("Residuals plot of improved model (fit2)")
```

We can see a lot less variation in the residual. Moreover when we summarize the fit2 for our improved model, we see that the value of _Residual standard error_ decreased a lot to almost 49 and _Adjusted R-squared_ increased to almost 94%. This was a huge jump from our previous model. Hence we can say that fit2 predicts the monthly loan amount better than the fit1 model. This also proves that _LoanOriginalAmount_ is not only the deciding factor of _MonthlyLoanPayment_. There are other factors too that contribute to this. From the ```summary(fit2)```, we observe that the slope values like below

    1. LoanOriginalAmount: 3.124e-02
    2. DebtToIncomeRatio:  3.211e+00
    3. ProsperRating..numeric.: -1.050e+01
    4. Term: -4.993e+00 

We can see that as we might have assumed, _LoanOriginalAmount_ & _DebtToIncomeRatio_ has positive slope which is obvious because if these two increase, lenders ask for more money from the borrowers. On the other hand, _Term_ has a negative slope which is logical because if term increases, monthly loan payment should decrease.

We can also see that the both maximum and minimum residual for our improved model is lesser than our previous model.

## Some Final Thoughts

Let's select 3 plots from what we have discussed and elaborate them bit further.

### What people loans for ?

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=7, fig.width=10}
categories <- c("Debt\nConsolidation", "Home\nImprovement", "Business",
               "Personal\nLoan", "Student Use", "Auto",
               "Other", "Baby", "Boat",
               "Cosmetic\nProcedure", "Engagement\nRing", "Green\nLoans",
               "Household\nExpenses", "Large\nPurchases", "Medical",
               "Motorcycle", "RV", "Taxes", "Vaccation", "Wedding\nLoans")
mapBorrowerCategory <- function(categoryNumber) {
    ifelse(categoryNumber == 0, 'na', categories[categoryNumber])
}

loanData.reasons <- loanData %>%
    group_by(ListingCategory..numeric.) %>%
    summarise(n = n()) %>%
    filter(ListingCategory..numeric. != 0) %>%
    mutate(category = mapBorrowerCategory(ListingCategory..numeric.), 
           freq = n / sum(n) * 100) %>%
    arrange(n) %>%
    dplyr::select(-ListingCategory..numeric.)

loanData.reasons$category <- factor(loanData.reasons$category)
ggplot(aes(x = reorder(category, freq), y = freq, fill = freq), 
       data = loanData.reasons) +
    geom_bar(stat = 'identity', position="dodge") +
    theme_pander() +
    scale_fill_gradient(low = 'lightblue', high = 'darkblue') +
    coord_flip() +
    guides(fill=FALSE) +
    geom_text(aes(label = sprintf("%2.1f%%", round(freq, 2))), 
              color="black", size = 5, nudge_y = 3) +
    ylab('Percentage of borrowers') +
    xlab('Loan Categories') +
    theme(axis.text.y = element_text(face = 'bold.italic', 
                                     colour = 'darkblue')) +
    ggtitle("People Loan to repay Loans", 
            subtitle = "this is why people loan!!") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = 'darkblue', 
                                    size=25),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = '#87a0cc', 
                                       size=15)) +
    scale_y_continuous(breaks = seq(0, 100, 5))

```

It is really strange that people took loans to reimburse loans/debts and lenders also issued loans to these people. Isn't is a bad idea to take loans to pay for loans? Aren't we falling into a vicious cycle? We can also see that Home Improvement is above Medical which may mean well insured population or negligence toward health.

### Does lenders prefer borrowers with better ProsperScore ?

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=12}
borrower_apr <- ggplot(aes(x = ProsperScore, 
                           y = BorrowerAPR, 
                           color=ProsperScore), 
       data = subset(loanData, !is.na(loanData$ProsperScore) & 
                         !is.na(loanData$BorrowerAPR))) +
    geom_jitter(shape=2, position=position_jitter(0.2), col = "dodgerblue2",
                alpha = 1/12) +
    geom_boxplot(outlier.shape=NA, col = "dodgerblue2") +
    stat_summary(fun.y=median, geom="point", size=2, color="black") +
    theme_pander() +
    xlab('\nProsper Score for Risk Factor') +
    ylab('Annual Percentage Rate') +
    ggtitle("Borrower Annual Rate", 
            subtitle = "for each borrower group") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = 'dodgerblue2', 
                                    size=25),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'black', 
                                       size=15))

borrower_rate <- ggplot(aes(x = ProsperScore, 
                            y = BorrowerRate, 
                            color=ProsperScore), 
       data = subset(loanData, !is.na(loanData$ProsperScore) & 
                         !is.na(loanData$BorrowerRate))) +
    geom_jitter(shape=2, position=position_jitter(0.2), col = "chocolate3",
                alpha = 1/12) +
    geom_boxplot(outlier.shape=NA, col = "chocolate3") +
    stat_summary(fun.y=median, geom="point", size=2, color="black") +
    theme_pander() +
    xlab('\nProsper Score for Risk Factor') +
    ylab('Borrower Interest Rate') +
    ggtitle("Borrower Interest Rate", 
            subtitle = "for each borrower group") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = 'chocolate3', 
                                    size=25),
          plot.subtitle = element_text(face = 'bold', 
                                       colour = 'black', 
                                       size=15))
grid.arrange(borrower_apr, borrower_rate, 
             nrow = 1, ncol = 2)
```

This depicts how we interest rates are affected by the Prosper Score for risk factor. As the score improves, the interest rate shows a declining treand visible from the black points. This indeed proves that the lenders like to charge less from borrowers with better prosper score.

### Correlation of MonthlyIncome vs MonthlyLoanPayment

```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=7, fig.width=14}
loanData$Term <- factor(loanData$Term)
ggplot(aes(x = LoanOriginalAmount, y = MonthlyLoanPayment, fill = Term), 
       data = loanData) +
    geom_point(shape = 21, col = "black", size = 3, 
               position = "jitter", alpha = 1/2) +
    scale_fill_manual(values = c("#900C3F", "#FF5733", "#FFC300")) +
    scale_x_continuous(limits = c(300, 25000), breaks = seq(300, 25000, 1500)) +
    scale_y_continuous(limits = c(100, 1500), breaks = seq(100, 1500, 100)) +
    theme_hc() +
    geom_smooth(se = F, aes(color = Term)) +
    ggtitle("How MonthlyLoanPayment & LoanOriginalAmount Related?",
            subtitle = "Precursor to our regression model") +
    theme(plot.title = element_text(face = 'bold.italic', 
                                    colour = '#900C3F', 
                                    size=25),
          plot.subtitle = element_text(face = 'bold', 
                                       size = 14, 
                                       colour = "#FFC300")) +
    xlab("Actual Loan amount in USD") +
    ylab("Monthly Loan payment in USD")
```

This created the base for the Linear regression model that was already discussed. This shows that _LoanOriginalAmount_ and _MonthlyLoanPayment_ has very good linear relationship for each _Term_. We have improved the model and by adding mutiple input variables and achieved a very good _Adjusted R squared_ value of almost 94% and a very low _Standard Residual Error_.

## Reflection

**Struggle** with this data set were many. I was at first intimidated by it seer size of number of rows. I honestly have never seen any CSV file which is this big and contains this many variables. However the real relief was the CSV file that contained the meaning of all the variables in the data set. But the biggest struggle that I faced in this data set was actually common in most of the data sets that contains this many variables. Some of the variables seems to have the same meaning for example _BorrowerAPR_, _BorrowerRate_, _LenderYield_, _EstimatedEffectiveYield_. These terms seemed to have similar meaning but not exactly the same. Now the was which one to include in our analysis and which is more important than other? Although with little bit more understanding I solved it but it took some time. Another struggle was to establish right combination of variables to create accurate linear regression model for minimum residual. This problem could have been solved using _Gradient Decend_ but because it being out of scope I had to try different combination manually. This also took some time.

**Successes** were achieved even after the above mentioned struggle because of some detailed analysis of the data set and the CSV file supplied that explains the variables in the data set. Linear Regression model's accuracy was checked by carefully taken the variables and then checking the _Adjusted R-squared_ values and the _Residual standard error_ and plotting the Residual standard error. By doing this I could actually improve the regression model substantially already discussed earlier.

Although some **ideas for future exploration** can improve this analysis substantially. I really think regression model can be improved further using _Gradient Decend_ to better approximate the slope and the intercept of the line. I also believe that is some information hidden inside the delinquency variables and late payment variables in _CurrentDelinquencies, AmountDelinquent, DelinquenciesLast7Years_ and _OnTimeProsperPayments, ProsperPaymentsLessThanOneMonthLate, ProsperPaymentsOneMonthPlusLate_ with respect the Monthly loan amount which can be explored further. 

However one thing is confirmed, Financial Data set is really intensive stuff.