-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathEDA_Loan.Rmd
1104 lines (913 loc) · 55.7 KB
/
EDA_Loan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Prosper Loan Data Analysis"
author: "Mayukh Sarkar"
output:
pdf_document:
highlight: tango
html_notebook: default
prettydoc::html_pretty:
highlight: vignette
keep_md: yes
theme: cayman
word_document: default
header-includes: \usepackage{booktabs}
---
```{r echo=FALSE, message=FALSE, warning=FALSE}
suppressMessages(library(devtools))
suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
suppressMessages(library(dplyr))
suppressMessages(library(memisc))
suppressMessages(library(gridExtra))
suppressMessages(library(RColorBrewer))
suppressMessages(library(magrittr))
suppressMessages(library(Kmisc))
suppressMessages(library(xtable))
suppressMessages(library(knitr))
suppressMessages(library(DT))
suppressMessages(library(scales))
suppressMessages(library(plotrix))
suppressMessages(library(corrplot))
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData <- read.csv('prosperLoanData.csv')
```
## Introduction
This dataset is financial dataset and this is related to the loan, borrowers, lenders, interest rates and stuffs like that. Prosper or [Prosper Marketplace Inc.](https://www.prosper.com/) is a San Francisco, California based company specializing in loans at low interest rates to the borrowers. In this dataset, we are using the data from the Posper to analyse it and trying to find the pattern in the Prosper data. This may be tedious because of the sheer size of the dataset and the complicated nature of all the financial datasets. We are using R, an advanced high level programming language of the analysis with some of its most popular graphic package ```ggplot```. The codes are not visible in the HTML/PDF export for the simlicity but the codes can be reviewed from the RMD file.
The dataset is comprised of 81 variables and contains 113937 entries. The variable that are explored in the dataset are the following
Term : Amount of month customers opted for loan
**LoanStatus** : Current status of the loan like chargedoff, completed, defauted etc...
**EstimatedEffectiveYield** : Yield of lenders from borrowers minus the processing fee and late fines
**ProsperScore** : Risk Factor score from 1 to 10. 10 being least risky
**BorrowerAPR** : The Borrower's Annual Percentage Rate (APR) for the loan.
**BorrowerRate** : The Borrower's interest rate for this loan.
**ListingCategory..numeric.** : Prosper rating for borrowers in numbers
**EmploymentStatus** : Current type of employment
**Occupation** : Occupation of borrower at the time of listing
**EmploymentStatusDuration** : How long the employee has been employed
**IsBorrowerHomeowner** : Does the borrower owns house at the time of listing (True & False)
**ProsperRating..Alpha.** : Prosper rating for borrowers in alphabets
**IncomeVerifiable** : If the income of the borrower is verifiable at the time of listing (True & False)
**StatedMonthlyIncome** : Monthly income of the borrower
**MonthlyLoanPayment** : Monthly loan payment amount
**Recommendations** : Recommendations the borrowers has at the time of listing
**DebtToIncomeRatio** : The debt to income ratio of the borrower at the time the credit profile was pulled.
**LoanOriginalAmount** : Original amount of the loan
**LoanOriginationQuarter** : Quarter of the month when loan was originated
A basic exploration of the datset would give the following information
```{r echo=FALSE, message=FALSE, warning=FALSE}
str(loanData)
```
## The first Qustion
The first question that needs to be asked is HOW LONG PEOPLE USUALLY OPT FOR LOAN? Let's answer this question with a histogram
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
ggplot(aes(x = Term / 12)) +
geom_histogram(binwidth = 1) +
theme_hc() +
xlab('Term in Years') +
ylab('How many?') +
scale_x_continuous(breaks = seq(1, 5, 2)) +
ggtitle("Distribution Loan Terms") +
theme(plot.title = element_text(face = 'bold.italic',
colour = "black", size=18))
```
We can see that people don't really loan any amount for less than one year and the most popular loan amount is of 3 years although some people do choose for 5 years. Now lets assume something and check if it is correct or not. I assume that people who opted for 1 year would fail to repay their loan more as compared to the people who opted for 3 or 5 years. But is it true? So the next question is
DOES PEOPLE REPAY THE LOANS BETTER WHEN THEY ARE GIVEN MORE TIME?
```{r echo=FALSE, message=FALSE, warning=FALSE}
totalLoanTermes <- loanData %>%
group_by(Term) %>%
summarise(n = n()) %>%
use_series(n)
# totalLoanTermes <- rep(totalLoanTermes, each = 2)
loanData.two_status <- loanData %>%
group_by(Term, LoanStatus) %>%
summarise(n = n())
levels(loanData.two_status$LoanStatus) <-
c(levels(loanData.two_status$LoanStatus), "Other")
loanData.two_status$LoanStatus[!(loanData.two_status$LoanStatus %in%
c("Completed", "Defaulted"))] <- "Other"
loanData.two_status <- loanData.two_status %>%
group_by(Term, LoanStatus) %>%
summarise(p = n(), total = sum(n)) %>%
mutate(freq = round(total / sum(total) * 100, 2))
```
## Plotting the trend of different customer types
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = Term / 12, y = freq, fill = LoanStatus),
data = loanData.two_status) +
geom_bar(stat = 'identity', position="stack", color = 'black') +
scale_x_continuous(breaks = c(1, 3, 5)) +
xlab('Loan term in years') +
theme_pander() +
ggtitle("LoanStatus: Completed vs Defaulted",
subtitle = "for each loan Term") +
theme(plot.title = element_text(face = 'bold.italic',
colour = '#F35E3A', size=18),
plot.subtitle = element_text(face = 'bold',
colour = '#17b4ba', size=11)) +
ylab("% of Borrowers")
```
That's unusual because for **LoanStatus** of _Completed_, we see a trend but not in **LoanStatus** of _Defaulted_ though. Now just because some customer's loan status is not Completed, doesn't mean his/her loan status is Defaulted.
We can see that there is a trend in the of **LoanStatus** of _Completed_ and _Other_ but we see any trend in _Defaulted_ though. Now just because some customer's loan status is not Completed, doesn't mean his/her loan status is Defaulted. So let's explore the Defaulted section more.
The reason we are exploring this is because we want to find if BANKS SHOULD FOCUS ON CUSTOMERS OPTING LOANS FOR SMALLER TERMS OR NOT? AND HOW SOME OF THE MOST COMMON BORROWER LOANSTATUS VARIES FOR DIFFERENT LOAN TERMS? For this getting data of only two loan status won't be enough. So let's divide the customers into two groups
1. Good Customer
2. Bad Customer
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData.gb <- loanData %>%
group_by(Term, LoanStatus) %>%
summarise(n = n()) %>%
mutate(customer_type = ifelse((LoanStatus == 'Current' |
LoanStatus == 'Completed'|
LoanStatus == 'FinalPaymentInProgress'),
'good', 'bad')) %>%
filter(LoanStatus != 'Cancelled') %>%
mutate(freq = n / sum(n) * 100) %>%
ungroup() %>%
group_by(Term, customer_type) %>%
summarise(n = sum(freq))
```
But before that lets see the distribution of LonaStatus variable
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
group_by(LoanStatus) %>%
summarise(n = n()) %>%
ggplot(aes(x = LoanStatus, y = n)) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
ylab("Number of borrowers") +
xlab("Different Loan Status") +
coord_flip()
```
## The bigger Picture
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = Term / 12, y = n, fill = customer_type), data = loanData.gb) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
xlab('Loan term in years') +
ylab('Percentage') +
scale_x_continuous(breaks = c(1, 3, 5)) +
geom_text(aes(label = sprintf("%2.1f%%", round(n, 2)), vjust = -.3),
color="black") +
facet_wrap(~customer_type) +
scale_fill_hc() +
ggtitle("Good vs Bad",
subtitle = "Number of good borrowes vs bad borrowers
in each loan Term") +
theme(plot.title = element_text(face = 'bold.italic',
colour = '#333338',
size=18),
plot.subtitle = element_text(face = 'bold',
colour = '#6aa5e7',
size=11))
```
As we can that the short and long term prospects of the banks who issues the loan seems good because both in 1 and 5 year category, we see more number of good customers as compared to the 3 years. The number of good customers is however decreased a little from 1 year to 5 years but as compared to 3 years it is nothing. This may mean that banks should focus in long and short term customers as compared to medium term customers. But we should take this fact with a grain of slat because there might be other factors that may affect this trend. Only further exploration of the data would reveal that.
But before we move ahead, lets analyse the distribution of some of the features that might be useful for us in this data set. We would also remove some the outliers to better understand the features and may dive into some descriptive statistics too.
## Distribution for some continuous/categorical features
```{r echo=FALSE, message=FALSE, warning=FALSE}
prosperScoreCountPlot <- loanData %>%
filter(!is.na(ProsperScore)) %>%
group_by(ProsperScore) %>%
summarise(n = n()) %>%
ggplot(aes(x = ProsperScore, y = n)) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
ylab("No. of borrowers") +
scale_y_continuous(breaks = seq(0, 15000, 2500)) +
scale_x_continuous(breaks = seq(1, 11, 1))
EmploymentStatusDurationHist <- ggplot(aes(x = EmploymentStatusDuration),
data = loanData) +
geom_histogram(binwidth = 10, color = 'white') +
scale_x_continuous(breaks = seq(0, 400, 100), limits = c(0, 400)) +
theme_pander() +
xlab("Borrower Experience")
prosperRatingCount <- loanData %>%
group_by(ProsperRating..Alpha.) %>%
summarise(n = n()) %>%
ggplot(aes(x = ProsperRating..Alpha., y = n)) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
ylab("\nNo. of borrowers") +
xlab("Borrower Rating")
StatedMonthlyIncomeHist <- ggplot(aes(x = StatedMonthlyIncome),
data = loanData) +
geom_histogram(binwidth = 1000, color = 'white') +
scale_x_continuous(limits = c(0, 20000)) +
theme_pander()
MonthlyLoanPaymentHist <- ggplot(aes(x = MonthlyLoanPayment),
data = loanData) +
geom_histogram(binwidth = 50, color = 'white') +
scale_x_continuous(breaks = seq(0, 1000, 300), limits = c(0, 1000)) +
theme_pander()
LoanOriginalAmountHist <- ggplot(aes(x = LoanOriginalAmount),
data = loanData) +
geom_histogram(binwidth = 1000, color = 'white') +
scale_x_continuous(breaks = seq(0, 25000, 7000), limits = c(0, 25000)) +
theme_pander()
BorrowerAPRHist <- ggplot(aes(x = BorrowerAPR),
data = loanData) +
geom_histogram(binwidth = 0.01, color = 'white') +
scale_x_continuous(breaks = seq(0.07, 0.3, 0.05), limits = c(0.07, 0.3)) +
theme_pander()
BorrowerRateHist <- ggplot(aes(x = BorrowerRate),
data = loanData) +
geom_histogram(binwidth = 0.01, color = 'white') +
scale_x_continuous(breaks = seq(0.07, 0.3, 0.05),
limits = c(0.07, 0.3)) +
theme_pander()
DebtToIncomeRatioHist <- ggplot(aes(x = DebtToIncomeRatio),
data = loanData) +
geom_histogram(binwidth = 0.05, color = 'white') +
scale_x_continuous(limits = c(0.0, 0.7),
breaks = seq(0.0, 0.6, 0.1)) +
theme_pander()
grid.arrange(prosperScoreCountPlot,
EmploymentStatusDurationHist,
prosperRatingCount,
StatedMonthlyIncomeHist,
MonthlyLoanPaymentHist,
LoanOriginalAmountHist,
BorrowerAPRHist,
BorrowerRateHist,
DebtToIncomeRatioHist,
ncol = 3, nrow = 3,
top = "Distribution of feature variables\n")
```
## EstimatedEffectiveYield - A better measure for a successful Lender
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = EstimatedEffectiveYield), data = loanData) +
geom_histogram(aes(y = ..density..), binwidth = 0.01, na.rm = T,
color = 'darkblue', fill = 'lightblue') +
theme_hc() +
scale_x_continuous(limits = c(-0.1, 0.5),
breaks = seq(-0.1, 0.5, 0.05)) +
geom_density(alpha=.2, fill="#FF6666", na.rm = T) +
geom_vline(aes(xintercept=mean(EstimatedEffectiveYield, na.rm=T)),
color="blue", linetype="dashed", size=1) +
ggtitle("Distribution of EstimatedEffectiveYield",
subtitle = "with the mean axis") +
theme(plot.title = element_text(face = 'bold.italic',
colour = '#FF6666',
size=18),
plot.subtitle = element_text(face = 'bold',
colour = 'darkblue',
size=13))
```
**EstimatedEffectiveYield** is said to be better estimate for the lenders than the interest rate because the interest includes _processing fees_, _uncollected interest due to borrower being chargedoff_. Plus it also doesn't include _late fines_. Hence EstimatedEffectiveYield takes account for all these things and it is thus a better measure. Above we are trying to see the distribution of the EstimatedEffectiveYield and we can see that it is multimodal. We see the most popular EstimatedEffectiveYield is around 0.3 while the mean is around 0.17 represented by the blue dotted line. The multimodal pattern shows that there are multiple EstimatedEffectiveYield that is popular. Strangely we can also see that the some customers have negative EstimatedEffectiveYield. This may mean a lot of things. This may mean that their BorrowerRate is a lot lower than their _service fee rate_ or these customer's _uncollected interest on chargeoff_ is lot more or they just never payed the late fee and payed back the loans along with the interest always on time.
## Does Lenders prefer borrowers with better Prosper Score ?
Now lets see what is the distribution of EstimatedEffectiveYield depending on the different **ProsperScore**. This is important because we want to answer a question, i.e., IF LENDERS GET MORE EstimatedEffectiveYield IF THEY HAVE BETTER ProsperScore ?
We are using violin plot instead of box plot for this.
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData$ProsperScore <- factor(loanData$ProsperScore)
ggplot(aes(x = ProsperScore, y = EstimatedEffectiveYield, fill=ProsperScore),
data = subset(loanData, !is.na(loanData$ProsperScore) &
!is.na(loanData$EstimatedEffectiveYield))) +
geom_violin(trim = F, scale = "width") +
stat_summary(fun.y=median, geom="point", size=2, color="black") +
scale_fill_manual(values=colorRampPalette(c("pink", "lightgreen"))(11)) +
theme_minimal() +
ylab('Effective yeild of Lenders') +
ggtitle("Effective Yield for each Risk Factors",
subtitle = "story of lenders preference") +
theme(plot.title = element_text(face = 'bold.italic',
colour = '#f3b0c0',
size=22),
plot.subtitle = element_text(face = 'bold',
colour = 'darkgreen',
size=14)) +
guides(fill = F)
```
Well this is interesting. We see an wonderful trend here. Here more score for the risk factor means better the borrower and lesser score for risk factor means poor prospects from the borrowers. We can see that for lower ProsperScore distribution of effective yield in a lot more than the higher ProsperScore. This may mean that lenders charges a variety of interest rate from the borrower with poor prospects as compared to borrowers with better prospect. We can also notice how median (represented by the black dot) is decreasing as ProsperScore is increasing. This may mean that lenders give more relaxations to borrowers with better ratings as compared to borrowers with poor rating. Does that mean lenders trust and like borrowers with better ProsperScore! Let's do a little more analysis to reveal it more. The reason we need more exploration on this is because EstimatedEffectiveYield includes more things such as late fine and doesn't include processing fee and others. So more EstimatedEffectiveYield for lesser ProsperScore borrowers may be due to high late fines because lesser ProsperScore borrowers are more prone to fail to repay their loan on time each month. So, Let's see if borrower's interest rate shows the same trend for each ProsperScore categories or not because interest rates doesn't include late fines.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=5, fig.width=14}
borrower_apr <- ggplot(aes(x = ProsperScore,
y = BorrowerAPR),
data = subset(loanData, !is.na(loanData$ProsperScore) &
!is.na(loanData$BorrowerAPR))) +
geom_boxplot() +
theme_pander()
borrower_rate <- ggplot(aes(x = ProsperScore,
y = BorrowerRate),
data = subset(loanData, !is.na(loanData$ProsperScore) &
!is.na(loanData$BorrowerRate))) +
geom_boxplot() +
theme_pander()
grid.arrange(borrower_apr, borrower_rate,
nrow = 1, ncol = 2,
top = "Interest Rate Distributions")
```
Finally things are revealed much better now! We can clearly observe that for both _BorrowerAPR_ and _BorrowerRate_ which are metric for interest rates, we see a declining trend as the _ProsperScore_ is increasing. This justifies the fact even more that lenders somehow prefers to charge less for all the borrowers with better ProsperScore as compared to borrowers with inferior ProsperScore. Here food for thought is that is it moral too? Well I guess no amount of EDA can reveal it.
But are we missing the real question here? We are performing all these analysis on the Prosper loan data to answer several questions but what is the most important thing for a loan ? It's **BORROWERS** and what is the most important question related to borrowers?
## Distribution of Listing Category
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=7, fig.width=10}
loanData %>%
group_by(ListingCategory..numeric.) %>%
summarise(n = n()) %>%
ggplot(aes(x = ListingCategory..numeric., y = n)) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
scale_x_continuous(breaks = seq(0, 20, 1)) +
scale_y_continuous(breaks = seq(0, 60000, 5000)) +
xlab("Listing Category") +
ylab("Number of borrowers") +
ggtitle("Distribution of ListingCategory")
```
The other section indicated by number 7 with more than 10000 of the borrowers would remain unknown. Why the loan reason would be unknown? Does it indicate any illegal reason to opt for loan! Let's explore this more. Let's see the Occupation and Employment status of Borrowers for all the loans falling into the _others_ section.
## Who are others??
```{r echo=FALSE, message=FALSE, warning=FALSE}
other.borrowers <- loanData %>%
filter(ListingCategory..numeric. == 7) %>%
group_by(EmploymentStatus, Occupation) %>%
summarise(n = n()) %>%
filter(!is.na(Occupation)) %>%
ungroup() %>%
arrange(EmploymentStatus, desc(n)) %>%
ungroup() %>%
group_by(EmploymentStatus) %>%
top_n(n = 5, wt = n) %>%
ungroup()
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
######### This is for knitr in HTML ############
# datatable(other.borrowers,
# colnames = c('Employment Status' = 'EmploymentStatus',
# 'Occupation' = 'Occupation',
# 'Total' = 'n'),
# class = 'stripe',
# rownames = FALSE,
# filter = 'top',
# options = list(
# pageLength = 5,
# autoWidth = TRUE,
# columnDefs = list(list(targets = c(1, 2), searchable = FALSE))
# ))
```
```{r results='asis', echo=FALSE}
######## This is for for knitr in PDF #########
employed <- other.borrowers %>%
filter(EmploymentStatus == 'Employed') %>%
dplyr::select(-EmploymentStatus)
full.time <- other.borrowers %>%
filter(EmploymentStatus == 'Full-time') %>%
dplyr::select(-EmploymentStatus)
part.time <- other.borrowers %>%
filter(EmploymentStatus == 'Part-time') %>%
dplyr::select(-EmploymentStatus)
self.employed <- other.borrowers %>%
filter(EmploymentStatus == 'Self-employed') %>%
dplyr::select(-EmploymentStatus)
others <- other.borrowers %>%
filter(EmploymentStatus == 'Retired' | EmploymentStatus == 'Other')
t1 <- kable(employed, format = "latex", booktabs = TRUE)
t2 <- kable(full.time, format = "latex", booktabs = TRUE)
t3 <- kable(part.time, format = "latex", booktabs = TRUE)
t4 <- kable(self.employed, format = "latex", booktabs = TRUE)
t5 <- kable(others, format = "latex", booktabs = TRUE)
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{ Employed }
\\centering",
t1,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{ Full-time }",
t2,
"\\end{minipage}
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{ Part-time }",
t3,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{ Self-employed }",
t4,
"\\end{minipage}
\\centering
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{ Others }",
t5,
"\\end{minipage}
\\end{table}"
))
```
We can see some really strange things happening here. In fact there are so many strange things that I have to list them down
**Firstly** what is the difference between _Employed_ and _Full-time_?
**Secondly** there are too many others. The complete sample of the data is taken for those borrowers whose purpose for borrowing is _Others_. But there are _Others_ for _Occupation_ and _EmploymentStatus_ too.
**Thirdly** how can someone be _Retired_ but still has _Occupation Others_ ?
**Lastly** there are some dubious borrowers in the _Others_ table. We can see that there are 352 cases where the borrowing reason, EmploymentStatus, Occupation all are _Others_ and 95 cases where _Occupation_ is left blank.
These above points, specially the last one says not all the burrower's information is revealed or in some cases they seemed to be fake too. I guess it is a partial dead end for us to see why people mention _Other_ as a reason while opting for loan.
Let's now move into something more computational. Let's first see after working for how many years, people like to opt for a loan? We can then move into some deeper level of analysis.
## Does experienced people opt for loan lesser ?
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=4, fig.width=11}
borrower.experience <- loanData %>%
filter(!is.na(EmploymentStatusDuration))
hist1 <- ggplot(aes(x = EmploymentStatusDuration / 12),
data = borrower.experience) +
geom_histogram(binwidth = 1,
color = 'red',
fill = 'deeppink',
alpha = 1/2) +
theme_pander() +
scale_x_continuous(breaks =
seq(min(borrower.experience$EmploymentStatusDuration),
max(borrower.experience$EmploymentStatusDuration),
2)) +
coord_cartesian(xlim = c(0, 41)) +
xlab('\nTotal experience in years') +
ggtitle("Loans for Experienced people",
subtitle = "actual distribution") +
theme(plot.title = element_text(face = 'bold.italic',
colour = 'deeppink',
size=20),
plot.subtitle = element_text(face = 'bold',
colour = 'black',
size=14))
duration <- loanData %>%
filter(!is.na(EmploymentStatusDuration)) %>%
mutate(EmploymentStatusDuration = round(EmploymentStatusDuration / 12, 0)) %>%
group_by(EmploymentStatusDuration) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n) * 100) %>%
top_n(n = 5, wt = freq)
hist2<-
ggplot(aes(x = reorder(EmploymentStatusDuration, -freq), y = freq, fill = freq),
data = duration) +
geom_bar(stat = 'identity', position="dodge",
color = 'white') +
theme_pander() +
scale_fill_continuous_tableau(palette = "Blue") +
guides(fill = F) +
ggtitle("Loans for Experienced people",
subtitle = "top 5 experience levels") +
theme(plot.title = element_text(face = 'bold.italic',
colour = 'darkslategrey',
size=20),
plot.subtitle = element_text(face = 'bold',
colour = 'black',
size=14)) +
xlab("Top 5 experience levels in years") +
ylab("% of borrowers")
grid.arrange(hist1, hist2, ncol = 2, nrow = 1)
```
## I was right ..:)
Well this was really interesting. We can see that our assumption was overall correct because the as people gain experience in their jobs, lesser they opt for the loans. This may be due the fact that as people gain experience their salary also increases and hence the lesser reason they find to opt for loans or the reason can be something different. We can also see in the right histogram of 95% quarterly, most people opt for loans when they have almost 2 years of experience.
Lets explore it even more let's see the correlation between number of month a person has experience and the EstimatedEffectiveYield variable that we have explored earlier. The question that we want to explore -
DOES LENDERS ASK FOR LESS INTEREST FORM THE BORROWERS WHO ARE MORE EXPERIENCED? This can be true because people with more job experience should have more potential to replay their loan better because they have higher paying jobs and hence their ProsperScore would be higher. And as we have seen that borrowers with better prosper score pay lesser to the lenders and lenders somehow prefer them.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(y = EmploymentStatusDuration / 12,
x = EstimatedEffectiveYield,
color = ProsperScore),
data = subset(loanData, !is.na(loanData$ProsperScore))) +
geom_point(na.rm = T, position = 'jitter',
alpha = 1 / 6, shape = 1) +
scale_color_brewer(type = 'div',
palette = 'Spectral',
direction = 1,
guide = guide_legend(title = 'ProsperScore',
reverse = T,
override.aes = list(alpha = 1, size = 2))) +
coord_cartesian(xlim = c(0.0, 0.3)) +
xlab('Actual yield interest of lenders') +
ylab('Boroower\'s Experience in Years') +
theme_pander() +
geom_smooth(se = F) +
ggtitle("Correlation of Borrower's Experience",
subtitle = " with Actual Yield of Lenders")
```
As it seems from the scatter plot that the pattern seems to have no correlation. It means our assumption was not correct. Borrowers with better **EmploymentStatusDuration** don't seem to get any special relaxation from lenders in terms of interest each month. This can be further confirmed by checking the _Perarson's correlation Coefficient_. We see that the correlation coefficient is almost 0 (-0.0233).
```{r echo=FALSE, message=FALSE, warning=FALSE}
with(subset(loanData, !is.null(EmploymentStatusDuration) &
!is.null(EstimatedEffectiveYield)),
cor.test(EmploymentStatusDuration / 12, EstimatedEffectiveYield))
```
This also says that even though the true correlation is not true and alternative hypothesis is accepted, their is some **serious statistical evidence of significance**. But if we look into the CI, it is within the range of -0.03 to -0.016 which is very small. Good R value is said a value < -0.3 or value > 0.3. This value is definitely not that large. Judging from the context latest it is not. So we can say that there is **no practical significance.** Hence we can not tell with any confirmation that More Experienced Lenders end up paying Less/More interest to the Lenders.
## Other Correlations
```{r echo=FALSE, message=FALSE, warning=FALSE}
d <- data.frame(Term=loanData$Term,
BorrowerAPR=loanData$BorrowerAPR,
BorrowerRate=loanData$BorrowerRate,
LenderYield=loanData$LenderYield,
EffectiveYield = loanData$EstimatedEffectiveYield,
Loss = loanData$EstimatedLoss,
Duration = loanData$EmploymentStatusDuration,
income = loanData$StatedMonthlyIncome,
payment = loanData$MonthlyLoanPayment)
d[is.na(d)] <- 0
d <- round(cor(d[sapply(d, is.numeric)]), 2)
corrplot(d, method = "circle")
```
## How people loan for their Homes ?
Here we are going to explore the people for two category.
1. First those who are opting for loan to rennovation of home when they have a house.
2. Second those who opt for home loans even though they don't have house.
```{r echo=FALSE, message=FALSE, warning=FALSE}
home.loans <- loanData %>%
filter(ListingCategory..numeric. == 2,
ProsperRating..Alpha. != "",
IsBorrowerHomeowner != "") %>%
group_by(IsBorrowerHomeowner, ProsperRating..Alpha.) %>%
summarise(n = n()) %>%
mutate(freq = round(n / sum(n) * 100, 2))
levels(home.loans$IsBorrowerHomeowner) <- c('No Home', 'Has Home')
home.loans$ProsperRating..Alpha. <-
factor(home.loans$ProsperRating..Alpha., ordered = T,
levels = c('AA', 'A', 'B', 'C', 'D', 'E', 'HR'))
home.loans <- arrange(home.loans, ProsperRating..Alpha.)
```
**Now let's plot the data**
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = ProsperRating..Alpha., y = freq,
fill = IsBorrowerHomeowner), data = home.loans) +
geom_bar(stat = 'identity', position="dodge",
color = 'white') +
theme_hc() +
scale_fill_fivethirtyeight() +
xlab('Prosper Ratings for borrowers') +
ylab('% of borrowers') +
ggtitle("Home Improvement for All",
subtitle = "even if you dont have home") +
theme(plot.title = element_text(face = 'bold.italic',
colour = 'red',
size=20),
plot.subtitle = element_text(face = 'bold',
colour = 'dodgerblue3',
size=14))
```
## Are Investors partial to Borrowers with better rating ??
Here we visualizing an interesting trend. We are seeing for people with ProsperRating of _AA, A & B_, number of people opted for loan mentioning the reason for loan as **Home Improvement** but still has no home is more than the people who has home and opted for loan for Home Improvement and truly a home owner. But for people with poor rating, the trend in opposite and expected also. People who don't have any house should not get any loan mentioning the loan reason of House Improvement. These whole thing shows not only lenders give much preference to Rating over Verification and KYC (Know Your Customer) but this also shows irrespective of rating of borrowers, lenders font care much about loan reason as a whole. The following code proves it even more. We can see that there are not a single borrowers who mentioned their loan purpose to be Home Improvement when they didn't have their own house and their loan was not approved. That's a strong evidence to show that investors don't care much about how much borrowers fudge or fake their loan purpose.
```{r message=FALSE, warning=FALSE}
a <- loanData %>%
filter(ListingCategory..numeric. == 2,
LoanStatus == 'Cancelled')
print.numeric_version(dim(a)[0])
```
## What's up with the people having dubious income sources ?
```{r echo=FALSE, message=FALSE, warning=FALSE}
dubious.borrowers <- loanData %>%
filter(ListingCategory..numeric. == 7,
Occupation == 'Other',
EmploymentStatus == 'Other',
IncomeVerifiable == 'False') %>%
group_by(ProsperRating..Alpha.) %>%
summarise(n = n()) %>%
mutate(freq = round(n / sum(n) * 100, 2))
dubious.borrowers$ProsperRating..Alpha. <-
ordered(dubious.borrowers$ProsperRating..Alpha.,
levels = c('AA', 'A', 'B', 'C', 'D', 'E', 'HR'))
dubious.borrowers <- arrange(dubious.borrowers, ProsperRating..Alpha.)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = ProsperRating..Alpha., y = freq, fill = freq),
data = dubious.borrowers) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
scale_fill_continuous(low = 'deepskyblue1',
high = 'dodgerblue4') +
guides(fill=FALSE) +
xlab('\nProsper rating for borrowers') +
ylab('% of borrowers') +
geom_text(aes(label = sprintf("%2.1f%%", round(freq, 2))),
color="dodgerblue4", size = 5.5, nudge_y = 1,
fontface = 'bold', nudge_x = 0.06) +
theme(axis.text.x = element_text(face = 'bold.italic',
colour = 'dodgerblue4',
size = 15),
axis.text.y=element_blank(),
axis.title.x = element_text(face = 'bold',
colour = 'dodgerblue4',
size = 16),
axis.title.y = element_text(face = 'bold',
colour = 'dodgerblue4',
size = 16),
plot.title = element_text(face = 'bold.italic',
colour = 'deepskyblue1',
size=18),
plot.subtitle = element_text(face = 'bold',
colour = 'dodgerblue4',
size=12)) +
ggtitle("Story of Dubious borrowers",
subtitle = "by different borrower's category")
```
## Prosper rating system seems legit !!
Although we can't see any specific pattern among the dubious borrowers but we can see one thing very well. Overall borrowers with better ratings tends to be less dubious than borrowers with poor ratings. One more interesting thing is that the above bar chart does not include borrowers with _AA_ rating. It means that the best borrowers are really best and the prosper rating really works.
## Are Lenders greedy ?
I always wanted to do this. But from this dataset we can answer this question a lot of way. One of the way is to check if the lenders asked for money if the borrowers income was high. Let's see if the correlation is substantial.
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=4, fig.width=12}
base <- loanData %>%
filter(StatedMonthlyIncome != 0.0,
MonthlyLoanPayment != 0.0,
ProsperRating..Alpha. != "") %>%
ggplot(aes(y = StatedMonthlyIncome,
x = MonthlyLoanPayment)) +
geom_jitter(color = 'black',
fill = '#F79420',
shape = 21,
alpha = 1/8) +
coord_cartesian(ylim = c(100, 30000), xlim = c(10, 1000)) +
scale_y_continuous(trans = log10_trans(),
breaks = c(100, 1000, 10000, 20000, 30000)) +
scale_x_continuous(trans = log10_trans(),
breaks = c(10, 50, 100, 200, 500, 700, 1000)) +
theme_pander()
scatter1 <- base + geom_smooth(span = 0.3, method = 'lm') +
theme(plot.margin=unit(c(0, 1, 0, 0), "cm")) +
xlab('\nAmount paid/month for loan') +
ggtitle("Monthly Income vs Monthly Loan (Linear fit)")
scatter2 <- base + geom_smooth(span = 0.3,
method = 'lm',
formula = y ~ splines::bs(x, 3)) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
xlab('\nAmount paid/month for loan') +
ggtitle("Monthly Income vs Monthly Loan (Cubic fit)")
grid.arrange(scatter1, scatter2, nrow = 1, ncol = 2)
```
Here I am trying to see the relationship of _MonthlyLoanPayment_ with _StatedMonthlyIncome._ We can see from the above statterplot that most of the _MonthlyLoanPayment_ is distributed from 10 to 1000 and the _StatedMonthlyIncome_ is distributed from 100 to 30000. Both the scales are transformed in log scale and we can clearly observe two things, namely,
1. There is definite a strong positive correlation
between monthly income and monthly loan amount.
2. Cubic relationship seems a better fit that simple linear model.
## Are we sure that they are greedy ?
Now we can see that there was definitely a strong correlation between the two variables but are we sure? Let's find the **Correlation Coefficient** to analyse it more.
```{r echo=FALSE, message=FALSE, warning=FALSE}
with(loanData, cor.test(MonthlyLoanPayment, StatedMonthlyIncome,
method = "pearson"))
```
Well we can't really say that there is a strong correlation looking at the value of R which is almost 0.2. Usually it is said to be of high statistical importance if it is more than 0.3 or less than -0.3. But we can see that the value is still acceptable with somewhat positive correlation with the population Conficence Interval being more than 0. The strong t-statistics of 67.76 and small p-value shows that the statistical significance of alternative hypothesis is very strong. But this can also be somewhat practical significance too, at least to be included in a linear model.
## Does Recommendations matter ??
Before we move ahead let's review the _Recommendations_ variable and understand it's impact and distribution.
### Distribution of Recommendation
```{r echo=FALSE, message=FALSE, warning=FALSE}
loanData %>%
group_by(Recommendations) %>%
summarise(n = n()) %>%
ggplot(aes(x = Recommendations, y = n)) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
ylab("Number of borrowers") +
xlab("Number of Recommendations") +
scale_x_continuous(breaks = unique(loanData$Recommendations))
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
rec <- loanData %>%
group_by(Recommendations) %>%
summarise(n = n(),
AvgLoan = median(MonthlyLoanPayment)) %>%
mutate(freq = AvgLoan / sum(AvgLoan) * 100)
A <- cut(rec$Recommendations, breaks = c(-1, 5, 10, 15, 20, 40),
labels= c("0 to 5", "6 to 10", "11 to 15", "16 to 20", "Above 20"))
rec$recrange <- A
recommend <- rec %>%
group_by(recrange) %>%
summarise(total = round(sum(freq), 2))
k <- recommend[order(recommend$total), ]
ggplot(aes(x = reorder(recrange, -total),
y = total, fill = -total), data = recommend) +
geom_bar(stat = 'identity', position="dodge") +
theme_minimal() +
guides(fill = F) +
scale_fill_gradient(low = 'darkblue', high = 'blue') +
geom_text(aes(label = sprintf("%2.1f%%", round(total, 2))),
color="dodgerblue4", size = 5.5, nudge_y = 1.5,
fontface = 'bold', nudge_x = 0.06) +
ylab("% of borrowers") +
xlab("Number of Recommendations") +
theme(
plot.title = element_text(face = 'bold.italic',
colour = 'darkblue',
size=18),
plot.subtitle = element_text(face = 'bold',
colour = 'blue',
size=12)) +
ggtitle("Borrowers with different recommendations",
subtitle = "in terms of monthly loan payment")
```
As we can see that people with _Above 20_ recommendations has the highest number of _MonthlyLoanPayment_ percentage as compared to any other group. But this still can't assure us that people with higher recommendations pay more loan each month. This question was relevant because we want to find out IF PEOPLE WITH MORE RECOMMENDATION ARE POWERFUL AND RICH ENOUGH TO PAY MORE LOAN MONTHLY? The answer is we can't be really sure.
## Let's predict the Monthly Loan Payment
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginalAmount, y = MonthlyLoanPayment),
data = loanData) +
geom_point(position = "jitter", alpha = 1/2) +
theme_hc() +
ggtitle("Correlation of MonthlyLoanPayment vs LoanOriginalAmount") +
geom_smooth(method = "lm")
```
As we can the a very good linear realtionship here. Let's explore this even further. There are multiple number of models that we can create with different number of input or independent variables. Let's try to create a linear model and see how we can improve it.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Selecting columns that we require
model <- loanData %>%
dplyr::select(MonthlyLoanPayment,
LoanOriginalAmount,
DebtToIncomeRatio,
ProsperRating..numeric.,
Term,
LoanOriginationQuarter) %>%
filter(!is.na(MonthlyLoanPayment), !is.na(LoanOriginalAmount),
MonthlyLoanPayment != 0.0, LoanOriginalAmount != 0.0,
MonthlyLoanPayment != "", LoanOriginalAmount != "",
!is.na(DebtToIncomeRatio), DebtToIncomeRatio != "",
!is.na(ProsperRating..numeric.))
# Creating the first Linear Model
fit1 <- lm(MonthlyLoanPayment ~ LoanOriginalAmount, data = model)
# Predict from the model
model$predicted <- predict(fit1)
# Calculate the residuals for accuracy
model$residuals <- residuals(fit1)
# Visualize how accurate the model (fit1)
ggplot(data = model, aes(x = LoanOriginalAmount,
y = MonthlyLoanPayment)) +
geom_segment(aes(xend = LoanOriginalAmount ,
yend = predicted), alpha = 0.3) +
theme_pander() +
geom_point(shape = 21, fill = 'violet', size = 1, alpha = 0.4) +
geom_point(aes(y = predicted), shape = 4) +
ggtitle("Residuals plot for initial model (fit1)")
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(fit1)
```
_LoanOriginalAmount_ should be the most important factor to decide what would be the _MonthlyLoanPayment_. As we can see that the correlation has very good _Adjusted R-squared_ value of 0.845 or almost 85%. The _Residual Standard Error_ is however quite high of almost75 which indicates the average performance of the linear model. We can visualize how the residuals are spread from the above plot. Their distribution is quite high. So the bigger question is that can we improve this model. Improving the model would ensure the following things
1. Higher Adjusted R-squared value
2. Lower Residual Standard Error
3. Less variation/ Less dispersed residual graph
The key to improve any model liner/non-linear is to increase the number of independent variables. The choice of independent variable should be very logical. We should not include the variable which are dependent on the predicting variable. The independent variables should be truly independent. Here are the following variables that are chosen to improve the model.
1. LoanOriginalAmount: This should be the most important one
because how much the borrower has to pay each month
should be dependent on the total amount that is loaned
2. DebtToIncomeRatio: This is important because lenders may
ask more money each month who has a record of high DebtToIncomeRatio
because they are considered borrowers with poor prospect and
lenders might charge them more if giving loans
3. ProsperRating..numeric.: This is obvious because the we have shown
already that lenders trust the prosper rating a lot over anything else
and hence borrowers with poor ratings should pay more each
month as compared to better ratings.
4. Term: Term decides how long the borrower is opting for the loan
and hence longer the Term shorter should be monthly payment.
5. LoanOriginationQuarter: Lenders may ask for different monthly loan payment
on different time of the year because of dynamic interest rate
and changing macro economic factors on different time of the year.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Selecting columns that we require
model <- loanData %>%
dplyr::select(MonthlyLoanPayment,
LoanOriginalAmount,
DebtToIncomeRatio,
ProsperRating..numeric.,
Term,
LoanOriginationQuarter) %>%
filter(!is.na(MonthlyLoanPayment), !is.na(LoanOriginalAmount),
MonthlyLoanPayment != 0.0, LoanOriginalAmount != 0.0,
MonthlyLoanPayment != "", LoanOriginalAmount != "",
!is.na(DebtToIncomeRatio), DebtToIncomeRatio != "",
!is.na(ProsperRating..numeric.))
# Creating the improved Linear Model
fit2 <- lm(MonthlyLoanPayment ~ LoanOriginalAmount +
DebtToIncomeRatio +
ProsperRating..numeric. +
Term + LoanOriginationQuarter,data = model)
# Predict from the model
model$predicted <- predict(fit2)
# Calculate the residuals for accuracy
model$residuals <- residuals(fit2)
# Visualize how accurate the model (fit2)
# summary(fit2)
ggplot(data = model, aes(x = LoanOriginalAmount,
y = MonthlyLoanPayment)) +
geom_segment(aes(xend = LoanOriginalAmount ,
yend = predicted), alpha = 0.3) +
theme_pander() +
geom_point(shape = 21, fill = 'violet', size = 1, alpha = 0.4) +
geom_point(aes(y = predicted), shape = 4) +
ggtitle("Residuals plot of improved model (fit2)")
```
We can see a lot less variation in the residual. Moreover when we summarize the fit2 for our improved model, we see that the value of _Residual standard error_ decreased a lot to almost 49 and _Adjusted R-squared_ increased to almost 94%. This was a huge jump from our previous model. Hence we can say that fit2 predicts the monthly loan amount better than the fit1 model. This also proves that _LoanOriginalAmount_ is not only the deciding factor of _MonthlyLoanPayment_. There are other factors too that contribute to this. From the ```summary(fit2)```, we observe that the slope values like below
1. LoanOriginalAmount: 3.124e-02
2. DebtToIncomeRatio: 3.211e+00
3. ProsperRating..numeric.: -1.050e+01
4. Term: -4.993e+00
We can see that as we might have assumed, _LoanOriginalAmount_ & _DebtToIncomeRatio_ has positive slope which is obvious because if these two increase, lenders ask for more money from the borrowers. On the other hand, _Term_ has a negative slope which is logical because if term increases, monthly loan payment should decrease.
We can also see that the both maximum and minimum residual for our improved model is lesser than our previous model.
## Some Final Thoughts
Let's select 3 plots from what we have discussed and elaborate them bit further.
### What people loans for ?
```{r echo=FALSE, message=FALSE, warning=FALSE, fig.height=7, fig.width=10}
categories <- c("Debt\nConsolidation", "Home\nImprovement", "Business",
"Personal\nLoan", "Student Use", "Auto",
"Other", "Baby", "Boat",
"Cosmetic\nProcedure", "Engagement\nRing", "Green\nLoans",
"Household\nExpenses", "Large\nPurchases", "Medical",
"Motorcycle", "RV", "Taxes", "Vaccation", "Wedding\nLoans")
mapBorrowerCategory <- function(categoryNumber) {
ifelse(categoryNumber == 0, 'na', categories[categoryNumber])
}
loanData.reasons <- loanData %>%
group_by(ListingCategory..numeric.) %>%
summarise(n = n()) %>%
filter(ListingCategory..numeric. != 0) %>%
mutate(category = mapBorrowerCategory(ListingCategory..numeric.),
freq = n / sum(n) * 100) %>%
arrange(n) %>%
dplyr::select(-ListingCategory..numeric.)
loanData.reasons$category <- factor(loanData.reasons$category)
ggplot(aes(x = reorder(category, freq), y = freq, fill = freq),
data = loanData.reasons) +
geom_bar(stat = 'identity', position="dodge") +
theme_pander() +
scale_fill_gradient(low = 'lightblue', high = 'darkblue') +
coord_flip() +
guides(fill=FALSE) +
geom_text(aes(label = sprintf("%2.1f%%", round(freq, 2))),
color="black", size = 5, nudge_y = 3) +