DRS bank is facing challenging times. Their NPAs (Non-Performing Assets) have been on a rise recently and a large part of these are due to the loans given to individual customers(borrowers). The Chief Risk Officer of the bank decides to put in a scientifically robust framework for approval of loans to individual customers to minimize the risk of loans converting into NPAs and initiates a project for the data science team at the bank. You, as a senior member of the team, are assigned this project.
Problem:
The data-set aims to answer the following key questions:
To identify the criteria to approve loans for an individual customer such that the likelihood of the loan delinquency is minimized?
What are the factors that drive the behavior of loan delinquency?
Data set :
Attribute Information:
The data contains characteristics of the people
ID: Customer ID
isDelinquent : indicates whether the customer is delinquent or not (1 => Yes, 0 => No)
term: Loan term in months
gender: Gender of the borrower
age: Age of the borrower
purpose: Purpose of Loan
home_ownership: Status of borrower's home
FICO: FICO (i.e. the bureau score) of the borrower
Learning Outcomes:
Exploratory Data Analysis
Preparing the data to train a model
Training and understanding of data using a decision tree model
Model evaluation
Domain Information:
Transactor – A person who pays his due amount balance full and on time.
Revolver – A person who pays the minimum due amount but keeps revolving his balance and does not pay the full amount.
Delinquent - Delinquency means that you are behind on payments, a person who fails to pay even the minimum due amount.
Defaulter – Once you are delinquent for a certain period your lender will declare you to be in the default stage.
Risk Analytics – A wide domain in the financial and banking industry, basically analyzing the risk of the customer.
STEP 1 : IMPORT LIBRARIES
STEP 2: READ THE DATA
data = pd.read_csv("Loan_Delinquent_Dataset.csv")
# copying data to another varaible to avoid any changes to original data
loan = data.copy()
loan.head()
loan.tail()
loan.shape
loan.info()
OBSERVATION:
isDelinquent is the dependent variable - type integer.
All the dependent variable except ID are object type.
Summary of the data set :
loan.describe(include="all")
OBSERVATION:
Most of the loans are for a 36-month term loan.
More males have applied for loans than females.
Most Loan applications are house loans.
Most customers have either mortgaged their houses.
Mostly customers in the age group 20-25 have applied for a loan.
Most Customers have a FICO Score between 300 and 500.
STEP 3 : EXPLORATORY DATA ANALYSIS
Univariate Analysis
labeled_barplot(loan, "isDelinquent", perc=True)
labeled_barplot(loan, "term", perc=True)
labeled_barplot(loan, "gender", perc=True)
labeled_barplot(loan, "purpose", perc=True)
labeled_barplot(loan, "home_ownership", perc=True)
OBSERVATION:
66.9% of the customers are delinquent
91.7% of the loans are for a 36 month term.
There are more male applicants (56.8%) than female applicants (43.2%)
Most loan applications are for house loans (59.7%) followed by car loans (18%)
There are 2 levels named 'other' and 'Other' under the purpose variable. Since we do not have any other information about these, we can merge these levels.
Very few applicants <10% own their house, Most customers have either mortgaged their houses or live on rent.
Bivariate Analysis
stacked_barplot(loan, "term", "isDelinquent")
stacked_barplot(loan, "gender", "isDelinquent")
stacked_barplot(loan, "purpose", "isDelinquent")
stacked_barplot(loan, "home_ownership", "isDelinquent")
OBSERVATION:
Most loan delinquent customers have taken loan for 36 months.
There's not much difference between male and female customers.
Most loan delinquent customers are those who have applied for house loans followed by car and personal loans.
Those customers who have their own house are less delinquent than the ones who live in a rented place or have mortgaged their home.
STEP 5 : BUILD DECISION TREE MODEL
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
# Checking model performance on training set #
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train)
decision_tree_perf_train
confusion_matrix_sklearn(model, X_train, y_train)
confusion_matrix_sklearn(model, X_train, y_train)
VISUALIZING THE DECISION TREE :
AFTER PRUNING
OBSERVATION:
Decision tree with post-pruning is giving the highest recall on the test set.
The tree with post pruning is not complex and easy to interpret.
CONCLUSION :
FICO, term and gender (in that order) are the most important variables in determining if a borrower will get into a delinquent stage.
No borrower shall be given a loan if they are applying for a 36 month term loan and have a FICO score in the range 300-500.
Female borrowers with a FICO score greater than 500 should be our target customers.
Criteria to approve loan according to decision tree model should depend on three main factors - FICO score, duration of loan and gender that is - If the FICO score is less than 500 and the duration of loan is less than 60 months then the customer will not be able to repay the loans. If the customer has greater than 500 FICO score and is a female higher chances that they will repay the loans.