top of page
Writer's pictureYamini

A Step by step by Research in ML using Decision Tree Algorithm


DRS bank is facing challenging times. Their NPAs (Non-Performing Assets) have been on a rise recently and a large part of these are due to the loans given to individual customers(borrowers). The Chief Risk Officer of the bank decides to put in a scientifically robust framework for approval of loans to individual customers to minimize the risk of loans converting into NPAs and initiates a project for the data science team at the bank. You, as a senior member of the team, are assigned this project.


Problem:


The data-set aims to answer the following key questions:


  • To identify the criteria to approve loans for an individual customer such that the likelihood of the loan delinquency is minimized?

  • What are the factors that drive the behavior of loan delinquency?


Data set :


Attribute Information:


  • The data contains characteristics of the people

  • ID: Customer ID

  • isDelinquent : indicates whether the customer is delinquent or not (1 => Yes, 0 => No)

  • term: Loan term in months

  • gender: Gender of the borrower

  • age: Age of the borrower

  • purpose: Purpose of Loan

  • home_ownership: Status of borrower's home

  • FICO: FICO (i.e. the bureau score) of the borrower


Learning Outcomes:


  • Exploratory Data Analysis

  • Preparing the data to train a model

  • Training and understanding of data using a decision tree model

  • Model evaluation


Domain Information:


  • Transactor – A person who pays his due amount balance full and on time.

  • Revolver – A person who pays the minimum due amount but keeps revolving his balance and does not pay the full amount.

  • Delinquent - Delinquency means that you are behind on payments, a person who fails to pay even the minimum due amount.

  • Defaulter – Once you are delinquent for a certain period your lender will declare you to be in the default stage.

  • Risk Analytics – A wide domain in the financial and banking industry, basically analyzing the risk of the customer.


STEP 1 : IMPORT LIBRARIES


STEP 2: READ THE DATA


data = pd.read_csv("Loan_Delinquent_Dataset.csv")
# copying data to another varaible to avoid any changes to original data
loan = data.copy()
loan.head()
loan.tail()
loan.shape
loan.info()

OUTPUT DISPLAYS THE DATA TYPE

OBSERVATION:

  • isDelinquent is the dependent variable - type integer.

  • All the dependent variable except ID are object type.


Summary of the data set :

loan.describe(include="all")

OBSERVATION:


  • Most of the loans are for a 36-month term loan.

  • More males have applied for loans than females.

  • Most Loan applications are house loans.

  • Most customers have either mortgaged their houses.

  • Mostly customers in the age group 20-25 have applied for a loan.

  • Most Customers have a FICO Score between 300 and 500.


STEP 3 : EXPLORATORY DATA ANALYSIS


Univariate Analysis


Univariate analysis
labeled_barplot(loan, "isDelinquent", perc=True)
labeled_barplot(loan, "term", perc=True)
labeled_barplot(loan, "gender", perc=True)
labeled_barplot(loan, "purpose", perc=True)
labeled_barplot(loan, "home_ownership", perc=True)


OBSERVATION:


  • 66.9% of the customers are delinquent

  • 91.7% of the loans are for a 36 month term.

  • There are more male applicants (56.8%) than female applicants (43.2%)

  • Most loan applications are for house loans (59.7%) followed by car loans (18%)

  • There are 2 levels named 'other' and 'Other' under the purpose variable. Since we do not have any other information about these, we can merge these levels.

  • Very few applicants <10% own their house, Most customers have either mortgaged their houses or live on rent.


Bivariate Analysis


Bivariate Analysis
stacked_barplot(loan, "term", "isDelinquent")
stacked_barplot(loan, "gender", "isDelinquent")
stacked_barplot(loan, "purpose", "isDelinquent")
stacked_barplot(loan, "home_ownership", "isDelinquent")

OBSERVATION:


  • Most loan delinquent customers have taken loan for 36 months.

  • There's not much difference between male and female customers.

  • Most loan delinquent customers are those who have applied for house loans followed by car and personal loans.

  • Those customers who have their own house are less delinquent than the ones who live in a rented place or have mortgaged their home.


STEP 5 : BUILD DECISION TREE MODEL


model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
# Checking model performance on training set #
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train)
decision_tree_perf_train
 confusion_matrix_sklearn(model, X_train, y_train)

 confusion_matrix_sklearn(model, X_train, y_train)


VISUALIZING THE DECISION TREE :





AFTER PRUNING





OBSERVATION:


  • Decision tree with post-pruning is giving the highest recall on the test set.

  • The tree with post pruning is not complex and easy to interpret.


CONCLUSION :


  • FICO, term and gender (in that order) are the most important variables in determining if a borrower will get into a delinquent stage.

  • No borrower shall be given a loan if they are applying for a 36 month term loan and have a FICO score in the range 300-500.

  • Female borrowers with a FICO score greater than 500 should be our target customers.

  • Criteria to approve loan according to decision tree model should depend on three main factors - FICO score, duration of loan and gender that is - If the FICO score is less than 500 and the duration of loan is less than 60 months then the customer will not be able to repay the loans. If the customer has greater than 500 FICO score and is a female higher chances that they will repay the loans.


18 views
bottom of page