Breaking Down the Tree - Part 3

Dec 17, 20245 min read

How Decision Trees Simplify Complex Data

The previous blog introduced two important techniques for splitting decision trees-a cornerstone of machine learning because of the simplicity and effectiveness of algorithms. Decision trees basically dissect data by attribute values recursively for the purpose of making a model understandable. Starting from a general overview of these two concepts, then, I will discuss them in detail before moving on to the other two splitting techniques in this blog. These should enable you to trim the tree further, as well as increase the tree's accuracy. Now, we would like to delve into these new strategies even deeper!

Technique 3 :Gini Impurity in Decision Trees Techn

Gini impurity which is a method used to decide on the splitting of a node is widely used especially, when the target variable is categorical. There has been an improvement in not only the simplicity and the usability of the decision tree but also in the most simple way to construct a decision tree. The Gini Impurity value can be calculated using this formula:

Gini Impurity Formula:

Now, what does Gini mean?

Gini measures the chance that an element that is randomly labeled from the label distribution within the node will be accurately classified. Gini's formula is as listed below:

Gini Impurity Formula:

The Gini Impurity is a measure of impurity/disorder of a node. A zero value of Gini Impurity shows that the node is perfectly homogeneous, which means all elements in the node are of the same class.

Now, you may ask yourself: if we already use Information Gain, why bother with Gini Impurity?

One argument is that the Gini Impurity is preferred to the Information Gainby the practitioners of decision tree algorithm because it is a simpler mathematical equation which is used with discrete data and it is also faster to compute as it does not contain any division by the logarithm, such a polynomial equation is surely computationally easier and quicker.

Chi-Square means a statistical approach in Decision Trees that helps to find which prediction methods we implement are better suited for node splitting when the target variable is categorical. Thus, it also identifies whether a split at a given node leads to children who are significantly different from the parent, i.e. in the distribution of the target variable.

Real Time Example:

Consider a bank that wants to decide whether an application for a loan should be approved or not. Let the decision be based on two features: Credit Score and Income. Historically, the bank has the following information:

Step 1: Data Collection

Here is a dataset of previous applications for loans:

Step 2: Split by Credit Score

We want to predict whether a loan will be approved based on the Credit Score feature. Let’s split the data into two groups based on whether the Credit Score is High or Low.

For the "High Credit Score" Group:
- Data: [Yes, Yes, No]
- The classes are "Yes" and "No."
- Count of each class: 2 "Yes" and 1 "No."
- Proportions of "Yes" and "No":
  - Proportion of "Yes" = 2/3
  - Proportion of "No" = 1/3
- Gini Impurity for High Credit Score
For the "Low Credit Score" Group:
- Data: [No, No, Yes]
- The classes are "Yes" and "No."
- Count of each class: 1 "Yes" and 2 "No."
- Proportions of "Yes" and "No":
  - Proportion of "Yes" = 1/3
  - Proportion of "No" = 2/3
- Gini Impurity for Low Credit Score

Step 3: Calculate the Weighted Gini Impurity for the Split

Now, we calculate the weighted Gini impurity of the entire split. There are 6 total samples, 3 in each group:

Step 4: Check about inaccuracies

The Gini inaccuracy for segmentation by credit score is 0.44, which means that even after segmentation There is still some uncertainty about whether the loan will be approved or not. In other words Both the high credit score group and the low credit score group remain "yes" and "no" as a mix of values. In practice, decision tree algorithms examine partitions. Whether it is income or something else and select the partition that minimizes Gini imprecision. This will help in building models that better predict loan approvals.

Technique 4: Chi-Square in Decision Trees

Chi-Square can be used as a statistical criterion to select the feature that best separates each node in a decision tree. In this case, the Chi-Square test helps to evaluate the independence between the feature and the target variable. The objective is to find the features that are most important to the target variable. Therefore, it is evidence of characteristics that divide data into pure subgroups... Chi-square in a decision tree: The Chi-Square test is one of the common methods used to select features or features in decision trees. Where target variables and properties are usually categorical, this Chi-Square test determines whether a particular property has a dependency relationship with the target variable. The feature that returns the highest Chi-Square statistic shows the strongest correlation with the target. was chosen as the division criteria...

How the Chi-Square Test Works in Decision Trees:

Split the Data Based on a Feature:

For each candidate feature, divide the dataset into subsets based on the possible values of that feature.

Create a Contingency Table:

For each subset, create a contingency table showing the observed frequencies of the target variable's classes concerning the values of the feature.

In a nutshell, for a given feature and a target variable, one contingency table will be constructed whose rows correspond to the possible values of the feature and columns correspond to the classes in the target variable.

Chi-Square Statistic Calculation:

For each contingency table, the Chi-Square statistic shall be calculated that calculates the strength of association between the feature and the target variable.

The formula for the Chi-Square statistic is:

Where:

Oi = observed frequency
Ei = expected frequency

4. Feature with the Highest Chi-Square:

The feature that has the highest Chi-Square statistic will be selected as the best feature to split on since it gives the best association with the target variable.

5. Repeat for Each Subset:

We have now to recursively repeat the procedure, operating on each subset of tuples based on the remaining features until a stopping condition is verified-that is, all subsets will either be pure or reach their minimum number of samples.

Example of Chi- Square in Decision Tree:

The chi-square test is widely used to select features at each level of partitioning in decision trees. Consider the yes/no problem of predicting customer reactions to marketing campaigns. Assume that there are the following trait classes: Region - North, South, East, West, and Income Level - Low, Medium, High. Use a Chi-Square test between each trait and the target variable to decide which is more relevant to the outcome. Chi-statistics Square for that feature is higher. Therefore, it shows more relationship with the target. and is selected for segmentation. In this way, the decision tree is meaningful to improve the prediction accuracy.

Conclusion:

The Gini Impurity and Chi-Square test are two important techniques that play an important role in the development of good decision tree models. The mentioned methods of measuring impurity and checking for statistical significance help to identify the most relevant features for splitting and increase the accuracy and interpretability of models. Gini Impurity is straightforward and computationally inexpensive, while the Chi-Square test statistically looks at the relationship between features and the target variable. Together, they enable the creation of more powerful decision trees that make better predictions. In our next blog, we will explore a practical case study to see how these techniques are applied in real-world scenarios. See you there!

Breaking Down the Tree - Part 3

Recent Posts