top of page
Writer's pictureYamini

Breaking Down the Tree - Part 2

How Decision Trees Simplify Complex Data


Decision Trees are one of the most powerful machine learning algorithms and find a broad application in data science. Simplicity both at implementation and interpretation is probably one of the prime reasons behind its popularity. Additionally, decision trees act as the building block of such complex and advanced algorithms, such as Random Forest, XGBoost, and LightGBM. Basically, for any new project or hackathon, my go-to first choice in machine learning would always be decision trees. In this Blog, I am going to demonstrate two easy ways of splitting a decision tree.


Learning Objective :


  • In this blog, lets us see how to split a Decision Tree with the use of different splitting criteria.


  • Familiarize yourself with concepts such as Reduction in Variance, Gini Impurity, Information Gain, and Chi-square in decision trees.


  • Understand the differences between these splitting techniques and how each of them works.


3 W's of Node Splitting : What , Why and Where.


  • What: In decision trees, node splitting is basically a method of splitting a parent node into two or more child nodes based on a feature that best segregates the data.


  • Why: This will enhance the predictive power of the tree, reduce impurity, and choose relevant features.


  • Where: The splitting of nodes takes place internally at every node during the process of building the tree. Starting from the root, the process goes down recursively to the leaf nodes.


Split Techniques:


There are usually two ways to split a node, based on the nature of the target variable


  • Continuous Target Variable: Variance Reduction

  • Categorical Target Variable: Gini Impurity, Information Gain, and Chi-Square.




In the next sections, we will see each of these splitting methods in detail. So let's begin with the first one.


Technique 1 : Reduction in Variance in Decision Tree


Variance reduction is a strategy for splitting nodes in cases where the target variable is continuous, which occurs in regression problems. As its name suggests, it's "variance reduction" because it relies on variance to determine which feature to use in splitting a node into child nodes

Variation is a measure of similarity between values in a node. If all the values in a node are the same, then the variance is zero; this means that for a node, homogeneity is perfect.


Real Time Example:


Let's look at another simple real-time example, this time involving fruit sizes and how many kids like them. We'll use variance reduction to predict how many kids will like a fruit based on its size.


Assume that you have some information regarding fruits, their size, and the number of kids who liked a particular fruit:

Step 1: Calculate Variance for the Whole Dataset.


First, we compute the average number of kids that liked the fruit:


Now we compute the overall variance of the dataset. The variance gives a measure of the deviation of "Kids Liked" from the mean.


Now after computation, the variance of the entire group is 52.17.


Step 2: Split the Data Based on Fruit Size.


Let's split the fruits into two groups based on their size:


  • Group 1: The size of the fruits is less than or equal to 10cm.

  • Group 2: The size of the fruits is more than 10 cm.


Group 1 (Size ≤ 10 cm):

Average Kids Liked for Group 1:

Variance for Group 1:


Group 2 (Size > 10 cm):

Average Kids Liked for Group 2:


Variance for Group 2:



Step 3: Calculate Variance Before and After the Split:


  • Variance before split: 52.17, since the variance for the entire data is.


  • Variance after the split:

    • Group 1 variance = 14.46

    • Group 2 variance = 6.25


By stratifying the fruits based on their size, we have reduced the variance of kids who like each group of fruits. It simply means that the number of kids who like the fruits is now closer to the group mean.


Result Overview: We reduced the variance by segmenting data according to fruit size. Variance reduction resulted in more homogeneous groups and better predictions of the number of kids who will like a fruit.


Technique 2 : Information Gain in Decision Tree


When the target variable is categorical, reduction in variance isn't effective. We instead use Information Gain to split the nodes. This technique is based on the concept of entropy and helps in deciding the best feature to split the data when we are predicting categorical outcomes. Information Gain basically calculates how much uncertainty is reduced after the data is split and is calculated as:

Entropy is a measure of impurity of a node. The value of entropy is lower, the purer the node is, meaning that the data in that node are more similar. If a node is completely homogeneous, i.e., all the data points belong to the same class, its entropy will be zero. Information Gain is calculated by subtracting the entropy from 1, so the purer the node, the higher the Information Gain with a maximum value of 1. Now, let's look at the formula for calculating entropy:

Real Time Example:


Suppose you want to predict if a person will go for a run depending on the weather conditions on a particular day. The target variable will be whether the person decided to run ("Yes") or not ("No").


Here’s the data for a few days:


Step 1: Determine the Uncertainty (Entropy) of the Whole Dataset



First, we calculate the overall uncertainty, or entropy, of the data. Out of the 5 days:


  • 3 days, the person went for a run ("Yes")

  • 2 days, the person did not go for a run ("No")


Since the target variable has a mix of "Yes" and "No", there's some uncertainty in predicting whether the person will run or not. The entropy for the whole dataset is 0.970, meaning it’s not entirely clear from the weather alone whether the person will decide to run.


Step 2: Split the Data by Weather Conditions (Sunny, Rainy, Cloudy)


Now, we split the data into groups based on weather conditions: sunny, rainy, and cloudy.


Group 1: Sunny Weather


In this group, every day had sunny weather, and the person went for a run on both days. There is no uncertainty here—whenever the weather is sunny, the person will go for a run. The entropy for this group is 0, meaning the outcome is certain.


Group 2: Rainy Weather

In this group, both days had rain, and on neither of these days did the person go for a run. Once again, there is no uncertainty within this group. The entropy is 0, which suggests we can be certain that on rainy days, he won't run.


Group 3: Cloudy Weather

In this group, there was only one day with cloudy weather, and the person went for a run that day. Since there's only one data point, there is no uncertainty. The entropy for this group is also 0.


Step 3: Calculate Information Gain


Now, let's calculate the Information Gain. Information Gain tells us how much uncertainty is reduced by splitting the data into these weather condition groups.


Information Gain can be computed by subtracting the entropy after the split from the entropy before the split:


  • Entropy before the split: The entropy for the entire dataset is 0.970.

  • Entropy after the split: For each weather group (sunny, rainy, cloudy), it is 0 since we do not have any uncertainty within each group.


    Therefore, Information Gain:

Results Overview: The data was split based on weather; we can very well predict if a person will run or not. Sunny means they'll run, rain means they won't, and cloudy showed they ran that day.


Conclusion:

In this blog, we explored two essential techniques for splitting nodes in decision trees: Reduction in Variance and Information Gain. These methods help create more accurate predictions by dividing data into more homogeneous groups. While these two methods are incredibly effective, there are other splitting criteria like Gini Impurity and Chi-Square that further refine decision tree models.


Stay tuned for my next blog as we go deeper into those two methods and how they will be able to enhance your decision tree models. Keep following and join me on a journey deep into the mighty world of decision trees and machine learning!




48 views
bottom of page