Example for XGBoost

Nov 6, 20204 min read

XG Boost is very powerful Machine learning algorithm which can have higher rates of accuracy when specified by its wide range of parameters in supervised machine learning. XGBoost stands for eXtreme Gradient Boosting. XG Boost works on parallel tree boosting which predicts the target by combining results of multiple weak model. The XGBoost library implements the gradient boosting decision tree algorithm. Let us explore more using an example.

Here we are using heart disease uci data set from kaggle. Also we are trying to predict the likelihood of getting heart disease and which feature is more important for that.

Here is the link for data. Data

First let us import pandas to read the data using the following line of code

import pandas as pd

Now let us read the data to a dataframe

 data = pd.read_csv('../input/heart-disease-uci/heart.csv')
 data

This will read the data into dataframe called data and gives the following output.

Now let us see the info of the data to explore more about data using the following code

data.info()

This will give the following information about the data. We can see data types, columns, null values etc.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

Let us see another form of information through following code

data.describe()

This will give the description of the dataframe in the following way.

Description gave count, mean, std, min, 25%, 50%, 75% and max of all columns. This further helps us to understand more about data. Now let us import the numpy, xgboost and sklearn.metrics libraries. We need mean squared error for regression.

 import xgboost as xgb
 from sklearn.metrics import mean_squared_error
 import numpy as np

Next we need to assign X and Y values. Here we are also seperating the target variable and the rest of the variables using .iloc to subset the data..

X, Y = data.iloc[:,:-1],data.iloc[:,-1]

Now we need to convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and gives it acclaimed performance and efficiency gains.

data_dmatrix = xgb.DMatrix(data=X,label=Y)

Now, we will create the train and test set for cross-validation of the results using the train_test_split function from sklearn's model_selection module with test_size size equal to 20% of the data. Also, to maintain reproducibility of the results, a random_state is also assigned .

 from sklearn.model_selection import train_test_split
 X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                      test_size=0.2, random_state=123)

The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments.

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)

Now let us fit the regressor to the training set using the .fit() method.

 xg_reg.fit(X_train,Y_train)

the output after fitting shows many hyperparameters as follows.

XGBRegressor(alpha=10, base_score=0.5,                           booster='gbtree',colsample_bylevel=1,colsample_bynode=1, colsample_bytree=0.3, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.1, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=10, n_jobs=0, num_parallel_tree=1,objective='reg:linear', random_state=0, reg_alpha=10, reg_lambda=1,scale_pos_weight=1, subsample=1, tree_method='exact',validate_parameters=1, verbosity=None)

Let us predict the above model using the .predict() method.

preds = xg_reg.predict(X_test)
preds

The out put is as follows

array([0.62361383, 0.436409  , 0.4638252 , 0.4234442 , 0.5576368 ,
       0.36139676, 0.6453048 , 0.48385164, 0.61654013, 0.53615165,
       0.5271714 , 0.5342451 , 0.36847046, 0.36139676, 0.5126163 ,
       0.42248273, 0.47246233, 0.52830946, 0.4810629 , 0.5204763 ,
       0.590331  , 0.47579026, 0.6621046 , 0.39579174, 0.49821952,
       0.53383183, 0.4865605 , 0.5409056 , 0.43799824, 0.53313845,
       0.4234442 , 0.5955741 , 0.5409056 , 0.45628968, 0.5528887 ,
       0.60049963, 0.61878484, 0.6078624 , 0.62996906, 0.59981364,
       0.64041364, 0.55104494, 0.5254018 , 0.62684685, 0.55517054,
       0.48204178, 0.36139676, 0.44480985, 0.60275817, 0.6277096 ,
       0.44678292, 0.42796794, 0.6621046 , 0.5477844 , 0.49884164,
       0.62160766, 0.55104494, 0.59651303, 0.5327918 , 0.55396104,
       0.5179344 ], dtype=float32)

Let us calculate the rmse by using the mean_sqaured_error function from sklearn's metrics module.The output will be as follows.

 rmse = np.sqrt(mean_squared_error(Y_test, preds))
 print("RMSE: %f" % (rmse))

RMSE: 0.449886

We can see that RMSE prediction for heart disease came out to be 0.45.So our model can predict with this error. To analyse which feature is more important factor we need to classify and XGBclassifier() is used for that. Let us import the required libraries for this task.

 from numpy import loadtxt
 from xgboost import XGBClassifier
 from xgboost import plot_importance
 from matplotlib import pyplot

Let us call the classifier using the following line

model = XGBClassifier()

Here we are trying to fit the model

model.fit(X, Y)

The above gives the following output with hyperparameters.

 XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1,gamma=0,gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1,missing=nan,monotone_constraints='()',
          n_estimators=100,n_jobs=0,num_parallel_tree=1,random_state=0,
              reg_alpha=0, reg_lambda=1,scale_pos_weight=1,subsample=1,
              tree_method='exact',validate_parameters=1,verbosity=None)

Here we are trying to plot the importance model using this code

 plot_importance(model)
 pyplot.show()

The output shows the order of importance

Conclusion

We have built XGBoost model for predicting the likelihood of getting heart disease and the model has 0.45 rmse. We have also found that the most important factor in finding out the heart disease is cholesterol.

Example for XGBoost

Recent Posts