Gestational Diabetes is a problem many women suffer from during pregnancy. This model is build to understand the preconditions of it and take precautions if any at early stage if symptoms are there from the very start.
To do this GDM data is taken and exploratory data analysis is done.
In Python:
import pandas as pd
import numpy as np
df = pd.read_excel('../input/gdgdgdm/GDM.xlsx')
EDA:
Under EDA data is analyzed for features to be used for detecting and analyzing gestational diabetes.
When we see the data ,we have to figure out which data is of before delivery and which one id of after delivery.
The after delivery data is dropped as it doesn't contribute anything to detect gestational diabetes.
Then we see there are 2 types of tests GCT and OGTT performed on the dataset. The values of glucose levels at 0 hr, 1 hr and 2 hrs is recorded for both and tests and Gestational diabetes is indicated wherever there is a value of equal or more than 7.8.
Also 87 out of 600 patients,glucose test values are not available as those were either transferred to other hospital or miscarried, these rows are deleted from main data.
This is implemented by:
conditions=[(df['1h glucose']>=7.8) | (df['OGTT 0h value']>=7.8) | (df['OGTT 1h value']>=7.8) | (df['OGTT 2h value']>=7.8),
(df['1h glucose']<7.8) | (df['OGTT 0h value']<7.8) | (df['OGTT 1h value']<7.8) | (df['OGTT 2h value']<7.8),
(df['1h glucose']=='') | (df['OGTT 0h value']=='') | (df['OGTT 1h value']=='') | (df['OGTT 2h value']=='')]
choices = ['1', '0',np.NaN]
df['GDM'] = np.select(conditions, choices, default=np.NaN)
df.drop(df[df['GDM']=='nan'].index, inplace = True)
Handling categorical features
Now categorical features like Smoking, Ethnicity, Previous GDM, Age>30, BMI>30,Screening method, Vit D list used. These are converted to numeric values using Ordinal number Encoding.
Handling NULL Values
After this we see how many features have null values we use seaborn library to see this by using code as below:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
This graph gives idea on NULL values, the NULLL values that need to be handled are 'V1 U creatinine ','V1 U protein ','V1 CRP','V1 ALT ','V1 Creatinine','V1 Platelet ','V1 Hb','WCC', and 'V1 HbA1c (mmol/mol)'.
The NULL values are here handled by KNN and MICE algorithm.
As common methods of imputation won't be helpful in medical data.
Values in glucose levels at 0 hr, 1 hr and 2 hrs for GCT and OGTT will have NULL values as one goes through 1 test or the other and these are used to derive values of GDM our target feature.
Also NULL Value rows for GDM are dropped as these will not lead to any conclusion.
We try to find relation of each data with probability of having gestational diabetes or No.
We do it by plotting through seaborn.
We plot first yes and No for gestational diabetes our target as under:
sns.set_style('whitegrid')
sns.countplot(x='GDM',data=df)
Graph shows that out of 516 rows of data 433re not having gestational diabetes while 83 have it.
If we draw this graph in relation to Age>30 the graph shows as below:
sns.set_style('whitegrid')
sns.countplot(x='GDM',hue='Age >30 10',data=df,palette='RdBu_r')
As the dataset we have has more values of Age more than 30, it doesn't show clear relation between having gestational diabetes and Age. Women having gestational diabetes is obviously higher for age more than 30 but it is same for women having no gestational diabetes.
We draw this graph in relation to Smoking the graph shows as below:
sns.set_style('whitegrid')
sns.countplot(x='GDM',hue='Smoking 123',data=df,palette='RdBu_r')
Here 3 have never smoked and 1 and 2 have been smoking currently as well as exited after pregnancy.
As the data of women who have never smoked is higher in women who didn't develop gestational diabetes so we can say non smoking women are safer to gestational diabetes than smoking ones.
If we draw this graph in relation to Overweight the graph shows as below:
sns.set_style('whitegrid')
sns.countplot(x='GDM',hue='Overweight 123',data=df,palette='RdBu_r')
The graph shows relation that people who are not overweight are less likely to develop gestational diabetes.
If we draw this graph in relation to LowBP the graph shows as below:
df['LowBP']=np.where(df['diastolic BP (mmHg) V1']<=60, 1,0)
sns.set_style('whitegrid')
sns.countplot(x='GDM',hue='LowBP',data=df,palette='RdBu_r')
The graph shows that people with low BP are less likely to develop gestational diabetes.
If we draw this graph in relation to H1bc the graph shows as below:
df['HbA1c']=np.where(df['V1 HbA1c (mmol/mol)']>=30, 1,0)
sns.set_style('whitegrid')
sns.countplot(x='GDM',hue='HbA1c',data=df,palette='RdBu_r')
The graph shows that people with low H1bc are less likely to develop gestational diabetes.
If we draw this graph in relation to BMI the graph shows as below:
df['BMI']=np.where(df['BMI (kg/m2) V1']>=25, 1,0)
sns.set_style('whitegrid')
sns.countplot(x='GDM',hue='BMI',data=df,palette='RdBu_r')
The graph shows that people with less BMI are less likely to develop gestational diabetes.
Modelling
We will try classification algorithms on this data like Random Forest, XGBoost, SVM Classifier, Decision Tree.
XGBoost can be done without any missing value imputation so we try to do it first.
import xgboost as xgb
from sklearn.metrics import mean_squared_error
xgb_classifier = xgb.XGBClassifier()
print(xgb_classifier)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
# Fitting the model with train data.
xgb_classifier.fit(X_train, y_train)
# Predicting the test data
y_pred = xgb_classifier.predict(X_test)
print(y_pred)
# Evaluate predictions
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 80%
We try to plot important features by using:
import matplotlib.pyplot as plt
xgb.plot_importance(xgb_classifier)
plt.rcParams['figure.figsize']=[10,10]
plt.show()
We try Random forest for modelling using 2 types of imputers
Random Forest with Mice Imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
imp = IterativeImputer(estimator=lr,missing_values=np.nan, max_iter=20)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier(n_estimators = 170)
#Mice Imputer
X=imp.fit_transform(X)
#Splitting test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Accuracy
round(accuracy_score(y_true = y_test, y_pred = preds), 3)
Accuracy: 81.7%
Random Forest with KNN Imputer
from sklearn.impute import KNNImputer
knn = KNNImputer(n_neighbors=5, add_indicator=True)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier(n_estimators = 170)
#knn Imputer
X=knn.fit_transform(X)
#Splitting test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Accuracy
round(accuracy_score(y_true = y_test, y_pred = preds), 3)
Accuracy: 81.7%
Decision Tree with MICE
# Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
#knn Imputer
X=knn.fit_transform(X)
#Splitting test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6923076923076923
Decision Tree Classifier through KNN
# Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
#knn Imputer
X=knn.fit_transform(X)
#Splitting test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6730769230769231
SVM through KNN
#Import svm model
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#KNN Imputer
X=knn.fit_transform(X)
#Splitting test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#Train the model using the training sets
clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 81.7461539
SVM through MICE
#Import svm model
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#MICE Imputer
X=imp.fit_transform(X)
#Splitting test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#Train the model using the training sets
clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 81.71538461539
Conclusion
The best accuracy is given by random forest with KNN and MICE imputer, so we will take this classifier and try to tune it to get better Accuracy percetage, We tried F1 scores as well but predicting True Positives here is more important so Accuracy score is taken into consideration.
Also XGBoost can be optimized with more hyperparameter tuning to get better percentage.
Thanks for reading!