End-to-end tutorial - Classification

This should give you an idea on how this library works

In this tutorial, we will use the Pima Indian dataset https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

from that_ml_library.data_preprocess import *
from that_ml_library.chart_plotting import *
from that_ml_library.ml_helpers import *

Load data

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/anhquan0412/dataset/main/diabetes.csv')
print(df.shape)
display(df.sample(7))

(768, 9)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
92	7	81	78	40	48	46.7	0.261	42	0
54	7	150	66	42	342	34.7	0.718	42	0
288	4	96	56	17	49	20.8	0.340	26	0
90	1	80	55	0	0	19.1	0.258	21	0
98	6	93	50	30	64	28.7	0.356	23	0
397	0	131	66	40	0	34.3	0.196	22	1
574	1	143	86	30	330	30.1	0.892	23	0

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Train test split and preprocessing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1], df.Outcome,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df.Outcome)

X_train.describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
count	614.000000	614.000000	614.000000	614.000000	614.000000	614.000000	614.000000	614.000000
mean	3.819218	120.908795	69.442997	20.776873	78.666124	31.973290	0.477428	33.366450
std	3.314148	31.561093	18.402581	15.856433	107.736572	7.861364	0.330300	11.833438
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.084000	21.000000
25%	1.000000	99.000000	62.500000	0.000000	0.000000	27.500000	0.245000	24.000000
50%	3.000000	117.000000	72.000000	23.000000	40.500000	32.300000	0.382500	29.000000
75%	6.000000	140.000000	80.000000	32.000000	130.000000	36.500000	0.639250	41.000000
max	17.000000	199.000000	122.000000	99.000000	744.000000	67.100000	2.329000	81.000000

For this dataset, we only need to perform scaling on numerical features, as there’s no categorical features to process. Note that scaling of numerical data is required for linear model such as linear regression, and not required for tree model, though it doesn’t hurt the performance of such tree model. For simplicity, we will scale the numerical features for all the models we are going to run below.

num_cols = df.columns[:-1].tolist()
print(num_cols)

['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

For Blood Pressure and Skin Thickness, a value of 0 does not make sense, so we will consider these as missing values. The strategy for filling in the missing values will be ‘median’

miss_cols = ['BloodPressure','SkinThickness']

X_train_processed,X_test_processed = preprocessing_general(X_train,X_test,
                                                           miss_cols=miss_cols,
                                                           missing_vals=0,strategies='median',
                                                           num_cols=num_cols,
                                                           scale_methods='minmax')

X_train_processed.describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
count	614.000000	614.000000	614.000000	614.000000	614.000000	614.000000	614.000000	614.000000
mean	0.224660	0.607582	0.569205	0.209867	0.105734	0.476502	0.175246	0.206107
std	0.194950	0.158598	0.150841	0.160166	0.144807	0.117159	0.147127	0.197224
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.058824	0.497487	0.512295	0.000000	0.000000	0.409836	0.071715	0.050000
50%	0.176471	0.587940	0.590164	0.232323	0.054435	0.481371	0.132962	0.133333
75%	0.352941	0.703518	0.655738	0.323232	0.174731	0.543964	0.247327	0.333333
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

Machine Learning Model Experiments

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(5)

Sklearn Logistic Regression

df_train = X_train_processed.copy()
df_train['has_diabetes'] = y_train

get_vif(df_train,plot_corr=True)

const                       40.937975
Pregnancies                  1.441405
Glucose                      1.573457
BloodPressure                1.152426
SkinThickness                1.529585
Insulin                      1.538192
BMI                          1.343058
DiabetesPedigreeFunction     1.097619
Age                          1.577323
has_diabetes                 1.451467
dtype: float64

With VIF, we can see that there’s not much colinearity in this dataset. Note that VIF >5 or >10 means high colinearity

run_logistic_regression(X_train_processed,y_train,
                        multi_class='multinomial',
                        solver='newton-cg',max_iter=10000)

	Features	Coefficients	Coefficients Exp
0	Intercept	-4.233888	0.014496
1	Pregnancies	0.976679	2.655623
2	Glucose	3.679170	39.613487
3	BloodPressure	-0.679500	0.506870
4	SkinThickness	0.217158	1.242540
5	Insulin	-0.473694	0.622698
6	BMI	3.106698	22.347132
7	DiabetesPedigreeFunction	0.881984	2.415688
8	Age	0.461467	1.586400

----------------------------------------------------------------------------------------------------
Log loss:  0.46620685663024525
----------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.90      0.85       400
           1       0.76      0.60      0.67       214

    accuracy                           0.79       614
   macro avg       0.78      0.75      0.76       614
weighted avg       0.79      0.79      0.79       614

LogisticRegression(max_iter=10000, multi_class='multinomial', penalty=None,
                   random_state=0, solver='newton-cg')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Interpretation of coefficients in logistic regression: For a coefficient β1, given a one unit increase in one of the variables (say X1), the odds of X (p(X)/1-p(X)) will be multiplied by e^β1.

In this case, we can tell from coefficient of Glucose that: the odd of having diabetes will be multiplied by ~40 if Glucose level is increased by 1

Statsmodel

run_multinomial_statmodel(X_train_processed,y_train,add_constant=True)

Optimization terminated successfully.
         Current function value: 0.466207
         Iterations 6
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                Outcome   No. Observations:                  614
Model:                        MNLogit   Df Residuals:                      605
Method:                           MLE   Df Model:                            8
Date:                Mon, 19 Feb 2024   Pseudo R-squ.:                  0.2789
Time:                        11:43:30   Log-Likelihood:                -286.25
converged:                       True   LL-Null:                       -396.97
Covariance Type:            nonrobust   LLR p-value:                 1.909e-43
============================================================================================
               Outcome=1       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                       -8.4678      0.805    -10.523      0.000     -10.045      -6.891
Pregnancies                  1.9534      0.624      3.129      0.002       0.730       3.177
Glucose                      7.3584      0.855      8.609      0.000       5.683       9.034
BloodPressure               -1.3590      0.733     -1.855      0.064      -2.795       0.077
SkinThickness                0.4343      0.788      0.551      0.581      -1.110       1.978
Insulin                     -0.9474      0.856     -1.107      0.268      -2.625       0.730
BMI                          6.2135      1.150      5.405      0.000       3.960       8.467
DiabetesPedigreeFunction     1.7640      0.755      2.335      0.020       0.283       3.245
Age                          0.9229      0.616      1.499      0.134      -0.284       2.130
============================================================================================
----------------------------------------------------------------------------------------------------
Log loss:  0.46620685662395556
----------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.81      0.90      0.85       400
           1       0.76      0.60      0.67       214

    accuracy                           0.79       614
   macro avg       0.78      0.75      0.76       614
weighted avg       0.79      0.79      0.79       614

<statsmodels.discrete.discrete_model.MNLogit>

Decision Tree

Validation Curve

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion='entropy',random_state=42,class_weight=None)
cv = StratifiedKFold(5)
plot_validation_curve(dt,'Val Curve - Decision Tree - PID',X_train_processed,y_train,
                      cv=cv,param_range=np.arange(1,20,1),param_name='max_depth',scoring='f1_macro',
                      n_jobs=-1,figsize=(6,4))

This is the validation curve for hyperparameter max depth, which can be used to prune a decision tree.

For max depth, we can observe the overfitting behavior when max depth is too high, which means the tree is able to make unlimited splits to the extreme of classifying each of the training data points correctly, thus failing to generalize. On the other hand, a small depth is enough to achieve a good CV score (from depth 1 to 5). This is a telling sign that some features have strong signals to predict diabetes than others.

Learning curve

np.linspace(0.1,1,20)

array([0.1       , 0.14736842, 0.19473684, 0.24210526, 0.28947368,
       0.33684211, 0.38421053, 0.43157895, 0.47894737, 0.52631579,
       0.57368421, 0.62105263, 0.66842105, 0.71578947, 0.76315789,
       0.81052632, 0.85789474, 0.90526316, 0.95263158, 1.        ])

dt = DecisionTreeClassifier(criterion='entropy',random_state=42,class_weight=None,min_samples_leaf=27)
plot_learning_curve(dt,'DT Learning Curve',X_train_processed,y_train,cv=cv,scoring='f1_macro',
                    train_sizes=np.linspace(0.1,1,20))

Based on the learning curve for the best tuned decision tree, more data is better for the model to learn, as training score seems to convert, and CV score still slowly increases. However, as the gap between CV score and train score is closing, adding lots of data might not be beneficial. For the scalability of the model, the relationship between training time and fit time is linear, which makes sense as more data allows tree to grow more, which takes more time to fit and predict.

Tree plotting

Using Dtreeviz

dt = DecisionTreeClassifier(criterion='entropy',random_state=42,class_weight=None,max_depth=3)
dt.fit(X_train_processed,y_train)
plot_tree_dtreeviz(dt,X_train_processed,y_train,target_name='has_diabetes',class_names=['no','yes'],
                   fancy=True,scale=1.0)

/home/quan/anaconda3/envs/dev/lib/python3.10/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names

Glucose alone can make CV F1 score to reach 0.704, which is not far from the CV score of the best tuned model. With domain knowledge, we can confirm that high glucose is the number one indicator for diabetes, because diabetes patients are unable to effectively use insulin to break down glucose.

Using Sklearn tree plot

dt = DecisionTreeClassifier(criterion='entropy',random_state=42,class_weight=None,max_depth=3)
dt.fit(X_train_processed,y_train)
plot_classification_tree_sklearn(dt,feature_names=X_train_processed.columns.values,
                  class_names=['no','yes'],
                  rotate=True,fname='tree_depth_3_PID')

Type ![](images/tree_depth_3_PID.png)

Model finetuning process

from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': np.arange(2,80),
    'min_samples_leaf': np.arange(1,100),
    'max_features': [0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
}
search_cv = tune_sklearn_model('RandomForest',param_grid,X_train_processed,y_train,
                               is_regression=False,
                               custom_cv=5,
                               scoring='f1_macro',
                               random_cv_iter=100,
                               rank_show=7,
                               show_split_scores=True)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
----------
Rank 1
Params: {'n_estimators': 4, 'min_samples_leaf': 23, 'max_features': 0.7}
Train scores: [0.77, 0.76, 0.78, 0.76, 0.78]
Mean train score: 0.768 +- 0.008
Test scores:  [0.73, 0.77, 0.71, 0.7, 0.77]
Mean test score: 0.738 +- 0.029
----------
Rank 2
Params: {'n_estimators': 41, 'min_samples_leaf': 5, 'max_features': 0.5}
Train scores: [0.88, 0.87, 0.86, 0.88, 0.87]
Mean train score: 0.872 +- 0.009
Test scores:  [0.71, 0.78, 0.71, 0.71, 0.77]
Mean test score: 0.737 +- 0.031
----------
Rank 3
Params: {'n_estimators': 31, 'min_samples_leaf': 10, 'max_features': 0.6}
Train scores: [0.83, 0.81, 0.82, 0.82, 0.81]
Mean train score: 0.819 +- 0.011
Test scores:  [0.72, 0.77, 0.7, 0.69, 0.79]
Mean test score: 0.737 +- 0.040
----------
Rank 4
Params: {'n_estimators': 20, 'min_samples_leaf': 28, 'max_features': 0.5}
Train scores: [0.77, 0.76, 0.78, 0.76, 0.78]
Mean train score: 0.770 +- 0.008
Test scores:  [0.75, 0.76, 0.71, 0.68, 0.78]
Mean test score: 0.735 +- 0.037
----------
Rank 5
Params: {'n_estimators': 18, 'min_samples_leaf': 94, 'max_features': 0.7}
Train scores: [0.73, 0.73, 0.74, 0.75, 0.72]
Mean train score: 0.735 +- 0.011
Test scores:  [0.74, 0.76, 0.7, 0.71, 0.77]
Mean test score: 0.735 +- 0.026
----------
Rank 6
Params: {'n_estimators': 35, 'min_samples_leaf': 61, 'max_features': 0.8}
Train scores: [0.74, 0.74, 0.77, 0.75, 0.73]
Mean train score: 0.746 +- 0.011
Test scores:  [0.8, 0.76, 0.69, 0.67, 0.76]
Mean test score: 0.734 +- 0.045
----------
Rank 7
Params: {'n_estimators': 35, 'min_samples_leaf': 8, 'max_features': 0.5}
Train scores: [0.84, 0.82, 0.83, 0.84, 0.82]
Mean train score: 0.832 +- 0.008
Test scores:  [0.75, 0.76, 0.7, 0.71, 0.76]
Mean test score: 0.734 +- 0.027
----------
Default Params
Mean train score: 1.0 +- 0.0
Mean test score: 0.731 +- 0.04

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.2s finished

params_3D_heatmap(search_cv,'n_estimators','min_samples_leaf','max_features','f1_macro')

By looking at the 3D map of these three hyperparameters, we might need to exclude max_features of 1 and min_samples_leaf that is more than 80. In fact, max_features of 0.8 and 0.9 shows the most promising set of hyperparameters, thus we can now simplied our search space to this:

param_grid = {
    'n_estimators': np.arange(2,80),
    'min_samples_leaf': np.arange(1,85),
    'max_features': [0.8,0.9],
}
search_cv = tune_sklearn_model('RandomForest',param_grid,X_train_processed,y_train,custom_cv=5,
                               scoring='f1_macro',
                               random_cv_iter=200,
                               rank_show=2,
                               show_split_scores=True)

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
----------
Rank 1
Params: {'n_estimators': 12, 'min_samples_leaf': 12, 'max_features': 0.8}
Train scores: [0.8, 0.8, 0.81, 0.82, 0.8]
Mean train score: 0.806 +- 0.010
Test scores:  [0.74, 0.79, 0.7, 0.7, 0.79]
Mean test score: 0.743 +- 0.041
----------
Rank 2
Params: {'n_estimators': 33, 'min_samples_leaf': 23, 'max_features': 0.8}
Train scores: [0.78, 0.76, 0.77, 0.77, 0.76]
Mean train score: 0.769 +- 0.007
Test scores:  [0.75, 0.77, 0.71, 0.69, 0.79]
Mean test score: 0.742 +- 0.038
----------
Default Params
Mean train score: 1.0 +- 0.0
Mean test score: 0.731 +- 0.04

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.2s finished

params_2D_heatmap(search_cv,'n_estimators','min_samples_leaf','f1_macro',
                  figsize=(20,10),min_hm=0.7,max_hm=None)

We will now focus on the area near the bottom right of the graph, for grid search

param_grid = {
    'n_estimators': np.arange(62,73),
    'min_samples_leaf': np.arange(20,26),
    'max_features': [0.9],
}
search_cv = tune_sklearn_model('RandomForest',
                              param_grid,
                              X_train_processed,
                              y_train,
                              scoring='f1_macro',
                              custom_cv=5,
                              rank_show=5)

Fitting 5 folds for each of 66 candidates, totalling 330 fits
----------
Rank 1
Params: {'max_features': 0.9, 'min_samples_leaf': 25, 'n_estimators': 72}
Train scores: [0.77, 0.75, 0.78, 0.76, 0.77]
Mean train score: 0.767 +- 0.009
Test scores:  [0.74, 0.78, 0.7, 0.72, 0.78]
Mean test score: 0.745 +- 0.033
----------
Rank 1
Params: {'max_features': 0.9, 'min_samples_leaf': 25, 'n_estimators': 63}
Train scores: [0.78, 0.75, 0.77, 0.77, 0.77]
Mean train score: 0.765 +- 0.009
Test scores:  [0.74, 0.78, 0.7, 0.72, 0.78]
Mean test score: 0.745 +- 0.033
----------
Rank 1
Params: {'max_features': 0.9, 'min_samples_leaf': 25, 'n_estimators': 65}
Train scores: [0.78, 0.75, 0.78, 0.76, 0.77]
Mean train score: 0.766 +- 0.009
Test scores:  [0.74, 0.78, 0.7, 0.72, 0.78]
Mean test score: 0.745 +- 0.033
----------
Rank 1
Params: {'max_features': 0.9, 'min_samples_leaf': 25, 'n_estimators': 62}
Train scores: [0.77, 0.75, 0.77, 0.77, 0.77]
Mean train score: 0.765 +- 0.008
Test scores:  [0.74, 0.78, 0.7, 0.72, 0.78]
Mean test score: 0.745 +- 0.033
----------
Rank 5
Params: {'max_features': 0.9, 'min_samples_leaf': 25, 'n_estimators': 67}
Train scores: [0.78, 0.75, 0.77, 0.76, 0.77]
Mean train score: 0.766 +- 0.011
Test scores:  [0.74, 0.78, 0.7, 0.71, 0.78]
Mean test score: 0.743 +- 0.034
----------
Default Params
Mean train score: 1.0 +- 0.0
Mean test score: 0.731 +- 0.04

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.2s finished

Feature importance (permutation technique)

params= {'max_features': 0.9, 'min_samples_leaf': 25, 'n_estimators': 62}
# Mean train score: 0.765 +- 0.008
# Mean test score: 0.745 +- 0.033

# # turn off warnings
# def warn(*args, **kwargs):
#     pass
# import warnings
# warnings.warn = warn

final_model= run_sklearn_model('RandomForest',
                                         params,
                                         X_train_processed,y_train,
                                         is_regression=False,
                                         class_names=['no','yes'],
                                         test_split=None,
                                         seed=42,
                                         plot_fea_imp=True)

------------------------------ Train set ------------------------------
Log loss: 0.4134397147185217
              precision    recall  f1-score   support

          no       0.82      0.90      0.86       400
         yes       0.78      0.63      0.70       214

    accuracy                           0.81       614
   macro avg       0.80      0.77      0.78       614
weighted avg       0.80      0.81      0.80       614

Insulin is not an impact feature for this dataset. Digging deeper on the source of the data, there’s a description that said: “This population has minimal European admixture, and their diabetes appears to be exclusively type 2 diabetes, with no evidence of the autoimmunity characteristic of type 1 diabetes”. Type 2 diabetes patients can still produce insulin, but the body is unable to use it. This information about insulin can be useful later if we want to perform feature selection, in order to boost this model’s performance

Partial Dependency Plot

num_features = X_train_processed.columns.tolist()
num_features

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

pdp_numerical_only(final_model,X_train_processed,num_features,class_names=['no diabetes','diabetes'],
                  nrows=4,ncols=2)
# y_class is ignored in binary classification or classical regression settings.

For Partial Dependency Plot, Glucose, BMI and Age show the most changes compared to other features. This matches the feature importances plot earlier.

plot_ice_pair(final_model,X_train_processed,pair_features=['Glucose','Age'],class_idx=1,figsize=(10,4))
# class_idx is ignored in binary classification or classical regression settings.