data_preprocess

This module contains several Python function for simple data preprocessing for ML, such as handling missing values, minmax scaling, and one hot encoding

source

process_missing_values

 process_missing_values (X_train:pandas.core.frame.DataFrame,
                         X_test:pandas.core.frame.DataFrame=None,
                         missing_cols:list|str=[],
                         missing_vals:list|int|float|str=nan,
                         strategies:list|str='median', **kwargs)

Process columns with missing values using Sklearn SimpleInputer

Type Default Details
X_train pd.DataFrame Training dataframe
X_test pd.DataFrame None Testing dataframe
missing_cols list | str [] A column name having missing values, or a list of such columns
missing_vals list | int | float | str nan A placeholder for missing values, or a list of placeholders for all columns in miss_cols
strategies list | str median The imputation strategy from sklearn, or a list of such values. Currently support ‘median’,‘mean’,‘most_frequent’
kwargs
df = pd.DataFrame([[7, 2, 3], [4, np.nan, 6], [10, 5, -1]],columns=['col1','col2','col3'])
display(df)
col1 col2 col3
0 7 2.0 3
1 4 NaN 6
2 10 5.0 -1
df_processed = process_missing_values(df,missing_cols=['col2','col3'],missing_vals=[np.NaN,-1],strategy='mean')
display(df_processed)
col1 col2 col3
0 7 2.0 3.0
1 4 3.5 6.0
2 10 5.0 4.5
df_trn = pd.DataFrame([[7, 2, 3], [4, np.nan, 6], [10, 5, -1]],columns=['col1','col2','col3'])
df_test = pd.DataFrame([[2, np.NaN, 3], [3, 1, -1]],columns=['col1','col2','col3'])
display(df_trn,df_test)
col1 col2 col3
0 7 2.0 3
1 4 NaN 6
2 10 5.0 -1
col1 col2 col3
0 2 NaN 3
1 3 1.0 -1
df_processed_trn,df_procesed_val= process_missing_values(df_trn,
                                                         df_test,
                                                         missing_cols=['col2','col3'],
                                                         missing_vals=[np.NaN,-1],strategy='mean')
display(df_processed_trn,df_procesed_val)
col1 col2 col3
0 7 2.0 3.0
1 4 3.5 6.0
2 10 5.0 4.5
col1 col2 col3
0 2 3.5 3.0
1 3 1.0 4.5

source

scale_num_cols

 scale_num_cols (X_train:pandas.core.frame.DataFrame,
                 X_test:pandas.core.frame.DataFrame=None,
                 num_cols:list|str=[], scale_methods:list|str='minmax',
                 **kwargs)

Scale numerical columns using Sklearn

Type Default Details
X_train pd.DataFrame Training dataframe
X_test pd.DataFrame None Testing dataframe
num_cols list | str [] Name of the numerical column, or a list of such columns
scale_methods list | str minmax Sklearn scaling method (‘minmax’ or ‘standard’), or a list of such methods
kwargs
df = pd.DataFrame([[7, 2, 3], [4, 2, 6], [10, 5, 1]],columns=['col1','col2','col3'])
display(df)
col1 col2 col3
0 7 2 3
1 4 2 6
2 10 5 1
df_processed = scale_num_cols(df,num_cols=['col1','col3'],scale_methods='standard')
display(df_processed)
col1 col2 col3
0 0.000000 2 -0.162221
1 -1.224745 2 1.297771
2 1.224745 5 -1.135550

source

one_hot_cat

 one_hot_cat (X_train:pandas.core.frame.DataFrame,
              X_test:pandas.core.frame.DataFrame=None,
              cat_cols:list|str=[], bi_cols:list|str=[], **kwargs)

Perform ‘get_dummies’ on categorical columns

Type Default Details
X_train pd.DataFrame Training dataframe
X_test pd.DataFrame None Testing dataframe
cat_cols list | str [] Name of the categorical columns (non-binary), or a list of such columns
bi_cols list | str [] Name of the binary column, or a list of such columns
kwargs
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
display(df)
A B C
0 a b 1
1 b a 2
2 a c 3
df_processed = one_hot_cat(df,cat_cols='B',bi_cols='A')
display(df_processed)
C B_a B_b B_c A_b
0 1 0.0 1.0 0.0 0.0
1 2 1.0 0.0 0.0 1.0
2 3 0.0 0.0 1.0 0.0

source

preprocessing_general

 preprocessing_general (X_train:pandas.core.frame.DataFrame,
                        X_test:pandas.core.frame.DataFrame=None, **kwargs)

*The main preprocessing functions, will perform:

  • Fill missing values

  • Scale numerical columns

  • One-hot encode categorical columns

Remember to put in the appropriate keyword arguments for each of the preprocessings mentioned above*

Type Default Details
X_train pd.DataFrame Training dataframe
X_test pd.DataFrame None Testing dataframe
kwargs
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/titanic.csv')
# Select some useful features, for now
df = df[['Survived','Pclass','Sex','Age','SibSp','Parch','Embarked']].copy()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  156 non-null    int64  
 1   Pclass    156 non-null    int64  
 2   Sex       156 non-null    object 
 3   Age       126 non-null    float64
 4   SibSp     156 non-null    int64  
 5   Parch     156 non-null    int64  
 6   Embarked  155 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 8.7+ KB
df.sample(5)
Survived Pclass Sex Age SibSp Parch Embarked
84 1 2 female 17.0 0 0 S
4 0 3 male 35.0 0 0 S
101 0 3 male NaN 0 0 S
48 0 3 male NaN 2 0 C
112 0 3 male 22.0 0 0 S

Let’s perform a simple train/test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived',axis=1), df['Survived'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df['Survived'])
X_train.head()
Pclass Sex Age SibSp Parch Embarked
142 3 female 24.0 1 0 S
134 2 male 25.0 0 0 S
120 2 male 21.0 2 0 S
50 3 male 7.0 4 1 S
133 2 female 29.0 1 0 S
X_test.head()
Pclass Sex Age SibSp Parch Embarked
91 3 male 20.0 0 0 S
145 2 male 19.0 1 1 S
115 3 male 21.0 0 0 S
106 3 female 21.0 0 0 S
9 2 female 14.0 1 0 C
X_train_processed,X_test_processed = preprocessing_general(X_train,X_test,
                                                           missing_cols=['Age','Embarked'],
                                                           missing_vals=np.NaN,
                                                           strategies=['median','most_frequent'],
                                                           num_cols=['Age','SibSp','Parch'],
                                                           scale_methods=['standard','minmax','minmax'],
                                                           cat_cols='Embarked',
                                                           bi_cols='Sex'
                                                          )

Notice that I don’t add Pclass to the preprocessing function. That means this column will be left untouched

X_train_processed.head()
Pclass Age SibSp Parch Embarked_C Embarked_Q Embarked_S Sex_male
142 3 -0.325526 0.2 0.0 0.0 0.0 1.0 0.0
134 2 -0.252796 0.0 0.0 0.0 0.0 1.0 1.0
120 2 -0.543716 0.4 0.0 0.0 0.0 1.0 1.0
50 3 -1.561938 0.8 0.2 0.0 0.0 1.0 1.0
133 2 0.038125 0.2 0.0 0.0 0.0 1.0 0.0
X_test_processed.head()
Pclass Age SibSp Parch Embarked_C Embarked_Q Embarked_S Sex_male
91 3 -0.616446 0.0 0.0 0.0 0.0 1.0 1.0
145 2 -0.689176 0.2 0.2 0.0 0.0 1.0 1.0
115 3 -0.543716 0.0 0.0 0.0 0.0 1.0 1.0
106 3 -0.543716 0.0 0.0 0.0 0.0 1.0 0.0
9 2 -1.052827 0.2 0.0 1.0 0.0 0.0 0.0