data_preprocess

This module contains several Python function for simple data preprocessing for ML, such as handling missing values, minmax scaling, and one hot encoding

source

process_missing_values

 process_missing_values (X_train:pandas.core.frame.DataFrame,
                         X_test:pandas.core.frame.DataFrame=None,
                         missing_cols:list|str=[],
                         missing_vals:list|int|float|str=nan,
                         strategies:list|str='median', **kwargs)

Process columns with missing values using Sklearn SimpleInputer

	Type	Default	Details
X_train	pd.DataFrame		Training dataframe
X_test	pd.DataFrame	None	Testing dataframe
missing_cols	list \| str	[]	A column name having missing values, or a list of such columns
missing_vals	list \| int \| float \| str	nan	A placeholder for missing values, or a list of placeholders for all columns in miss_cols
strategies	list \| str	median	The imputation strategy from sklearn, or a list of such values. Currently support ‘median’,‘mean’,‘most_frequent’
kwargs

df = pd.DataFrame([[7, 2, 3], [4, np.nan, 6], [10, 5, -1]],columns=['col1','col2','col3'])
display(df)

	col1	col2	col3
0	7	2.0	3
1	4	NaN	6
2	10	5.0	-1

df_processed = process_missing_values(df,missing_cols=['col2','col3'],missing_vals=[np.NaN,-1],strategy='mean')
display(df_processed)

	col1	col2	col3
0	7	2.0	3.0
1	4	3.5	6.0
2	10	5.0	4.5

df_trn = pd.DataFrame([[7, 2, 3], [4, np.nan, 6], [10, 5, -1]],columns=['col1','col2','col3'])
df_test = pd.DataFrame([[2, np.NaN, 3], [3, 1, -1]],columns=['col1','col2','col3'])
display(df_trn,df_test)

	col1	col2	col3
0	7	2.0	3
1	4	NaN	6
2	10	5.0	-1

	col1	col2	col3
0	2	NaN	3
1	3	1.0	-1

df_processed_trn,df_procesed_val= process_missing_values(df_trn,
                                                         df_test,
                                                         missing_cols=['col2','col3'],
                                                         missing_vals=[np.NaN,-1],strategy='mean')
display(df_processed_trn,df_procesed_val)

	col1	col2	col3
0	7	2.0	3.0
1	4	3.5	6.0
2	10	5.0	4.5

	col1	col2	col3
0	2	3.5	3.0
1	3	1.0	4.5

source

scale_num_cols

 scale_num_cols (X_train:pandas.core.frame.DataFrame,
                 X_test:pandas.core.frame.DataFrame=None,
                 num_cols:list|str=[], scale_methods:list|str='minmax',
                 **kwargs)

Scale numerical columns using Sklearn

	Type	Default	Details
X_train	pd.DataFrame		Training dataframe
X_test	pd.DataFrame	None	Testing dataframe
num_cols	list \| str	[]	Name of the numerical column, or a list of such columns
scale_methods	list \| str	minmax	Sklearn scaling method (‘minmax’ or ‘standard’), or a list of such methods
kwargs

df = pd.DataFrame([[7, 2, 3], [4, 2, 6], [10, 5, 1]],columns=['col1','col2','col3'])
display(df)

	col1	col2	col3
0	7	2	3
1	4	2	6
2	10	5	1

df_processed = scale_num_cols(df,num_cols=['col1','col3'],scale_methods='standard')
display(df_processed)

	col1	col2	col3
0	0.000000	2	-0.162221
1	-1.224745	2	1.297771
2	1.224745	5	-1.135550

source

one_hot_cat

 one_hot_cat (X_train:pandas.core.frame.DataFrame,
              X_test:pandas.core.frame.DataFrame=None,
              cat_cols:list|str=[], bi_cols:list|str=[], **kwargs)

Perform ‘get_dummies’ on categorical columns

	Type	Default	Details
X_train	pd.DataFrame		Training dataframe
X_test	pd.DataFrame	None	Testing dataframe
cat_cols	list \| str	[]	Name of the categorical columns (non-binary), or a list of such columns
bi_cols	list \| str	[]	Name of the binary column, or a list of such columns
kwargs

df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
display(df)

	A	B	C
0	a	b	1
1	b	a	2
2	a	c	3

df_processed = one_hot_cat(df,cat_cols='B',bi_cols='A')
display(df_processed)

	C	B_a	B_b	B_c	A_b
0	1	0.0	1.0	0.0	0.0
1	2	1.0	0.0	0.0	1.0
2	3	0.0	0.0	1.0	0.0

source

preprocessing_general

 preprocessing_general (X_train:pandas.core.frame.DataFrame,
                        X_test:pandas.core.frame.DataFrame=None, **kwargs)

*The main preprocessing functions, will perform:

Fill missing values
Scale numerical columns
One-hot encode categorical columns

Remember to put in the appropriate keyword arguments for each of the preprocessings mentioned above*

	Type	Default	Details
X_train	pd.DataFrame		Training dataframe
X_test	pd.DataFrame	None	Testing dataframe
kwargs

df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/titanic.csv')

# Select some useful features, for now
df = df[['Survived','Pclass','Sex','Age','SibSp','Parch','Embarked']].copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  156 non-null    int64  
 1   Pclass    156 non-null    int64  
 2   Sex       156 non-null    object 
 3   Age       126 non-null    float64
 4   SibSp     156 non-null    int64  
 5   Parch     156 non-null    int64  
 6   Embarked  155 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 8.7+ KB

df.sample(5)

	Survived	Pclass	Sex	Age	SibSp	Embarked
84	1	2	female	17.0	0	S
4	0	3	male	35.0	0	S
101	0	3	male	NaN	0	S
48	0	3	male	NaN	2	C
112	0	3	male	22.0	0	S

Let’s perform a simple train/test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived',axis=1), df['Survived'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df['Survived'])

X_train.head()

	Pclass	Sex	Age	SibSp	Parch	Embarked
142	3	female	24.0	1	0	S
134	2	male	25.0	0	0	S
120	2	male	21.0	2	0	S
50	3	male	7.0	4	1	S
133	2	female	29.0	1	0	S

X_test.head()

	Pclass	Sex	Age	SibSp	Parch	Embarked
91	3	male	20.0	0	0	S
145	2	male	19.0	1	1	S
115	3	male	21.0	0	0	S
106	3	female	21.0	0	0	S
9	2	female	14.0	1	0	C

X_train_processed,X_test_processed = preprocessing_general(X_train,X_test,
                                                           missing_cols=['Age','Embarked'],
                                                           missing_vals=np.NaN,
                                                           strategies=['median','most_frequent'],
                                                           num_cols=['Age','SibSp','Parch'],
                                                           scale_methods=['standard','minmax','minmax'],
                                                           cat_cols='Embarked',
                                                           bi_cols='Sex'
                                                          )

Notice that I don’t add Pclass to the preprocessing function. That means this column will be left untouched

X_train_processed.head()

	Pclass	Age	SibSp	Parch	Embarked_S	Sex_male
142	3	-0.325526	0.2	0.0	1.0	0.0
134	2	-0.252796	0.0	0.0	1.0	1.0
120	2	-0.543716	0.4	0.0	1.0	1.0
50	3	-1.561938	0.8	0.2	1.0	1.0
133	2	0.038125	0.2	0.0	1.0	0.0

X_test_processed.head()

	Pclass	Age	SibSp	Parch	Embarked_C	Embarked_S	Sex_male
91	3	-0.616446	0.0	0.0	0.0	1.0	1.0
145	2	-0.689176	0.2	0.2	0.0	1.0	1.0
115	3	-0.543716	0.0	0.0	0.0	1.0	1.0
106	3	-0.543716	0.0	0.0	0.0	1.0	0.0
9	2	-1.052827	0.2	0.0	1.0	0.0	0.0