source
process_missing_values
process_missing_values (X_train:pandas.core.frame.DataFrame,
X_test:pandas.core.frame.DataFrame=None,
missing_cols:list|str=[],
missing_vals:list|int|float|str=nan,
strategies:list|str='median', **kwargs)
Process columns with missing values using Sklearn SimpleInputer
X_train
pd.DataFrame
Training dataframe
X_test
pd.DataFrame
None
Testing dataframe
missing_cols
list | str
[]
A column name having missing values, or a list of such columns
missing_vals
list | int | float | str
nan
A placeholder for missing values, or a list of placeholders for all columns in miss_cols
strategies
list | str
median
The imputation strategy from sklearn, or a list of such values. Currently support ‘median’,‘mean’,‘most_frequent’
kwargs
df = pd.DataFrame([[7 , 2 , 3 ], [4 , np.nan, 6 ], [10 , 5 , - 1 ]],columns= ['col1' ,'col2' ,'col3' ])
display(df)
0
7
2.0
3
1
4
NaN
6
2
10
5.0
-1
df_processed = process_missing_values(df,missing_cols= ['col2' ,'col3' ],missing_vals= [np.NaN,- 1 ],strategy= 'mean' )
display(df_processed)
0
7
2.0
3.0
1
4
3.5
6.0
2
10
5.0
4.5
df_trn = pd.DataFrame([[7 , 2 , 3 ], [4 , np.nan, 6 ], [10 , 5 , - 1 ]],columns= ['col1' ,'col2' ,'col3' ])
df_test = pd.DataFrame([[2 , np.NaN, 3 ], [3 , 1 , - 1 ]],columns= ['col1' ,'col2' ,'col3' ])
display(df_trn,df_test)
0
7
2.0
3
1
4
NaN
6
2
10
5.0
-1
df_processed_trn,df_procesed_val= process_missing_values(df_trn,
df_test,
missing_cols= ['col2' ,'col3' ],
missing_vals= [np.NaN,- 1 ],strategy= 'mean' )
display(df_processed_trn,df_procesed_val)
0
7
2.0
3.0
1
4
3.5
6.0
2
10
5.0
4.5
source
scale_num_cols
scale_num_cols (X_train:pandas.core.frame.DataFrame,
X_test:pandas.core.frame.DataFrame=None,
num_cols:list|str=[], scale_methods:list|str='minmax',
**kwargs)
Scale numerical columns using Sklearn
X_train
pd.DataFrame
Training dataframe
X_test
pd.DataFrame
None
Testing dataframe
num_cols
list | str
[]
Name of the numerical column, or a list of such columns
scale_methods
list | str
minmax
Sklearn scaling method (‘minmax’ or ‘standard’), or a list of such methods
kwargs
df = pd.DataFrame([[7 , 2 , 3 ], [4 , 2 , 6 ], [10 , 5 , 1 ]],columns= ['col1' ,'col2' ,'col3' ])
display(df)
df_processed = scale_num_cols(df,num_cols= ['col1' ,'col3' ],scale_methods= 'standard' )
display(df_processed)
0
0.000000
2
-0.162221
1
-1.224745
2
1.297771
2
1.224745
5
-1.135550
source
one_hot_cat
one_hot_cat (X_train:pandas.core.frame.DataFrame,
X_test:pandas.core.frame.DataFrame=None,
cat_cols:list|str=[], bi_cols:list|str=[], **kwargs)
Perform ‘get_dummies’ on categorical columns
X_train
pd.DataFrame
Training dataframe
X_test
pd.DataFrame
None
Testing dataframe
cat_cols
list | str
[]
Name of the categorical columns (non-binary), or a list of such columns
bi_cols
list | str
[]
Name of the binary column, or a list of such columns
kwargs
df = pd.DataFrame({'A' : ['a' , 'b' , 'a' ], 'B' : ['b' , 'a' , 'c' ],
'C' : [1 , 2 , 3 ]})
display(df)
df_processed = one_hot_cat(df,cat_cols= 'B' ,bi_cols= 'A' )
display(df_processed)
0
1
0.0
1.0
0.0
0.0
1
2
1.0
0.0
0.0
1.0
2
3
0.0
0.0
1.0
0.0
source
preprocessing_general
preprocessing_general (X_train:pandas.core.frame.DataFrame,
X_test:pandas.core.frame.DataFrame=None, **kwargs)
*The main preprocessing functions, will perform:
Remember to put in the appropriate keyword arguments for each of the preprocessings mentioned above*
X_train
pd.DataFrame
Training dataframe
X_test
pd.DataFrame
None
Testing dataframe
kwargs
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/titanic.csv' )
# Select some useful features, for now
df = df[['Survived' ,'Pclass' ,'Sex' ,'Age' ,'SibSp' ,'Parch' ,'Embarked' ]].copy()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 156 non-null int64
1 Pclass 156 non-null int64
2 Sex 156 non-null object
3 Age 126 non-null float64
4 SibSp 156 non-null int64
5 Parch 156 non-null int64
6 Embarked 155 non-null object
dtypes: float64(1), int64(4), object(2)
memory usage: 8.7+ KB
84
1
2
female
17.0
0
0
S
4
0
3
male
35.0
0
0
S
101
0
3
male
NaN
0
0
S
48
0
3
male
NaN
2
0
C
112
0
3
male
22.0
0
0
S
Let’s perform a simple train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('Survived' ,axis= 1 ), df['Survived' ],
test_size= 0.2 ,
random_state= 42 ,
stratify= df['Survived' ])
142
3
female
24.0
1
0
S
134
2
male
25.0
0
0
S
120
2
male
21.0
2
0
S
50
3
male
7.0
4
1
S
133
2
female
29.0
1
0
S
91
3
male
20.0
0
0
S
145
2
male
19.0
1
1
S
115
3
male
21.0
0
0
S
106
3
female
21.0
0
0
S
9
2
female
14.0
1
0
C
X_train_processed,X_test_processed = preprocessing_general(X_train,X_test,
missing_cols= ['Age' ,'Embarked' ],
missing_vals= np.NaN,
strategies= ['median' ,'most_frequent' ],
num_cols= ['Age' ,'SibSp' ,'Parch' ],
scale_methods= ['standard' ,'minmax' ,'minmax' ],
cat_cols= 'Embarked' ,
bi_cols= 'Sex'
)
Notice that I don’t add Pclass
to the preprocessing function. That means this column will be left untouched
142
3
-0.325526
0.2
0.0
0.0
0.0
1.0
0.0
134
2
-0.252796
0.0
0.0
0.0
0.0
1.0
1.0
120
2
-0.543716
0.4
0.0
0.0
0.0
1.0
1.0
50
3
-1.561938
0.8
0.2
0.0
0.0
1.0
1.0
133
2
0.038125
0.2
0.0
0.0
0.0
1.0
0.0
91
3
-0.616446
0.0
0.0
0.0
0.0
1.0
1.0
145
2
-0.689176
0.2
0.2
0.0
0.0
1.0
1.0
115
3
-0.543716
0.0
0.0
0.0
0.0
1.0
1.0
106
3
-0.543716
0.0
0.0
0.0
0.0
1.0
0.0
9
2
-1.052827
0.2
0.0
1.0
0.0
0.0
0.0