ml_helpers

This module contains several Python functions for running a quick ML prototype on your processed dataset.

source

run_logistic_regression

 run_logistic_regression (X_trn:pandas.core.frame.DataFrame,
                          y_trn:pandas.core.series.Series|numpy.ndarray,
                          multi_class='multinomial', solver='newton-cg',
                          penalty=None, max_iter=10000, return_coef=False)

Perform Sklearn logistic regression, then print coefficients and classification report

Type Default Details
X_trn pd.DataFrame Training dataframe
y_trn pd.Series | np.ndarray Training label
multi_class str multinomial sklearn’s log reg multiclass option
solver str newton-cg sklearn’s log reg solver option
penalty NoneType None sklearn’s log reg penalty option
max_iter int 10000 sklearn’s log reg max iteration option
return_coef bool False whether to return coefficients

source

run_multinomial_statmodel

 run_multinomial_statmodel (X_trn:pandas.core.frame.DataFrame,
                            y_trn:pandas.core.series.Series|numpy.ndarray,
                            add_constant=True)

Perform multinominal logit from statsmodel, then print results and classification report

Type Default Details
X_trn pd.DataFrame Training dataframe
y_trn pd.Series | np.ndarray Training label
add_constant bool True To add a constant column to X_trn

source

run_sklearn_model

 run_sklearn_model (model_name:str, model_params:dict,
                    X_trn:pandas.core.frame.DataFrame,
                    y_trn:pandas.core.series.Series|numpy.ndarray,
                    is_regression=False, class_names:list=None,
                    test_split=None, metric_funcs={}, seed=42,
                    plot_fea_imp=True)
Type Default Details
model_name str sklearn’s Machine Learning model to try. Currently support DecisionTree,AdaBoost,RandomForest
model_params dict A dictionary containing model’s hyperparameters
X_trn pd.DataFrame Training dataframe
y_trn pd.Series | np.ndarray Training label
is_regression bool False To use regression model or classification model
class_names list None List of names associated with the labels (same order); e.g. [‘no’,‘yes’]. For classification only
test_split NoneType None Test set split. If float: random split. If list of list: indices of train and test set. If None: skip splitting
metric_funcs dict {} Dictionary of metric functions: {metric_name:metric_func}
seed int 42 Random seed
plot_fea_imp bool True To whether plot sklearn’s feature importances. Set to False to skip the plot

source

tune_sklearn_model

 tune_sklearn_model (model_name:str, param_grid:dict,
                     X_trn:pandas.core.frame.DataFrame,
                     y_trn:pandas.core.series.Series|numpy.ndarray,
                     is_regression=False, custom_cv=5,
                     random_cv_iter=None, scoring=None, seed=42,
                     rank_show=10, show_split_scores=True)

Perform either Sklearn’s Grid Search or Randomized Search (based on random_cv_iter) of the model using param_grid

Type Default Details
model_name str sklearn’s Machine Learning model to try. Currently support DecisionTree,AdaBoost,RandomForest,
param_grid dict Dictionary with parameters names (str) as keys and lists of parameter settings to try as values
X_trn pd.DataFrame Training dataframe
y_trn pd.Series | np.ndarray Training label
is_regression bool False Is it a regression problem, or classification?
custom_cv int 5 sklearn’s cross-validation splitting strategy
random_cv_iter NoneType None Number of parameter settings that are sampled. Use this if you want to do RandomizedSearchCV
scoring NoneType None Metric
seed int 42 Random seed
rank_show int 10 Number of ranks to show (descending order)
show_split_scores bool True To show both train and test split scores

source

get_adaboost_info

 get_adaboost_info (dt_params, ada_params, X, y, seed=42)

source

show_both_cv

 show_both_cv (search_cv, default_cv, scoring, top_n=10,
               show_split_scores=False)

source

summarize_default_cv

 summarize_default_cv (default_cv, s)

source

summarize_cv_results

 summarize_cv_results (search_cv, scoring, top_n=10,
                       show_split_scores=False)

source