ml_helpers
This module contains several Python functions for running a quick ML prototype on your processed dataset.
run_logistic_regression
run_logistic_regression (X_trn:pandas.core.frame.DataFrame, y_trn:pandas.core.series.Series|numpy.ndarray, multi_class='multinomial', solver='newton-cg', penalty=None, max_iter=10000, return_coef=False)
Perform Sklearn logistic regression, then print coefficients and classification report
Type | Default | Details | |
---|---|---|---|
X_trn | pd.DataFrame | Training dataframe | |
y_trn | pd.Series | np.ndarray | Training label | |
multi_class | str | multinomial | sklearn’s log reg multiclass option |
solver | str | newton-cg | sklearn’s log reg solver option |
penalty | NoneType | None | sklearn’s log reg penalty option |
max_iter | int | 10000 | sklearn’s log reg max iteration option |
return_coef | bool | False | whether to return coefficients |
run_multinomial_statmodel
run_multinomial_statmodel (X_trn:pandas.core.frame.DataFrame, y_trn:pandas.core.series.Series|numpy.ndarray, add_constant=True)
Perform multinominal logit from statsmodel, then print results and classification report
Type | Default | Details | |
---|---|---|---|
X_trn | pd.DataFrame | Training dataframe | |
y_trn | pd.Series | np.ndarray | Training label | |
add_constant | bool | True | To add a constant column to X_trn |
run_sklearn_model
run_sklearn_model (model_name:str, model_params:dict, X_trn:pandas.core.frame.DataFrame, y_trn:pandas.core.series.Series|numpy.ndarray, is_regression=False, class_names:list=None, test_split=None, metric_funcs={}, seed=42, plot_fea_imp=True)
Type | Default | Details | |
---|---|---|---|
model_name | str | sklearn’s Machine Learning model to try. Currently support DecisionTree,AdaBoost,RandomForest | |
model_params | dict | A dictionary containing model’s hyperparameters | |
X_trn | pd.DataFrame | Training dataframe | |
y_trn | pd.Series | np.ndarray | Training label | |
is_regression | bool | False | To use regression model or classification model |
class_names | list | None | List of names associated with the labels (same order); e.g. [‘no’,‘yes’]. For classification only |
test_split | NoneType | None | Test set split. If float: random split. If list of list: indices of train and test set. If None: skip splitting |
metric_funcs | dict | {} | Dictionary of metric functions: {metric_name:metric_func} |
seed | int | 42 | Random seed |
plot_fea_imp | bool | True | To whether plot sklearn’s feature importances. Set to False to skip the plot |
tune_sklearn_model
tune_sklearn_model (model_name:str, param_grid:dict, X_trn:pandas.core.frame.DataFrame, y_trn:pandas.core.series.Series|numpy.ndarray, is_regression=False, custom_cv=5, random_cv_iter=None, scoring=None, seed=42, rank_show=10, show_split_scores=True)
Perform either Sklearn’s Grid Search or Randomized Search (based on random_cv_iter) of the model using param_grid
Type | Default | Details | |
---|---|---|---|
model_name | str | sklearn’s Machine Learning model to try. Currently support DecisionTree,AdaBoost,RandomForest, | |
param_grid | dict | Dictionary with parameters names (str) as keys and lists of parameter settings to try as values | |
X_trn | pd.DataFrame | Training dataframe | |
y_trn | pd.Series | np.ndarray | Training label | |
is_regression | bool | False | Is it a regression problem, or classification? |
custom_cv | int | 5 | sklearn’s cross-validation splitting strategy |
random_cv_iter | NoneType | None | Number of parameter settings that are sampled. Use this if you want to do RandomizedSearchCV |
scoring | NoneType | None | Metric |
seed | int | 42 | Random seed |
rank_show | int | 10 | Number of ranks to show (descending order) |
show_split_scores | bool | True | To show both train and test split scores |
get_adaboost_info
get_adaboost_info (dt_params, ada_params, X, y, seed=42)
show_both_cv
show_both_cv (search_cv, default_cv, scoring, top_n=10, show_split_scores=False)
summarize_default_cv
summarize_default_cv (default_cv, s)
summarize_cv_results
summarize_cv_results (search_cv, scoring, top_n=10, show_split_scores=False)
do_param_search
do_param_search (X_train, y_train, estimator, param_grid, random_cv_iter=None, include_default=True, cv=None, scoring=None, seed=42)