Model Main Functions and Controller

For an in-depth tutorial, click here for classification, or here for regression

source

model_init_classification

 model_init_classification (model_class, cpoint_path,
                            output_hidden_states:bool, device=None,
                            config=None, seed=None, body_model=None,
                            model_kwargs={})

*To initialize a classification (or regression) model, either from an existing HuggingFace model or custom architecture

Can be used for binary, multi-class single-head, multi-class multi-head, multi-label clasisifcation, and regression*

Type Default Details
model_class Model’s class object, e.g. RobertaHiddenStateConcatForSequenceClassification
cpoint_path Either model string name on HuggingFace, or the path to model checkpoint
output_hidden_states bool To whether output the model hidden states or not. Useful when you try to build a custom classification head
device NoneType None Device to train on
config NoneType None Model config. If not provided, AutoConfig is used to load config from cpoint_path
seed NoneType None Random seed
body_model NoneType None If not none, we use this to initialize model’s body. If you only want to load the model checkpoint in cpoint_path, leave this as none
model_kwargs dict {} Keyword arguments for model (both head and body)

source

compute_metrics

 compute_metrics (pred, metric_funcs=[], metric_types=[], head_sizes=[],
                  label_names=[], is_multilabel=False,
                  multilabel_threshold=0.5)

*Return a dictionary of metric name and its values.

Reference: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_utils.py#L107C16-L107C16*

Type Default Details
pred An EvalPrediction object from HuggingFace (which is a named tuple with predictions and label_ids attributes)
metric_funcs list [] A list of metric functions to evaluate
metric_types list [] Type of metric (‘classification’ or ‘regression’) for each metric functions above
head_sizes list [] Class size for each head. Regression head will have head size 1
label_names list [] Names of the label (dependent variable) columns
is_multilabel bool False Whether this is a multilabel classification
multilabel_threshold float 0.5 Threshold for multilabel (>= threshold is positive)

source

compute_metrics_separate_heads

 compute_metrics_separate_heads (pred, metric_funcs=[], label_names=[],
                                 **kwargs)

*Return a dictionary of metric name and its values. This is used in Deep Hierarchical Classification (special case of multi-head classification)

This metric function is mainly used when you have a separate logit output for each head (instead of the typical multi-head logit output: all heads’ logits are concatenated)*

Type Default Details
pred An EvalPrediction object from HuggingFace (which is a named tuple with predictions and label_ids attributes)
metric_funcs list [] A list of metric functions to evaluate
label_names list [] Names of the label (dependent variable) columns
kwargs

source

loss_for_classification

 loss_for_classification (logits, labels, is_multilabel=False,
                          is_multihead=False, head_sizes=[],
                          head_weights=[])

*The general loss function for classification

  • If is_multilabel is False and is_multihead is False: Single-Head Classification, e.g. You predict 1 out of n class

  • If is_multilabel is False and is_multihead is True: Multi-Head Classification, e.g. You predict 1 out of n classes at Level 1, and 1 out of m classes at Level 2

  • If is_multilabel is True and is_multihead is False: Single-Head Multi-Label Classification, e.g. You predict x out of n class (x>=0)

  • If is_multilabel is True and is_multihead is True: Not supported*

Type Default Details
logits output of the last linear layer, before any softmax/sigmoid. Size: (bs,class_size)
labels determined by your datasetdict. Size: (bs,number_of_head)
is_multilabel bool False Whether this is a multilabel classification
is_multihead bool False Whether this is a multihead classification
head_sizes list [] Class size for each head. Regression head will have head size 1
head_weights list [] loss weight for each head. Default to 1 for each head

source

finetune

 finetune (lr, bs, wd, epochs, ddict, tokenizer, o_dir='./tmp_weights',
           save_checkpoint=False, model=None, model_init=None,
           data_collator=None, compute_metrics=None, grad_accum_steps=2,
           lr_scheduler_type='cosine', warmup_ratio=0.1, no_valid=False,
           val_bs=None, seed=None, report_to='none', trainer_class=None,
           len_train=None)

The main model training/finetuning function

Type Default Details
lr Learning rate
bs Batch size
wd Weight decay
epochs Number of epochs
ddict The HuggingFace datasetdict
tokenizer HuggingFace tokenizer
o_dir str ./tmp_weights Directory to save weights
save_checkpoint bool False Whether to save weights (checkpoints) to o_dir
model NoneType None NLP model
model_init NoneType None A function to initialize model
data_collator NoneType None HuggingFace data collator
compute_metrics NoneType None A function to compute metric, e.g. compute_metrics
grad_accum_steps int 2 The batch at each step will be divided by this integer and gradient will be accumulated over gradient_accumulation_steps steps.
lr_scheduler_type str cosine The scheduler type to use. Including: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup
warmup_ratio float 0.1 The warmup ratio for some lr scheduler
no_valid bool False Whether there is a validation set or not
val_bs NoneType None Validation batch size
seed NoneType None Random seed
report_to str none The list of integrations to report the results and logs to. Supported platforms are “azure_ml”, “comet_ml”, “mlflow”, “neptune”, “tensorboard”,“clearml” and “wandb”. Use “all” to report to all integrations installed, “none” for no integrations.
trainer_class NoneType None You can include the class name of your custom trainer here
len_train NoneType None estimated number of samples in the whole training set (for streaming dataset only)

source

ModelController

 ModelController (model, data_store=None, seed=None)

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
model NLP model
data_store NoneType None a TextDataController/TextDataControllerStreaming object
seed NoneType None Random seed

source

ModelController.fit

 ModelController.fit (epochs, learning_rate, ddict=None,
                      metric_funcs=[<function accuracy_score at
                      0x7f896fe39820>], metric_types=[], batch_size=16,
                      val_batch_size=None, weight_decay=0.01,
                      lr_scheduler_type='cosine', warmup_ratio=0.1,
                      o_dir='./tmp_weights', save_checkpoint=False,
                      hf_report_to='none', compute_metrics=<function
                      compute_metrics>, grad_accum_steps=2,
                      tokenizer=None, label_names=None, head_sizes=None,
                      trainer_class=None, len_train=None)
Type Default Details
epochs Number of epochs
learning_rate Learning rate
ddict NoneType None DatasetDict to fit (will override data_store)
metric_funcs list [<function accuracy_score at 0x7f896fe39820>] A list of metric functions (can be from Sklearn)
metric_types list [] A list of metric types (classification or regression) that matches with the metric function list
batch_size int 16 Batch size
val_batch_size NoneType None Validation batch size. Set to batch_size if None
weight_decay float 0.01 Weight decay
lr_scheduler_type str cosine The scheduler type to use. Including: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup
warmup_ratio float 0.1 The warmup ratio for some lr scheduler
o_dir str ./tmp_weights Directory to save weights
save_checkpoint bool False Whether to save weights (checkpoints) to o_dir
hf_report_to str none The list of HuggingFace-allowed integrations to report the results and logs to
compute_metrics function compute_metrics A function to compute metric, e.g. compute_metrics which utilizes the given metric_funcs
grad_accum_steps int 2 Gradient will be accumulated over gradient_accumulation_steps steps.
tokenizer NoneType None Tokenizer (to override one in data_store)
label_names NoneType None Names of the label (dependent variable) columns (to override one in data_store)
head_sizes NoneType None Class size for each head (to override one in model)
trainer_class NoneType None You can include the class name of your custom trainer here
len_train NoneType None Number of samples in the whole training set (for streaming dataset only)

source

ModelController.predict_raw_text

 ModelController.predict_raw_text (content:Union[dict,list,str],
                                   is_multilabel=None,
                                   multilabel_threshold=0.5, topk=1,
                                   are_heads_separated=False)

source

ModelController.predict_raw_dset

 ModelController.predict_raw_dset (dset, batch_size=16,
                                   do_filtering=False, is_multilabel=None,
                                   multilabel_threshold=0.5, topk=1,
                                   are_heads_separated=False)

source

ModelController.predict_ddict

 ModelController.predict_ddict (ddict:Union[datasets.dataset_dict.DatasetD
                                ict,datasets.arrow_dataset.Dataset]=None,
                                ds_type='test', batch_size=16,
                                is_multilabel=None,
                                multilabel_threshold=0.5, topk=1,
                                tokenizer=None, label_names=None,
                                class_names_predefined=None,
                                are_heads_separated=False)
Type Default Details
ddict DatasetDict | Dataset None A processed and tokenized DatasetDict/Dataset (will override one in data_store)
ds_type str test The split of DatasetDict to predict
batch_size int 16 Batch size for making prediction on GPU
is_multilabel NoneType None Is this a multilabel classification?
multilabel_threshold float 0.5 Threshold for multilabel classification
topk int 1 Number of labels to return for each head
tokenizer NoneType None Tokenizer (to override one in data_store)
label_names NoneType None Names of the label (dependent variable) columns (to override one in data_store)
class_names_predefined NoneType None List of names associated with the labels (same index order) (to override one in data_store)
are_heads_separated bool False Are outputs (of model) separate heads?