Model Main Functions and Controller
model_init_classification
model_init_classification (model_class, cpoint_path, output_hidden_states:bool, device=None, config=None, seed=None, body_model=None, model_kwargs={})
*To initialize a classification (or regression) model, either from an existing HuggingFace model or custom architecture
Can be used for binary, multi-class single-head, multi-class multi-head, multi-label clasisifcation, and regression*
Type | Default | Details | |
---|---|---|---|
model_class | Model’s class object, e.g. RobertaHiddenStateConcatForSequenceClassification | ||
cpoint_path | Either model string name on HuggingFace, or the path to model checkpoint | ||
output_hidden_states | bool | To whether output the model hidden states or not. Useful when you try to build a custom classification head | |
device | NoneType | None | Device to train on |
config | NoneType | None | Model config. If not provided, AutoConfig is used to load config from cpoint_path |
seed | NoneType | None | Random seed |
body_model | NoneType | None | If not none, we use this to initialize model’s body. If you only want to load the model checkpoint in cpoint_path, leave this as none |
model_kwargs | dict | {} | Keyword arguments for model (both head and body) |
compute_metrics
compute_metrics (pred, metric_funcs=[], metric_types=[], head_sizes=[], label_names=[], is_multilabel=False, multilabel_threshold=0.5)
*Return a dictionary of metric name and its values.
Reference: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_utils.py#L107C16-L107C16*
Type | Default | Details | |
---|---|---|---|
pred | An EvalPrediction object from HuggingFace (which is a named tuple with predictions and label_ids attributes) |
||
metric_funcs | list | [] | A list of metric functions to evaluate |
metric_types | list | [] | Type of metric (‘classification’ or ‘regression’) for each metric functions above |
head_sizes | list | [] | Class size for each head. Regression head will have head size 1 |
label_names | list | [] | Names of the label (dependent variable) columns |
is_multilabel | bool | False | Whether this is a multilabel classification |
multilabel_threshold | float | 0.5 | Threshold for multilabel (>= threshold is positive) |
compute_metrics_separate_heads
compute_metrics_separate_heads (pred, metric_funcs=[], label_names=[], **kwargs)
*Return a dictionary of metric name and its values. This is used in Deep Hierarchical Classification (special case of multi-head classification)
This metric function is mainly used when you have a separate logit output for each head (instead of the typical multi-head logit output: all heads’ logits are concatenated)*
Type | Default | Details | |
---|---|---|---|
pred | An EvalPrediction object from HuggingFace (which is a named tuple with predictions and label_ids attributes) |
||
metric_funcs | list | [] | A list of metric functions to evaluate |
label_names | list | [] | Names of the label (dependent variable) columns |
kwargs |
loss_for_classification
loss_for_classification (logits, labels, is_multilabel=False, is_multihead=False, head_sizes=[], head_weights=[])
*The general loss function for classification
If is_multilabel is
False
and is_multihead isFalse
: Single-Head Classification, e.g. You predict 1 out of n classIf is_multilabel is
False
and is_multihead isTrue
: Multi-Head Classification, e.g. You predict 1 out of n classes at Level 1, and 1 out of m classes at Level 2If is_multilabel is
True
and is_multihead isFalse
: Single-Head Multi-Label Classification, e.g. You predict x out of n class (x>=0)If is_multilabel is
True
and is_multihead isTrue
: Not supported*
Type | Default | Details | |
---|---|---|---|
logits | output of the last linear layer, before any softmax/sigmoid. Size: (bs,class_size) | ||
labels | determined by your datasetdict. Size: (bs,number_of_head) | ||
is_multilabel | bool | False | Whether this is a multilabel classification |
is_multihead | bool | False | Whether this is a multihead classification |
head_sizes | list | [] | Class size for each head. Regression head will have head size 1 |
head_weights | list | [] | loss weight for each head. Default to 1 for each head |
finetune
finetune (lr, bs, wd, epochs, ddict, tokenizer, o_dir='./tmp_weights', save_checkpoint=False, model=None, model_init=None, data_collator=None, compute_metrics=None, grad_accum_steps=2, lr_scheduler_type='cosine', warmup_ratio=0.1, no_valid=False, val_bs=None, seed=None, report_to='none', trainer_class=None, len_train=None)
The main model training/finetuning function
Type | Default | Details | |
---|---|---|---|
lr | Learning rate | ||
bs | Batch size | ||
wd | Weight decay | ||
epochs | Number of epochs | ||
ddict | The HuggingFace datasetdict | ||
tokenizer | HuggingFace tokenizer | ||
o_dir | str | ./tmp_weights | Directory to save weights |
save_checkpoint | bool | False | Whether to save weights (checkpoints) to o_dir |
model | NoneType | None | NLP model |
model_init | NoneType | None | A function to initialize model |
data_collator | NoneType | None | HuggingFace data collator |
compute_metrics | NoneType | None | A function to compute metric, e.g. compute_metrics |
grad_accum_steps | int | 2 | The batch at each step will be divided by this integer and gradient will be accumulated over gradient_accumulation_steps steps. |
lr_scheduler_type | str | cosine | The scheduler type to use. Including: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup |
warmup_ratio | float | 0.1 | The warmup ratio for some lr scheduler |
no_valid | bool | False | Whether there is a validation set or not |
val_bs | NoneType | None | Validation batch size |
seed | NoneType | None | Random seed |
report_to | str | none | The list of integrations to report the results and logs to. Supported platforms are “azure_ml”, “comet_ml”, “mlflow”, “neptune”, “tensorboard”,“clearml” and “wandb”. Use “all” to report to all integrations installed, “none” for no integrations. |
trainer_class | NoneType | None | You can include the class name of your custom trainer here |
len_train | NoneType | None | estimated number of samples in the whole training set (for streaming dataset only) |
ModelController
ModelController (model, data_store=None, seed=None)
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
model | NLP model | ||
data_store | NoneType | None | a TextDataController/TextDataControllerStreaming object |
seed | NoneType | None | Random seed |
ModelController.fit
ModelController.fit (epochs, learning_rate, ddict=None, metric_funcs=[<function accuracy_score at 0x7f896fe39820>], metric_types=[], batch_size=16, val_batch_size=None, weight_decay=0.01, lr_scheduler_type='cosine', warmup_ratio=0.1, o_dir='./tmp_weights', save_checkpoint=False, hf_report_to='none', compute_metrics=<function compute_metrics>, grad_accum_steps=2, tokenizer=None, label_names=None, head_sizes=None, trainer_class=None, len_train=None)
Type | Default | Details | |
---|---|---|---|
epochs | Number of epochs | ||
learning_rate | Learning rate | ||
ddict | NoneType | None | DatasetDict to fit (will override data_store) |
metric_funcs | list | [<function accuracy_score at 0x7f896fe39820>] | A list of metric functions (can be from Sklearn) |
metric_types | list | [] | A list of metric types (classification or regression ) that matches with the metric function list |
batch_size | int | 16 | Batch size |
val_batch_size | NoneType | None | Validation batch size. Set to batch_size if None |
weight_decay | float | 0.01 | Weight decay |
lr_scheduler_type | str | cosine | The scheduler type to use. Including: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup |
warmup_ratio | float | 0.1 | The warmup ratio for some lr scheduler |
o_dir | str | ./tmp_weights | Directory to save weights |
save_checkpoint | bool | False | Whether to save weights (checkpoints) to o_dir |
hf_report_to | str | none | The list of HuggingFace-allowed integrations to report the results and logs to |
compute_metrics | function | compute_metrics | A function to compute metric, e.g. compute_metrics which utilizes the given metric_funcs |
grad_accum_steps | int | 2 | Gradient will be accumulated over gradient_accumulation_steps steps. |
tokenizer | NoneType | None | Tokenizer (to override one in data_store ) |
label_names | NoneType | None | Names of the label (dependent variable) columns (to override one in data_store ) |
head_sizes | NoneType | None | Class size for each head (to override one in model ) |
trainer_class | NoneType | None | You can include the class name of your custom trainer here |
len_train | NoneType | None | Number of samples in the whole training set (for streaming dataset only) |
ModelController.predict_raw_text
ModelController.predict_raw_text (content:Union[dict,list,str], is_multilabel=None, multilabel_threshold=0.5, topk=1, are_heads_separated=False)
ModelController.predict_raw_dset
ModelController.predict_raw_dset (dset, batch_size=16, do_filtering=False, is_multilabel=None, multilabel_threshold=0.5, topk=1, are_heads_separated=False)
ModelController.predict_ddict
ModelController.predict_ddict (ddict:Union[datasets.dataset_dict.DatasetD ict,datasets.arrow_dataset.Dataset]=None, ds_type='test', batch_size=16, is_multilabel=None, multilabel_threshold=0.5, topk=1, tokenizer=None, label_names=None, class_names_predefined=None, are_heads_separated=False)
Type | Default | Details | |
---|---|---|---|
ddict | DatasetDict | Dataset | None | A processed and tokenized DatasetDict/Dataset (will override one in data_store ) |
ds_type | str | test | The split of DatasetDict to predict |
batch_size | int | 16 | Batch size for making prediction on GPU |
is_multilabel | NoneType | None | Is this a multilabel classification? |
multilabel_threshold | float | 0.5 | Threshold for multilabel classification |
topk | int | 1 | Number of labels to return for each head |
tokenizer | NoneType | None | Tokenizer (to override one in data_store ) |
label_names | NoneType | None | Names of the label (dependent variable) columns (to override one in data_store ) |
class_names_predefined | NoneType | None | List of names associated with the labels (same index order) (to override one in data_store ) |
are_heads_separated | bool | False | Are outputs (of model) separate heads? |