Language Model Main Functions and Controller

For an in-depth tutorial, click here for Roberta language model, and here for GPT language model

from transformers import AutoModelForCausalLM, AutoModelForMaskedLM

source

language_model_init

 language_model_init (model_class, cpoint_path=None, config=None,
                      device=None, seed=None)

To initialize a language model, either masked or casual

	Type	Default	Details
model_class			Model’s class object, e.g. AutoModelForMaskedLM
cpoint_path	NoneType	None	Either model string name on HuggingFace, or the path to model checkpoint. Put `None` to train from scratch
config	NoneType	None	Model config. If not provided, AutoConfig is used to load config from cpoint_path
device	NoneType	None	Device to train on
seed	NoneType	None	Random seed

_model1 = language_model_init(AutoModelForMaskedLM,
                              'roberta-base')
_model1

/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Total parameters: 124697433
Total trainable parameters: 124697433

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (lm_head): RobertaLMHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (decoder): Linear(in_features=768, out_features=50265, bias=True)
  )
)

_model1 = language_model_init(AutoModelForMaskedLM,
                              'nguyenvulebinh/envibert')
_model1

Total parameters: 70764377
Total trainable parameters: 70764377

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(59993, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=1024, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=1024, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (lm_head): RobertaLMHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (decoder): Linear(in_features=768, out_features=59993, bias=True)
  )
)

_model2 = language_model_init(AutoModelForCausalLM,
                              'gpt2')
_model2

Total parameters: 124439808
Total trainable parameters: 124439808

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

source

finetune_lm

 finetune_lm (lr, bs, wd, epochs, ddict, tokenizer, o_dir='./tmp_weights',
              save_checkpoint=False, model=None, model_init=None,
              data_collator=None, compute_metrics=None,
              grad_accum_steps=2, lr_scheduler_type='cosine',
              warmup_ratio=0.1, no_valid=False, val_bs=None, seed=None,
              report_to='none', trainer_class=None, len_train=None)

The main model training/finetuning function

	Type	Default	Details
lr			Learning rate
bs			Batch size
wd			Weight decay
epochs			Number of epochs
ddict			The HuggingFace datasetdict
tokenizer			HuggingFace tokenizer
o_dir	str	./tmp_weights	Directory to save weights
save_checkpoint	bool	False	Whether to save weights (checkpoints) to o_dir
model	NoneType	None	NLP model
model_init	NoneType	None	A function to initialize model
data_collator	NoneType	None	HuggingFace data collator
compute_metrics	NoneType	None	A function to compute metric, default to `compute_lm_accuracy`
grad_accum_steps	int	2	The batch at each step will be divided by this integer and gradient will be accumulated over gradient_accumulation_steps steps.
lr_scheduler_type	str	cosine	The scheduler type to use. Including: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup
warmup_ratio	float	0.1	The warmup ratio for some lr scheduler
no_valid	bool	False	Whether there is a validation set or not
val_bs	NoneType	None	Validation batch size
seed	NoneType	None	Random seed
report_to	str	none	The list of integrations to report the results and logs to. Supported platforms are “azure_ml”, “comet_ml”, “mlflow”, “neptune”, “tensorboard”,“clearml” and “wandb”. Use “all” to report to all integrations installed, “none” for no integrations.
trainer_class	NoneType	None	You can include the class name of your custom trainer here
len_train	NoneType	None	estimated number of samples in the whole training set (for streaming dataset only)

source

ModelLMController

 ModelLMController (model, data_store=None, seed=None)

Initialize self. See help(type(self)) for accurate signature.

	Type	Default	Details
model			NLP language model
data_store	NoneType	None	a TextDataLMController/TextDataLMControllerStreaming object
seed	NoneType	None	Random seed

source

ModelLMController.fit

 ModelLMController.fit (epochs, learning_rate, ddict=None,
                        compute_metrics=None, batch_size=16,
                        val_batch_size=None, weight_decay=0.01,
                        lr_scheduler_type='cosine', warmup_ratio=0.1,
                        o_dir='./tmp_weights', save_checkpoint=False,
                        hf_report_to='none', grad_accum_steps=2,
                        tokenizer=None, data_collator=None, is_mlm=None,
                        trainer_class=None, len_train=None)

	Type	Default	Details
epochs			Number of epochs
learning_rate			Learning rate
ddict	NoneType	None	DatasetDict to fit (will override data_store)
compute_metrics	NoneType	None	A function to compute metric, default to `compute_lm_accuracy`
batch_size	int	16	Batch size
val_batch_size	NoneType	None	Validation batch size. Set to batch_size if None
weight_decay	float	0.01	Weight decay
lr_scheduler_type	str	cosine	The scheduler type to use. Including: linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup
warmup_ratio	float	0.1	The warmup ratio for some lr scheduler
o_dir	str	./tmp_weights	Directory to save weights
save_checkpoint	bool	False	Whether to save weights (checkpoints) to o_dir
hf_report_to	str	none	The list of HuggingFace-allowed integrations to report the results and logs to
grad_accum_steps	int	2	Gradient will be accumulated over gradient_accumulation_steps steps.
tokenizer	NoneType	None	Tokenizer (to override one in `data_store`)
data_collator	NoneType	None	Data Collator (to override one in `data_store`)
is_mlm	NoneType	None	Whether this is masked LM or casual LM
trainer_class	NoneType	None	You can include the class name of your custom trainer here
len_train	NoneType	None	estimated number of samples in the whole training set (for streaming dataset only)

source

ModelLMController.predict_raw_text

 ModelLMController.predict_raw_text (content:dict|list|str,
                                     print_result=True, **kwargs)

	Type	Default	Details
content	dict \| list \| str		Either a single sentence, list of sentence or a dictionary where keys are metadata, values are list
print_result	bool	True	To whether print the result in readable format, or get the result returned
kwargs