Roberta model (Custom Single Head)

This notebook contains some example of how to use the Roberta-based models in this NLP library

In this series, we walk through some of the capability of this library: single-head classification, multi-head classification, multi-label classification, and regression. If you want a more detailed tutorial, check this out

import os
#This will specify a (or a list) of GPUs for training
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *
from that_nlp_library.utils import seed_everything
from underthesea import text_normalize
from functools import partial
from pathlib import Path
import pandas as pd
import numpy as np
import nlpaug.augmenter.char as nac
from datasets import load_dataset
import random
from transformers import RobertaTokenizer
from datasets import Dataset

Define the custom augmentation function

def nlp_aug_stochastic(x,aug=None,p=0.5):
    if not isinstance(x,list): 
        if random.random()<p: return aug.augment(x)[0]
        return x
    news=[]
    originals=[]
    for _x in x:
        if random.random()<p: news.append(_x)
        else: originals.append(_x)
    # only perform augmentation when needed
    if len(news): news = aug.augment(news)
    return news+originals
aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug_stochastic,aug=aug,p=0.3)

Create a TextDataController object

We will reuse the data and the preprocessings in this tutorial

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         # add "str.lower" here because nearby_aug might return uppercase character
                         val_ratio=0.2,
                         batch_size=1000,
                         seed=42,
                         num_proc=20,
                         verbose=False
                        )

Define our tokenizer for Roberta

_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

Process and tokenize our dataset

tdc.process_and_tokenize(_tokenizer,max_length=100,shuffle_trn=True)
tdc.main_ddict
DatasetDict({
    train: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18102
    })
    validation: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

Model Experiment: Roberta Single-Head Classification (with hidden layer concatenation)

Define and train a custom Roberta model

from transformers.models.roberta.modeling_roberta import RobertaModel
from that_nlp_library.models.roberta.classifiers import *
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score
num_classes = len(tdc.label_lists[0])
roberta_body = RobertaModel.from_pretrained('roberta-base')
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Then we can define a classification head. One trick we can use to boost the performance of our entire model is to concatenate the outputs of [CLS] from the four last layers of the pre-trained Roberta model (source: https://ieeexplore.ieee.org/document/9335912). We already define such custom head (ConcatHeadSimple), and the necessary architecture to make it work (RobertaHiddenStateConcatForSequenceClassification)

# our model is more complex, so it's best to define some of its arguments
_model_kwargs={
    # overall model hyperparams
    'head_class_sizes':num_classes,
    'head_class': ConcatHeadSimple,
    # classfication head hyperparams
    'layer2concat':2, # you can change the number of layers to concat (default is 4, based on the paper)
    'classifier_dropout':0.1 
}

model = model_init_classification(model_class = RobertaHiddenStateConcatForSequenceClassification,
                                  cpoint_path = 'roberta-base', 
                                  output_hidden_states=True, # since we are using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=roberta_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(model,tdc,seed=42)
Loading body weights. This assumes the body is the very first block of your custom architecture
Total parameters: 124654854
Total trainable parameters: 124654854

And we can start training our model

seed_everything(42)
lr = 1e-4
bs=32
wd=0.01
epochs= 3

controller.fit(epochs,lr,
               metric_funcs=metric_funcs,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics,
              )
[849/849 06:17, Epoch 3/3]
Epoch Training Loss Validation Loss F1 Score Department name Accuracy Score Department name
1 No log 0.305263 0.744996 0.916041
2 0.431600 0.266121 0.752087 0.922448
3 0.431600 0.270825 0.752354 0.923111

controller.trainer.model.save_pretrained('./sample_weights/my_model')

Make predictions

Load trained model

_model_kwargs
{'head_class_sizes': 6,
 'head_class': that_nlp_library.models.roberta.classifiers.ConcatHeadSimple,
 'layer2concat': 2,
 'classifier_dropout': 0.1}
trained_model = model_init_classification(model_class = RobertaHiddenStateConcatForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/my_model'), 
                                          output_hidden_states=True,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

controller = ModelController(trained_model,tdc,seed=42)
Some weights of the model checkpoint at sample_weights/my_model were not used when initializing RobertaHiddenStateConcatForSequenceClassification: ['body_model.pooler.dense.bias', 'body_model.pooler.dense.weight']
- This IS expected if you are initializing RobertaHiddenStateConcatForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHiddenStateConcatForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Total parameters: 124064262
Total trainable parameters: 124064262

Predict Train/Validation set

df_val = controller.predict_ddict(ds_type='validation')
-------------------- Start making predictions --------------------
df_val = df_val.to_pandas()
df_val.head()
Title Review Text Division Name Department Name label input_ids attention_mask pred_Department Name pred_prob_Department Name
0 general petite . . such a fun jacket ! great t... general petite Intimate 2 [0, 15841, 4716, 1459, 479, 479, 215, 10, 1531... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jackets 0.776409
1 simple and elegant general petite . simple and elegant . i though... general petite Tops 4 [0, 15841, 4716, 1459, 479, 2007, 8, 14878, 47... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Tops 0.997269
2 retro and pretty general . retro and pretty . this top has a bi... general Tops 4 [0, 15841, 479, 11299, 8, 1256, 479, 42, 299, ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Tops 0.997735
3 summer/fall wear general petite . summer / fall wear . i first ... general petite Dresses 1 [0, 15841, 4716, 1459, 479, 1035, 1589, 1136, ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Dresses 0.980090
4 perfect except slip general petite . perfect except slip . this is... general petite Dresses 1 [0, 15841, 4716, 1459, 479, 1969, 4682, 9215, ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Dresses 0.990936

You can try to get your metric to see if it matches your last traing epoch’s above

f1_score(df_val['Department Name'],df_val['pred_Department Name'],average='macro')
0.7523539516254371

Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let’s reuse the sample csv and pretend it’s our test set

df_test = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig').sample(frac=0.2,random_state=1)
# drop NaN values in the label column
df_test = df_test[~df_test['Department Name'].isna()].reset_index(drop=True)

# save the label, as we will calculate some metrics later. We also filter out labels with NaN Review Text,
# as there will be a filtering processing on the test set
true_labels = df_test.loc[~df_test['Review Text'].isna(),'Department Name'].values 

# drop the label (you don't need to, but this is necessary to simulate an actual test set)
df_test.drop('Department Name',axis=1,inplace=True)
_test_dset = Dataset.from_pandas(df_test)
_test_dset_predicted = controller.predict_raw_dset(_test_dset,
                                                   do_filtering=True, # since we have some text filtering in the processing
                                                  )
-------------------- Start making predictions --------------------
df_test_predicted = _test_dset_predicted.to_pandas()
df_test_predicted.head()
Title Review Text Division Name input_ids attention_mask pred_Department Name pred_prob_Department Name
0 perfect for work and play general . perfect for work and play . this shi... general [0, 15841, 479, 1969, 13, 173, 8, 310, 479, 42... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Tops 0.997632
1 general petite . . i don't know why i had the ... general petite [0, 15841, 4716, 1459, 479, 479, 939, 218, 75,... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Bottoms 0.993063
2 great pants general petite . great pants . thes e cords ar... general petite [0, 15841, 4716, 1459, 479, 372, 9304, 479, 5,... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Bottoms 0.980067
3 surprisingly comfy for a button down general petite . surprisingly comfy for a butt... general petite [0, 15841, 4716, 1459, 479, 10262, 3137, 24382... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Tops 0.995013
4 short and small general petite . short and small . the shirt i... general petite [0, 15841, 4716, 1459, 479, 765, 8, 650, 479, ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Tops 0.997465

Let’s quickly check the f1 score to make sure everything works correctly

f1_score(true_labels,df_test_predicted['pred_Department Name'],average='macro')
# 0.7551497694535967
0.7554695549351907

Predict top k results

_test_dset = Dataset.from_pandas(df_test)
_test_dset_predicted = controller.predict_raw_dset(_test_dset,
                                                   do_filtering=True,
                                                   topk=3
                                                  )
-------------------- Start making predictions --------------------
df_test_predicted = _test_dset_predicted.to_pandas()

df_test_predicted.head()
Title Review Text Division Name input_ids attention_mask pred_Department Name pred_prob_Department Name
0 perfect for work and play general . perfect for work and play . this shi... general [0, 15841, 479, 1969, 13, 173, 8, 310, 479, 42... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [Tops, Trend, Intimate] [0.99763227, 0.0011167374, 0.000746253]
1 general petite . . i don't know why i had the ... general petite [0, 15841, 4716, 1459, 479, 479, 939, 218, 75,... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [Bottoms, Intimate, Trend] [0.9930628, 0.0033172437, 0.0027576974]
2 great pants general petite . great pants . thes e cords ar... general petite [0, 15841, 4716, 1459, 479, 372, 9304, 479, 5,... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [Bottoms, Intimate, Trend] [0.980067, 0.01673956, 0.0024985557]
3 surprisingly comfy for a button down general petite . surprisingly comfy for a butt... general petite [0, 15841, 4716, 1459, 479, 10262, 3137, 24382... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [Tops, Intimate, Trend] [0.9950134, 0.001822388, 0.00145723]
4 short and small general petite . short and small . the shirt i... general petite [0, 15841, 4716, 1459, 479, 765, 8, 650, 479, ... [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [Tops, Trend, Intimate] [0.997465, 0.001083337, 0.00081098813]
# Since we have some metadatas (Title and Division Name), we need to define a dictionary containing those values
raw_content={'Review Text': 'This shirt is so comfortable I love it!',
             'Title': 'Great shirt',
             'Division Name': 'general'}
df_result = controller.predict_raw_text(raw_content,topk=3)
-------------------- Start making predictions --------------------
df_result
{'Review Text': ['general . great shirt . this shirt is so comfortable i love it !'],
 'Title': ['great shirt'],
 'Division Name': ['general'],
 'input_ids': [[0,
   15841,
   479,
   372,
   6399,
   479,
   42,
   6399,
   16,
   98,
   3473,
   939,
   657,
   24,
   27785,
   2]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
 'pred_Department Name': [['Tops', 'Trend', 'Intimate']],
 'pred_prob_Department Name': [[0.9976713061332703,
   0.0011040765093639493,
   0.0007168611045926809]]}