Roberta Language Model for a streamed dataset

This notebook contains an end-to-end process of preprocess + tokenizing your streamed dataset, and build language models based on Roberta architecture
import os
#This will specify a (or a list) of GPUs for training
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main_lm_streaming import *
from that_nlp_library.utils import seed_everything
from that_nlp_library.model_lm_main import *
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from transformers import AutoTokenizer, AutoConfig, AutoModelForMaskedLM
from datasets import load_dataset
import pandas as pd
import numpy as np
from transformers import DataCollatorForLanguageModeling

Finetune a Roberta Language Model (with line-by-line tokenization)

Create a TextDataLMController object

We will reuse the data and the preprocessings in this tutorial

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']
tdc = TextDataLMControllerStreaming(ddict_with_val,
                                    main_text='Review Text',
                                    filter_dict={'Review Text': lambda x: x is not None},
                                    metadatas=['Title','Division Name'],
                                    content_transformations=[text_normalize,str.lower],
                                    cols_to_keep=['Clothing ID','Review Text'],
                                    seed=42,
                                    batch_size=1024,
                                    verbose=False
                                    )

Define our tokenizer for Roberta

_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Process and tokenize our dataset (using line-by-line tokenization)

tdc.process_and_tokenize(_tokenizer,line_by_line=True,max_length=-1)
tdc.main_ddict
DatasetDict({
    train: IterableDataset({
        features: Unknown,
        n_shards: 1
    })
    validation: Dataset({
        features: ['Clothing ID', 'Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 2253
    })
})

And set the data collator

tdc.set_data_collator(is_mlm=True,mlm_prob=0.15)

Initialize and train Roberta Language Model

_config = AutoConfig.from_pretrained('roberta-base',
                                    vocab_size=len(_tokenizer))
_config
RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.40.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}
_model = language_model_init(AutoModelForMaskedLM,
                             config=_config,
                             cpoint_path='roberta-base',
                             seed=42
                            )
Total parameters: 124697433
Total trainable parameters: 124697433

Create a model controller

controller = ModelLMController(_model,data_store=tdc,seed=42)

And we can start training our model

lr = 1e-4
bs=32
wd=0.01
epochs= 4
warmup_ratio=0.25
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               warmup_ratio=warmup_ratio,
               save_checkpoint=False,
               len_train=20000 # estimation of number of samples in train set
              )
max_steps is given, it will override any value given in num_train_epochs
[1248/1248 07:23, Epoch 3/9223372036854775807]
Epoch Training Loss Validation Loss Accuracy
0 No log 1.511546 0.663742
1 1.651100 1.410441 0.676708
2 1.651100 1.279535 0.697920
3 1.651100 1.264103 0.696687

[71/71 00:04]
Perplexity on validation set: 3.589

Finetuning from a pretrained model results in a massive improvement in terms of metrics

controller.trainer.model.save_pretrained('./sample_weights/lm_model')

Fill mask using model

trained_model = language_model_init(AutoModelForMaskedLM,
                                    cpoint_path='./sample_weights/lm_model',
                                   )
Total parameters: 124697433
Total trainable parameters: 124697433
controller2 = ModelLMController(trained_model,data_store=tdc,seed=42)
controller2.data_store.tokenizer.mask_token
'<mask>'
inp1 = {'Clothing ID':1,
        'Title':'Flattering',
        'Division Name':'General',
        'Review Text': "Love this <mask>. The detail is amazing. Runs small I ordered a 12 I'm usually a 10, but still a little snug"
       }
controller2.predict_raw_text(inp1,print_result=True)
Score: 0.371 >>> general. flattering. love this top. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.281 >>> general. flattering. love this dress. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.119 >>> general. flattering. love this shirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.066 >>> general. flattering. love this skirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.052 >>> general. flattering. love this sweater. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
--------------------

You can input several raw texts

inp2 = {'Clothing ID':[1,2],
        'Title':['Flattering','Lovely, but small'],
        'Division Name':['General','General'],
        'Review Text': ["Love this <mask>. The detail is amazing. Runs small I ordered a 12 I'm usually a 10, but still a little snug",
                        "Love this skirt. The detail is amazing. Runs <mask>, I ordered a 12 I'm usually a 10, but still a little snug"]
       }
controller2.predict_raw_text(inp2,print_result=True)
Score: 0.371 >>> general. flattering. love this top. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.281 >>> general. flattering. love this dress. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.119 >>> general. flattering. love this shirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.066 >>> general. flattering. love this skirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.052 >>> general. flattering. love this sweater. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
--------------------
Score: 0.933 >>> general. lovely, but small. love this skirt. the detail is amazing. runs small, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.033 >>> general. lovely, but small. love this skirt. the detail is amazing. runs large, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.009 >>> general. lovely, but small. love this skirt. the detail is amazing. runs big, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.006 >>> general. lovely, but small. love this skirt. the detail is amazing. runs short, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.004 >>> general. lovely, but small. love this skirt. the detail is amazing. runs smaller, i ordered a 12 i'm usually a 10, but still a little snug
--------------------
controller2.predict_raw_text(inp2,print_result=False)
[[{'score': 0.3714928925037384,
   'token': 299,
   'token_str': ' top',
   'sequence': "general. flattering. love this top. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.28066369891166687,
   'token': 3588,
   'token_str': ' dress',
   'sequence': "general. flattering. love this dress. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.11859548091888428,
   'token': 6399,
   'token_str': ' shirt',
   'sequence': "general. flattering. love this shirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.06550988554954529,
   'token': 16576,
   'token_str': ' skirt',
   'sequence': "general. flattering. love this skirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.05240405723452568,
   'token': 23204,
   'token_str': ' sweater',
   'sequence': "general. flattering. love this sweater. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug"}],
 [{'score': 0.9333353638648987,
   'token': 650,
   'token_str': ' small',
   'sequence': "general. lovely, but small. love this skirt. the detail is amazing. runs small, i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.0329098179936409,
   'token': 739,
   'token_str': ' large',
   'sequence': "general. lovely, but small. love this skirt. the detail is amazing. runs large, i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.00943901389837265,
   'token': 380,
   'token_str': ' big',
   'sequence': "general. lovely, but small. love this skirt. the detail is amazing. runs big, i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.006075778976082802,
   'token': 765,
   'token_str': ' short',
   'sequence': "general. lovely, but small. love this skirt. the detail is amazing. runs short, i ordered a 12 i'm usually a 10, but still a little snug"},
  {'score': 0.003932583145797253,
   'token': 2735,
   'token_str': ' smaller',
   'sequence': "general. lovely, but small. love this skirt. the detail is amazing. runs smaller, i ordered a 12 i'm usually a 10, but still a little snug"}]]

Extract hidden states from model

From raw texts

inp1 = {'Clothing ID':1,
        'Title':'Flattering',
        'Division Name': 'General',
        'Review Text': "Love this skirt. The detail is amazing. Runs small I ordered a 12 I'm usually a 10, but still a little snug"
       }
_config = AutoConfig.from_pretrained('./sample_weights/lm_model',output_hidden_states=True)
trained_model = language_model_init(AutoModelForMaskedLM,
                                    cpoint_path='./sample_weights/lm_model',
                                    config=_config
                                   )

controller2 = ModelLMController(trained_model,data_store=tdc,seed=42)
Total parameters: 124697433
Total trainable parameters: 124697433
hidden_from_ip1 = controller2.get_hidden_states_from_raw_text(inp1,
                                                              state_name='hidden_states',
                                                              state_idx=[-1,0]
                                                             )
hidden_from_ip1
Dataset({
    features: ['Clothing ID', 'Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask', 'hidden_states'],
    num_rows: 1
})
hidden_from_ip1['hidden_states'].shape
(1, 768)

From validation (or even train) set

hidden_from_vals = controller2.get_hidden_states(ds_type='validation',
                                                 state_name='hidden_states',
                                                 state_idx=[-1,0]
                                                )
hidden_from_vals
Dataset({
    features: ['Clothing ID', 'Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask', 'hidden_states'],
    num_rows: 2253
})
hidden_from_vals['hidden_states'].shape
(2253, 768)

Finetune a Roberta Language Model (with token concatenation)

Since our data only contain short text (with maximum sentence length is around 120 words), using Token Concatenation technique might not be ideal (as this technique is more suitable for when the text is long). One perk is that this will reduce the amount of training data. With that being said, we will still run some experiments using this technique.

Create a TextDataLMController object

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataLMControllerStreaming(ddict_with_val,
                                    main_text='Review Text',
                                    filter_dict={'Review Text': lambda x: x is not None},
                                    metadatas=['Title','Division Name'],
                                    content_transformations=[text_normalize,str.lower],
                                    cols_to_keep=['Clothing ID','Review Text'],
                                    seed=42,
                                    batch_size=1024,
                                    verbose=False
                                    )

Define our tokenizer for Roberta

_tokenizer = AutoTokenizer.from_pretrained('roberta-base')
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Process and tokenize our dataset (using token concatenation technique)

block_size=140
tdc.process_and_tokenize(_tokenizer,line_by_line=False,max_length=block_size)
tdc.main_ddict
DatasetDict({
    train: IterableDataset({
        features: Unknown,
        n_shards: 1
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 1342
    })
})

And set the data collator

tdc.set_data_collator(is_mlm=True,mlm_prob=0.15)

Initialize and train Roberta Language Model

_config = AutoConfig.from_pretrained('roberta-base',
                                    vocab_size=len(_tokenizer))
_config
RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.40.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}
_model = language_model_init(AutoModelForMaskedLM,
                             config=_config,
                             cpoint_path='roberta-base',
                             seed=42
                            )
Total parameters: 124697433
Total trainable parameters: 124697433

Create a model controller

controller = ModelLMController(_model,data_store=tdc,seed=42)

And we can start training our model

lr = 1e-4
bs=32
wd=0.01
epochs= 4
warmup_ratio=0.25
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               warmup_ratio=warmup_ratio,
               save_checkpoint=False,
               len_train=20000
              )

#  [808/808 03:58, Epoch 4/4]
# Epoch Training Loss   Validation Loss Accuracy
# 1 No log  1.694216    0.628713
# 2 1.860100    1.601513    0.642077
# 3 1.860100    1.515734    0.656354
# 4 1.561200    1.477700    0.662074
#  [103/103 00:04]
# Perplexity on validation set: 4.413
max_steps is given, it will override any value given in num_train_epochs
[1248/1248 09:17, Epoch 6/9223372036854775807]
Epoch Training Loss Validation Loss Accuracy
0 No log 1.613501 0.643859
1 No log 1.542169 0.650042
2 No log 1.441636 0.667640
3 1.721100 1.397509 0.679836
4 1.721100 1.363855 0.685481
5 1.721100 1.326490 0.688224
6 1.721100 1.334893 0.687034

[42/42 00:03]
Perplexity on validation set: 3.775
controller.trainer.model.save_pretrained('./sample_weights/lm_model')

Fill mask using model

trained_model = language_model_init(AutoModelForMaskedLM,
                                    cpoint_path='./sample_weights/lm_model',
                                   )
Total parameters: 124697433
Total trainable parameters: 124697433
controller2 = ModelLMController(trained_model,data_store=tdc,seed=42)
inp1 = {'Clothing ID':1,
        'Title':'Flattering',
        'Division Name':'General',
        'Review Text': "Love this <mask>. The detail is amazing. Runs small I ordered a 12 I'm usually a 10, but still a little snug"
       }
controller2.predict_raw_text(inp1,print_result=True)
Score: 0.323 >>> general. flattering. love this dress. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.317 >>> general. flattering. love this top. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.116 >>> general. flattering. love this shirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.063 >>> general. flattering. love this skirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.047 >>> general. flattering. love this sweater. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
--------------------

You can input several raw texts

inp2 = {'Clothing ID':[1,2],
        'Title':['Flattering','Lovely, but small'],
        'Division Name':['General','General'],
        'Review Text': ["Love this <mask>. The detail is amazing. Runs small I ordered a 12 I'm usually a 10, but still a little snug",
                        "Love this skirt. The detail is amazing. Runs <mask>, I ordered a 12 I'm usually a 10, but still a little snug"]
       }
controller2.predict_raw_text(inp2,print_result=True)
Score: 0.323 >>> general. flattering. love this dress. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.317 >>> general. flattering. love this top. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.116 >>> general. flattering. love this shirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.063 >>> general. flattering. love this skirt. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.047 >>> general. flattering. love this sweater. the detail is amazing. runs small i ordered a 12 i'm usually a 10, but still a little snug
--------------------
Score: 0.935 >>> general. lovely, but small. love this skirt. the detail is amazing. runs small, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.038 >>> general. lovely, but small. love this skirt. the detail is amazing. runs large, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.009 >>> general. lovely, but small. love this skirt. the detail is amazing. runs big, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.004 >>> general. lovely, but small. love this skirt. the detail is amazing. runs short, i ordered a 12 i'm usually a 10, but still a little snug
Score: 0.003 >>> general. lovely, but small. love this skirt. the detail is amazing. runs smaller, i ordered a 12 i'm usually a 10, but still a little snug
--------------------