Text Main For Language Model

This module contains the main Python class for Language Model data control: TextDataLMController
import pandas as pd
import numpy as np
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from importlib.machinery import SourceFileLoader
from datasets import load_dataset
import os

Class TextDataLMController


source

TextDataLMController

 TextDataLMController (inp, main_text:str, filter_dict={}, metadatas=[],
                       process_metas=True, metas_sep='.',
                       content_transformations=[],
                       val_ratio:int|float|None=0.2, stratify_cols=[],
                       seed=None, batch_size=1024, num_proc=4,
                       cols_to_keep=None, verbose=True)

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
inp HuggingFainpce Dataset or DatasetDict
main_text str Name of the main text column
filter_dict dict {} A dictionary: {feature: filtering_function_for_that_feature}
metadatas list [] Names of the metadata columns
process_metas bool True Whether to do simple text processing on the chosen metadatas
metas_sep str . Separator, for multiple metadatas concatenation
content_transformations list [] A list of text transformations
val_ratio int | float | None 0.2 Ratio of data for validation set
stratify_cols list [] Column(s) needed to do stratified shuffle split
seed NoneType None Random seed
batch_size int 1024 CPU batch size
num_proc int 4 Number of process for multiprocessing
cols_to_keep NoneType None Columns to keep after all processings
verbose bool True Whether to prdint processing information

1. Load data + Basic use case


source

TextDataController.from_csv

 TextDataController.from_csv (file_path, **kwargs)

source

TextDataController.from_df

 TextDataController.from_df (df, validate=True, **kwargs)

You can create a TextDataLMController from a csv, pandas DataFrame, or directly from a HuggingFace dataset object. Currently, TextDataLMController is designed for processing text in order to train a language model

Dataset source: https://www.kaggle.com/datasets/kavita5/review_ecommerce

import pandas as pd
df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df.shape
(23486, 10)
df.sample(5)
Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
18374 1077 43 NaN I love the color, which is eye popping without... 4 1 1 General Dresses Dresses
9201 862 47 NaN I love this top. so much so that i bought it i... 5 1 9 General Tops Knits
10964 1083 36 Gor-geous This dress is absolutely fantastic. beautiful,... 5 1 0 General Dresses Dresses
4108 829 44 Great quality, unique design Very unique shirt-- you will get a compliment!... 5 1 1 General Tops Blouses
9892 860 70 Not a wow I bought the bronze color which was nice but t... 1 0 0 General Petite Tops Knits

You can create a TextDataLMController from a dataframe. This also provides a quick input validation check (NaN check and Duplication check)

tdc = TextDataLMController.from_df(df,main_text='Review Text')
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows

You can also create a TextDataLMController directly from the csv file. The good thing about using HuggingFace Dataset as the main backend is that you can utilize lots of its useful functionality, such as caching

tdc = TextDataLMController.from_csv('sample_data/Womens_Clothing_Reviews.csv',main_text='Review Text')

You can also create a TextDataLMController from a HuggingFace Dataset

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset
Dataset({
    features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
    num_rows: 23486
})
tdc = TextDataLMController(dset,main_text='Review Text')

In the “Input Validation Precheck” above, we notice that our dataset has missing values in the text field and the label field. For now, let’s load the data as a Pandas’ DataFrame, perform some cleaning, and create our TextDataLMController

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df = df[(~df['Review Text'].isna()) & (~df['Department Name'].isna())].reset_index(drop=True)
tdc = TextDataLMController.from_df(df,main_text='Review Text')
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title    2966
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 1 rows

At this point you can start perform 2 important steps on your data

  1. Text preprocessings + Train/Validation Split
  2. Tokenization
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
    train: Dataset({
        features: ['Review Text'],
        num_rows: 18100
    })
    validation: Dataset({
        features: ['Review Text'],
        num_rows: 4526
    })
})

Our DatasetDict now has two split: train and validation. Note that train split is now IterableDataset, for processing efficiency

ddict['train'][:3]
{'Review Text': ['A lovely skirt and i\'m so glad i found it before the medium sold out! having said that, i expected the medium to run small and that i\'d have to squeeze into it but having tried it on this evening it\'s not the case at all. i nice fit. i might even have fitted into a small, which i think is the only size remaining. the skirt is very spain inspired. very flamenco! i love it! i will say that you\'d need a bit of height to wear this skirt due to the length at the back. i\'m 5\'6" which is tall enough fo',
  "The velvet isn't as soft or plush as i thought it would be but these are comfy pants. i won't wear them until next winter, which is fine.",
  "So i almost returned this top without trying it on because i've been binging on tops with thin blue lines but so glad i didn't!! i'm busty like ddd36 and i weigh 170, but i got the 8 and it fits like a glove! perfection!! plus i got it on sale!! so fab!"]}
ddict['validation'][:3]
{'Review Text': ["I love these jeans! i really like the way they fit and haven't had problems with them stretching out like other reviewers have.",
  'This shirt is so cute alone with jeans or dressed up with nice jewelry, a scarf or cardi. its just the right weight, true to size, drapes nicely and its very flattering. i"m sorry i didn\'t order more when i had the chance. its already sold out in the colors and sizes i wanted. excellent quality as usual -- thanks again retailer!',
  'The colors on these leggings are very nice and the fit was fabulous. the waist is high enough to hold in a slight "muffin" top and the control in the fabric is just right. i received several compliments on them and hubby really liked them.']}

2. Filtering

This preprocessing step allow you to filter out certain values of a certain column in your dataset. Let’s say I want to filter out any None value in the column ‘Review Text’

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df[(~df['Review Text'].isna())].isna().sum()
Clothing ID                   0
Age                           0
Title                      2966
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                13
Department Name              13
Class Name                   13
dtype: int64

We will provide a dictionary containing the name of the column and the filtering function to apply on that column. Note that the filtering function will receive an item from the column, and the function should return a boolean

tdc = TextDataLMController.from_df(df,
                                 main_text='Review Text',
                                 filter_dict={'Review Text': lambda x: x is not None},
                                 seed=42
                                )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
    train: Dataset({
        features: ['Review Text'],
        num_rows: 18111
    })
    validation: Dataset({
        features: ['Review Text'],
        num_rows: 4529
    })
})

Let’s check if we have filtered out all NaN/None value

for i in ddict['train']['Review Text']:
    assert i is not None
for i in ddict['validation']['Review Text']:
    assert i is not None

We can even add multiple filtering functions. Remember from our precheck, there are also None values in ‘Department Name’. While we are at it, let’s filter out any rating that is less than 3 (just to showcase what our filtering can do)

df.Rating.value_counts()
Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

Note that TextDataLMController will only keep the text and the metadatas columns; any other column will be dropped. To double-check our result, we need to define the cols_to_keep argument

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
tdc = TextDataLMController.from_df(df,
                                   main_text='Review Text',
                                   filter_dict={'Review Text': lambda x: x is not None,
                                                'Department Name': lambda x: x is not None,
                                                'Rating': lambda x: x>=3
                                               },
                                   cols_to_keep=['Review Text','Rating','Department Name'],
                                   seed=42
                                  )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Rating -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
for i in ddict['train']['Department Name']:
    assert i is not None
for i in ddict['validation']['Department Name']:
    assert i is not None

for i in ddict['train']['Rating']:
    assert i is not None
for i in ddict['validation']['Rating']:
    assert i >= 3

3. Metadatas concatenation

If we think metadatas can be helpful, we can concatenate them into the front of your text, so that our text classification model is aware of it.

In this example, Let’s add ‘Title’ as our metadata

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
tdc = TextDataLMController.from_df(df,
                                   main_text='Review Text',
                                   filter_dict={'Review Text': lambda x: x is not None},
                                   metadatas='Title',
                                   process_metas=True, # to preprocess the metadata (currently it's just empty space stripping and lowercasing),
                                   seed=42
                                  )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Shuffling and flattening train set --------------------
Done
ddict['train'][:3]
{'Title': ['not flattering on me', '', ''],
 'Review Text': ['not flattering on me . I ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5\'9" about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.',
  " . So unflattering! really disappointed. made me look 6 month pregnant and i'm a petite size 2.",
  ' . This t-shirt does a great job of elevating the basic t-shirt in to one with a touch of flair. i typically wear a medium but luckily read earlier reviews and went with the small.']}
ddict['validation'][:3]
{'Title': ['', '', ''],
 'Review Text': [" . This picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.",
  ' . Easy to wear! cute, comfy...will be a go to for summer.',
  ' . Nice sweater, just did not look good on me. sorry, going back.']}

4. Content Transformation

This processing allows you to alter the text content in your dataset. You need to define a function that accepts a single string and returns a new, processed string. Note that this transformation will be applied to ALL of your dataset (both train and validation)

Let’s say we want to normalize our text, because the text might contain some extra spaces between words, or not follow the “single space after a period” rule

_tmp = "This is a      sentence,which doesn't follow any rule!No single space is provided after period or punctuation marks.    Maybe there are too many spaces!?!   "
from underthesea import text_normalize
text_normalize(_tmp)
"This is a sentence , which doesn't follow any rule ! No single space is provided after period or punctuation marks . Maybe there are too many spaces ! ? !"
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=text_normalize,
                         seed=42
                        )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict['train']['Review Text'][0]
'I ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 \' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .'
ddict['validation']['Review Text'][0]
"This picture doesn't do the skirt justice . i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt . it is really pretty and flattering on ."

You can chain multiple functions. Let’s say after text normalizing, I want to lowercase the text

str.lower('tHis IS NoT lowerCASE')
'this is not lowercase'
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         seed=42
                        )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict['train']['Review Text'][0]
'i ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 \' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .'
ddict['validation']['Review Text'][0]
"this picture doesn't do the skirt justice . i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt . it is really pretty and flattering on ."

5. Train/Validation Split

There are several ways to perform a train/validation split with TextDataLMController

The first way is when you already have a validation split in your HuggingFace’s Dataset. Let’s use the Dataset built-in function train_test_split to simulate this

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1)
# This will create a 'test' split instead of 'validation', so we will process a bit to have a validation split
ddict_with_val['validation']=ddict_with_val['test']
del ddict_with_val['test']
ddict_with_val
DatasetDict({
    train: Dataset({
        features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
        num_rows: 21137
    })
    validation: Dataset({
        features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
        num_rows: 2349
    })
})
tdc = TextDataLMController(ddict_with_val,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         seed=42
                        )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
Validation split already exists
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
    train: Dataset({
        features: ['Review Text'],
        num_rows: 20368
    })
    validation: Dataset({
        features: ['Review Text'],
        num_rows: 2273
    })
})

A second way is to split randomly based on a ratio (a float between 0 and 1), or based on the number of data in your validation set

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         val_ratio=0.15,
                         seed=42,
                         verbose=False
                        )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
ddict
DatasetDict({
    train: Dataset({
        features: ['Review Text'],
        num_rows: 19243
    })
    validation: Dataset({
        features: ['Review Text'],
        num_rows: 3397
    })
})
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         val_ratio=5000,
                         seed=42,
                         verbose=False
                        )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
ddict
DatasetDict({
    train: Dataset({
        features: ['Review Text'],
        num_rows: 17640
    })
    validation: Dataset({
        features: ['Review Text'],
        num_rows: 5000
    })
})

A third way is to do a random stratified split (inspired by sklearn’s). Let’s do a stratified split based on our label ‘Department Name’

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df['Department Name'].value_counts(normalize=True)
Department Name
Tops        0.445978
Dresses     0.269214
Bottoms     0.161852
Intimate    0.073918
Jackets     0.043967
Trend       0.005070
Name: proportion, dtype: float64
tdc = TextDataLMController.from_df(df,
                                 main_text='Review Text',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=0.2,
                                 stratify_cols='Department Name',
                                 cols_to_keep=['Review Text','Department Name'],
                                 seed=42
                                )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name'],
        num_rows: 18100
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name'],
        num_rows: 4526
    })
})
pd.Series(ddict['train']['Department Name']).value_counts(normalize=True)
Tops        0.444033
Dresses     0.271602
Bottoms     0.161878
Intimate    0.072983
Jackets     0.044309
Trend       0.005193
Name: proportion, dtype: float64
pd.Series(ddict['validation']['Department Name']).value_counts(normalize=True)
Tops        0.444101
Dresses     0.271542
Bottoms     0.161732
Intimate    0.073133
Jackets     0.044189
Trend       0.005303
Name: proportion, dtype: float64

You can also use multiple columns for your stratification

tdc = TextDataLMController.from_df(df,
                                 main_text='Review Text',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=0.2,
                                 stratify_cols=['Department Name','Rating'],
                                 cols_to_keep=['Review Text','Department Name','Rating'],
                                 seed=42,
                                 verbose=False
                                )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Rating', 'Department Name'],
        num_rows: 18100
    })
    validation: Dataset({
        features: ['Review Text', 'Rating', 'Department Name'],
        num_rows: 4526
    })
})

And finally, you can omit any validation split if you specify val_ratio as None

tdc = TextDataLMController.from_df(df,
                                 main_text='Review Text',
                                 filter_dict={'Review Text': lambda x: x is not None},
                                 val_ratio=None,
                                 seed=42
                                )
ddict = tdc.do_all_preprocessing(shuffle_trn=True)
ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
No validation split defined
Done
-------------------- Dropping unused features --------------------
Done
-------------------- Shuffling and flattening train set --------------------
Done
DatasetDict({
    train: Dataset({
        features: ['Review Text'],
        num_rows: 22641
    })
})

6. Tokenization

Define our tokenization

from transformers import RobertaTokenizer
from underthesea import text_normalize
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

source

TextDataLMController.process_and_tokenize

 TextDataLMController.process_and_tokenize (tokenizer, max_length=None,
                                            line_by_line=True,
                                            stride=None, trn_size=None,
                                            tok_num_proc=None,
                                            shuffle_trn=True,
                                            check_val_leak=True)

This will perform do_all_processing then do_tokenization

Type Default Details
tokenizer Tokenizer (preferably from HuggingFace)
max_length NoneType None pad to model’s allowed max length (default is max_sequence_length)
line_by_line bool True To whether tokenize each sentence separately, or concatenate them and then tokenize
stride NoneType None option to do striding when line_by_line is False
trn_size NoneType None The number of training data to be tokenized
tok_num_proc NoneType None Number of processes for tokenization
shuffle_trn bool True To shuffle the train set before tokenization
check_val_leak bool True To check (and remove) training data which is leaked to validation set

a) Option 1: Tokenize our corpus line-by-line

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         cols_to_keep=['Clothing ID','Review Text'],
                         seed=42,
                         verbose=False
                        )

With no padding

tdc.process_and_tokenize(tokenizer,line_by_line=True,max_length=-1)
tdc.main_ddict
DatasetDict({
    train: Dataset({
        features: ['Clothing ID', 'Review Text', 'input_ids', 'special_tokens_mask', 'attention_mask'],
        num_rows: 18111
    })
    validation: Dataset({
        features: ['Clothing ID', 'Review Text', 'input_ids', 'special_tokens_mask', 'attention_mask'],
        num_rows: 4529
    })
})
print(tdc.main_ddict['train']['Review Text'][0])
print(tdc.main_ddict['validation']['Review Text'][0])
i ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 ' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .
this picture doesn't do the skirt justice . i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt . it is really pretty and flattering on .
print(tokenizer.decode(tdc.main_ddict['train']['input_ids'][0]))
print(tokenizer.decode(tdc.main_ddict['validation']['input_ids'][0]))
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s>
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s>

With padding (set max_length to None if you want to pad to model’s maximum sequence length)

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         cols_to_keep=['Clothing ID','Review Text'],
                         seed=42,
                         verbose=False
                        )
tdc.process_and_tokenize(tokenizer,line_by_line=True,max_length=100)
print(tokenizer.decode(tdc.main_ddict['train']['input_ids'][0]))
print(tokenizer.decode(tdc.main_ddict['validation']['input_ids'][0]))
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad>
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

b) Option 2: Tokenize every text, then concatenate them together before splitting them in smaller parts.

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         cols_to_keep=['Clothing ID','Review Text'],
                         seed=42,
                         verbose=False,
                        )
tdc.process_and_tokenize(tokenizer,line_by_line=False,max_length=100)
tdc.main_ddict
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'special_tokens_mask', 'attention_mask'],
        num_rows: 13573
    })
    validation: Dataset({
        features: ['input_ids', 'special_tokens_mask', 'attention_mask'],
        num_rows: 3446
    })
})

Notice that even when I put in the cols_to_keep parameters, the returned DatasetDict still does not keep them, because it wouldn’t make sense to retain them when the tokens are concatenated. Normally, you will leave cols_to_keep as None, which is the default, when line_by_line is False

for i in tdc.main_ddict['train']['input_ids'][:3]:
    print(tokenizer.decode(i))
    print('-'*100)
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made
----------------------------------------------------------------------------------------------------
 me look 6 month pregnant and i'm a petite size 2.</s><s>i love rompers and this one is really cute. i usually wear size 12 but should have got a 10, it runs big. it seems too long, and i'm 5'9 ". the prints cute but a little blah. i paid $ 158 which is too much, since i haven't worn it yet, i should have waited for it to go on sale.</s><s>... the print is so
----------------------------------------------------------------------------------------------------
 sharking, and i love the way it looks on the model -- but i'm a more curvy figure, and the boxy-ish cut plus rather stuff fabric in front is incredibly unflattering. ordinarily i love everything made by maeve, but this one sadly must be returned... on a thinner / straighter-shaped person i expect it would be great.</s><s>i've had my eye on this poncho for weeks and finally scored the olive green one over thanksgiving /
----------------------------------------------------------------------------------------------------
for i in tdc.main_ddict['validation']['input_ids'][:3]:
    print(tokenizer.decode(i))
    print('-'*100)
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s><s>easy to wear! cute, comfy... will be a go to for summer.</s><s>nice sweater, just did not look good on me. sorry, going back.</s><s>this jacket was a little shorter than i had expected, but i still really enjoy the cut and fit of it
----------------------------------------------------------------------------------------------------
.</s><s>i wasn't planning on loving this dress when i tried it on. i loved the the color which is what prompted me to buy it. this dress fit perfectly. it hugs my body without feeling tight. the ruching is perfect. i didn't want to take it off! it's also very comfortable. i'm 5'1 ", 107 lbs and the xs petite fit perfectly. the dress hits me at the same length that is pictured. i think it would
----------------------------------------------------------------------------------------------------
 be easy to hem if you wanted it to be shorter. i have a short torso and saw no issues with that as some reviewer</s><s>i like flowy tops because i have a bit of a belly and i like to camouflage it but this top was really flowy. the fabric is great and the embroidery is beautiful, i was hoping for this to be a holiday staple this year. it has to go back though, just too large. i don't love it quite enough to order
----------------------------------------------------------------------------------------------------

c) Striding (For Concatenation of tokens)

If your sentences (or paragraphs) are larger than max_length, after concatenation, they will be broken apart; your long paragraph will be incompleted in terms of meaning. Striding is a way to somewhat preserve the sentence’s meaning, by getting part of the sentence back. We will demonstrate it with an example, and you can compare it with the previous one (without striding) to see the differences

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False,
                        )
tdc.process_and_tokenize(tokenizer,line_by_line=False,max_length=100,stride=20)
# Stride is 20, meaning for the next entry, we go back 20 tokens
for i in tdc.main_ddict['train']['input_ids'][:3]:
    print(tokenizer.decode(i))
    print('-'*100)
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made
----------------------------------------------------------------------------------------------------
 comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made me look 6 month pregnant and i'm a petite size 2.</s><s>i love rompers and this one is really cute. i usually wear size 12 but should have got a 10, it runs big. it seems too long, and i'm 5'9 ". the prints cute but a little blah. i paid $ 158 which is too much, since i haven't worn it
----------------------------------------------------------------------------------------------------
 but a little blah. i paid $ 158 which is too much, since i haven't worn it yet, i should have waited for it to go on sale.</s><s>... the print is so sharking, and i love the way it looks on the model -- but i'm a more curvy figure, and the boxy-ish cut plus rather stuff fabric in front is incredibly unflattering. ordinarily i love everything made by maeve, but this one sadly must be returned... on
----------------------------------------------------------------------------------------------------

For the second entry, we can see it starts with the last 20 tokens of the previous entry: comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made)

for i in tdc.main_ddict['validation']['input_ids'][:3]:
    print(tokenizer.decode(i))
    print('-'*100)
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s><s>easy to wear! cute, comfy... will be a go to for summer.</s><s>nice sweater, just did not look good on me. sorry, going back.</s><s>this jacket was a little shorter than i had expected, but i still really enjoy the cut and fit of it
----------------------------------------------------------------------------------------------------
 was a little shorter than i had expected, but i still really enjoy the cut and fit of it.</s><s>i wasn't planning on loving this dress when i tried it on. i loved the the color which is what prompted me to buy it. this dress fit perfectly. it hugs my body without feeling tight. the ruching is perfect. i didn't want to take it off! it's also very comfortable. i'm 5'1 ", 107 lbs and the xs pet
----------------------------------------------------------------------------------------------------
 it's also very comfortable. i'm 5'1 ", 107 lbs and the xs petite fit perfectly. the dress hits me at the same length that is pictured. i think it would be easy to hem if you wanted it to be shorter. i have a short torso and saw no issues with that as some reviewer</s><s>i like flowy tops because i have a bit of a belly and i like to camouflage it but this top was really flowy. the fabric is great and
----------------------------------------------------------------------------------------------------

7. Data Collator

from underthesea import text_normalize
from transformers import AutoTokenizer

a) For masked language model

tokenizer = AutoTokenizer.from_pretrained('roberta-base')

Let’s define our text controller first

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         cols_to_keep=['Clothing ID','Review Text'],
                         seed=42,
                         verbose=False
                        )

We will tokenize our corpus line-by-line

tdc.process_and_tokenize(tokenizer,line_by_line=True,max_length=-1)
tdc.set_data_collator(is_mlm=True,mlm_prob=0.15)
tdc.data_collator
DataCollatorForLanguageModeling(tokenizer=RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}, mlm=True, mlm_probability=0.15, pad_to_multiple_of=8, tf_experimental_compile=False, return_tensors='pt')

Before applying the collator…

print([tdc.main_ddict['train'][i] for i in range(2)])
[{'Clothing ID': 937, 'Review Text': 'i ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 \' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .', 'input_ids': [0, 118, 2740, 42, 804, 8, 21, 5779, 19, 5, 2564, 77, 24, 2035, 479, 939, 2740, 5, 3023, 29, 8, 24, 21, 202, 81, 10799, 7, 5, 477, 9, 145, 29747, 24203, 479, 939, 524, 6764, 195, 128, 361, 22, 59, 8325, 2697, 8, 33, 10, 5342, 7174, 28762, 8, 356, 275, 11, 21543, 29, 14, 33, 103, 3989, 479, 114, 47, 101, 10, 7082, 2564, 42, 429, 28, 13, 47, 479, 5, 1468, 16, 33997, 8, 3279, 8, 3473, 479, 939, 74, 3608, 12926, 159, 10, 1836, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}, {'Clothing ID': 870, 'Review Text': "so unflattering ! really disappointed . made me look 6 month pregnant and i'm a petite size 2 .", 'input_ids': [0, 2527, 29747, 24203, 27785, 269, 5779, 479, 156, 162, 356, 231, 353, 5283, 8, 939, 437, 10, 4716, 1459, 1836, 132, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}]

We can see that the length of each token list is different from each other

list(map(len,tdc.main_ddict['train']['input_ids'][:5]))
[91, 24, 79, 82, 121]

Let’s apply the collator

# extract only the required keys
inp_keys = tokenizer.model_input_names
_inp = [{k:tdc.main_ddict['train'][i][k] for k in inp_keys} for i in range(5)]
print(_inp[:2])
[{'input_ids': [0, 118, 2740, 42, 804, 8, 21, 5779, 19, 5, 2564, 77, 24, 2035, 479, 939, 2740, 5, 3023, 29, 8, 24, 21, 202, 81, 10799, 7, 5, 477, 9, 145, 29747, 24203, 479, 939, 524, 6764, 195, 128, 361, 22, 59, 8325, 2697, 8, 33, 10, 5342, 7174, 28762, 8, 356, 275, 11, 21543, 29, 14, 33, 103, 3989, 479, 114, 47, 101, 10, 7082, 2564, 42, 429, 28, 13, 47, 479, 5, 1468, 16, 33997, 8, 3279, 8, 3473, 479, 939, 74, 3608, 12926, 159, 10, 1836, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}, {'input_ids': [0, 2527, 29747, 24203, 27785, 269, 5779, 479, 156, 162, 356, 231, 353, 5283, 8, 939, 437, 10, 4716, 1459, 1836, 132, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}]
out = tdc.data_collator(_inp) # simulation with batch size 5
out.keys()
dict_keys(['input_ids', 'attention_mask', 'labels'])

Now all token lists have the same length, which is 128: a multiple of 8 and larger than the longest list in the batch (which is 121)

out['input_ids'].shape
torch.Size([5, 128])
out['input_ids'][:2,:]
tensor([[    0,   118,  2740,    42,   804,     8,    21,  5779,    19, 50264,
          2564,    77,    24,  2035,   479,   939,  2740,     5,  3023,    29,
             8,    24,    21,   202, 50264, 10799,     7,     5,   477, 50264,
           145, 50264, 24203,   479,   939,   524,  6764,   195,   128,   361,
            22,    59,  8325,  2697,     8,    33,    10,  5342,  7174, 28762,
         50264,   356, 50264,    11, 21543,    29,    14,    33,   103, 38941,
           479,   114,    47,   101,    10,  7082,  2564,    42,   429,    28,
            13,    47,   479, 50264,  1468, 44089, 33997,     8,  3279,     8,
          3473,   479,   939,    74,  3608, 12926,   159, 50264,  1836,   479,
             2,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1],
        [    0,  2527, 29747, 24203, 50264,   269,  5779,   479, 50264, 50264,
           356,   231,   353, 50264,     8, 50264,   437,    10,  4716,  1459,
          1836, 49943,   479,     2,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1]])

The labels have also been constructed, which shows the “mask” tokens (non -100) in which the model has to predict. To increase the amount of masked tokens, increase the mlm_prob

out['labels'][:2,:]
tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,     5,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,    81,  -100,  -100,  -100,  -100,     9,
          -100, 29747,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
             8,  -100,   275,  -100,  -100,  -100,  -100,  -100,  -100,  3989,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,     5,  1468,    16,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,    10,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [ -100,  -100,  -100,  -100, 27785,  -100,  -100,  -100,   156,   162,
          -100,  -100,  -100,  5283,  -100,   939,  -100,  -100,  -100,  -100,
          -100,   132,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100]])

If you apply padding in the tokenization step (by adjusting the max_length argument), no matter whether it’s line-by-line tokenization or not, the data collator will skip the padding step

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         cols_to_keep=['Clothing ID','Review Text'],
                         seed=42,
                         verbose=False
                        )
tdc.process_and_tokenize(tokenizer,line_by_line=False,max_length=100)
tdc.set_data_collator(is_mlm=True,mlm_prob=0.15)
list(map(len,tdc.main_ddict['train']['input_ids'][:5]))
[100, 100, 100, 100, 100]

Let’s apply the collator

inp_keys = tokenizer.model_input_names
_inp = [{k:tdc.main_ddict['train'][i][k] for k in inp_keys} for i in range(5)]
out = tdc.data_collator(_inp) # simulation with batch size 5
out['input_ids'].shape
torch.Size([5, 100])
out['input_ids'][:2,:]
tensor([[    0,   118,  2740,    42,   804,     8,    21,  5779,    19, 50264,
          2564,    77,    24,  2035,   479,   939,  2740,     5,  3023,    29,
             8,    24,    21,   202,    81, 10799,     7,     5,   477, 50264,
           145, 50264, 24203,   479,   939,   524,  6764,   195,   128,   361,
            22,    59,  8325,  2697,     8,    33,    10,  5342,  7174, 28762,
         50264,   356, 50264,    11, 21543,    29,    14,    33,   103, 41316,
           479,   114,    47,   101,    10,  7082,  2564,    42,   429,    28,
            13,    47,   479, 50264, 17204, 50264, 33997,     8,  3279,     8,
          3473,   479,   939,    74,  3608, 12926,   159, 50264,  1836,   479,
             2,     0,  2527, 29747, 50264, 27785,   269,  5779,   479, 50264],
        [  162,   356,   231, 50264,  5283,     8, 50264,   437, 23781,  4716,
          1459,  1836,   132,   479,     2,     0,   118, 50264,   910,  7474,
           268,     8,    42,    65,    16,   269, 11962, 50264,   939,  2333,
          3568,  1836, 50264,    53,   197,    33, 50264, 50264,   158,  2156,
            24, 50264,   380, 44224,    24,  1302,   350,   251,  2156, 50264,
           939,   437,   195, 50264,   361,    22,   479,     5, 19553, 11962,
            53,    10,   410, 50264,   479,   939,  1199,    68, 26498,    61,
            16,   350,   203,  2156,   187,   939, 50264,    75, 10610, 50264,
           648,  2156,   939,   197,    33,  9010,    13,    24,     7,   213,
            15,  1392,   479,     2,     0,   734,     5,  5780,    16,    98]])
out['labels'][:2,:]
tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,     5,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,    81,  -100,  -100,  -100,  -100,     9,
          -100, 29747,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
             8,  -100,   275,  -100,  -100,  -100,  -100,  -100,  -100,  3989,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,     5,  1468,    16,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,    10,  -100,  -100,
          -100,  -100,  -100,  -100, 24203,  -100,  -100,  -100,  -100,   156],
        [ -100,  -100,  -100,   353,  -100,  -100,   939,  -100,    10,  -100,
          -100,  -100,  -100,   479,  -100,  -100,  -100,   657,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,   479,  -100,  -100,
          -100,  -100,   316,  -100,  -100,  -100,   300,    10,  -100,  -100,
          -100,  1237,  -100,   479,  -100,  -100,  -100,  -100,  -100,     8,
          -100,  -100,  -100,   128,  -100,    22,  -100,  -100,  -100,  -100,
          -100,  -100,  -100, 38596,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  2220,  -100,  -100,    24,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100]])

Since we are using the concatenation-of-tokenization technique, one smart thing that the HuggingFace’s DataCollatorForLanguageModeling (which is the data collator we use) does is to allow maskings at every position, at opposed to to the previous cases (with line-by-line tokenization), there’s no masking near the end tokens of each list, because those end tokens are padding tokens

b) For causal language model

from transformers import AutoTokenizer
from tokenizers import processors

Let’s define our GPT2 tokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer
GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

GPT2 does not use start/end-of-sentence token:

print(tokenizer.convert_ids_to_tokens(tokenizer("this is a text. That is a second text.But there's a third one")['input_ids']))
['this', 'Ġis', 'Ġa', 'Ġtext', '.', 'ĠThat', 'Ġis', 'Ġa', 'Ġsecond', 'Ġtext', '.', 'But', 'Ġthere', "'s", 'Ġa', 'Ġthird', 'Ġone']

If you want to perform concatenation-of-token, and you want your causal LM to differentiate between sentences, you can add a special token to separate sentences, as follow:

tokenizer._tokenizer.post_processor = processors.TemplateProcessing(
    single="$A " + tokenizer.eos_token,
    special_tokens=[(tokenizer.eos_token, tokenizer.eos_token_id)],
)
tokenizer.pad_token = tokenizer.eos_token
print(tokenizer.convert_ids_to_tokens(tokenizer("this is a text. That is a second text.But there's a third one")['input_ids']))
['this', 'Ġis', 'Ġa', 'Ġtext', '.', 'ĠThat', 'Ġis', 'Ġa', 'Ġsecond', 'Ġtext', '.', 'But', 'Ġthere', "'s", 'Ġa', 'Ġthird', 'Ġone', '<|endoftext|>']

With this modified tokenizer, let’s perform concatenation-of-tokenization using GPT2

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False
                        )
tdc.process_and_tokenize(tokenizer,line_by_line=False,max_length=100)

Since it’s casual language modeling, let’s turn off is_mlm

tdc.set_data_collator(is_mlm=False)
list(map(len,tdc.main_ddict['train']['input_ids'][:5]))
[100, 100, 100, 100, 100]

Let’s apply the collator

out = tdc.data_collator([tdc.main_ddict['train'][i] for i in range(5)]) # simulation with batch size 5
out['input_ids'].shape
torch.Size([5, 100])
out['input_ids'][:2,:]
tensor([[   72,  6149,   428,  2691,   290,   373, 11679,   351,   262,  4197,
           618,   340,  5284,   764,  1312,  6149,   262,  2124,    82,   290,
           340,   373,   991,   625,  7857,   284,   262,   966,   286,   852,
         42880, 16475,   764,  1312,   716,  7331,   642,   705,   860,   366,
           546, 11323,  8059,   290,   423,   257,  6547,  7888, 28668,   290,
           804,  1266,   287, 16270,    82,   326,   423,   617,  5485,   764,
           611,   345,   588,   257,  9155,  4197,   428,  1244,   307,   329,
           345,   764,   262,  2587,   318, 29175,   290,  5814,   290,  6792,
           764,  1312,   561,  1950, 16216,   866,   257,  2546,   764, 50256,
           568, 42880, 16475,  5145,  1107, 11679,   764,   925,   502,   804],
        [  718,  1227, 10423,   290,  1312,  1101,   257,  4273,   578,  2546,
           362,   764, 50256,    72,  1842,   374,  3361,   364,   290,   428,
           530,   318,  1107, 13779,   764,  1312,  3221,  5806,  2546,  1105,
           475,   815,   423,  1392,   257,   838,   837,   340,  4539,  1263,
           764,   340,  2331,  1165,   890,   837,   290,  1312,  1101,   642,
           705,   860,   366,   764,   262, 20842, 13779,   475,   257,  1310,
         33367,   764,  1312,  3432,   720, 24063,   543,   318,  1165,   881,
           837,  1201,  1312,  4398,   470, 12666,   340,  1865,   837,  1312,
           815,   423, 13488,   329,   340,   284,   467,   319,  5466,   764,
         50256,   986,   262,  3601,   318,   523, 21027,   278,   837,   290]])
out['labels'][:2,:]
tensor([[   72,  6149,   428,  2691,   290,   373, 11679,   351,   262,  4197,
           618,   340,  5284,   764,  1312,  6149,   262,  2124,    82,   290,
           340,   373,   991,   625,  7857,   284,   262,   966,   286,   852,
         42880, 16475,   764,  1312,   716,  7331,   642,   705,   860,   366,
           546, 11323,  8059,   290,   423,   257,  6547,  7888, 28668,   290,
           804,  1266,   287, 16270,    82,   326,   423,   617,  5485,   764,
           611,   345,   588,   257,  9155,  4197,   428,  1244,   307,   329,
           345,   764,   262,  2587,   318, 29175,   290,  5814,   290,  6792,
           764,  1312,   561,  1950, 16216,   866,   257,  2546,   764,  -100,
           568, 42880, 16475,  5145,  1107, 11679,   764,   925,   502,   804],
        [  718,  1227, 10423,   290,  1312,  1101,   257,  4273,   578,  2546,
           362,   764,  -100,    72,  1842,   374,  3361,   364,   290,   428,
           530,   318,  1107, 13779,   764,  1312,  3221,  5806,  2546,  1105,
           475,   815,   423,  1392,   257,   838,   837,   340,  4539,  1263,
           764,   340,  2331,  1165,   890,   837,   290,  1312,  1101,   642,
           705,   860,   366,   764,   262, 20842, 13779,   475,   257,  1310,
         33367,   764,  1312,  3432,   720, 24063,   543,   318,  1165,   881,
           837,  1201,  1312,  4398,   470, 12666,   340,  1865,   837,  1312,
           815,   423, 13488,   329,   340,   284,   467,   319,  5466,   764,
          -100,   986,   262,  3601,   318,   523, 21027,   278,   837,   290]])

For CLM, the labels are essentially the same as input_ids. From HuggingFace documentation:

`DataCollatorForLanguageModeling` will take care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training.

8. Save and Load TextDataController


source

TextDataLMController.save_as_pickles

 TextDataLMController.save_as_pickles (fname, parent='pickle_files')
Type Default Details
fname Name of the pickle file
parent str pickle_files Parent folder

source

TextDataController.from_pickle

 TextDataController.from_pickle (fname, parent='pickle_files')
Type Default Details
fname Name of the pickle file
parent str pickle_files Parent folder

TextDataLMController object can be saved and loaded with ease. This is especially useful after text processing and/or tokenization have been done

from datasets import disable_caching
disable_caching()
tokenizer = AutoTokenizer.from_pretrained('roberta-base')

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False
                        )
tdc.process_and_tokenize(tokenizer,line_by_line=True,max_length=-1)

tdc.set_data_collator(is_mlm=True,mlm_prob=0.15)
tdc.save_as_pickles('my_lm_tdc')

Load back our object

tdc2 = TextDataLMController.from_pickle('my_lm_tdc')

You can still access all its attributes, data, preprocessings, transformations …

tdc2.main_ddict
DatasetDict({
    train: Dataset({
        features: ['Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 18111
    })
    validation: Dataset({
        features: ['Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 4529
    })
})
tdc2.filter_dict,tdc2.content_tfms
({'Review Text': <function __main__.<lambda>(x)>},
 [<function underthesea.pipeline.text_normalize.text_normalize(text, tokenizer='underthesea')>,
  <method 'lower' of 'str' objects>])