import pandas as pd
import numpy as np
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from importlib.machinery import SourceFileLoader
from datasets import load_dataset
import os
Text Main For Language Model
TextDataLMController
Class TextDataLMController
TextDataLMController
TextDataLMController (inp, main_text:str, filter_dict={}, metadatas=[], process_metas=True, metas_sep='.', content_transformations=[], val_ratio:int|float|None=0.2, stratify_cols=[], seed=None, batch_size=1024, num_proc=4, cols_to_keep=None, verbose=True)
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
inp | HuggingFainpce Dataset or DatasetDict | ||
main_text | str | Name of the main text column | |
filter_dict | dict | {} | A dictionary: {feature: filtering_function_for_that_feature} |
metadatas | list | [] | Names of the metadata columns |
process_metas | bool | True | Whether to do simple text processing on the chosen metadatas |
metas_sep | str | . | Separator, for multiple metadatas concatenation |
content_transformations | list | [] | A list of text transformations |
val_ratio | int | float | None | 0.2 | Ratio of data for validation set |
stratify_cols | list | [] | Column(s) needed to do stratified shuffle split |
seed | NoneType | None | Random seed |
batch_size | int | 1024 | CPU batch size |
num_proc | int | 4 | Number of process for multiprocessing |
cols_to_keep | NoneType | None | Columns to keep after all processings |
verbose | bool | True | Whether to prdint processing information |
1. Load data + Basic use case
TextDataController.from_csv
TextDataController.from_csv (file_path, **kwargs)
TextDataController.from_df
TextDataController.from_df (df, validate=True, **kwargs)
You can create a TextDataLMController
from a csv, pandas DataFrame, or directly from a HuggingFace dataset object. Currently, TextDataLMController
is designed for processing text in order to train a language model
Dataset source: https://www.kaggle.com/datasets/kavita5/review_ecommerce
import pandas as pd
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
df.shape
(23486, 10)
5) df.sample(
Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | |
---|---|---|---|---|---|---|---|---|---|---|
18374 | 1077 | 43 | NaN | I love the color, which is eye popping without... | 4 | 1 | 1 | General | Dresses | Dresses |
9201 | 862 | 47 | NaN | I love this top. so much so that i bought it i... | 5 | 1 | 9 | General | Tops | Knits |
10964 | 1083 | 36 | Gor-geous | This dress is absolutely fantastic. beautiful,... | 5 | 1 | 0 | General | Dresses | Dresses |
4108 | 829 | 44 | Great quality, unique design | Very unique shirt-- you will get a compliment!... | 5 | 1 | 1 | General | Tops | Blouses |
9892 | 860 | 70 | Not a wow | I bought the bronze color which was nice but t... | 1 | 0 | 0 | General Petite | Tops | Knits |
You can create a TextDataLMController
from a dataframe. This also provides a quick input validation check (NaN check and Duplication check)
= TextDataLMController.from_df(df,main_text='Review Text') tdc
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
You can also create a TextDataLMController
directly from the csv file. The good thing about using HuggingFace Dataset as the main backend is that you can utilize lots of its useful functionality, such as caching
= TextDataLMController.from_csv('sample_data/Womens_Clothing_Reviews.csv',main_text='Review Text') tdc
You can also create a TextDataLMController
from a HuggingFace Dataset
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset dset
Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 23486
})
= TextDataLMController(dset,main_text='Review Text') tdc
In the “Input Validation Precheck” above, we notice that our dataset has missing values in the text field and the label field. For now, let’s load the data as a Pandas’ DataFrame, perform some cleaning, and create our TextDataLMController
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
= df[(~df['Review Text'].isna()) & (~df['Department Name'].isna())].reset_index(drop=True) df
= TextDataLMController.from_df(df,main_text='Review Text') tdc
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 2966
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 1 rows
At this point you can start perform 2 important steps on your data
- Text preprocessings + Train/Validation Split
- Tokenization
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
train: Dataset({
features: ['Review Text'],
num_rows: 18100
})
validation: Dataset({
features: ['Review Text'],
num_rows: 4526
})
})
Our DatasetDict now has two split: train and validation. Note that train split is now IterableDataset, for processing efficiency
'train'][:3] ddict[
{'Review Text': ['A lovely skirt and i\'m so glad i found it before the medium sold out! having said that, i expected the medium to run small and that i\'d have to squeeze into it but having tried it on this evening it\'s not the case at all. i nice fit. i might even have fitted into a small, which i think is the only size remaining. the skirt is very spain inspired. very flamenco! i love it! i will say that you\'d need a bit of height to wear this skirt due to the length at the back. i\'m 5\'6" which is tall enough fo',
"The velvet isn't as soft or plush as i thought it would be but these are comfy pants. i won't wear them until next winter, which is fine.",
"So i almost returned this top without trying it on because i've been binging on tops with thin blue lines but so glad i didn't!! i'm busty like ddd36 and i weigh 170, but i got the 8 and it fits like a glove! perfection!! plus i got it on sale!! so fab!"]}
'validation'][:3] ddict[
{'Review Text': ["I love these jeans! i really like the way they fit and haven't had problems with them stretching out like other reviewers have.",
'This shirt is so cute alone with jeans or dressed up with nice jewelry, a scarf or cardi. its just the right weight, true to size, drapes nicely and its very flattering. i"m sorry i didn\'t order more when i had the chance. its already sold out in the colors and sizes i wanted. excellent quality as usual -- thanks again retailer!',
'The colors on these leggings are very nice and the fit was fabulous. the waist is high enough to hold in a slight "muffin" top and the control in the fabric is just right. i received several compliments on them and hubby really liked them.']}
2. Filtering
This preprocessing step allow you to filter out certain values of a certain column in your dataset. Let’s say I want to filter out any None value in the column ‘Review Text’
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df ~df['Review Text'].isna())].isna().sum() df[(
Clothing ID 0
Age 0
Title 2966
Review Text 0
Rating 0
Recommended IND 0
Positive Feedback Count 0
Division Name 13
Department Name 13
Class Name 13
dtype: int64
We will provide a dictionary containing the name of the column and the filtering function to apply on that column. Note that the filtering function will receive an item from the column, and the function should return a boolean
= TextDataLMController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=42
seed )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
train: Dataset({
features: ['Review Text'],
num_rows: 18111
})
validation: Dataset({
features: ['Review Text'],
num_rows: 4529
})
})
Let’s check if we have filtered out all NaN/None value
for i in ddict['train']['Review Text']:
assert i is not None
for i in ddict['validation']['Review Text']:
assert i is not None
We can even add multiple filtering functions. Remember from our precheck, there are also None values in ‘Department Name’. While we are at it, let’s filter out any rating that is less than 3 (just to showcase what our filtering can do)
df.Rating.value_counts()
Rating
5 13131
4 5077
3 2871
2 1565
1 842
Name: count, dtype: int64
Note that TextDataLMController
will only keep the text and the metadatas columns; any other column will be dropped. To double-check our result, we need to define the cols_to_keep
argument
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df = TextDataLMController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
'Rating': lambda x: x>=3
},=['Review Text','Rating','Department Name'],
cols_to_keep=42
seed )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Rating -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
for i in ddict['train']['Department Name']:
assert i is not None
for i in ddict['validation']['Department Name']:
assert i is not None
for i in ddict['train']['Rating']:
assert i is not None
for i in ddict['validation']['Rating']:
assert i >= 3
3. Metadatas concatenation
If we think metadatas can be helpful, we can concatenate them into the front of your text, so that our text classification model is aware of it.
In this example, Let’s add ‘Title’ as our metadata
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df = TextDataLMController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict='Title',
metadatas=True, # to preprocess the metadata (currently it's just empty space stripping and lowercasing),
process_metas=42
seed )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Shuffling and flattening train set --------------------
Done
'train'][:3] ddict[
{'Title': ['not flattering on me', '', ''],
'Review Text': ['not flattering on me . I ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5\'9" about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.',
" . So unflattering! really disappointed. made me look 6 month pregnant and i'm a petite size 2.",
' . This t-shirt does a great job of elevating the basic t-shirt in to one with a touch of flair. i typically wear a medium but luckily read earlier reviews and went with the small.']}
'validation'][:3] ddict[
{'Title': ['', '', ''],
'Review Text': [" . This picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.",
' . Easy to wear! cute, comfy...will be a go to for summer.',
' . Nice sweater, just did not look good on me. sorry, going back.']}
4. Content Transformation
This processing allows you to alter the text content in your dataset. You need to define a function that accepts a single string and returns a new, processed string. Note that this transformation will be applied to ALL of your dataset (both train and validation)
Let’s say we want to normalize our text, because the text might contain some extra spaces between words, or not follow the “single space after a period” rule
= "This is a sentence,which doesn't follow any rule!No single space is provided after period or punctuation marks. Maybe there are too many spaces!?! " _tmp
from underthesea import text_normalize
text_normalize(_tmp)
"This is a sentence , which doesn't follow any rule ! No single space is provided after period or punctuation marks . Maybe there are too many spaces ! ? !"
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=text_normalize,
content_transformations=42
seed )
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
'train']['Review Text'][0] ddict[
'I ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 \' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .'
'validation']['Review Text'][0] ddict[
"This picture doesn't do the skirt justice . i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt . it is really pretty and flattering on ."
You can chain multiple functions. Let’s say after text normalizing, I want to lowercase the text
str.lower('tHis IS NoT lowerCASE')
'this is not lowercase'
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=42
seed )
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
'train']['Review Text'][0] ddict[
'i ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 \' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .'
'validation']['Review Text'][0] ddict[
"this picture doesn't do the skirt justice . i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt . it is really pretty and flattering on ."
5. Train/Validation Split
There are several ways to perform a train/validation split with TextDataLMController
The first way is when you already have a validation split in your HuggingFace’s Dataset. Let’s use the Dataset built-in function train_test_split
to simulate this
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = dset.train_test_split(test_size=0.1)
ddict_with_val # This will create a 'test' split instead of 'validation', so we will process a bit to have a validation split
'validation']=ddict_with_val['test']
ddict_with_val[del ddict_with_val['test']
ddict_with_val
DatasetDict({
train: Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 21137
})
validation: Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 2349
})
})
= TextDataLMController(ddict_with_val,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=42
seed
)= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
Validation split already exists
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
train: Dataset({
features: ['Review Text'],
num_rows: 20368
})
validation: Dataset({
features: ['Review Text'],
num_rows: 2273
})
})
A second way is to split randomly based on a ratio (a float between 0 and 1), or based on the number of data in your validation set
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=0.15,
val_ratio=42,
seed=False
verbose
)= tdc.do_all_preprocessing(shuffle_trn=True)
ddict ddict
DatasetDict({
train: Dataset({
features: ['Review Text'],
num_rows: 19243
})
validation: Dataset({
features: ['Review Text'],
num_rows: 3397
})
})
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=5000,
val_ratio=42,
seed=False
verbose
)= tdc.do_all_preprocessing(shuffle_trn=True)
ddict ddict
DatasetDict({
train: Dataset({
features: ['Review Text'],
num_rows: 17640
})
validation: Dataset({
features: ['Review Text'],
num_rows: 5000
})
})
A third way is to do a random stratified split (inspired by sklearn’s). Let’s do a stratified split based on our label ‘Department Name’
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
'Department Name'].value_counts(normalize=True) df[
Department Name
Tops 0.445978
Dresses 0.269214
Bottoms 0.161852
Intimate 0.073918
Jackets 0.043967
Trend 0.005070
Name: proportion, dtype: float64
= TextDataLMController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.2,
val_ratio='Department Name',
stratify_cols=['Review Text','Department Name'],
cols_to_keep=42
seed
)= tdc.do_all_preprocessing(shuffle_trn=True)
ddict ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name'],
num_rows: 18100
})
validation: Dataset({
features: ['Review Text', 'Department Name'],
num_rows: 4526
})
})
'train']['Department Name']).value_counts(normalize=True) pd.Series(ddict[
Tops 0.444033
Dresses 0.271602
Bottoms 0.161878
Intimate 0.072983
Jackets 0.044309
Trend 0.005193
Name: proportion, dtype: float64
'validation']['Department Name']).value_counts(normalize=True) pd.Series(ddict[
Tops 0.444101
Dresses 0.271542
Bottoms 0.161732
Intimate 0.073133
Jackets 0.044189
Trend 0.005303
Name: proportion, dtype: float64
You can also use multiple columns for your stratification
= TextDataLMController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.2,
val_ratio=['Department Name','Rating'],
stratify_cols=['Review Text','Department Name','Rating'],
cols_to_keep=42,
seed=False
verbose
)= tdc.do_all_preprocessing(shuffle_trn=True)
ddict ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
DatasetDict({
train: Dataset({
features: ['Review Text', 'Rating', 'Department Name'],
num_rows: 18100
})
validation: Dataset({
features: ['Review Text', 'Rating', 'Department Name'],
num_rows: 4526
})
})
And finally, you can omit any validation split if you specify val_ratio
as None
= TextDataLMController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=None,
val_ratio=42
seed
)= tdc.do_all_preprocessing(shuffle_trn=True)
ddict ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
No validation split defined
Done
-------------------- Dropping unused features --------------------
Done
-------------------- Shuffling and flattening train set --------------------
Done
DatasetDict({
train: Dataset({
features: ['Review Text'],
num_rows: 22641
})
})
6. Tokenization
Define our tokenization
from transformers import RobertaTokenizer
from underthesea import text_normalize
= RobertaTokenizer.from_pretrained('roberta-base') tokenizer
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
TextDataLMController.process_and_tokenize
TextDataLMController.process_and_tokenize (tokenizer, max_length=None, line_by_line=True, stride=None, trn_size=None, tok_num_proc=None, shuffle_trn=True, check_val_leak=True)
This will perform do_all_processing
then do_tokenization
Type | Default | Details | |
---|---|---|---|
tokenizer | Tokenizer (preferably from HuggingFace) | ||
max_length | NoneType | None | pad to model’s allowed max length (default is max_sequence_length) |
line_by_line | bool | True | To whether tokenize each sentence separately, or concatenate them and then tokenize |
stride | NoneType | None | option to do striding when line_by_line is False |
trn_size | NoneType | None | The number of training data to be tokenized |
tok_num_proc | NoneType | None | Number of processes for tokenization |
shuffle_trn | bool | True | To shuffle the train set before tokenization |
check_val_leak | bool | True | To check (and remove) training data which is leaked to validation set |
a) Option 1: Tokenize our corpus line-by-line
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=['Clothing ID','Review Text'],
cols_to_keep=42,
seed=False
verbose )
With no padding
=True,max_length=-1) tdc.process_and_tokenize(tokenizer,line_by_line
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Clothing ID', 'Review Text', 'input_ids', 'special_tokens_mask', 'attention_mask'],
num_rows: 18111
})
validation: Dataset({
features: ['Clothing ID', 'Review Text', 'input_ids', 'special_tokens_mask', 'attention_mask'],
num_rows: 4529
})
})
print(tdc.main_ddict['train']['Review Text'][0])
print(tdc.main_ddict['validation']['Review Text'][0])
i ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 ' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .
this picture doesn't do the skirt justice . i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt . it is really pretty and flattering on .
print(tokenizer.decode(tdc.main_ddict['train']['input_ids'][0]))
print(tokenizer.decode(tdc.main_ddict['validation']['input_ids'][0]))
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s>
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s>
With padding (set max_length
to None
if you want to pad to model’s maximum sequence length)
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=['Clothing ID','Review Text'],
cols_to_keep=42,
seed=False
verbose )
=True,max_length=100) tdc.process_and_tokenize(tokenizer,line_by_line
print(tokenizer.decode(tdc.main_ddict['train']['input_ids'][0]))
print(tokenizer.decode(tdc.main_ddict['validation']['input_ids'][0]))
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad>
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
b) Option 2: Tokenize every text, then concatenate them together before splitting them in smaller parts.
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=['Clothing ID','Review Text'],
cols_to_keep=42,
seed=False,
verbose )
=False,max_length=100) tdc.process_and_tokenize(tokenizer,line_by_line
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['input_ids', 'special_tokens_mask', 'attention_mask'],
num_rows: 13573
})
validation: Dataset({
features: ['input_ids', 'special_tokens_mask', 'attention_mask'],
num_rows: 3446
})
})
Notice that even when I put in the cols_to_keep
parameters, the returned DatasetDict still does not keep them, because it wouldn’t make sense to retain them when the tokens are concatenated. Normally, you will leave cols_to_keep
as None
, which is the default, when line_by_line
is False
for i in tdc.main_ddict['train']['input_ids'][:3]:
print(tokenizer.decode(i))
print('-'*100)
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made
----------------------------------------------------------------------------------------------------
me look 6 month pregnant and i'm a petite size 2.</s><s>i love rompers and this one is really cute. i usually wear size 12 but should have got a 10, it runs big. it seems too long, and i'm 5'9 ". the prints cute but a little blah. i paid $ 158 which is too much, since i haven't worn it yet, i should have waited for it to go on sale.</s><s>... the print is so
----------------------------------------------------------------------------------------------------
sharking, and i love the way it looks on the model -- but i'm a more curvy figure, and the boxy-ish cut plus rather stuff fabric in front is incredibly unflattering. ordinarily i love everything made by maeve, but this one sadly must be returned... on a thinner / straighter-shaped person i expect it would be great.</s><s>i've had my eye on this poncho for weeks and finally scored the olive green one over thanksgiving /
----------------------------------------------------------------------------------------------------
for i in tdc.main_ddict['validation']['input_ids'][:3]:
print(tokenizer.decode(i))
print('-'*100)
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s><s>easy to wear! cute, comfy... will be a go to for summer.</s><s>nice sweater, just did not look good on me. sorry, going back.</s><s>this jacket was a little shorter than i had expected, but i still really enjoy the cut and fit of it
----------------------------------------------------------------------------------------------------
.</s><s>i wasn't planning on loving this dress when i tried it on. i loved the the color which is what prompted me to buy it. this dress fit perfectly. it hugs my body without feeling tight. the ruching is perfect. i didn't want to take it off! it's also very comfortable. i'm 5'1 ", 107 lbs and the xs petite fit perfectly. the dress hits me at the same length that is pictured. i think it would
----------------------------------------------------------------------------------------------------
be easy to hem if you wanted it to be shorter. i have a short torso and saw no issues with that as some reviewer</s><s>i like flowy tops because i have a bit of a belly and i like to camouflage it but this top was really flowy. the fabric is great and the embroidery is beautiful, i was hoping for this to be a holiday staple this year. it has to go back though, just too large. i don't love it quite enough to order
----------------------------------------------------------------------------------------------------
c) Striding (For Concatenation of tokens)
If your sentences (or paragraphs) are larger than max_length
, after concatenation, they will be broken apart; your long paragraph will be incompleted in terms of meaning. Striding is a way to somewhat preserve the sentence’s meaning, by getting part of the sentence back. We will demonstrate it with an example, and you can compare it with the previous one (without striding) to see the differences
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=42,
seed=False,
verbose )
=False,max_length=100,stride=20)
tdc.process_and_tokenize(tokenizer,line_by_line# Stride is 20, meaning for the next entry, we go back 20 tokens
for i in tdc.main_ddict['train']['input_ids'][:3]:
print(tokenizer.decode(i))
print('-'*100)
<s>i ordered this online and was disappointed with the fit when it arrived. i ordered the xs and it was still oversize to the point of being unflattering. i am tall 5'9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape. if you like a loose fit this might be for you. the material is thicker and warm and comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made
----------------------------------------------------------------------------------------------------
comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made me look 6 month pregnant and i'm a petite size 2.</s><s>i love rompers and this one is really cute. i usually wear size 12 but should have got a 10, it runs big. it seems too long, and i'm 5'9 ". the prints cute but a little blah. i paid $ 158 which is too much, since i haven't worn it
----------------------------------------------------------------------------------------------------
but a little blah. i paid $ 158 which is too much, since i haven't worn it yet, i should have waited for it to go on sale.</s><s>... the print is so sharking, and i love the way it looks on the model -- but i'm a more curvy figure, and the boxy-ish cut plus rather stuff fabric in front is incredibly unflattering. ordinarily i love everything made by maeve, but this one sadly must be returned... on
----------------------------------------------------------------------------------------------------
For the second entry, we can see it starts with the last 20 tokens of the previous entry: comfortable. i would suggest ordering down a size.</s><s>so unflattering! really disappointed. made
)
for i in tdc.main_ddict['validation']['input_ids'][:3]:
print(tokenizer.decode(i))
print('-'*100)
<s>this picture doesn't do the skirt justice. i paired it with a creme colored cashmere cowlneck sweater and a silver jeweled belt. it is really pretty and flattering on.</s><s>easy to wear! cute, comfy... will be a go to for summer.</s><s>nice sweater, just did not look good on me. sorry, going back.</s><s>this jacket was a little shorter than i had expected, but i still really enjoy the cut and fit of it
----------------------------------------------------------------------------------------------------
was a little shorter than i had expected, but i still really enjoy the cut and fit of it.</s><s>i wasn't planning on loving this dress when i tried it on. i loved the the color which is what prompted me to buy it. this dress fit perfectly. it hugs my body without feeling tight. the ruching is perfect. i didn't want to take it off! it's also very comfortable. i'm 5'1 ", 107 lbs and the xs pet
----------------------------------------------------------------------------------------------------
it's also very comfortable. i'm 5'1 ", 107 lbs and the xs petite fit perfectly. the dress hits me at the same length that is pictured. i think it would be easy to hem if you wanted it to be shorter. i have a short torso and saw no issues with that as some reviewer</s><s>i like flowy tops because i have a bit of a belly and i like to camouflage it but this top was really flowy. the fabric is great and
----------------------------------------------------------------------------------------------------
7. Data Collator
from underthesea import text_normalize
from transformers import AutoTokenizer
a) For masked language model
= AutoTokenizer.from_pretrained('roberta-base') tokenizer
Let’s define our text controller first
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=['Clothing ID','Review Text'],
cols_to_keep=42,
seed=False
verbose )
We will tokenize our corpus line-by-line
=True,max_length=-1) tdc.process_and_tokenize(tokenizer,line_by_line
=True,mlm_prob=0.15) tdc.set_data_collator(is_mlm
tdc.data_collator
DataCollatorForLanguageModeling(tokenizer=RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}, mlm=True, mlm_probability=0.15, pad_to_multiple_of=8, tf_experimental_compile=False, return_tensors='pt')
Before applying the collator…
print([tdc.main_ddict['train'][i] for i in range(2)])
[{'Clothing ID': 937, 'Review Text': 'i ordered this online and was disappointed with the fit when it arrived . i ordered the xs and it was still oversize to the point of being unflattering . i am tall 5 \' 9 " about 130 pounds and have a fairly thin torso and look best in cloths that have some shape . if you like a loose fit this might be for you . the material is thicker and warm and comfortable . i would suggest ordering down a size .', 'input_ids': [0, 118, 2740, 42, 804, 8, 21, 5779, 19, 5, 2564, 77, 24, 2035, 479, 939, 2740, 5, 3023, 29, 8, 24, 21, 202, 81, 10799, 7, 5, 477, 9, 145, 29747, 24203, 479, 939, 524, 6764, 195, 128, 361, 22, 59, 8325, 2697, 8, 33, 10, 5342, 7174, 28762, 8, 356, 275, 11, 21543, 29, 14, 33, 103, 3989, 479, 114, 47, 101, 10, 7082, 2564, 42, 429, 28, 13, 47, 479, 5, 1468, 16, 33997, 8, 3279, 8, 3473, 479, 939, 74, 3608, 12926, 159, 10, 1836, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}, {'Clothing ID': 870, 'Review Text': "so unflattering ! really disappointed . made me look 6 month pregnant and i'm a petite size 2 .", 'input_ids': [0, 2527, 29747, 24203, 27785, 269, 5779, 479, 156, 162, 356, 231, 353, 5283, 8, 939, 437, 10, 4716, 1459, 1836, 132, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]}]
We can see that the length of each token list is different from each other
list(map(len,tdc.main_ddict['train']['input_ids'][:5]))
[91, 24, 79, 82, 121]
Let’s apply the collator
# extract only the required keys
= tokenizer.model_input_names
inp_keys = [{k:tdc.main_ddict['train'][i][k] for k in inp_keys} for i in range(5)] _inp
print(_inp[:2])
[{'input_ids': [0, 118, 2740, 42, 804, 8, 21, 5779, 19, 5, 2564, 77, 24, 2035, 479, 939, 2740, 5, 3023, 29, 8, 24, 21, 202, 81, 10799, 7, 5, 477, 9, 145, 29747, 24203, 479, 939, 524, 6764, 195, 128, 361, 22, 59, 8325, 2697, 8, 33, 10, 5342, 7174, 28762, 8, 356, 275, 11, 21543, 29, 14, 33, 103, 3989, 479, 114, 47, 101, 10, 7082, 2564, 42, 429, 28, 13, 47, 479, 5, 1468, 16, 33997, 8, 3279, 8, 3473, 479, 939, 74, 3608, 12926, 159, 10, 1836, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}, {'input_ids': [0, 2527, 29747, 24203, 27785, 269, 5779, 479, 156, 162, 356, 231, 353, 5283, 8, 939, 437, 10, 4716, 1459, 1836, 132, 479, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}]
= tdc.data_collator(_inp) # simulation with batch size 5 out
out.keys()
dict_keys(['input_ids', 'attention_mask', 'labels'])
Now all token lists have the same length, which is 128: a multiple of 8 and larger than the longest list in the batch (which is 121)
'input_ids'].shape out[
torch.Size([5, 128])
'input_ids'][:2,:] out[
tensor([[ 0, 118, 2740, 42, 804, 8, 21, 5779, 19, 50264,
2564, 77, 24, 2035, 479, 939, 2740, 5, 3023, 29,
8, 24, 21, 202, 50264, 10799, 7, 5, 477, 50264,
145, 50264, 24203, 479, 939, 524, 6764, 195, 128, 361,
22, 59, 8325, 2697, 8, 33, 10, 5342, 7174, 28762,
50264, 356, 50264, 11, 21543, 29, 14, 33, 103, 38941,
479, 114, 47, 101, 10, 7082, 2564, 42, 429, 28,
13, 47, 479, 50264, 1468, 44089, 33997, 8, 3279, 8,
3473, 479, 939, 74, 3608, 12926, 159, 50264, 1836, 479,
2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 2527, 29747, 24203, 50264, 269, 5779, 479, 50264, 50264,
356, 231, 353, 50264, 8, 50264, 437, 10, 4716, 1459,
1836, 49943, 479, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]])
The labels
have also been constructed, which shows the “mask” tokens (non -100) in which the model has to predict. To increase the amount of masked tokens, increase the mlm_prob
'labels'][:2,:] out[
tensor([[ -100, -100, -100, -100, -100, -100, -100, -100, -100, 5,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, 81, -100, -100, -100, -100, 9,
-100, 29747, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
8, -100, 275, -100, -100, -100, -100, -100, -100, 3989,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, 5, 1468, 16, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, 10, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100],
[ -100, -100, -100, -100, 27785, -100, -100, -100, 156, 162,
-100, -100, -100, 5283, -100, 939, -100, -100, -100, -100,
-100, 132, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100]])
If you apply padding in the tokenization step (by adjusting the max_length
argument), no matter whether it’s line-by-line tokenization or not, the data collator will skip the padding step
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=['Clothing ID','Review Text'],
cols_to_keep=42,
seed=False
verbose )
=False,max_length=100) tdc.process_and_tokenize(tokenizer,line_by_line
=True,mlm_prob=0.15) tdc.set_data_collator(is_mlm
list(map(len,tdc.main_ddict['train']['input_ids'][:5]))
[100, 100, 100, 100, 100]
Let’s apply the collator
= tokenizer.model_input_names
inp_keys = [{k:tdc.main_ddict['train'][i][k] for k in inp_keys} for i in range(5)]
_inp = tdc.data_collator(_inp) # simulation with batch size 5 out
'input_ids'].shape out[
torch.Size([5, 100])
'input_ids'][:2,:] out[
tensor([[ 0, 118, 2740, 42, 804, 8, 21, 5779, 19, 50264,
2564, 77, 24, 2035, 479, 939, 2740, 5, 3023, 29,
8, 24, 21, 202, 81, 10799, 7, 5, 477, 50264,
145, 50264, 24203, 479, 939, 524, 6764, 195, 128, 361,
22, 59, 8325, 2697, 8, 33, 10, 5342, 7174, 28762,
50264, 356, 50264, 11, 21543, 29, 14, 33, 103, 41316,
479, 114, 47, 101, 10, 7082, 2564, 42, 429, 28,
13, 47, 479, 50264, 17204, 50264, 33997, 8, 3279, 8,
3473, 479, 939, 74, 3608, 12926, 159, 50264, 1836, 479,
2, 0, 2527, 29747, 50264, 27785, 269, 5779, 479, 50264],
[ 162, 356, 231, 50264, 5283, 8, 50264, 437, 23781, 4716,
1459, 1836, 132, 479, 2, 0, 118, 50264, 910, 7474,
268, 8, 42, 65, 16, 269, 11962, 50264, 939, 2333,
3568, 1836, 50264, 53, 197, 33, 50264, 50264, 158, 2156,
24, 50264, 380, 44224, 24, 1302, 350, 251, 2156, 50264,
939, 437, 195, 50264, 361, 22, 479, 5, 19553, 11962,
53, 10, 410, 50264, 479, 939, 1199, 68, 26498, 61,
16, 350, 203, 2156, 187, 939, 50264, 75, 10610, 50264,
648, 2156, 939, 197, 33, 9010, 13, 24, 7, 213,
15, 1392, 479, 2, 0, 734, 5, 5780, 16, 98]])
'labels'][:2,:] out[
tensor([[ -100, -100, -100, -100, -100, -100, -100, -100, -100, 5,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, 81, -100, -100, -100, -100, 9,
-100, 29747, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
8, -100, 275, -100, -100, -100, -100, -100, -100, 3989,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, 5, 1468, 16, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, 10, -100, -100,
-100, -100, -100, -100, 24203, -100, -100, -100, -100, 156],
[ -100, -100, -100, 353, -100, -100, 939, -100, 10, -100,
-100, -100, -100, 479, -100, -100, -100, 657, -100, -100,
-100, -100, -100, -100, -100, -100, -100, 479, -100, -100,
-100, -100, 316, -100, -100, -100, 300, 10, -100, -100,
-100, 1237, -100, 479, -100, -100, -100, -100, -100, 8,
-100, -100, -100, 128, -100, 22, -100, -100, -100, -100,
-100, -100, -100, 38596, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, 2220, -100, -100, 24,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, -100, -100, -100, -100]])
Since we are using the concatenation-of-tokenization technique, one smart thing that the HuggingFace’s DataCollatorForLanguageModeling
(which is the data collator we use) does is to allow maskings at every position, at opposed to to the previous cases (with line-by-line tokenization), there’s no masking near the end tokens of each list, because those end tokens are padding tokens
b) For causal language model
from transformers import AutoTokenizer
from tokenizers import processors
Let’s define our GPT2 tokenizer
= AutoTokenizer.from_pretrained('gpt2') tokenizer
tokenizer
GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
GPT2 does not use start/end-of-sentence token:
print(tokenizer.convert_ids_to_tokens(tokenizer("this is a text. That is a second text.But there's a third one")['input_ids']))
['this', 'Ġis', 'Ġa', 'Ġtext', '.', 'ĠThat', 'Ġis', 'Ġa', 'Ġsecond', 'Ġtext', '.', 'But', 'Ġthere', "'s", 'Ġa', 'Ġthird', 'Ġone']
If you want to perform concatenation-of-token, and you want your causal LM to differentiate between sentences, you can add a special token to separate sentences, as follow:
= processors.TemplateProcessing(
tokenizer._tokenizer.post_processor ="$A " + tokenizer.eos_token,
single=[(tokenizer.eos_token, tokenizer.eos_token_id)],
special_tokens
)= tokenizer.eos_token tokenizer.pad_token
print(tokenizer.convert_ids_to_tokens(tokenizer("this is a text. That is a second text.But there's a third one")['input_ids']))
['this', 'Ġis', 'Ġa', 'Ġtext', '.', 'ĠThat', 'Ġis', 'Ġa', 'Ġsecond', 'Ġtext', '.', 'But', 'Ġthere', "'s", 'Ġa', 'Ġthird', 'Ġone', '<|endoftext|>']
With this modified tokenizer, let’s perform concatenation-of-tokenization using GPT2
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=42,
seed=False
verbose )
=False,max_length=100) tdc.process_and_tokenize(tokenizer,line_by_line
Since it’s casual language modeling, let’s turn off is_mlm
=False) tdc.set_data_collator(is_mlm
list(map(len,tdc.main_ddict['train']['input_ids'][:5]))
[100, 100, 100, 100, 100]
Let’s apply the collator
= tdc.data_collator([tdc.main_ddict['train'][i] for i in range(5)]) # simulation with batch size 5 out
'input_ids'].shape out[
torch.Size([5, 100])
'input_ids'][:2,:] out[
tensor([[ 72, 6149, 428, 2691, 290, 373, 11679, 351, 262, 4197,
618, 340, 5284, 764, 1312, 6149, 262, 2124, 82, 290,
340, 373, 991, 625, 7857, 284, 262, 966, 286, 852,
42880, 16475, 764, 1312, 716, 7331, 642, 705, 860, 366,
546, 11323, 8059, 290, 423, 257, 6547, 7888, 28668, 290,
804, 1266, 287, 16270, 82, 326, 423, 617, 5485, 764,
611, 345, 588, 257, 9155, 4197, 428, 1244, 307, 329,
345, 764, 262, 2587, 318, 29175, 290, 5814, 290, 6792,
764, 1312, 561, 1950, 16216, 866, 257, 2546, 764, 50256,
568, 42880, 16475, 5145, 1107, 11679, 764, 925, 502, 804],
[ 718, 1227, 10423, 290, 1312, 1101, 257, 4273, 578, 2546,
362, 764, 50256, 72, 1842, 374, 3361, 364, 290, 428,
530, 318, 1107, 13779, 764, 1312, 3221, 5806, 2546, 1105,
475, 815, 423, 1392, 257, 838, 837, 340, 4539, 1263,
764, 340, 2331, 1165, 890, 837, 290, 1312, 1101, 642,
705, 860, 366, 764, 262, 20842, 13779, 475, 257, 1310,
33367, 764, 1312, 3432, 720, 24063, 543, 318, 1165, 881,
837, 1201, 1312, 4398, 470, 12666, 340, 1865, 837, 1312,
815, 423, 13488, 329, 340, 284, 467, 319, 5466, 764,
50256, 986, 262, 3601, 318, 523, 21027, 278, 837, 290]])
'labels'][:2,:] out[
tensor([[ 72, 6149, 428, 2691, 290, 373, 11679, 351, 262, 4197,
618, 340, 5284, 764, 1312, 6149, 262, 2124, 82, 290,
340, 373, 991, 625, 7857, 284, 262, 966, 286, 852,
42880, 16475, 764, 1312, 716, 7331, 642, 705, 860, 366,
546, 11323, 8059, 290, 423, 257, 6547, 7888, 28668, 290,
804, 1266, 287, 16270, 82, 326, 423, 617, 5485, 764,
611, 345, 588, 257, 9155, 4197, 428, 1244, 307, 329,
345, 764, 262, 2587, 318, 29175, 290, 5814, 290, 6792,
764, 1312, 561, 1950, 16216, 866, 257, 2546, 764, -100,
568, 42880, 16475, 5145, 1107, 11679, 764, 925, 502, 804],
[ 718, 1227, 10423, 290, 1312, 1101, 257, 4273, 578, 2546,
362, 764, -100, 72, 1842, 374, 3361, 364, 290, 428,
530, 318, 1107, 13779, 764, 1312, 3221, 5806, 2546, 1105,
475, 815, 423, 1392, 257, 838, 837, 340, 4539, 1263,
764, 340, 2331, 1165, 890, 837, 290, 1312, 1101, 642,
705, 860, 366, 764, 262, 20842, 13779, 475, 257, 1310,
33367, 764, 1312, 3432, 720, 24063, 543, 318, 1165, 881,
837, 1201, 1312, 4398, 470, 12666, 340, 1865, 837, 1312,
815, 423, 13488, 329, 340, 284, 467, 319, 5466, 764,
-100, 986, 262, 3601, 318, 523, 21027, 278, 837, 290]])
For CLM, the labels
are essentially the same as input_ids
. From HuggingFace documentation:
`DataCollatorForLanguageModeling` will take care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training.
8. Save and Load TextDataController
TextDataLMController.save_as_pickles
TextDataLMController.save_as_pickles (fname, parent='pickle_files')
Type | Default | Details | |
---|---|---|---|
fname | Name of the pickle file | ||
parent | str | pickle_files | Parent folder |
TextDataController.from_pickle
TextDataController.from_pickle (fname, parent='pickle_files')
Type | Default | Details | |
---|---|---|---|
fname | Name of the pickle file | ||
parent | str | pickle_files | Parent folder |
TextDataLMController object can be saved and loaded with ease. This is especially useful after text processing and/or tokenization have been done
from datasets import disable_caching
disable_caching()
= AutoTokenizer.from_pretrained('roberta-base')
tokenizer
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataLMController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=[text_normalize,str.lower],
content_transformations=42,
seed=False
verbose )
=True,max_length=-1)
tdc.process_and_tokenize(tokenizer,line_by_line
=True,mlm_prob=0.15) tdc.set_data_collator(is_mlm
'my_lm_tdc') tdc.save_as_pickles(
Load back our object
= TextDataLMController.from_pickle('my_lm_tdc') tdc2
You can still access all its attributes, data, preprocessings, transformations …
tdc2.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 18111
})
validation: Dataset({
features: ['Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 4529
})
})
tdc2.filter_dict,tdc2.content_tfms
({'Review Text': <function __main__.<lambda>(x)>},
[<function underthesea.pipeline.text_normalize.text_normalize(text, tokenizer='underthesea')>,
<method 'lower' of 'str' objects>])