Text Main

This module contains the main Python class for data control: TextDataController

import pandas as pd
import numpy as np
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from importlib.machinery import SourceFileLoader
import os

Helper functions

1. Content Transformation, Augmentations, and Tokenization

source

tokenizer_explain

 tokenizer_explain (inp, tokenizer, split_word=False)

Display results from tokenizer

	Type	Default	Details
inp			Input sentence
tokenizer			Tokenizer (preferably from HuggingFace)
split_word	bool	False	Is input `inp` split into list or not

We can use this function to show how HuggingFace’s tokenizer works and what its outputs look like

Let’s try PhoBert tokenizer (for Vietnamese texts). PhoBert tokenizer requires the input to be word-segmented. We will use our built-in function apply_vnmese_word_tokenize to do this

from transformers import AutoTokenizer

_tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

inp = apply_vnmese_word_tokenize('hội cư dân chung cư sen hồng - chung cư lotus sóng thần thủ đức')
print(inp)

hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức

Now we can use tokenizer_explain to see how our PhoBert-base tokenizer process our input inp

tokenizer_explain(inp,_tokenizer)

        ------- Tokenizer Explained -------
----- Input -----
hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức

----- Tokenized results ----- 
{'input_ids': [0, 1093, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'hội', 'cư_dân', 'chung_cư', 'sen', 'hồng', '-', 'chung_cư', 'lo@@', 'tus', 'sóng_thần', 'thủ_@@', 'đức', '</s>']

----- Results from tokenizer.decode ----- 
<s> hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức </s>

The Tokenized results is the outputs of tokenizer, and Results from tokenizer.convert_ids_to_tokens shows us what each token id really is, the newly-added start and end tokens, and even the byte-pair encoding in action

source

two_steps_tokenization_explain

 two_steps_tokenization_explain (inp, tokenizer, content_tfms=[],
                                 aug_tfms=[])

Display results form each content transformation, then display results from tokenizer

	Type	Default	Details
inp			Input sentence
tokenizer			Tokenizer (preferably from HuggingFace)
content_tfms	list	[]	A list of text transformations
aug_tfms	list	[]	A list of text augmentation

This function further showcase how each text transformation and/or text augmentation affect our text input, step by step

Let’s load Phobert tokenizer one more time to test out this function

_tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

from underthesea import text_normalize

apply_vnmese_word_tokenize also have an option to normalize text (i.e. standardizing text input, getting rid of extra spaces, normalizing accents for Vietnamese text …)

from functools import partial

inp = 'Hội cư dân   chung cư sen hồng- chung cư    lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh'
two_steps_tokenization_explain(inp,_tokenizer,content_tfms=[partial(apply_vnmese_word_tokenize,normalize_text=True)])

        ------- Text Transformation Explained -------
----- Raw sentence -----
Hội cư dân   chung cư sen hồng- chung cư    lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh

----- Content Transformations (on both train and test) -----
--- apply_vnmese_word_tokenize ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh


----- Augmentations (on train only) -----

        ------- Tokenizer Explained -------
----- Input -----
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh

----- Tokenized results ----- 
{'input_ids': [0, 792, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 5, 5043, 8, 16, 149, 2850, 214, 784, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'Hội', 'cư_dân', 'chung_cư', 'sen', 'hồng', '-', 'chung_cư', 'lo@@', 'tus', 'sóng_thần', 'thủ_@@', 'đức', '.', 'Thủ_Đức', 'là', 'một', 'huyện', 'trực_thuộc', 'thành_phố', 'Hồ_Chí_Minh', '</s>']

----- Results from tokenizer.decode ----- 
<s> Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức. Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh </s>

Let’s add some text augmentations

import unidecode

# to remove vietnamese accent
remove_accent = lambda x: unidecode.unidecode(x)

If you want your function to be printed in with a different name:

remove_accent.__name__ = 'Remove Vietnamese Accent'

two_steps_tokenization_explain(inp,_tokenizer,
                               content_tfms=[partial(apply_vnmese_word_tokenize,normalize_text=True)],
                               aug_tfms=[remove_accent]
                              )

        ------- Text Transformation Explained -------
----- Raw sentence -----
Hội cư dân   chung cư sen hồng- chung cư    lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh

----- Content Transformations (on both train and test) -----
--- apply_vnmese_word_tokenize ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh


----- Augmentations (on train only) -----
--- Remove Vietnamese Accent ---
Hoi cu_dan chung_cu sen hong - chung_cu lotus song_than thu_duc . Thu_Duc la mot huyen truc_thuoc thanh_pho Ho_Chi_Minh


        ------- Tokenizer Explained -------
----- Input -----
Hoi cu_dan chung_cu sen hong - chung_cu lotus song_than thu_duc . Thu_Duc la mot huyen truc_thuoc thanh_pho Ho_Chi_Minh

----- Tokenized results ----- 
{'input_ids': [0, 3021, 1111, 56549, 17386, 22975, 13689, 3330, 27037, 31, 22975, 13689, 2029, 4885, 3227, 9380, 1510, 21605, 6190, 1894, 5, 5770, 4098, 1894, 2644, 3773, 1204, 18951, 2052, 10242, 9835, 1881, 22899, 17366, 10384, 30234, 8470, 1612, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'Ho@@', 'i', 'cu_@@', 'dan', 'chung_@@', 'cu', 'sen', 'hong', '-', 'chung_@@', 'cu', 'lo@@', 'tus', 'so@@', 'ng_th@@', 'an', 'thu_@@', 'du@@', 'c', '.', 'Thu_@@', 'Du@@', 'c', 'la', 'mo@@', 't', 'huy@@', 'en', 'tru@@', 'c_th@@', 'u@@', 'oc', 'thanh_@@', 'pho', 'Ho_@@', 'Chi_@@', 'Minh', '</s>']

----- Results from tokenizer.decode ----- 
<s> Hoi cu_dan chung_cu sen hong - chung_cu lotus song_than thu_duc. Thu_Duc la mot huyen truc_thuoc thanh_pho Ho_Chi_Minh </s>

You can even be creative with your augmentation functions; let’s say you only want your augmentation to be applied 50% of the time:

import random

random.seed(2) # for reproducibility

remove_accent = lambda x: unidecode.unidecode(x) if random.random()<0.5 else x
remove_accent.__name__ = 'Remove Vietnamese Accent with 0.5 prob'

two_steps_tokenization_explain(inp,_tokenizer,
                               content_tfms=[partial(apply_vnmese_word_tokenize,normalize_text=True)],
                               aug_tfms=[remove_accent]
                              )

        ------- Text Transformation Explained -------
----- Raw sentence -----
Hội cư dân   chung cư sen hồng- chung cư    lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh

----- Content Transformations (on both train and test) -----
--- apply_vnmese_word_tokenize ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh


----- Augmentations (on train only) -----
--- Remove Vietnamese Accent with 0.5 prob ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh


        ------- Tokenizer Explained -------
----- Input -----
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh

----- Tokenized results ----- 
{'input_ids': [0, 792, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 5, 5043, 8, 16, 149, 2850, 214, 784, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'Hội', 'cư_dân', 'chung_cư', 'sen', 'hồng', '-', 'chung_cư', 'lo@@', 'tus', 'sóng_thần', 'thủ_@@', 'đức', '.', 'Thủ_Đức', 'là', 'một', 'huyện', 'trực_thuộc', 'thành_phố', 'Hồ_Chí_Minh', '</s>']

----- Results from tokenizer.decode ----- 
<s> Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức. Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh </s>

There are more examples of interesting augmentation here

2. Tokenize Function

source

tokenize_function

 tokenize_function (text, tok, max_length=None, is_split_into_words=False,
                    return_tensors=None, return_special_tokens_mask=False)

This is a wrapper for Huggingface’ tokenizer, to tokenize and pad your input text, getting them ready for your NLP model

I will reuse PhoBert’s tokenizer to demonstrate the functionality of this function. For more information about this tokenizer: https://huggingface.co/vinai/phobert-base

_tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

phobert_preprocess = partial(apply_vnmese_word_tokenize,normalize_text=True)

tokenize_function(phobert_preprocess('hội cư dân chung cư sen hồng - chung cư lotus sóng thần thủ đức'),
                  _tokenizer,max_length=512)

{'input_ids': [0, 1093, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

_inp = ['hội cần mở thẻ tín dụng tại hà nội, đà nẵng, tp. hồ chí minh',"biti's cao lãnh - đồng tháp"]
_inp = [phobert_preprocess(i) for i in _inp]
_inp

['hội cần mở thẻ_tín_dụng tại hà_nội , đà_nẵng , tp . hồ chí_minh',
 "biti's cao_lãnh - đồng tháp"]

tokenize_function(_inp,_tokenizer,max_length=512)

{'input_ids': [[0, 1093, 115, 548, 10603, 35, 44068, 2151, 4, 62295, 1301, 24931, 4, 1187, 2380, 5, 1005, 43647, 9534, 2], [0, 3907, 2081, 51899, 1118, 10109, 8271, 31, 80, 3186, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

You can change the tokenizer outputs’ type, such as pytorch’s tensor, tensorflow objects, or numpy array

tokenize_function(_inp,_tokenizer,max_length=512,return_tensors='pt')

{'input_ids': tensor([[    0,  1093,   115,   548, 10603,    35, 44068,  2151,     4, 62295,
          1301, 24931,     4,  1187,  2380,     5,  1005, 43647,  9534,     2],
        [    0,  3907,  2081, 51899,  1118, 10109,  8271,    31,    80,  3186,
             2,     1,     1,     1,     1,     1,     1,     1,     1,     1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

results = tokenize_function(_inp,_tokenizer,max_length=512)

print(_tokenizer.convert_ids_to_tokens(results['input_ids'][0]))

['<s>', 'hội', 'cần', 'mở', 'thẻ_tín_dụng', 'tại', 'hà_@@', 'nội', ',', 'đà_@@', 'n@@', 'ẵng', ',', 't@@', 'p', '.', 'hồ', 'chí_@@', 'minh', '</s>']

You can change max_length (which allow truncation when sentence length is higher than max_length)

results = tokenize_function(_inp,_tokenizer,
                            max_length=5)

results

{'input_ids': [[0, 1093, 115, 548, 2], [0, 3907, 2081, 51899, 2]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

3. Metadatas Processing

source

concat_metadatas

 concat_metadatas (dset:dict, main_text, metadatas, process_metas=True,
                   sep='.', is_batched=True)

Extract, process (optional) and concatenate metadatas to the front of text

	Type	Default	Details
dset	dict		HuggingFace Dataset
main_text			Text feature name
metadatas			Metadata (or a list of metadatas)
process_metas	bool	True	Whether apply simple metadata processing, i.e. space strip and lowercase
sep	str	.	Separator, for multiple metadatas concatenation
is_batched	bool	True	whether batching is applied

This function allows you to concatenate any text metadatas to the front of your main texts. Adding metadatas to your text might help your model to utilize any extra information for its downstream task

Class TextDataController

source

TextDataController

 TextDataController (inp, main_text:str, label_names=[], sup_types=[],
                     class_names_predefined=[], filter_dict={},
                     label_tfm_dict={}, metadatas=[], process_metas=True,
                     metas_sep='.', content_transformations=[],
                     val_ratio:int|float|None=0.2, stratify_cols=[],
                     upsampling_list={}, content_augmentations=[],
                     seed=None, batch_size=1024, num_proc=4,
                     cols_to_keep=None, verbose=True)

Initialize self. See help(type(self)) for accurate signature.

	Type	Default	Details
inp			HuggingFainpce Dataset or DatasetDict
main_text	str		Name of the main text column
label_names	list	[]	Names of the label (dependent variable) columns
sup_types	list	[]	Type of supervised learning for each label name (‘classification’ or ‘regression’)
class_names_predefined	list	[]	List of names associated with the labels (same index order). Use empty list for regression
filter_dict	dict	{}	A dictionary: {feature: filtering_function_for_that_feature}
label_tfm_dict	dict	{}	A dictionary: {label_name: transform_function_for_that_label}
metadatas	list	[]	Names of the metadata columns
process_metas	bool	True	Whether to do simple text processing on the chosen metadatas
metas_sep	str	.	Separator, for multiple metadatas concatenation
content_transformations	list	[]	A list of text transformations
val_ratio	int \| float \| None	0.2	Ratio of data for validation set
stratify_cols	list	[]	Column(s) needed to do stratified shuffle split
upsampling_list	dict	{}	A list of tuple. Each tuple: (feature,upsampling_function_based_on_the_feature)
content_augmentations	list	[]	A list of text augmentations
seed	NoneType	None	Random seed
batch_size	int	1024	CPU batch size
num_proc	int	4	Number of processes for multiprocessing
cols_to_keep	NoneType	None	Columns to keep after all processings
verbose	bool	True	Whether to print processing information

source

TextDataController.do_all_preprocessing

 TextDataController.do_all_preprocessing (shuffle_trn=True,
                                          check_val_leak=True)

	Type	Default	Details
shuffle_trn	bool	True	To shuffle the train set before tokenization
check_val_leak	bool	True	To check (and remove) training data which is leaked to validation set

source

TextDataController.do_tokenization

 TextDataController.do_tokenization (tokenizer, max_length=None,
                                     trn_size=None, tok_num_proc=None)

	Type	Default	Details
tokenizer			Tokenizer (preferably from HuggingFace)
max_length	NoneType	None	pad to model’s allowed max length (default is max_sequence_length). Use -1 for no padding at all
trn_size	NoneType	None	The number of training data to be tokenized
tok_num_proc	NoneType	None	Number of processes for tokenization

source

TextDataController.process_and_tokenize

 TextDataController.process_and_tokenize (tokenizer, max_length=None,
                                          trn_size=None,
                                          tok_num_proc=None,
                                          shuffle_trn=True,
                                          check_val_leak=True)

This will perform do_all_processing then do_tokenization

	Type	Default	Details
tokenizer			Tokenizer (preferably from HuggingFace)
max_length	NoneType	None	pad to model’s allowed max length (default is max_sequence_length)
trn_size	NoneType	None	The number of training data to be tokenized
tok_num_proc	NoneType	None	Number of processes for tokenization
shuffle_trn	bool	True	To shuffle the train set before tokenization
check_val_leak	bool	True	To check (and remove) training data which is leaked to validation set

1. Load data + Basic use case

source

TextDataController.from_csv

 TextDataController.from_csv (file_path, **kwargs)

source

TextDataController.from_df

 TextDataController.from_df (df, validate=True, **kwargs)

You can create a TextDataController from a csv, pandas DataFrame, or directly from a HuggingFace dataset object. Currently, TextDataController is designed for text classification and text regression, as we will explore in this documentation

We will load a sample data to prepare for a classification task: which Department Name a comment (Review Text) belongs to

Dataset source: https://www.kaggle.com/datasets/kavita5/review_ecommerce

import pandas as pd

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')

df.shape

(23486, 10)

df.sample(5)

	Clothing ID	Age	Title	Review Text	Rating	Recommended IND	Positive Feedback Count	Division Name	Department Name	Class Name
15253	831	31	Great work top	I snagged this top with the 25% off sale and i...	5	1	0	General Petite	Tops	Blouses
1254	850	49	Flattering, comfy top	Everyone has said it, so i'll just add my two ...	4	1	0	General Petite	Tops	Blouses
5105	824	38	Adore this top!	Saw this one online and when it came it did no...	5	1	2	General Petite	Tops	Blouses
8611	920	29	Great spring sweater	This sweater is classy and comfortable. it has...	4	1	0	General	Tops	Fine gauge
17574	1110	37	Super cute!	I'm not sure why the other reviewers think tha...	5	1	6	General	Dresses	Dresses

You can create a TextDataController from a dataframe. This also provides a quick input validation check (NaN check and Duplication check)

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                )

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows

You can also create a TextDataController directly from the csv file. The good thing about using HuggingFace Dataset as the main backend of the TextDataController is that you can utilize lots of its useful functionality, such as caching

tdc = TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',
                                  main_text='Review Text',
                                  sup_types='classification',
                                  label_names='Department Name',
                                 )

You can also create a TextDataController from a HuggingFace Dataset

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset

Dataset({
    features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
    num_rows: 23486
})

tdc = TextDataController(dset,
                         main_text='Review Text',
                         sup_types='classification',
                         label_names='Department Name',
                         seed=42
                        )

In the “Input Validation Precheck” above, we notice that our dataset has missing values in the text field and the label field. For now, let’s load the data as a Pandas’ DataFrame, perform some cleaning, and create our TextDataController

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')

df = df[(~df['Review Text'].isna()) & (~df['Department Name'].isna())].reset_index(drop=True)

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                )

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title    2966
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 1 rows

At this point you can start perform 2 important steps on your data

Text preprocessings, Label Encoding, Train/Validation Split
Tokenization

We haven’t provided any preprocessings to the TextDataController; we will see more on how to use preprocessings (step by step) as we progress

ddict = tdc.do_all_preprocessing(shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done

ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label'],
        num_rows: 18099
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label'],
        num_rows: 4526
    })
})

Our DatasetDict now has two split: train and validation. Note that train split is now IterableDataset, for processing efficiency

ddict['train'][:3]

{'Review Text': ['I wanted to love this. i have been looking for a poncho-type sweater for our cold midwestern winters. the cream colored one that i really wanted sold out instantly and i missed the window for the xxsp. i ordered this in the xs/sp (smallest size available). i am 5\'1" and 108 lbs with small shoulders. the neck opening is huge. my collar bones and a seciton of my upper back were exposed. this would not keep me warm due to so much exposed skin on my neck, back, and shoulders. i suppose i could get a',
  'Love the movement of the blouse and how it falls. great quality material.',
  "Loved these beach pants! i purchased the size medium in the coral. i loved the accents on the ties and the little pom pom details. i did get many compliments on them. the only thing i don't love about them is the material is very thin. i know they are beach pants but i personally would have liked slightly more weight to them. i wore them once with a pair of cropped leggings underneath and i thought it was a very cute way to wear them with some additional substance underneath."],
 'Department Name': ['Tops', 'Tops', 'Intimate'],
 'label': [4, 4, 2]}

ddict['validation'][:3]

{'Review Text': ["The raspberry color is really stunning! i have been looking for colored tights for a while and had difficulty finding really rich colors. i was thrilled when i saw these! i've worn them once so far. very comfortable and seem like they will last.",
  'I just received this dress and i feel like a goddess in it! it is perfect for graduations, weddings, romantic dinners, tropical va cations....heck, i\'ll wear it to the grocery store! i love it that much!\r\n\r\ni am 5\'7" with a 34c bust.....i have this dress in a size 4 and it fits very well. this dress is slim cut from the shoulder down to the waist. the dress length hits me at the lower calf, just like the model online. i think the armholes are cut a little high...this being said; this dress would',
  "When i saw this top online, i thought i'd love it and immediately ordered it in both colors.  they arrived today and i am soooo disappointed.  i have never seen such drab colors.  the blue is a muddy grayish hue (like an overcast day) and the pink is a dusty shade of peach.  yuck.  and i was hoping the ruffle at the bottom would have a chiffon-like flowy effect.  instead, the ruffle is made of a cheap looking knit.  back these go..."],
 'Department Name': ['Intimate', 'Dresses', 'Tops'],
 'label': [2, 1, 4]}

Now we can start with the tokenization

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

ddict = tdc.do_tokenization(tokenizer,max_length=512)

-------------------- Tokenization --------------------
Done

ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18099
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

print(ddict['train'][0]['input_ids'][:150])

[0, 100, 770, 7, 657, 42, 4, 939, 33, 57, 546, 13, 10, 181, 261, 11156, 12, 12528, 23204, 13, 84, 2569, 1084, 16507, 31000, 4, 5, 6353, 20585, 65, 14, 939, 269, 770, 1088, 66, 11764, 8, 939, 2039, 5, 2931, 13, 5, 37863, 4182, 4, 939, 2740, 42, 11, 5, 3023, 29, 73, 4182, 36, 23115, 990, 1836, 577, 322, 939, 524, 195, 108, 134, 113, 8, 13955, 23246, 19, 650, 10762, 4, 5, 5397, 1273, 16, 1307, 4, 127, 19008, 12396, 8, 10, 15636, 24899, 9, 127, 2853, 124, 58, 4924, 4, 42, 74, 45, 489, 162, 3279, 528, 7, 98, 203, 4924, 3024, 15, 127, 5397, 6, 124, 6, 8, 10762, 4, 939, 19792, 939, 115, 120, 10, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

print(ddict['validation'][0]['input_ids'][:150])

[0, 133, 41345, 3195, 16, 269, 5835, 328, 939, 33, 57, 546, 13, 20585, 326, 6183, 13, 10, 150, 8, 56, 9600, 2609, 269, 4066, 8089, 4, 939, 21, 8689, 77, 939, 794, 209, 328, 939, 348, 10610, 106, 683, 98, 444, 4, 182, 3473, 8, 2045, 101, 51, 40, 94, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

You can combine Text Processing and Tokenization with 1 method call

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name'
                                )

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title    2966
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 1 rows

tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 4, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

You can access the DatasetDict from the instance variable main_ddict

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18098
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

This DatasetDict is ready to be put into any HuggingFace text model.

2. Filtering

This preprocessing step allow you to filter out certain values of a certain column in your dataset. Let’s say I want to filter out any None value in the column ‘Review Text’

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df[(~df['Review Text'].isna())].isna().sum()

Clothing ID                   0
Age                           0
Title                      2966
Review Text                   0
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                13
Department Name              13
Class Name                   13
dtype: int64

We will provide a dictionary containing the name of the column and the filtering function to apply on that column. Note that the filtering function will receive an item from the column, and the function should return a boolean

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                 filter_dict={'Review Text': lambda x: x is not None},
                                 seed=42
                                )

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows

tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18111
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4529
    })
})

Let’s check if we have filtered out all NaN/None value

for i in tdc.main_ddict['train']['Review Text']:
    assert i is not None
for i in tdc.main_ddict['validation']['Review Text']:
    assert i is not None

We can even add multiple filtering functions. Remember from our precheck, there are also None values in our label ‘Department Name’. While we are at it, let’s filter out any rating that is less than 3 (just to showcase what our filtering can do)

df.Rating.value_counts()

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

Note that TextDataController will only keep the text, the labels and the metadatas columns; any other column will be dropped. To keep the ‘Rating’, we need to define the cols_to_keep argument

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              'Rating': lambda x: x>=3
                                             },
                                 cols_to_keep=['Review Text','Rating','Department Name'],
                                 seed=42
                                )

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows

tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Rating -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

for i in tdc.main_ddict['train']['Department Name']:
    assert i is not None

for i in tdc.main_ddict['validation']['Department Name']:
    assert i is not None
    
for i in tdc.main_ddict['validation']['Rating']:
    assert i >= 3

3. Taking a sample from training data

If you only want to extract a training sample of your data, you can use the trn_size argument of the method process_and_tokenize (or do_tokenization). Since we use sharding to extract a sample from a DatasetDict, if trn_size is a integer, an approximated size will be returned

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         sup_types='classification',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True,trn_size=1000)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1006
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         sup_types='classification',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True,trn_size=0.1) # return 10% of the data

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1810
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

4. Metadatas concatenation

If we think metadatas can be helpful, we can concatenate them into the front of your text, so that our text classification model is aware of it.

In this example, Let’s add ‘Title’ as our metadata

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 metadatas='Title',
                                 process_metas=True, # to preprocess the metadata (currently it's just empty space stripping and lowercasing),
                                 seed=42
                                )

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows

tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][:5]

['beautiful! . I love this top. it was everything i hoped it would be. it is lined so it is not see through in the chest/back; sleeves are sheer. soft. gorgeous color. love the layers. runs large so definitely size down. i am usually a m and ordered the s. i\'m 5\'8" curvy 32dd',
 'very flattering . This dress fits to a t! true to size. very flattering. fabric is soft and comfortable.',
 "the worst . I don't typically write bad reviews, but this dress is so bad and i want to save someone else from buying it. i read the mostly bad reviews and still purchased anyway (my fault i know). the dress is super stiff ( i know denim can be that way and it is possible it would soften up after a few washes). i'm typically a 6/8 and the size small swallowed me, and the xs was big everywhere except through the bust (i ordered both sizes to try). i wouldn't recommend buying this if you are a size 8 or small",
 "love this jacket! . I was on the lookout for a denim jacket when i found this beauty on line. i fell in love immediately and didn't think twice about paying full price. i wear it with moss green chinos and it looks really good. the little dots in the jacket are actually a pale green, which gives it extra character. very well made. i was a bit skeptical about the hook and eye fastenings, but they are very secure. \r\n\r\ni ordered my usual xl and found it roomy enough in the bust and arms. i would definitely call it tru",
 'great spring/summer dress. . I am excited for spring so i can wear this. i purchased the orange. it is actually more of a red, but i like it. colorful and flattering fit.']

tdc.main_ddict['validation']['Review Text'][:5]

[' . Such a fun jacket! great to wear in the spring or to the office as an alternative to a formal blazer. very comfortable!',
 'simple and elegant . I thought this shirt was really feminine and elegant. only downsides is some of the punched out holes had fabric still attached which you have cut off with scissors- otherwise the shirt will snag. and the second issue of bigger importance are the low armholes. lots of bra showing- not really sure how to get around that so i always wear it with a cardigan. but it would be nice not to have to. \r\nother than that it looks nice and pairs nicely with almost anything.',
 'retro and pretty . This top has a bit of a retro flare but so adorable on. looks really cute with a pair of faded boot cut jeans.',
 'summer/fall wear . I first spotted this on an retailer employee, she paired it with a peasant top & wore it open w/jeans & boots- so darn cute. love how this peice transitions from summer to fall. i\'m 5\'4" so i had to order the small petite which is perfect. note that this dress is very long! it\'s just a must have garment. the colors/ print are just beautiful.',
 "perfect except slip . This is my new favorite dress! my only complaint is the slip is too small and the dress cannot be worn without it. i can't order a size up as the dress would then be huge. not sure what the solution is but the dress itself is stunning."]

You can add multiple metadatas. Let’s say ‘Division Name’ is the second metadata.

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 metadatas=['Title','Division Name'],
                                 process_metas=True,
                                 seed=42
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][:5]

["general petite . meh . This tunic is way over priced for the style and quality. it fit comfortably (runs a size larger) but it's not really flattering, it jut kind of hangs there looking ok. it is a little too deep of a v cut for a work top as well. this top does not support the price at all. it felt like something i could find at department store for way less. i will be returning it.",
 'general . awesome buy! . I am so happy i took a chance on this jumpsuit! i am post-baby (six weeks) and although i intend on slimming down more i would say that it is flattering even at my current size. and it will only get better! \r\nthe quality and color are great!',
 'general . warm grey . These are a lovely neutral to slightly warm grey pair of jeans from a great line. i wore my usual size without issues.',
 "general . loved it, but it didn't work for me. . I wanted this top to work so bad. unfortunately the way the bust of the top is designed it isn't flattering if you aren't flat chested. it squishes on side of your chest and leaves the other side alone. i'm a b cup and had this problem so if you are a b or larger, i don't recommend. however, if you are smaller busted, this piece would be worth the purchase.",
 "general . varying feelings and opinions . As you can see, there is an array of differing opinions on here, and i share sentiments on both:\r\n_______\r\npros:\r\n- the texture and feel of this is great; it is very comfortable and is different.\r\n- tts for the most part; i normally can wear sizes 10 and 12 (m and l) with most retailer and got the medium and the fit was overall fine but more snug at the hips. if you're more slim/straight, it'll probably fit you like on the model. \r\n- good length, not too short or too long.\r\n- the mock collar is ni"]

tdc.main_ddict['validation']['Review Text'][:5] # The metadata for this text is None

['general petite .  . Such a fun jacket! great to wear in the spring or to the office as an alternative to a formal blazer. very comfortable!',
 'general petite . simple and elegant . I thought this shirt was really feminine and elegant. only downsides is some of the punched out holes had fabric still attached which you have cut off with scissors- otherwise the shirt will snag. and the second issue of bigger importance are the low armholes. lots of bra showing- not really sure how to get around that so i always wear it with a cardigan. but it would be nice not to have to. \r\nother than that it looks nice and pairs nicely with almost anything.',
 'general . retro and pretty . This top has a bit of a retro flare but so adorable on. looks really cute with a pair of faded boot cut jeans.',
 'general petite . summer/fall wear . I first spotted this on an retailer employee, she paired it with a peasant top & wore it open w/jeans & boots- so darn cute. love how this peice transitions from summer to fall. i\'m 5\'4" so i had to order the small petite which is perfect. note that this dress is very long! it\'s just a must have garment. the colors/ print are just beautiful.',
 "general petite . perfect except slip . This is my new favorite dress! my only complaint is the slip is too small and the dress cannot be worn without it. i can't order a size up as the dress would then be huge. not sure what the solution is but the dress itself is stunning."]

5. Label Encodings

Single-head prediction

We have briefly gone through the simplest case of label encoding, where we only need to predict 1 single label. We call this single head classification

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',                         
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

All label names will be saved in instance variable label_lists

tdc.label_lists

[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]

… and all labels will be encoded

tdc.main_ddict['validation']['label'][:5]

[2, 4, 4, 1, 1]

We also keep the original labeling, for references

tdc.main_ddict['validation']['Department Name'][:5]

['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']

You can also do single-head regression. Let’s say we want to predict Rating

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')

dset

Dataset({
    features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
    num_rows: 23486
})

tdc = TextDataController(dset,
                         main_text='Review Text',
                         sup_types='regression',
                         label_names='Rating',
                         filter_dict={'Review Text': lambda x: x is not None},
                         seed=42,
                        )

tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

print(tdc.main_ddict['train']['Rating'][:5])
print(tdc.main_ddict['train']['label'][:5])

[3.0, 1.0, 4.0, 3.0, 5.0]
[3.0, 1.0, 4.0, 3.0, 5.0]

print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['label'][:5])

[5.0, 4.0, 3.0, 5.0, 5.0]
[5.0, 4.0, 3.0, 5.0, 5.0]

Multi-head prediction

What if we need to predict 2 different labels as once? We call this multi-head classification/regression. For example, let’s define our dataset so that we need to predict both Department Name and Division Name (both as classification)

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names=['Division Name','Department Name'],
                         sup_types=['classification','classification'],
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.label_lists

[['General', 'General Petite', 'Initmates'],
 ['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]

We can see that we have two lists, one for label names of Division Name, and one for label names of Department Name

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18099
    })
    validation: Dataset({
        features: ['Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

print(tdc.main_ddict['validation']['Division Name'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])

['General Petite', 'General Petite', 'General', 'General Petite', 'General Petite']
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[1, 2], [1, 4], [0, 4], [1, 1], [1, 1]]

What if one label is classification, and another label is regression? We will predict Department Name (classification) and Rating (regression)

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names=['Rating','Department Name'],
                         sup_types=['regression','classification'],
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])

[5.0, 5.0, 5.0, 5.0, 4.0]
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[5.0, 2.0], [5.0, 4.0], [5.0, 4.0], [5.0, 1.0], [4.0, 1.0]]

Since it’s multi-head, you can define multiple classification/regression labels, as many as you want

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names=['Division Name','Rating','Department Name'],
                         sup_types=['classification','regression','classification'],
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                      'Division Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Division Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

print(tdc.main_ddict['train']['Division Name'][:5])
print(tdc.main_ddict['train']['Rating'][:5])
print(tdc.main_ddict['train']['Department Name'][:5])
print(tdc.main_ddict['train']['label'][:5])

['General Petite', 'General Petite', 'General', 'General', 'General Petite']
[4.0, 4.0, 5.0, 3.0, 5.0]
['Tops', 'Tops', 'Tops', 'Tops', 'Dresses']
[[1.0, 4.0, 4.0], [1.0, 4.0, 4.0], [0.0, 5.0, 4.0], [0.0, 3.0, 4.0], [1.0, 5.0, 1.0]]

print(tdc.main_ddict['validation']['Division Name'][:5])
print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])

['General Petite', 'General Petite', 'General', 'General Petite', 'General Petite']
[5.0, 5.0, 5.0, 5.0, 4.0]
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[1.0, 5.0, 2.0], [1.0, 5.0, 4.0], [0.0, 5.0, 4.0], [1.0, 5.0, 1.0], [1.0, 4.0, 1.0]]

Multi-label classification

Lastly, let’s define a multi-label classification, where a text can have 1 or more label. Our data don’t have such labeling, so we will make a new one, just for demonstration.

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')

df['Department Name'].unique()

array(['Intimate', 'Dresses', 'Bottoms', 'Tops', 'Jackets', 'Trend', nan],
      dtype=object)

df['Fake Label'] = [np.random.choice(df['Department Name'].unique()[:-1],size=np.random.randint(2,6),replace=False) for _ in range(len(df))]

df.head()

	Clothing ID	Age	Title	Review Text	Rating	Recommended IND	Positive Feedback Count	Division Name	Department Name	Class Name	Fake Label
0	767	33	NaN	Absolutely wonderful - silky and sexy and comf...	4	1	0	Initmates	Intimate	Intimates	[Intimate, Dresses, Trend, Bottoms]
1	1080	34	NaN	Love this dress! it's sooo pretty. i happene...	5	1	4	General	Dresses	Dresses	[Trend, Intimate]
2	1077	60	Some major design flaws	I had such high hopes for this dress and reall...	3	0	0	General	Dresses	Dresses	[Intimate, Dresses, Bottoms, Trend]
3	1049	50	My favorite buy!	I love, love, love this jumpsuit. it's fun, fl...	5	1	0	General Petite	Bottoms	Pants	[Intimate, Bottoms]
4	847	47	Flattering shirt	This shirt is very flattering to all due to th...	5	1	6	General	Tops	Blouses	[Trend, Bottoms, Dresses, Intimate, Jackets]

You don’t have to add any extra argument; the controller will determine whether this is for multilabel classification, based on the format of the label values

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 filter_dict={'Review Text': lambda x: x is not None},
                                 label_names='Fake Label',
                                 sup_types='classification',
                                 seed=42,
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Fake Label', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18111
    })
    validation: Dataset({
        features: ['Review Text', 'Fake Label', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4529
    })
})

tdc.label_lists

[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]

tdc.main_ddict['validation']['Fake Label'][2]

['Trend', 'Intimate', 'Bottoms', 'Dresses']

Since this is multilabel classification, the label will be one-hot encoded

tdc.main_ddict['validation']['label'][2]

[1, 1, 1, 0, 0, 1]

tdc.main_ddict['validation']['label'][:5]

[[0, 1, 1, 0, 1, 0],
 [0, 1, 1, 1, 1, 0],
 [1, 1, 1, 0, 0, 1],
 [1, 1, 0, 1, 0, 0],
 [0, 1, 1, 1, 0, 1]]

No label

If you don’t have a label to define, leave all label-related arguments blank

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'input_ids', 'attention_mask'],
        num_rows: 18111
    })
    validation: Dataset({
        features: ['Review Text', 'input_ids', 'attention_mask'],
        num_rows: 4529
    })
})

6. Label transformation

Sometimes, you want to apply some light transformation to your label(s) before apply label encoding, e.g. there are some typos in your string label (classification), or you want to scale your regression label. TextDataController provides a way for you to do so, via label_tfm_dict argument. For the following example, I will fix the typo ‘Initmates’ in Division Name label, and log scale the Rating

import math

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names=['Division Name','Rating','Department Name'],
                         sup_types=['classification','regression','classification'],
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                      'Division Name': lambda x: x is not None,
                                     },
                         label_tfm_dict={'Division Name': lambda x: x if x!='Initmates' else 'Intimates',
                                         'Rating': lambda x: math.log(x)+1},
                         seed=42,
                         num_proc=1
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Division Name -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

Notice that in the label_lists, the label ‘Initmates’ has been replaced by ‘Intimates’

Also, the second empty list corresponds to the label value of Rating, which is for regression, thus results in an empty list

tdc.label_lists

[['General', 'General Petite', 'Intimates'],
 [],
 ['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]

print(tdc.main_ddict['train']['Division Name'][:5])
print(tdc.main_ddict['train']['Rating'][:5])
print(tdc.main_ddict['train']['Department Name'][:5])
print(tdc.main_ddict['train']['label'][:5])

['General Petite', 'General Petite', 'General', 'General', 'General Petite']
[2.386294361119891, 2.386294361119891, 2.6094379124341005, 2.09861228866811, 2.6094379124341005]
['Tops', 'Tops', 'Tops', 'Tops', 'Dresses']
[[1.0, 2.386294361119891, 4.0], [1.0, 2.386294361119891, 4.0], [0.0, 2.6094379124341005, 4.0], [0.0, 2.09861228866811, 4.0], [1.0, 2.6094379124341005, 1.0]]

print(tdc.main_ddict['validation']['Division Name'][:5])
print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])

['General Petite', 'General Petite', 'General', 'General Petite', 'General Petite']
[2.6094379124341005, 2.6094379124341005, 2.6094379124341005, 2.6094379124341005, 2.386294361119891]
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[1.0, 2.6094379124341005, 2.0], [1.0, 2.6094379124341005, 4.0], [0.0, 2.6094379124341005, 4.0], [1.0, 2.6094379124341005, 1.0], [1.0, 2.386294361119891, 1.0]]

7. Content Transformation

This processing allows you to alter the text content in your dataset. You need to define a function that accepts a single string and returns a new, processed string. Note that this transformation will be applied to ALL of your dataset (both train and validation)

Let’s say we want to normalize our text, because the text might contain some extra spaces between words, or not follow the “single space after a period” rule

_tmp = "This is a      sentence,which doesn't follow any rule!No single space is provided after period or punctuation marks.    Maybe there are too many spaces!?!   "

from underthesea import text_normalize

text_normalize(_tmp)

"This is a sentence , which doesn't follow any rule ! No single space is provided after period or punctuation marks . Maybe there are too many spaces ! ? !"

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         content_transformations=text_normalize,
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][0]

"This sweater is beautiful , but is definitely more for looks than warmth . it's very soft , but very thin . i prefer the way it looks open rather than buttoned . i got the moss green color on sale , and i am glad i didn't pay full price for it--it ' s lovely , but certainly not worth $ 88 ."

tdc.main_ddict['validation']['Review Text'][0]

'Such a fun jacket ! great to wear in the spring or to the office as an alternative to a formal blazer . very comfortable !'

You can chain multiple functions. Let’s say after text normalizing, I want to lowercase the text

str.lower('tHis IS NoT lowerCASE')

'this is not lowercase'

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         content_transformations=[text_normalize,str.lower],
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][0]

"this sweater is beautiful , but is definitely more for looks than warmth . it's very soft , but very thin . i prefer the way it looks open rather than buttoned . i got the moss green color on sale , and i am glad i didn't pay full price for it--it ' s lovely , but certainly not worth $ 88 ."

tdc.main_ddict['validation']['Review Text'][0]

'such a fun jacket ! great to wear in the spring or to the office as an alternative to a formal blazer . very comfortable !'

You can even perform some complex transformations, such as removing text inside parentheses, or replacing some texts via a pattern (which is doable using regular expression). Let’s make an example of such transformations, where we remove text inside parentheses, and convert any hashtag into the string ‘hashtag’

import re

def process_text(s):
    # Remove texts inside parentheses
    s = re.sub(r'\(.*?\)', '', s)
    
    # Convert any hashtag into the string 'hashtag'
    s = re.sub(r'#\w+', 'hashtag', s)
    
    return s.strip()

process_text("#Promotions There's no way it works (I checked!), however it surprises me #howonearth #mindblowing")

"hashtag There's no way it works , however it surprises me hashtag hashtag"

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         content_transformations=process_text,
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- process_text -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

8. Train/Validation Split

There are several ways to perform a train/validation split with TextDataController

The first way is when you already have a validation split in your HuggingFace’s Dataset. Let’s use the Dataset built-in function train_test_split to simulate this

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1)
# This will create a 'test' split instead of 'validation', so we will process a bit to have a validation split
ddict_with_val['validation']=ddict_with_val['test']
del ddict_with_val['test']

ddict_with_val

DatasetDict({
    train: Dataset({
        features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
        num_rows: 21137
    })
    validation: Dataset({
        features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
        num_rows: 2349
    })
})

tdc = TextDataController(ddict_with_val,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split already exists
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.00% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 20374
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2253
    })
})

A second way is to split randomly based on a ratio (a float between 0 and 1), or based on the number of data in your validation set

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         val_ratio=0.15,
                         seed=42,
                         verbose=False
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)
print(tdc.main_ddict)

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 19231
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 3395
    })
})

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         val_ratio=5000,
                         seed=42,
                         verbose=False
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)
print(tdc.main_ddict)

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 17624
    })
    validation: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})

A third way is to do a random stratified split (inspired by sklearn’s). Let’s do a stratified split based on our label ‘Department Name’

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')

df['Department Name'].value_counts(normalize=True)

Department Name
Tops        0.445978
Dresses     0.269214
Bottoms     0.161852
Intimate    0.073918
Jackets     0.043967
Trend       0.005070
Name: proportion, dtype: float64

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 label_names='Department Name',
                                 sup_types='classification',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=0.2,
                                 stratify_cols='Department Name',
                                 seed=42
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

pd.Series(tdc.main_ddict['train']['Department Name']).value_counts(normalize=True)

Tops        0.444033
Dresses     0.271602
Bottoms     0.161878
Intimate    0.072983
Jackets     0.044309
Trend       0.005193
Name: proportion, dtype: float64

pd.Series(tdc.main_ddict['validation']['Department Name']).value_counts(normalize=True)

Tops        0.444101
Dresses     0.271542
Bottoms     0.161732
Intimate    0.073133
Jackets     0.044189
Trend       0.005303
Name: proportion, dtype: float64

You can also use multiple columns for your stratification

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 sup_types='classification',
                                 label_names='Department Name',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=0.2,
                                 stratify_cols=['Department Name','Rating'],
                                 seed=42,
                                 verbose=False
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows

And finally, you can omit any validation split if you specify val_ratio as None

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 label_names='Department Name',
                                 sup_types='classification',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=None,
                                 seed=42
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)
tdc.main_ddict

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
No validation split defined
Done
-------------------- Dropping unused features --------------------
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

DatasetDict({
    train: Dataset({
        features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 22628
    })
})

9. Upsampling

This is useful when you have an imbalanced dataset and you want to perform some upsampling (oversampling) on the minority class. In TextDataController, you can perform upsampling on any column of the original dataset, and you can even do upsampling on multiple columns at once

Behind the scene, upsampling contains 2 steps; first, the subset of the data is collected based on the filtering condition, and then this subset is concatenated back into the original data

df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')

df['Department Name'].sample(frac=0.8).value_counts() 
# fraction 0.8 because we only do upsampling on train data, which is 80% of the total data

Department Name
Tops        8379
Dresses     5044
Bottoms     3037
Intimate    1396
Jackets      831
Trend         92
Name: count, dtype: int64

df['Department Name'].sample(frac=0.8).value_counts(normalize=True)

Department Name
Tops        0.446876
Dresses     0.269372
Bottoms     0.159823
Intimate    0.073601
Jackets     0.044736
Trend       0.005592
Name: proportion, dtype: float64

Let’s say I want to upsampling the ‘Trend’ by the factor of 2 (x2 the amount of ‘Trend’ data)

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 label_names='Department Name',
                                 sup_types='classification',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=0.2,
                                 stratify_cols='Department Name',
                                 upsampling_list=[('Department Name',lambda x: x=='Trend')],
                                 seed=42
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Upsampling data --------------------
----- Do <lambda> on Department Name -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

pd.Series(tdc.main_ddict['train']['Department Name']).value_counts()

Tops        8037
Dresses     4916
Bottoms     2930
Intimate    1321
Jackets      802
Trend        188
Name: count, dtype: int64

pd.Series(tdc.main_ddict['train']['Department Name']).value_counts(normalize=True)

Tops        0.441739
Dresses     0.270199
Bottoms     0.161042
Intimate    0.072606
Jackets     0.044080
Trend       0.010333
Name: proportion, dtype: float64

The percenntage of ‘Trend’ data in the train set has approximately doubled (note that we filter some NaN text value so the result is not exactly doubled)

pd.Series(tdc.main_ddict['validation']['Department Name']).value_counts(normalize=True)

Tops        0.444101
Dresses     0.271542
Bottoms     0.161732
Intimate    0.073133
Jackets     0.044189
Trend       0.005303
Name: proportion, dtype: float64

Since augmentation is applied only to the train set, the distribution of label in the validation set remains the same

Similarly, you can triple the amount of ‘Trend’ by repeating the procedure twice. In the following examples, I will triple the ‘Trend’ and double the ‘Jackets’

tdc = TextDataController.from_df(df,
                                 main_text='Review Text',
                                 label_names='Department Name',
                                 sup_types='classification',
                                 filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                             },
                                 val_ratio=0.2,
                                 stratify_cols='Department Name',
                                 upsampling_list=[('Department Name',lambda x: x=='Trend'),
                                                  ('Department Name',lambda x: x=='Trend'),
                                                  ('Department Name',lambda x: x=='Jackets')
                                                 ],
                                 # This can be simplified as
#                                  upsampling_list=[('Department Name',lambda x: x=='Trend' or x=='Jackets'),
#                                                   ('Department Name',lambda x: x=='Trend')],
                                 seed=42
                                )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Upsampling data --------------------
----- Do <lambda> on Department Name -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Department Name -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

pd.Series(tdc.main_ddict['train']['Department Name']).value_counts()

Tops        8037
Dresses     4916
Bottoms     2930
Jackets     1604
Intimate    1321
Trend        282
Name: count, dtype: int64

A word of warning: Upsampling is a slow procedure, as it requires multiple dataset concatenation.

10. Content Augmentation

Similarly to Content Transformation, Content Augmentation allows to alter the text content in your dataset. You also need to provide a function accepting a single string, and return a new, processed string. Unlike Content Transformation which is applied to ALL data, the Content Augmentation only applies to your TRAINING data

One of the popular library for data augmentation is nlpaug. We will demonstrate how to integrate its augmentation functions into our TextDataController

import nlpaug.augmenter.char as nac

_tmp = "I like my clothes loose fitting but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but was too big overall"

def nlp_aug(x,aug=None):
    results = aug.augment(x)
    if not isinstance(x,list): return results[0]
    return results

Augmentation by replacing character with nearby one on the keyboard

aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug,aug=aug)

nearby_aug_func(_tmp)

"I liMe my c;othes loose fitting but even for me this ran large, i am 5 ' 7 134b and medium fit in the shoulders but was too big overa:l"

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         content_augmentations=nearby_aug_func,
                         seed=42,
                         verbose=True
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Text Augmentation --------------------
----- nlp_aug -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][:5]

["This sweater is beautiful, but is definitely more for looks hhan warmth. it ' s very soft, but vdry thin. i prefer the way it looks open rather than buttoned. i got the moss green color on sale, and i am glad i diFn ' t pay full price for it - - it ' s lovdly, but certainly not wo%th $ 88.",
 "I ' m a curCy person, so my review might not be suited to everyone. my standard size in retailer tops is xl, and it is the same for this blouse. - overall: overaol gorgeo8s, wWll made blouse but i wish there was less fabric involved and the burnt out design didn ' t make a horizontal stripe across the back and biceps. this blokse just might not work out as well if you are a full figured person. - pros: g(rgeous blousf high quality unique - cons: i wish the burnt out design didj ' t make a hor",
 'This blouse is wonderful. i juwt got and wore the wine Solored blouse today. i received so many compliments. i love it and with the sale price it is so w(rth it.',
 'When i saw this, i ordered i<medistely thinking it was similar to the popular colorblocked stripe sweater from last yea5. the kgit is sfretchy and textured and fee:s like great quality (would wash w$ll ), but it \' s pretty lightweight. the fit is huge. .. could easily size Eown. i \' m 5 \' 7 " 128 # and found the small to be loose everywhere, including the arms. the length was at my knees, and the stripe fell awkwardly across my chest. no idea what i \' d wear this with ev@n if it fit better. sadly, it \' s goinR',
 "This dress is a zillion times cuter in real life. it ' s ver% detro - swingy and girlish - it reminds me of something mia farrow would ' ve worn in her rosemary ' s baby era. i havF the black version and i ' ve paired mine with tall black gladiator Eandals for a more sIltry nighttime lo9k and also flip flops for beachy summer days. i think it ' s a total steal at the sale price."]

Again, since this is Content Augmentation, the validation set is unmodified.

tdc.main_ddict['validation']['Review Text'][:5]

['Such a fun jacket! great to wear in the spring or to the office as an alternative to a formal blazer. very comfortable!',
 'I thought this shirt was really feminine and elegant. only downsides is some of the punched out holes had fabric still attached which you have cut off with scissors- otherwise the shirt will snag. and the second issue of bigger importance are the low armholes. lots of bra showing- not really sure how to get around that so i always wear it with a cardigan. but it would be nice not to have to. \r\nother than that it looks nice and pairs nicely with almost anything.',
 'This top has a bit of a retro flare but so adorable on. looks really cute with a pair of faded boot cut jeans.',
 'I first spotted this on an retailer employee, she paired it with a peasant top & wore it open w/jeans & boots- so darn cute. love how this peice transitions from summer to fall. i\'m 5\'4" so i had to order the small petite which is perfect. note that this dress is very long! it\'s just a must have garment. the colors/ print are just beautiful.',
 "This is my new favorite dress! my only complaint is the slip is too small and the dress cannot be worn without it. i can't order a size up as the dress would then be huge. not sure what the solution is but the dress itself is stunning."]

You can even apply Content Augmentation stochastically, by adding a random condition in your augmentation function

# def nlp_aug_stochastic(x,aug=None,p=0.5):
#     results = aug.augment(x)
#     if not isinstance(x,list): return results[0] if random.random()<p else x
#     return [a if random.random()<p else b for a,b in zip(results,x)]

def nlp_aug_stochastic(x,aug=None,p=0.5):
    if not isinstance(x,list): 
        if random.random()<p: return aug.augment(x)[0]
        return x
    news=[]
    originals=[]
    for _x in x:
        if random.random()<p: news.append(_x)
        else: originals.append(_x)
    # only perform augmentation when needed
    if len(news): news = aug.augment(news)
    return news+originals

aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug_stochastic,aug=aug,p=0.3) # nearby_augmentation only applies 30% of the time, with p=0.3

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         content_augmentations=nearby_aug_func,
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][:10]

["This sweater is beautiful, but is definitely more for pooks than warmth. it ' s very soBt, but very thin. i prefer the way it looks opeb rzther than buttoned. i got the moss green color on sale, and i am glad i didn ' t pay full price for it - - it ' s lovely, but certain/y not worth $ 88.",
 "I'm a curvy person, so my review might not be suited to everyone. my standard size in retailer tops is xl, and it is the same for this blouse.\r\n-\r\noverall:\r\noverall gorgeous, well made blouse but i wish there was less fabric involved and the burnt out design didn't make a horizontal stripe across the back and biceps. this blouse just might not work out as well if you are a full figured person.\r\n-\r\npros:\r\ngorgeous blouse\r\nhigh quality\r\nunique\r\n-\r\ncons:\r\ni wish the burnt out design didn't make a hor",
 'This blouse is wonderful. i just got and wor$ the wiJe colored blouse today. i received so many compliments. i love it and with the sale priSe it is so worth it.',
 'When i saw this, i ordered immediately th(nking it was similar to the popular volorGlocked stripe sweater from last year. the knit is stretchy and textured and fe#ls like greaR quality (would wash well ), but it \' s pretty lightweight. the fit is huge. .. couPd easily size d8wn. i \' m 5 \' 7 " 128 # and found the small to be loose eveGywhere, including the arms. the length was at my knees, and the stripe fell awkwardly across my chest. no idea wtat i \' d wear tyis with even if it fit better. sadly, it \' s going',
 "This dress is a zillion times cuter in real life. it's very retro-swingy and girlish- it reminds me of something mia farrow would've worn in her rosemary's baby era. i have the black version and i've paired mine with tall black gladiator sandals for a more sultry nighttime look and also flip flops for beachy summer days. i think it's a total steal at the sale price.",
 "This top is so soft and with a henley neck opening and longer ribbed shirttail hems, it not only feels heavenly against the skin but it gives off a casual chic vibe. it is also great for layering under shorter sweaters and sweatshirts to give my staples a little oomph. it is a bit sheer so cami is a must. i am also not sure how well it will hold up after washings, especially since it's priced quite high. i love it so much that i will most probably end up keeping it it is true to size. i ordered",
 "This is my first lair of ag and i loGe them so far. they are not cutfed as shown in the picture. they are long so i had to get them altered (i ' m 5 ' ' 5 ). the color is a rich blue and Ghey have a nice stretch. i haven ' t worn tNem all day yet to see if they keep their shape. usuZlly a 28 or 29 and went with the 28 on these. got them on 20 perc off salf so very happy!",
 'I liked this coat but my family said it looked too much like something hilary clinton would wear. i am 54 and i think it made me look a bit dowdy since it runs a bit big.',
 'I saw a photographer wearing this at a wedding i went to in october. i absolutely fell in love. it is beautiful. i can\'t wait to wear it for the holidays! i got the small petite and i am 5\'2", 125 lbs. fit great. enjoy!',
 "This dress was adorable & fit great! regrettably, i had to return it since it wasn't lined."]

One of the advanced augmentation is “Contexttual Word Embeddings Augmenter” (code example: https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb), where you can insert/substitute words using LLM such as BERT, RoBERTA …

import nlpaug.augmenter.word as naw

aug = naw.ContextualWordEmbsAug(model_path='roberta-base', 
                                device='cuda:0', # if you don't have gpu, change to 'cpu'
                                action="substitute",
                                top_k=10,
                               aug_p=0.07)

/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

contextual_aug_func = partial(nlp_aug,aug=aug)

_tmp = "I like my clothes loose fitting but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but was too big overall"

contextual_aug_func(_tmp)

"I kept my clothes slim fitting but even for me this ran large, i am 5'7 134b and medium fit in upper shoulders but was too big overall"

contextual_aug_func([_tmp for i in range(7)])

["I like my clothes big enough but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but felt too big overall",
 "I like my clothes loose fitting but even for me this ran large, i am 5'7 134b and I fit in the back but still too big overall",
 "I like my big loose fitting but even for me this ran large, i stand 5'7 134b and medium light in the shoulders but was too big overall",
 "I liked its own loose fitting but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but was too big overall",
 "I like my clothes loose fitting but had given me this ran large, i am 5'7 134b is medium fit in the shoulders but was too big overall",
 "I made my clothes loose fitting but even for me this ran large, i am 5'7 134b and barely fit over the shoulders but was too big overall",
 "I like my clothes loose fitting but honestly for me this ran large, i am 5'7 134b and medium fit in all shoulders it was too big overall"]

For this type of augmentation, it’s wise to use GPU to minimize processing speed. You also don’t want all of your text to be augmented, so let’s reuse the stochastic augmentation.

contextual_aug_func = partial(nlp_aug_stochastic,aug=aug,p=0.3)

# add these 2 instance variables to your gpu augmentation
contextual_aug_func.run_on_gpu=True
contextual_aug_func.batch_size=32

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         sup_types='classification',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         content_augmentations=contextual_aug_func,
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict['train']['Review Text'][:10]

['This dress makes me so sad...the textured stretchy fabric, color, length, and overall swingy fit are spot on. as another reviewer noted though, the armholes and neck totally ruin the dress. the neck is tiny, which i could have gotten over once it was on, but the arm holes were just awful - too big all around (too tall of a cut plus too wide of a cut). basically, you could see my bra plus from the front there was unflattering exposure near the armholes. it could have been so good, but alas, but t',
 "This top is very flattering, the fabric flows and moves. it fits perfectly (slim cut), but hides tummy bulges and other imperfections. and it's slimming too. can be dressed up or down, goes with everything. i ended up buying all three colors, and if there were more, i would buy more!",
 'This blouse is wonderful. i just got and wore the wine colored blouse today. i received so many compliments. i love it and with the sale price it is so worth it.',
 'This top is very versatile. i wore it out to dinner with skinny jeans on a friday night, but it can easily transition to a saturday afternoon stroll around town top.',
 'This top is so soft and luxuriously comfy! i love wearing it around the house, haven\'t really "dressed" it up yet with jeans or jewelry. it runs slightly big, but if you like the oversized look, this is definitely perfect.',
 "I was in love with this shirt from the moment i put it on. it is of high fit, with layers to ensure the top isn't sheer. the embroidery is incredibly pretty and the top looks way less grandma in it. i ordered the top xxs and it fits perfectly. i really appreciate that the underarm holes are just the right size and cant show off any of my bra, which sometimes happens with small tops. i can wear it with jeans and boots or with a pencil skirt and heels but is looks great with both outfits. o",
 "I read the other review and from the picture it looked as though it may be a little tight, so i ordered up to a large. the medium would have fit, but since i'm in ur mid-40's i felt more comfortable with large. but if people are trim and young or young at heart your usual medium will be fine. live the material and the navy makes it classy and rich looking. could be dressed up or down. have worn it to a cocktail party fundraiser with white crop sleeves and received many reviews. i'm always challenge",
 "This skirt is so ladylike and light as air! the cherry red color is beautiful - just as pictured. i can imagine so many opportunities to wear this skirt. with a sweater and tights now, and maybe a striped tee and sandals in the spring.\ni'm sure i'll have this gorgeous classic in my wardrobe for a very long time to come!",
 'Very pretty dress, perfect style for my build, bigger busted, muffin top. the material/pattern is really pretty.',
 "I purchased this top in the navy. the picture gives the top looking like an interesting blue with some purple in it, but in person the top is just... navy. the lace and fabric are soft. it fits true to me; i almost always wear a small and the small fit me. the wasn't quite my style, but it's a pretty top it will be great for spring and summer."]

And finally, similarly to Content Transformation, you can link multiple augmentation functions together by providing a list of those functions in content_augmentations

11. Save and Load TextDataController

source

TextDataController.save_as_pickles

 TextDataController.save_as_pickles (fname, parent='pickle_files',
                                     drop_attributes=False)

	Type	Default	Details
fname			Name of the pickle file
parent	str	pickle_files	Parent folder
drop_attributes	bool	False	Whether to drop large-size attributes

source

TextDataController.from_pickle

 TextDataController.from_pickle (fname, parent='pickle_files')

	Type	Default	Details
fname			Name of the pickle file
parent	str	pickle_files	Parent folder

TextDataController object can be saved and loaded with ease. This is especially useful after text processing and/or tokenization have been done

from datasets import disable_caching

disable_caching() # disable huggingface caching to see data size

from underthesea import text_normalize
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw

def nlp_aug_stochastic(x,aug=None,p=0.5):
    if not isinstance(x,list): 
        if random.random()<p: return aug.augment(x)[0]
        return x
    news=[]
    originals=[]
    for _x in x:
        if random.random()<p: news.append(_x)
        else: originals.append(_x)
    # only perform augmentation when needed
    if len(news): news = aug.augment(news)
    return news+originals

aug2 = naw.ContextualWordEmbsAug(model_path='roberta-base', 
                                device='cuda:0', # if you don't have gpu, change to 'cpu'
                                action="substitute",
                                top_k=10,
                               aug_p=0.07)

contextual_aug_func = partial(nlp_aug_stochastic,aug=aug2,p=0.1)
# add these 2 instance variables to your gpu augmentation
contextual_aug_func.run_on_gpu=True
contextual_aug_func.batch_size=32

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations = [text_normalize,str.lower],
                         content_augmentations = contextual_aug_func, 
                         process_metas=True,
                         seed=42
                        )
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done

tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18102
    })
    validation: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

tdc.save_as_pickles('my_tdc')

Let’s check the file size

file_stats = os.stat(Path('pickle_files/my_tdc.pkl'))
print(f'File Size in MegaBytes is {round(file_stats.st_size / (1024 * 1024), 3)}')

File Size in MegaBytes is 479.025

Load back our object

tdc2 = TextDataController.from_pickle('my_tdc')

You can still access all its attributes, data, preprocessings, transformation/augmentation …

tdc2.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18102
    })
    validation: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

for i,v in enumerate(tdc2.main_ddict['train']):
    if i==3:break
    print(f"Text: {v['Review Text']}\nLabel: {v['Department Name']} => {v['label']}")
    print('-'*10)

Text: general petite . meh . this tunic is way over priced for the style and quality . it fit comfortably ( runs a size larger ) but it's not really flattering , it jut kind of hangs there looking ok . it is a little too deep of a v cut for a work top as well . this top does not support the price at all . it felt like something i could find at department store for way less . i will be returning it .
Label: Tops => 4
----------
Text: general . awesome buy ! . i am so happy i took a chance on this jumpsuit ! i am post-baby ( six weeks ) and although i intend on slimming down more i would say that it is flattering even at my current size . and it will only get better ! the quality and color are great !
Label: Bottoms => 0
----------
Text: general petite . snap neck pullover . i love this top . i ordered it in a large thinking it would be a tight rib but it is not so i reordered it in a small . i am 5 ' 7 " 145 lbs 34 g chest . the small fits perfectly and probably could have taken an xs . it is stretchy but fits wonderfully . i bought the black . i love how the neck snaps and adds a little pizzazz to a simple black turtle neck . i'm wearing it today with straight leg jeans and my leopard print ballet flats . i feel like audrey hepburn ! ! i will not be dry cleaning it . i will wash
Label: Bottoms => 0
----------

tdc2.label_lists

[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]

tdc2.filter_dict,tdc2.content_tfms,tdc2.aug_tfms

({'Review Text': <function __main__.<lambda>(x)>,
  'Department Name': <function __main__.<lambda>(x)>},
 [<function underthesea.pipeline.text_normalize.text_normalize(text, tokenizer='underthesea')>,
  <method 'lower' of 'str' objects>],
 [functools.partial(<function nlp_aug_stochastic>, aug=<nlpaug.augmenter.word.context_word_embs.ContextualWordEmbsAug object>, p=0.1)])

If you don’t want to store the HuggingFace DatasetDict in your TextDataController, or the augmentation functions (typically when you already have a trained model, and you only use TextDataController to preprocess the test set), you can remove it in the save_as_pickles step

tdc.save_as_pickles('my_lightweight_tdc',drop_attributes=True)

Let’s check the file size

file_stats = os.stat(Path('pickle_files/my_lightweight_tdc.pkl'))
print(f'File Size in MegaBytes is {round(file_stats.st_size / (1024 * 1024), 3)}')

File Size in MegaBytes is 1.911

Load it back

tdc3 = TextDataController.from_pickle('my_lightweight_tdc')

We will use this object to demonstrate the Test Set Construction in the next section

Construct a Test Dataset

source

TextDataController.prepare_test_dataset

 TextDataController.prepare_test_dataset (test_dset, do_filtering=False)

	Type	Default	Details
test_dset			The HuggingFace Dataset as Test set
do_filtering	bool	False	whether to perform data filtering on this test set

source

TextDataController.prepare_test_dataset_from_csv

 TextDataController.prepare_test_dataset_from_csv (file_path,
                                                   do_filtering=False)

	Type	Default	Details
file_path			path to csv file
do_filtering	bool	False	whether to perform data filtering on this test set

source

TextDataController.prepare_test_dataset_from_df

 TextDataController.prepare_test_dataset_from_df (df, validate=True,
                                                  do_filtering=False)

	Type	Default	Details
df			Pandas Dataframe
validate	bool	True	whether to perform input data validation
do_filtering	bool	False	whether to perform data filtering on this test set

source

TextDataController.prepare_test_dataset_from_raws

 TextDataController.prepare_test_dataset_from_raws (content)

	Details
content	Either a single sentence, list of sentence or a dictionary with keys are metadata columns and values are list

Let’s say you have done your preprocessing and tokenization in your training set, and have a nicely trained model, ready to do inference on new data. Here is how you can use TextDataController to apply all the necessary preprocessings to your new data

We will reuse the lightweight tdc object we created in the previous section (since we don’t really need all the training data just to construct new data). Also, we will take a small sample of our training data and pretend it is our test data

tdc = TextDataController.from_pickle('my_lightweight_tdc')

Let’s predict a few raw texts

If we only provide a raw text as follows

tdc.prepare_test_dataset_from_raws('This shirt is so comfortable I love it!')

You will counter this error:

ValueError: There is/are metadatas in the preprocessing step. Please include a dictionary including these keys for 
metadatas: ['Title', 'Division Name'], and texture content: Review Text

Since our preprocessing includes some metadatas, you have to provide a dictionary as follows:

results = tdc.prepare_test_dataset_from_raws({'Review Text': 'This shirt is so comfortable I love it!',
                                    'Title': 'Great shirt',
                                    'Division Name': 'general'
                                   })

-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Tokenization --------------------
Done

print(results[0])

{'Review Text': 'general . great shirt . this shirt is so comfortable i love it !', 'Title': 'great shirt', 'Division Name': 'general', 'input_ids': [0, 15841, 479, 372, 6399, 479, 42, 6399, 16, 98, 3473, 939, 657, 24, 27785, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Let’s make prediction from a pandas Dataframe

df_test = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig').sample(frac=0.2,random_state=1)
# drop NaN values in the label column
df_test = df_test[~df_test['Department Name'].isna()].reset_index(drop=True)
df_test.shape

(4692, 10)

There are few things to pay attention to when constructing your new test set using TextDataController: - Only a few processings will be applied to your test set: Metadatas concatenation, Filtering (can be omited), Content Transformation, and Tokenization. Therefore, all columns required to perform these processings must exist in your test dataset - You can exclude the label column (e.g. Department Name in this example), since it’s a test set

To view all required columns, access the attribute cols_to_keep (you can omit the last column, which is the name of the label column)

tdc.cols_to_keep

['Review Text', 'Title', 'Division Name', 'Department Name']

This test dataset might have some NaN values in the text field (Review Text), thus we will turn on the filtering option to get rid of these NaNs, as this is what we did in the training set. If your test dataset don’t need any filtering, turn off this option

test_dset = tdc.prepare_test_dataset_from_df(df_test,validate=True,do_filtering=True)

- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title          758
Review Text    164
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 2 rows
-------------------- Start Test Set Transformation --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Tokenization --------------------
Done

test_dset

Dataset({
    features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'input_ids', 'attention_mask'],
    num_rows: 4528
})

for i in range(3):
    print(f"Text: {test_dset['Review Text'][i]}")
    print(f"Input_ids: {test_dset['input_ids'][i]}")
    print('-'*10)

Text: general . perfect for work and play . this shirt works for both going out and going to work , and i can wear it with everything . fits perfect , tucked and untucked , tied and untied . i love it .
Input_ids: [0, 15841, 479, 1969, 13, 173, 8, 310, 479, 42, 6399, 1364, 13, 258, 164, 66, 8, 164, 7, 173, 2156, 8, 939, 64, 3568, 24, 19, 960, 479, 10698, 1969, 2156, 21222, 8, 7587, 23289, 2156, 3016, 8, 7587, 2550, 479, 939, 657, 24, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------
Text: general petite . . i don't know why i had the opposite problem most reviewers had with these ..... i tried on the regular length in the store and found that they were just a bit too short with heels . ( i'm 5 ' 5 ) . i had them ordered in a petite and when they came , they were too short with flats ! maybe it's the way i like to wear them , i like my flare jeans to barely skim the ground . i just exchanged them for regular length and will wear them with a small wedge shoe . aside from the length issues , these are super cute
Input_ids: [0, 15841, 4716, 1459, 479, 479, 939, 218, 75, 216, 596, 939, 56, 5, 5483, 936, 144, 34910, 56, 19, 209, 29942, 734, 939, 1381, 15, 5, 1675, 5933, 11, 5, 1400, 8, 303, 14, 51, 58, 95, 10, 828, 350, 765, 19, 8872, 479, 36, 939, 437, 195, 128, 195, 4839, 479, 939, 56, 106, 2740, 11, 10, 4716, 1459, 8, 77, 51, 376, 2156, 51, 58, 350, 765, 19, 20250, 27785, 2085, 24, 18, 5, 169, 939, 101, 7, 3568, 106, 2156, 939, 101, 127, 24186, 10844, 7, 6254, 28772, 5, 1255, 479, 939, 95, 11024, 106, 13, 1675, 5933, 8, 40, 3568, 106, 19, 10, 650, 27288, 12604, 479, 4364, 31, 5, 5933, 743, 2156, 209, 32, 2422, 11962, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------
Text: general petite . great pants . thes e cords are great--lightweight for fl winters , and the bootcut flare bottom is super cute with ballet flats or booties . i am 5 ' 10 " and typically a size 8 ; the size 29 fit perfectly . they have a little stretch to them , which is great . very flattering--wish i could order in more colors ! !
Input_ids: [0, 15841, 4716, 1459, 479, 372, 9304, 479, 5, 29, 364, 37687, 32, 372, 5579, 6991, 4301, 13, 2342, 31000, 2156, 8, 5, 9759, 8267, 24186, 2576, 16, 2422, 11962, 19, 22573, 20250, 50, 9759, 918, 479, 939, 524, 195, 128, 158, 22, 8, 3700, 10, 1836, 290, 25606, 5, 1836, 1132, 2564, 6683, 479, 51, 33, 10, 410, 4140, 7, 106, 2156, 61, 16, 372, 479, 182, 34203, 5579, 605, 1173, 939, 115, 645, 11, 55, 8089, 27785, 27785, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------