import pandas as pd
import numpy as np
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from importlib.machinery import SourceFileLoader
import os
Text Main
TextDataController
Helper functions
1. Content Transformation, Augmentations, and Tokenization
tokenizer_explain
tokenizer_explain (inp, tokenizer, split_word=False)
Display results from tokenizer
Type | Default | Details | |
---|---|---|---|
inp | Input sentence | ||
tokenizer | Tokenizer (preferably from HuggingFace) | ||
split_word | bool | False | Is input inp split into list or not |
We can use this function to show how HuggingFace’s tokenizer works and what its outputs look like
Let’s try PhoBert tokenizer (for Vietnamese texts). PhoBert tokenizer requires the input to be word-segmented. We will use our built-in function apply_vnmese_word_tokenize
to do this
from transformers import AutoTokenizer
= AutoTokenizer.from_pretrained("vinai/phobert-base") _tokenizer
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
= apply_vnmese_word_tokenize('hội cư dân chung cư sen hồng - chung cư lotus sóng thần thủ đức')
inp print(inp)
hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức
Now we can use tokenizer_explain
to see how our PhoBert-base tokenizer process our input inp
tokenizer_explain(inp,_tokenizer)
------- Tokenizer Explained -------
----- Input -----
hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức
----- Tokenized results -----
{'input_ids': [0, 1093, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'hội', 'cư_dân', 'chung_cư', 'sen', 'hồng', '-', 'chung_cư', 'lo@@', 'tus', 'sóng_thần', 'thủ_@@', 'đức', '</s>']
----- Results from tokenizer.decode -----
<s> hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức </s>
The Tokenized results is the outputs of tokenizer, and Results from tokenizer.convert_ids_to_tokens shows us what each token id really is, the newly-added start and end tokens, and even the byte-pair encoding in action
two_steps_tokenization_explain
two_steps_tokenization_explain (inp, tokenizer, content_tfms=[], aug_tfms=[])
Display results form each content transformation, then display results from tokenizer
Type | Default | Details | |
---|---|---|---|
inp | Input sentence | ||
tokenizer | Tokenizer (preferably from HuggingFace) | ||
content_tfms | list | [] | A list of text transformations |
aug_tfms | list | [] | A list of text augmentation |
This function further showcase how each text transformation and/or text augmentation affect our text input, step by step
Let’s load Phobert tokenizer one more time to test out this function
= AutoTokenizer.from_pretrained("vinai/phobert-base") _tokenizer
from underthesea import text_normalize
apply_vnmese_word_tokenize
also have an option to normalize text (i.e. standardizing text input, getting rid of extra spaces, normalizing accents for Vietnamese text …)
from functools import partial
= 'Hội cư dân chung cư sen hồng- chung cư lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh'
inp =[partial(apply_vnmese_word_tokenize,normalize_text=True)]) two_steps_tokenization_explain(inp,_tokenizer,content_tfms
------- Text Transformation Explained -------
----- Raw sentence -----
Hội cư dân chung cư sen hồng- chung cư lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh
----- Content Transformations (on both train and test) -----
--- apply_vnmese_word_tokenize ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh
----- Augmentations (on train only) -----
------- Tokenizer Explained -------
----- Input -----
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh
----- Tokenized results -----
{'input_ids': [0, 792, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 5, 5043, 8, 16, 149, 2850, 214, 784, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'Hội', 'cư_dân', 'chung_cư', 'sen', 'hồng', '-', 'chung_cư', 'lo@@', 'tus', 'sóng_thần', 'thủ_@@', 'đức', '.', 'Thủ_Đức', 'là', 'một', 'huyện', 'trực_thuộc', 'thành_phố', 'Hồ_Chí_Minh', '</s>']
----- Results from tokenizer.decode -----
<s> Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức. Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh </s>
Let’s add some text augmentations
import unidecode
# to remove vietnamese accent
= lambda x: unidecode.unidecode(x) remove_accent
If you want your function to be printed in with a different name:
__name__ = 'Remove Vietnamese Accent' remove_accent.
two_steps_tokenization_explain(inp,_tokenizer,=[partial(apply_vnmese_word_tokenize,normalize_text=True)],
content_tfms=[remove_accent]
aug_tfms )
------- Text Transformation Explained -------
----- Raw sentence -----
Hội cư dân chung cư sen hồng- chung cư lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh
----- Content Transformations (on both train and test) -----
--- apply_vnmese_word_tokenize ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh
----- Augmentations (on train only) -----
--- Remove Vietnamese Accent ---
Hoi cu_dan chung_cu sen hong - chung_cu lotus song_than thu_duc . Thu_Duc la mot huyen truc_thuoc thanh_pho Ho_Chi_Minh
------- Tokenizer Explained -------
----- Input -----
Hoi cu_dan chung_cu sen hong - chung_cu lotus song_than thu_duc . Thu_Duc la mot huyen truc_thuoc thanh_pho Ho_Chi_Minh
----- Tokenized results -----
{'input_ids': [0, 3021, 1111, 56549, 17386, 22975, 13689, 3330, 27037, 31, 22975, 13689, 2029, 4885, 3227, 9380, 1510, 21605, 6190, 1894, 5, 5770, 4098, 1894, 2644, 3773, 1204, 18951, 2052, 10242, 9835, 1881, 22899, 17366, 10384, 30234, 8470, 1612, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'Ho@@', 'i', 'cu_@@', 'dan', 'chung_@@', 'cu', 'sen', 'hong', '-', 'chung_@@', 'cu', 'lo@@', 'tus', 'so@@', 'ng_th@@', 'an', 'thu_@@', 'du@@', 'c', '.', 'Thu_@@', 'Du@@', 'c', 'la', 'mo@@', 't', 'huy@@', 'en', 'tru@@', 'c_th@@', 'u@@', 'oc', 'thanh_@@', 'pho', 'Ho_@@', 'Chi_@@', 'Minh', '</s>']
----- Results from tokenizer.decode -----
<s> Hoi cu_dan chung_cu sen hong - chung_cu lotus song_than thu_duc. Thu_Duc la mot huyen truc_thuoc thanh_pho Ho_Chi_Minh </s>
You can even be creative with your augmentation functions; let’s say you only want your augmentation to be applied 50% of the time:
import random
2) # for reproducibility random.seed(
= lambda x: unidecode.unidecode(x) if random.random()<0.5 else x
remove_accent __name__ = 'Remove Vietnamese Accent with 0.5 prob' remove_accent.
two_steps_tokenization_explain(inp,_tokenizer,=[partial(apply_vnmese_word_tokenize,normalize_text=True)],
content_tfms=[remove_accent]
aug_tfms )
------- Text Transformation Explained -------
----- Raw sentence -----
Hội cư dân chung cư sen hồng- chung cư lotus sóng thần thủ đức. Thủ Đức là một huyện trực thuộc thành phố Hồ Chí Minh
----- Content Transformations (on both train and test) -----
--- apply_vnmese_word_tokenize ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh
----- Augmentations (on train only) -----
--- Remove Vietnamese Accent with 0.5 prob ---
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh
------- Tokenizer Explained -------
----- Input -----
Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức . Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh
----- Tokenized results -----
{'input_ids': [0, 792, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 5, 5043, 8, 16, 149, 2850, 214, 784, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
----- Results from tokenizer.convert_ids_to_tokens -----
['<s>', 'Hội', 'cư_dân', 'chung_cư', 'sen', 'hồng', '-', 'chung_cư', 'lo@@', 'tus', 'sóng_thần', 'thủ_@@', 'đức', '.', 'Thủ_Đức', 'là', 'một', 'huyện', 'trực_thuộc', 'thành_phố', 'Hồ_Chí_Minh', '</s>']
----- Results from tokenizer.decode -----
<s> Hội cư_dân chung_cư sen hồng - chung_cư lotus sóng_thần thủ_đức. Thủ_Đức là một huyện trực_thuộc thành_phố Hồ_Chí_Minh </s>
There are more examples of interesting augmentation here
2. Tokenize Function
tokenize_function
tokenize_function (text, tok, max_length=None, is_split_into_words=False, return_tensors=None, return_special_tokens_mask=False)
This is a wrapper for Huggingface’ tokenizer, to tokenize and pad your input text, getting them ready for your NLP model
I will reuse PhoBert’s tokenizer to demonstrate the functionality of this function. For more information about this tokenizer: https://huggingface.co/vinai/phobert-base
= AutoTokenizer.from_pretrained("vinai/phobert-base") _tokenizer
= partial(apply_vnmese_word_tokenize,normalize_text=True) phobert_preprocess
'hội cư dân chung cư sen hồng - chung cư lotus sóng thần thủ đức'),
tokenize_function(phobert_preprocess(=512) _tokenizer,max_length
{'input_ids': [0, 1093, 1838, 1574, 3330, 2025, 31, 1574, 2029, 4885, 8554, 25625, 7344, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
= ['hội cần mở thẻ tín dụng tại hà nội, đà nẵng, tp. hồ chí minh',"biti's cao lãnh - đồng tháp"]
_inp = [phobert_preprocess(i) for i in _inp]
_inp _inp
['hội cần mở thẻ_tín_dụng tại hà_nội , đà_nẵng , tp . hồ chí_minh',
"biti's cao_lãnh - đồng tháp"]
=512) tokenize_function(_inp,_tokenizer,max_length
{'input_ids': [[0, 1093, 115, 548, 10603, 35, 44068, 2151, 4, 62295, 1301, 24931, 4, 1187, 2380, 5, 1005, 43647, 9534, 2], [0, 3907, 2081, 51899, 1118, 10109, 8271, 31, 80, 3186, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
You can change the tokenizer outputs’ type, such as pytorch’s tensor, tensorflow objects, or numpy array
=512,return_tensors='pt') tokenize_function(_inp,_tokenizer,max_length
{'input_ids': tensor([[ 0, 1093, 115, 548, 10603, 35, 44068, 2151, 4, 62295,
1301, 24931, 4, 1187, 2380, 5, 1005, 43647, 9534, 2],
[ 0, 3907, 2081, 51899, 1118, 10109, 8271, 31, 80, 3186,
2, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
= tokenize_function(_inp,_tokenizer,max_length=512) results
print(_tokenizer.convert_ids_to_tokens(results['input_ids'][0]))
['<s>', 'hội', 'cần', 'mở', 'thẻ_tín_dụng', 'tại', 'hà_@@', 'nội', ',', 'đà_@@', 'n@@', 'ẵng', ',', 't@@', 'p', '.', 'hồ', 'chí_@@', 'minh', '</s>']
You can change max_length (which allow truncation when sentence length is higher than max_length)
= tokenize_function(_inp,_tokenizer,
results =5) max_length
results
{'input_ids': [[0, 1093, 115, 548, 2], [0, 3907, 2081, 51899, 2]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}
3. Metadatas Processing
concat_metadatas
concat_metadatas (dset:dict, main_text, metadatas, process_metas=True, sep='.', is_batched=True)
Extract, process (optional) and concatenate metadatas to the front of text
Type | Default | Details | |
---|---|---|---|
dset | dict | HuggingFace Dataset | |
main_text | Text feature name | ||
metadatas | Metadata (or a list of metadatas) | ||
process_metas | bool | True | Whether apply simple metadata processing, i.e. space strip and lowercase |
sep | str | . | Separator, for multiple metadatas concatenation |
is_batched | bool | True | whether batching is applied |
This function allows you to concatenate any text metadatas to the front of your main texts. Adding metadatas to your text might help your model to utilize any extra information for its downstream task
Class TextDataController
TextDataController
TextDataController (inp, main_text:str, label_names=[], sup_types=[], class_names_predefined=[], filter_dict={}, label_tfm_dict={}, metadatas=[], process_metas=True, metas_sep='.', content_transformations=[], val_ratio:int|float|None=0.2, stratify_cols=[], upsampling_list={}, content_augmentations=[], seed=None, batch_size=1024, num_proc=4, cols_to_keep=None, verbose=True)
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
inp | HuggingFainpce Dataset or DatasetDict | ||
main_text | str | Name of the main text column | |
label_names | list | [] | Names of the label (dependent variable) columns |
sup_types | list | [] | Type of supervised learning for each label name (‘classification’ or ‘regression’) |
class_names_predefined | list | [] | List of names associated with the labels (same index order). Use empty list for regression |
filter_dict | dict | {} | A dictionary: {feature: filtering_function_for_that_feature} |
label_tfm_dict | dict | {} | A dictionary: {label_name: transform_function_for_that_label} |
metadatas | list | [] | Names of the metadata columns |
process_metas | bool | True | Whether to do simple text processing on the chosen metadatas |
metas_sep | str | . | Separator, for multiple metadatas concatenation |
content_transformations | list | [] | A list of text transformations |
val_ratio | int | float | None | 0.2 | Ratio of data for validation set |
stratify_cols | list | [] | Column(s) needed to do stratified shuffle split |
upsampling_list | dict | {} | A list of tuple. Each tuple: (feature,upsampling_function_based_on_the_feature) |
content_augmentations | list | [] | A list of text augmentations |
seed | NoneType | None | Random seed |
batch_size | int | 1024 | CPU batch size |
num_proc | int | 4 | Number of processes for multiprocessing |
cols_to_keep | NoneType | None | Columns to keep after all processings |
verbose | bool | True | Whether to print processing information |
TextDataController.do_all_preprocessing
TextDataController.do_all_preprocessing (shuffle_trn=True, check_val_leak=True)
Type | Default | Details | |
---|---|---|---|
shuffle_trn | bool | True | To shuffle the train set before tokenization |
check_val_leak | bool | True | To check (and remove) training data which is leaked to validation set |
TextDataController.do_tokenization
TextDataController.do_tokenization (tokenizer, max_length=None, trn_size=None, tok_num_proc=None)
Type | Default | Details | |
---|---|---|---|
tokenizer | Tokenizer (preferably from HuggingFace) | ||
max_length | NoneType | None | pad to model’s allowed max length (default is max_sequence_length). Use -1 for no padding at all |
trn_size | NoneType | None | The number of training data to be tokenized |
tok_num_proc | NoneType | None | Number of processes for tokenization |
TextDataController.process_and_tokenize
TextDataController.process_and_tokenize (tokenizer, max_length=None, trn_size=None, tok_num_proc=None, shuffle_trn=True, check_val_leak=True)
This will perform do_all_processing
then do_tokenization
Type | Default | Details | |
---|---|---|---|
tokenizer | Tokenizer (preferably from HuggingFace) | ||
max_length | NoneType | None | pad to model’s allowed max length (default is max_sequence_length) |
trn_size | NoneType | None | The number of training data to be tokenized |
tok_num_proc | NoneType | None | Number of processes for tokenization |
shuffle_trn | bool | True | To shuffle the train set before tokenization |
check_val_leak | bool | True | To check (and remove) training data which is leaked to validation set |
1. Load data + Basic use case
TextDataController.from_csv
TextDataController.from_csv (file_path, **kwargs)
TextDataController.from_df
TextDataController.from_df (df, validate=True, **kwargs)
You can create a TextDataController
from a csv, pandas DataFrame, or directly from a HuggingFace dataset object. Currently, TextDataController
is designed for text classification and text regression, as we will explore in this documentation
We will load a sample data to prepare for a classification task: which Department Name
a comment (Review Text
) belongs to
Dataset source: https://www.kaggle.com/datasets/kavita5/review_ecommerce
import pandas as pd
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
df.shape
(23486, 10)
5) df.sample(
Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | |
---|---|---|---|---|---|---|---|---|---|---|
15253 | 831 | 31 | Great work top | I snagged this top with the 25% off sale and i... | 5 | 1 | 0 | General Petite | Tops | Blouses |
1254 | 850 | 49 | Flattering, comfy top | Everyone has said it, so i'll just add my two ... | 4 | 1 | 0 | General Petite | Tops | Blouses |
5105 | 824 | 38 | Adore this top! | Saw this one online and when it came it did no... | 5 | 1 | 2 | General Petite | Tops | Blouses |
8611 | 920 | 29 | Great spring sweater | This sweater is classy and comfortable. it has... | 4 | 1 | 0 | General | Tops | Fine gauge |
17574 | 1110 | 37 | Super cute! | I'm not sure why the other reviewers think tha... | 5 | 1 | 6 | General | Dresses | Dresses |
You can create a TextDataController
from a dataframe. This also provides a quick input validation check (NaN check and Duplication check)
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
You can also create a TextDataController
directly from the csv file. The good thing about using HuggingFace Dataset as the main backend of the TextDataController is that you can utilize lots of its useful functionality, such as caching
= TextDataController.from_csv('sample_data/Womens_Clothing_Reviews.csv',
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names )
You can also create a TextDataController
from a HuggingFace Dataset
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset dset
Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 23486
})
= TextDataController(dset,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names=42
seed )
In the “Input Validation Precheck” above, we notice that our dataset has missing values in the text field and the label field. For now, let’s load the data as a Pandas’ DataFrame, perform some cleaning, and create our TextDataController
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
= df[(~df['Review Text'].isna()) & (~df['Department Name'].isna())].reset_index(drop=True) df
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 2966
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 1 rows
At this point you can start perform 2 important steps on your data
- Text preprocessings, Label Encoding, Train/Validation Split
- Tokenization
We haven’t provided any preprocessings to the TextDataController
; we will see more on how to use preprocessings (step by step) as we progress
= tdc.do_all_preprocessing(shuffle_trn=True) ddict
-------------------- Start Main Text Processing --------------------
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label'],
num_rows: 18099
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label'],
num_rows: 4526
})
})
Our DatasetDict now has two split: train and validation. Note that train split is now IterableDataset, for processing efficiency
'train'][:3] ddict[
{'Review Text': ['I wanted to love this. i have been looking for a poncho-type sweater for our cold midwestern winters. the cream colored one that i really wanted sold out instantly and i missed the window for the xxsp. i ordered this in the xs/sp (smallest size available). i am 5\'1" and 108 lbs with small shoulders. the neck opening is huge. my collar bones and a seciton of my upper back were exposed. this would not keep me warm due to so much exposed skin on my neck, back, and shoulders. i suppose i could get a',
'Love the movement of the blouse and how it falls. great quality material.',
"Loved these beach pants! i purchased the size medium in the coral. i loved the accents on the ties and the little pom pom details. i did get many compliments on them. the only thing i don't love about them is the material is very thin. i know they are beach pants but i personally would have liked slightly more weight to them. i wore them once with a pair of cropped leggings underneath and i thought it was a very cute way to wear them with some additional substance underneath."],
'Department Name': ['Tops', 'Tops', 'Intimate'],
'label': [4, 4, 2]}
'validation'][:3] ddict[
{'Review Text': ["The raspberry color is really stunning! i have been looking for colored tights for a while and had difficulty finding really rich colors. i was thrilled when i saw these! i've worn them once so far. very comfortable and seem like they will last.",
'I just received this dress and i feel like a goddess in it! it is perfect for graduations, weddings, romantic dinners, tropical va cations....heck, i\'ll wear it to the grocery store! i love it that much!\r\n\r\ni am 5\'7" with a 34c bust.....i have this dress in a size 4 and it fits very well. this dress is slim cut from the shoulder down to the waist. the dress length hits me at the lower calf, just like the model online. i think the armholes are cut a little high...this being said; this dress would',
"When i saw this top online, i thought i'd love it and immediately ordered it in both colors. they arrived today and i am soooo disappointed. i have never seen such drab colors. the blue is a muddy grayish hue (like an overcast day) and the pink is a dusty shade of peach. yuck. and i was hoping the ruffle at the bottom would have a chiffon-like flowy effect. instead, the ruffle is made of a cheap looking knit. back these go..."],
'Department Name': ['Intimate', 'Dresses', 'Tops'],
'label': [2, 1, 4]}
Now we can start with the tokenization
from transformers import RobertaTokenizer
= RobertaTokenizer.from_pretrained('roberta-base') tokenizer
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
= tdc.do_tokenization(tokenizer,max_length=512) ddict
-------------------- Tokenization --------------------
Done
ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 18099
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
print(ddict['train'][0]['input_ids'][:150])
[0, 100, 770, 7, 657, 42, 4, 939, 33, 57, 546, 13, 10, 181, 261, 11156, 12, 12528, 23204, 13, 84, 2569, 1084, 16507, 31000, 4, 5, 6353, 20585, 65, 14, 939, 269, 770, 1088, 66, 11764, 8, 939, 2039, 5, 2931, 13, 5, 37863, 4182, 4, 939, 2740, 42, 11, 5, 3023, 29, 73, 4182, 36, 23115, 990, 1836, 577, 322, 939, 524, 195, 108, 134, 113, 8, 13955, 23246, 19, 650, 10762, 4, 5, 5397, 1273, 16, 1307, 4, 127, 19008, 12396, 8, 10, 15636, 24899, 9, 127, 2853, 124, 58, 4924, 4, 42, 74, 45, 489, 162, 3279, 528, 7, 98, 203, 4924, 3024, 15, 127, 5397, 6, 124, 6, 8, 10762, 4, 939, 19792, 939, 115, 120, 10, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print(ddict['validation'][0]['input_ids'][:150])
[0, 133, 41345, 3195, 16, 269, 5835, 328, 939, 33, 57, 546, 13, 20585, 326, 6183, 13, 10, 150, 8, 56, 9600, 2609, 269, 4066, 8089, 4, 939, 21, 8689, 77, 939, 794, 209, 328, 939, 348, 10610, 106, 683, 98, 444, 4, 182, 3473, 8, 2045, 101, 51, 40, 94, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
You can combine Text Processing and Tokenization with 1 method call
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name'
label_names )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 2966
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 1 rows
=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 4, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
You can access the DatasetDict from the instance variable main_ddict
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 18098
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
This DatasetDict is ready to be put into any HuggingFace text model.
2. Filtering
This preprocessing step allow you to filter out certain values of a certain column in your dataset. Let’s say I want to filter out any None value in the column ‘Review Text’
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df ~df['Review Text'].isna())].isna().sum() df[(
Clothing ID 0
Age 0
Title 2966
Review Text 0
Rating 0
Recommended IND 0
Positive Feedback Count 0
Division Name 13
Department Name 13
Class Name 13
dtype: int64
We will provide a dictionary containing the name of the column and the filtering function to apply on that column. Note that the filtering function will receive an item from the column, and the function should return a boolean
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None},
filter_dict=42
seed )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 18111
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4529
})
})
Let’s check if we have filtered out all NaN/None value
for i in tdc.main_ddict['train']['Review Text']:
assert i is not None
for i in tdc.main_ddict['validation']['Review Text']:
assert i is not None
We can even add multiple filtering functions. Remember from our precheck, there are also None values in our label ‘Department Name’. While we are at it, let’s filter out any rating that is less than 3 (just to showcase what our filtering can do)
df.Rating.value_counts()
Rating
5 13131
4 5077
3 2871
2 1565
1 842
Name: count, dtype: int64
Note that TextDataController
will only keep the text, the labels and the metadatas columns; any other column will be dropped. To keep the ‘Rating’, we need to define the cols_to_keep
argument
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df = TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
'Rating': lambda x: x>=3
},=['Review Text','Rating','Department Name'],
cols_to_keep=42
seed )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Rating -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
for i in tdc.main_ddict['train']['Department Name']:
assert i is not None
for i in tdc.main_ddict['validation']['Department Name']:
assert i is not None
for i in tdc.main_ddict['validation']['Rating']:
assert i >= 3
3. Taking a sample from training data
If you only want to extract a training sample of your data, you can use the trn_size
argument of the method process_and_tokenize
(or do_tokenization
). Since we use sharding to extract a sample from a DatasetDict, if trn_size
is a integer, an approximated size will be returned
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True,trn_size=1000) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 1006
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True,trn_size=0.1) # return 10% of the data tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 1810
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
4. Metadatas concatenation
If we think metadatas can be helpful, we can concatenate them into the front of your text, so that our text classification model is aware of it.
In this example, Let’s add ‘Title’ as our metadata
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df = TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},='Title',
metadatas=True, # to preprocess the metadata (currently it's just empty space stripping and lowercasing),
process_metas=42
seed )
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][:5] tdc.main_ddict[
['beautiful! . I love this top. it was everything i hoped it would be. it is lined so it is not see through in the chest/back; sleeves are sheer. soft. gorgeous color. love the layers. runs large so definitely size down. i am usually a m and ordered the s. i\'m 5\'8" curvy 32dd',
'very flattering . This dress fits to a t! true to size. very flattering. fabric is soft and comfortable.',
"the worst . I don't typically write bad reviews, but this dress is so bad and i want to save someone else from buying it. i read the mostly bad reviews and still purchased anyway (my fault i know). the dress is super stiff ( i know denim can be that way and it is possible it would soften up after a few washes). i'm typically a 6/8 and the size small swallowed me, and the xs was big everywhere except through the bust (i ordered both sizes to try). i wouldn't recommend buying this if you are a size 8 or small",
"love this jacket! . I was on the lookout for a denim jacket when i found this beauty on line. i fell in love immediately and didn't think twice about paying full price. i wear it with moss green chinos and it looks really good. the little dots in the jacket are actually a pale green, which gives it extra character. very well made. i was a bit skeptical about the hook and eye fastenings, but they are very secure. \r\n\r\ni ordered my usual xl and found it roomy enough in the bust and arms. i would definitely call it tru",
'great spring/summer dress. . I am excited for spring so i can wear this. i purchased the orange. it is actually more of a red, but i like it. colorful and flattering fit.']
'validation']['Review Text'][:5] tdc.main_ddict[
[' . Such a fun jacket! great to wear in the spring or to the office as an alternative to a formal blazer. very comfortable!',
'simple and elegant . I thought this shirt was really feminine and elegant. only downsides is some of the punched out holes had fabric still attached which you have cut off with scissors- otherwise the shirt will snag. and the second issue of bigger importance are the low armholes. lots of bra showing- not really sure how to get around that so i always wear it with a cardigan. but it would be nice not to have to. \r\nother than that it looks nice and pairs nicely with almost anything.',
'retro and pretty . This top has a bit of a retro flare but so adorable on. looks really cute with a pair of faded boot cut jeans.',
'summer/fall wear . I first spotted this on an retailer employee, she paired it with a peasant top & wore it open w/jeans & boots- so darn cute. love how this peice transitions from summer to fall. i\'m 5\'4" so i had to order the small petite which is perfect. note that this dress is very long! it\'s just a must have garment. the colors/ print are just beautiful.',
"perfect except slip . This is my new favorite dress! my only complaint is the slip is too small and the dress cannot be worn without it. i can't order a size up as the dress would then be huge. not sure what the solution is but the dress itself is stunning."]
You can add multiple metadatas. Let’s say ‘Division Name’ is the second metadata.
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df = TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=['Title','Division Name'],
metadatas=True,
process_metas=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][:5] tdc.main_ddict[
["general petite . meh . This tunic is way over priced for the style and quality. it fit comfortably (runs a size larger) but it's not really flattering, it jut kind of hangs there looking ok. it is a little too deep of a v cut for a work top as well. this top does not support the price at all. it felt like something i could find at department store for way less. i will be returning it.",
'general . awesome buy! . I am so happy i took a chance on this jumpsuit! i am post-baby (six weeks) and although i intend on slimming down more i would say that it is flattering even at my current size. and it will only get better! \r\nthe quality and color are great!',
'general . warm grey . These are a lovely neutral to slightly warm grey pair of jeans from a great line. i wore my usual size without issues.',
"general . loved it, but it didn't work for me. . I wanted this top to work so bad. unfortunately the way the bust of the top is designed it isn't flattering if you aren't flat chested. it squishes on side of your chest and leaves the other side alone. i'm a b cup and had this problem so if you are a b or larger, i don't recommend. however, if you are smaller busted, this piece would be worth the purchase.",
"general . varying feelings and opinions . As you can see, there is an array of differing opinions on here, and i share sentiments on both:\r\n_______\r\npros:\r\n- the texture and feel of this is great; it is very comfortable and is different.\r\n- tts for the most part; i normally can wear sizes 10 and 12 (m and l) with most retailer and got the medium and the fit was overall fine but more snug at the hips. if you're more slim/straight, it'll probably fit you like on the model. \r\n- good length, not too short or too long.\r\n- the mock collar is ni"]
'validation']['Review Text'][:5] # The metadata for this text is None tdc.main_ddict[
['general petite . . Such a fun jacket! great to wear in the spring or to the office as an alternative to a formal blazer. very comfortable!',
'general petite . simple and elegant . I thought this shirt was really feminine and elegant. only downsides is some of the punched out holes had fabric still attached which you have cut off with scissors- otherwise the shirt will snag. and the second issue of bigger importance are the low armholes. lots of bra showing- not really sure how to get around that so i always wear it with a cardigan. but it would be nice not to have to. \r\nother than that it looks nice and pairs nicely with almost anything.',
'general . retro and pretty . This top has a bit of a retro flare but so adorable on. looks really cute with a pair of faded boot cut jeans.',
'general petite . summer/fall wear . I first spotted this on an retailer employee, she paired it with a peasant top & wore it open w/jeans & boots- so darn cute. love how this peice transitions from summer to fall. i\'m 5\'4" so i had to order the small petite which is perfect. note that this dress is very long! it\'s just a must have garment. the colors/ print are just beautiful.',
"general petite . perfect except slip . This is my new favorite dress! my only complaint is the slip is too small and the dress cannot be worn without it. i can't order a size up as the dress would then be huge. not sure what the solution is but the dress itself is stunning."]
5. Label Encodings
Single-head prediction
We have briefly gone through the simplest case of label encoding, where we only need to predict 1 single label. We call this single head classification
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
All label names will be saved in instance variable label_lists
tdc.label_lists
[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]
… and all labels will be encoded
'validation']['label'][:5] tdc.main_ddict[
[2, 4, 4, 1, 1]
We also keep the original labeling, for references
'validation']['Department Name'][:5] tdc.main_ddict[
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
You can also do single-head regression. Let’s say we want to predict Rating
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train') dset
dset
Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 23486
})
= TextDataController(dset,
tdc ='Review Text',
main_text='regression',
sup_types='Rating',
label_names={'Review Text': lambda x: x is not None},
filter_dict=42,
seed )
=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
print(tdc.main_ddict['train']['Rating'][:5])
print(tdc.main_ddict['train']['label'][:5])
[3.0, 1.0, 4.0, 3.0, 5.0]
[3.0, 1.0, 4.0, 3.0, 5.0]
print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['label'][:5])
[5.0, 4.0, 3.0, 5.0, 5.0]
[5.0, 4.0, 3.0, 5.0, 5.0]
Multi-head prediction
What if we need to predict 2 different labels as once? We call this multi-head classification/regression. For example, let’s define our dataset so that we need to predict both Department Name
and Division Name
(both as classification)
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text=['Division Name','Department Name'],
label_names=['classification','classification'],
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.label_lists
[['General', 'General Petite', 'Initmates'],
['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]
We can see that we have two lists, one for label names of Division Name
, and one for label names of Department Name
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 18099
})
validation: Dataset({
features: ['Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
print(tdc.main_ddict['validation']['Division Name'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])
['General Petite', 'General Petite', 'General', 'General Petite', 'General Petite']
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[1, 2], [1, 4], [0, 4], [1, 1], [1, 1]]
What if one label is classification, and another label is regression? We will predict Department Name
(classification) and Rating
(regression)
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text=['Rating','Department Name'],
label_names=['regression','classification'],
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])
[5.0, 5.0, 5.0, 5.0, 4.0]
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[5.0, 2.0], [5.0, 4.0], [5.0, 4.0], [5.0, 1.0], [4.0, 1.0]]
Since it’s multi-head, you can define multiple classification/regression labels, as many as you want
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text=['Division Name','Rating','Department Name'],
label_names=['classification','regression','classification'],
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
'Division Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Division Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
print(tdc.main_ddict['train']['Division Name'][:5])
print(tdc.main_ddict['train']['Rating'][:5])
print(tdc.main_ddict['train']['Department Name'][:5])
print(tdc.main_ddict['train']['label'][:5])
['General Petite', 'General Petite', 'General', 'General', 'General Petite']
[4.0, 4.0, 5.0, 3.0, 5.0]
['Tops', 'Tops', 'Tops', 'Tops', 'Dresses']
[[1.0, 4.0, 4.0], [1.0, 4.0, 4.0], [0.0, 5.0, 4.0], [0.0, 3.0, 4.0], [1.0, 5.0, 1.0]]
print(tdc.main_ddict['validation']['Division Name'][:5])
print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])
['General Petite', 'General Petite', 'General', 'General Petite', 'General Petite']
[5.0, 5.0, 5.0, 5.0, 4.0]
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[1.0, 5.0, 2.0], [1.0, 5.0, 4.0], [0.0, 5.0, 4.0], [1.0, 5.0, 1.0], [1.0, 4.0, 1.0]]
Multi-label classification
Lastly, let’s define a multi-label classification, where a text can have 1 or more label. Our data don’t have such labeling, so we will make a new one, just for demonstration.
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
'Department Name'].unique() df[
array(['Intimate', 'Dresses', 'Bottoms', 'Tops', 'Jackets', 'Trend', nan],
dtype=object)
'Fake Label'] = [np.random.choice(df['Department Name'].unique()[:-1],size=np.random.randint(2,6),replace=False) for _ in range(len(df))] df[
df.head()
Clothing ID | Age | Title | Review Text | Rating | Recommended IND | Positive Feedback Count | Division Name | Department Name | Class Name | Fake Label | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 767 | 33 | NaN | Absolutely wonderful - silky and sexy and comf... | 4 | 1 | 0 | Initmates | Intimate | Intimates | [Intimate, Dresses, Trend, Bottoms] |
1 | 1080 | 34 | NaN | Love this dress! it's sooo pretty. i happene... | 5 | 1 | 4 | General | Dresses | Dresses | [Trend, Intimate] |
2 | 1077 | 60 | Some major design flaws | I had such high hopes for this dress and reall... | 3 | 0 | 0 | General | Dresses | Dresses | [Intimate, Dresses, Bottoms, Trend] |
3 | 1049 | 50 | My favorite buy! | I love, love, love this jumpsuit. it's fun, fl... | 5 | 1 | 0 | General Petite | Bottoms | Pants | [Intimate, Bottoms] |
4 | 847 | 47 | Flattering shirt | This shirt is very flattering to all due to th... | 5 | 1 | 6 | General | Tops | Blouses | [Trend, Bottoms, Dresses, Intimate, Jackets] |
You don’t have to add any extra argument; the controller will determine whether this is for multilabel classification, based on the format of the label values
= TextDataController.from_df(df,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict='Fake Label',
label_names='classification',
sup_types=42,
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Fake Label', 'label', 'input_ids', 'attention_mask'],
num_rows: 18111
})
validation: Dataset({
features: ['Review Text', 'Fake Label', 'label', 'input_ids', 'attention_mask'],
num_rows: 4529
})
})
tdc.label_lists
[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]
'validation']['Fake Label'][2] tdc.main_ddict[
['Trend', 'Intimate', 'Bottoms', 'Dresses']
Since this is multilabel classification, the label will be one-hot encoded
'validation']['label'][2] tdc.main_ddict[
[1, 1, 1, 0, 0, 1]
'validation']['label'][:5] tdc.main_ddict[
[[0, 1, 1, 0, 1, 0],
[0, 1, 1, 1, 1, 0],
[1, 1, 1, 0, 0, 1],
[1, 1, 0, 1, 0, 0],
[0, 1, 1, 1, 0, 1]]
No label
If you don’t have a label to define, leave all label-related arguments blank
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text={'Review Text': lambda x: x is not None},
filter_dict=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'input_ids', 'attention_mask'],
num_rows: 18111
})
validation: Dataset({
features: ['Review Text', 'input_ids', 'attention_mask'],
num_rows: 4529
})
})
6. Label transformation
Sometimes, you want to apply some light transformation to your label(s) before apply label encoding, e.g. there are some typos in your string label (classification), or you want to scale your regression label. TextDataController
provides a way for you to do so, via label_tfm_dict
argument. For the following example, I will fix the typo ‘Initmates’ in Division Name
label, and log scale the Rating
import math
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text=['Division Name','Rating','Department Name'],
label_names=['classification','regression','classification'],
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
'Division Name': lambda x: x is not None,
},={'Division Name': lambda x: x if x!='Initmates' else 'Intimates',
label_tfm_dict'Rating': lambda x: math.log(x)+1},
=42,
seed=1
num_proc
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Division Name -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
Notice that in the label_lists, the label ‘Initmates’ has been replaced by ‘Intimates’
Also, the second empty list corresponds to the label value of Rating
, which is for regression, thus results in an empty list
tdc.label_lists
[['General', 'General Petite', 'Intimates'],
[],
['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]
print(tdc.main_ddict['train']['Division Name'][:5])
print(tdc.main_ddict['train']['Rating'][:5])
print(tdc.main_ddict['train']['Department Name'][:5])
print(tdc.main_ddict['train']['label'][:5])
['General Petite', 'General Petite', 'General', 'General', 'General Petite']
[2.386294361119891, 2.386294361119891, 2.6094379124341005, 2.09861228866811, 2.6094379124341005]
['Tops', 'Tops', 'Tops', 'Tops', 'Dresses']
[[1.0, 2.386294361119891, 4.0], [1.0, 2.386294361119891, 4.0], [0.0, 2.6094379124341005, 4.0], [0.0, 2.09861228866811, 4.0], [1.0, 2.6094379124341005, 1.0]]
print(tdc.main_ddict['validation']['Division Name'][:5])
print(tdc.main_ddict['validation']['Rating'][:5])
print(tdc.main_ddict['validation']['Department Name'][:5])
print(tdc.main_ddict['validation']['label'][:5])
['General Petite', 'General Petite', 'General', 'General Petite', 'General Petite']
[2.6094379124341005, 2.6094379124341005, 2.6094379124341005, 2.6094379124341005, 2.386294361119891]
['Intimate', 'Tops', 'Tops', 'Dresses', 'Dresses']
[[1.0, 2.6094379124341005, 2.0], [1.0, 2.6094379124341005, 4.0], [0.0, 2.6094379124341005, 4.0], [1.0, 2.6094379124341005, 1.0], [1.0, 2.386294361119891, 1.0]]
7. Content Transformation
This processing allows you to alter the text content in your dataset. You need to define a function that accepts a single string and returns a new, processed string. Note that this transformation will be applied to ALL of your dataset (both train and validation)
Let’s say we want to normalize our text, because the text might contain some extra spaces between words, or not follow the “single space after a period” rule
= "This is a sentence,which doesn't follow any rule!No single space is provided after period or punctuation marks. Maybe there are too many spaces!?! " _tmp
from underthesea import text_normalize
text_normalize(_tmp)
"This is a sentence , which doesn't follow any rule ! No single space is provided after period or punctuation marks . Maybe there are too many spaces ! ? !"
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=text_normalize,
content_transformations=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][0] tdc.main_ddict[
"This sweater is beautiful , but is definitely more for looks than warmth . it's very soft , but very thin . i prefer the way it looks open rather than buttoned . i got the moss green color on sale , and i am glad i didn't pay full price for it--it ' s lovely , but certainly not worth $ 88 ."
'validation']['Review Text'][0] tdc.main_ddict[
'Such a fun jacket ! great to wear in the spring or to the office as an alternative to a formal blazer . very comfortable !'
You can chain multiple functions. Let’s say after text normalizing, I want to lowercase the text
str.lower('tHis IS NoT lowerCASE')
'this is not lowercase'
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=[text_normalize,str.lower],
content_transformations=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][0] tdc.main_ddict[
"this sweater is beautiful , but is definitely more for looks than warmth . it's very soft , but very thin . i prefer the way it looks open rather than buttoned . i got the moss green color on sale , and i am glad i didn't pay full price for it--it ' s lovely , but certainly not worth $ 88 ."
'validation']['Review Text'][0] tdc.main_ddict[
'such a fun jacket ! great to wear in the spring or to the office as an alternative to a formal blazer . very comfortable !'
You can even perform some complex transformations, such as removing text inside parentheses, or replacing some texts via a pattern (which is doable using regular expression). Let’s make an example of such transformations, where we remove text inside parentheses, and convert any hashtag into the string ‘hashtag’
import re
def process_text(s):
# Remove texts inside parentheses
= re.sub(r'\(.*?\)', '', s)
s
# Convert any hashtag into the string 'hashtag'
= re.sub(r'#\w+', 'hashtag', s)
s
return s.strip()
"#Promotions There's no way it works (I checked!), however it surprises me #howonearth #mindblowing") process_text(
"hashtag There's no way it works , however it surprises me hashtag hashtag"
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=process_text,
content_transformations=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- process_text -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
8. Train/Validation Split
There are several ways to perform a train/validation split with TextDataController
The first way is when you already have a validation split in your HuggingFace’s Dataset. Let’s use the Dataset built-in function train_test_split
to simulate this
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = dset.train_test_split(test_size=0.1)
ddict_with_val # This will create a 'test' split instead of 'validation', so we will process a bit to have a validation split
'validation']=ddict_with_val['test']
ddict_with_val[del ddict_with_val['test']
ddict_with_val
DatasetDict({
train: Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 21137
})
validation: Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 2349
})
})
= TextDataController(ddict_with_val,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split already exists
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 1, which is 0.00% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 20374
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 2253
})
})
A second way is to split randomly based on a ratio (a float between 0 and 1), or based on the number of data in your validation set
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.15,
val_ratio=42,
seed=False
verbose
)=512,shuffle_trn=True)
tdc.process_and_tokenize(tokenizer,max_lengthprint(tdc.main_ddict)
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 19231
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 3395
})
})
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=5000,
val_ratio=42,
seed=False
verbose
)=512,shuffle_trn=True)
tdc.process_and_tokenize(tokenizer,max_lengthprint(tdc.main_ddict)
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 17624
})
validation: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 5000
})
})
A third way is to do a random stratified split (inspired by sklearn’s). Let’s do a stratified split based on our label ‘Department Name’
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
'Department Name'].value_counts(normalize=True) df[
Department Name
Tops 0.445978
Dresses 0.269214
Bottoms 0.161852
Intimate 0.073918
Jackets 0.043967
Trend 0.005070
Name: proportion, dtype: float64
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.2,
val_ratio='Department Name',
stratify_cols=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Department Name']).value_counts(normalize=True) pd.Series(tdc.main_ddict[
Tops 0.444033
Dresses 0.271602
Bottoms 0.161878
Intimate 0.072983
Jackets 0.044309
Trend 0.005193
Name: proportion, dtype: float64
'validation']['Department Name']).value_counts(normalize=True) pd.Series(tdc.main_ddict[
Tops 0.444101
Dresses 0.271542
Bottoms 0.161732
Intimate 0.073133
Jackets 0.044189
Trend 0.005303
Name: proportion, dtype: float64
You can also use multiple columns for your stratification
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.2,
val_ratio=['Department Name','Rating'],
stratify_cols=42,
seed=False
verbose
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
And finally, you can omit any validation split if you specify val_ratio
as None
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=None,
val_ratio=42
seed
)=512,shuffle_trn=True)
tdc.process_and_tokenize(tokenizer,max_length tdc.main_ddict
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
No validation split defined
Done
-------------------- Dropping unused features --------------------
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
DatasetDict({
train: Dataset({
features: ['Review Text', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 22628
})
})
9. Upsampling
This is useful when you have an imbalanced dataset and you want to perform some upsampling (oversampling) on the minority class. In TextDataController
, you can perform upsampling on any column of the original dataset, and you can even do upsampling on multiple columns at once
Behind the scene, upsampling contains 2 steps; first, the subset of the data is collected based on the filtering condition, and then this subset is concatenated back into the original data
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig') df
'Department Name'].sample(frac=0.8).value_counts()
df[# fraction 0.8 because we only do upsampling on train data, which is 80% of the total data
Department Name
Tops 8379
Dresses 5044
Bottoms 3037
Intimate 1396
Jackets 831
Trend 92
Name: count, dtype: int64
'Department Name'].sample(frac=0.8).value_counts(normalize=True) df[
Department Name
Tops 0.446876
Dresses 0.269372
Bottoms 0.159823
Intimate 0.073601
Jackets 0.044736
Trend 0.005592
Name: proportion, dtype: float64
Let’s say I want to upsampling the ‘Trend’ by the factor of 2 (x2 the amount of ‘Trend’ data)
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.2,
val_ratio='Department Name',
stratify_cols=[('Department Name',lambda x: x=='Trend')],
upsampling_list=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Upsampling data --------------------
----- Do <lambda> on Department Name -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Department Name']).value_counts() pd.Series(tdc.main_ddict[
Tops 8037
Dresses 4916
Bottoms 2930
Intimate 1321
Jackets 802
Trend 188
Name: count, dtype: int64
'train']['Department Name']).value_counts(normalize=True) pd.Series(tdc.main_ddict[
Tops 0.441739
Dresses 0.270199
Bottoms 0.161042
Intimate 0.072606
Jackets 0.044080
Trend 0.010333
Name: proportion, dtype: float64
The percenntage of ‘Trend’ data in the train set has approximately doubled (note that we filter some NaN text value so the result is not exactly doubled)
'validation']['Department Name']).value_counts(normalize=True) pd.Series(tdc.main_ddict[
Tops 0.444101
Dresses 0.271542
Bottoms 0.161732
Intimate 0.073133
Jackets 0.044189
Trend 0.005303
Name: proportion, dtype: float64
Since augmentation is applied only to the train set, the distribution of label in the validation set remains the same
Similarly, you can triple the amount of ‘Trend’ by repeating the procedure twice. In the following examples, I will triple the ‘Trend’ and double the ‘Jackets’
= TextDataController.from_df(df,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=0.2,
val_ratio='Department Name',
stratify_cols=[('Department Name',lambda x: x=='Trend'),
upsampling_list'Department Name',lambda x: x=='Trend'),
('Department Name',lambda x: x=='Jackets')
(
],# This can be simplified as
# upsampling_list=[('Department Name',lambda x: x=='Trend' or x=='Jackets'),
# ('Department Name',lambda x: x=='Trend')],
=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 3810
Review Text 845
Division Name 14
Department Name 14
Class Name 14
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 21 rows
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio, with stratifying
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Upsampling data --------------------
----- Do <lambda> on Department Name -----
----- Do <lambda> on Department Name -----
----- Do <lambda> on Department Name -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Department Name']).value_counts() pd.Series(tdc.main_ddict[
Tops 8037
Dresses 4916
Bottoms 2930
Jackets 1604
Intimate 1321
Trend 282
Name: count, dtype: int64
A word of warning: Upsampling is a slow procedure, as it requires multiple dataset concatenation.
10. Content Augmentation
Similarly to Content Transformation, Content Augmentation allows to alter the text content in your dataset. You also need to provide a function accepting a single string, and return a new, processed string. Unlike Content Transformation which is applied to ALL data, the Content Augmentation only applies to your TRAINING data
One of the popular library for data augmentation is nlpaug. We will demonstrate how to integrate its augmentation functions into our TextDataController
import nlpaug.augmenter.char as nac
= "I like my clothes loose fitting but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but was too big overall" _tmp
def nlp_aug(x,aug=None):
= aug.augment(x)
results if not isinstance(x,list): return results[0]
return results
Augmentation by replacing character with nearby one on the keyboard
= nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
aug = partial(nlp_aug,aug=aug) nearby_aug_func
nearby_aug_func(_tmp)
"I liMe my c;othes loose fitting but even for me this ran large, i am 5 ' 7 134b and medium fit in the shoulders but was too big overa:l"
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=nearby_aug_func,
content_augmentations=42,
seed=True
verbose
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Text Augmentation --------------------
----- nlp_aug -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][:5] tdc.main_ddict[
["This sweater is beautiful, but is definitely more for looks hhan warmth. it ' s very soft, but vdry thin. i prefer the way it looks open rather than buttoned. i got the moss green color on sale, and i am glad i diFn ' t pay full price for it - - it ' s lovdly, but certainly not wo%th $ 88.",
"I ' m a curCy person, so my review might not be suited to everyone. my standard size in retailer tops is xl, and it is the same for this blouse. - overall: overaol gorgeo8s, wWll made blouse but i wish there was less fabric involved and the burnt out design didn ' t make a horizontal stripe across the back and biceps. this blokse just might not work out as well if you are a full figured person. - pros: g(rgeous blousf high quality unique - cons: i wish the burnt out design didj ' t make a hor",
'This blouse is wonderful. i juwt got and wore the wine Solored blouse today. i received so many compliments. i love it and with the sale price it is so w(rth it.',
'When i saw this, i ordered i<medistely thinking it was similar to the popular colorblocked stripe sweater from last yea5. the kgit is sfretchy and textured and fee:s like great quality (would wash w$ll ), but it \' s pretty lightweight. the fit is huge. .. could easily size Eown. i \' m 5 \' 7 " 128 # and found the small to be loose everywhere, including the arms. the length was at my knees, and the stripe fell awkwardly across my chest. no idea what i \' d wear this with ev@n if it fit better. sadly, it \' s goinR',
"This dress is a zillion times cuter in real life. it ' s ver% detro - swingy and girlish - it reminds me of something mia farrow would ' ve worn in her rosemary ' s baby era. i havF the black version and i ' ve paired mine with tall black gladiator Eandals for a more sIltry nighttime lo9k and also flip flops for beachy summer days. i think it ' s a total steal at the sale price."]
Again, since this is Content Augmentation, the validation set is unmodified.
'validation']['Review Text'][:5] tdc.main_ddict[
['Such a fun jacket! great to wear in the spring or to the office as an alternative to a formal blazer. very comfortable!',
'I thought this shirt was really feminine and elegant. only downsides is some of the punched out holes had fabric still attached which you have cut off with scissors- otherwise the shirt will snag. and the second issue of bigger importance are the low armholes. lots of bra showing- not really sure how to get around that so i always wear it with a cardigan. but it would be nice not to have to. \r\nother than that it looks nice and pairs nicely with almost anything.',
'This top has a bit of a retro flare but so adorable on. looks really cute with a pair of faded boot cut jeans.',
'I first spotted this on an retailer employee, she paired it with a peasant top & wore it open w/jeans & boots- so darn cute. love how this peice transitions from summer to fall. i\'m 5\'4" so i had to order the small petite which is perfect. note that this dress is very long! it\'s just a must have garment. the colors/ print are just beautiful.',
"This is my new favorite dress! my only complaint is the slip is too small and the dress cannot be worn without it. i can't order a size up as the dress would then be huge. not sure what the solution is but the dress itself is stunning."]
You can even apply Content Augmentation stochastically, by adding a random condition in your augmentation function
# def nlp_aug_stochastic(x,aug=None,p=0.5):
# results = aug.augment(x)
# if not isinstance(x,list): return results[0] if random.random()<p else x
# return [a if random.random()<p else b for a,b in zip(results,x)]
def nlp_aug_stochastic(x,aug=None,p=0.5):
if not isinstance(x,list):
if random.random()<p: return aug.augment(x)[0]
return x
=[]
news=[]
originalsfor _x in x:
if random.random()<p: news.append(_x)
else: originals.append(_x)
# only perform augmentation when needed
if len(news): news = aug.augment(news)
return news+originals
= nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
aug = partial(nlp_aug_stochastic,aug=aug,p=0.3) # nearby_augmentation only applies 30% of the time, with p=0.3 nearby_aug_func
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=nearby_aug_func,
content_augmentations=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][:10] tdc.main_ddict[
["This sweater is beautiful, but is definitely more for pooks than warmth. it ' s very soBt, but very thin. i prefer the way it looks opeb rzther than buttoned. i got the moss green color on sale, and i am glad i didn ' t pay full price for it - - it ' s lovely, but certain/y not worth $ 88.",
"I'm a curvy person, so my review might not be suited to everyone. my standard size in retailer tops is xl, and it is the same for this blouse.\r\n-\r\noverall:\r\noverall gorgeous, well made blouse but i wish there was less fabric involved and the burnt out design didn't make a horizontal stripe across the back and biceps. this blouse just might not work out as well if you are a full figured person.\r\n-\r\npros:\r\ngorgeous blouse\r\nhigh quality\r\nunique\r\n-\r\ncons:\r\ni wish the burnt out design didn't make a hor",
'This blouse is wonderful. i just got and wor$ the wiJe colored blouse today. i received so many compliments. i love it and with the sale priSe it is so worth it.',
'When i saw this, i ordered immediately th(nking it was similar to the popular volorGlocked stripe sweater from last year. the knit is stretchy and textured and fe#ls like greaR quality (would wash well ), but it \' s pretty lightweight. the fit is huge. .. couPd easily size d8wn. i \' m 5 \' 7 " 128 # and found the small to be loose eveGywhere, including the arms. the length was at my knees, and the stripe fell awkwardly across my chest. no idea wtat i \' d wear tyis with even if it fit better. sadly, it \' s going',
"This dress is a zillion times cuter in real life. it's very retro-swingy and girlish- it reminds me of something mia farrow would've worn in her rosemary's baby era. i have the black version and i've paired mine with tall black gladiator sandals for a more sultry nighttime look and also flip flops for beachy summer days. i think it's a total steal at the sale price.",
"This top is so soft and with a henley neck opening and longer ribbed shirttail hems, it not only feels heavenly against the skin but it gives off a casual chic vibe. it is also great for layering under shorter sweaters and sweatshirts to give my staples a little oomph. it is a bit sheer so cami is a must. i am also not sure how well it will hold up after washings, especially since it's priced quite high. i love it so much that i will most probably end up keeping it it is true to size. i ordered",
"This is my first lair of ag and i loGe them so far. they are not cutfed as shown in the picture. they are long so i had to get them altered (i ' m 5 ' ' 5 ). the color is a rich blue and Ghey have a nice stretch. i haven ' t worn tNem all day yet to see if they keep their shape. usuZlly a 28 or 29 and went with the 28 on these. got them on 20 perc off salf so very happy!",
'I liked this coat but my family said it looked too much like something hilary clinton would wear. i am 54 and i think it made me look a bit dowdy since it runs a bit big.',
'I saw a photographer wearing this at a wedding i went to in october. i absolutely fell in love. it is beautiful. i can\'t wait to wear it for the holidays! i got the small petite and i am 5\'2", 125 lbs. fit great. enjoy!',
"This dress was adorable & fit great! regrettably, i had to return it since it wasn't lined."]
One of the advanced augmentation is “Contexttual Word Embeddings Augmenter” (code example: https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb), where you can insert/substitute words using LLM such as BERT, RoBERTA …
import nlpaug.augmenter.word as naw
= naw.ContextualWordEmbsAug(model_path='roberta-base',
aug ='cuda:0', # if you don't have gpu, change to 'cpu'
device="substitute",
action=10,
top_k=0.07) aug_p
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
= partial(nlp_aug,aug=aug) contextual_aug_func
= "I like my clothes loose fitting but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but was too big overall" _tmp
contextual_aug_func(_tmp)
"I kept my clothes slim fitting but even for me this ran large, i am 5'7 134b and medium fit in upper shoulders but was too big overall"
for i in range(7)]) contextual_aug_func([_tmp
["I like my clothes big enough but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but felt too big overall",
"I like my clothes loose fitting but even for me this ran large, i am 5'7 134b and I fit in the back but still too big overall",
"I like my big loose fitting but even for me this ran large, i stand 5'7 134b and medium light in the shoulders but was too big overall",
"I liked its own loose fitting but even for me this ran large, i am 5'7 134b and medium fit in the shoulders but was too big overall",
"I like my clothes loose fitting but had given me this ran large, i am 5'7 134b is medium fit in the shoulders but was too big overall",
"I made my clothes loose fitting but even for me this ran large, i am 5'7 134b and barely fit over the shoulders but was too big overall",
"I like my clothes loose fitting but honestly for me this ran large, i am 5'7 134b and medium fit in all shoulders it was too big overall"]
For this type of augmentation, it’s wise to use GPU to minimize processing speed. You also don’t want all of your text to be augmented, so let’s reuse the stochastic augmentation.
= partial(nlp_aug_stochastic,aug=aug,p=0.3) contextual_aug_func
# add these 2 instance variables to your gpu augmentation
=True
contextual_aug_func.run_on_gpu=32 contextual_aug_func.batch_size
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='classification',
sup_types='Department Name',
label_names={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=contextual_aug_func,
content_augmentations=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Label Encoding -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 3, which is 0.02% of training set
Filtering leaked data out of training set...
Done
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
'train']['Review Text'][:10] tdc.main_ddict[
['This dress makes me so sad...the textured stretchy fabric, color, length, and overall swingy fit are spot on. as another reviewer noted though, the armholes and neck totally ruin the dress. the neck is tiny, which i could have gotten over once it was on, but the arm holes were just awful - too big all around (too tall of a cut plus too wide of a cut). basically, you could see my bra plus from the front there was unflattering exposure near the armholes. it could have been so good, but alas, but t',
"This top is very flattering, the fabric flows and moves. it fits perfectly (slim cut), but hides tummy bulges and other imperfections. and it's slimming too. can be dressed up or down, goes with everything. i ended up buying all three colors, and if there were more, i would buy more!",
'This blouse is wonderful. i just got and wore the wine colored blouse today. i received so many compliments. i love it and with the sale price it is so worth it.',
'This top is very versatile. i wore it out to dinner with skinny jeans on a friday night, but it can easily transition to a saturday afternoon stroll around town top.',
'This top is so soft and luxuriously comfy! i love wearing it around the house, haven\'t really "dressed" it up yet with jeans or jewelry. it runs slightly big, but if you like the oversized look, this is definitely perfect.',
"I was in love with this shirt from the moment i put it on. it is of high fit, with layers to ensure the top isn't sheer. the embroidery is incredibly pretty and the top looks way less grandma in it. i ordered the top xxs and it fits perfectly. i really appreciate that the underarm holes are just the right size and cant show off any of my bra, which sometimes happens with small tops. i can wear it with jeans and boots or with a pencil skirt and heels but is looks great with both outfits. o",
"I read the other review and from the picture it looked as though it may be a little tight, so i ordered up to a large. the medium would have fit, but since i'm in ur mid-40's i felt more comfortable with large. but if people are trim and young or young at heart your usual medium will be fine. live the material and the navy makes it classy and rich looking. could be dressed up or down. have worn it to a cocktail party fundraiser with white crop sleeves and received many reviews. i'm always challenge",
"This skirt is so ladylike and light as air! the cherry red color is beautiful - just as pictured. i can imagine so many opportunities to wear this skirt. with a sweater and tights now, and maybe a striped tee and sandals in the spring.\ni'm sure i'll have this gorgeous classic in my wardrobe for a very long time to come!",
'Very pretty dress, perfect style for my build, bigger busted, muffin top. the material/pattern is really pretty.',
"I purchased this top in the navy. the picture gives the top looking like an interesting blue with some purple in it, but in person the top is just... navy. the lace and fabric are soft. it fits true to me; i almost always wear a small and the small fit me. the wasn't quite my style, but it's a pretty top it will be great for spring and summer."]
And finally, similarly to Content Transformation
, you can link multiple augmentation functions together by providing a list of those functions in content_augmentations
11. Save and Load TextDataController
TextDataController.save_as_pickles
TextDataController.save_as_pickles (fname, parent='pickle_files', drop_attributes=False)
Type | Default | Details | |
---|---|---|---|
fname | Name of the pickle file | ||
parent | str | pickle_files | Parent folder |
drop_attributes | bool | False | Whether to drop large-size attributes |
TextDataController.from_pickle
TextDataController.from_pickle (fname, parent='pickle_files')
Type | Default | Details | |
---|---|---|---|
fname | Name of the pickle file | ||
parent | str | pickle_files | Parent folder |
TextDataController object can be saved and loaded with ease. This is especially useful after text processing and/or tokenization have been done
from datasets import disable_caching
# disable huggingface caching to see data size disable_caching()
from underthesea import text_normalize
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
def nlp_aug_stochastic(x,aug=None,p=0.5):
if not isinstance(x,list):
if random.random()<p: return aug.augment(x)[0]
return x
=[]
news=[]
originalsfor _x in x:
if random.random()<p: news.append(_x)
else: originals.append(_x)
# only perform augmentation when needed
if len(news): news = aug.augment(news)
return news+originals
= naw.ContextualWordEmbsAug(model_path='roberta-base',
aug2 ='cuda:0', # if you don't have gpu, change to 'cpu'
device="substitute",
action=10,
top_k=0.07)
aug_p
= partial(nlp_aug_stochastic,aug=aug2,p=0.1)
contextual_aug_func # add these 2 instance variables to your gpu augmentation
=True
contextual_aug_func.run_on_gpu=32 contextual_aug_func.batch_size
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = TextDataController(dset,
tdc ='Review Text',
main_text='Department Name',
label_names='classification',
sup_types={'Review Text': lambda x: x is not None,
filter_dict'Department Name': lambda x: x is not None,
},=['Title','Division Name'],
metadatas= [text_normalize,str.lower],
content_transformations = contextual_aug_func,
content_augmentations =True,
process_metas=42
seed
)=512,shuffle_trn=True) tdc.process_and_tokenize(tokenizer,max_length
-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 18102
})
validation: Dataset({
features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
'my_tdc') tdc.save_as_pickles(
Let’s check the file size
= os.stat(Path('pickle_files/my_tdc.pkl'))
file_stats print(f'File Size in MegaBytes is {round(file_stats.st_size / (1024 * 1024), 3)}')
File Size in MegaBytes is 479.025
Load back our object
= TextDataController.from_pickle('my_tdc') tdc2
You can still access all its attributes, data, preprocessings, transformation/augmentation …
tdc2.main_ddict
DatasetDict({
train: Dataset({
features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 18102
})
validation: Dataset({
features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
num_rows: 4526
})
})
for i,v in enumerate(tdc2.main_ddict['train']):
if i==3:break
print(f"Text: {v['Review Text']}\nLabel: {v['Department Name']} => {v['label']}")
print('-'*10)
Text: general petite . meh . this tunic is way over priced for the style and quality . it fit comfortably ( runs a size larger ) but it's not really flattering , it jut kind of hangs there looking ok . it is a little too deep of a v cut for a work top as well . this top does not support the price at all . it felt like something i could find at department store for way less . i will be returning it .
Label: Tops => 4
----------
Text: general . awesome buy ! . i am so happy i took a chance on this jumpsuit ! i am post-baby ( six weeks ) and although i intend on slimming down more i would say that it is flattering even at my current size . and it will only get better ! the quality and color are great !
Label: Bottoms => 0
----------
Text: general petite . snap neck pullover . i love this top . i ordered it in a large thinking it would be a tight rib but it is not so i reordered it in a small . i am 5 ' 7 " 145 lbs 34 g chest . the small fits perfectly and probably could have taken an xs . it is stretchy but fits wonderfully . i bought the black . i love how the neck snaps and adds a little pizzazz to a simple black turtle neck . i'm wearing it today with straight leg jeans and my leopard print ballet flats . i feel like audrey hepburn ! ! i will not be dry cleaning it . i will wash
Label: Bottoms => 0
----------
tdc2.label_lists
[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]
tdc2.filter_dict,tdc2.content_tfms,tdc2.aug_tfms
({'Review Text': <function __main__.<lambda>(x)>,
'Department Name': <function __main__.<lambda>(x)>},
[<function underthesea.pipeline.text_normalize.text_normalize(text, tokenizer='underthesea')>,
<method 'lower' of 'str' objects>],
[functools.partial(<function nlp_aug_stochastic>, aug=<nlpaug.augmenter.word.context_word_embs.ContextualWordEmbsAug object>, p=0.1)])
If you don’t want to store the HuggingFace DatasetDict in your TextDataController
, or the augmentation functions (typically when you already have a trained model, and you only use TextDataController
to preprocess the test set), you can remove it in the save_as_pickles
step
'my_lightweight_tdc',drop_attributes=True) tdc.save_as_pickles(
Let’s check the file size
= os.stat(Path('pickle_files/my_lightweight_tdc.pkl'))
file_stats print(f'File Size in MegaBytes is {round(file_stats.st_size / (1024 * 1024), 3)}')
File Size in MegaBytes is 1.911
Load it back
= TextDataController.from_pickle('my_lightweight_tdc') tdc3
We will use this object to demonstrate the Test Set Construction in the next section
Construct a Test Dataset
TextDataController.prepare_test_dataset
TextDataController.prepare_test_dataset (test_dset, do_filtering=False)
Type | Default | Details | |
---|---|---|---|
test_dset | The HuggingFace Dataset as Test set | ||
do_filtering | bool | False | whether to perform data filtering on this test set |
TextDataController.prepare_test_dataset_from_csv
TextDataController.prepare_test_dataset_from_csv (file_path, do_filtering=False)
Type | Default | Details | |
---|---|---|---|
file_path | path to csv file | ||
do_filtering | bool | False | whether to perform data filtering on this test set |
TextDataController.prepare_test_dataset_from_df
TextDataController.prepare_test_dataset_from_df (df, validate=True, do_filtering=False)
Type | Default | Details | |
---|---|---|---|
df | Pandas Dataframe | ||
validate | bool | True | whether to perform input data validation |
do_filtering | bool | False | whether to perform data filtering on this test set |
TextDataController.prepare_test_dataset_from_raws
TextDataController.prepare_test_dataset_from_raws (content)
Details | |
---|---|
content | Either a single sentence, list of sentence or a dictionary with keys are metadata columns and values are list |
Let’s say you have done your preprocessing and tokenization in your training set, and have a nicely trained model, ready to do inference on new data. Here is how you can use TextDataController
to apply all the necessary preprocessings to your new data
We will reuse the lightweight tdc object we created in the previous section (since we don’t really need all the training data just to construct new data). Also, we will take a small sample of our training data and pretend it is our test data
= TextDataController.from_pickle('my_lightweight_tdc') tdc
Let’s predict a few raw texts
If we only provide a raw text as follows
'This shirt is so comfortable I love it!') tdc.prepare_test_dataset_from_raws(
You will counter this error:
ValueError: There is/are metadatas in the preprocessing step. Please include a dictionary including these keys for
metadatas: ['Title', 'Division Name'], and texture content: Review Text
Since our preprocessing includes some metadatas, you have to provide a dictionary as follows:
= tdc.prepare_test_dataset_from_raws({'Review Text': 'This shirt is so comfortable I love it!',
results 'Title': 'Great shirt',
'Division Name': 'general'
})
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Tokenization --------------------
Done
print(results[0])
{'Review Text': 'general . great shirt . this shirt is so comfortable i love it !', 'Title': 'great shirt', 'Division Name': 'general', 'input_ids': [0, 15841, 479, 372, 6399, 479, 42, 6399, 16, 98, 3473, 939, 657, 24, 27785, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Let’s make prediction from a pandas Dataframe
= pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig').sample(frac=0.2,random_state=1)
df_test # drop NaN values in the label column
= df_test[~df_test['Department Name'].isna()].reset_index(drop=True)
df_test df_test.shape
(4692, 10)
There are few things to pay attention to when constructing your new test set using TextDataController
: - Only a few processings will be applied to your test set: Metadatas concatenation, Filtering (can be omited), Content Transformation, and Tokenization. Therefore, all columns required to perform these processings must exist in your test dataset - You can exclude the label column (e.g. Department Name
in this example), since it’s a test set
To view all required columns, access the attribute cols_to_keep
(you can omit the last column, which is the name of the label column)
tdc.cols_to_keep
['Review Text', 'Title', 'Division Name', 'Department Name']
This test dataset might have some NaN values in the text field (Review Text
), thus we will turn on the filtering option to get rid of these NaNs, as this is what we did in the training set. If your test dataset don’t need any filtering, turn off this option
= tdc.prepare_test_dataset_from_df(df_test,validate=True,do_filtering=True) test_dset
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title 758
Review Text 164
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 2 rows
-------------------- Start Test Set Transformation --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Tokenization --------------------
Done
test_dset
Dataset({
features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'input_ids', 'attention_mask'],
num_rows: 4528
})
for i in range(3):
print(f"Text: {test_dset['Review Text'][i]}")
print(f"Input_ids: {test_dset['input_ids'][i]}")
print('-'*10)
Text: general . perfect for work and play . this shirt works for both going out and going to work , and i can wear it with everything . fits perfect , tucked and untucked , tied and untied . i love it .
Input_ids: [0, 15841, 479, 1969, 13, 173, 8, 310, 479, 42, 6399, 1364, 13, 258, 164, 66, 8, 164, 7, 173, 2156, 8, 939, 64, 3568, 24, 19, 960, 479, 10698, 1969, 2156, 21222, 8, 7587, 23289, 2156, 3016, 8, 7587, 2550, 479, 939, 657, 24, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------
Text: general petite . . i don't know why i had the opposite problem most reviewers had with these ..... i tried on the regular length in the store and found that they were just a bit too short with heels . ( i'm 5 ' 5 ) . i had them ordered in a petite and when they came , they were too short with flats ! maybe it's the way i like to wear them , i like my flare jeans to barely skim the ground . i just exchanged them for regular length and will wear them with a small wedge shoe . aside from the length issues , these are super cute
Input_ids: [0, 15841, 4716, 1459, 479, 479, 939, 218, 75, 216, 596, 939, 56, 5, 5483, 936, 144, 34910, 56, 19, 209, 29942, 734, 939, 1381, 15, 5, 1675, 5933, 11, 5, 1400, 8, 303, 14, 51, 58, 95, 10, 828, 350, 765, 19, 8872, 479, 36, 939, 437, 195, 128, 195, 4839, 479, 939, 56, 106, 2740, 11, 10, 4716, 1459, 8, 77, 51, 376, 2156, 51, 58, 350, 765, 19, 20250, 27785, 2085, 24, 18, 5, 169, 939, 101, 7, 3568, 106, 2156, 939, 101, 127, 24186, 10844, 7, 6254, 28772, 5, 1255, 479, 939, 95, 11024, 106, 13, 1675, 5933, 8, 40, 3568, 106, 19, 10, 650, 27288, 12604, 479, 4364, 31, 5, 5933, 743, 2156, 209, 32, 2422, 11962, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------
Text: general petite . great pants . thes e cords are great--lightweight for fl winters , and the bootcut flare bottom is super cute with ballet flats or booties . i am 5 ' 10 " and typically a size 8 ; the size 29 fit perfectly . they have a little stretch to them , which is great . very flattering--wish i could order in more colors ! !
Input_ids: [0, 15841, 4716, 1459, 479, 372, 9304, 479, 5, 29, 364, 37687, 32, 372, 5579, 6991, 4301, 13, 2342, 31000, 2156, 8, 5, 9759, 8267, 24186, 2576, 16, 2422, 11962, 19, 22573, 20250, 50, 9759, 918, 479, 939, 524, 195, 128, 158, 22, 8, 3700, 10, 1836, 290, 25606, 5, 1836, 1132, 2564, 6683, 479, 51, 33, 10, 410, 4140, 7, 106, 2156, 61, 16, 372, 479, 182, 34203, 5579, 605, 1173, 939, 115, 645, 11, 55, 8089, 27785, 27785, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------