Text Main Streaming

This module contains the main Python class for the streaming version of TextDataController
import pandas as pd
import numpy as np
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from importlib.machinery import SourceFileLoader
import os
import random

Class TextDataControllerStreaming


source

TextDataControllerStreaming

 TextDataControllerStreaming (inp, main_text:str, label_names=[],
                              sup_types=[], class_names_predefined=[],
                              filter_dict={}, label_tfm_dict={},
                              metadatas=[], process_metas=True,
                              metas_sep='.', content_transformations=[],
                              content_augmentations=[], seed=None,
                              batch_size=1024, num_proc=1,
                              cols_to_keep=None, verbose=True)

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
inp HuggingFainpce Dataset or DatasetDict
main_text str Name of the main text column
label_names list [] Names of the label (dependent variable) columns
sup_types list [] Type of supervised learning for each label name (‘classification’ or ‘regression’)
class_names_predefined list [] List of names associated with the labels (same index order)
filter_dict dict {} A dictionary: {feature: filtering_function_based_on_the_feature}
label_tfm_dict dict {} A dictionary: {label_name: transform_function_for_that_label}
metadatas list [] Names of the metadata columns
process_metas bool True Whether to do simple text processing on the chosen metadatas
metas_sep str . Separator, for multiple metadatas concatenation
content_transformations list [] A list of text transformations
content_augmentations list [] A list of text augmentations
seed NoneType None Random seed
batch_size int 1024 CPU batch size
num_proc int 1 Number of process for multiprocessing. This will be applied on non-streamed validation set
cols_to_keep NoneType None Columns to keep after all processings
verbose bool True Whether to print processing information

source

TextDataControllerStreaming.process_and_tokenize

 TextDataControllerStreaming.process_and_tokenize (tokenizer,
                                                   max_length=None,
                                                   tok_num_proc=None,
                                                   line_by_line=False)
Type Default Details
tokenizer Tokenizer (preferably from HuggingFace)
max_length NoneType None pad to model’s allowed max length (default is max_sequence_length)
tok_num_proc NoneType None Number of processes for tokenization
line_by_line bool False To whether process + tokenize each sentence separately. Faster, but no padding applied

1. Streaming Capability

The majority of streaming capability of TextDataControllerStreaming is adapted from HuggingFace’s stream

Streaming is a method to let you work with data without having it in your hard drive. This is especially helpful when the dataset size exceeds the amount of disk space you have on your machine.

Here are a few things to be aware of when using TextDataControllerStreaming streaming functionality (versus TextDataController)

  • The list of label names must be available beforehand (except for regression label)
  • To avoid out-of-memory error, reduce batch_size argument.
  • There will NOT be any validation split functionality. If you want to include a validation set, provide a validation split in your HuggingFace DatasetDict beforehand
  • There’s no upsampling, and there’s no shuffling the training set

To stream, you must provide a streamed HuggingFace dataset.

Let’s repeat few examples mentioned in this tutorial, but with a streaming dataset

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
from underthesea import text_normalize
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
def nlp_aug_stochastic(x,aug=None,p=0.5):
    if not isinstance(x,list): 
        if random.random()<p: return aug.augment(x)[0]
        return x
    news=[]
    originals=[]
    for _x in x:
        if random.random()<p: news.append(_x)
        else: originals.append(_x)
    # only perform augmentation when needed
    if len(news): news = aug.augment(news)
    return news+originals

a) Filtering + Metadatas + Label Transformation + Content Transformation (for Single Head)

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']
tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              },
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  metadatas=['Title','Division Name'],
                                  content_transformations=[text_normalize,str.lower],
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42
                                 )
tdc.process_and_tokenize(tokenizer,max_length=256)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
for i,v in enumerate(tdc.main_ddict['train']):
    if i%100==0:
        print(i)
    if i==1000-1:
        break
    pass
0
100
200
300
400
500
600
700
800
900
CPU times: user 1.85 s, sys: 857 ms, total: 2.7 s
Wall time: 2.69 s
for i,v in enumerate(tdc.main_ddict['train']):
    if i==5:break
    print(f"Text: {v['Review Text']}\nLabel: {v['Department Name']} => {v['label']}")
    print('-'*10)
Text: general petite . beautiful top , worth the necessary tailoring . the beautiful bold print drew me to this top and it did not disappoint upon receipt . however , the bottom ruffle belled so far out on each side that it was laughable ! the actual fit is nothing like the picture ; clearly the model's arms are placed in front of all the extra fabric to hold the ruffle back . however , the fabric is beautiful , the fit was perfect ( size 2 , 5 ' 4 " , 106 lbs . ) , the quality is great and i love the print so i decided to take it to my tailor to " sew away " the " wings " on both si
Label: Tops => 4
----------
Text: general . not as short on me ( petite ) . i ordered the xxs p as this dress is not a fitted dress , and that was the right size for me . only thing is the length is a bit linger still 9 lower on calf for me ) , the straps are almost tight , so i would say the dress is a reversed taper shape . color is beautiful , i ordered green as the other color ( plum ) doesn't have petite available . green is rich , and classy , the fabric is surprisingly soft . i love the little details in the velvet . definitely need a strapless bra for this one . 115 lbsm 30 d
Label: Dresses => 1
----------
Text: general . perfect .... for two wears . ok ladies .... you need to know that this type of fabric is the one that will get holes ( i bought the white one ) . it is super thin and lovely , but i was only able to get two wears out of it . i did wash it and it maintained it's size because i restretched it while wet then hung to dry . i was super disappointed about the wear but appreciated being able to return it without question at my local retailer .
Label: Tops => 4
----------
Text: initmates . . i love this dress . it is so soft and comfortable , perfect for summer ! ! i wish it came in more colors because i would buy everyone ! !
Label: Intimate => 2
----------
Text: general petite . adorable and excellent quality . this is such a clean and cute printed dress and i knew that i had to try the dress when i first saw it online . after reading other reviews , i sized up . i am normally a 0 or 2 in retailer . i ordered the 2 and it fits nicely and looks great . however , i feel like the 2 buttons at the lowered rib cage area gape slightly . the tie covers it and holds it in place , unless i sit , then it gapes freely . i am a 32 a so not big chested at all , and yet this fit snug in the chest area . i would worry about
Label: Dresses => 1
----------
for i in range(5):
    print(f"Text: {tdc.main_ddict['validation']['Review Text'][i]}")
    print(f"Label: {tdc.main_ddict['validation']['Department Name'][i]} => {tdc.main_ddict['validation']['label'][i]}")
    print('-'*10)
Text: general . soft , feminine and fun pockets ! . i love this tunic . purchased the dark orange in medium ( i am 5 ' 9 and 140 lbs ) . tried the small and almost kept it but i felt seams around my arm pits a tad , so went with the medium and glad i did - this top should be comfortable . feels very fall and perfect for casual get-togethers and running around town . only comment is that it is rayon ... and for me anyway rayon doesn't wash too well - so we shall see how this one fairs .
Label: Tops => 4
----------
Text: general petite . a new staple ! . tried these on out of sheer curiosity -- i've got a long torso & was pleasantly surprised how flattering they are ! they manage to look flowing & sleek without shortening the legs . took a size 6 with my 27 " waist , 37 " hips . it's a bit of a generous fit , especially around the waist , but they're extremely comfortable & have room to tuck tops into . i have the cowled sweater tank in gray & it looks fantastic over these ! couldn't resist getting both the rust and black . perfect for a dressy casual look
Label: Bottoms => 0
----------
Text: general . maybe swing is for me ! . i love swing dresses but they never seem to work out for me . however , lately i have been trying on swing tops like this one and they are super scores ! i love this top ! in my store , they had a rack of test materials where they don't have the full line but they have a look at some online features or clothes that are very new releases . this was on the rack . i knew it wasn't my size but i tried it on anyway and i am absolutely in love . i am waiting for a sale ( as always ) but i am going to get this i
Label: Tops => 4
----------
Text: general . too flare . too small ... too flare ... nice thick fabric . not my favorite pant .
Label: Bottoms => 0
----------
Text: general . love . i love this top it is easy to wear fun and very comfortable . i was thinking about it for weeks and kept coming back to it after i read a review about going up a size i decided to go for it and i am very happy i did ! ! ! my new favorite ! ! !
Label: Tops => 4
----------

Compare to non-streamed version

# redefine streaming data controller with verbose=False
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              },
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  metadatas=['Title','Division Name'],
                                  content_transformations=[text_normalize,str.lower],
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42,
                                  verbose=False
                                 )

tdc.process_and_tokenize(tokenizer,max_length=256,tok_num_proc=1)
from that_nlp_library.text_main import TextDataController
dset2 = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val2 = dset2.train_test_split(test_size=0.1,seed=42)
ddict_with_val2['validation'] = ddict_with_val2['test']
del ddict_with_val2['test']


tdc2 = TextDataController(ddict_with_val2,
                         main_text='Review Text',
                         label_names='Department Name',
                         sup_types='classification',
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         process_metas=True,
                         batch_size=1000,
                         num_proc=4,
                         seed=42,
                         verbose=False
                        )
tdc2.process_and_tokenize(tokenizer,max_length=256,shuffle_trn=False,tok_num_proc=1)
# check whether train sets are the same
assert len(list(tdc.main_ddict['train']))==len(tdc2.main_ddict['train'])
iter1 = iter(tdc.main_ddict['train'])
iter2 = iter(tdc2.main_ddict['train'])
for a,b in zip(iter1,iter2):
    assert a==b
# check whether validation set is the same
assert len(list(tdc.main_ddict['validation']))==len(tdc2.main_ddict['validation'])

iter1 = iter(tdc.main_ddict['validation'])
iter2 = iter(tdc2.main_ddict['validation'])
for a,b in zip(iter1,iter2):
    assert a==b

b) Filtering + Metadatas + Label Transformation + Content Transformation + Content Augmentation (for Multi Head: Classification + Regression + Classification)

aug2 = naw.ContextualWordEmbsAug(model_path='roberta-base', 
                                device='cuda:0', # if you don't have gpu, change to 'cpu'
                                action="substitute",
                                top_k=10,
                               aug_p=0.07)

contextual_aug_func = partial(nlp_aug_stochastic,aug=aug2,p=0.5)
contextual_aug_func.run_on_gpu=True
contextual_aug_func.batch_size=32
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']
tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names=['Division Name','Rating','Department Name'],
                                  sup_types=['classification','regression','classification'],
                                  class_names_predefined=[['General', 'General Petite', 'Initmates'],
                                                          [], # empty list for regression
                                                          ['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending']],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                               'Department Name': lambda x: x is not None,
                                              },
                                  metadatas=['Title'],
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  content_transformations=[text_normalize,str.lower],
                                  content_augmentations=contextual_aug_func,
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=1,
                                  seed=42
                                 )
tdc.process_and_tokenize(tokenizer,max_length=256)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
for i,v in enumerate(tdc.main_ddict['train']):
    if i==10:break
    print(f"Text: {v['Review Text']}\nLabel: {v['Division Name'],v['Rating'],v['Department Name']} => {v['label']}")
    print('-'*10)
Text: not as short on me ( petite ). i ordered black xxs p as this model is not a fitted dress, and that was the right size for me. only thing is the length is 1 bit linger still 9 lower on calf for me ). the straps are almost tight, so i would say the dress is a reversed taper shape. color = beautiful! i ordered green as the other color ( ) ( doesn't have petite available. green is rich, and classy, the fabric is surprisingly soft. i love the little details in the front. definitely need a strapless bra for this one. 115 < 30 d
Label: ('General Petite', 4.0, 'Tops') => [1.0, 4.0, 4.0]
----------
Text: perfect.... for two wears. ( ladies.... we need to know that this type of fabric is the one that will get holes when i bought the white one ). it is super thin and lovely, but i was only able to get 2 wears out of it. i did wash it and it maintained it's size because i restretched it while wet then hung to dry. i been super disappointed about the wear but appreciated being able to find it without question at my favourite retailer.
Label: ('General', 5.0, 'Dresses') => [0.0, 5.0, 1.0]
----------
Text: . i love this dress. it is in soft is comfortable, perfect for summer!! i wish it came in more colors ; i would buy everyone!!
Label: ('General', 1.0, 'Tops') => [0.0, 1.0, 4.0]
----------
Text: great fit and what. love this flowing a cute top! casual, but can be easily dressed up. great fit.
Label: ('Initmates', 5.0, 'Intimate') => [2.0, 5.0, 2.0]
----------
Text: please bring back more!!!. love this tank! prettier color of yellow in person. love the wide width and length. wish they'd bring back more in lots of colors. only sized up one from my usual size and will have too shorten and length of the strap myself, but it's an amazing alteration and is totally worth it! especially good at sale price. i really hope they make more in the style or lots of colors. great find!
Label: ('General Petite', 4.0, 'Dresses') => [1.0, 4.0, 1.0]
----------
Text: roomy and flows. comfortable and flows well, does run large one order one size smaller fuck
Label: ('General', 5.0, 'Bottoms') => [0.0, 5.0, 0.0]
----------
Text: not as pictured. i just received this dress wadded around in a bag. it is nothing like the colors shown off line. in the middle it appears blue and brown. in front it is teal and a funky color of beigey pink. the top is see though but it does see have a slip under the skirt. will be sending it back and will remind myself why i shouldn't try to buy any dress on line.
Label: ('General Petite', 2.0, 'Tops') => [1.0, 2.0, 4.0]
----------
Text: comfy and pretty. i bought this in red and wear it a lot. it is so soft and comfortable!! i love all the fabric and the fit. the longer sleeve length seems amazing! it is a perfect tee! you can tell from the front there's a lot a fabric and it's supposed to be smooth and drape like that. it fits like in the pics this is true to size. wear a tank underneath just case there's a strong wind.
Label: ('General Petite', 5.0, 'Tops') => [1.0, 5.0, 4.0]
----------
Text: so so pretty. sunday by brooklyn has bee killing it lately with their tops! i could not decide between this and another of theirs, but i am positive i chose this one out first, the design is subtle but very cunning style : the sleeve length matches the rest of the top tier, which gives this top a very finished look. the color is a beautiful jewel-like raspberry red with a bluish undertone. and, the fabric is a nice washable texture. i wore my usual size xs at retailer.
Label: ('Initmates', 5.0, 'Intimate') => [2.0, 5.0, 2.0]
----------
Text: great flares. wearing jeans, fit in to size and look as pictured. i recommended those to meet friend and she bought and liked them as well.
Label: ('General Petite', 5.0, 'Tops') => [1.0, 5.0, 4.0]
----------
for i in range(5):
    print(f"Text: {tdc.main_ddict['validation']['Review Text'][i]}")
    print(f"Label: {tdc.main_ddict['validation']['Division Name'][i],tdc.main_ddict['validation']['Rating'][i],tdc.main_ddict['validation']['Department Name'][i]} => {tdc.main_ddict['validation']['label'][i]}")
    print('-'*10)
Text: soft , feminine and fun pockets ! . i love this tunic . purchased the dark orange in medium ( i am 5 ' 9 and 140 lbs ) . tried the small and almost kept it but i felt seams around my arm pits a tad , so went with the medium and glad i did - this top should be comfortable . feels very fall and perfect for casual get-togethers and running around town . only comment is that it is rayon ... and for me anyway rayon doesn't wash too well - so we shall see how this one fairs .
Label: ('General', 5.0, 'Tops') => [0.0, 5.0, 4.0]
----------
Text: a new staple ! . tried these on out of sheer curiosity -- i've got a long torso & was pleasantly surprised how flattering they are ! they manage to look flowing & sleek without shortening the legs . took a size 6 with my 27 " waist , 37 " hips . it's a bit of a generous fit , especially around the waist , but they're extremely comfortable & have room to tuck tops into . i have the cowled sweater tank in gray & it looks fantastic over these ! couldn't resist getting both the rust and black . perfect for a dressy casual look
Label: ('General Petite', 5.0, 'Bottoms') => [1.0, 5.0, 0.0]
----------
Text: maybe swing is for me ! . i love swing dresses but they never seem to work out for me . however , lately i have been trying on swing tops like this one and they are super scores ! i love this top ! in my store , they had a rack of test materials where they don't have the full line but they have a look at some online features or clothes that are very new releases . this was on the rack . i knew it wasn't my size but i tried it on anyway and i am absolutely in love . i am waiting for a sale ( as always ) but i am going to get this i
Label: ('General', 5.0, 'Tops') => [0.0, 5.0, 4.0]
----------
Text: too flare . too small ... too flare ... nice thick fabric . not my favorite pant .
Label: ('General', 2.0, 'Bottoms') => [0.0, 2.0, 0.0]
----------
Text: love . i love this top it is easy to wear fun and very comfortable . i was thinking about it for weeks and kept coming back to it after i read a review about going up a size i decided to go for it and i am very happy i did ! ! ! my new favorite ! ! !
Label: ('General', 5.0, 'Tops') => [0.0, 5.0, 4.0]
----------

c) Filtering + Metadatas + Content Transformation + Content Augmentation (for Multi Label)

aug2 = naw.ContextualWordEmbsAug(model_path='roberta-base', 
                                device='cuda:0', # if you don't have gpu, change to 'cpu'
                                action="substitute",
                                top_k=10,
                               aug_p=0.07)

contextual_aug_func = partial(nlp_aug_stochastic,aug=aug2,p=0.5)
contextual_aug_func.run_on_gpu=True
contextual_aug_func.batch_size=32
df = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig')
df['Fake Label'] = [np.random.choice(df['Department Name'].unique()[:-1],size=np.random.randint(2,6),replace=False) for _ in range(len(df))]
dset = Dataset.from_pandas(df)
ddict_with_val = dset.train_test_split(test_size=0.1)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']
ddict_with_val
DatasetDict({
    train: IterableDataset({
        features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name', 'Fake Label'],
        n_shards: 1
    })
    validation: Dataset({
        features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name', 'Fake Label'],
        num_rows: 2349
    })
})
tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Fake Label',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                                  filter_dict={'Review Text': lambda x: x is not None},
                                  metadatas=['Title','Division Name'],
                                  content_transformations=[text_normalize,str.lower],
                                  content_augmentations= contextual_aug_func, 
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42
                                 )
tdc.process_and_tokenize(tokenizer,max_length=512)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
for i,v in enumerate(tdc.main_ddict['train']):
    if i==10:break
    print(f"Text: {v['Review Text']}\nLabel: {v['Fake Label']} => {v['label']}")
    print('-'*10)
Text: general. beautiful, stunning, cozy top!. i checked the first review on here and ordered both a small and a medium as i thought this would run small. i have to totally disagree of the reviewer! i find that this top runs true on size or even generous! the sky color is so pretty and this top can be dressed up with many nice jewels and a necklace or it can be comfy casual, i usually wear a small in hh brand and this one was true to fit ( 5 " 2 ", broad shoulders, 120 ml )
Label: ['Jackets', 'Tops', 'Intimate'] => [0, 0, 1, 1, 1, 0]
----------
Text: general. love!. love love love this dress! but, and you are not wearing a slip... you should be like please wear a slip, you can see right through this dress."
Label: ['Bottoms', 'Dresses'] => [1, 1, 0, 0, 0, 0]
----------
Text: general. runs big. i liked the idea on these pair as i've been looking for an updated pair of tuxedo pants. i wear 26 in most of their jeans. i'm not super skinny & find my legs medium ( not too skinny & not too athletic ). i tried these on in xs ( 36 european as marked on them ) & they were big around the waist & hip area. there was so much gap in the back it made them look frumpy! u did however liked the length. the material is nice & heavy which i also loved. sadly enough i didn't work for me though. really wish t
Label: ['Bottoms', 'Tops', 'Trend', 'Dresses'] => [1, 1, 0, 0, 1, 1]
----------
Text: general petite. so flattering, no need for petite. i just try this top on in xs regular even : i generally wear xspetite in retailer and it fit great ( 34 aa - 35 - 34 ). i think it's flattering on any short narrow arms with the halter neck, fitted waist and peplum? i would guess it would be flattering on many body types ; it highlights shoulders beautifully. it was very hard to put my head through the small, not-too-stretchy opening so you may want to try it on without makeup. i knocked off one star, the neck band wasn't symmetrical,
Label: ['Dresses', 'Bottoms', 'Trend', 'Tops', 'Intimate'] => [1, 1, 1, 0, 1, 1]
----------
Text: general. a new wardrobe staple!. love this jacket! i purchased both the green and gray versions and will have them constantly this winter! although some reviews were critical of the length of the yarn, i do not find this to be a problem... so happy with these products! for reference to size, i bought the m and am 5'11... the sweater hits me exactly where it should on the model.
Label: ['Bottoms', 'Tops'] => [1, 0, 0, 0, 1, 0]
----------
Text: general. unique and adorable. the photos don't do the top justice. the split back is very unique and beautiful. i typically take a size 8 in tops, however ordered a 15 since a reviewer suggested it was narrow in the shoulders. the 12 will fit perfectly, but the body is way too big that i was swimming it in. i like it enough that i'm going to visit the tailor to take in the fit. with the sale price, i is worth tailoring.
Label: ['Bottoms', 'Intimate', 'Dresses', 'Tops'] => [1, 1, 1, 0, 1, 0]
----------
Text: general. great slouchy sweater, perfect color. this sweater has the perfect slouchy coat for fall. i wish I were a little bit softer and heavier - the fabric is pretty lightweight - yet it layers beautifully and will be a highlight for me this season.
Label: ['Intimate', 'Bottoms', 'Jackets', 'Trend', 'Tops'] => [1, 0, 1, 1, 1, 1]
----------
Text: general. feminine plus makeup. this top is gorgeous and versatile. i wear me with jeans and dress it up with a skirt. so happy to have this in my wardrobe.
Label: ['Jackets', 'Bottoms'] => [1, 0, 0, 1, 0, 0]
----------
Text: general. great pants that don't get baggy.. e they are shit - please make them in other colors besides black and navy!
Label: ['Trend', 'Dresses', 'Intimate'] => [0, 1, 1, 0, 0, 1]
----------
Text: general petite. a very blouse-like dress - very simple. this dress is very cute on. very flouncy. i know it looks like a gingham collar, but it's a more like a silk dress with a gingham collar. the one issue i had was that it stood out at the back where you tied it on any size ( large or small ), and i am larger in top, so i don't know what would happen if it with a small chest. i wouldn't be minded if it did the same thing on the front, but it was just the back.
Label: ['Trend', 'Jackets', 'Tops', 'Bottoms'] => [1, 0, 0, 1, 1, 1]
----------
for i in range(5):
    print(f"Text: {tdc.main_ddict['validation']['Review Text'][i]}")
    print(f"Label: {tdc.main_ddict['validation']['Fake Label'][i]} => {tdc.main_ddict['validation']['label'][i]}")
    print('-'*10)
Text: general petite . . this top has great detailing and color . does run a little big , but adds to the style and movement of the tank . the stitching around the bottom makes it cute for layering .
Label: ['Dresses', 'Intimate', 'Trend', 'Tops', 'Bottoms'] => [1, 1, 1, 0, 1, 1]
----------
Text: general . . i love this top . i got it on sale and am so glad that i did . it is a short too but still super flattering . it isn't too boxy on me .
Label: ['Intimate', 'Trend', 'Jackets', 'Dresses', 'Tops'] => [0, 1, 1, 1, 1, 1]
----------
Text: general . beautiful idea ... . i ordered my normal size in this dress . i am 6 foot tall , but the regular sizes were too large and too long ( mid-calf ) . i returned the dress for a size smaller in petite for a more flattering hemline . the dress is lovely , especially on the models in the pictures , but didn't quite work out for me . also , it feels like there are hundreds of closure hooks that make putting on / taking off the dress seem to take an unusually long time !
Label: ['Dresses', 'Tops'] => [0, 1, 0, 0, 1, 0]
----------
Text: general petite . comfy , but not made to last . this sweater is fine for the casual days . i bought this in cream and i have to say after one wash it looks old . i'm a huge retailer lover and buy a lot of clothes from them . this is just not the best quality and looks tired after a few wears . very soft , but poor material . not my favorite purchase .
Label: ['Trend', 'Jackets'] => [0, 0, 0, 1, 0, 1]
----------
Text: general . great cool looking jeans . i just bought these jeans today & they are really cute & comfortable on . i love pilcro jeans as they fit really well , they are made well & they are always on style . i did have to go down a size as well but they fit beautifully ! comfy & stylish !
Label: ['Tops', 'Intimate', 'Jackets', 'Bottoms'] => [1, 0, 1, 1, 1, 0]
----------
tdc.label_lists
[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]

2. Batch process vs Line-by-line process

So far, we are applying processing functions and tokenization on each batch (of size 1000). There’s a faster way to process: we can apply these functions immediately to each row (instead of waiting for a batch to be formed). Let’s discuss when to use each of these approaches

  1. When to use batch process over line-by-line process
    • You have some processing functions that perform more efficient when apply to the whole batch (instead of applying line-by-line), e.g. the contextual_aug_func (contextual augmentation) function that utilizes a language model on GPU
    • For tokenization, you need to apply padding (to the whole batch)
  2. When to use line-by-line process over batch process
    • All your processing functions can perform efficiently line-by-line
    • For tokenization, you don’t need padding (maybe you will do so via DataCollatorWithPadding during training)
    • If you choose DataCollatorWithPadding to pad, be aware that there’s no option to truncate, therefore you have to make sure the maximum length of your token_ids does not exceed the maximum model’s sequence length

a) Line-by-line over batch process

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              },
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  metadatas=['Title','Division Name'],
                                  content_transformations=[text_normalize,str.lower],
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42
                                 )

For example, for the above data controller, two of the processing functions (text_normalize and str.lower) can perform line-by-line just fine. Therefore, we can utilize line-by-line processing

tdc.process_and_tokenize(tokenizer,line_by_line=True)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
# time it takes to go throuch 3 batches (1000 x3)
for i,v in enumerate(tdc.main_ddict['train']):
    if i%500==0:
        print(i)
    if i==1000*3-1:
        break
    pass
0
500
1000
1500
2000
2500
CPU times: user 2.6 s, sys: 2.53 s, total: 5.13 s
Wall time: 5.12 s
for i,v in enumerate(tdc.main_ddict['train']):
    print(v['input_ids'])
    print(f"Length of input_ids: {len(v['input_ids'])}")
    if i==1:
        break
    print('-'*10)
[0, 15841, 4716, 1459, 479, 2721, 299, 2156, 966, 5, 2139, 7886, 5137, 479, 5, 2721, 7457, 5780, 4855, 162, 7, 42, 299, 8, 24, 222, 45, 17534, 2115, 18245, 479, 959, 2156, 5, 2576, 910, 15315, 28, 9970, 98, 444, 66, 15, 349, 526, 14, 24, 21, 38677, 27785, 5, 3031, 2564, 16, 1085, 101, 5, 2170, 25606, 2563, 5, 1421, 18, 3701, 32, 2325, 11, 760, 9, 70, 5, 1823, 10199, 7, 946, 5, 910, 15315, 124, 479, 959, 2156, 5, 10199, 16, 2721, 2156, 5, 2564, 21, 1969, 36, 1836, 132, 2156, 195, 128, 204, 22, 2156, 13442, 23246, 479, 4839, 2156, 5, 1318, 16, 372, 8, 939, 657, 5, 5780, 98, 939, 1276, 7, 185, 24, 7, 127, 26090, 7, 22, 35043, 409, 22, 5, 22, 11954, 22, 15, 258, 3391, 2]
Length of input_ids: 136
----------
[0, 15841, 479, 45, 25, 765, 15, 162, 36, 4716, 1459, 4839, 479, 939, 2740, 5, 37863, 29, 181, 25, 42, 3588, 16, 45, 10, 15898, 3588, 2156, 8, 14, 21, 5, 235, 1836, 13, 162, 479, 129, 631, 16, 5, 5933, 16, 10, 828, 18277, 202, 361, 795, 15, 16701, 13, 162, 4839, 2156, 5, 31622, 32, 818, 3229, 2156, 98, 939, 74, 224, 5, 3588, 16, 10, 13173, 326, 15888, 3989, 479, 3195, 16, 2721, 2156, 939, 2740, 2272, 25, 5, 97, 3195, 36, 36838, 4839, 630, 75, 33, 4716, 1459, 577, 479, 2272, 16, 4066, 2156, 8, 30228, 2156, 5, 10199, 16, 10262, 3793, 479, 939, 657, 5, 410, 1254, 11, 5, 29986, 479, 2299, 240, 10, 18052, 16979, 11689, 13, 42, 65, 479, 12312, 23246, 119, 389, 385, 2]
Length of input_ids: 133

There’s no padding at all, just by looking at the first 2 input_ids. Let’s use DataCollatorWithPadding to add padding to these tokens

from transformers import DataCollatorWithPadding
# no truncation strategy! We will pad to multiple of 8, to utilize NVIDIA hardware
data_collator = DataCollatorWithPadding(tokenizer,padding=True,pad_to_multiple_of=8)
tdc.set_data_collator(data_collator)
train_ddict = tdc.main_ddict['train'].remove_columns(tdc.cols_to_keep)

iter1 = iter(train_ddict)
out = tdc.data_collator([next(iter1) for i in range(3000)]) # apply data collator on 3 batches of 1000
CPU times: user 2.46 s, sys: 2.27 s, total: 4.72 s
Wall time: 4.72 s
out['input_ids'].shape
torch.Size([3000, 160])
out['input_ids'][:2]
tensor([[    0, 15841,  4716,  1459,   479,  2721,   299,  2156,   966,     5,
          2139,  7886,  5137,   479,     5,  2721,  7457,  5780,  4855,   162,
             7,    42,   299,     8,    24,   222,    45, 17534,  2115, 18245,
           479,   959,  2156,     5,  2576,   910, 15315,    28,  9970,    98,
           444,    66,    15,   349,   526,    14,    24,    21, 38677, 27785,
             5,  3031,  2564,    16,  1085,   101,     5,  2170, 25606,  2563,
             5,  1421,    18,  3701,    32,  2325,    11,   760,     9,    70,
             5,  1823, 10199,     7,   946,     5,   910, 15315,   124,   479,
           959,  2156,     5, 10199,    16,  2721,  2156,     5,  2564,    21,
          1969,    36,  1836,   132,  2156,   195,   128,   204,    22,  2156,
         13442, 23246,   479,  4839,  2156,     5,  1318,    16,   372,     8,
           939,   657,     5,  5780,    98,   939,  1276,     7,   185,    24,
             7,   127, 26090,     7,    22, 35043,   409,    22,     5,    22,
         11954,    22,    15,   258,  3391,     2,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1],
        [    0, 15841,   479,    45,    25,   765,    15,   162,    36,  4716,
          1459,  4839,   479,   939,  2740,     5, 37863,    29,   181,    25,
            42,  3588,    16,    45,    10, 15898,  3588,  2156,     8,    14,
            21,     5,   235,  1836,    13,   162,   479,   129,   631,    16,
             5,  5933,    16,    10,   828, 18277,   202,   361,   795,    15,
         16701,    13,   162,  4839,  2156,     5, 31622,    32,   818,  3229,
          2156,    98,   939,    74,   224,     5,  3588,    16,    10, 13173,
           326, 15888,  3989,   479,  3195,    16,  2721,  2156,   939,  2740,
          2272,    25,     5,    97,  3195,    36, 36838,  4839,   630,    75,
            33,  4716,  1459,   577,   479,  2272,    16,  4066,  2156,     8,
         30228,  2156,     5, 10199,    16, 10262,  3793,   479,   939,   657,
             5,   410,  1254,    11,     5, 29986,   479,  2299,   240,    10,
         18052, 16979, 11689,    13,    42,    65,   479, 12312, 23246,   119,
           389,   385,     2,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1]])

Our tokens have been padded

Now let’s compare the runtime if we use batch-processing instead

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              },
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  metadatas=['Title','Division Name'],
                                  content_transformations=[text_normalize,str.lower],
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42
                                 )

tdc.process_and_tokenize(tokenizer,line_by_line=False,max_length=152)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
# time it takes to go throuch 3 batches (1000 x3)
for i,v in enumerate(tdc.main_ddict['train']):
    if i%500==0:
        print(i)
    if i==1000*3-1:
        break
    pass
0
500
1000
1500
2000
2500
CPU times: user 5.5 s, sys: 2.52 s, total: 8.02 s
Wall time: 7.98 s

This took a bit longer than line-by-line’s time

for i,v in enumerate(tdc.main_ddict['train']):
    print(v['input_ids'])
    print(f"Length of input_ids: {len(v['input_ids'])}")
    if i==1:
        break
    print('-'*10)
[0, 15841, 4716, 1459, 479, 2721, 299, 2156, 966, 5, 2139, 7886, 5137, 479, 5, 2721, 7457, 5780, 4855, 162, 7, 42, 299, 8, 24, 222, 45, 17534, 2115, 18245, 479, 959, 2156, 5, 2576, 910, 15315, 28, 9970, 98, 444, 66, 15, 349, 526, 14, 24, 21, 38677, 27785, 5, 3031, 2564, 16, 1085, 101, 5, 2170, 25606, 2563, 5, 1421, 18, 3701, 32, 2325, 11, 760, 9, 70, 5, 1823, 10199, 7, 946, 5, 910, 15315, 124, 479, 959, 2156, 5, 10199, 16, 2721, 2156, 5, 2564, 21, 1969, 36, 1836, 132, 2156, 195, 128, 204, 22, 2156, 13442, 23246, 479, 4839, 2156, 5, 1318, 16, 372, 8, 939, 657, 5, 5780, 98, 939, 1276, 7, 185, 24, 7, 127, 26090, 7, 22, 35043, 409, 22, 5, 22, 11954, 22, 15, 258, 3391, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Length of input_ids: 149
----------
[0, 15841, 479, 45, 25, 765, 15, 162, 36, 4716, 1459, 4839, 479, 939, 2740, 5, 37863, 29, 181, 25, 42, 3588, 16, 45, 10, 15898, 3588, 2156, 8, 14, 21, 5, 235, 1836, 13, 162, 479, 129, 631, 16, 5, 5933, 16, 10, 828, 18277, 202, 361, 795, 15, 16701, 13, 162, 4839, 2156, 5, 31622, 32, 818, 3229, 2156, 98, 939, 74, 224, 5, 3588, 16, 10, 13173, 326, 15888, 3989, 479, 3195, 16, 2721, 2156, 939, 2740, 2272, 25, 5, 97, 3195, 36, 36838, 4839, 630, 75, 33, 4716, 1459, 577, 479, 2272, 16, 4066, 2156, 8, 30228, 2156, 5, 10199, 16, 10262, 3793, 479, 939, 657, 5, 410, 1254, 11, 5, 29986, 479, 2299, 240, 10, 18052, 16979, 11689, 13, 42, 65, 479, 12312, 23246, 119, 389, 385, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Length of input_ids: 149

But at least our tokens has been padded appropriately

b) Batch-process over line-by-line

aug2 = naw.ContextualWordEmbsAug(model_path='roberta-base', 
                                device='cuda:0', # if you don't have gpu, change to 'cpu'
                                action="substitute",
                                top_k=10,
                               aug_p=0.07)

contextual_aug_func = partial(nlp_aug_stochastic,aug=aug2,p=0.5)
contextual_aug_func.run_on_gpu=True
contextual_aug_func.batch_size=32
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              },
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  metadatas=['Title','Division Name'],
                                  content_augmentations=[contextual_aug_func],
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42
                                 )

For the above data controller, there’s an augmentation function contextual_aug_func that can utilize batch process

tdc.process_and_tokenize(tokenizer,line_by_line=False,max_length=152)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
# time it takes to go throuch 3 batches (1000 x3)
for i,v in enumerate(tdc.main_ddict['train']):
    if i%500==0:
        print(i)
    if i==1000*3-1:
        break
    pass
0
500
1000
1500
2000
2500
CPU times: user 43.7 s, sys: 4.3 s, total: 48 s
Wall time: 47.9 s
for i,v in enumerate(tdc.main_ddict['train']):
    print(v['input_ids'])
    print(f"Length of input_ids: {len(v['input_ids'])}")
    if i==1:
        break
    print('-'*10)
[0, 15841, 4716, 1459, 4, 2721, 299, 6, 966, 5, 2139, 7886, 5137, 4, 20, 2721, 7457, 5780, 4855, 162, 7, 42, 1836, 8, 24, 222, 45, 17534, 2115, 18245, 4, 2223, 6, 5, 2576, 526, 28, 9970, 21, 444, 66, 11, 349, 526, 6, 24, 21, 38677, 328, 5, 3031, 2564, 21, 1085, 101, 5, 2170, 131, 2563, 5, 1421, 18, 3701, 32, 2325, 11, 760, 9, 70, 5, 1823, 10199, 7, 946, 5, 910, 15315, 124, 4, 50121, 50118, 9178, 6294, 6, 5, 10199, 16, 2721, 6, 5, 2408, 21, 1969, 36, 10799, 132, 6, 195, 108, 306, 1297, 13442, 23246, 12345, 5, 1318, 16, 372, 8, 939, 657, 5, 5780, 98, 939, 1276, 7, 492, 24, 7, 127, 26090, 7, 22, 1090, 605, 409, 113, 5, 22, 42932, 113, 15, 2185, 3391, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Length of input_ids: 152
----------
[0, 15841, 4, 1969, 17220, 1990, 80, 15033, 4, 5148, 10717, 17220, 6968, 33, 7, 216, 14, 42, 1907, 9, 10199, 16, 5, 65, 14, 40, 120, 6538, 36, 118, 33, 5, 1104, 65, 322, 24, 16, 2422, 11962, 8, 9869, 6, 53, 939, 21, 129, 441, 7, 120, 80, 15033, 66, 9, 24, 4, 939, 393, 10397, 24, 8, 24, 4925, 24, 18, 1836, 142, 939, 1079, 47904, 24, 683, 7727, 172, 10601, 7, 3841, 4, 939, 21, 2422, 5779, 15, 5, 3568, 53, 10874, 145, 441, 7, 671, 24, 396, 864, 23, 103, 400, 6215, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Length of input_ids: 152

Let’s compare this runtime to line-by-line process

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.1,seed=42)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trending'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                              'Department Name': lambda x: x is not None,
                                              },
                                  label_tfm_dict={'Department Name': lambda x: x if x!='Trend' else 'Trending'},
                                  metadatas=['Title','Division Name'],
#                                   content_transformations=[text_normalize,str.lower],
                                  content_augmentations=[contextual_aug_func],
                                  process_metas=True,
                                  batch_size=1000,
                                  num_proc=4,
                                  seed=42
                                 )

tdc.process_and_tokenize(tokenizer,line_by_line=True)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Label Transformation --------------------
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
# time it takes to go throuch 3 batches (1000 x3)
for i,v in enumerate(tdc.main_ddict['train']):
    if i%500==0:
        print(i)
    if i==1000*3-1:
        break
    pass
0
500
1000
1500
2000
2500
CPU times: user 1min, sys: 2.44 s, total: 1min 2s
Wall time: 1min 2s

This took longer to run than batch-processing

for i,v in enumerate(tdc.main_ddict['train']):
    print(v['input_ids'])
    print(f"Length of input_ids: {len(v['input_ids'])}")
    if i==3:
        break
    print('-'*10)
[0, 15841, 4716, 1459, 479, 2721, 299, 6, 966, 5, 2139, 7886, 5137, 479, 20, 2721, 7457, 5780, 4855, 162, 7, 42, 299, 8, 24, 222, 45, 17534, 2115, 18245, 4, 959, 6, 5, 2576, 910, 15315, 28, 9970, 98, 444, 66, 15, 349, 526, 14, 24, 21, 38677, 328, 5, 3031, 2564, 16, 1085, 101, 5, 2170, 131, 2563, 5, 1421, 18, 3701, 32, 2325, 11, 760, 9, 70, 5, 1823, 10199, 7, 946, 5, 910, 15315, 124, 4, 50121, 50118, 9178, 6294, 6, 5, 10199, 16, 2721, 6, 5, 2564, 21, 1969, 36, 10799, 132, 6, 195, 108, 306, 1297, 13442, 23246, 12345, 5, 1318, 16, 372, 8, 939, 657, 5, 5780, 98, 939, 1276, 7, 185, 24, 7, 127, 26090, 7, 22, 1090, 605, 409, 113, 5, 22, 42932, 113, 15, 258, 3391, 2]
Length of input_ids: 137
----------
[0, 15841, 479, 45, 25, 765, 15, 162, 36, 13713, 1459, 43, 479, 38, 2740, 5, 37863, 29, 181, 25, 42, 3588, 16, 45, 10, 15898, 3588, 6, 8, 14, 21, 5, 235, 1836, 13, 162, 4, 129, 631, 16, 5, 5933, 16, 10, 828, 18277, 202, 361, 29668, 15, 16701, 13, 162, 238, 5, 31622, 32, 818, 3229, 6, 98, 939, 74, 224, 5, 3588, 16, 10, 13173, 326, 15888, 3989, 4, 3195, 16, 2721, 6, 939, 2740, 2272, 25, 5, 97, 3195, 36, 2911, 783, 43, 630, 75, 33, 4716, 1459, 577, 4, 2272, 16, 4066, 6, 8, 30228, 6, 5, 10199, 16, 10262, 3793, 4, 939, 657, 5, 410, 1254, 11, 5, 29986, 4, 2299, 240, 10, 18052, 16979, 11689, 13, 42, 65, 4, 50121, 50118, 50121, 50118, 15314, 23246, 119, 389, 417, 2]
Length of input_ids: 137
----------
[0, 15841, 479, 1969, 17220, 1990, 80, 15033, 479, 5148, 10717, 17220, 6968, 240, 7, 216, 14, 42, 1907, 9, 10199, 16, 5, 65, 14, 40, 120, 6538, 36, 118, 2162, 5, 1104, 65, 322, 24, 16, 2422, 7174, 8, 9869, 6, 53, 939, 21, 129, 441, 7, 120, 80, 15033, 66, 9, 24, 4, 939, 222, 10397, 24, 8, 24, 4925, 24, 18, 1836, 142, 939, 1079, 47904, 24, 150, 7727, 172, 10601, 7, 3841, 4, 939, 21, 2422, 5779, 59, 5, 3568, 53, 10874, 145, 441, 7, 671, 24, 396, 864, 23, 127, 400, 6215, 4, 2]
Length of input_ids: 99
----------
[0, 25153, 11139, 328, 479, 38, 657, 42, 3588, 4, 24, 1299, 98, 3793, 8, 3473, 6, 1969, 13, 1035, 12846, 939, 2813, 24, 376, 11, 80, 8089, 142, 939, 74, 907, 961, 12846, 2]
Length of input_ids: 35

3. Save and Load TextDataControllerStreaming


source

TextDataControllerStreaming.save_as_pickles

 TextDataControllerStreaming.save_as_pickles (fname,
                                              parent='pickle_files',
                                              drop_attributes=False)
Type Default Details
fname Name of the pickle file
parent str pickle_files Parent folder
drop_attributes bool False Whether to drop large-size attributes

source

TextDataControllerStreaming.from_pickle

 TextDataControllerStreaming.from_pickle (fname, parent='pickle_files')
Type Default Details
fname Name of the pickle file
parent str pickle_files Parent folder

TextDataControllerStreaming object can be saved and loaded with ease. This is especially useful after text processing and/or tokenization have been done

from datasets import disable_caching
disable_caching() # disable huggingface caching to see data size
from underthesea import text_normalize
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
def nlp_aug_stochastic(x,aug=None,p=0.5):
    if not isinstance(x,list): 
        if random.random()<p: return aug.augment(x)[0]
        return x
    news=[]
    originals=[]
    for _x in x:
        if random.random()<p: news.append(_x)
        else: originals.append(_x)
    # only perform augmentation when needed
    if len(news): news = aug.augment(news)
    return news+originals
aug2 = naw.ContextualWordEmbsAug(model_path='roberta-base', 
                                device='cuda:0', # if you don't have gpu, change to 'cpu'
                                action="substitute",
                                top_k=10,
                               aug_p=0.07)

contextual_aug_func = partial(nlp_aug_stochastic,aug=aug2,p=0.1)
contextual_aug_func.run_on_gpu=True
contextual_aug_func.batch_size=32
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
ddict_with_val = dset.train_test_split(test_size=0.2)
ddict_with_val['validation'] = ddict_with_val['test']
ddict_with_val['train'] = ddict_with_val['train'].to_iterable_dataset()
del ddict_with_val['test']

tdc = TextDataControllerStreaming(ddict_with_val,
                                  main_text='Review Text',
                                  label_names='Department Name',
                                  sup_types='classification',
                                  class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                                  filter_dict={'Review Text': lambda x: x is not None,
                                               'Department Name': lambda x: x is not None,
                                              },
                                  metadatas=['Title','Division Name'],
                                  content_transformations=[text_normalize,str.lower],
                                  content_augmentations= contextual_aug_func,
                                  process_metas=True,
                                  batch_size=100,
                                  num_proc=4,
                                  seed=42
                                 )
tdc.process_and_tokenize(tokenizer,max_length=256)
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on validation set -----
Done
----- Creating a generator for content transformation, augmentation and tokenization on train set -----
Done
tdc.main_ddict
DatasetDict({
    train: IterableDataset({
        features: Unknown,
        n_shards: 3
    })
    validation: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4529
    })
})
tdc.save_as_pickles('my_tdc_stream')

Let’s check the file size

file_stats = os.stat(Path('pickle_files/my_tdc_stream.pkl'))
print(f'File Size in MegaBytes is {round(file_stats.st_size / (1024 * 1024), 3)}')
File Size in MegaBytes is 479.023

Load back our object

tdc2 = TextDataControllerStreaming.from_pickle('my_tdc_stream')

You can still access all its attributes, data, preprocessings, transformation/augmentation …

tdc2.main_ddict
DatasetDict({
    train: IterableDataset({
        features: Unknown,
        n_shards: 3
    })
    validation: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4529
    })
})
for i,v in enumerate(tdc2.main_ddict['train']):
    if i==3:break
    print(f"Text: {v['Review Text']}\nLabel: {v['Department Name']} => {v['label']}")
    print('-'*10)
Text: general petite.. i love it soft brown glistening, flowy beauty! it's my favorite color too! i'm 5'5 ". 34 d, size 6 and a small fit and with room to spare. don't wait!
Label: Jackets => 3
----------
Text: general. not the same... as i agree a other reviewer, the material of these jeans is not the same! thin, short, and you end up pulling them up all the time. me am a short, curvy girl and would prefer to have the old jean fabric back! this seems to be the trend in jeans? nydj also uses this fabric? probably too much.
Label: Dresses => 1
----------
Text: general. not for the busty, simple fabric, very versatile but the knit length and style accentuates the bust. probably not an issue for most but if your a d or up it's more attention than you may want.
Label: Tops => 4
----------
tdc2.label_lists
[['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend']]
tdc2.filter_dict,tdc2.content_tfms,tdc2.aug_tfms
({'Review Text': <function __main__.<lambda>(x)>,
  'Department Name': <function __main__.<lambda>(x)>},
 [<function underthesea.pipeline.text_normalize.text_normalize(text, tokenizer='underthesea')>,
  <method 'lower' of 'str' objects>],
 [functools.partial(<function nlp_aug_stochastic>, aug=<nlpaug.augmenter.word.context_word_embs.ContextualWordEmbsAug object>, p=0.1)])

If you don’t want to store the HuggingFace DatasetDict in your TextDataControllerStreaming, or the augmentation functions (typically when you already have a trained model, and you only use TextDataControllerStreaming to preprocess the test set), you can remove it in the save_as_pickles step

tdc.save_as_pickles('my_lightweight_tdc_stream',drop_attributes=True)

Let’s check the file size

file_stats = os.stat(Path('pickle_files/my_lightweight_tdc_stream.pkl'))
print(f'File Size in MegaBytes is {round(file_stats.st_size / (1024 * 1024), 3)}')
File Size in MegaBytes is 1.907

Load it back

tdc3 = TextDataControllerStreaming.from_pickle('my_lightweight_tdc_stream')

We will use this object to demonstrate the Test Set Construction in the next section

Construct a Test Dataset


source

TextDataControllerStreaming.prepare_test_dataset

 TextDataControllerStreaming.prepare_test_dataset (test_dset,
                                                   do_filtering=False)
Type Default Details
test_dset The HuggingFace Dataset as Test set
do_filtering bool False whether to perform data filtering on this test set

source

TextDataControllerStreaming.prepare_test_dataset_from_csv

 TextDataControllerStreaming.prepare_test_dataset_from_csv (file_path,
                                                            do_filtering=F
                                                            alse)
Type Default Details
file_path path to csv file
do_filtering bool False whether to perform data filtering on this test set

source

TextDataControllerStreaming.prepare_test_dataset_from_df

 TextDataControllerStreaming.prepare_test_dataset_from_df (df,
                                                           validate=True, 
                                                           do_filtering=Fa
                                                           lse)
Type Default Details
df Pandas Dataframe
validate bool True whether to perform input data validation
do_filtering bool False whether to perform data filtering on this test set

source

TextDataControllerStreaming.prepare_test_dataset_from_raws

 TextDataControllerStreaming.prepare_test_dataset_from_raws (content)
Details
content Either a single sentence, list of sentence or a dictionary with keys are metadata columns and values are list

Let’s say you have done your preprocessing and tokenization in your training set, and have a nicely trained model, ready to do inference on new data. Here is how you can use TextDataControllerStreaming to apply all the necessary preprocessings to your new data

We will reuse the lightweight tdc object we created in the previous section (since we don’t really need all the training data just to construct new data). Also, we will take a small sample of our training data and pretend it is our test data

tdc = TextDataControllerStreaming.from_pickle('my_lightweight_tdc_stream')
df_test = pd.read_csv('sample_data/Womens_Clothing_Reviews.csv',encoding='utf-8-sig').sample(frac=0.2,random_state=1)
# drop NaN values in the label column
df_test = df_test[~df_test['Department Name'].isna()].reset_index(drop=True)
df_test.shape
(4692, 10)
df_test.head()
Clothing ID Age Title Review Text Rating Recommended IND Positive Feedback Count Division Name Department Name Class Name
0 872 42 Perfect for work and play This shirt works for both going out and going ... 5 1 0 General Tops Knits
1 1033 40 NaN I don't know why i had the opposite problem mo... 4 1 0 General Petite Bottoms Jeans
2 1037 45 Great pants These cords are great--lightweight for fl wint... 5 1 1 General Petite Bottoms Jeans
3 829 35 Surprisingly comfy for a button down I am a 10 m and got the 10. it fits perfectly ... 5 1 1 General Petite Tops Blouses
4 872 29 Short and small The shirt is mostly a thick sweatshirt materia... 3 0 15 General Petite Tops Knits
test_dset = tdc.prepare_test_dataset_from_df(df_test,validate=True,do_filtering=True)
- Input Validation Precheck -
Data contains missing values!
-----> List of columns and the number of missing values for each
Title          758
Review Text    164
dtype: int64
Data contains duplicated values!
-----> Number of duplications: 2 rows
-------------------- Start Test Set Transformation --------------------
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on test set -----
Done
for i in range(3):
    print(f"Text: {test_dset['Review Text'][i]}")
    print(f"Input_ids: {test_dset['input_ids'][i]}")
    print('-'*10)
Text: general . perfect for work and play . this shirt works for both going out and going to work , and i can wear it with everything . fits perfect , tucked and untucked , tied and untied . i love it .
Input_ids: [0, 15841, 479, 1969, 13, 173, 8, 310, 479, 42, 6399, 1364, 13, 258, 164, 66, 8, 164, 7, 173, 2156, 8, 939, 64, 3568, 24, 19, 960, 479, 10698, 1969, 2156, 21222, 8, 7587, 23289, 2156, 3016, 8, 7587, 2550, 479, 939, 657, 24, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------
Text: general petite . . i don't know why i had the opposite problem most reviewers had with these ..... i tried on the regular length in the store and found that they were just a bit too short with heels . ( i'm 5 ' 5 ) . i had them ordered in a petite and when they came , they were too short with flats ! maybe it's the way i like to wear them , i like my flare jeans to barely skim the ground . i just exchanged them for regular length and will wear them with a small wedge shoe . aside from the length issues , these are super cute
Input_ids: [0, 15841, 4716, 1459, 479, 479, 939, 218, 75, 216, 596, 939, 56, 5, 5483, 936, 144, 34910, 56, 19, 209, 29942, 734, 939, 1381, 15, 5, 1675, 5933, 11, 5, 1400, 8, 303, 14, 51, 58, 95, 10, 828, 350, 765, 19, 8872, 479, 36, 939, 437, 195, 128, 195, 4839, 479, 939, 56, 106, 2740, 11, 10, 4716, 1459, 8, 77, 51, 376, 2156, 51, 58, 350, 765, 19, 20250, 27785, 2085, 24, 18, 5, 169, 939, 101, 7, 3568, 106, 2156, 939, 101, 127, 24186, 10844, 7, 6254, 28772, 5, 1255, 479, 939, 95, 11024, 106, 13, 1675, 5933, 8, 40, 3568, 106, 19, 10, 650, 27288, 12604, 479, 4364, 31, 5, 5933, 743, 2156, 209, 32, 2422, 11962, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------
Text: general petite . great pants . thes e cords are great--lightweight for fl winters , and the bootcut flare bottom is super cute with ballet flats or booties . i am 5 ' 10 " and typically a size 8 ; the size 29 fit perfectly . they have a little stretch to them , which is great . very flattering--wish i could order in more colors ! !
Input_ids: [0, 15841, 4716, 1459, 479, 372, 9304, 479, 5, 29, 364, 37687, 32, 372, 5579, 6991, 4301, 13, 2342, 31000, 2156, 8, 5, 9759, 8267, 24186, 2576, 16, 2422, 11962, 19, 22573, 20250, 50, 9759, 918, 479, 939, 524, 195, 128, 158, 22, 8, 3700, 10, 1836, 290, 25606, 5, 1836, 1132, 2564, 6683, 479, 51, 33, 10, 410, 4140, 7, 106, 2156, 61, 16, 372, 479, 182, 34203, 5579, 605, 1173, 939, 115, 645, 11, 55, 8089, 27785, 27785, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
----------

Let’s make our test data streamed as well

test_dset_raw = Dataset.from_pandas(df_test).to_iterable_dataset()

This test dataset might have some NaN values in the text field (Review Text), thus we will turn on the filtering option to get rid of these NaNs, as this is what we did in the training set. If your test dataset don’t need any filtering, turn off this option

test_dset = tdc.prepare_test_dataset(test_dset_raw,do_filtering=True)
-------------------- Start Test Set Transformation --------------------
-------------------- Data Filtering --------------------
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Dropping unused features --------------------
Done
----- Performing Content Transformation and Tokenization on test set -----
Done
for i,v in enumerate(test_dset):
    if i==3:break
    print(f"Text: {v['Review Text']}\Input_ids: {v['input_ids']}\nAttention mask: {v['attention_mask']}")
    print('-'*10)
Text: general . perfect for work and play . this shirt works for both going out and going to work , and i can wear it with everything . fits perfect , tucked and untucked , tied and untied . i love it .\Input_ids: [0, 15841, 479, 1969, 13, 173, 8, 310, 479, 42, 6399, 1364, 13, 258, 164, 66, 8, 164, 7, 173, 2156, 8, 939, 64, 3568, 24, 19, 960, 479, 10698, 1969, 2156, 21222, 8, 7587, 23289, 2156, 3016, 8, 7587, 2550, 479, 939, 657, 24, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
----------
Text: general petite . . i don't know why i had the opposite problem most reviewers had with these ..... i tried on the regular length in the store and found that they were just a bit too short with heels . ( i'm 5 ' 5 ) . i had them ordered in a petite and when they came , they were too short with flats ! maybe it's the way i like to wear them , i like my flare jeans to barely skim the ground . i just exchanged them for regular length and will wear them with a small wedge shoe . aside from the length issues , these are super cute\Input_ids: [0, 15841, 4716, 1459, 479, 479, 939, 218, 75, 216, 596, 939, 56, 5, 5483, 936, 144, 34910, 56, 19, 209, 29942, 734, 939, 1381, 15, 5, 1675, 5933, 11, 5, 1400, 8, 303, 14, 51, 58, 95, 10, 828, 350, 765, 19, 8872, 479, 36, 939, 437, 195, 128, 195, 4839, 479, 939, 56, 106, 2740, 11, 10, 4716, 1459, 8, 77, 51, 376, 2156, 51, 58, 350, 765, 19, 20250, 27785, 2085, 24, 18, 5, 169, 939, 101, 7, 3568, 106, 2156, 939, 101, 127, 24186, 10844, 7, 6254, 28772, 5, 1255, 479, 939, 95, 11024, 106, 13, 1675, 5933, 8, 40, 3568, 106, 19, 10, 650, 27288, 12604, 479, 4364, 31, 5, 5933, 743, 2156, 209, 32, 2422, 11962, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
----------
Text: general petite . great pants . thes e cords are great--lightweight for fl winters , and the bootcut flare bottom is super cute with ballet flats or booties . i am 5 ' 10 " and typically a size 8 ; the size 29 fit perfectly . they have a little stretch to them , which is great . very flattering--wish i could order in more colors ! !\Input_ids: [0, 15841, 4716, 1459, 479, 372, 9304, 479, 5, 29, 364, 37687, 32, 372, 5579, 6991, 4301, 13, 2342, 31000, 2156, 8, 5, 9759, 8267, 24186, 2576, 16, 2422, 11962, 19, 22573, 20250, 50, 9759, 918, 479, 939, 524, 195, 128, 158, 22, 8, 3700, 10, 1836, 290, 25606, 5, 1836, 1132, 2564, 6683, 479, 51, 33, 10, 410, 4140, 7, 106, 2156, 61, 16, 372, 479, 182, 34203, 5579, 605, 1173, 939, 115, 645, 11, 55, 8089, 27785, 27785, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
----------