import os
Text Feature Extraction From Hidden States
#This will specify a (or a list) of GPUs for training
'CUDA_VISIBLE_DEVICES'] = "0" os.environ[
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main_lm import *
from that_nlp_library.utils import seed_everything
from that_nlp_library.model_lm_main import *
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from transformers import AutoTokenizer, AutoConfig, AutoModelForMaskedLM
from datasets import load_dataset
import pandas as pd
import numpy as np
from transformers import DataCollatorForLanguageModeling
Finetune a Roberta Language Model (with line-by-line tokenization)
Create a TextDataLMController object
We will reuse the data and the preprocessings in this tutorial
In order to extract a feature vector from a review sentence in the dataset, we can directly use pretrained models such as Roberta, GPT2, … But if our dataset is vastly different from the datasets these pretrained models are trained on, we can finetune these pretrained models on our dataset before extracting the feature vector. And that’s exactly what we are going to do now.
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
dset = dset.train_test_split(test_size=0.2,seed=42)
ddict 'validation'] = ddict['test']
ddict[del ddict['test']
# For now, we will filter missing review text rows here instead of relying the data controler
= ddict.filter(lambda x: x['Review Text'] is not None) ddict
ddict
DatasetDict({
train: Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 18111
})
validation: Dataset({
features: ['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name'],
num_rows: 4530
})
})
= TextDataLMController(ddict,
tdc ='Review Text',
main_text='Title',
metadatas=[text_normalize,str.lower],
content_transformations=42,
seed=True
verbose )
Define our tokenizer for Roberta
= AutoTokenizer.from_pretrained('roberta-base') _tokenizer
/home/quan/anaconda3/envs/nlp_dev/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Process and tokenize our dataset (using line-by-line tokenization)
=112
block_size=True,max_length=block_size)
tdc.process_and_tokenize(_tokenizer,line_by_line# set max_length=-1 if you want the data collator to pad
-------------------- Start Main Text Processing --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split already exists
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 2, which is 0.01% of training set
Filtering leaked data out of training set...
Done
-------------------- Shuffling and flattening train set --------------------
Done
-------------------- Tokenization --------------------
Done
tdc.main_ddict
DatasetDict({
train: Dataset({
features: ['Title', 'Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 18109
})
validation: Dataset({
features: ['Title', 'Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask'],
num_rows: 4530
})
})
And set the data collator
=True,mlm_prob=0.15) tdc.set_data_collator(is_mlm
Initialize and train Roberta Language Model
= AutoConfig.from_pretrained('roberta-base',vocab_size=len(_tokenizer)) _config
= language_model_init(AutoModelForMaskedLM,
_model =_config,
config='roberta-base',
cpoint_path=42
seed )
Total parameters: 124697433
Total trainable parameters: 124697433
Create a model controller
= ModelLMController(_model,data_store=tdc,seed=42) controller
And we can start training our model
= 1e-4
lr =32
bs=0.01
wd= 6
epochs=0.25
warmup_ratio
controller.fit(epochs,lr,=bs,
batch_size=wd,
weight_decay=warmup_ratio,
warmup_ratio=False,
save_checkpoint )
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | No log | 1.502509 | 0.657702 |
2 | 1.706200 | 1.509666 | 0.656255 |
3 | 1.706200 | 1.423058 | 0.671800 |
4 | 1.476100 | 1.344976 | 0.684443 |
5 | 1.476100 | 1.309737 | 0.691151 |
6 | 1.308800 | 1.293480 | 0.694945 |
Perplexity on validation set: 3.634
Finetuning from a pretrained model results in a massive improvement in terms of metrics
'./sample_weights/roberta_lm_model') controller.trainer.model.save_pretrained(
What can we do with feature vectors?
Dimensionality Reduction
Typically for dimensionality reduction, the two main downstream tasks we can perform on the reduced-dimension data is either for visualization, or as an input of simpler machine learning models (regression, tree-related algorithms)
For dimensionality reduction, it’s important to determine the target number of lower dimensions that we will perform on the original data. We will use a popular algorithm called PCA, thus those dimensions are the principal components (PC). Thus, we can use the explained variance ratio (calculated from PC’s’ eigenvalue) to determine the amount of PC needed for our dimension reduction task
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
def plot_cummulative_variance(X,figsize=(6,4),dim_to_visualize=None):
if dim_to_visualize is None: dim_to_visualize = X.shape[1]
= PCA(svd_solver='full',random_state=42)
_pca = _pca.fit_transform(X)
_tmp = plt.subplots(figsize=figsize)
fig,ax 1+np.arange(dim_to_visualize),_pca.explained_variance_ratio_[:dim_to_visualize],color='maroon')
ax.bar('Explained Variance')
ax.set_ylabel(
= ax.twinx()
ax2 1+np.arange(dim_to_visualize),np.cumsum(_pca.explained_variance_ratio_[:dim_to_visualize]),label='Cummulative line',color='blue')
ax2.plot('n_components')
ax2.set_xlabel('Cummulative Explained Variance')
ax2.set_ylabel(='upper right')
ax2.legend(loc plt.show()
'hidden_states'],dim_to_visualize=55) plot_cummulative_variance(hidden_from_val[
we can achieve roughly 80% total variance with only ~55 components (out of 768, which is only ~7% of data). The first 3 PC are the most important PC with the highest explained variance.
Let’s try to visualize the first 3 PCs
= PCA(svd_solver='full',n_components=3,random_state=42)
_pca = _pca.fit_transform(hidden_from_val['hidden_states']) _pca_coord
= pd.DataFrame(_pca_coord,columns=['pc1','pc2','pc3']) pca_df
'DN']=ddict['validation']['Department Name']
pca_df['DN'].fillna('Tops',inplace=True) pca_df[
/tmp/ipykernel_51605/3688443385.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
pca_df['DN'].fillna('Tops',inplace=True)
= plt.subplots(3,1,figsize=(10,14))
fig,axs =pca_df,x='pc1',y='pc2',hue='DN',ax=axs[0],s=10)
sns.scatterplot(data=pca_df,x='pc1',y='pc3',hue='DN',ax=axs[1],s=10)
sns.scatterplot(data=pca_df,x='pc2',y='pc3',hue='DN',ax=axs[2],s=10) sns.scatterplot(data
We can somewhat distinguish a clear cluster when visualizing PC1 and PC3 on the top right. Combining with the Department Name
, this cluster contains mostly “Bottoms” and “Intimiate” clothing items. Therefore, using these PCs, we can build a simple classification model to classify Department Name
Sentence Similarity
When each sentence is represented by a dense feature vector, we can perform one of the popular task of sentence similarity: which sentences are the most similar to a given sentence? In this experiment, we will use FAISS library, which is designed specifically for efficient similarity search and clustering of dense vectors
import faiss
import time
import pandas as pd
import numpy as np
from datasets import load_from_disk
def faiss_find_NN(index,X,k):
= time.perf_counter()
tic = index.search(X, k=k)
D, I = time.perf_counter()
toc print(f"Execution time: {toc - tic:0.4f} seconds")
return I,D
def faiss_prepare_gpu_index(X):
= faiss.StandardGpuResources()
res = faiss.IndexFlatL2(X.shape[1])
index = faiss.index_cpu_to_gpu(res, 0, index)
gpu_index
gpu_index.add(X)return gpu_index
def faiss_prepare_gpu_index_fast(X,nlist,nprobe=1):
"""
https://github.com/facebookresearch/faiss/wiki/Faster-search
- nlist parameter in the IndexIVFFlat index in Faiss refers to the number of Voronoi cells,
or clusters, that the index uses to partition the vector space.
- nprobe parameter specifies the number of cells to visit during the search,
can be used in conjunction with nlist to further fine-tune the performance of the index.
For example, you can increase nlist to speed up the search
and then increase nprobe to compensate for the potential decrease in accuracy.
"""
= faiss.StandardGpuResources()
res = X.shape[1]
d = faiss.IndexFlatL2(d)
quantizer = faiss.IndexIVFFlat(quantizer,d,nlist)
index assert not index.is_trained
index.train(X)assert index.is_trained
index.add(X)= nprobe
index.nprobe = faiss.index_cpu_to_gpu(res,0,index)
gpu_index return gpu_index
We will produce the feature vectors for the entire dataset
= load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
full_dset # For now, we will filter missing review text rows here instead of relying the data controler
= full_dset.filter(lambda x: x['Review Text'] is not None) full_dset
= AutoConfig.from_pretrained('./sample_weights/roberta_lm_model',output_hidden_states=True)
_config
= language_model_init(AutoModelForMaskedLM,
trained_model ='./sample_weights/roberta_lm_model',
cpoint_path=_config
config
)
= ModelLMController(trained_model,data_store=tdc,seed=42) controller2
Total parameters: 124697433
Total trainable parameters: 124697433
= controller2.get_hidden_states_from_raw_dset(dset=full_dset,
all_hiddens ='hidden_states',
state_name=[-1,0]) state_idx
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Tokenization --------------------
Done
# all_hiddens = load_from_disk("./sample_weights/all_hiddens")
'hidden_states'].shape all_hiddens[
(22641, 768)
= faiss_prepare_gpu_index(all_hiddens['hidden_states']) gpu_index
= faiss_find_NN(gpu_index,all_hiddens['hidden_states'],k=5) # find 20 nearest neighbors idxs,D
Execution time: 0.0659 seconds
=1
iprint(f"1. Original Sentence: {all_hiddens['Review Text'][i]}\n")
print('2. Neighbors:')
for a,b in enumerate(idxs[i]):
print(f" - Sentence: {all_hiddens['Review Text'][b]}")
print(f" - Distance: {D[i][a]}")
print('-'*10)
1. Original Sentence: . love this dress ! it's sooo pretty . i happened to find it in a store , and i'm glad i did bc i never would have ordered it online bc it's petite . i bought a petite and am 5 ' 8 " . i love the length on me - hits just a little below the knee . would definitely be a true midi on someone who is truly petite .
2. Neighbors:
- Sentence: . love this dress ! it's sooo pretty . i happened to find it in a store , and i'm glad i did bc i never would have ordered it online bc it's petite . i bought a petite and am 5 ' 8 " . i love the length on me - hits just a little below the knee . would definitely be a true midi on someone who is truly petite .
- Distance: 3.0517578125e-05
----------
- Sentence: this dress is beautiful . when i received the dress , it was true to size on the bust and length . i didn't purchase the petite , but i wish i did since i am 5 ' 2 . i tried to wear my wedges , but it's still long . i love the dress and would definitely recommend it .
- Distance: 0.445709228515625
----------
- Sentence: pretty everyday dress . this is a great everyday dress . i bought it in black and it looks cute with wedges . it's a little boxy but that also makes it super comfy so i don't mind that too much . definitely order one-size down if you're between sizes . you may want to consider petite , if it's available , becuase i found it to be a bit long for my taste . ( i'm 5 ' 5 " and it hits at the knee ) . i've already worn it a few times and have had received several compliments .
- Distance: 0.4746246337890625
----------
- Sentence: adorable ! . i ran into retailer to actually return a dress and saw this hanging in the petite section and had to try it . at almost 5 ' 4 " i can sometimes go between regular and petite sizing . and thankfully i could grab the petite with this dress . ( they didn't have the regular size ) it's airy and summery , i love the design of the swing . it's loose , but not tent like . the length came to about 3 inches above my knee . it's not see though at all ! i had black bottoms on and you couldn't see them . another plus it tha
- Distance: 0.479583740234375
----------
- Sentence: love this dress ! . i'm 5 ft 3 inches , 130 lbs . this dress fits perfect ! i bought a 4 r . the petite 4 probably would have fit and been a inch or two shorter but i don't the regular looked bad . it hits an inch or two below my knee . medium thickness of fabric is forgiving even though the dress hugs in all the right places . beautiful dress ! !
- Distance: 0.48602294921875
----------
=2
iprint(f"1. Original Sentence: {all_hiddens['Review Text'][i]}\n")
print('2. Neighbors:')
for a,b in enumerate(idxs[i]):
print(f" - Sentence: {all_hiddens['Review Text'][b]}")
print(f" - Distance: {D[i][a]}")
print('-'*10)
1. Original Sentence: some major design flaws . i had such high hopes for this dress and really wanted it to work for me . i initially ordered the petite small ( my usual size ) but i found this to be outrageously small . so small in fact that i could not zip it up ! i reordered it in petite medium , which was just ok . overall , the top half was comfortable and fit nicely , but the bottom half had a very tight under layer and several somewhat cheap ( net ) over layers . imo , a major design flaw was the net over layer sewn directly into the zipper - it c
2. Neighbors:
- Sentence: some major design flaws . i had such high hopes for this dress and really wanted it to work for me . i initially ordered the petite small ( my usual size ) but i found this to be outrageously small . so small in fact that i could not zip it up ! i reordered it in petite medium , which was just ok . overall , the top half was comfortable and fit nicely , but the bottom half had a very tight under layer and several somewhat cheap ( net ) over layers . imo , a major design flaw was the net over layer sewn directly into the zipper - it c
- Distance: 0.0
----------
- Sentence: wanted to love . . i was so excited for this dress to arrive , but when it did , it was a major disappointment . there is no lining on the skirt portion of the dress leaving it sheer . also , the top part along the chest puckered in a weird unflattering way . the colors were beautiful , but not enough to to make up for the other design flaws . i am 5 ' 6 " 140 lb and the small fit tts . will be sending it back
- Distance: 0.29474639892578125
----------
- Sentence: sloppy fit . tried this dress in a small . i am 5 ' 6 " , 135 lbs , 34 c . the small was huge - the sash helped some , but looked like a potato sack on me . also , i noticed the fabric was piling slighly in the store . i had high hopes for this dress but left this one in the store .
- Distance: 0.34699249267578125
----------
- Sentence: gorgeous but ... . this dress caught my eye online and i had to have it . when it arrived it was still just as pretty as i remembered . beautiful , vibrant colors and a unique oversized floral on a flirty , swishy fabric . unfortunately the cut was really strange . after reading several reviews i went with an xs , but i'm not sure a s would have solved all the fit problems i was having . while everything from the waist down fit perfectly , the arms were cut too tight , the chest was huge and yet i struggled to zip it up . an
- Distance: 0.35202789306640625
----------
- Sentence: wanted to love it , but ... . i saw the colors and texture of this top and wanted to love it however when i put it on it just did not work . the cut was so strange . it runs very small in the shoulders and then flairs out so much at the waist . i even ordered up a size based on reviews . i ordered an 8 when i normally where a small ( 6 ) in things . the shoulders were still too small and it was way too big every where else .
- Distance: 0.35425567626953125
----------