Text Transformation

This contains some text transformation functionality

apply_vnmese_word_tokenize

 apply_vnmese_word_tokenize (sentence:str, normalize_text=False,
                             fixed_words=[])

Applying UnderTheSea Vietnamese word tokenization

	Type	Default	Details
sentence	str		Input sentence
normalize_text	bool	False	To ‘normalize’ the text before tokenization
fixed_words	list	[]

For non-Vietnamese word, it’s a hit-or-miss since UnderTheSea works best for Vietnamese sentences

text = 'This is a cat. New York city. San Francisco. New York and San Francisco Bay area. George Bush, Barrack Obama'
apply_vnmese_word_tokenize(text)

'This is a_cat . New_York city . San_Francisco . New_York and_San_Francisco Bay area . George Bush , Barrack Obama'

Here’s an example on a clean Vietnamese sentence

text = 'Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò'
apply_vnmese_word_tokenize(text)

'Chàng trai 9X Quảng_Trị khởi_nghiệp từ nấm sò'

What if the sentence is not cleaned?

text = "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò.Anh ấy không nuôi   nấm😊. nhưng anh này nuôi. Chị ấy lại không nuôi?(ai biết tại sao 😊😊? )Rồi? Rồi sao?rồi ?Rồi ủa...chứ chị ấy nuôi gì, #mộthaiba cũng không rõ =)) 😊. Haha :) 😊 hehe 😊."

apply_vnmese_word_tokenize(text)

'Chàng trai 9X Quảng_Trị khởi_nghiệp từ nấm sò . Anh ấy không nuôi nấm 😊 . nhưng anh này nuôi . Chị ấy lại không nuôi ? ( ai biết tại_sao 😊_😊 ? ) Rồi ? Rồi sao ? rồi ? Rồi ủa ... chứ chị ấy nuôi gì , #_mộthaiba cũng không rõ =))_😊 . Haha :) 😊 hehe 😊 .'

We need to normalize the text

apply_vnmese_word_tokenize(text,normalize_text=True)

'Chàng trai 9X Quảng_Trị khởi_nghiệp từ nấm sò . Anh ấy không nuôi nấm 😊 . nhưng anh này nuôi . Chị ấy lại không nuôi ? ( ai biết tại_sao 😊_😊 ? ) Rồi ? Rồi sao ? rồi ? Rồi ủa ... chứ chị ấy nuôi gì , #_mộthaiba cũng không rõ =))_😊 . Haha :) 😊 hehe 😊 .'

We can add a list of specific words to tokenize

text = "Viện Nghiên Cứu chiến lược quốc gia về học máy"
apply_vnmese_word_tokenize(text)

'Viện Nghiên_Cứu chiến_lược quốc_gia về học máy'

apply_vnmese_word_tokenize(text,fixed_words=["Viện Nghiên Cứu", "học máy"])

'Viện_Nghiên_Cứu chiến_lược quốc_gia về học_máy'