helpers

This module contains all the various helper functions used in the other modules.
Important

Note: This notebook contains a large collection of profane and offensive language to use as a word filter. It is not recommended for children or the highly sensitive.


source

get_words

 get_words (text:str)

custom regex to extract all the words in a string

Type Details
text str the text to extract words from
Returns list

The following code is adapted from this awesome blog post by C Chaitanya.


source

FastTextLanguageDetector

 FastTextLanguageDetector (model_path:str='/tmp/lid.176.bin')

Initialize self. See help(type(self)) for accurate signature.

fasttext_model = FastTextLanguageDetector.from_pretrained()

# test spanish
lang, prob = fasttext_model.get_language("Hola, como estas?")
assert lang == "es"
assert prob > 0.9

# test english
lang, prob = fasttext_model.get_language("Hello, how are you?")
assert lang == "en"
assert prob > 0.9

# test combination
lang, prob = fasttext_model.get_language("Hello, how are you? Hola, como estas?")
assert prob < 0.9
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
# test with multiple lines

lang, prob = fasttext_model.get_language("Hello, how are you?\nI am fine, thank you.")
assert lang == "en"
assert prob > 0.9

lang, prob = fasttext_model.get_language("Hello, how are you?\n\nI am fine, thank you.")
assert lang == "en"
assert prob > 0.9
# check pickling works
import pickle

with open("/tmp/fasttext_model.pkl", "wb") as f:
    pickle.dump(fasttext_model, f)

with open("/tmp/fasttext_model.pkl", "rb") as f:
    pickled_fasttext_model = pickle.load(f)

lang, prob = fasttext_model.get_language("Hello, how are you?")
p_lang, p_prob = pickled_fasttext_model.get_language("Hello, how are you?")
assert lang == p_lang
assert prob == p_prob
assert pickled_fasttext_model == fasttext_model
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.

The following code has been copied from the awesome Huggingface Space by edugp


source

SentencePiece

 SentencePiece (model:str)

Initialize self. See help(type(self)) for accurate signature.


source

KenlmModel

 KenlmModel (model_dataset:str, language:str, lower_case:bool=False,
             remove_accents:bool=False, normalize_numbers:bool=True,
             punctuation:int=1)

Initialize self. See help(type(self)) for accurate signature.

To run this test, you need to have kenlm installed: pip install https://github.com/kpu/kenlm/archive/master.zip

model = KenlmModel.from_pretrained(
    model_dataset="wikipedia",
    language="en",
    lower_case=True,
    remove_accents=True,
    normalize_numbers=True,
    punctuation=1,
)

# Get perplexity
perplex_1 = model.get_perplexity("I am very perplexed")
perplex_2 = model.get_perplexity("im hella trippin")

assert perplex_1 < perplex_2
/home/nathan/miniconda3/envs/squeakily/lib/python3.10/site-packages/huggingface_hub/file_download.py:592: FutureWarning: `cached_download` is the legacy way to download files from the HF hub, please consider upgrading to `hf_hub_download`
  warnings.warn(