Tutorial: Using another library

This tutorial shows how to use another library in a notebook. We will use the scrubadub library to remove personal information from text.

First off, we need to install the library.

pip install scrubadub

Now we will use the same (wikitext) dataset as in the previous tutorial.

from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-v1", split="train[:1%]")

We will use the scrubadub library to remove personal information from the text. scrubadub usually defaults to removing the following types: * credential - username and password combinations * credit_card - credit card numbers * drivers_license - drivers license numbers * email - email addresses * national_insurance_number - GB National Insurance numbers (NINOs) * phone - phone numbers * postalcode - british postal codes * social_security_number - US Social Security numbers (SSNs) * tax_reference_number - UK PAYE temporary reference number (TRN) * twitter - twitter handles * url - URLs * vehicle_license_plate - british vehicle license plates

However, while experimenting with the library it seems some of these are not on by default. Either way, we are only going to focus on the credit_card, drivers_license, email, phone, and social_security_number detectors. Therefore, we must turn the others off:

from scrubadub import Scrubber
from scrubadub.detectors import CredentialDetector, TwitterDetector, UrlDetector

scrubber = Scrubber()
scrubber.remove_detector(CredentialDetector)
scrubber.remove_detector(TwitterDetector)
scrubber.remove_detector(UrlDetector)

datasources = [
    {
        "dataset": ds,
        "name": "wikitext",
        "columns": ["text"],
        "filters": [],
        "cleaners": [scrubber.clean],
    },
    # ...
]

Essentially, any function that takes in a string and returns a string will work out of the box with squeakily. Luckily for us, scrubadub has a clean function that does just that. We can use this function to remove personal information from the text!

A similar process can be used for filters, except the return type is a bool instead of a str denoting whether or not the text should be kept.

Note

Note: If you want to mix and match, it is super easy!

from squeakily.clean import remove_empty_lines, remove_ip
datasources = [
    {
        "dataset": ds,
        "name": "wikitext",
        "columns": ["text"],
        "filters": [],
        "cleaners": [scrubber.clean, remove_empty_lines, remove_ip],
    },
    # ...
]

Now we can process the datasources as before with a Pipeline object.

from squeakily.core import Pipeline

pipeline = Pipeline(datasources)
pipeline.run()