from datasets import load_dataset
= load_dataset("wikitext", "wikitext-103-v1", split="train[:1%]") ds
Tutorial: Using another library
First off, we need to install the library.
pip install scrubadub
Now we will use the same (wikitext) dataset as in the previous tutorial.
We will use the scrubadub
library to remove personal information from the text. scrubadub
usually defaults to removing the following types: * credential - username and password combinations * credit_card - credit card numbers * drivers_license - drivers license numbers * email - email addresses * national_insurance_number - GB National Insurance numbers (NINOs) * phone - phone numbers * postalcode - british postal codes * social_security_number - US Social Security numbers (SSNs) * tax_reference_number - UK PAYE temporary reference number (TRN) * twitter - twitter handles * url - URLs * vehicle_license_plate - british vehicle license plates
However, while experimenting with the library it seems some of these are not on by default. Either way, we are only going to focus on the credit_card
, drivers_license
, email
, phone
, and social_security_number
detectors. Therefore, we must turn the others off:
from scrubadub import Scrubber
from scrubadub.detectors import CredentialDetector, TwitterDetector, UrlDetector
= Scrubber()
scrubber
scrubber.remove_detector(CredentialDetector)
scrubber.remove_detector(TwitterDetector)
scrubber.remove_detector(UrlDetector)
= [
datasources
{"dataset": ds,
"name": "wikitext",
"columns": ["text"],
"filters": [],
"cleaners": [scrubber.clean],
},# ...
]
Essentially, any function that takes in a string and returns a string will work out of the box with squeakily
. Luckily for us, scrubadub
has a clean
function that does just that. We can use this function to remove personal information from the text!
A similar process can be used for filters, except the return type is a bool
instead of a str
denoting whether or not the text should be kept.
Note: If you want to mix and match, it is super easy!
from squeakily.clean import remove_empty_lines, remove_ip
= [
datasources
{"dataset": ds,
"name": "wikitext",
"columns": ["text"],
"filters": [],
"cleaners": [scrubber.clean, remove_empty_lines, remove_ip],
},# ...
]
Now we can process the datasources
as before with a Pipeline
object.
from squeakily.core import Pipeline
= Pipeline(datasources)
pipeline pipeline.run()