In Section 4.2 (PI Type Classification) of our paper, we detail our methodology for identifying the types of personal information (PI) requested in each web form field. The relevant code is located in this folder.
As described in Section 4.2 -- List of PI Types, we use prelabel-gpt-freeform.py
to obtain a list of PI types present in our dataset:
$ python prelabel-gpt-freeform.py ~/webform-data gpt-pi-types.jsonl
The output is saved in gpt-pi-types.jsonl
:
$ shuf -n1 gpt-pi-types.jsonl
{"domain": "dslreports.com", "job_hash": "b689489d7a532483653a5c72775f2237057974d07421878c4adb0c59d1cefc14", "filename": "form-0.json", "response": ["Username", "Email Address", "Password"]}
As outlined in the paper, we manually reviewed these outputs to create the final list of PI type labels.
As discussed in Section 4.2 -- Feature Extraction and Labeling, we use extract-features.py
to convert the HTML code of each web form field into a human-readable YAML string for manual labeling:
$ python extract-features.py ~/webform-data pi-unlabeled.jsonl
Each line in pi-unlabeled.jsonl
is a JSON object containing the extracted YAML string and additional metadata about the web form:
$ shuf -n1 pi-unlabeled.jsonl | jq -r '.text'
phone_number
tagName: INPUT
label: Phone number *
attributes:
type: text
id: edit-phone-number
autocomplete: off
isVisible: true
We use Label Studio, an open-source data labeling platform, to facilitate manual data labeling. Set up a "Text Classification" project in Label Studio, and use import-to-ls.py
to import the unlabeled data:
$ python import-to-ls.py -P <API_KEY> <LABEL_STUDIO_URL> <PROJECT_ID> pi-unlabeled.jsonl
In this commandline:
- Replace
<LABEL_STUDIO_URL>
with your Label Studio instance URL (e.g.,http://localhost:8080/
). <PROJECT_ID>
is the project number in the URL (e.g., if the URL ishttp://localhost:8080/projects/1/data
, thenPROJECT_ID
is1
).<API_KEY>
can be found in your account settings.
We also provide a helper script, merge-ls-verified-samples.py
, to merge manual annotations from Label Studio back into the unlabeled .jsonl
file:
$ python merge-ls-verified-samples.py -P <API_KEY> <LABEL_STUDIO_URL> <PROJECT_ID> pi-unlabeled.jsonl pi-v2.jsonl
The output pi-v2.jsonl
contains both labeled and unlabeled data, which can be re-imported into Label Studio using import-to-ls.py
. This is useful if the dataset changes and you wish to incorporate new data without discarding previous work.
Once labeling is complete, use train-setfit-script.py
to train the PI type classifier:
$ python train-setfit-script.py -P <API_KEY> <LABEL_STUDIO_URL> <PROJECT_ID> -o model/
The trained model checkpoint will be stored in the model/latest/
folder. Keep it for future use:
$ cp -rT model/latest ~/webform-classifiers/pi-type
We have released the trained PI type classifier as classifier-pi-type.tar.zst
in the released artifacts. To extract the classifier:
$ mkdir -p ~/webform-classifiers/pi-type
$ tar xf classifier-pi-type.tar.zst -C ~/webform-classifiers/pi-type
In extra/label-studio_pi-types-classification.tar.zst
, you can find an export of the annotations from Label Studio, provided primarily for archival purposes.
Note that some PI type names in our code and model differ from those used in the paper:
Name in the Code/Model | Name in the Paper |
---|---|
LocationCityOrCoarser | Coarse Location |
AgeOrAgeGroup | Age |
CitizenshipOrImmigrationStatus | Immigration Status |
UsernameOrOtherId | Online Alias |
We use the same script as in Step 4.2.1, extract-features.py
, to convert the HTML code of web form fields into YAML strings for the classifier input:
$ python extract-features.py ~/webform-data pi-unlabeled.jsonl
100%|██████████| 970644/970644 [04:36<00:00, 3506.68it/s]
Use prelabel-model.py
to process the YAML strings of form fields and classify the PI types:
$ python prelabel-model.py pi-unlabeled.jsonl ~/webform-classifiers/pi-type pi-labeled.jsonl
774477it [40:12, 320.98it/s]
Finally, use import-classification.py
to import the PI type classification results back into the main database:
$ python import-classification.py -i pi-labeled.jsonl ~/webform-data
100%|██████████| 970644/970644 [04:37<00:00, 3493.70it/s]
The results are saved in the field_classification
table:
$ sqlite3 -header ~/webform-data.db 'SELECT * FROM field_classification ORDER BY RANDOM() LIMIT 3'
domain|job_hash|form_filename|field_list
primus.ca|24771f1319126c41765064260e83614ae9121b689fc16dd9047a63439f402466|form-0.json|["PhoneNumber", "PhoneNumber", "PostalCode"]
squarespace.com|52b8a1127937903c9223687b3b9a4a9278ade734d45648e1978ddfab32604321|form-0.json|["EmailAddress", "Password"]
arin.net|6c135bee885e61ac64e875577425cad9de2119a2f79bc5d659977292d78766d4|form-0.json|["EmailAddress"]
We evaluate the model's performance by creating a separate validation dataset using the same procedure used for the training data. Use manual-eval.py
to run the classifier on the validation dataset and generate performance metrics (as shown in Table 3 of our paper):
$ python manual-eval.py -P <API_KEY> <LABEL_STUDIO_URL> <PROJECT_ID> model/latest/
[50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50]
precision recall f1-score support
Address 0.920 1.000 0.958 46
DateOfBirth 0.900 0.938 0.918 48
......
micro avg 0.935 0.952 0.943 786
......
The dataset annotations of PI types are stored in the field_classification
table in the released results database (webform-data.db
).