Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore uniform dates under redactions #30

Open
mlissner opened this issue Sep 11, 2021 · 1 comment
Open

Ignore uniform dates under redactions #30

mlissner opened this issue Sep 11, 2021 · 1 comment

Comments

@mlissner
Copy link
Member

It seems to be common to put dates under the redaction boxes, as you can see in the highlighted screenshot below:

Screenshot from 2021-09-11 11-21-13

Note that the date isn't actually relevant semantically to the sentence. Looking throughout the redactions of this document:

{2: [{'bbox': (390.3498229980469,
               536.0278930664062,
               415.180419921875,
               552.8250122070312),
      'text': '03/23/2019'}],
 20: [{'bbox': (434.0060119628906,
                293.506103515625,
                446.1649169921875,
                307.0159912109375),
       'text': '03/23/2'}],
 29: [{'bbox': (197.58200073242188,
                75.3205795288086,
                224.60189819335938,
                89.5059814453125),
       'text': '03/23/2019'},
      {'bbox': (232.70700073242188,
                75.31907653808594,
                269.1838073730469,
                88.8289794921875),
       'text': '03/23/2019'},
      {'bbox': (278.6400146484375,
                75.99359130859375,
                319.1697998046875,
                87.47698974609375),
       'text': '03/23/2019'},
      {'bbox': (348.2170104980469,
                75.3205795288086,
                421.17059326171875,
                89.5059814453125),
       'text': '03/23/2019'},

You see a pattern that the text is always the same date. When this is the case, we should nuke all such redactions from our list as false positives.

gov.uscourts.cacd.45170.569.9_2.pdf

@mlissner
Copy link
Member Author

I did some code for this over the weekend, but shelved it. I decided it's better to have false positives than false negatives and that there was no way to avoid false negatives (imagine a doc with actual dates redacted).

I'd like to see how many false positives this kind of thing creates. If it's a common pattern to put date strings into redactions, maybe we can take another swing at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant