GitHub - freelawproject/x-ray: A tool to detect whether a PDF has a bad redaction
x-ray is a Python library for finding bad redactions in PDF documents. At Free Law Project, we collect millions of PDFs. An ongoing problem is that people fail to properly redact things. Instead of doing it the right way, they just draw a black rectangle or a black highlight on top of black text and call it a day. Well, when that happens you just select the text under the rectangle, and you can read it again. Not great. After witnessing this problem for years, we decided it would be good to figure out how common it is, so, with some help, we built this simple tool. You give the tool the path to a PDF. It tells you if it has worthless redactions in it. Right now, x-ray works pretty well and we are using it to analyze documents in our collections. It could be better though. Bad redactions take many forms. See the issues tab for other examples we don't yet support. We'd love your help solving some of tougher cases. With uv, do: uv add x-ray With pip, that'd be: pip install x-ray uvx lets you run this without even installing it. For example, here's an amicus brief we filed that doesn't have any bad redactions: uvx --from x-ray xray https://storage.courtlistener.com/recap/gov.uscourts.ca3.125346/gov.uscourts.ca3.125346.45.0.pdf {} Once you do install x-ray, you can easily use it on the command line. Once installed, just: % xray path/to/your/file.pdf { "1": [ { "bbox": [ 58.550079345703125, 72.19873046875, 75.65007781982422, 739.3987426757812 ], "text": "The Ring travels by way of Cirith Ungol" } ] } Or if you have the file on a server somewhere, give it a URL. If it starts with https:// , it will be interpreted as a PDF to download. Here's congressional testimonry our directory made (it doesn't have any bad redactions): % xray https://free.law/pdf/congressional-testimony-michael-lissner-free-law-project-hearing-on-ethics-and-transparency-2021-10-26.pdf {} A fun trick you can do is to make a file with one URL per line, call it urls.txt . Then you can run this to check each URL: xargs -n 1 xray < urls.txt However you run xray on the command line, you'll get JSON as output. When you have that, you can use it with tools like jq . The format is as follows: It's a dict. The keys are page numbers. Each page number maps to a list of dicts. Each of those dicts maps to two keys. The first key is bbox . This is a four-tuple that indicates the x,y positions of the upper left corner and then lower right corners of the bad redaction. The second key is text . This is the text under the bad rectangle. Simple enough. You can also use it as a Python module, if you prefer the long-form: % python -m xray some-file.pdf But that's not as easy to remember. If you want a bit more, you can, of course, use xray in Python: from pprint import pprint import xray bad_redactions = xray.inspect("some/path/to/your/file.pdf") # Pathlib works too pprint(bad_redactions) {1: [{'bbox': (58.550079345703125, 72.19873046875, 75.65007781982422, 739.3987426757812), 'text':...
Preview: ~500 words
Continue reading at Hacker News
Read Full Article