Mueller Report Redaction
The DOJ failed to apply best practices, doing a disservice to researchers.
While the discussion surrounding the Justice Department’s release of their redacted version of the Mueller report rightly focused on the substantive issues of the investigation, many of my Twitter feed were also complaining that it was released as a non-searchable PDF file. Duff Johnson, president of something called the PDF Association, weighs in with “A Technical and Cultural Assessment of the Mueller Report PDF.”
His “Key Take-Aways”:
- If Mueller delivered a “born digital” PDF to Justice, that file was printed and scanned back into a set of low-quality images for release; a disservice to all future users of the document, and also a violation of Section 508 regulations.
- If Mueller delivered a paper document to Justice which was subsequently scanned, DoJ’s treatment of the document is more understandable, but still non-conforming with Section 508.
- Irrespective of the evidence and conclusions about the Trump campaign, the Special Counsel’s report showcases the essential qualities of static, self-contained, reliable, sharable PDF in a world that increasingly runs on HTML.
It would be bizarre, indeed, if Mueller had delivered his report only as a paper document. But either that’s what happened or DOJ used some rather primitive means of redacting the report for no good reason.
Regardless, an amazing amount of information is discoverable from the PDF’s metadata. Johnson tells us,
From a PDF technology perspective the file uses PDF 1.6 technology. It is of acceptable quality, but does not conform to ISO 19005 (PDF/A), the archival standard for PDF files. It is not digitally signed or encrypted for security.
Based on its metadata, the PDF released by the Department of Justice was produced using Ricoh MP 6C502 software, probably a typical office network copier / printer. The file was produced on April 17 after 6:23 pm.
The document consists of 448 200 dpi RGB (color) images all 2200 x 1700 pixels in size. The images were compressed with lossy compression more appropriate to photographs than to text. This is the cause of the “noise” associated with the text.
Analysis: The fact that DOJ chose to deliver an “images only” PDF forces a much larger file-size and loss of searchable text. Effectively, this process “dumbed down” the PDF to a set of images – the same type of content that comes out of a scanner. Admittedly, it is also a crude but effective means of ensuring (beyond redaction) that nothing is released besides images of pages… but the redaction software available to DoJ (see below) is fully effective at redacting born-digital PDF files, so image conversion was unnecessary.
From the scanner artifacts left on the images (e.g. the horizontal yellow streak and the gray vertical streak on the right edge) and the voluminous compression artifacts, we assess that the document has certainly been scanned and compressed at least once and more probably twice.
Although DoJ did not OCR the report prior to its release, those downloading the file are free to use their own OCR. Results will not be ideal or identical since the source images are of relatively low quality. In particular, OCR errors will be more common adjacent to underlines and redactions.
Analysis: We assess that the document was most likely scanned twice, with redactions being added to the first scanned document using software. This implies that the document may have been provided to DoJ on paper rather than as an electronic document. If it was provided by Mueller to DoJ electronically, then printing it just to scan it back into another, far larger and less capable PDF is difficult to understand.
Indeed. Regardless, the results are less than ideal from a usability standpoint.
In addition to not being searchable, the file contains no text, is not tagged, and is therefore not accessible to disabled users.
The US Department of Justice has a clear policy of ensuring that public documents comply with Section 508 regulations, and are therefore accessible to users with disabilities. The Mueller Report PDF does not conform with these regulations.
If the Mueller report was delivered to DoJ as a high-quality born-digital PDF, it would have been tagged from the outset. DoJ could have easily redacted it without resorting to printing the result and and re-scanning the printed paper.
Analysis: If Mueller had delivered a paper document instead of a PDF, then DoJ’s process, while not best practice or even within the regulations, is more understandable due to time pressures. If Mueller had delivered a high-quality PDF, however, then it’s exceptionally unfortunate that DoJ chose to “dumb it down” when processing and releasing it.
Oddly, despite the ham-handed delivery format, DOJ applied sophisticated techniques to the redaction itself:
Due to their consistency and regularity of form and application, it’s clear that the redactions were performed by software rather than manual methods (i.e., to a printed document). The redaction implementation (style, spacing, label) is completely consistent throughout the document, indicating expert use of professional-class redaction software.
Using high-quality redaction software allows organizations to collaborate effectively on such projects, ensuring that the type of redaction used, as well as the color-codes and other features, are consistent for all collaborators. It is to be expected that DoJ possesses and is expert in the use of such software.
Instead of delivering “native” redactions, however, it’s obvious that DoJ printed and then scanned the document after it was redacted. We know this because on many pages a scanner artifact (the faint yellow line) crosses a redacted area. This deliberate and unnecessary act made the document substantially harder for anyone and everyone to use, forever.
Analysis: I asked Mark Gavin, CTO of Appligent Document Solutions, and the developer of the first PDF redaction tool, for his comments on the redaction method used in the Mueller Report. Mark said:
“Native PDF redaction has been available now for more than 20 years, yet this document is just images of redacted pages. As such, there is no searchable text, the document will not reflow on different devices and most importantly this document is not Section 508 compliant. The document cannot be read by a screen reader for people with visual disabilities and it cannot be analyzed using any text analysis tools. The Mueller Report as a redacted PDF document is really kind of sad.”
Johnson’s conclusion on the technical matters:
It’s interesting – and deeply unfortunate – that DoJ clearly used advanced redaction software but nonetheless chose to deliver a paper-age “images only” PDF. In so doing they:
- Dramatically increased the file’s size, probably by 8-10x.
- Permanently and substantially reduced the visual text and image quality of a document of historical interest
- Permanently reduced text searchability (assuming they received a searchable PDF from Mueller)
- Delivered a documents that’s inherently inaccessible to users who require assistive technology (AT) in order to read, requiring substantial remediation efforts to recover any useful degree of accessibility, let alone full compliance with applicable regulations.
Johnson provides no speculation of nefarious intent here and neither do experts on my Twitter feed. Given that Barr has gone out of his way to spin Mueller’s findings in ways favorable to President Trump, it would be plausible that would provide a less-than-user-friendly release if it were helpful to the President. But I can think of no benefit in handling the redactions in this manner.
In addition to the technical dissection, Johnson spends quite a bit of time on the “cultural” aspects of using the PDF format. It’s mostly an ode to the file type but it’s a fair one. He begins:
Everyone knew that the US Department of Justice and Attorney General Barr would release the Mueller Report as a PDF file.
In fact, it’s safe to say that AG Barr never considered delivering anything else. No one would have even suggested a Word file, or a set of TIFF images, or a website, or an XPS file, or EPUB, or plain text. It’s 2019, but it seems safe to say that they simply assumed they’d use PDF.
It’s also somewhat amusing because, in the early days of the blog, my hatred of PDFs on the web was a running joke of sorts. In those days, PDF files were almost all of the scanned image variety and thus slow-to-load, not searchable, and not cut-and-pasteable (at least for the vast majority of us who relied on Acrobat Reader or other non-premium software). In the intervening years, though, all of those defects have been overcome and PDF is now a universal standard—the only way to ensure everyone sees the same document in the same format, same page numbering, etc.
Additionally, Johnson observes,
Once he was done writing and editing, Mueller needed to unambiguously “freeze” or fix his document for the purposes of submitting a report. PDF is the only mainstream document format offering this capability.
Why is the fixed nature (“rendering”) so important? It contains the clues humans use to judge authenticity, such as layout, formatting, dates, logos and signatures, and in many other, more subtle ways.
Everyone knows this, which is why people exchange contracts rather than simply share access to a wiki page. The need for a rendering made it easy to predict ahead of time that Barr would release the Mueller report as a PDF, and would never have considered converting its text to DOCX, or posting the text as HTML on a website.
In releasing the redacted PDF of the report to the public, Barr avoids suspicion that the document had been edited (changed) in addition to straightforward redactions. PDF serves the need to unambiguously assure the press and the public that they are seeing Mueller’s actual report.
Correct. Which is why it’s so hard for me to believe that Mueller would have delivered only a hard copy of the report. A PDF is simply de rigueur.