Mueller Report Redaction

The DOJ failed to apply best practices, doing a disservice to researchers.

While the discussion surrounding the Justice Department’s release of their redacted version of the Mueller report rightly focused on the substantive issues of the investigation, many of my Twitter feed were also complaining that it was released as a non-searchable PDF file. Duff Johnson, president of something called the PDF Association, weighs in with “A Technical and Cultural Assessment of the Mueller Report PDF.”

His “Key Take-Aways”:

  • If Mueller delivered a “born digital” PDF to Justice, that file was printed and scanned back into a set of low-quality images for release; a disservice to all future users of the document, and also a violation of Section 508 regulations.
  • If Mueller delivered a paper document to Justice which was subsequently scanned, DoJ’s treatment of the document is more understandable, but still non-conforming with Section 508.
  • Irrespective of the evidence and conclusions about the Trump campaign, the Special Counsel’s report showcases the essential qualities of static, self-contained, reliable, sharable PDF in a world that increasingly runs on HTML.

It would be bizarre, indeed, if Mueller had delivered his report only as a paper document. But either that’s what happened or DOJ used some rather primitive means of redacting the report for no good reason.

Regardless, an amazing amount of information is discoverable from the PDF’s metadata. Johnson tells us,

From a PDF technology perspective the file uses PDF 1.6 technology. It is of acceptable quality, but does not conform to ISO 19005 (PDF/A), the archival standard for PDF files. It is not digitally signed or encrypted for security.

Based on its metadata, the PDF released by the Department of Justice was produced using Ricoh MP 6C502 software, probably a typical office network copier / printer. The file was produced on April 17 after 6:23 pm.

[…]

The document consists of 448 200 dpi RGB (color) images all 2200 x 1700 pixels in size. The images were compressed with lossy compression more appropriate to photographs than to text. This is the cause of the “noise” associated with the text.

Analysis:  The fact that DOJ chose to deliver an “images only” PDF forces a much larger file-size and loss of searchable text. Effectively, this process “dumbed down” the PDF to a set of images – the same type of content that comes out of a scanner. Admittedly, it is also a crude but effective means of ensuring (beyond redaction) that nothing is released besides images of pages… but the redaction software available to DoJ (see below) is fully effective at redacting born-digital PDF files, so image conversion was unnecessary.

From the scanner artifacts left on the images (e.g. the horizontal yellow streak and the gray vertical streak on the right edge) and the voluminous compression artifacts, we assess that the document has certainly been scanned and compressed at least once and more probably twice.

Although DoJ did not OCR the report prior to its release, those downloading the file are free to use their own OCR. Results will not be ideal or identical since the source images are of relatively low quality. In particular, OCR errors will be more common adjacent to underlines and redactions.

Analysis: We assess that the document was most likely scanned twice, with redactions being added to the first scanned document using software. This implies that the document may have been provided to DoJ on paper rather than as an electronic document. If it was provided by Mueller to DoJ electronically, then printing it just to scan it back into another, far larger and less capable PDF is difficult to understand.

Indeed. Regardless, the results are less than ideal from a usability standpoint.

In addition to not being searchable, the file contains no text, is not tagged, and is therefore not accessible to disabled users.

The US Department of Justice has a clear policy of ensuring that public documents comply with Section 508 regulations, and are therefore accessible to users with disabilities. The Mueller Report PDF does not conform with these regulations.

If the Mueller report was delivered to DoJ as a high-quality born-digital PDF, it would have been tagged from the outset. DoJ could have easily redacted it without resorting to printing the result and and re-scanning the printed paper.

Analysis:  If Mueller had delivered a paper document instead of a PDF, then DoJ’s process, while not best practice or even within the regulations, is more understandable due to time pressures. If Mueller had delivered a high-quality PDF, however, then it’s exceptionally unfortunate that DoJ chose to “dumb it down” when processing and releasing it.

Oddly, despite the ham-handed delivery format, DOJ applied sophisticated techniques to the redaction itself:

Due to their consistency and regularity of form and application, it’s clear that the redactions were performed by software rather than manual methods (i.e., to a printed document). The redaction implementation (style, spacing, label) is completely consistent throughout the document, indicating expert use of professional-class redaction software.

Using high-quality redaction software allows organizations to collaborate effectively on such projects, ensuring that the type of redaction used, as well as the color-codes and other features, are consistent for all collaborators. It is to be expected that DoJ possesses and is expert in the use of such software.

Instead of delivering “native” redactions, however, it’s obvious that DoJ printed and then scanned the document after it was redacted. We know this because on many pages a scanner artifact (the faint yellow line) crosses a redacted area. This deliberate and unnecessary act made the document substantially harder for anyone and everyone to use, forever.

Analysis: I asked Mark Gavin, CTO of Appligent Document Solutions, and the developer of the first PDF redaction tool, for his comments on the redaction method used in the Mueller Report. Mark said:

“Native PDF redaction has been available now for more than 20 years, yet this document is just images of redacted pages.  As such, there is no searchable text, the document will not reflow on different devices and most importantly this document is not Section 508 compliant.  The document cannot be read by a screen reader for people with visual disabilities and it cannot be analyzed using any text analysis tools. The Mueller Report as a redacted PDF document is really kind of sad.”

Johnson’s conclusion on the technical matters:

It’s interesting – and deeply unfortunate – that DoJ clearly used advanced redaction software but nonetheless chose to deliver a paper-age “images only” PDF. In so doing they:

  • Dramatically increased the file’s size, probably by 8-10x.
  • Permanently and substantially reduced the visual text and image quality of a document of historical interest
  • Permanently reduced text searchability (assuming they received a searchable PDF from Mueller)
  • Delivered a documents that’s inherently inaccessible to users who require assistive technology (AT) in order to read, requiring substantial remediation efforts to recover any useful degree of accessibility, let alone full compliance with applicable regulations.

Johnson provides no speculation of nefarious intent here and neither do experts on my Twitter feed. Given that Barr has gone out of his way to spin Mueller’s findings in ways favorable to President Trump, it would be plausible that would provide a less-than-user-friendly release if it were helpful to the President. But I can think of no benefit in handling the redactions in this manner.

In addition to the technical dissection, Johnson spends quite a bit of time on the “cultural” aspects of using the PDF format. It’s mostly an ode to the file type but it’s a fair one. He begins:

Everyone knew that the US Department of Justice and Attorney General Barr would release the Mueller Report as a PDF file.

In fact, it’s safe to say that AG Barr never considered delivering anything else. No one would have even suggested a Word file, or a set of TIFF images, or a website, or an XPS file, or EPUB, or plain text. It’s 2019, but it seems safe to say that they simply assumed they’d use PDF.

That’s right.

It’s also somewhat amusing because, in the early days of the blog, my hatred of PDFs on the web was a running joke of sorts. In those days, PDF files were almost all of the scanned image variety and thus slow-to-load, not searchable, and not cut-and-pasteable (at least for the vast majority of us who relied on Acrobat Reader or other non-premium software). In the intervening years, though, all of those defects have been overcome and PDF is now a universal standard—the only way to ensure everyone sees the same document in the same format, same page numbering, etc.

Additionally, Johnson observes,

Once he was done writing and editing, Mueller needed to unambiguously “freeze” or fix his document for the purposes of submitting a report. PDF is the only mainstream document format offering this capability.

Why is the fixed nature (“rendering”) so important? It contains the clues humans use to judge authenticity, such as layout, formatting, dates, logos and signatures, and in many other, more subtle ways.

Everyone knows this, which is why people exchange contracts rather than simply share access to a wiki page. The need for a rendering made it easy to predict ahead of time that Barr would release the Mueller report as a PDF, and would never have considered converting its text to DOCX, or posting the text as HTML on a website.

In releasing the redacted PDF of the report to the public, Barr avoids suspicion that the document had been edited (changed) in addition to straightforward redactions. PDF serves the need to unambiguously assure the press and the public that they are seeing Mueller’s actual report.

Correct. Which is why it’s so hard for me to believe that Mueller would have delivered only a hard copy of the report. A PDF is simply de rigueur.

FILED UNDER: Science & Technology
James Joyner
About James Joyner
James Joyner is a Security Studies professor at Marine Corps University's Command and Staff College and a nonresident senior fellow at the Scowcroft Center for Strategy and Security at the Atlantic Council. He's a former Army officer and Desert Storm vet. Views expressed here are his own. Follow James on Twitter @DrJJoyner.

Comments

  1. mattbernius says:

    Thanks for finding this. It totally speaks to my publishing tech geeky side.

    I had some similar thoughts when I first saw the PDF… In particular about why they went for clearly scanned pages and didn’t do the OCR to make it searchable.

    I suspect that part of the reason was, theoretically, security. I am not sure how well text PDFs handle redactions and there could be a concern (possibly fair depending on who rendered the PDF) that if the redactions were handled electronically, they could somehow be hacked and removed.

    Or at least create the illusion that they were unredacted.

  2. mattbernius says:

    Also, the fact those choices slowed down analysis was a feature, not a bug.

  3. Kit says:

    I wish I had that feral energy, so common on the Right, to take this and demonstrate how it’s a shock, a scandal, and a conspiracy of the deep state, and as such serves as proof that my enemies must be prosecuted.

  4. Dave Schuler says:

    All federal departments and agencies have an official policy of Section 508 compliance. Sadly, it’s routinely violated. Quis custodiet ipsos custodes?

  5. OzarkHillbilly says:

    The US Department of Justice has a clear policy of ensuring that public documents comply with Section 508 regulations, and are therefore accessible to users with disabilities. The Mueller Report PDF does not conform with these regulations.

    At some point, one has to come to the conclusion that with the people in the trump admin, violating rules, regulations, and laws is just how they roll.

  6. Blue Galangal says:

    it would be plausible that would provide a less-than-user-friendly release if it were helpful to the President. But I can think of no benefit in handling the redactions in this manner.

    “Petty” and “venal” are two reasons that spring to mind. As for benefits: to make it harder for analysts, journalists, and interested parties to search. No question. The longer they can keep their “NO COLLUSION!” narrative in the forefront – in a 24 hour news cycle – the better, as far as they’re concerned.

  7. Blue Galangal says:

    @mattbernius: Possibly, but we are also talking about people without a deep understanding of technology (as Dr. Joyner clearly shows). At least from Adobe Acrobat 9, redactions are permanent. You have to click through about 10 warnings reminding you of that fact every time you use the feature. My sense – not being more tech-savvy than I need to be for my position – is that Acrobat removes the actual text and leaves a “blank” space, it doesn’t just cover words with black “marker” anymore.

  8. James Joyner says:

    @OzarkHillbilly:

    At some point, one has to come to the conclusion that with the people in the trump admin, violating rules, regulations, and laws is just how they roll.

    @Blue Galangal:

    The longer they can keep their “NO COLLUSION!” narrative in the forefront – in a 24 hour news cycle – the better, as far as they’re concerned.

    Well, sure. But I think this is DOJ acting as a bureaucracy rather than some illicit action by Barr and other appointees. I could be proven wrong, however, if this process was unusual by DOJ standards.

  9. mattbernius says:

    @Blue Galangal: thanks for the info. I am pretty familiar with the PDF standard, but never have had to deal with redactions. My guess is that it would handle them in the way you suggested, but I don’t want to assume anything.

    From a document creation standpoint, it would be interesting to see the history of the redacted report and the number of transformations it went through (the partial archeology that James linked to is a great start).

    All that said, I also appreciate how easy it is to screw this stuff up. Let’s not forget that Manafort was in part undone by the metadata of his doctored documents (yes, I get that .docx is fundamentally different from .PDF, but the capacity to screw up complex technologies still remains regardless of file type).

  10. OzarkHillbilly says:

    @James Joyner:

    But I think this is DOJ acting as a bureaucracy rather than some illicit action by Barr and other appointees.

    This is certainly possible James, the DoJ is a bureaucracy after all, but I have found that whenever analyzing any actions taken by the trump admin and the possible motivations for such, it always comes down to 2 choices:

    Are they being ,
    a) stupid
    b) evil

    Inevitably the answer always seems to be

    c) Both.

    YMMV 😉

  11. OzarkHillbilly says:

    Stopped by Marcy Wheeler’s place for the 1st time since the Mueller report was released and she has linked to a searchable copy of the report.

  12. Michael Reynolds says:

    The whole strategy was to stall and obfuscate and count on the short attention spans of voters.

    Because that’s what innocent people do, dontcha know? They don’t want people seeing how innocent they are.

  13. JKB says:

    Conspiracy theory on conspiracy theory.

    It might we wise to remember that no one legally had a right to see the report. Not the public, not the “researchers”, not the Congress. Congress was to receive a summary which they did. Then, with President Trump’s blessing, the report was lightly redacted to comply with the law and not endanger national security and released.

    So make up conspiracy theories all you want, but ultimately, if the Trump administration wanted to make hard to see what’s in the report, you wouldn’t even know how many pages it contains unless that came out in the court proceedings.

    Couple this with Dems in Congress demanding an unredacted copy, in violation of the law on grand jury information, passed by Congress. It’s all smoke or they’d have legislation to modify the law in the process.

    1
    7
  14. The abyss that is the soul of cracker says:

    @mattbernius:

    I suspect that part of the reason was, theoretically, security.

    Whereas I am quite confident that most, if not all, of the reason was, actually and quite literally, obfuscation. Barr was picked because he, like Comey before him turned out to be, is a partisan hack. With that as a given, he has produced a document where, in the spirit of Noam Chomsky’s dissection of bureaucratese, the document pretends to communicate where no actual communication is forthcoming. When, at some later date, the redacted documents are compared to the originals, it will also be found that many of the redactions were not for the protection of the innocent/investigation process, but rather to hide pertinent information from Congress. Probably at the request of Republicans of the self same Congress, if the truth were to come out.

    ETA: @Ozark: Both stupid and evil is exactly right! Of the 2, evil seems more important, too.

  15. DrDaveT says:

    I think it’s about 9:1 odds that whoever was responsible for issuing the redacted version had been burned in the past by documents that were in theory redacted but had been improperly prepared, so that the recipients were able to see through the redactions. So they did the one thing they knew would for sure prevent that. That’s not a defense, but it’s my best guess explanation.

  16. James Joyner says:

    @DrDaveT: That’s as good an explanation as any.

  17. JKB says:

    @The abyss that is the soul of cracker:

    Certain members of Congress can see the document with all but grand jury redactions now, but they declined. And if they want to see the grand jury redactions, Congress can either seek judicial approval as required by the law, or pass legislation modifying the law.

    As it is, the no respect for the Rule of Law Democrats are trying to pressure the Attorney General or other DoJ employees to commit federal crime.

  18. An Interested Party says:

    As it is, the no respect for the Rule of Law Democrats are trying to pressure the Attorney General or other DoJ employees to commit federal crime.

    As opposed to Trump and the Republicans who always respect the Rule of Law? It’s nice to see that you continue to be delusional and live in some fantasy land…

  19. Duff Johnson says:

    @DrDaveT: This cannot be the reason because printing and scanning the document can’t improve on existing redactions.