Is it possible to redact PDF areas with PDFBox by position?

Is it possible to redact PDF areas with PDFBox by position? - java

The Context
Currently, I have a solution where I loop through a PDF and draw black rectangles throughout it.
So I already have a PDRectangle list representing the right areas I need to fill/cover on the pdf, hiding all the texts I want to.
The Problems
Problem number 1: The text underneath the black rectangle is easily copied, searchable, or extracted by other tools.
I solved this by flattening my pdf (converting it into an image so that it becomes a single layer document and the black rectangle can no longer be tricked). Same solution as described here:
Disable pdf-text searching with pdfBox
This is not an actual redacting, it's more like a workaround.
Which leads me to
Problem number 2:
My final PDF becomes an image document, where I lose all the pdf properties, including searching, copying... also it's a much slower process. I wanted to keep all the pdf properties while the redacted areas are not readable by any means.
What I want to accomplish
That being said, I'd like to know if it is possible and how I could do an actual redacting, blacken out rectangles areas since I already have all the positions I need, with PDFBox, keeping the pdf properties and not allowing the redacted area to be read.
Note: I'm aware of the problems PDFBox had with the old ReplaceText function, but here I have the positions I need to make sure I'd blank precisely the areas I need.
Also, I'm accepting other free library suggestions.
Technical Specification:
PDFBox 2.0.21
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64
My Code
This is how I draw the black rectangle:
private void draw(PDPage page, PDRectangle hitPdRectangle) throws IOException {
PDPageContentStream content = new PDPageContentStream(pdDocument, page,
PDPageContentStream.AppendMode.APPEND, false, false);
content.setNonStrokingColor(0f);
content.addRect(hitPdRectangle.getLowerLeftX(),
hitPdRectangle.getLowerLeftY() -0.5f,
hitPdRectangle.getUpperRightX() - hitPdRectangle.getLowerLeftX(),
hitPdRectangle.getUpperRightY() - hitPdRectangle.getLowerLeftY());
content.fill();
content.close();
}
This is how I convert it into an Image PDF:
private PDDocument createNewRedactedPdf() throws IOException {
PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
PDDocument redactedDocument = new PDDocument();
for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(pageIndex, 200);
String formatName = "jpg";
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(image, formatName, baos);
byte[] bimg = baos.toByteArray();
PDPage page = pdDocument.getPage(pageIndex);
float pageWidth = page.getMediaBox().getWidth();
float pageHeight = page.getMediaBox().getHeight();
PDPage pageDraw = new PDPage(new PDRectangle(pageWidth, pageHeight));
redactedDocument.addPage(pageDraw);
String imgSuffixName = pageIndex + "." + formatName;
PDImageXObject img = PDImageXObject.createFromByteArray(redactedDocument, bimg,
pdDocument.getDocument().getDocumentID() + imgSuffixName);
try (PDPageContentStream contentStream
= new PDPageContentStream(redactedDocument, pageDraw, PDPageContentStream.AppendMode.OVERWRITE, false)) {
contentStream.drawImage(img, 0, 0, pageWidth, pageHeight);
}
}
return redactedDocument;
}
Any thoughts?

What you want to have, a true redaction feature, is possible to implement based on PDFBox but it requires a lot of coding on top of it (similar to the pdfSweep add-on implemented on top of iText).
In particular you have found out yourself that it does not suffice to draw black rectangles over the areas to redact as text extraction or copy&paste from a viewer usually completely ignores whether text is visible or covered by something.
Thus, in the code you do have to find the actual instruction drawing the text to redact and remove them. But you cannot simply remove them without replacement, otherwise additional text on the same line may be moved by your redaction.
But you cannot simply replace them with the same number of spaces or a move-right by the width of the removed text: Just consider the case of a table you want to redact a column from with only "yes" and "no" entries. If after redaction a text extractor returns three spaces where there was a "yes" and two spaces where there was a "no", anyone looking at those results knows what there was in the redacted area.
You also have to clean up instructions around the actual text drawing instruction. Consider the example of the column to redact with "yes"/"no" information again, but this time for more clarity the "yes" is drawn in green and the "no" in red. If you only replace the text drawing instructions, someone with an extractor that also extracts attributes like the color will immediately know the redacted information.
In case of tagged PDFs, the tag attributes have to be inspected too. There in particular is an attribute ActualText which contains the actual text represented by the tagged instructions (in particular for screen readers). If you only remove the text drawing instructions but leave the tags with their attributes, anyone reading using a screen reader may not even realize that you tried to redact something as his screen reader reads the complete, original text to him.
For a proper redaction, therefore, you essentially have to interpret all the current instructions, determine the actual content they draw, and create a new set of instructions which draws the same content without unnecessary extra instructions which may give away something about the redacted content.
And here we only looked at redacting the text; redacting vector and bitmap graphics on a PDF page has a similar amount of challenges to overcome for proper redaction.
...
Thus, the code required for actual redaction is beyond the scope of a stack overflow answer. Nonetheless, the items above may help someone implementing a redactor not to fall into typical traps of too naive redaction code.

Related

iText: Change Colour of existing PDF to Grayscale

We are using an old Version of iText (2.x) with Java 6 at the moment.
What we now try to do is to open an existing PDF and change its Color to grayscale. I found the method PdfWriter.setDefaultColorspace(PdfName key, PdfObject cs)
but I'm not really sure how to use it.
Can anyone tell me, how to use it in the right way? Or maybe anybody knows how to change PDF to grayscale in another way with this old iText version.
Many thanks in advance!

I implemented the code here using iText 5.5.14 but it should also work with iText 2.1.7 with minimal changes.
There are two ways to remove color from PDF pages,
either one actually iterates through all color related instructions of its content streams and replaces the colors set therein by an equivalent gray
or one appends instructions to each page content stream which remove the color saturation of all that the existing instructions create.
The former option is beyond the scope of a stack overflow answer (there are many different kinds of colors in PDFs, embedded bitmaps also bring color along, and one has to also consider the effects of transparency and blend modes used) but the latter option is fairly easy to implement by overlaying the page with a grayscale color in blend mode Saturation:
void dropSaturation(PdfStamper pdfStamper) {
PdfGState gstate = new PdfGState();
gstate.setBlendMode(PdfName.SATURATION);
PdfReader pdfReader = pdfStamper.getReader();
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
PdfContentByte canvas = pdfStamper.getOverContent(i);
canvas.setGState(gstate);
Rectangle mediaBox = pdfReader.getPageSize(i);
canvas.setColorFill(BaseColor.BLACK);
canvas.rectangle(mediaBox.getLeft(), mediaBox.getBottom(), mediaBox.getWidth(), mediaBox.getHeight());
canvas.fill();
canvas = pdfStamper.getUnderContent(i);
canvas.setColorFill(BaseColor.WHITE);
canvas.rectangle(mediaBox.getLeft(), mediaBox.getBottom(), mediaBox.getWidth(), mediaBox.getHeight());
canvas.fill();
}
}
(ColorToGray method)
You can apply it like this:
PdfReader pdfReader = new PdfReader(SOURCE_PDF);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
dropSaturation(pdfStamper);
pdfStamper.close();
Beware, this is a proof-of-concept. For a complete solution you actually have to do the same to all annotations of the pages.

Coordinates are wrong when appending line in PDFBox to the existing page

I'm using PDFBox 1.8.11.
I'm trying to draw a line from (0,0) to (x,y). That's how I do it:
PDPageContentStream stream = new PDPageContentStream(document, page, true, false);
stream.setStrokingColor(80, 100, 200);
stream.setLineWidth(1.0f);
stream.drawLine(0, 0, x, y);
stream.close();
All works fine for almost all PDFs. But for one PDF if I append to the stream (the third parameter of new PDPageContentStream()) the line is drawn from the right bottom corner and goes beyond the page right border. If I don't append to the content stream, the line is drawn as expected.
It happens only for this PDF (maybe some others), and I'm wondering if I miss anything. Maybe I need to reset some coordinate system before drawing or the like?
P.S. The media box of the page starts from (0,0) and is equal to the page size.
Thanks in advance

Actually this post (PDFBox : PDPageContentStream's append mode misbehaving) explains the issue.
Setting the last parameter resetContext to true in the constructor below solved my problem.
public PDPageContentStream(PDDocument document, PDPage sourcePage,
boolean appendContent,
boolean compress, boolean resetContext)
throws IOException

In PDFBox, how to create a link annotation with "rollover" / "mouse over" effects?

Question:
With PDFBox, how can I create a link annotation with "mouse over" color effect (aka rollover / mouse hover)?
It means that when I hover my mouse cursor over a link in a PDF file (without clicking it), the link changes to a different color. And if I mouse the cursor away, the link changes backs to the original color.
For example:
The effect that I am looking for is similar to the links at stackoverflow website. When you hover the mouse cursor over (without clicking) the "Ask Question" button, the link changes from grey to orange. When you move the cursor away, the color changes back to grey. See following picture for example: I want to achieve exactly the same effect in a PDF file.
What I have tried:
In PDF Reference Sixth Edition, it is described that:
the rollover appearance is used when the user moves the cursor into the annotation’s active area without pressing the mouse button"
and
[rollover appearance] are defined in an appearance dictionary, which in turn is the value of the AP entry in the annotation dictionary
Also,
In the PDFBox, there is a PDAppearanceDictionary class, which has a setRolloverAppearance() method.
This is the farthest I can get. I don't know how to use PDAppearanceDictionary class (if this is indeed the right class to use) in conjunction with a PDAnnotationLink class, in order to achieve my desired result.
I have tried finding examples on Google in vain.

In short
There was some uncertainty about whether or not such a rollover effect is possible. Using fairly current Adobe PDF viewers (Reader XI and Acrobat 9.5) for displaying, the desired rollover effect did not turn up for a link annotation. The effect did turn up, though, for a button widget (carrying the same URL action).
In detail
The test code feeds either a PDAnnotationLink or a PDAnnotationWidget customized as a pushbutton to a method which embeds the respective annotation in a document and adds normal and rollover appearances to it:
void createRollover(PDAnnotation annotation, String filename) throws IOException, COSVisitorException
{
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
List<PDAnnotation> annotations = page.getAnnotations();
float x = 100;
float y = 500;
String text = "PDFBox";
PDFont font = PDType1Font.HELVETICA_BOLD;
float textWidth = font.getStringWidth(text) / 1000 * 18;
PDPageContentStream contents = new PDPageContentStream(document, page);
contents.beginText();
contents.setFont(font, 18);
contents.moveTextPositionByAmount(x, y);
contents.drawString(text);
contents.endText();
contents.close();
PDAppearanceDictionary appearanceDictionary = new PDAppearanceDictionary();
PDAppearanceStream normal = createAppearanceStream(document, textWidth, font, "0.5 0.5 0.5 rg");
PDAppearanceStream rollover = createAppearanceStream(document, textWidth, font, "1 0.7 0.5 rg");
PDAppearanceStream down = createAppearanceStream(document, textWidth, font, "0 0 0 rg");
appearanceDictionary.setNormalAppearance(normal);
appearanceDictionary.setRolloverAppearance(rollover);
appearanceDictionary.setDownAppearance(down);
annotation.setAppearance(appearanceDictionary);
PDRectangle position = new PDRectangle();
position.setLowerLeftX(x);
position.setLowerLeftY(y - 5);
position.setUpperRightX(x + textWidth);
position.setUpperRightY(y + 20);
annotation.setRectangle(position);
annotations.add(annotation);
document.save(new File(RESULT_FOLDER, filename));
document.close();
}
In case of the PDAnnotationLink:
In case of the pushbutton PDAnnotationWidget:
Backgrounds:
The OP in his question and #Tilman in a comment referred to the PDF specification:
An annotation may define as many as three separate appearances: [...]
• The rollover appearance shall be used when the user moves the cursor into the annotation’s active area without pressing the mouse button. [...]
The normal, rollover, and down appearances shall be defined in an appearance dictionary, which in turn is the value of the AP entry in the annotation dictionary
and, therefore, thought:
So it should be possible
They did not consider, though, that the specification introduces the appearance dictionary as:
AP dictionary (Optional; PDF 1.2) An appearance dictionary specifying how the annotation shall be presented visually on the page (see 12.5.5, “Appearance Streams”). Individual annotation handlers may ignore this entry and provide their own appearances.
Thus, what at first glance seemed to be an unconditional requirement ("The rollover appearance shall be used when...") turns out to be ignorable if the annotation handler in a PDF viewer has its own ideas.
tl;dr: it is completely up to the PDF viewer in question to decide which appearance streams it uses and which it ignores and replaces in its own ways.
If making use of annotation appearance streams, one should always make sure that one also supplies the information most plausibly used if the given appearances are ignored, e.g. having regular page content beneath a link annotation.

It's important to understand that a "link" annotation in PDF simply represents a selectable area. It's a rectangle that may or may not have text under it, and is not tied to any specific text in any way ("hyperlinked" text just happens to be in the linked zone of the document). Acrobat and Reader have some "extra" features to "guess" at which text is used in links, and mark used links a different color, but from a PDF perspective a link is just a rectangle. You can give the link annotation itself a rollover effect, this has the effect of changing the appearance of the link rectangle itself. Examples include having the previously invisible rectangular outline appear when you mouseover, or having a visible rectangular outline change color. You can play around with these in Acrobat's link properties menu to get a better understanding.
However, that is the only type of rollover you will be able to achieve using link annotations. To reproduce what happens with web links, you will want to look into other workarounds. Examples include creating an Xobject form of the text with an alternate rollover appearance, creating the text as a image-based button with a rollover appearance, or even using Flash. I hope this helps explain what is and isn't possible with link annotations themselves!

reversed Arabic when printing PDF

I'm trying to print Arabic in some PDF documents using the Java code found here :
http://www.java2s.com/Code/Java/PDF-RTF/ArabicTextinPDF.htm
The example works great, except that the text comes out backwards. For example, changing the example slightly :
String txt = "\u0623\u0628\u062c\u062f\u064a\u0629 \u0639\u0631\u0628\u064a\u0629";
System.out.println(txt);
g2.drawString(txt, 100, 30);
What is printed on the screen are the same characters but in the opposite direction, compared to the PDF. The console output is correct, the PDF is not.
I don't want to simply reverse the characters because otherwise I would lose bi-directional support ...
Thanks much

IIRC, iText supports Arabic shaping at a highler level than drawString. Lets see here...
Ah! ColumnText.showTextAligned(PdfContentByte canvas, int alignment, Phrase phrase, float x, float y, float rotation, int runDirection, int arabicOptions)
Alignment is one of Element.ALIGN_*. Run direction is one of PdfWriter.RUN_DIRECTION_*. Arabic options are bit flags, ColumnText.AR_*
That should do the trick, with one caveat: I'm not sure that it'll handle multiple directions in the same phrase. Your test string has CJKV, Arabic, and Latin characters, so there should be two direction changes.
Good luck.

Figured it out, here is the complete process :
document.open();
java.awt.Font font = new java.awt.Font("times", 0, 30);
PdfContentByte cb = writer.getDirectContent();
java.awt.Graphics2D g2 = cb.createGraphicsShapes(PageSize.A4.width(), PageSize.A4.height());
g2.setFont(font);
String txt = "日本人 أبجدية عربية Dès Noël où";
System.out.println(txt);
java.awt.font.FontRenderContext frc = g2.getFontRenderContext();
java.awt.font.TextLayout layout = new java.awt.font.TextLayout(txt, font, frc);
layout.draw(g2, 15, 55);
g2.dispose();
document.close();
You'll notice it does multiple languages with bi-directional support. Only thing is it's impossible to copy/paste the resulting PDF text, as it is an image. I can live with that.

Unicode Arabic (or anything else) is always in logical order in a Java program. Some PDFs are made in visual order, though this is quite rare in the modern world. The program you cite might be a hack that ends up with PDF's that work, sort of, for some purposes.
If I were you, I'd start by examining some PDF's produced in Arabic by some modern tool.
This sort of 'graphics' approach to PDF construction seems risky to me at best.

Read pdf uploadstream one page at a time with java

I am trying to read a pdf document in a j2ee application.
For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.
With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.
I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.

For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.
If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.
Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.
Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.
For parsing the PDF document from an InputStream, look at the COSDocument class
For writing lucene indexes, look at LucenePDFDocument class
For in-memory representations of COSDocuments, look at FDFDocument

In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.
This was answered here.

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.
Here is an example copied from the link above which shows how to draw a PDF page into an image:
File file = new File("test.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel channel = raf.getChannel();
ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
PDFFile pdffile = new PDFFile(buf);
// draw the first page to an image
PDFPage page = pdffile.getPage(0);
//get the width and height for the doc at the default zoom
Rectangle rect = new Rectangle(0,0,
(int)page.getBBox().getWidth(),
(int)page.getBBox().getHeight());
//generate the image
Image img = page.getImage(
rect.width, rect.height, //width & height
rect, // clip rect
null, // null for the ImageObserver
true, // fill background with white
true // block until drawing is done
);

I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.