I am trying to read a pdf document in a j2ee application.
For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.
With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.
I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.
For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.
If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.
Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.
Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.
For parsing the PDF document from an InputStream, look at the COSDocument class
For writing lucene indexes, look at LucenePDFDocument class
For in-memory representations of COSDocuments, look at FDFDocument
In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.
This was answered here.
Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.
Here is an example copied from the link above which shows how to draw a PDF page into an image:
File file = new File("test.pdf");
RandomAccessFile raf = new RandomAccessFile(file, "r");
FileChannel channel = raf.getChannel();
ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
PDFFile pdffile = new PDFFile(buf);
// draw the first page to an image
PDFPage page = pdffile.getPage(0);
//get the width and height for the doc at the default zoom
Rectangle rect = new Rectangle(0,0,
(int)page.getBBox().getWidth(),
(int)page.getBBox().getHeight());
//generate the image
Image img = page.getImage(
rect.width, rect.height, //width & height
rect, // clip rect
null, // null for the ImageObserver
true, // fill background with white
true // block until drawing is done
);
I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.
Related
I have a PDF file to save, but first I have to compress it with the best possible quality and I must use open source library (like Apache PDFBox®).
So, until now what I do is get all the image type resources, compress them and put them back in the PDF, but the compression ratio is to low. This is just a fragment of the code where I assign the compression parameters:
PDImageXObject imageXObject = (PDImageXObject) pdxObject;
ImageWriter imageWriter = ImageIO
.getImageWritersByFormatName(FileType.JPEG.name().toLowerCase()).next();
ImageWriteParam imageWriteParam = imageWriter.getDefaultWriteParam();
imageWriteParam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
imageWriteParam.setCompressionQuality(COMPRESSION_FACTOR);
There is some other mechanism to optimize a PDF, so far only compress the images shows a slightly poor result.
On compression. Indeed, images probably are the largest culprits.
Images: The image size, width and height, contribute to the file size too, not only the lossy image quality (your COMPRESSION_FACTOR). In general I would start with compressing a JPEG file outside the PDF. Then you can find the best compression, that still shows and prints (!) adequately. Photos JPEG, vector graphics (like diagrams) can best be done with Encapsulated PostScript.
Repeated images like page logos should not be stored repeatedly. The optimisation here is internet streaming.
Fonts: The default fonts need no space, the full fonts need the most space (for PDFs with forms for instance). Embedded fonts are a third possibility, only loading the symbols one needs.
PDFs own binary data: Text and other parts can be uncompressed, compressed using only 7bits ASCII, and further compressed using all bytes. The ASCII option is a bit outdated.
At the moment I am not using pdfbox, hence I leave that to you.
The Context
Currently, I have a solution where I loop through a PDF and draw black rectangles throughout it.
So I already have a PDRectangle list representing the right areas I need to fill/cover on the pdf, hiding all the texts I want to.
The Problems
Problem number 1: The text underneath the black rectangle is easily copied, searchable, or extracted by other tools.
I solved this by flattening my pdf (converting it into an image so that it becomes a single layer document and the black rectangle can no longer be tricked). Same solution as described here:
Disable pdf-text searching with pdfBox
This is not an actual redacting, it's more like a workaround.
Which leads me to
Problem number 2:
My final PDF becomes an image document, where I lose all the pdf properties, including searching, copying... also it's a much slower process. I wanted to keep all the pdf properties while the redacted areas are not readable by any means.
What I want to accomplish
That being said, I'd like to know if it is possible and how I could do an actual redacting, blacken out rectangles areas since I already have all the positions I need, with PDFBox, keeping the pdf properties and not allowing the redacted area to be read.
Note: I'm aware of the problems PDFBox had with the old ReplaceText function, but here I have the positions I need to make sure I'd blank precisely the areas I need.
Also, I'm accepting other free library suggestions.
Technical Specification:
PDFBox 2.0.21
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64
My Code
This is how I draw the black rectangle:
private void draw(PDPage page, PDRectangle hitPdRectangle) throws IOException {
PDPageContentStream content = new PDPageContentStream(pdDocument, page,
PDPageContentStream.AppendMode.APPEND, false, false);
content.setNonStrokingColor(0f);
content.addRect(hitPdRectangle.getLowerLeftX(),
hitPdRectangle.getLowerLeftY() -0.5f,
hitPdRectangle.getUpperRightX() - hitPdRectangle.getLowerLeftX(),
hitPdRectangle.getUpperRightY() - hitPdRectangle.getLowerLeftY());
content.fill();
content.close();
}
This is how I convert it into an Image PDF:
private PDDocument createNewRedactedPdf() throws IOException {
PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
PDDocument redactedDocument = new PDDocument();
for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(pageIndex, 200);
String formatName = "jpg";
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(image, formatName, baos);
byte[] bimg = baos.toByteArray();
PDPage page = pdDocument.getPage(pageIndex);
float pageWidth = page.getMediaBox().getWidth();
float pageHeight = page.getMediaBox().getHeight();
PDPage pageDraw = new PDPage(new PDRectangle(pageWidth, pageHeight));
redactedDocument.addPage(pageDraw);
String imgSuffixName = pageIndex + "." + formatName;
PDImageXObject img = PDImageXObject.createFromByteArray(redactedDocument, bimg,
pdDocument.getDocument().getDocumentID() + imgSuffixName);
try (PDPageContentStream contentStream
= new PDPageContentStream(redactedDocument, pageDraw, PDPageContentStream.AppendMode.OVERWRITE, false)) {
contentStream.drawImage(img, 0, 0, pageWidth, pageHeight);
}
}
return redactedDocument;
}
Any thoughts?
What you want to have, a true redaction feature, is possible to implement based on PDFBox but it requires a lot of coding on top of it (similar to the pdfSweep add-on implemented on top of iText).
In particular you have found out yourself that it does not suffice to draw black rectangles over the areas to redact as text extraction or copy&paste from a viewer usually completely ignores whether text is visible or covered by something.
Thus, in the code you do have to find the actual instruction drawing the text to redact and remove them. But you cannot simply remove them without replacement, otherwise additional text on the same line may be moved by your redaction.
But you cannot simply replace them with the same number of spaces or a move-right by the width of the removed text: Just consider the case of a table you want to redact a column from with only "yes" and "no" entries. If after redaction a text extractor returns three spaces where there was a "yes" and two spaces where there was a "no", anyone looking at those results knows what there was in the redacted area.
You also have to clean up instructions around the actual text drawing instruction. Consider the example of the column to redact with "yes"/"no" information again, but this time for more clarity the "yes" is drawn in green and the "no" in red. If you only replace the text drawing instructions, someone with an extractor that also extracts attributes like the color will immediately know the redacted information.
In case of tagged PDFs, the tag attributes have to be inspected too. There in particular is an attribute ActualText which contains the actual text represented by the tagged instructions (in particular for screen readers). If you only remove the text drawing instructions but leave the tags with their attributes, anyone reading using a screen reader may not even realize that you tried to redact something as his screen reader reads the complete, original text to him.
For a proper redaction, therefore, you essentially have to interpret all the current instructions, determine the actual content they draw, and create a new set of instructions which draws the same content without unnecessary extra instructions which may give away something about the redacted content.
And here we only looked at redacting the text; redacting vector and bitmap graphics on a PDF page has a similar amount of challenges to overcome for proper redaction.
...
Thus, the code required for actual redaction is beyond the scope of a stack overflow answer. Nonetheless, the items above may help someone implementing a redactor not to fall into typical traps of too naive redaction code.
I have a PDF file to save, but first I have to compress it with the best possible quality and I must use open source library (like Apache PDFBox®).
So, until now what I do is get all the image type resources, compress them and put them back in the PDF, but the compression ratio is to low. This is just a fragment of the code where I assign the compression parameters:
PDImageXObject imageXObject = (PDImageXObject) pdxObject;
ImageWriter imageWriter = ImageIO
.getImageWritersByFormatName(FileType.JPEG.name().toLowerCase()).next();
ImageWriteParam imageWriteParam = imageWriter.getDefaultWriteParam();
imageWriteParam.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);
imageWriteParam.setCompressionQuality(COMPRESSION_FACTOR);
There is some other mechanism to optimize a PDF, so far only compress the images shows a slightly poor result.
On compression. Indeed, images probably are the largest culprits.
Images: The image size, width and height, contribute to the file size too, not only the lossy image quality (your COMPRESSION_FACTOR). In general I would start with compressing a JPEG file outside the PDF. Then you can find the best compression, that still shows and prints (!) adequately. Photos JPEG, vector graphics (like diagrams) can best be done with Encapsulated PostScript.
Repeated images like page logos should not be stored repeatedly. The optimisation here is internet streaming.
Fonts: The default fonts need no space, the full fonts need the most space (for PDFs with forms for instance). Embedded fonts are a third possibility, only loading the symbols one needs.
PDFs own binary data: Text and other parts can be uncompressed, compressed using only 7bits ASCII, and further compressed using all bytes. The ASCII option is a bit outdated.
At the moment I am not using pdfbox, hence I leave that to you.
I've been trying to use Flying Saucer to take HTML content and render it to a SVG. I've easily achieved it when I want a JPEG or PNG, but SVG's seem to a bit more complicated and I've been struggling to figure out the process a bit. (Mainly because I'm a Java noob as it were).
I've been trying to follow this but it looks like the examples don't quite work, and trying to follow the demos they have bundled I was struggling with a bit as well, plus most examples try and get a browser involved which I don't really care about. All I want is to take the HTML and convert it to a SVG so I can use it as resource in a PDF. This is what I have PNG conversion:
BufferedImage buff = null;
buff = Graphics2DRenderer.renderToImage("content.html", 1000, 1000);
File outputfile = new File("test.png");
ImageIO.write(buff, "PNG", outputfile);
I am using iText to read and manipulate PDF files(reports). I read in a PDF report using iText and I save each page as a separate PDF file. But I am unable to reduce the size of the generated PDF. Is there any compression technique or any other way to reduce the size of the PDF? Does pdfbox help in this way?
You can try to set a compression level when using iText:
Document document = ...
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
writer.setCompressionLevel(9);
Level 9 is slowest, but gives you the best compression available in iText.
Please note, that the compression effect largely depends on the PDF content. If your PDF file contains large binary streams, such as images, the compression will have little to no effect on your document. Also, iText will never compress XMP metadata stream regardless of the configuration options.