Adobe PDFBox produces unexpected changes for load followed by save - java

I'm working on a tool to automatically generate PDF files based on a template file and other source data. I'm using Mac OS 10.14.4, Java 1.8 and PDFBox version 2.0.15.
As a basic test, I trimmed the open-and-save code down to two lines, which have an obvious problem for one particular PDF and more subtle issues for all other PDFs I've tried:
PDDocument targetPDF = PDDocument.load(new File(templatePath));
targetPDF.save(targetFileName)
The observed problem for one particular PDF is that unexpected characters are inserted at the top of the first page. (They appear to be in an alphabet which is not otherwise used, and are clipped.) Other PDFs are visually similar, but very different when I run them through diff. Is this something tricky I should do to save the files? Is this a problem with that one file? Is PDFBox is doing something odd?
I've looked for similar reports, and found a few that are concerned with the size of the output files: In PDFBox, why does file size becomes extremely large after saving? and Split and merge pdf files using PDFBOX produces large file I do see a noticeable increase in file size, but not as much as those questions report. In one case, the input and output files are visually different. In others, diff -y --text template.pdf target.pdf reports large differences but I don't detect any differences by eye alone. (In Mac's built-in "Preview" document viewer. The breaking "template.pdf" is created in Adobe Acrobat. I don't know about the non-breaking files.)
After comparison to https://issues.apache.org/jira/browse/PDFBOX-2690 and http://useof.org/java-open-source/org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject I tried adding targetPDF.close() after the targetPDF.save(), but that made no difference.
https://pdfbox.apache.org/2.0/faq.html seems to suggest closing a content stream before saving the file, but I don't know how to do that. (File itself doesn't have a close() method. None of the methods on PDDocument seem related to streams or closing them, except PDDocument.close() itself, which prevents saving.)
I would paste in some log files here, but I'm not getting any log messages from PDFBox classes....

Related

Preview LaTeX output with Java

Right now I'm working on displaying LaTeX generated document with Java.
Strictly speaking, LaTeX source can be used to directly generate two formats:
DVI using latex, the first one to be supported;
PDF using pdflatex, more recent.
However rendering dvi or pdf is not available as far as I know.
Is there any way to handle those formats ? Or maybe others that makes sense ?
There are not enough details with regards to how you wish to "render" DVI or PDF from a LaTeX document. However, you could always just render the pdf using pdflatex and DVI using latex and use ICEpdf for viewing PDFs and javaDVI for viewing DVIs.
Another neat hack to display pdf in a panel is to pass the file path to an embedded web component in the application, and the web component will use whatever pdf rendering tool is available on your machine (Acrobat, Foxit, Preview, etc.)
I remember there was a post about this a long time ago.
I don't think there's a generic way to preview the rendered output without generating the file itself. You can write your own LaTeX engine which caches the output every few seconds and displays that but regardless of the storage, you have to output it somewhere physically and then render the output separately using any of the steps mentioned above.
Another approach is to convert the div output to an svg image file and render that with SVGGraphics2D. That will produce nice scalable results. Dvi files can be converted to svg on the command line (or in a script) using:
dvisvgm --no-fonts input.dvi -o output.svg
For more conversion options see this thread on how to convert pdf to clean svg.

Creating an editable document via java web application

I am looking for a convenient method to export some data from my database into a form that would be editable afterwards. The perfect scenario would be to export a word document, and perhaps a brutally simple solution would be to generate HTML and copy/paste it into Word.
I've looked at several open source libraries for generating word documents, but they seem a bit too simple or incomplete. I need support for tables and embedded images and control over formatting the fonts, table borders etc. (too much formatting seems to be lost when copying html and pasting into word).
Although Word is the end format, it'd be fine to generate it in any format that word would be able to open and subsequently save as DOCX.
I really haven't been able to find anything about generating ODT files (server side without client installation).
I would just dive into the ASPOSE libraries, but it'll take ages (and significant pain) to get a purchase order sorted out so I need to make sure its the only viable option before taking that route.
I could generate it as an excel file and copy that to word - this is looking like the best option currently.

How to edit files to change md5 hash without corrupting?

I need to duplicate various kinds of file types, change them a bit so that the original's md5 hash won't match the modified one, but keep them readable and not corrupted.
TXT files - that's obvious. I just add a random string to the end of the file.
PDF file - well I started looking for a java library to edit pdf files, but then I accidentally tried to open a pdf file in notepad++, and thought - why don't I try to add a random string to the end of the not readable content that I see there. Well, to my surprise it worked and the file wasn't corrupted.
ZIP file - I've tried the same that I did with pdf, and it also worked.
DOCX- the same method stopped working here. Appending just a space (" ") at the end of the binary content of a docx file that I open in a text editor, corrupts the file.
So what I need is:
java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.
There are still file types that I need to change there md5 hash output, but I don't think they are modifiable in java - media files for example, executables and etc..
So, nevertheless, how can i perform what I want on these files? Is there a way to just "touch" the file, change a header or something and make it nonidentical to an untouched one?
edit:
Ok, here's the motivation - I want to generate massive amount of data as I asked here: How to produce massive amount of data?
At the time of that question, the answers I got there were enough, but not they dont.
I need the data to be nonidentical. Pairs of files must fail md5 hash test.
i can't just generate random strings, because I need to simulate real files and documnets.
I can't use existing data dumps, because I need various sizes of these data sets that include various file types. I need something that I'll give as an input the size, and it will generate the data for me.
So I figured that I should use a starting data set of all the file types that I eventually need, and just duplicate this data set.
java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.
Apache POI is used to modify MS Office files. Note that newer formats (xlsx, docx, etc.) are simply ZIP files containing XML. Unzipping them and modifying plain text XML might work as well.
The same advice goes to ZIP files: try unzipping and modifying the easiest file.
But what are you actually trying to achieve? Note that randomly attaching some string at the end of the file works only by chance. On other computer or other version of software the file might be considered as corrupted...
I would advice you to either store some metadata external to the file rather than comparing MD5 or look deeper into file formats. There are almost always headers and various pieces of metadata hidden in the file (ID3 tags in MP3, EXIF in images, etc.) It is much safer to modify it instead.
Also look for reserved/not used bytes - it is quite often. But again - why? are you doing it on the first place?

How to extract images from pdf using Java (not using pdfbox)

I've being researching on how to extract images from a big (> 300MB) PDF file. I'm using pdfbox but for some particular reason that I can't figure out, some pages are not correctly extracted.
I'm using the PDFToImage class of pdfbox as base for my code.
So, do you know another library that may help me to do this? I know that iText may be used, but I read that it can't be used for commercial products.
I've installed the packages xpdf and xpdf-utils, and the utility called pdfimages is working perfect. But I need to solve this problem from Java and it should be portable.
I think you're talking about two different things here: extracting images from a PDF, and converting PDF pages to images. PDFToImage will output an image for every page, while pdfimages extracts all embedded images (e.g. a text document has 0 images).
Take a look at org.apache.pdfbox.tools.ExtractImages (source code) to see if it does what you want.
The most likely reason why it is hard working with 300 Mb PDF's is that you run out of memory. If it works well for smaller PDF's I would have a closer look at why it fails.
Have you tried icepdf or JPedal (both pure java)?

View PDF files in IFrame with Named Destinations

We've got an application that displays PDF files in an IFrame at specific Named Destinations. This works well on Windows systems but not Mac. In Safari, with Acrobat, the Named Destination is ignored and the document is displayed at the start.
Does anyone have any suggestions on how we might accomplish the task of displaying this information? Our initial thoughts are to:
Convert the PDF to HTML on the fly and display the HTML version in the IFrame
Convert the PDF on the page referenced to another format such as PNG etc. and display that in the IFrame
Utilize some kind of Java app that allowed us to render the PDF while honouring the Named Destination (not sure if this exists)
Any other ideas on a potential method of better displaying PDF files at Named Destination points that is a little more cross platform?
EDIT: I guess another option is to store the data in XSL/XSLT type format and convert to HTML for veiwing or PDF for saving to the desktop.
Not much help, but I found that alternative ways to display PDF files (other than the Acrobat Reader client) are few and far between. As you say, the commonly accepted way to render PDF's in something that doesn't natively support it seems to be converting it "something else", which is supported (even Acrobat.com does it this way in their Flex client if I remember it correctly).
Even converting the PDF document to other formats may be disappointing - especially if you expect a certain level of quality. It may also introduce server-side performance issues.
I realise this doesn't help anyone much but I'm interested to see if any other suggestions come up. We've dealt with this problem before in the same way, using IFrame controls (but without named destinations) but I'm very much interested in other suggestions/ideas as well.

Categories

Resources