Java Pdf Diff library - java

Does anybody know of a open source Java library that will do robust diffing of the text parts of pdf files?
Ideally I would like something that would produce a diff in the form of a patch.

Extract the pdf text with http://incubator.apache.org/pdfbox/ and create a diff with http://code.google.com/p/google-diff-match-patch.

If the PDFs are different only in text, you could also rasterize the pages and then look at the differences that way - we use that for regression testing output on our PDF code.

You can take a look of xdiffweb.com. It's a pure java opensource project based on apache pdfbox.

Related

How to modify .docx template in either Java or Python

I want to be able to generate .docx files through either Java or Python based off of a template .docx file. I need to be able to insert in simple text, some bullets and a table or two.
I would like suggestions on specific libraries/modules for either Python or Java that would allow me to load a template, insert basic text and tables and then save it.
I have been looking into JACOB for Java and docx for Python. Any alternatives? Or will one of these be able to do what I need?
Thanks in advance
If you want to generate a docx, than you might like docxtemplater, which is a library I maintain which does docx generation from a template (much like Mustache for HTML).
It runs on node but has a command line interface so you can use it from any language.
DocxTemplater Library
Demo Site
Give docx4j as a choice, it's based on Apache POI but with better documentation
Did you look at Aapache POI (the Java API for Microsoft Documents) project?
http://poi.apache.org/
Good luck!

Creating Docx, PDF, XSL-FO

[Background Info]
We had a solution in place to use Word automation serverside to convert HTM documents into Docx, PDF or Print documents. This solution broke in the latest version of Windows Server 2012. We learned that MS does not intend on Word working in this manner and after trouble shooting with MS support Engineers we have come to the conclusion that it will never work.
[Currently]
I am currently researching potential technologies and tools that my company can use to regain this functionality. We need to be able to create Docx, PDF and print files to a local printer.
I have looked into a number of tool already and I am currently leaning towards Apache FOP this seems to handle PDF and Printing for us.
However, I'm looking for some advice and suggested tools that we could use to implement a pure Java approach. Currently our application creates HTM files with all the required information. So ideally we would like to take these HTM files and "Convert" them into Docx/XLS-FO format.
[Question]
So my question that I'm hoping you will be able to help me with.
What is the best tools that I can use to get from
HTM to Docx
HTM to PDF
Or what would be the best process for achieving this? has anyone had success finding a solution for this in the past?
Thank You
It depends on the level of control and the complexity of the source HTML. There are HTML to FO stylesheets but you might find them wanting for your specific need.
So you could use the Jericho parser to read the HTML and generate FO. Or you generate the target format directly using Apache PDFBox and Apache POI
It all boils down to the level of control you want/need
docx4j-ImportXHTML will get you from XHTML to docx. From there, you can use docx4j (or some other solution eg LibreOffice/OpenOffice) to do docx to PDF.
docx4j supports docx to XSL FO, and by default uses FOP.

Is there a way to extract text from PostScript (.ps , .eps) files using Java?

I am looking for a solution similiar to PDFBox for PDFs of Apache Tika, however, for PS files.
thanks.
Like James Black says, it's probably best just to convert to PDF and use your familiar tools.
However, there does exist pstotext which is available in, e.g., the Ubuntu universe in its own package.
Ghostscript itself also comes with both ps2txt and ps2ascii which can also do this.
You could use Ghostscript to convert to a pdf, http://www.osalt.com/ghostscript, then there are various libraries to handle a pdf.
This has an advantage in that you are only pulling from PDFs, so you can handle other formats as long as you can convert them to PDFs.

Creating PDF, HTML, and optionally RTF documents from the same source using Java?

I was looking at using iText to create both a pdf and html version of a document with RTF as a possible option. According to this question this is no longer possible with iText. Is there a library that will allow me to create a document in Java and output it as both PDF and HTML? The ability to output RTF would be nice but is not required.
As that answer to the other question states, you can just use the iText RTF Library.
I have used PD4ML to convert HTML to pdf. Even though it is a commercial app. It is very reliable and supports CSS well.
JasperReports. If you look at this package it supports export to:
pdf
html
rtf
xls
xml
You have two options to create the documents:
via iReport - a visual designer for reports
via an API, where you construct everything with Java code.
Note that even though JasperReports's main function is to create reports, it can very well create other documents, with no tabular data for example.
You could also try Docmosis since that supports the output formats provided by OpenOffice (including the ones you specified) and you can often do the job with a lot less code.

Java - PDF Generation Framework

I am using PDF documents for various purposes using iText library.
Its like one class per PDF document. In a way there are a lot of similarities among the classes and the same have been listed below:
The fields have (x,y) location
The field can be wrapped after some no. of words
A field can have a value which is a function of one or more parameters
Subsequent page of PDF has to kept same or different
I am thinking of doing this layout business through a XML file. Any thoughts or innovative ideas of solving this are welcome.
take a look at PDFBox Library which is now in the incubator of Apache
PDFBox is nice, Used it before and good good help from the developer. You might want to have a look at XSL:FO. It is an XML based formatting language that can output the result as PDF (and other formats) using Apache:FOP.
What about Prince? It's a FOP engine that uses CSS files as styling, and has a Java API. It's not free though (apart from the free Personal License)
Flying Saucer supports using XHTML/CSS to create PDFs.

Categories

Resources