I'm trying to make a tool that's able to import XML output files from a certain tool and 'convert' them into nice a nice PDF report that sums things up in understandable language for normal people.
Output files always contain specific data, but I just want my appliciation to automatically create a report that's not hard to read for someone who's not very familiar with technology.
I know it's impossible to do completely automated, so I guess I'll have to use some kind of pre-defined templates or something, but I'm not sure what the best solution is.
Importing and reading the XML files isn't hard, but how do I convert them to a readable PDF document?
Apache FOP may help. This is a print formatter. Uses XSL-FO (XSL Formatting Objects) as template, and supports pdf as output. XSL-FO is a language for formatting XML data.
Related
I am currently working at a project which generates contracts. The idea is that I put the data in a form and save it in a simple database.
So long, this was my favorite place to search for good ideas and simple solutions.
Now I am facing another problem and I don't know how I can solve that. I want to create a PDF and replace some placeholders with some data from my form.
One idea was, that I use an existing Word template with some bookmarks and replace them with the data from my form. Maybe there is a way to do that, and I am just too stupid to find it.
Another idea was, that I am using XML. Therefore, I thought I was clever and just converted the Word template to an PDF, so I am able to convert that PDF to an XML. Attached, you find the XML file. But now I need the XSL file - is there an easy way to create the XSL file?
Or maybe there is another simple solution to solve my problem.
In these attachments you find the PDF file, the Word template and the XML:
Thank you a lot :)
Using a template is a good idea - it makes some changes much quicker to make and then deploy. The comments above are focused on conversion, but don't forget you need to merge your data in (population) first.
If you can use Adobe tools, you can have a PDF template and use the Adobe tools to populate. This saves a "conversion" stage.
You mentioned using Word for templates. This means you to run through two stages of processing:
population - docx is a zipped set of XML files - so you can process them with your own code or using a library.
conversion - you need pdf, so you have to convert the docx to pdf. You also have to watch out for fonts at this stage (ie make sure they are available on your host).
The population stage you could do yourself since you are familiar with XML. But it is definitely complicated. The conversion needs to use a tool that is ideal for it. There are a few mentioned in the comments already.
There are some free/os and commercial tools that can do both parts:
docx4j
JOD Reports
Libre Office (using the Java Uno API) (I blogged this once - Java Convert Word to PDF with UNO)
Docmosis (please note I work for Docmosis)
I suggest starting with the simple example you have attached and prove you can both populate and convert that. Then switch to a more complicated example to see if you can do the other things that might be required (eg repeating or conditions or other logic) during the population stage.
I have to generate word documents from my application against a entity which will contain some information about that entity, for this i am using POI. But while using POI i have to decide like where i have to create a paragraph, where i have make text bold\italic etc based on a configuration in entity object which i could easily handle in the code.
But is there any way so that i can just define all these style/alignments etc information in any XML/XSL or in any other type of config so i can get rid of styling in my java code ?
Regarding your title question, see Where can I find the XSDs of DOCX XML files?
Regarding your body question,
But is there any way so that i can just define all these
style/alignments etc information in any XML/XSL or in any other type
of config so i can get rid of styling in my java code ?
Yes, of course, and it would be a wise design decision to do so. Since DOCX is OOXML (within OPC) your XSLT will be able to generate OOXML character level formatting via w:rPr settings such as w:b, w:i, etc.
The challenge you'll be facing, however, is that you'll be forgoing the convenience provided by the POI API. You'll also have to reconstruct the OPC if you want to produce a proper DOCX file rather than just an importable OOXML file. For small projects, the learning curve required to wield OOXML directly is likely to be too steep to merit a direct-to-OOXML approach.
I am looking to batch convert a large amount of ALTO format XML docs to various formats in Windows, txt at least, rtf if possible and pdf would be convenient as well.
ALTO is an xml standard used by libraries and archives to hold metadata/format/font/layout aware text for reconstruction in PDF images.
I have only the XML files for a large archive that I would like to convert for text mining. The software I am using requires clean text or rtf files, so converting the xml to plain text is kind of the goal. Because ALTO is a standard the conversion should be possible, no?
A bonus would be the ability to either embed the metadata in a pdf or convert it to a bibliographical format file like LaTex. This could be a separate program.
I'd appreciate any ideas,
Thanks.
In order to get plain text from the ALTO xml, you may try implementing the simple method used in this (hacky) Python script in Java: https://github.com/cneud/alto-ocr-text.
I am not currently aware of a straight conversion to PDF or LaTeX though you may be able to do this with a stylesheet, based on how exactly your ALTO files look like.
[Background Info]
We had a solution in place to use Word automation serverside to convert HTM documents into Docx, PDF or Print documents. This solution broke in the latest version of Windows Server 2012. We learned that MS does not intend on Word working in this manner and after trouble shooting with MS support Engineers we have come to the conclusion that it will never work.
[Currently]
I am currently researching potential technologies and tools that my company can use to regain this functionality. We need to be able to create Docx, PDF and print files to a local printer.
I have looked into a number of tool already and I am currently leaning towards Apache FOP this seems to handle PDF and Printing for us.
However, I'm looking for some advice and suggested tools that we could use to implement a pure Java approach. Currently our application creates HTM files with all the required information. So ideally we would like to take these HTM files and "Convert" them into Docx/XLS-FO format.
[Question]
So my question that I'm hoping you will be able to help me with.
What is the best tools that I can use to get from
HTM to Docx
HTM to PDF
Or what would be the best process for achieving this? has anyone had success finding a solution for this in the past?
Thank You
It depends on the level of control and the complexity of the source HTML. There are HTML to FO stylesheets but you might find them wanting for your specific need.
So you could use the Jericho parser to read the HTML and generate FO. Or you generate the target format directly using Apache PDFBox and Apache POI
It all boils down to the level of control you want/need
docx4j-ImportXHTML will get you from XHTML to docx. From there, you can use docx4j (or some other solution eg LibreOffice/OpenOffice) to do docx to PDF.
docx4j supports docx to XSL FO, and by default uses FOP.
I have following problem: I have a XML file with XSL stylesheet, that is rendering this XML file as neat table in HTML when I load it in web browser. Now I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser, without need for making custom FO's for every file. Everything must be done in Java.
I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser
Think again about this requirement. Paged media such as PDF and non-paged media such as HTML may only look "close enough", but never "exactly like" each other. This is even more obvious if you consider your HTML being displayed on devices with different screen sizes.
If you relax the above requirement somewhat, you'll probably agree that XSL-FO is the best choice. You definitely do not need to write "custom FO's for every file": write an XSLT just once, and use it on-the-fly to convert your XML to XSL-FO, and then use a rendering engine to process XSL-FO to PDF. Simple.
XSL-FO does sound like exactly what you need. But if that's not an option, first explicitly doing the XSLT transform on the XML in Java and then converting the resulting HTML (which by then is a String/byte array/DOM/whatever you want) to PDF using some additional library would do the trick. There's some libraries that support HTML to PDF, like iText for example. XSLT transformations in Java are really simple. Little code involved there.