I am looking to batch convert a large amount of ALTO format XML docs to various formats in Windows, txt at least, rtf if possible and pdf would be convenient as well.
ALTO is an xml standard used by libraries and archives to hold metadata/format/font/layout aware text for reconstruction in PDF images.
I have only the XML files for a large archive that I would like to convert for text mining. The software I am using requires clean text or rtf files, so converting the xml to plain text is kind of the goal. Because ALTO is a standard the conversion should be possible, no?
A bonus would be the ability to either embed the metadata in a pdf or convert it to a bibliographical format file like LaTex. This could be a separate program.
I'd appreciate any ideas,
Thanks.
In order to get plain text from the ALTO xml, you may try implementing the simple method used in this (hacky) Python script in Java: https://github.com/cneud/alto-ocr-text.
I am not currently aware of a straight conversion to PDF or LaTeX though you may be able to do this with a stylesheet, based on how exactly your ALTO files look like.
Related
I have hundreds of rich PDFs that need to be generated from my application, they have images and colorful content. I was looking to build a framework which support a template and data model and can take care of rest, so adding anew pdfs would be just adding a new template. In the past i have used free-marker to generate HTML and that print HTML to PDFs, are there any better recent solution to solve this problem?
There are various things you could do:
generate xml data, apply xslt transformation to style it, and convert
the html document to a pdf
code a small class that converts whatever data format you have to a pdf document (you would need to do all the layout through code)
create a template (using whatever design program you want) pdf document, insert form fields, and have iText fill the form (several of our customers go for this approach)
Keep in mind that JasperReports uses a proprietary format. Whereas the approaches I suggested use only open and well-established formats.
Take a look at JasperReports.
Is there any java library which can be used for converted Microsoft Word files (doc/docx) to Open Document Text format(.odt) formats. Free library would be preferable.
I don't know about any libraries that do it directly, but it should be relatively easy to exact the bits you're interested from a .docx using poi:
http://poi.apache.org/
and then write them to an ODT format using ODFDOM:
http://incubator.apache.org/odftoolkit/odfdom/index.html
This should be relatively straightforward for simple documents, but if your use case calls for complex doucments containing pictures etc, this might become a LOT harder.
Anyway, hope this helps at least some ;)
I believe everything you need is in this post: http://angelozerr.wordpress.com/2012/12/06/how-to-convert-docxodt-to-pdfhtml-with-java/
For instance:
JODConverter : JODConverter automates conversions between office
document formats using OpenOffice.org or LibreOffice. Supported
formats include OpenDocument, PDF, RTF, HTML, Word, Excel, PowerPoint,
and Flash. It can be used as a Java library, a command line tool, or a
web application.
I'm trying to make a tool that's able to import XML output files from a certain tool and 'convert' them into nice a nice PDF report that sums things up in understandable language for normal people.
Output files always contain specific data, but I just want my appliciation to automatically create a report that's not hard to read for someone who's not very familiar with technology.
I know it's impossible to do completely automated, so I guess I'll have to use some kind of pre-defined templates or something, but I'm not sure what the best solution is.
Importing and reading the XML files isn't hard, but how do I convert them to a readable PDF document?
Apache FOP may help. This is a print formatter. Uses XSL-FO (XSL Formatting Objects) as template, and supports pdf as output. XSL-FO is a language for formatting XML data.
I have following problem: I have a XML file with XSL stylesheet, that is rendering this XML file as neat table in HTML when I load it in web browser. Now I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser, without need for making custom FO's for every file. Everything must be done in Java.
I need to make a PDF that is looking EXACTLY like that XSL-styled XML in web browser
Think again about this requirement. Paged media such as PDF and non-paged media such as HTML may only look "close enough", but never "exactly like" each other. This is even more obvious if you consider your HTML being displayed on devices with different screen sizes.
If you relax the above requirement somewhat, you'll probably agree that XSL-FO is the best choice. You definitely do not need to write "custom FO's for every file": write an XSLT just once, and use it on-the-fly to convert your XML to XSL-FO, and then use a rendering engine to process XSL-FO to PDF. Simple.
XSL-FO does sound like exactly what you need. But if that's not an option, first explicitly doing the XSLT transform on the XML in Java and then converting the resulting HTML (which by then is a String/byte array/DOM/whatever you want) to PDF using some additional library would do the trick. There's some libraries that support HTML to PDF, like iText for example. XSLT transformations in Java are really simple. Little code involved there.
Does iText provide/support for any kind of styling sheet?
What I mean is, like in Apache FOP, the data is represented in the XML and the formatting is programmed in the XSL. So then we pass the XML and XSL to the FOP engine which in turn converts the data in XML using the formatting specified in the XSL to create a PDF.
Does iText support a similar functionality or the only way we have is to program the whole formatting in the Java code itself, meaning specifying the table/cell(its dimensions etc.), paragraph(its font, color etc.)?
iText isn't FOP, no. The only way is to program the whole formatting in the java code itself. OTOH, Your program could read formatting information from various files in the format of your choice, but you'd have to write that code yourself.
iText in Action 2nd ed has a sample that outlines building an XML parser and using that to feed iText. Nothing about style info other than what's written in the code.