Java Utility to convert content of any file to text file. - java

I am looking for a java utility through which a user can convert any type of file (pdf, doc, docx, xls, xlsx, csv, rtf, txt). We have a requirement in which user can upload any type of file and we need to read the content of the file(only text), convert it and store it in an object. That can be done using Apachi poi but I am wondering if any java utility exists?

You may be interested in Apache Tika, which includes the functionality of Apache POI and PDFBox. From the project description, the toolkit: "detects and extracts metadata and structured text content from various documents using existing parser libraries."

I imagine you can't have some sort of universal function for every type of file. You will need to implement conversion methods for each file type. This link helps with PDF files, and will also give you a template to work with your other file types.

Related

xsl stylesheet for generating word documents from Java

I was wondering if there is any Java API which allows to generate Word document similarly Apache FOP does.
With FOP, it is possible to specify a style sheet which defines the layout of the page in which the data (stored in an xml file) are printed.
The Transformer object within FOP library is in charge of that.
Is there any equivalent API for word document?
With FOP, you can try XML to RTF, which Word accepts.
From their webpage, XMLmind XSL-FO Converter apparently generates:
RTF (can be opened in Word 2000+),
WordprocessingML (can be opened inWord 2003+),
Office Open XML (.docx, can be opened in Word 2007+),
Putting FO to one side, here are 2 different approaches:
The first would be to write an XSLT to convert your XML to Flat OPC XML. Most parts in the Flat OPC XML would simply be copied there by your XSLT. (Generate that template content in Word, using "save as XML"). You'll be focusing mainly on populating the document.xml part. Word can open a Flat OPC XML file, or you can use docx4j (a project I work on) to convert Flat OPC XML to docx.
The second would be to use the docx4j Flying Saucer fork to convert your XML + CSS to docx content. See the code samples. You may need to customise it a bit; one way of feeding it CSS is this file. This actually ought to work pretty well; there is stuff there for mapping class attributes to Word styles, so if you could adorn your XML with class attributes, you could get even better results.
I will assume that your input is an XML document, or at least a CSV file.
1) Create an XSLT stylesheet to transform your input into the Word document format. The result will be a file we will call content.xml. You can apply the stylesheet to the input from Java.
2) Create a MS Word shell and put the content.xml into the shell. There are tools within Apache POI to do this.
I may have taken your question too literally. You might also be able to generate the document using Apache POI API. Also, if your MS Word document doesn't have tables, you can use Apache FOP to generate an RTF document, which can then be easily translated to a .docx file.
Apache POI provides Java APIs for reading and writing Microsoft Excel, Word, and PowerPoint files.
You can checkout POI's Javadocs here.

Generate PDF files using iText and apache velocity template(.vm)

What is the general workflow to generate a PDF using iText and an Apache Velocity template file (.vm) in Java?
I am interested in knowing steps like: parse template file, put Java object in context and steps to be performed to generate pdf etc.
I know this is a very basic question. But I am not able to find even a single example of this type on the web. I found XDocReport, but I am interested to know other alternatives as well.
Please help me with some sample project link or at least the steps to get started.
Yes, you can.
It all depends on how complex you want the PDFs to be.
Here are the steps for basic functionality
Generate a HTML file using Apache Velocity template file (.vm).
Use com.itextpdf.text.html.simpleparser.HTMLWorker (deprecated) to parse/convert that HTML file into a PDF.
Additionally, you can use com.itextpdf.text.pdf.PdfCopy.PageStamp to add content (borders, stamps, notes, annotations etc) to an existing PDF.
There is also com.itextpdf.tool.xml.XMLWorker for more advanced HTML conversion (adding style sheets etc)
Generating PDF using iText and an Apache Velocity template file (.vm) in Java directly is not possible because:
PDF is binary format,
Velocity generates plain text content.
On other words, Velocity cannot generate PDF.
XDocReport is able to generate a docx/odt report by merging a docx/odt template which contains some Velocity/Freemarker syntax with Java context. The generated docx/odt report can be convert it to pdf/xhtml.
It works because docx/odt are a zip which contains several xml entries. If you unzip a docx you will see word/document.xml. In this entry, you will see the content that you have typed with MS Word. word/document.xml is a plain text, so Velocity can be used in this case.
Here the XDocReport process to generate pdf from a docx template which uses Velocity:
Load docx template. this step consist to unzip the docx and stores in a map each xml entries (name entry as key and byte array as value). For instance map contains a key with word/document.xml and the xml content of this entry as value.
Loop for each xml entries which must be merged with Java context. For instance word/document.xml is merged with Java context by using Velocity and the result of merge replace the word/document.xml value of the map
Rebuild a new docx by zipping each entries of the map.
At this step we have a generated docx (the report).
To convert it to another format, XDocReport provides a docx-to-pdf converter based on Apache POI and iText. Here the XDocReport process to convert a docx to pdf:
Load docx with Apache POI
Loop for each structures of POI (XWPFParagraph, etc.) to create iText structure (iText Paragraph).
Note that XDocReport is modular and you can use other converters as well.
At first,we use freemarker template to generate a html file,and then render html to a pdf file by IItextRender .Finally, we can view pdf file in browser,there has a very useful javascript tools called pdfjs. Maybe you can try it.

convert any file type to pdf using Java API

Can you please let me know which java api (open source - Devlopment & Commercial) can be used to convert any file type (e.g. doc, docx, xls, xlsx, ppt, pptx) to pdf. Those files may contain text, image, graph,chart, style etc.
Thanks in advance.
You can user iText library for creating pdf documents, see: http://itextpdf.com/. For reading the doc, docx, xls, etc. files I suggest using apache poi library, see: http://poi.apache.org/

Generation of pdf file using java

I am developing a standalone application in Java. I want to generate a pdf file using Java code. I have a display form in which all the details are fetched from database and displayed in the window. Details are Customer Name, Order Details etc.
Now I want to have a button there which says Convert to pdf.
I want to convert this to pdf file with proper alignment and formatting like tables, font etc.
What can be an ideal way to go about it?
I'd suggest you to use reporting tool like a jasperreports.
JasperReports is entirely written in
Java and it is able to use data coming
from any kind of data source and
produce pixel-perfect documents that
can be viewed, printed or exported in
a variety of document formats
including HTML, PDF, Excel, OpenOffice
and Word.
Have a look at other open source projects (pdf api):
Apache PDFBox
Apache Tika (Toolkit for detecting and extracting metadata and structured text content from various documents using POI and PDFBOX parser libs.)
PDFjet
Use iText:
http://itextpdf.com/

Parse Pdf File and write content in word file using java

how to Parse a PDF file and write the content in word file using Java?
For parsing a PDF file in Java, you can use Apache PDFBox: http://incubator.apache.org/pdfbox/
For reading/writing Word (or other Office) file formats in Java, try POI: http://poi.apache.org/
Both are free.
Try the iText java library:
iText is an ideal library for developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation.
It can be used for your parsing step.
As for generating word documents - the OpenOffice Java API might be able to generate Word compatible docs (no personal experience with this API).
You might want to try any of these:
http://incubator.apache.org/pdfbox/
https://pdf-renderer.dev.java.net/
Once you are reading the contents of the PDF file, you can as well store them in a ODT file or a text file. For ODT file, try http://odftoolkit.openoffice.org.
Best!
You could use iText if the source PDF is mostly text. Images and such are quite hard to handle while parsing. If it's text only, it's as easy as 10 lines of code. See the iText manual for examples.
For writing word files there's only Apache POI. It can be a little tricky to figure out, but for such a simple task it shouldn't be any problem.

Categories

Resources