I have a huge pdf file (20 mb/800 pages) which contains some information.
It has got index with hyperlinks. Also most of the remaining information is in Tabular format (in pdf). I need to retrieve this information using Java and store it in SQL Server.
Which is the best API available to read this kind of file from Java?
It is unlikely to be in tabular format inside the PDF as PDF does not contain structure information unless explicitly added at creation time. I wrote an article explaining some of the issues with text extraction from at PDF at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
Have you tried iText:
iText
Download iText
iText in Action — 2nd Edition
List of the Examples
Related
Currently I am having a requirement to download multiple files (PDF , XLXS , PPT , JPEG , PNG) from SFTP Server and then merge it to a one PDF File and provide to the client in order take a printout. I thought of using ITEXT library to convert all files to PDF and then perform a PDF Merge , but don't know weather it is possible, Thus I am requesting a support from you guys for a better approach to perform the task. I have already performed the file download using JSCH from SFTP to the server.
You can merge multiple PDF documents into a single PDF document using the class named PDFMergerUtility class, this class provides methods to merge two or more PDF documents in to a single PDF document.
Answering to My Own question to Benefit another person.
In order to Convert Files with extensions docx , xlsx , pptx) Used
Spire.Office for Java (Free Evaluation version available)
Also I tried aspose cells libray as well (Free Evaluation available) to convert xlsx to PDF as well. Both Libraries worked fine and hassle free , But all libraries were not free.
Then Merged all the PDF Files using ITEXT Library.
If Someone is having a better alternative answer , kindly share.
For multiple files merge, you can refer This Example
Actually I am attempting to extract the data from a PDF file but I didn't find any example in the internet and I am asking if there is any possibility that I can use the JPedal library to open to read the data from a PDF file.
You can use PDFBox from Apache.
I am not familiar with JPedal, but I write lots of code that generates and processes pdf files. I use IText and highly recommend it. If you have a specific question on how to process a pdf file, let me know.
Which APIs in java help in extracting table metadata from a pdf, and presenting that table in a web page?
The result should be that when the source of page is viewed it will show the html code of that table.
Itext is usefull in this context
http://itextpdf.com/
I assume that, you need a PDF library for Java.
PDFBox is one of the popular libraries created to PDF manipulation and I think it is worth to look at it.
try The Metadata Extract Tool which extracts metadata from specific file types including PDF. Then you can parse the xml output with any Java XML parser. Once you're able to parse it, elements can be easily laid down in your view page.
I am developing a standalone application in Java. I want to generate a pdf file using Java code. I have a display form in which all the details are fetched from database and displayed in the window. Details are Customer Name, Order Details etc.
Now I want to have a button there which says Convert to pdf.
I want to convert this to pdf file with proper alignment and formatting like tables, font etc.
What can be an ideal way to go about it?
I'd suggest you to use reporting tool like a jasperreports.
JasperReports is entirely written in
Java and it is able to use data coming
from any kind of data source and
produce pixel-perfect documents that
can be viewed, printed or exported in
a variety of document formats
including HTML, PDF, Excel, OpenOffice
and Word.
Have a look at other open source projects (pdf api):
Apache PDFBox
Apache Tika (Toolkit for detecting and extracting metadata and structured text content from various documents using POI and PDFBOX parser libs.)
PDFjet
Use iText:
http://itextpdf.com/
I was looking at using iText to create both a pdf and html version of a document with RTF as a possible option. According to this question this is no longer possible with iText. Is there a library that will allow me to create a document in Java and output it as both PDF and HTML? The ability to output RTF would be nice but is not required.
As that answer to the other question states, you can just use the iText RTF Library.
I have used PD4ML to convert HTML to pdf. Even though it is a commercial app. It is very reliable and supports CSS well.
JasperReports. If you look at this package it supports export to:
pdf
html
rtf
xls
xml
You have two options to create the documents:
via iReport - a visual designer for reports
via an API, where you construct everything with Java code.
Note that even though JasperReports's main function is to create reports, it can very well create other documents, with no tabular data for example.
You could also try Docmosis since that supports the output formats provided by OpenOffice (including the ones you specified) and you can often do the job with a lot less code.