I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it.
I tried with iText, PDFBox and other Java libraries but I only get text line by line and that's not what I need.
I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed.
If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table.
I need to understand how the PDF was created, if this is possible. I know that working with PDF's is not easy, but I need to start with something useful. Thanks!
PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs.
See this blog for more details.
https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition
You can download Windows/macOS/Linux PDFGenie command line tool here.
https://www.pdftron.com/downloads/linux
One more option, we can extract from Aspose PDF also, if you want look into the below link
https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/
Related
I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?
EDIT for those who find it not clear, I explain some more:
Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading
Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.
PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.
I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same.
The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) .
Apache POI does not provide an option to format the word document as in the HTML page.
I need something that can give me a completely formatted word document.
Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.
I tried parsing the HTML page using JSOUP and I get the contents of
the page in my java program. Now I need to pass these contents to a
doc/docx file. Can docx4j be helpful to get a formatted docx file?
Please help.
Thank you.
I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document.
Links to both are here:
http://www.docx4java.org/trac/docx4j
http://jsoup.org/
When you are doing it also, the formatting might be iffy.
To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.
I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically.
I have asked another question regarding the same.
Which APIs in java help in extracting table metadata from a pdf, and presenting that table in a web page?
The result should be that when the source of page is viewed it will show the html code of that table.
Itext is usefull in this context
http://itextpdf.com/
I assume that, you need a PDF library for Java.
PDFBox is one of the popular libraries created to PDF manipulation and I think it is worth to look at it.
try The Metadata Extract Tool which extracts metadata from specific file types including PDF. Then you can parse the xml output with any Java XML parser. Once you're able to parse it, elements can be easily laid down in your view page.
Any clue about a good library to programmatically produce a PDF in Java, using a PDF as a template?
try iText. It has many many goodies. There is also a book about it by Manning iText in Action
If you want to be able to edit the Text in the template, you should set up the template carefully and use forms for text content. You can't easily replace text in PDFs because they do not contain text structure.
There is a blog article highlighting some of the issues at http://pdf.jpedal.org/java-pdf-blog/bid/17370/Problems-editing-PDF-files
After cleaning a folder full of HTML files with TIDY, how can the tables content be extracted for further processing?
I've used BeautifulSoup for such things in the past with great success.
Depends on what sort of processing you want to do. You can tell Tidy to generate XHTML, which is a type of XML, which means you can use all the usual XML tools like XSLT and XQuery on the results.
If you want to process them in Microsoft Excel, then you should be able to slice the table out of the HTML and put it in a file, then open that file in Excel: it will happily convert an HTML table in to a spreadsheet page. You could then save it as CSV or as an Excel workbook etc. (You can even use this on a web server -- return an HTML table but set the Content-Type header to application/ms-vnd.excel: Excel will open and import the table and turn it in to a spreadsheet.)
If you want CSV to feed in to a database then you could go via Excel as before, or if you want to automate the process, you could write a program that uses the XML-navigating API of your choice to iterate of the table rows and save them as CSV. Python's Elementtree and CSV modules would make this pretty easy.
After reviewing the suggestions, I wound up using HtmlUnit.
With HtmlUnit, I was able to customize the Java code to open each HTML file in the folder, navigate to the TABLE tag,
query each column content and extract the data I needed to create a CSV file.
In .NET you could use HTMLAgilityPack.
See this previous question on StackOverflow for more information.
If you want to extract the content from the the HTML markup, you should use some type of HTML parser. To that end there are plenty out there and here are two that might suite your needs:
http://jtidy.sourceforge.net/
http://htmlparser.sourceforge.net/
iterate through the text and Use regular expression :)
http://www.knowledgehouse.sg