I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?
EDIT for those who find it not clear, I explain some more:
Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading
Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.
PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.
Related
My requirement is just to show the DVT hierarchyView in a HTML format in PDF, I don't want to embed the flash content in to PDF.
Also, the above links seems suggesting use of FLEX which we are not comfortable of.
Do you have any other pointers please
I have a PDF file that was produced with iText and created with JasperReports (I don't know if it's relevant) and I was wondering if I can find some API or anything to see the structure because I need to extract text from it.
I tried with iText, PDFBox and other Java libraries but I only get text line by line and that's not what I need.
I also tried conversion in HTML, XML, DOM but I get the same result with text extraction, no structure parsed.
If I try to open it as DOCX I see that Word recognize sort of structure, for example an area that looks like a table in PDF, after conversion in DOCX it is actually a table.
I need to understand how the PDF was created, if this is possible. I know that working with PDF's is not easy, but I need to start with something useful. Thanks!
PDFTron PDFGenie can do full semantic table and paragraph extraction from a PDF file. It can generate a reflowable HTML file containing all the appropriate HTML tags for tables and paragraphs.
See this blog for more details.
https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/#a-idpart7aevaluating-accuracy-of-pdf-table-recognition
You can download Windows/macOS/Linux PDFGenie command line tool here.
https://www.pdftron.com/downloads/linux
One more option, we can extract from Aspose PDF also, if you want look into the below link
https://blog.aspose.com/2018/02/28/extract-text-by-paragraphs-and-convert-files-to-pdf-with-aspose.pdf/
Sample pdf
A sample pdf is shown in image. We need to create 2 column structure which can have text/images/figures etc. Moreover, we need to change the text format like font/size, auto wrapping etc. Text content will be dynamic as we don't know the content at compile time hence, it should be able to align itself after paragraph ends and we should not need to provide hard coded value for height. Giving absolute positions of components in qoppa to create pdf is not feasible for us because of dynamic content.
We've already explored qoppa library and we couldn't figure out how to make a pdf like shown in image. If anyone has worked on qoppa, Please do share the valuable resources available online related to qoppa. And Please, let me know if it is possible to create a pdf like this using qoppa.
I know this may sound silly to some of you experienced guys out there but it’s really important for me and my group at school, we need to create a software that allows the user to create a new RTF document from scratch (like an editor where you can center, change font size, style, save, insert picture), it also needs to be able read a docx document with images and format included and save it as a RTF document.
What we have done so far is being able to open the .docx document, extract the text without format and put it into an RTF document out. In other words using docx4j library we have been able to transform a .docx document text to .rtf, no pictures included, no formatting, just plain text surrounded by [ ].
We have made some progress today but we can’t figure out the next steps, considering the delivery date is in 72 hours, I thought it’d be a good idea to ask for help from more experienced people than us.
Please leave your answers or request info about the project, we’ll be glad to learn from you guys
To convert a .docx to .rtf use a library like https://code.google.com/p/jodconverter/. It will do all the heavy lifting for you.
Anyway, now about your editor itself. If I had to do it as fast as I could, I would use JavaFX to make my interface. There is a control called "Rich Text Editor" (http://docs.oracle.com/javafx/2/ui_controls/editor.htm) which you can just put into your application.
The trick here is that you can actually extract the HTML of the editor using getHtmlText(), and then you can the HTML to RTF using... yes, a library. I suspect that jodconverter can do this too, but if not, you can look at this question: Convert HTML to RTF in java?.
This should give you a better idea of how to do your project. There are Java libraries to handle conversion between HTML and RTF, so you can use an HTML editor (provided by JavaFX). And of course, a .docx can be converted to HTML too. Let libraries do all the dirty work :).
On googling,I have found some graph creation software's like amcharts,fusion charts, etc. and PDF creation software's like iText etc which only creates graphs and writes to pdf respectively. Amcharts has an option of exporting to PDF as it exports only graph data but not html data. But my web application has both HTML data(tables,text in tags) and graphs. I have to write both html content and graph data to a single PDF file using javascript or Java .Is there any way to do this.I need to generate a graph from the data in the table and write both table and the created graph to PDF. Please help me.
I don't think that converting HTML to PDF is a simple task. It is because HTML and PDF structures are very different. You need some engine (or library if you want) to render/print HTML file into PDF. Then HTML file can contain anything (text, images, specially layouted elements to looks like a graph, etc.) and some engine could render it if it is good enough.
For HTML to PDF Java library you can see this thread A java library for converting xml/html to pdf
For HTML to PDF JavaScript library you can see this thread https://stackoverflow.com/questions/20029132/javascript-pdf-generator-library
PS: Some version of amcharts works with flash only. It is impossible render into PDF using library. I recommend you to use something else for example d3.js, it si awesome JavaScript library for data visualization (like interactive charts).