I can read textboxes with anchors directly from document and table streams as mentioned in microsoft office format specifications.
But I am not getting idea about reading inline textboxes.
Please suggest an idea..
while reading paragraph with textbox I am getting a field character at the beginning of textbox. Please provide any code if you already have it.
Related
community!
My project is simple: I have a link to a website that has multiple information on different chemical substances and I want to extract some data and put in into pdf. Thing is that I want to keep the formatting of the original HTML (using it's css, of course).
Example of substance: http://www.molbase.com/en/msds_1659-31-0-moldata-2.html#tabs
I used jsoup to read the HTML of the table on the bottom of the page, the MSDS one, containing multiple sections with different information about the substance, but I really don't know how to save the exact HTML format into my pdf file. I have tried with iText too, but it gives me "missing ending tag" error, and if it worked, it would print the full page, not only that msds table.
Here is what I have tried to do, but ain't effective:
Document docu = Jsoup.connect(urlbun).get();
Element tableHeader = docu.select("div[class=\"msds\"]")
.first();
String[] finSyn = tableHeader.text().split(" ");
String moreText =" ";
I tried to split the text that the webpage has under that div ("class = "msds"") but I cannot find a way to split it the good way.
Please, could you please give me a hint on what to do? Even if the formating is not the same, I would like to be able to display the information in the same way, with indentation and such.
Thank you!
You can put the content that you want to convert to PDF inside a CSS ID (such as a DIV) and then use the PDFmyURL API to convert only that section to PDF.
Please refer to this on our website about how to select pieces from a page to convert to PDF
Disclosure: I work for the company that owns this site
PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
Extracted text: http://pastebin.com/BXFfMy0z
Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
What can I do to extract correct text from this pdf file?
In addition to #karthik27's answer:
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.
I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation
link1
link2
The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.
How can I add text to a pdf document, which is not visible?
The document manipulation should be done in java. The usecase is to add further metadata to a document (in a proprietary format, about 40kb), before the document is signed and archived.
I tried:
annotation field with size 0,0
.txt file attachment
but, this annoys readers of the PDF, because they see a difference (comment / attachment bar).
Is there a comment object or a syntax to comment out lines in a PDF document?
EDIT:
I've tried adding text between PDF objects. This works, the problem is: acrobat reader asks to resave the file when closing window.
Adding the text after %EOF is not a solution, because signing is not applied to the metadata, which is a needed feature.
The proper way to add metadata to a PDF would be through XMP. It allows you to add arbitrary metadata and allows defining the metadata types inside of the same PDF file (which you really should do if you're archiving and which is a requirement in archival standards such as PDF/A).
XMP data can be extracted by readers who don't understand the PDF format using a simple text scanning algorithm yet at the same time it will be inside of the document so will be protected by the digital signature you apply.
You can read more about it here: http://www.adobe.com/products/xmp/
I have seen PDF's who had a bunch of metadata in the footer, just in color white while the background was also white, so normally you wouldn't recognize it when you're looking at the PDF. But that's quite nasty..
How to convert the pdf into the word doc file?
The pdf file was generated by JasperReports and which has one table in which one column contains text with html body part like <p><b>test</b></p>
So I just want to convert this pdf file in doc with proper formating like text display in bold format.
Much of the format information is removed in converting a file into a PDF so you can not just convert it back unless the PDF was created as Marked content with additional meta tags in it.
I wrote a blog article explaining about PDF text at http://www.jpedal.org/PDFblog/2009/04/pdf-text/
Pro grammatically you can do it with Apachi POI. You can first read the PDF and then write it to a Word Doc using the API.
I am using iText library for writing a PDF file.
I want to give page numbers and page header on every page of file
How can I do that?
If you don't break lines manually then it's really hard almost impossible to precisely get the line count. This number depends on font measurement and the text layout on a page. iText is a great tool for PDF generation not for parsing.
I suggest you buy the iText book "iText in Action" which is very helpful. If you don't generate your page breaks yourself, you could checkout the PdfPageEvent methods such as onStartPage() and the getPageNumber() method.