converting docx document to pdf with docx4j not the same

converting docx document to pdf with docx4j not the same - java

Good evening!
I convert from a docx document programatically (java docx4j) to pdf.
I get the pdf document from my docx document but the pdf is not exactly the same as the docx document. (lines between numbers are lost and no bold headline, please see the attachted documents)
If you compare the docx and the pdf document two differences are there. 1) the headlines in pdf are not anymore bold and 2) more important under number nine (§9) there is no new lines betweenn the numbers (1),(2),(3). in pdf but in docx there are.
How can i produce the same pdf from my docx file?
Thanks in advance
http://www.janolaw.de/export/LivingWillGeneratedByMe.pdf
http://www.janolaw.de/export/LivingWillorg.docx

Regarding "no new lines between the numbers (1),(2),(3)", it appears w:br is not being handled correctly.
I've created https://github.com/plutext/docx4j/issues/90 to track this.
Update fixed in docx4j 3.0 beta 2

Related

PdfBox text extraction not working properly

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
Extracted text: http://pastebin.com/BXFfMy0z
Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
What can I do to extract correct text from this pdf file?

In addition to #karthik27's answer:
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.

I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation
link1
link2

The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.

How to hide text in an PDF file?

How can I add text to a pdf document, which is not visible?
The document manipulation should be done in java. The usecase is to add further metadata to a document (in a proprietary format, about 40kb), before the document is signed and archived.
I tried:
annotation field with size 0,0
.txt file attachment
but, this annoys readers of the PDF, because they see a difference (comment / attachment bar).
Is there a comment object or a syntax to comment out lines in a PDF document?
EDIT:
I've tried adding text between PDF objects. This works, the problem is: acrobat reader asks to resave the file when closing window.
Adding the text after %EOF is not a solution, because signing is not applied to the metadata, which is a needed feature.

The proper way to add metadata to a PDF would be through XMP. It allows you to add arbitrary metadata and allows defining the metadata types inside of the same PDF file (which you really should do if you're archiving and which is a requirement in archival standards such as PDF/A).
XMP data can be extracted by readers who don't understand the PDF format using a simple text scanning algorithm yet at the same time it will be inside of the document so will be protected by the digital signature you apply.
You can read more about it here: http://www.adobe.com/products/xmp/

I have seen PDF's who had a bunch of metadata in the footer, just in color white while the background was also white, so normally you wouldn't recognize it when you're looking at the PDF. But that's quite nasty..

extract formatted text from pdf to html

I needed to covert PDF documents into HTML. where i can achieve below.
1-Extract the text from the PDF.
2-extract the images
3-Retain the formatting in the newly converted HTML page same as that of PDF page.
4-To embed the images into the newly converted HTML page in the appropriate places as that of PDF.
5- Applying color scheme to HTML page.
Any help would be appreciated.

To extract images from PDF
Image Extraction
To extract text from PDF
Text Extraction
All the other things you are seeking answers for are possible using any web-application setup.,

Issue with pdf creation using itext

I have to edit an existing pdf file using itext in java. My problem is in the existing pdf it contains lots of pages. When inputting the page number of that existing pdf i have to edit the footer of that page to a new text and have to output only that page with edited footer page along with the page contents in that page. No need to output the remaining pages. Also the existing pdf is in A6 format and I have to change the output pdf to A4 format. How it is possible?

You can split and merge PDF files using iText. That means, you need to split your original document into three parts and keep only the middle (required) part. You can also delete and add objects. That means you can find the footer object, delete it and and add a new object in its place. I do not think you would be able to change the format. Unless, you can create a brand new document in the target format and copy the objects from the source into the new document. Worth trying.

Convert pdf into word doc file

How to convert the pdf into the word doc file?
The pdf file was generated by JasperReports and which has one table in which one column contains text with html body part like <p><b>test</b></p>
So I just want to convert this pdf file in doc with proper formating like text display in bold format.

Much of the format information is removed in converting a file into a PDF so you can not just convert it back unless the PDF was created as Marked content with additional meta tags in it.
I wrote a blog article explaining about PDF text at http://www.jpedal.org/PDFblog/2009/04/pdf-text/

Pro grammatically you can do it with Apachi POI. You can first read the PDF and then write it to a Word Doc using the API.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

converting docx document to pdf with docx4j not the same - java

Regarding "no new lines between the numbers (1),(2),(3)", it appears w:br is not being handled correctly. I've created https://github.com/plutext/docx4j/issues/90 to track this. Update fixed in docx4j 3.0 beta 2

Related

PdfBox text extraction not working properly

How to hide text in an PDF file?

extract formatted text from pdf to html

Issue with pdf creation using itext

Convert pdf into word doc file

Categories

Resources