How can I add text to a pdf document, which is not visible?
The document manipulation should be done in java. The usecase is to add further metadata to a document (in a proprietary format, about 40kb), before the document is signed and archived.
I tried:
annotation field with size 0,0
.txt file attachment
but, this annoys readers of the PDF, because they see a difference (comment / attachment bar).
Is there a comment object or a syntax to comment out lines in a PDF document?
EDIT:
I've tried adding text between PDF objects. This works, the problem is: acrobat reader asks to resave the file when closing window.
Adding the text after %EOF is not a solution, because signing is not applied to the metadata, which is a needed feature.
The proper way to add metadata to a PDF would be through XMP. It allows you to add arbitrary metadata and allows defining the metadata types inside of the same PDF file (which you really should do if you're archiving and which is a requirement in archival standards such as PDF/A).
XMP data can be extracted by readers who don't understand the PDF format using a simple text scanning algorithm yet at the same time it will be inside of the document so will be protected by the digital signature you apply.
You can read more about it here: http://www.adobe.com/products/xmp/
I have seen PDF's who had a bunch of metadata in the footer, just in color white while the background was also white, so normally you wouldn't recognize it when you're looking at the PDF. But that's quite nasty..
Related
I have to create a pdf using itext which will contain a button, when clicked should add a row in an existing PdfPTable. I wrote some code to create a PushbuttonField. While trying to set action I can only find PdfAction.javaScript. I am not able to figure out how to add a row in a table. I tried searching online but all I could find is PdfAction.javaScript
Any help would be greatly appreciated. Thank you.
When you create a PDF file, you draw text, lines and shapes to a canvas. That is also what happens when you add a PdfPTable to a Document. If you look at the syntax of the PDF page, you won't recognize a table. You'll find text (the content of the cells), lines (the borders), and shapes (the backgrounds), but you won't find a table. If the table is distributed over different pages, the "table" on one page won't know that it is related to the "table" on the other page.
Sure, you can add semantic structure to the document by introducing marked content, and by creating a structure tree, but that mechanism which we call Tagged PDF can't be used to make the PDF "editable" the same way a Word document is editable. Tagged PDF is (among others) used to allow assistive technology to present the content to the visually impaired (e.g. in the context of PDF/UA). The presence of structure doesn't change the fact that all text, all lines, and all shapes are added at absolute positions.
This is very different from HTML where the position on a page of a <table>, <tr>, <th>, or <td> is calculated at the moment the page is rendered. In HTML this position can even change when you resize the browser window.
There is no such thing in PDF (except if you use XFA (*), a technology that is deprecated since ISO 32000-2). All content on a page has a fixed position, hardcoded into the page's content stream. Changing the size of the PDF viewer window won't change anything to the position of the page content.
Because of all of this, your question is invalid. It is impossible to create a button in PDF that adds a row to a table, because:
In many cases there is no table: there is just a bunch of text, lines, and shapes at absolute positions,
Even if there is the notion of a table (using Tagged PDF): the visual represenation of that table is fixed at creation time, it can't be changed at consumption time.
You want to use an ordinary PDF viewer as if it were a PDF editor. That is impossible for all the reasons listed above.
(*) XFA was deprecated for different reasons. One of the most important reasons it is the lack of support for XFA. There aren't many viewers that support XFA. If you would post a follow-up question asking *"How can I create an XFA document?", the answer would be: "Don't do this!" Creating XFA is extremely complex, and once you've succeeded in creating an XFA form, you'll discover that many of your customers won't be able to consume the file because their viewer doesn't support the format.
PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
Extracted text: http://pastebin.com/BXFfMy0z
Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
What can I do to extract correct text from this pdf file?
In addition to #karthik27's answer:
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.
I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation
link1
link2
The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.
Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated
Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.
I have to edit an existing pdf file using itext in java. My problem is in the existing pdf it contains lots of pages. When inputting the page number of that existing pdf i have to edit the footer of that page to a new text and have to output only that page with edited footer page along with the page contents in that page. No need to output the remaining pages. Also the existing pdf is in A6 format and I have to change the output pdf to A4 format. How it is possible?
You can split and merge PDF files using iText. That means, you need to split your original document into three parts and keep only the middle (required) part. You can also delete and add objects. That means you can find the footer object, delete it and and add a new object in its place. I do not think you would be able to change the format. Unless, you can create a brand new document in the target format and copy the objects from the source into the new document. Worth trying.
I am working on a billing program - right now when you click the appropriate button it generates a frame that shows the various charges etc, basically an invoice. Is there a way to give the user an option of saving that frame as a document, either Microsoft Word, Microsoft Works or PDF?
One approach would be to save the frame as an image, you can do that by using the following syntax to convert it to an image.
BufferedImage myImage = new BufferedImage(size.width,size.height,
BufferedImage.TYPE_INT_RGB);
Graphics2D g2 = myImage.createGraphics();
myComponent.paint(g2);
you can then save this image and pass it into a jasper report. From the JasperPrint object you can then save in a few different formats, including pdf. A better but similar approach would be to pass the Graphics context into JasperReports(there is a renderer to do this in jasper, and the quality is much better).
Paint JFrame in a BufferedImage. paint() method of JFrame
Save the image as jpg or png or whatever image format
Take some pdf library and create a blank pdf (e.g. iText)
Insert the image into the PDF document
Save it - done
Instead of generating a word document, I'd rather use a Java library like iText to produce a PDF document (more portable) or, even better, the JasperReport report library that can output reports in a wide range of formats (PDF, XML, HTML, CSV, XLS, RTF, TXT) as suggested by bigbrother82 in a comment. This looks cleaner to me than using an image, especially for printing (not even mentioning that your invoice may be a multi-page document).
I'd likely look at this from a slightly different direction and instead of asking how to splat the GUI form as-is into a PDF or word document I'd ask how to get that content into a Word/PDF document.
The answer to that question is Apache FOP. Generate a XSL-FO file and ask FOP to convert it into a RTF document (with a .DOC extension) or a PDF.
Normally one does this by generating an XML file containing the data you need printed. Then use an XSLT to convert that XML to XSL-FO. I however found it easier to generate a XSL-FO file directly using a templating language (such as Freemarker).
You might want to look at the online demo for Docmosis as an example which gives the user the options for requesting the document up front. That demo does a download, but it could direct the document into a frame instead and leave it to the browser to display. This style of working (as metioned by others) is looking at the problem from a different angle and deciding up front what format, rather than after the fact and then trying to save the frame contents.