PDF Generation using iText7 with Java

PDF Generation using iText7 with Java - java

I am trying to add content to an existing PDF using iText7. I have been able to create new PDFs and add content to them using Paragraphs and Tables. However, once I go to reopen a PDF that I have created and attempt to write more content to it, the new content starts overwriting the old content. I want the new content to be appended to the Document after the old content. How can I achieve this?
Edit
This is the Class which sets up some common methods that will be executed with each change done to a PDF document.
public class PDFParent {
private static Document document;
private static PdfWriter writer;
private static PdfReader reader;
private static PageSize ps;
private static PdfDocument pdfDoc;
public static Document getDocument() {
return document;
}
public static void setDocument(Document document) {
PDFParent.document = document;
}
public static void setupPdf(byte[] inParamInPDFBinary){
writer = new PdfWriter(new ByteArrayOutputStream());
try {
reader = new PdfReader(new ByteArrayInputStream(inParamInPDFBinary));
} catch (IOException e) {
e.printStackTrace();
}
pdfDoc = new PdfDocument(reader, writer);
ps = PageSize.A4;
document = new Document(pdfDoc, ps);
}
public static byte[] writePdf(){
ByteArrayOutputStream stream = (ByteArrayOutputStream) writer.getOutputStream();
return stream.toByteArray();
}
public static void closePdf(){
pdfDoc.close();
}
And this is how I am adding the content to the pdf
public class ActAddParagraphToPDF extends PDFParent{
// output parameters
public static byte[] outParamOutPDFBinary;
public static ActAddParagraphToPDF mosAddParagraphToPDF(byte[] inParamInPDFBinary, String inParamParagraph) throws IOException {
ActAddParagraphToPDF result = new ActAddParagraphToPDF();
setupPdf(inParamInPDFBinary);
//---------------------begin content-------------------//
getDocument().add((Paragraph) new Paragraph(inParamParagraph));
//---------------------end content-------------------//
closePdf();
outParamOutPDFBinary = writePdf();
return result;
}
When I go to execute this second class, it appears to be treating the original document as if it is blank. Then writes the new Paragraph on top of the original content. I know that I am missing something, just not sure what that is.

Is reopening the document every time a requirement? If you keep the document open, you can append as much content as you want and you won't have to deal with content overlapping problems.
If it is a requirement, then you will have to track the last free content position yourself and reset it to new DocumentRenderer.
A Rectangle would be enough to store the free area that is left on the last page. Right before closing the document, save the free area in some Rectangle in the following way:
Rectangle savedBbox = document.getRenderer().getCurrentArea().getBBox();
After that, when you have to reopen the document, first jump to the last page:
document.add(new AreaBreak(AreaBreakType.LAST_PAGE));
And then reset the free occupied area left from the previous time you dealt with the document:
document.getRenderer().getCurrentArea().setBBox(savedBbox);
After that you are free to add new content to the document and it will appear at the saved position:
document.add(new Paragraph("Hello again"));
Please note that this approach works if you know which documents you are dealing with (i.e. you can associate last "free" position with the document's ID) and this document is not changed outside of your environment. If this is not the case, I recommend that you look into content extraction and in particular PdfDocumentContentParser. It can help you to extract the content you have on the page and determine which positions it occupies. Then you can calculate the free area on a page and use document.getRenderer().getCurrentArea().setBBox approach I described above to point DocumentRenderer to the correct place to write content to.

Related

apache pdfbox - how to test if a document is flattened?

I have written the following small Java main method. It takes in a (hardcoded for testing purposes!) PDF document I know contains active elements in the form and need to flatten it.
public static void main(String [] args) {
try {
// for testing
Tika tika = new Tika();
String filePath = "<path-to>/<pdf-document-with-active-elements>.pdf";
String fileName = filePath.substring(0, filePath.length() -4);
File file = new File(filePath);
if (tika.detect(file).equalsIgnoreCase("application/pdf")) {
PDDocument pdDocument = PDDocument.load(file);
PDAcroForm pdAcroForm = pdDocument.getDocumentCatalog().getAcroForm();
if (pdAcroForm != null) {
pdAcroForm.flatten();
pdAcroForm.refreshAppearances();
pdDocument.save(fileName + "-flattened.pdf");
}
pdDocument.close();
}
}
catch (Exception e) {
System.err.println("Exception: " + e.getLocalizedMessage());
}
}
What kind of test would assert the File(<path-to>/<pdf-document-with-active-elements>-flattened.pdf) generated by this code would, in fact, be flat?

What kind of test would assert that the file generated by this code would, in fact, be flat?
Load that document anew and check whether it has any form fields in its PDAcroForm (if there is a PDAcroForm at all).
If you want to be thorough, also iterate through the pages and assure that there are no Widget annotations associated to them anymore.
And to really be thorough, additionally determine the field positions and contents before flattening and apply text extraction at those positions to the flattened pdf. This verifies that the form has not merely been dropped but indeed flattened.

AcroForm comb TextField in Itextpdf7

Sorry for my English....
Using sluduyuschy code, I want to add a field in a PDF document that it had been a combination of several characters, and were filled, but it becomes a combination of just after I click on it from Adober reader or fill in it from the Adobe Reader again.
That see as enter image description here, only when i click that see good enter image description here
What am I doing wrong
public class TestClass {
public static void main(String[] args) throws IOException {
PdfWriter writer = new PdfWriter("D:\\TestPdf.pdf");
PdfDocument pdf = new PdfDocument(writer);
PdfAcroForm myAcro = PdfAcroForm.getAcroForm(pdf,true);
pdf.addNewPage();
PdfFormField text = PdfFormField.createText(pdf,new Rectangle(0f,800f,200f,50f),"textValue","")
.setFileSelect(false) //Meaningful only if the MaxLen entry is present in the text field dictionary and if the Multiline, Password, and FileSelect flags are clear. If true, the field is automatically divided into as many equally spaced positions, or combs, as the value of MaxLen, and the text is laid out into those combs.
.setMultiline(false)
.setPassword(false)
.setMaxLen(5)
.setComb(true);
text.setValue("What not good!");
text.regenerateField();
myAcro.addField(text);
pdf.close();
}

try to set myAcro.setNeedAppearances(true);

Page is cropped in new document in PDFBox while copying

I am trying to split the single PDF into multiple. Like 10 page document into 10 single page document.
PDDocument source = PDDocument.load(input_file);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(file);
output.close();
Here the problem is, the new document's page size is different than original document. So some text are cropped or missing in new document. I am using PDFBox 2.0 and how I can avoid this?
UPDATE:
Thanks #mkl.
Splitter did the magic. Here is the updated working part,
public static void extractAndCreateDocument(SplitMeta meta, PDDocument source)
throws IOException {
File file = new File(meta.getFilename());
Splitter splitter = new Splitter();
splitter.setStartPage(meta.getStart());
splitter.setEndPage(meta.getEnd());
splitter.setSplitAtPage(meta.getEnd());
List<PDDocument> docs = splitter.split(source);
if(docs.size() > 0){
PDDocument output = docs.get(0);
output.save(file);
output.close();
}
}
public class SplitMeta {
private String filename;
private int start;
private int end;
public SplitMeta() {
}
}

Unfortunately the OP has not provided a sample document to reproduce the issue. Thus, I have to guess.
I assume that the issue is based in objects not immediately linked to the page object but inherited from its parents.
In that case using PDDocument.addPage is the wrong choice as this method only adds the given page object to the target document page tree without consideration of inherited stuff.
Instead one should use PDDocument.importPage which is documented as:
/**
* This will import and copy the contents from another location. Currently the content stream is stored in a scratch
* file. The scratch file is associated with the document. If you are adding a page to this document from another
* document and want to copy the contents to this document's scratch file then use this method otherwise just use
* the {#link #addPage} method.
*
* Unlike {#link #addPage}, this method does a deep copy. If your page has annotations, and if
* these link to pages not in the target document, then the target document might become huge.
* What you need to do is to delete page references of such annotations. See
* here for how to do this.
*
* #param page The page to import.
* #return The page that was imported.
*
* #throws IOException If there is an error copying the page.
*/
public PDPage importPage(PDPage page) throws IOException
Actually even this method might not suffice as is as it does not consider all inherited attributes, but looking at the Splitter utility class one gets an impression what one has to do:
PDPage imported = getDestinationDocument().importPage(page);
imported.setCropBox(page.getCropBox());
imported.setMediaBox(page.getMediaBox());
// only the resources of the page will be copied
imported.setResources(page.getResources());
imported.setRotation(page.getRotation());
// remove page links to avoid copying not needed resources
processAnnotations(imported);
making use of the helper method
private void processAnnotations(PDPage imported) throws IOException
{
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations)
{
if (annotation instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null)
{
PDAction action = link.getAction();
if (action instanceof PDActionGoTo)
{
destination = ((PDActionGoTo)action).getDestination();
}
}
if (destination instanceof PDPageDestination)
{
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}
As you are trying to split the single PDF into multiple, like 10 page document into 10 single page document, you might want to use this Splitter utility class as is.
Tests
To test those methods I used the output of the PDF Clown sample output AnnotationSample.Standard.pdf because that library heavily depends on inheritance of page tree values. Thus, I copied the content of its only page to a new document using either PDDocument.addPage, PDDocument.importPage, or Splitter like this:
PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(new File(RESULT_FOLDER, "PageAddedFromAnnotationSample.Standard.pdf"));
output.close();
(CopyPages.java test testWithAddPage)
PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.importPage(page);
output.save(new File(RESULT_FOLDER, "PageImportedFromAnnotationSample.Standard.pdf"));
output.close();
(CopyPages.java test testWithImportPage)
PDDocument source = PDDocument.load(resource);
Splitter splitter = new Splitter();
List<PDDocument> results = splitter.split(source);
Assert.assertEquals("Expected exactly one result document from splitting a single page document.", 1, results.size());
PDDocument output = results.get(0);
output.save(new File(RESULT_FOLDER, "PageSplitFromAnnotationSample.Standard.pdf"));
output.close();
(CopyPages.java test testWithSplitter)
Only the final test copied the page faithfully.

Copy contents from docx with bullets intact with Apache POI

I am trying to copy contents from a docx file to the clipboard eventually. The code I have come up with so far is:
package config;
public class buffer {
public static void main(String[] args) throws IOException, XmlException {
XWPFDocument srcDoc = new XWPFDocument(new FileInputStream("D:\\rules.docx"));
XWPFDocument destDoc = new XWPFDocument();
OutputStream out = new FileOutputStream("D:\\test.docx");
for (IBodyElement bodyElement : srcDoc.getBodyElements()) {
XWPFParagraph srcPr = (XWPFParagraph) bodyElement;
XWPFParagraph dstPr = destDoc.createParagraph();
dstPr.createRun();
int pos = destDoc.getParagraphs().size() - 1;
destDoc.setParagraph(srcPr, pos);
}
destDoc.write(out);
out.close();
}
}
This does fetch the bullets but numbers them. I want to retain the original bullet format. Is there a way to do this?

You'll need to handle the numbering definition (in the numbering part) correctly.
The most reliable thing to do would be to copy the definition (both the instance list and the abstract one) across, and renumber it (ie give it a new ID) so that it is unique.
Then of course you'll need to update the ID's in your paragraph to match.
Note that the above is a solution only for the question you have asked.
You'll run into problems if your content contains a rel to some other part (eg an image). And you'tr not handling the style definition etc.

iText add Watermark to selected pages

I need to add a watermark to every page that has certain text, such as "PROCEDURE DELETED".
Based on Bruno Lowagie's suggestion in Adding watermark directly to the stream
So far have the PdfWatermark Class with:
protected Phrase watermark = new Phrase("DELETED", new Font(FontFamily.HELVETICA, 60, Font.NORMAL, BaseColor.PINK));
ArrayList<Integer> arrPages = new ArrayList<Integer>();
boolean pdfChecked = false;
#Override
public void onEndPage(PdfWriter writer, Document document) {
if(pdfChecked == false) {
detectPages(writer, document);
pdfChecked = true;
}
int pageNum = writer.getPageNumber();
if(arrPages.contains(pageNum)) {
PdfContentByte canvas = writer.getDirectContentUnder();
ColumnText.showTextAligned(canvas, Element.ALIGN_CENTER, watermark, 298, 421, 45);
}
}
And this works fine if I add, say, the number 3 to the arrPages ArrayList in my custom detectPages method - it shows the desired watermark on page 3.
What I am having trouble with is how to search through the document for the text string, which I have access to here, only from the PdfWriter writer or the com.itextpdf.text.Document document sent to onEndPage method.
Here is what I have tried, unsuccessfully:
private void detectPages(PdfWriter writer, Document document) {
try {
//arrPages.add(3);
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfWriter.getInstance(document, byteArrayOutputStream);
//following code no work
PdfReader reader = new PdfReader(writer.getDirectContent().getPdfDocument());
PdfContentByte canvas = writer.getDirectContent();
PdfImportedPage page;
for (int i = 0; i < reader.getNumberOfPages(); ) {
page = writer.getImportedPage(reader, ++i);
canvas.addTemplate(page, 1f, 0, 0.4f, 0.4f, 72, 50 * i);
canvas.beginText();
canvas.showTextAligned(Element.ALIGN_CENTER,
//search String
String.valueOf((char)(181 + i)), 496, 150 + 50 * i, 0);
//if(detected) arrPages.add(i);
canvas.endText();
}
Am I on the right track with this as a solution or do I need to back out?
Can anyone supply the missing link needed to scan the doc and pick out "PROCEDURE DELETED" pages?
EDIT: I am using iText 5.0.4 - cannot upgrade to 5.5.X at this time, but could probably upgrade to latest version below that.
EDIT2: More information: This is approach to adding text to the document (doc):
String processed = processText(template);
List<Element> objects = HTMLWorker.parseToListLog(new StringReader(processed),
styles, interfaceProps, errors);
for (Element elem : objects) {
doc.add(elem);
}
That is called in an addText method I control. The template is simply html from a database LOB. The processText checks the html for custom markers contained by curlies as in ${replaceMe}.
This seems to be the place to identify the "PROCEDURE DELETED" string during generation of the document, but I don't see the path to Chunk.setGenericTags().
EDIT3: Table difficulties
List<Element> objects = HTMLWorker.parseToListLog(new StringReader(processed),
styles, interfaceProps, errors);
for (Element elem : objects) {
//Code below no work
if (elem instanceof PdfPTable) {
PdfPTable table = (PdfPTable) elem;
ArrayList<Chunk> chks = table.getChunks();
for(Chunk chk : chks){
if(chk.toString().contains("TEST DELETED")) {
chk.setGenericTag("delete_tag");
}
}
}
doc.add(elem);
}
Commenters mlk and Bruno suggested to detect the "PROCEDURE DELETED" keywords at the time they are added to the doc. However, since the keywords are necessarily inside a table, they have to be detected through PdfPTable rather than the simpler Element.
I could not do it with the code above. Any suggestions exactly how to find text inside a table cell and do a string comparison on it?
EDIT4: Based on some experimentation, I would like to make some assertions and please show me the way through them:
Using Chunk.setGenericTag() is required to trigger the handler onGenericTag
For some reason (PdfPTable) table.getChunks() does not return chunks, at least that my system picks up. This is counterintuitive and possibly there is a setup, version, or code bug causing this behavior.
Therefore, a selection text string inside a table cannot be used to trigger a watermark.

Thanks to all the help from #mkl and #Bruno, I finally got a workable solution. Anybody interested who can give a more elegant approach is welcome - there is certainly room.
Two classes were in play in all the snippets given in the original question. Below is partial code that embodies the working solution.
public class PdfExporter
//a custom class to build content
//create some content to add to pdf
String text = "<div>Lots of important content here</div>";
//assume the section is known to be deleted
if(section_deleted) {
//add a special marker to the text, in white-on-white
text = "<span style=\"color:white;\">.!^DEL^?.</span>" + text + "</span>";
}
//based on some indicator, we know we want to force a new page for a new section
boolean ensureNewPage = true;
//use an instance of PdfDocHandler to add the content
pdfDocHandler.addText(text, ensureNewPage);
public class PdfDocHandler extends PdfPageEventHelper
private boolean isDEL;
public boolean addText(String text, boolean ensureNewPage)
if (ensureNewPage) {
//turn isDEL off: a forced pagebreak indicates a new section
isDEL = false;
}
//attempt to find the special DELETE marker in first chunk
//NOTE: this can be done several ways
ArrayList<Chunk> chks = elem.getChunks();
if(chks.size()>0) {
if(chks.get(0).getContent().contains(".!^DEL^?.")) {
//special delete marker found in text
isDEL = true;
}
}
//doc is an iText Document
doc.add(elem);
public void onEndPage(PdfWriter writer, Document document) {
if(isDEL) {
//set the watermark
Phrase watermark = new Phrase("DELETED", new Font(Font.FontFamily.HELVETICA, 60, Font.NORMAL, BaseColor.PINK));
PdfContentByte canvas = writer.getDirectContentUnder();
ColumnText.showTextAligned(canvas, Element.ALIGN_CENTER, watermark, 298, 421, 45);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDF Generation using iText7 with Java - java

Related

apache pdfbox - how to test if a document is flattened?

AcroForm comb TextField in Itextpdf7

Page is cropped in new document in PDFBox while copying

Copy contents from docx with bullets intact with Apache POI

iText add Watermark to selected pages

Categories

Resources