How to replace a text in pdf file with ITextPDF library?

How to replace a text in pdf file with ITextPDF library? - java

I have a requirement to replace a placeholder like ${placeholder} with an actual value, but I could not find any working solution... I've ben following by https://itextpdf.com/en/resources/examples/itext-7/replacing-pdf-objects and it doesn't work. Does anybody know how to do it?

In general, it's not so easy to "replace" the content of a pdf file, since it could have been written in a different way. For example, suppose that you want to replace a chunk "Hello" with a chunk "World". You'd be lucky if "Hello" has been written to a pdf as a whole word. It might have been written as "He" and "llo", or even "o", "l" , "l", "e", "H", and the letters migth be placed in a different parts of the content stream.
However one can remove the content and then place some other content on the same place.
Let's look at how it could be done.
1) I advice you to use iText's pdfSweep, since this tool is able to detect the areas on which the content has been placed and remove the content (it's important to mention that pdfSweep doesn't hide content, it removes it completely)
Please look at the next sample: https://github.com/itext/i7j-pdfsweep/blob/develop/src/test/java/com/itextpdf/pdfcleanup/BigDocumentAutoCleanUpTest.java
Let's discuss redactTonySoprano test. As you can see, one can provide some regexes (for example, ""Tony( |_)Soprano", "Soprano" and "Sopranoes") and iText will redact all the matches of the content.
Then you can just write some text upon these areas using iText either via lowlevel api (PdfCanvas) or via more complex highlevel api (Canvas, etc).
Let's modify the soprano sample I've mentioned before a bit:
2) Let's add some text upon the redacted areas:
for (IPdfTextLocation location : strategy.getResultantLocations()) {
PdfPage page = pdf.getPage(location.getPageNumber()+1);
PdfCanvas pdfCanvas = new PdfCanvas(page.newContentStreamAfter(), page.getResources(), page.getDocument());
Canvas canvas = new Canvas(pdfCanvas, pdf, location.getRectangle());
canvas.add(new Paragraph("SECURED").setFontSize(8));
}
The result is not ideal, but that is just a proof of concept. It's possible to override the extraction strategies and define the font of the redacted content, so that it could be used for the new text to be placed on the redacted area.

Sample code below for replace content in PDF using iText
File dir = new File("./");
File [] files = dir.listFiles(new FilenameFilter() {
#Override
public boolean accept(File dir, String name) {
return name.endsWith(".pdf");
}
});
for (File pdffile : files) {
System.out.println(pdffile.getName());
PdfReader reader = null;
reader = new PdfReader(pdffile.toString());
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data);
dd = dd.replace("0 0 0 rg\n()Tj", "0 0 0 rg\n(Plan Advanced Payment)Tj");
System.out.print(dd);
stream.setData(dd.getBytes());
}
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream("./output/"+pdffile.getName())); // output PDF
stamper.close();
reader.close();
}

Related

Aspose word styles copy from template

I am trying to use a template with predefined styles, fonts and fonts size so when I create the new document it has styles, fonts defined in my template.
This is my code:
private static byte[] joinSections(Document document, List<byte[]> list) throws Exception {
if (list != null && list.size() > 0) {
document.removeAllChildren();
for (byte[] item : list) {
ByteArrayInputStream partInStream = new ByteArrayInputStream(item);
Document target = new Document(partInStream);
// Make the document appear straight after the destination documents content.
target.getFirstSection().getPageSetup().setSectionStart(SectionStart.CONTINUOUS);
document.appendDocument(target, ImportFormatMode.USE_DESTINATION_STYLES);
}
ByteArrayOutputStream bs = new ByteArrayOutputStream();
document.save(bs, SaveFormat.DOCX);
return bs.toByteArray();
}
return null;
}
The document is the template where styles are defined.
The list is the list of paragraphs I want to insert into empty templates. The list is populated before calling this function
I have two issues:
In result document, my styles are applied, but font name and font size are inherited from source paragraphs, even I set to use USE_DESTINATION_STYLES. So for example, if my template has Arial 14 for Heading 1, and source document, from where I extracted that paragraph, for Heading 1 have font Times New Roman, in result document it will be Times New Roman? What I am missing here
What ever I do, I am always getting style Normal_0 in my result document. Not sure how to ignore the source and avoid having Normal_0 in my result document?
Thanks

PDFBox Form fill - saveIncremental does not work

I have a pdf file with some form field that I want to fill from java. Right now I'm trying to fill just one form which I am finding by its name. My code looks like this:
File file = new File("c:/Testy/luxmed/Skierowanie3.pdf");
PDDocument document = PDDocument.load(file);
PDDocumentCatalog doc = document.getDocumentCatalog();
PDAcroForm Form = doc.getAcroForm();
String formName = "topmostSubform[0].Page1[0].pana_pania[0]";
PDField f = Form.getField(formName);
setField(document, formName, "Artur");
System.out.println("New value 2nd: " + f.getValueAsString());
document.saveIncremental(new FileOutputStream("c:/Testy/luxmed/nowy_pd3.pdf"));
document.close();
and this:
public static void setField(PDDocument pdfDocument, String name, String Value) throws IOException
{
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field instanceof PDCheckBox){
field.setValue("Yes");
}
else if (field instanceof PDTextField){
System.out.println("Original value: " + field.getValueAsString());
field.setValue(Value);
System.out.println("New value: " + field.getValueAsString());
}
else{
System.out.println("Nie znaleziono pola");
}
}
As system.out states, the value was set correctly, but in new the generated pdf file, new value is not showing up (original String is presented) so I guess the incremental saving does not work properly. What am I missing?
I use 2.0.2 version of pdfbox, and here is pdf file with which I working: pdf

In general
When saving changes to a PDF as an incremental update with PDFBox 2.0.x, you have to set the property NeedToBeUpdated to true for every PDF object changed. Furthermore, the object must be reachable from the PDF catalog via a chain of references, and each PDF object in this chain also has to have the property NeedToBeUpdated set to true.
This is due to the way PDFBox saves incrementally, starting from the catalog it inspects the NeedToBeUpdated property, and if it is set to true, PDFBox stores the object, and only in this case it recurses deeper into the objects referenced from this object in search for more objects to store.
In particular this implies that some objects unnecessarily have to be marked NeedToBeUpdated, e.g. the PDF catalog itself, and in some cases this even defeats the purpose of the incremental update at large, see below.
In case of the OP's document
Setting the NeedToBeUpdated properties
On one hand one has to extend the setField method to mark the chain of field dictionaries up to and including the changed field and also the appearance:
public static void setField(PDDocument pdfDocument, String name, String Value) throws IOException
{
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDField field = acroForm.getField(name);
if (field instanceof PDCheckBox) {
field.setValue("Yes");
}
else if (field instanceof PDTextField) {
System.out.println("Original value: " + field.getValueAsString());
field.setValue(Value);
System.out.println("New value: " + field.getValueAsString());
}
else {
System.out.println("Nie znaleziono pola");
}
// vvv--- new
COSDictionary fieldDictionary = field.getCOSObject();
COSDictionary dictionary = (COSDictionary) fieldDictionary.getDictionaryObject(COSName.AP);
dictionary.setNeedToBeUpdated(true);
COSStream stream = (COSStream) dictionary.getDictionaryObject(COSName.N);
stream.setNeedToBeUpdated(true);
while (fieldDictionary != null)
{
fieldDictionary.setNeedToBeUpdated(true);
fieldDictionary = (COSDictionary) fieldDictionary.getDictionaryObject(COSName.PARENT);
}
// ^^^--- new
}
(FillInFormSaveIncremental method setField)
On the other hand the main code has to be extended to mark a chain from the catalog to the fields array:
PDDocument document = PDDocument.load(...);
PDDocumentCatalog doc = document.getDocumentCatalog();
PDAcroForm Form = doc.getAcroForm();
String formName = "topmostSubform[0].Page1[0].pana_pania[0]";
PDField f = Form.getField(formName);
setField(document, formName, "Artur");
System.out.println("New value 2nd: " + f.getValueAsString());
// vvv--- new
COSDictionary dictionary = document.getDocumentCatalog().getCOSObject();
dictionary.setNeedToBeUpdated(true);
dictionary = (COSDictionary) dictionary.getDictionaryObject(COSName.ACRO_FORM);
dictionary.setNeedToBeUpdated(true);
COSArray array = (COSArray) dictionary.getDictionaryObject(COSName.FIELDS);
array.setNeedToBeUpdated(true);
// ^^^--- new
document.saveIncremental(new FileOutputStream(...));
document.close();
(FillInFormSaveIncremental test testFillInSkierowanie3)
Beware: for use with generic PDFs one obviously should introduce some null tests...
Opening the result file in Adobe Reader one will unfortunately see that the program complains about changes which disable extended features in the file.
This is due to the quirk in PDFBox' incremental saving that it requires some unnecessary objects in the update section. In particular the catalog is saved there which contains a usage rights signature (the technology granting extended features). The re-saved signature obviously is not at its original position in its original revision anymore. Thus, is invalidated.
Most likely the OP OP wanted to save the PDF incrementally to not break this signature but PDFBox does not permit this. Oh well...
Thus, the only thing one can do is prevent the warning by completely removing the signature.
Removing the usage rights signature
We already have retrieved the catalog object in the additions above, so removing the signature is easy:
COSDictionary dictionary = document.getDocumentCatalog().getCOSObject();
// vvv--- new
dictionary.removeItem(COSName.PERMS);
// ^^^--- new
dictionary.setNeedToBeUpdated(true);
(FillInFormSaveIncremental test testFillInSkierowanie3)
Opening the result file in Adobe Reader one will unfortunately see that the program complains about missing extended features in the file to save it.
This is due to the fact that Adobe Reader requires extended features to save changes to XFA forms, extended features we had to remove in this step.
But the document at hand is a hybrid AcroForm & XFA form document, and Adobe Reader requires no extended features to save AcroForm documents. Thus, all we have to do is remove the XFA form. As our code only sets the AcroForm value, this is a good idea anyways...
Removing the XFA form
We already have retrieved the acroform object in the additions above, so removing the XFA form referenced from there is easy:
dictionary = (COSDictionary) dictionary.getDictionaryObject(COSName.ACRO_FORM);
// vvv--- new
dictionary.removeItem(COSName.XFA);
// ^^^--- new
dictionary.setNeedToBeUpdated(true);
(FillInFormSaveIncremental test testFillInSkierowanie3)
Opening the result file in Adobe Reader one will see that one now can without further ado edit the form and save the file.
Beware, a sufficiently new Adobe Reader version is required for this, earlier versions (up to at least version 9) did require extended features even for saving changes to an AcroForm form

Remain only used font subsets while splitting PDF in Java

I am using iText to split a PDF document into separate pages as PDF files. Each file seems to be too large as all fonts used in the input PDF are saved into all result pages, which is apparently not very clean.
Code of splitting is as below. Notice PdfSmartCopy and setFullCompression doesn't help to reduce size (which I have no idea why).
public List<byte[]> split(byte[] input) throws IOException, DocumentException {
PdfReader pdfReader = new PdfReader(input);
List<byte[]> pdfFiles = new ArrayList<>();
int pageCount = pdfReader.getNumberOfPages();
int pageIndex = 0;
while (++pageIndex <= pageCount) {
Document document = new Document(pdfReader.getPageSizeWithRotation(pageIndex));
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfCopy pdfCopy = new PdfSmartCopy(document, byteArrayOutputStream);
pdfCopy.setFullCompression();
PdfImportedPage pdfImportedPage = pdfCopy.getImportedPage(pdfReader, pageIndex);
document.open();
pdfCopy.addPage(pdfImportedPage);
document.close();
pdfCopy.close();
pdfFiles.add(byteArrayOutputStream.toByteArray());
}
return pdfFiles;
}
So is there a way in Java (iText or not) to solve these problem?
Update with demo PDF
Here is a 377KB PDF using multiple CJK fonts where any page in it using 1 or 2 fonts. The summary size of sub-PDFs is 1.2MB. Considering CJK fonts are very bloated, I would like to find a way to remove unused font and even remove unused characters in used fonts.
So my idea is to remain only used characters in used fonts and embed them in sub files and then un-embed all other fonts. Any advice?

iText Android - Adding text to existing PDF

we have a PDF with some fields in order to collect some data, and I have to fill it programmatically with iText on Android by adding some text in those positions. I've been thinking about different ways to achieve this, with little success in each one.
Note: I'm using the Android version of iText (iTextG 5.5.4) and a Samsung Galaxy Note 10.1 2014 (Android 4.4) for most of my tests.
The approach I took from the start was to "draw" the text on a given coordinates, for a given page. This has some problems with the management of the fields (I have to be aware of the length of the strings, and it could be hard to position each text in the exact coordinate of the pdf). But most importantly, the performance of the process is really slow in some devices/OSVersions (it works great in Nexus 5 with 5.0.2, but takes several minutes with a 5MB Pdf on the Note 10.1).
pdfReader = new PdfReader(is);
document = new Document();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
pdfCopy = new PdfCopy(document, baos);
document.open();
PdfImportedPage page;
PdfCopy.PageStamp stamp;
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
page = pdfCopy.getImportedPage(pdfReader, i); // First page = 1
stamp = pdfCopy.createPageStamp(page);
for (int i=0; i<10; i++) {
int posX = i*50;
int posY = i*100;
Phrase phrase = new Phrase("Example text", FontFactory.getFont(FontFactory.HELVETICA, 12, BaseColor.RED));
ColumnText.showTextAligned(stamp.getOverContent(), Element.ALIGN_CENTER, phrase, posX, posY, 0);
}
stamp.alterContents();
pdfCopy.addPage(page);
}
We though about adding "forms fields" instead of drawing. That way I can configure a TextField and avoid managing the texts myself. However, the final PDF shouldn't have any annotations, so I would need to copy it into a new Pdf without annotations and with those "forms fields" drawn. I don't have an example of this because I wasn't able to perform this, I don't even know if this is possible/worthwhile.
The third option would be to receive a Pdf with the "forms fields" already added, that way I only have to fill them. However I still need to create a new Pdf with all those fields and without annotations...
I'd like to know what's be the best way in performance to do this process, and any help about achieving it. I am really newbie with iText and any help would be really appreciated.
Thanks!
EDIT
At the end I used the third option: a PDF with editable fields that we fill, and then we use the "flattening" to create a non-editable PDF with all texts already there.
The code is as follows:
pdfReader = new PdfReader(is);
FileOutputStream fios = new FileOutputStream(outPdf);
PdfStamper pdfStamper = new PdfStamper(pdfReader, fios);
//Filling the PDF (It's totally necessary that the PDF has Form fields)
fillPDF(pdfStamper);
//Setting the PDF to uneditable format
pdfStamper.setFormFlattening(true);
pdfStamper.close();
and the method to fill the forms:
public static void fillPDF(PdfStamper stamper) throws IOException, DocumentException{
//Getting the Form fields from the PDF
AcroFields form = stamper.getAcroFields();
Set<String> fields = form.getFields().keySet();
for(String field : fields){
form.setField("name", "Ernesto");
form.setField("surname", "Lage");
}
}
}
The only thing about this approach is that you need to know the name of each field in order to fill it.

There is a process in iText known as 'flattening', which takes the form fields, and replaces them with the text that the fields contain.
I haven't used iText in a few years (and not at all on Android), but if you search the manual or online examples for 'flattening', you should find how to do it.

Replace fonts in a PDF using iText (Java)

I'd like to convert all the fonts, embedded or otherwise, of a PDF to another font using iText. I understand that line-height, kerning and a bunch of other things would be bungled up, but this I truly don't mind how ugly the output is.
I have seen how to embed fonts into existing pdfs here, but I don't know how to set ALL EXISTING text in the document to that font.
I understand that this isn't as straightforward as I make it out to be. Perhaps it would be easier just to take all the raw text from the document, and create a new document using the new font (again, layout/readability is a non-issue to me)

The example EmbedFontPostFacto.java from chapter 16 of iText in Action — 2nd Edition shows how to embed an originally not embedded font. The central method is this:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(FONT, "r");
byte fontfile[] = new byte[(int)raf.length()];
raf.readFully(fontfile);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontfile);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontfile.length));
// create a reader object
PdfReader reader = new PdfReader(RESULT1);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT2));
PdfName fontname = new PdfName(FONTNAME);
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
This (without the fontname.equals(font.get(PdfName.FONTNAME)) test) may be a starting point for the simple cases of your task.
You'll have to do quite a lot of tests concerning encoding and add some individual translations for a more generic solution. You may want to study section 9 Text of the PDF specification ISO 32000-1 for this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to replace a text in pdf file with ITextPDF library? - java

I have a requirement to replace a placeholder like ${placeholder} with an actual value, but I could not find any working solution... I've ben following by https://itextpdf.com/en/resources/examples/itext-7/replacing-pdf-objects and it doesn't work. Does anybody know how to do it?

Related

Aspose word styles copy from template

PDFBox Form fill - saveIncremental does not work

Remain only used font subsets while splitting PDF in Java

iText Android - Adding text to existing PDF

Replace fonts in a PDF using iText (Java)

Categories

Resources