iText Android - Adding text to existing PDF - java

we have a PDF with some fields in order to collect some data, and I have to fill it programmatically with iText on Android by adding some text in those positions. I've been thinking about different ways to achieve this, with little success in each one.
Note: I'm using the Android version of iText (iTextG 5.5.4) and a Samsung Galaxy Note 10.1 2014 (Android 4.4) for most of my tests.
The approach I took from the start was to "draw" the text on a given coordinates, for a given page. This has some problems with the management of the fields (I have to be aware of the length of the strings, and it could be hard to position each text in the exact coordinate of the pdf). But most importantly, the performance of the process is really slow in some devices/OSVersions (it works great in Nexus 5 with 5.0.2, but takes several minutes with a 5MB Pdf on the Note 10.1).
pdfReader = new PdfReader(is);
document = new Document();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
pdfCopy = new PdfCopy(document, baos);
document.open();
PdfImportedPage page;
PdfCopy.PageStamp stamp;
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
page = pdfCopy.getImportedPage(pdfReader, i); // First page = 1
stamp = pdfCopy.createPageStamp(page);
for (int i=0; i<10; i++) {
int posX = i*50;
int posY = i*100;
Phrase phrase = new Phrase("Example text", FontFactory.getFont(FontFactory.HELVETICA, 12, BaseColor.RED));
ColumnText.showTextAligned(stamp.getOverContent(), Element.ALIGN_CENTER, phrase, posX, posY, 0);
}
stamp.alterContents();
pdfCopy.addPage(page);
}
We though about adding "forms fields" instead of drawing. That way I can configure a TextField and avoid managing the texts myself. However, the final PDF shouldn't have any annotations, so I would need to copy it into a new Pdf without annotations and with those "forms fields" drawn. I don't have an example of this because I wasn't able to perform this, I don't even know if this is possible/worthwhile.
The third option would be to receive a Pdf with the "forms fields" already added, that way I only have to fill them. However I still need to create a new Pdf with all those fields and without annotations...
I'd like to know what's be the best way in performance to do this process, and any help about achieving it. I am really newbie with iText and any help would be really appreciated.
Thanks!
EDIT
At the end I used the third option: a PDF with editable fields that we fill, and then we use the "flattening" to create a non-editable PDF with all texts already there.
The code is as follows:
pdfReader = new PdfReader(is);
FileOutputStream fios = new FileOutputStream(outPdf);
PdfStamper pdfStamper = new PdfStamper(pdfReader, fios);
//Filling the PDF (It's totally necessary that the PDF has Form fields)
fillPDF(pdfStamper);
//Setting the PDF to uneditable format
pdfStamper.setFormFlattening(true);
pdfStamper.close();
and the method to fill the forms:
public static void fillPDF(PdfStamper stamper) throws IOException, DocumentException{
//Getting the Form fields from the PDF
AcroFields form = stamper.getAcroFields();
Set<String> fields = form.getFields().keySet();
for(String field : fields){
form.setField("name", "Ernesto");
form.setField("surname", "Lage");
}
}
}
The only thing about this approach is that you need to know the name of each field in order to fill it.

There is a process in iText known as 'flattening', which takes the form fields, and replaces them with the text that the fields contain.
I haven't used iText in a few years (and not at all on Android), but if you search the manual or online examples for 'flattening', you should find how to do it.

Related

Replace itext to pdfbox performance

I am evaluating to replace our pdf processing from itext to pdfbox. I did some tests with 200 pdfs with a single page (94KB, 469KB, 937KB) and merged them to one pdf in our application. PDFBox version: 2.0.23.
itext version: 2.1.7. Here are the test results:
Here is the itext implementation:
byte[] l_PDFPage = null;
PdfReader l_PDFReader = null;
PdfCopy l_Copier = null;
Document l_PDFDocument = null;
OutputStream l_Stream = new FileOutputStream(m_File);
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
l_PDFPage = l_Page.getAsPdf();
l_PDFReader = new PdfReader(l_PDFPage);
l_PDFReader.getPageN(1).put(PdfName.ROTATE, new PdfNumber(l_PDFReader.getPageRotation(1) + l_Page.getRotation() % 360));
l_PDFReader.consolidateNamedDestinations();
if( i == 0 ) {
l_PDFDocument = new Document(l_PDFReader.getPageSizeWithRotation(1));
l_Copier = new PdfCopy(l_PDFDocument, l_Stream);
l_PDFDocument.open();
}
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
if( l_PDFReader.getAcroForm() != null )
l_Copier.copyAcroForm(l_PDFReader);
l_Copier.flush();
l_Copier.freeReader(l_PDFReader);
}
l_PDFDocument.close();
l_Stream.close();
Here is the pdfbox implementation:
byte[] l_PDFPage = null;
List<PDDocument> pageDocuments = new ArrayList<>();
PDDocument saveDocument = new PDDocument();
try {
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
// our wrapper object for a page
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
PDDocument document = PDDocument.load(l_PDFPage);
// save page document to close it later
pageDocuments.add(document);
PDPage page = document.getPage(0);
saveDocument.addPage(saveDocument.importPage(page));
}
saveDocument.save(l_Stream);
}
finally {
// close every page document
for(PDDocument doc : pageDocuments) {
doc.close();
}
saveDocument.close();
}
I have also tried using pdfmerger of pdfbox. The performance was nearly the same as the other pdfbox implementation. But with the 937KB files I run in an outofmemory exception with this implementation:
byte[] l_PDFPage = null;
OutputStream l_Stream = new FileOutputStream(m_File);
PDFMergerUtility merger = new PDFMergerUtility();
// do it for all pages in the editor
for( int i = 0; i < m_Editor.getCountOfElements(); i++ ) {
l_Page = m_Editor.getPageAt(i);
// page as byte[]
l_PDFPage = l_Page.getAsPdf();
merger.addSource(new ByteArrayInputStream(l_PDFPage));
}
merger.setDestinationStream(l_Stream);
merger.mergeDocuments(null);
So my questions:
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
Am I missing something in our pdfbox implementation?
Why I can't close the "page document" after I added the page in "saveDocument"? If i close it there I'd get an error while saving so I have to store the "page documents" and close them at the end.
PDFBox and iText are architecturally different and, therefore, perform differently well for different tasks.
In particular iText attempts to write out new contents early, in your case much of the page is written to the output already during
l_Copier.addPage(l_Copier.getImportedPage(l_PDFReader, 1));
and
l_PDFDocument.close();
eventually only finalizes the PDF and writes last remaining objects and the file trailer.
PDFBox on the other hand saves everything in the end at once:
saveDocument.save(l_Stream);
The approach of iText has the advantage of a smaller memory footprint (as you observed) and the disadvantage that you cannot change data of a page once it is written.
(As an aside: the iText architecture has changed from iText 5 to iText 7, in iText 7 you have the choice and can keep everything in memory, but the price here also is a big memory footprint.)
Thus,
Why is the performance (needed time AND memory usage) of pdfbox so bad in comparison to itext?
The difference in memory use can partially be explained by the above. Also in iText after
l_Copier.freeReader(l_PDFReader);
the PdfReader can be closed (which you leave to the garbage collection to do for you) to free its resources while in your PDFBox code you keep all the source documents open, holding the resources up to the end. (Actually I would have assumed that when you're using importPage, you needn't keep them.)
Concerning the time I'm not sure now. You should do some finer clocking and determine where exactly the extra time is used in PDFBox; thus, I second #Tilman's request for profiling data. I assume it's during the final save but that's only a hunch. Also such time differences might depend on structural details of the PDF in question and may be less extreme for other documents.

How to replace a text in pdf file with ITextPDF library?

I have a requirement to replace a placeholder like ${placeholder} with an actual value, but I could not find any working solution... I've ben following by https://itextpdf.com/en/resources/examples/itext-7/replacing-pdf-objects and it doesn't work. Does anybody know how to do it?
In general, it's not so easy to "replace" the content of a pdf file, since it could have been written in a different way. For example, suppose that you want to replace a chunk "Hello" with a chunk "World". You'd be lucky if "Hello" has been written to a pdf as a whole word. It might have been written as "He" and "llo", or even "o", "l" , "l", "e", "H", and the letters migth be placed in a different parts of the content stream.
However one can remove the content and then place some other content on the same place.
Let's look at how it could be done.
1) I advice you to use iText's pdfSweep, since this tool is able to detect the areas on which the content has been placed and remove the content (it's important to mention that pdfSweep doesn't hide content, it removes it completely)
Please look at the next sample: https://github.com/itext/i7j-pdfsweep/blob/develop/src/test/java/com/itextpdf/pdfcleanup/BigDocumentAutoCleanUpTest.java
Let's discuss redactTonySoprano test. As you can see, one can provide some regexes (for example, ""Tony( |_)Soprano", "Soprano" and "Sopranoes") and iText will redact all the matches of the content.
Then you can just write some text upon these areas using iText either via lowlevel api (PdfCanvas) or via more complex highlevel api (Canvas, etc).
Let's modify the soprano sample I've mentioned before a bit:
2) Let's add some text upon the redacted areas:
for (IPdfTextLocation location : strategy.getResultantLocations()) {
PdfPage page = pdf.getPage(location.getPageNumber()+1);
PdfCanvas pdfCanvas = new PdfCanvas(page.newContentStreamAfter(), page.getResources(), page.getDocument());
Canvas canvas = new Canvas(pdfCanvas, pdf, location.getRectangle());
canvas.add(new Paragraph("SECURED").setFontSize(8));
}
The result is not ideal, but that is just a proof of concept. It's possible to override the extraction strategies and define the font of the redacted content, so that it could be used for the new text to be placed on the redacted area.
Sample code below for replace content in PDF using iText
File dir = new File("./");
File [] files = dir.listFiles(new FilenameFilter() {
#Override
public boolean accept(File dir, String name) {
return name.endsWith(".pdf");
}
});
for (File pdffile : files) {
System.out.println(pdffile.getName());
PdfReader reader = null;
reader = new PdfReader(pdffile.toString());
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data);
dd = dd.replace("0 0 0 rg\n()Tj", "0 0 0 rg\n(Plan Advanced Payment)Tj");
System.out.print(dd);
stream.setData(dd.getBytes());
}
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream("./output/"+pdffile.getName())); // output PDF
stamper.close();
reader.close();
}

How to copy/move AcroForm fields from one document to new blank one using IText5 or IText7?

I need to copy whole AcroForm including field positions and values from template PDF to a new blank PDF file. How can I do that?
In short words - I need to get rid of "background" from the template and leave only filed forms.
The whole point of this is to create a PDF with content that would be printed on pre-printed templates.
I am using IText 5 but I can switch to 7 if usefull examples would be provided
After a lot of trial and error I have found the solution to "How to copy AcfroForm fields into another PDF". It is a iText v7 version. I hope it will help somebody someday.
private byte[] copyFormElements(byte[] sourceTemplate) throws IOException {
PdfReader completeReader = new PdfReader(new ByteArrayInputStream(sourceTemplate));
PdfDocument completeDoc = new PdfDocument(completeReader);
ByteArrayOutputStream out = new ByteArrayOutputStream();
PdfWriter offsetWriter = new PdfWriter(out);
PdfDocument offsetDoc = new PdfDocument(offsetWriter);
offsetDoc.initializeOutlines();
PdfPage blank = offsetDoc.addNewPage();
PdfAcroForm originalForm = PdfAcroForm.getAcroForm(completeDoc, false);
// originalForm.getPdfObject().copyTo(offsetDoc,false);
PdfAcroForm offsetForm = PdfAcroForm.getAcroForm(offsetDoc, true);
for (String name : originalForm.getFormFields().keySet()) {
PdfFormField field = originalForm.getField(name);
PdfDictionary copied = field.getPdfObject().copyTo(offsetDoc, false);
PdfFormField copiedField = PdfFormField.makeFormField(copied, offsetDoc);
offsetForm.addField(copiedField, blank);
}
offsetDoc.close();
completeDoc.close();
return out.toByteArray();
}
Did you check the PdfCopyForms object:
Allows you to add one (or more) existing PDF document(s) to create a new PDF and add the form of another PDF document to this new PDF.
I didn't find an example, but you could try something like this:
PdfReader reader1 = new PdfReader(src1); // a document with a form
PdfReader reader2 = new PdfReader(src2); // a document without a form
PdfCopyForms copy = new PdfCopyForms(new FileOutputStream(dest));
copy.AddDocument(reader1); // add the document without the form
copy.CopyDocumentFields(reader2); // add the fields of the document with the form
copy.close();
reader1.close();
reader2.close();
I see that the class is deprecated. I'm not sure of that's because iText 7 makes it much easier to do this, or if it's because there were technical problems with the class.

iText pdf Multiple Pages with same Content

How can i generate pdf report of multiple pages with same content on each page. Following is the code for single page report. Multiple pages should be in a single pdf file.
<%
response.setContentType( "application/pdf" );
response.setHeader ("Content-Disposition","attachment;filename=TEST1.pdf");
Document document=new Document(PageSize.A4,25,25,35,0);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
PdfWriter writer=PdfWriter.getInstance( document, buffer);
document.open();
Font fontnormalbold = FontFactory.getFont("Arial", 10, Font.BOLD);
Paragraph p1=new Paragraph("",fontnormalbold);
float[] iwidth = {1f,1f,1f,1f,1f,1f,1f,1f};
float[] iwidth1 = {1f};
PdfPTable table1 = new PdfPTable(iwidth);
table1.setWidthPercentage(100);
PdfPCell cell = new PdfPCell(new Paragraph("Testing Page",fontnormalbold));
cell.setHorizontalAlignment(1);
cell.setColspan(8);
cell.setPadding(5.0f);
table1.addCell(cell);
PdfPTable outerTable = new PdfPTable(iwidth1);
outerTable.setWidthPercentage(100);
PdfPCell containerCell = new PdfPCell();
containerCell.addElement(table1);
outerTable.addCell(containerCell);
p1.add(outerTable);
document.add(new Paragraph(p1));
document.close();
DataOutput output = new DataOutputStream( response.getOutputStream() );
byte[] bytes = buffer.toByteArray();
response.setContentLength(bytes.length);
for( int i = 0; i < bytes.length; i++ ) { output.writeByte( bytes[i] ); }
response.getOutputStream().flush();
response.getOutputStream().close();
%>
There are different way to solve this problem. Not all of the solutions are elegant.
Approach 1: add the same table many times.
I see that you are creating a PdfPTable object named outerTable. I'm going to ignore the silly things you do with this table (e.g. why are you adding this table to a Paragraph? Why are you adding a single cell with colspan 8 to a table with 8 columns? Why are you nesting this table into a table with a single column? All of these shenanigans are really weird), but having that outertable, you could do this:
for (int i = 0; i < x; i++) {
document.add(outerTable);
document.newPage();
}
This will add the table x times and it will start a new page for every table. This is also what the people in the comments advised you, and although the code looks really elegant, it doesn't result in an elegant PDF. That is: if you were my employee, I'd fire you if you did this.
Why? Because adding a table requires CPU and you are using x times the CPU you need. Moreover, with every table you create, you create new content streams. The same content will be added x times to your document. Your PDF will be about x times bigger than it should be.
Why would this be a reason to fire a developer? Because applications like this usually live in the cloud. In the cloud, one usually pays for CPU and bandwidth. A developer who writes code that requires a multiple of CPU and bandwidth, causes a cost that is unacceptable. In many cases, it is more cost-efficient to fire bad developers, hire slightly more expensive developers and buy slightly more expensive software, and then save plenty of money on the long term thanks to code that is more efficient in terms of CPU and band-width.
Approach 2: add the table to a PdfTemplate, reuse the PdfTemplate.
Please take a look at my answer to the StackOverflow question How to resize a PdfPTable to fit the page?
In this example, I create a PdfPTable named table. I know how wide I want the table to be (PageSize.A4.getWidth()), but I don't know in advance how high it will be. So I lock the width, I add the cells I need to add, and then I can calculate the height of the table like this: table.getTotalHeight().
I create a PdfTemplate that is exactly as big as the table:
PdfContentByte canvas = writer.getDirectContent();
PdfTemplate template = canvas.createTemplate(
table.getTotalWidth(), table.getTotalHeight());
I now add the table to this template:
table.writeSelectedRows(0, -1, 0, table.getTotalHeight(), template);
I wrap the table inside an Image object. This doesn't mean we're rasterizing the table, all text and lines are preserved as vector-data.
Image img = Image.getInstance(template);
I scale the img so that it fits the page size I have in mind:
img.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
Now I position the table vertically in the middle.
img.setAbsolutePosition(
0, (PageSize.A4.getHeight() - table.getTotalHeight()) / 2);
If you want to add the table multiple times, this is how you'd do it:
for (int i = 0; i < x; i++) {
document.add(img);
document.newPage();
}
What is the difference with Approach 1? Well, by using PdfTemplate, you are creating a Form XObject. A Form XObject is a content stream that is external to the page stream. A Form XObject is stored in the PDF file only once, and it can be reused many times, e.g. on every page of a document.
Approach 3: create a PDF document with a single page; concatenate the file many times
You are creating your PDF in memory. The PDF is stored in the buffer object. You could read this PDF using PdfReader like this:
PdfReader reader = new PdfReader(buffer.toByteArray());
Then you reuse this content like this:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Document doc = new Document();
PdfSmartCopy copy = new PdfSmartCopy(doc, baos);
doc.open();
for (int i = 0; i < x; i++) {
copy.addDocument(reader);
}
doc.close();
reader.close();
Now you can send the bytes stored in baos to the OutputStream of your response object. Make sure that you use PdfSmartCopy instead of PdfCopy. PdfCopy just copies the pages AS-IS without checking if there is redundant information. The result is a bloated PDF similar to the one you'd get if you'd use Approach 1. PdfSmartCopy looks at the bytes of the content streams and will detect that you're adding the same page over and over again. That page will be reused the same way as is done in Approach 2.

Reading a table or cell value in a pdf file using java?

I have gone through Java and PDF forums to extract a text value from the table in a pdf file, but could't find any solution except JPedal (It's not opensource and licensed).
So, I would like to know any opensource API's like pdfbox, itext to achieve the same result as JPedal.
Ref. Example:
In comments the OP clarified that he locates the text value from the table in a pdf file he wants to extract
By providing X and Y co-ordinates
Thus, while the question initially sounded like generic extraction of tabular data from PDFs (which can be difficult at least), it actually is essentially about extracting the text from a rectangular region on a page given by coordinates.
This is possible using either of the libraries you mentioned (and surely others, too).
iText
To restrict the region from which you want to extract text, you can use the RegionTextRenderFilter in a FilteredTextRenderListener, e.g.:
/**
* Parses a specific area of a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
}
(ExtractPageContentArea sample from iText in Action, 2nd edition)
Beware, though, iText extracts text based on the basic text chunks in the content stream, not based on each individual glyph in such a chunk. Thus, the whole chunk is processed if only the tiniest part of it is in the area.
This may or may not suit you.
If you run into the problem that more is extracted than you wanted, you should split the chunks into their constituting glyphs beforehand. This stackoverflow answer explains how to do that.
PDFBox
To restrict the region from which you want to extract text, you can use the PDFTextStripperByArea, e.g.:
PDDocument document = PDDocument.load( args[0] );
if( document.isEncrypted() )
{
document.decrypt( "" );
}
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 10, 280, 275, 60 );
stripper.addRegion( "class1", rect );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 0 );
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1" ) );
(ExtractTextByArea from the PDFBox 1.8.8 examples)
Try PDFTextStream. At least I am able to identify the column values. Earlier, I was using iText and got stuck in defining strategy. Its hard.
This api separates column cells by putting more spaces. Its fixed. you can put logic. (this was missing in iText).
import com.snowtide.PDF;
import com.snowtide.pdf.Document;
import com.snowtide.pdf.OutputTarget;
public class PDFText {
public static void main(String[] args) throws java.io.IOException {
String pdfFilePath = "xyz.pdf";
Document pdf = PDF.open(pdfFilePath);
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
}
}
Question has been asked related to this on stackoverflow!

Categories

Resources