Editing PDF text using Java

Editing PDF text using Java - java

Is there a way I can edit a PDF from Java?
I have a PDF document which contains placeholders for text that I need to be replaced using Java, but all the libraries that I saw created PDF from scratch and small editing functionality.
Is there anyway I can edit a PDF or is this impossible?

You can do it with iText. I tested it with following code. It adds a chunk of text and a red circle over each page of an existing PDF.
/* requires itextpdf-5.1.2.jar or similar */
import java.io.*;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.*;
public class AddContentToPDF {
public static void main(String[] args) throws IOException, DocumentException {
/* example inspired from "iText in action" (2006), chapter 2 */
PdfReader reader = new PdfReader("C:/temp/Bubi.pdf"); // input PDF
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream("C:/temp/Bubi_modified.pdf")); // output PDF
BaseFont bf = BaseFont.createFont(
BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED); // set font
//loop on pages (1-based)
for (int i=1; i<=reader.getNumberOfPages(); i++){
// get object for writing over the existing content;
// you can also use getUnderContent for writing in the bottom layer
PdfContentByte over = stamper.getOverContent(i);
// write text
over.beginText();
over.setFontAndSize(bf, 10); // set font and size
over.setTextMatrix(107, 740); // set x,y position (0,0 is at the bottom left)
over.showText("I can write at page " + i); // set text
over.endText();
// draw a red circle
over.setRGBColorStroke(0xFF, 0x00, 0x00);
over.setLineWidth(5f);
over.ellipse(250, 450, 350, 550);
over.stroke();
}
stamper.close();
}
}

I modified the code found a bit and it was working as follows
public class Principal {
public static final String SRC = "C:/tmp/244558.pdf";
public static final String DEST = "C:/tmp/244558-2.pdf";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Principal().manipulatePdf(SRC, DEST);
}
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
PdfArray refs = null;
if (dict.get(PdfName.CONTENTS).isArray()) {
refs = dict.getAsArray(PdfName.CONTENTS);
} else if (dict.get(PdfName.CONTENTS).isIndirect()) {
refs = new PdfArray(dict.get(PdfName.CONTENTS));
}
for (int i = 0; i < refs.getArrayList().size(); i++) {
PRStream stream = (PRStream) refs.getDirectObject(i);
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("NULA", "Nulo").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}
}

Take a look at iText and this sample code

Take a look at aspose and this sample code

I've done this using LibreOffice Draw.
You start by manually opening a pdf in Draw, checking that it renders OK, and saving it as a Draw .odg file.
That's a zipped xml file, so you can modify it in code to find and replace the placeholders.
Next (from code) you use a command line call to Draw to generate the pdf.
Success!
The main issue is that Draw doesn't handle fonts embedded in a pdf. If the font isn't also installed on your system - then it will render oddly, as Draw will replace it with a standard one that inevitably has different sizing.
If this approach is of interest, I'll put together some shareable code.

Related

Background image itextpdf 5.5

I'm using itextpdf 5.5, can't change it to 7.
I have the problem with background image.
I have a document (text and tables) without stamp and I want to add stamp to it.
This is how I download existing doc.
PdfReader pdfReader = new PdfReader("/doc.pdf");
PdfImportedPage page = writer.getImportedPage(pdfReader, 1);
PdfContentByte pcb = writer.getDirectContent();
pcb.addTemplate(page, 0,0);
And this is how I download stamp image and add it to my doc.
PdfContentByte canvas = writer.getDirectContentUnder();
URL resource = getClass().getResource(getStamp());
Image background = new Jpeg(resource);
background.scaleToFit(463F, 132F);
background.setAbsolutePosition(275F, 100F);
canvas.addImage(background);
But when I download my document - I don't see the stamp. I tried to change getDirectContent() to getDirectContentUnder() when I download my doc but this leads to the opposite situation - my stamp isn't in background.
My first doc is generated this way.
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Document document = new Document(PageSize.A4);
try {
PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
Paragraph title = new Paragraph(formatUtil.msg("my.header"), fontBold);
title.setAlignment(Element.ALIGN_CENTER);
document.add(title);
Template tmpl = fmConfig.getConfiguration().getTemplate("template.ftl");
Map<String, Object> params = new HashMap<>();
StringWriter writer = new StringWriter();
params.put("param", "param");
tmpl.process(params, writer);
document.add(new Paragraph(writer.toString(), fontCommon));
PdfPTable table = new PdfPTable(2);
document.add(table);
PdfContentByte canvas = writer.getDirectContentUnder();
Image background = new Jpeg(getClass().getResource("background.jpg"));
background.scaleAbsolute(PageSize.A4);
background.setAbsolutePosition(0,0);
canvas.addImage(background);
} finally {
if (document.isOpen()) {
document.close();
}
}

In comments it became clear that the task was to add some content (a bitmap image) to a PDF so that it is over the background (another bitmap image) added to the UnderContent during generation and under the text in the original DirectContent.
iText does not contain high-level code for such content manipulation. While the structure of iText generated PDFs would allow for such code, PDFs generated or manipulated by other PDF libraries may have a different structure; the structure in iText generated PDFs may even be changed during post-processing using other libraries. Thus, it is understandable that no high-level feature for this is provided by iText.
To implement the task nonetheless, therefore, we have to base our code on lower level iText APIs. In this context we can make use of the PdfContentStreamEditor helper class from this answer which already abstracts some details away. (That question originally is about iTextSharp for C# but further down in the answer Java versions of the code also are provided.)
In detail, we extend the PdfContentStreamEditor to remove the former UnderContent (and provide it as list of instructions). Now we can in a first step apply this editor to your file and in a second step add this former UnderContent plus an image over it to the intermediary file. (This could also be done in a single step but that would require a more complex, less maintainable editor class.)
First the new UnderContentRemover content stream editor class:
public class UnderContentRemover extends PdfContentStreamEditor {
/**
* Clears state of {#link UnderContentRemover}, in particular
* the collected content. Use this if you use this instance for
* multiple edit runs.
*/
public void clear() {
afterUnderContent = false;
underContent.clear();
depth = 0;
}
/**
* Retrieves the collected UnderContent instructions
*/
public List<List<PdfObject>> getUnderContent() {
return new ArrayList<List<PdfObject>>(underContent);
}
/**
* Adds the given instructions (which may previously have been
* retrieved using {#link #getUnderContent()}) to the given
* {#link PdfContentByte} instance.
*/
public static void write (PdfContentByte canvas, List<List<PdfObject>> operations) throws IOException {
for (List<PdfObject> operands : operations) {
int index = 0;
for (PdfObject object : operands) {
object.toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(operands.size() > ++index ? (byte) ' ' : (byte) '\n');
}
}
}
protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
String operatorString = operator.toString();
if (afterUnderContent) {
super.write(processor, operator, operands);
return;
} else if ("q".equals(operatorString)) {
depth++;
} else if ("Q".equals(operatorString)) {
depth--;
if (depth < 1)
afterUnderContent = true;
} else if (depth == 0) {
afterUnderContent = true;
super.write(processor, operator, operands);
return;
}
underContent.add(new ArrayList<>(operands));
}
boolean afterUnderContent = false;
List<List<PdfObject>> underContent = new ArrayList<>();
int depth = 0;
}
(UnderContentRemover)
As you see, its write method stores the leading instructions forwarded to it in the underContent list until it finds the restore-graphics-state (Q) instruction matching the initial save-graphics-state instruction (q). After that it instead forwards all further instructions to the parent write implementation which writes them to the edited page content.
We can use this for the task at hand as follows:
PdfReader pdfReader = new PdfReader(YOUR_DOCUMENT);
List<List<List<PdfObject>>> underContentByPage = new ArrayList<>();
byte[] sourceWithoutUnderContent = null;
try ( ByteArrayOutputStream outputStream = new ByteArrayOutputStream() ) {
PdfStamper pdfStamper = new PdfStamper(pdfReader, outputStream);
UnderContentRemover underContentRemover = new UnderContentRemover();
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
underContentRemover.clear();
underContentRemover.editPage(pdfStamper, i);
underContentByPage.add(underContentRemover.getUnderContent());
}
pdfStamper.close();
pdfReader.close();
sourceWithoutUnderContent = outputStream.toByteArray();
}
Image background = YOUR_IMAGE_TO_ADD_INBETWEEN;
background.scaleToFit(463F, 132F);
background.setAbsolutePosition(275F, 100F);
pdfReader = new PdfReader(sourceWithoutUnderContent);
byte[] sourceWithStampInbetween = null;
try ( ByteArrayOutputStream outputStream = new ByteArrayOutputStream() ) {
PdfStamper pdfStamper = new PdfStamper(pdfReader, outputStream);
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
PdfContentByte canvas = pdfStamper.getUnderContent(i);
UnderContentRemover.write(canvas, underContentByPage.get(i-1));
canvas.addImage(background);
}
pdfStamper.close();
pdfReader.close();
sourceWithStampInbetween = outputStream.toByteArray();
}
Files.write(new File("PdfLikeVladimirSafonov-WithStampInbetween.pdf").toPath(), sourceWithStampInbetween);
(AddImageInBetween test testForVladimirSafonov)

Extracting an embedded object from a pdf

I had embedded a byte array into a pdf file (Java).
Now I am trying to extract that same array.
The array was embedded as a "MOVIE" file.
I couldn't find any clue on how to do that...
Any ideas?
Thanks!
EDIT
I used this code to embed the byte array:
public static void pack(byte[] file) throws IOException, DocumentException{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);
document.open();
RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));
PdfFileSpecification fs
= PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
RichMediaParams flashVars = new RichMediaParams();
instance.setAsset(asset);
configuration.addInstance(instance);
RichMediaActivation activation = new RichMediaActivation();
richMedia.setActivation(activation);
PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
writer.addAnnotation(richMediaAnnotation);
document.close();

I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:
public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ExtractStreams().parse(SRC, DEST);
}
public void parse(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if (obj != null && obj.isStream()) {
PRStream stream = (PRStream)obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
}
catch(UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
}
}
}
Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:
When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library
You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

How to remove headers and footers from PDF file using iText in Java

I am using the PDF iText library to convert PDF to text.
Below is my code to convert PDF to text file using Java.
public class PdfConverter {
/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";
/**
* Parses a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
System.out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws IOException
*/
public static void main(String[] args) throws IOException {
new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}
The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.

Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.
The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.

Using IText I found one example in this site http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/
In this you create a rectangle that defines the bounds of the text you are getting.
PdfReader reader = new PdfReader(pdf);
PrintWriter out= new PrintWriter(new FileOutputStream(txt));
//Creating the rectangle
Rectangle rect=new Rectangle(70,80,420,500);
//creating a filter based on the rectangle
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i+){
//setting the filter on the text extraction strategy
strategy= new FilteredTextRenderListener(
new LocationTextExtractionStrategy(),filter);
out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy));
}
out.flush();out.close();
as the web page describes this, It should work even if the pdf is not tagged.

You can read specific locations of a pdf file. Just mark those areas that you need to get text from and leave the areas where the header and footer are shown. I have done it and here is the complete code. itext reading specific location from pdf file runs in intellij and gives desired output but executable jar throws error

Remove rectangles from PDF file

I'd like to have a program that removes all rectangles from a PDF file. One use case for this is to unblacken a given PDF file to see if there is any hidden information behind the rectangles. The rest of the PDF file should be kept as-is.
Which PDF library is suitable to this task? In Java, I would like the code to look like this:
PdfDocument doc = PdfDocument.load(new File("original.pdf"));
PdfDocument unblackened = doc.transform(new CopyingPdfVisitor() {
public void visitRectangle(PdfRect rect) {
if (rect.getFillColor().getBrightness() >= 0.1) {
super.visitRectangle(rect);
}
}
});
unblackened.save(new File("unblackened.pdf"));
The CopyingPdfVisitor would copy a PDF document exactly as-is, and my custom code would leave out all the dark rectangles.

Itext pdf library have ways to modify pdf content.
The *ITEXT CONTENTPARSER Example * may give you any idea. "qname" parameter (qualified name) may be used to detected rectangle element.
http://itextpdf.com/book/chapter.php?id=15
Other option, if you want obtain the text on the document use the PdfReaderContentParser to extract text content
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
example at http://itextpdf.com/examples/iia.php?id=277

Replace fonts in a PDF using iText (Java)

I'd like to convert all the fonts, embedded or otherwise, of a PDF to another font using iText. I understand that line-height, kerning and a bunch of other things would be bungled up, but this I truly don't mind how ugly the output is.
I have seen how to embed fonts into existing pdfs here, but I don't know how to set ALL EXISTING text in the document to that font.
I understand that this isn't as straightforward as I make it out to be. Perhaps it would be easier just to take all the raw text from the document, and create a new document using the new font (again, layout/readability is a non-issue to me)

The example EmbedFontPostFacto.java from chapter 16 of iText in Action — 2nd Edition shows how to embed an originally not embedded font. The central method is this:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(FONT, "r");
byte fontfile[] = new byte[(int)raf.length()];
raf.readFully(fontfile);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontfile);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontfile.length));
// create a reader object
PdfReader reader = new PdfReader(RESULT1);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT2));
PdfName fontname = new PdfName(FONTNAME);
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
This (without the fontname.equals(font.get(PdfName.FONTNAME)) test) may be a starting point for the simple cases of your task.
You'll have to do quite a lot of tests concerning encoding and add some individual translations for a more generic solution. You may want to study section 9 Text of the PDF specification ISO 32000-1 for this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Editing PDF text using Java - java

Is there a way I can edit a PDF from Java? I have a PDF document which contains placeholders for text that I need to be replaced using Java, but all the libraries that I saw created PDF from scratch and small editing functionality. Is there anyway I can edit a PDF or is this impossible?

Take a look at iText and this sample code

Take a look at aspose and this sample code

Related

Background image itextpdf 5.5

Extracting an embedded object from a pdf

How to remove headers and footers from PDF file using iText in Java

Remove rectangles from PDF file

Replace fonts in a PDF using iText (Java)

Categories

Resources