Replace fonts in a PDF using iText (Java) - java

I'd like to convert all the fonts, embedded or otherwise, of a PDF to another font using iText. I understand that line-height, kerning and a bunch of other things would be bungled up, but this I truly don't mind how ugly the output is.
I have seen how to embed fonts into existing pdfs here, but I don't know how to set ALL EXISTING text in the document to that font.
I understand that this isn't as straightforward as I make it out to be. Perhaps it would be easier just to take all the raw text from the document, and create a new document using the new font (again, layout/readability is a non-issue to me)

The example EmbedFontPostFacto.java from chapter 16 of iText in Action — 2nd Edition shows how to embed an originally not embedded font. The central method is this:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(FONT, "r");
byte fontfile[] = new byte[(int)raf.length()];
raf.readFully(fontfile);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontfile);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontfile.length));
// create a reader object
PdfReader reader = new PdfReader(RESULT1);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT2));
PdfName fontname = new PdfName(FONTNAME);
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
This (without the fontname.equals(font.get(PdfName.FONTNAME)) test) may be a starting point for the simple cases of your task.
You'll have to do quite a lot of tests concerning encoding and add some individual translations for a more generic solution. You may want to study section 9 Text of the PDF specification ISO 32000-1 for this.

Related

How to replace a text in pdf file with ITextPDF library?

I have a requirement to replace a placeholder like ${placeholder} with an actual value, but I could not find any working solution... I've ben following by https://itextpdf.com/en/resources/examples/itext-7/replacing-pdf-objects and it doesn't work. Does anybody know how to do it?
In general, it's not so easy to "replace" the content of a pdf file, since it could have been written in a different way. For example, suppose that you want to replace a chunk "Hello" with a chunk "World". You'd be lucky if "Hello" has been written to a pdf as a whole word. It might have been written as "He" and "llo", or even "o", "l" , "l", "e", "H", and the letters migth be placed in a different parts of the content stream.
However one can remove the content and then place some other content on the same place.
Let's look at how it could be done.
1) I advice you to use iText's pdfSweep, since this tool is able to detect the areas on which the content has been placed and remove the content (it's important to mention that pdfSweep doesn't hide content, it removes it completely)
Please look at the next sample: https://github.com/itext/i7j-pdfsweep/blob/develop/src/test/java/com/itextpdf/pdfcleanup/BigDocumentAutoCleanUpTest.java
Let's discuss redactTonySoprano test. As you can see, one can provide some regexes (for example, ""Tony( |_)Soprano", "Soprano" and "Sopranoes") and iText will redact all the matches of the content.
Then you can just write some text upon these areas using iText either via lowlevel api (PdfCanvas) or via more complex highlevel api (Canvas, etc).
Let's modify the soprano sample I've mentioned before a bit:
2) Let's add some text upon the redacted areas:
for (IPdfTextLocation location : strategy.getResultantLocations()) {
PdfPage page = pdf.getPage(location.getPageNumber()+1);
PdfCanvas pdfCanvas = new PdfCanvas(page.newContentStreamAfter(), page.getResources(), page.getDocument());
Canvas canvas = new Canvas(pdfCanvas, pdf, location.getRectangle());
canvas.add(new Paragraph("SECURED").setFontSize(8));
}
The result is not ideal, but that is just a proof of concept. It's possible to override the extraction strategies and define the font of the redacted content, so that it could be used for the new text to be placed on the redacted area.
Sample code below for replace content in PDF using iText
File dir = new File("./");
File [] files = dir.listFiles(new FilenameFilter() {
#Override
public boolean accept(File dir, String name) {
return name.endsWith(".pdf");
}
});
for (File pdffile : files) {
System.out.println(pdffile.getName());
PdfReader reader = null;
reader = new PdfReader(pdffile.toString());
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data);
dd = dd.replace("0 0 0 rg\n()Tj", "0 0 0 rg\n(Plan Advanced Payment)Tj");
System.out.print(dd);
stream.setData(dd.getBytes());
}
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream("./output/"+pdffile.getName())); // output PDF
stamper.close();
reader.close();
}

How to copy/move AcroForm fields from one document to new blank one using IText5 or IText7?

I need to copy whole AcroForm including field positions and values from template PDF to a new blank PDF file. How can I do that?
In short words - I need to get rid of "background" from the template and leave only filed forms.
The whole point of this is to create a PDF with content that would be printed on pre-printed templates.
I am using IText 5 but I can switch to 7 if usefull examples would be provided
After a lot of trial and error I have found the solution to "How to copy AcfroForm fields into another PDF". It is a iText v7 version. I hope it will help somebody someday.
private byte[] copyFormElements(byte[] sourceTemplate) throws IOException {
PdfReader completeReader = new PdfReader(new ByteArrayInputStream(sourceTemplate));
PdfDocument completeDoc = new PdfDocument(completeReader);
ByteArrayOutputStream out = new ByteArrayOutputStream();
PdfWriter offsetWriter = new PdfWriter(out);
PdfDocument offsetDoc = new PdfDocument(offsetWriter);
offsetDoc.initializeOutlines();
PdfPage blank = offsetDoc.addNewPage();
PdfAcroForm originalForm = PdfAcroForm.getAcroForm(completeDoc, false);
// originalForm.getPdfObject().copyTo(offsetDoc,false);
PdfAcroForm offsetForm = PdfAcroForm.getAcroForm(offsetDoc, true);
for (String name : originalForm.getFormFields().keySet()) {
PdfFormField field = originalForm.getField(name);
PdfDictionary copied = field.getPdfObject().copyTo(offsetDoc, false);
PdfFormField copiedField = PdfFormField.makeFormField(copied, offsetDoc);
offsetForm.addField(copiedField, blank);
}
offsetDoc.close();
completeDoc.close();
return out.toByteArray();
}
Did you check the PdfCopyForms object:
Allows you to add one (or more) existing PDF document(s) to create a new PDF and add the form of another PDF document to this new PDF.
I didn't find an example, but you could try something like this:
PdfReader reader1 = new PdfReader(src1); // a document with a form
PdfReader reader2 = new PdfReader(src2); // a document without a form
PdfCopyForms copy = new PdfCopyForms(new FileOutputStream(dest));
copy.AddDocument(reader1); // add the document without the form
copy.CopyDocumentFields(reader2); // add the fields of the document with the form
copy.close();
reader1.close();
reader2.close();
I see that the class is deprecated. I'm not sure of that's because iText 7 makes it much easier to do this, or if it's because there were technical problems with the class.

Remain only used font subsets while splitting PDF in Java

I am using iText to split a PDF document into separate pages as PDF files. Each file seems to be too large as all fonts used in the input PDF are saved into all result pages, which is apparently not very clean.
Code of splitting is as below. Notice PdfSmartCopy and setFullCompression doesn't help to reduce size (which I have no idea why).
public List<byte[]> split(byte[] input) throws IOException, DocumentException {
PdfReader pdfReader = new PdfReader(input);
List<byte[]> pdfFiles = new ArrayList<>();
int pageCount = pdfReader.getNumberOfPages();
int pageIndex = 0;
while (++pageIndex <= pageCount) {
Document document = new Document(pdfReader.getPageSizeWithRotation(pageIndex));
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfCopy pdfCopy = new PdfSmartCopy(document, byteArrayOutputStream);
pdfCopy.setFullCompression();
PdfImportedPage pdfImportedPage = pdfCopy.getImportedPage(pdfReader, pageIndex);
document.open();
pdfCopy.addPage(pdfImportedPage);
document.close();
pdfCopy.close();
pdfFiles.add(byteArrayOutputStream.toByteArray());
}
return pdfFiles;
}
So is there a way in Java (iText or not) to solve these problem?
Update with demo PDF
Here is a 377KB PDF using multiple CJK fonts where any page in it using 1 or 2 fonts. The summary size of sub-PDFs is 1.2MB. Considering CJK fonts are very bloated, I would like to find a way to remove unused font and even remove unused characters in used fonts.
So my idea is to remain only used characters in used fonts and embed them in sub files and then un-embed all other fonts. Any advice?

Extracting an embedded object from a pdf

I had embedded a byte array into a pdf file (Java).
Now I am trying to extract that same array.
The array was embedded as a "MOVIE" file.
I couldn't find any clue on how to do that...
Any ideas?
Thanks!
EDIT
I used this code to embed the byte array:
public static void pack(byte[] file) throws IOException, DocumentException{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);
document.open();
RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));
PdfFileSpecification fs
= PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
RichMediaParams flashVars = new RichMediaParams();
instance.setAsset(asset);
configuration.addInstance(instance);
RichMediaActivation activation = new RichMediaActivation();
richMedia.setActivation(activation);
PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
writer.addAnnotation(richMediaAnnotation);
document.close();
I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:
public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ExtractStreams().parse(SRC, DEST);
}
public void parse(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if (obj != null && obj.isStream()) {
PRStream stream = (PRStream)obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
}
catch(UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
}
}
}
Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:
When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library
You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Remove rectangles from PDF file

I'd like to have a program that removes all rectangles from a PDF file. One use case for this is to unblacken a given PDF file to see if there is any hidden information behind the rectangles. The rest of the PDF file should be kept as-is.
Which PDF library is suitable to this task? In Java, I would like the code to look like this:
PdfDocument doc = PdfDocument.load(new File("original.pdf"));
PdfDocument unblackened = doc.transform(new CopyingPdfVisitor() {
public void visitRectangle(PdfRect rect) {
if (rect.getFillColor().getBrightness() >= 0.1) {
super.visitRectangle(rect);
}
}
});
unblackened.save(new File("unblackened.pdf"));
The CopyingPdfVisitor would copy a PDF document exactly as-is, and my custom code would leave out all the dark rectangles.
Itext pdf library have ways to modify pdf content.
The *ITEXT CONTENTPARSER Example * may give you any idea. "qname" parameter (qualified name) may be used to detected rectangle element.
http://itextpdf.com/book/chapter.php?id=15
Other option, if you want obtain the text on the document use the PdfReaderContentParser to extract text content
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
out.flush();
out.close();
reader.close();
}
example at http://itextpdf.com/examples/iia.php?id=277

Categories

Resources