PDFBox incorrect text appearance after copy/paste

PDFBox incorrect text appearance after copy/paste - java

I’m using PDFBox 2.0.4 to create PDF documents with acroForms. Here is my test code example:
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
String dir = "../testPdfBox/src/main/resources/fonts/";
PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf"));
PDResources resources = new PDResources();
String fontName = resources.add(font).getName();
acroForm.setDefaultResources(resources);
String defaultAppearanceString = format("/%s 12 Tf 0 g", fontName);
acroForm.setDefaultAppearance(defaultAppearanceString);
PDTextField field = new PDTextField(acroForm);
field.setPartialName("SampleField");
field.setDefaultAppearance(defaultAppearanceString);
acroForm.getFields().add(field);
PDAnnotationWidget widget = field.getWidgets().get(0);
PDRectangle rect = new PDRectangle(50, 750, 200, 50);
widget.setRectangle(rect);
widget.setPage(page);
widget.setPrinted(true);
page.getAnnotations().add(widget);
field.setValue("Sample field 123456");
acroForm.flatten();
document.save("target/SimpleForm.pdf");
document.close();
Everything works fine. But when I try to copy text from the created document and paste it to the NotePad or Word it becomes squares.
􀀷􀁅􀁑􀁔􀁐􀁉􀀄􀁊􀁍􀁉􀁐􀁈􀀄􀀕􀀖􀀗􀀘􀀙􀀚
I search a lot about this problem. The most popular answer is that there is no toUnicode cmap in created PDF. So I explore my document with CanOpener for Acrobat:
Yes, there is no toUnicode cmap, but everything works properly, if not to use acroForm.flatten(). When form fields are not flattened, I can copy/paste text from the document and it looks correct. Nevertheless I need all fields to be flattened.
So, I have two questions:
Why there is a problem with copy/pasting text in flattened form, and everything is ok in non-flattened?
What can I do to avoid problem with text copy/pasting?
Is there only one solution - to create toUnicode CMap by my own, like in this example?
My test pdf files are available here.

Please replace
PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf"));
with
PDType0Font font = PDType0Font.load(document, new FileInputStream(dir + "Roboto-Regular.ttf"), false);
This makes sure that the font is embedded in full and not just as a subset.

Related

What is the proper way to add rich text annotation in PDFBox?

I add three annotations into empty PDF:
call .setContents("...")
call .setRichContents("...");
call .setContents("..."); and .setRichContents("...");
First annotation displays properly in Adobe Reader and in Preview (on Mac).
Second annotation displays properly in Adobe Reader only (as formatted text), and empty box in Preview.
Third annotation displays as plain text Simple text content for Test Case 3 from the method .setContent in Adobe Reader ( NOT rich text from the method .setRichContents and PDF tag /RC!) and in Preview.
Text for comments in PDF contains rich formatted elements, I need to show them in annotations. I assume that Preview doesn't support rich text in annotations.
I've tried to re-save PDF in Adobe Reader and then open in Preview - after that I see all comments in Adobe Reader (with rich text) and in Preview (as not formatted text).
Question: how to show rich formatted annotation in Adobe Reader and plain text in Preview?
My code:
PDDocument document = new PDDocument();
PDPage blankPage = new PDPage();
document.addPage(blankPage);
List<PDAnnotation> annotations = document.getPage(0).getAnnotations();
float pageHeight = document.getPage(0).getCropBox().getHeight();
PDAnnotationText text_TC1 = new PDAnnotationText();
PDRectangle positionTC1 = new PDRectangle();
positionTC1.setLowerLeftX(10);
positionTC1.setLowerLeftY(pageHeight - 10);
positionTC1.setUpperRightX(20);
positionTC1.setUpperRightY(pageHeight - 20);
text_TC1.setContents("Simple text content for Test Case 1.");
text_TC1.setRectangle(positionTC1);
text_TC1.setOpen(true);
text_TC1.constructAppearances();
annotations.add(text_TC1);
PDAnnotationText text_TC2 = new PDAnnotationText();
PDRectangle positionTC2 = new PDRectangle();
positionTC2.setLowerLeftX(10);
positionTC2.setLowerLeftY(pageHeight - 110);
positionTC2.setUpperRightX(20);
positionTC2.setUpperRightY(pageHeight - 120);
text_TC2.setRichContents("<body xmlns=\"http://www.w3.org/1999/xhtml\">" +
"<p><span style=\"font-weight:bold\">Rich</span> <span style=\"font-style:italic\">text content</span> for Test Case 2.</p>" +
"</body>");
text_TC2.setRectangle(positionTC2);
text_TC2.setOpen(true);
text_TC2.constructAppearances();
annotations.add(text_TC2);
PDAnnotationText text_TC3 = new PDAnnotationText();
PDRectangle positionTC3 = new PDRectangle();
positionTC3.setLowerLeftX(10);
positionTC3.setLowerLeftY(pageHeight - 210);
positionTC3.setUpperRightX(20);
positionTC3.setUpperRightY(pageHeight - 220);
text_TC3.setContents("Simple text content for Test Case 3.");
text_TC3.setRichContents("<body xmlns=\"http://www.w3.org/1999/xhtml\">" +
"<p><span style=\"font-weight:bold\">Rich</span> <span style=\"font-style:italic\">text content</span> for Test Case 3.</p>" +
"</body>");
text_TC3.setRectangle(positionTC3);
text_TC3.setOpen(true);
text_TC3.constructAppearances();
annotations.add(text_TC3);
document.save("test_so.pdf");
document.close();
Update 2022-05-08: I've found a workaround solution - add comments via import XFDF:
PDDocument document = new PDDocument();
PDPage blankPage = new PDPage();
document.addPage(blankPage);
FDFDocument fdfDoc = FDFDocument.loadXFDF("test.xfdf");
List<FDFAnnotation> fdfAnnots = fdfDoc.getCatalog().getFDF().getAnnotations();
List<PDAnnotation> pdfannots = new ArrayList<>();
for (int i=0; i<fdfDoc.getCatalog().getFDF().getAnnotations().size(); i++) {
FDFAnnotation fdfannot = fdfAnnots.get(i);
PDAnnotation pdfannot = PDAnnotation.createAnnotation(fdfannot.getCOSObject());
pdfannots.add(pdfannot);
}
document.getPage(0).setAnnotations(pdfannots);
fdfDoc.close();
document.save("test.pdf");
document.close();

Subsetting OpenType Collection font in pdfbox

I'm trying to embed a subset of noto-regular in my code. but I keeping on getting:
java.lang.UnsupportedOperationException: OTF fonts do not have a glyf table
at org.apache.fontbox.ttf.OpenTypeFont.getGlyph(OpenTypeFont.java:66)
at org.apache.fontbox.ttf.TTFSubsetter.addCompoundReferences(TTFSubsetter.java:481)
at org.apache.fontbox.ttf.TTFSubsetter.getGIDMap(TTFSubsetter.java:136)
at org.apache.pdfbox.pdmodel.font.TrueTypeEmbedder.subset(TrueTypeEmbedder.java:306)
at org.apache.pdfbox.pdmodel.font.PDType0Font.subset(PDType0Font.java:162)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1138)
I downloaded the font file NotoSansCJK-Regular.ttc from https://www.google.com/get/noto/help/cjk/
Font subsetting works for .ttf fonts, as I haven't had any issues if the document I saved contains no special characters.
EDIT
It appears that true type collection fonts can have shared glyf table (makes sense since the font collection contains Japanese glyphs). So the individual PDType0Font parsed from .ttc can't be treated as an individual font.
I loaded the font using:
ttc.processAllFonts((TrueTypeFont ttf) -> {
PDFont font = PDType0Font.load(doc, ttf, true);
fontList.add(font);
});
I'm guessing that there are extra work I needed to do to make this work, but I can't find any code samples anywhere.
EDIT2
Seems like the problem is that when subsetting specific OpenType font files, (which font collection contains) turns on an internal flag isPostScript. The flag is then checked and process is aborted when getGlyph() is called.
The following code generates the glyf table error when creating the pdf documents
// downloaded from Noto project site
String OTF_FILE = "./src/test/resources/NotoSansJP-Regular.otf";
PDDocument doc = new PDDocument();
PDFont otf = null;
try (InputStream inputStream = new FileInputStream(new File(OTF_FILE))) {
otf = PDType0Font.load(doc, new OTFParser().parse(inputStream), true);
PDPage page = new PDPage();
PDPageContentStream stream = new PDPageContentStream(doc, page);
stream.setFont(otf, 10f);
stream.beginText();
stream.newLineAtOffset(100f, 600f);
stream.showText("二ろほス反2化みた大第リきやね景手ハニエ者性ルヤリウ円脱");
stream.endText();
stream.close();
doc.addPage(page);
doc.save("test.pdf");
} catch (IOException iox) {
// failed
}
but it will generate the pdf fine as soon as I set the subsetting parameter to true in the PDType0Font.load call
Similarily if I load the otf font through the collection:
String OTF_FILE = "./src/test/resources/NotoSansCJK-Regular.ttc";
PDDocument doc = new PDDocument();
PDFont otf = null;
try (InputStream inputStream = new FileInputStream(new File(OTF_FILE))) {
TrueTypeCollection ttc = new TrueTypeCollection(inputStream);
otf = PDType0Font.load(doc, ttc.getFontByName("NotoSansCJKjp-Regular"), true);
PDPage page = new PDPage();
PDPageContentStream stream = new PDPageContentStream(doc, page);
stream.setFont(otf, 10f);
stream.beginText();
stream.newLineAtOffset(100f, 600f);
stream.showText("二ろほス反2化みた大第リきやね景手ハニエ者性ルヤリウ円脱");
stream.endText();
stream.close();
doc.addPage(page);
doc.save("test.pdf");
} catch (IOException iox) {
// failed
}
I either need to embed the whole font or subsetting will throw the error
EDIT 3
I ended up circumvent this by downloading the OTF font from "Language-specific OpenType/CFF (OTF)", which contains characters from all 4 regions and converted it using otf2ttf from fonttools

write on existing form-pdf with pdfbox

I am relativly new to Java and I want to replace an existing iText based Javascript with pdfbox. (Java 2.0)
I have a pdf-Formsheet (but this sheet has no Acroform entries) and I want to fill it with information (Name, Birthdate and so on). The pdf is in a rectangular special size (like a contact card).
My code so far:
File file = new File("ToBeFilled.pdf");
PDDocument document = PDDocument.load(file);
System.out.println("PDF loaded");
//Retrieving the page
PDPage page = (PDPage)document.getPages().get( 0 );
PDFont font = PDType1Font.HELVETICA_BOLD;
PDPageContentStream content = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true);
content.beginText();
//Setting the font to the Content stream
content.setFont(font, 30);
//Setting the position for the line (float x, float y), (0,0) = lower left corner
content.newLineAtOffset(100, 400);
String text = "This is the sample document and we are adding content to it.";
String text1 = "This is an example of adding text to a page in the pdf document. we can add as many lines";
String text2 = "as we want like this using the ShowText() method of the ContentStream class";
//Adding text in the form of string
content.showText(text);
//Adding text in the form of string
content.newLine();
content.showText(text1);
content.newLine();
content.showText(text2);
//Ending the content stream
content.endText();
System.out.println("Text added");
content.close();
//Saving the document
document.save("newPrint.pdf");
//Closing the document
document.close();
The text does not show. What am I missing here? I thought with the correct text-positions I could simply write on the pdf?

The source is working.
Maybe your content.newLineAtOffset(100, 400); is too huge - out of bounds - for your little card.
By the way, you have to setLeading(float) to use newLine() meaningfuly.

Write cyrillic chars into PDF form fields with PDFBox

I am using pdfbox 2.0.5 to fill out form fields of a PDF document using this code:
doc = PDDocument.load(inputStream);
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDAcroForm form = catalog.getAcroForm();
for (PDField field : form.getFieldTree()){
field.setValue("должен");
}
I get this error: U+0434 ('afii10069') is not available in this font Times-Roman (generic: TimesNewRomanPSMT) encoding: StandardEncoding with differences
The PDF document itself contains cyrillic text which is displayed fine. I have tried using different fonts. For "Arial Unicode MS" it wants to download a 50MB "Adobe Acrobat Reader DC Font Pack". Is this a requirement for cyrillic characters?
Which font do I have to specify in the text field to handle cyrillic (or asian) characters?
Thanks,
Ropo

Adobe handles that by reusing the embedded font file in the {/Ubuntu} font and creates a new font resource from that. Here is a quick hack which can serve as a guide of how to achieve something similar. The code is specific to a sample I've got.
PDDocument doc = PDDocument.load(new File(...));
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDResources formResources = acroForm.getDefaultResources();
PDTrueTypeFont font = (PDTrueTypeFont) formResources.getFont(COSName.getPDFName("Ubuntu"));
// here is the 'magic' to reuse the font as a new font resource
TrueTypeFont ttFont = font.getTrueTypeFont();
PDFont font2 = PDType0Font.load(doc, ttFont, true);
ttFont.close();
formResources.put(COSName.getPDFName("F0"), font2);
PDTextField formField = (PDTextField) acroForm.getField("Text2");
formField.setDefaultAppearance("/F0 0 Tf 0 g");
formField.setValue("öäüинформацию");
doc.save(...);
doc.close();

The solution was trivial:
form.setNeedAppearances(true);
And then I remove the blue box of the field with:
field.setReadOnly(true);

How to add text watermark to pdf in Java using Apache PDFBox?

I am not getting any tutorial for adding a text watermark in a PDF file? Can you all please guide me, I am very new to PDFBOX.
Its not duplicate, the link in the comment didn't help me. I want to add text, not an image to the pdf.

Here is an example using PDFBox 2.0.2. This will load a PDF and write some text in the bottom right corner in a red transparent font. If it is a multiple page PDF the watermark will appear on every page. It might not be production ready, as I am not sure if there are some additional null conditions that need to be checked, but it should get you running in the right direction.
Keep in mind that this particular block of code will not modify the original PDF, but will create a new PDF using the Tmp_(filename) as the output.
private static void watermarkPDF (File fileStored) {
File tmpPDF;
PDDocument doc;
tmpPDF = new File(fileStored.getParent() + System.getProperty("file.separator") +"Tmp_"+fileStored.getName());
doc = PDDocument.load(fileStored);
for(PDPage page:doc.getPages()){
PDPageContentStream cs = new PDPageContentStream(doc, page, AppendMode.APPEND, true, true);
String ts = "Some sample text";
PDFont font = PDType1Font.HELVETICA_BOLD;
float fontSize = 14.0f;
PDResources resources = page.getResources();
PDExtendedGraphicsState r0 = new PDExtendedGraphicsState();
r0.setNonStrokingAlphaConstant(0.5f);
cs.setGraphicsStateParameters(r0);
cs.setNonStrokingColor(255,0,0);//Red
cs.beginText();
cs.setFont(font, fontSize);
cs.setTextMatrix(Matrix.getTranslateInstance(0f,0f));
cs.showText(ts);
cs.endText();
}
cs.close();
}
doc.save(tmpPDF);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox incorrect text appearance after copy/paste - java

Please replace PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf")); with PDType0Font font = PDType0Font.load(document, new FileInputStream(dir + "Roboto-Regular.ttf"), false); This makes sure that the font is embedded in full and not just as a subset.

Related

What is the proper way to add rich text annotation in PDFBox?

Subsetting OpenType Collection font in pdfbox

write on existing form-pdf with pdfbox

Write cyrillic chars into PDF form fields with PDFBox

How to add text watermark to pdf in Java using Apache PDFBox?

Categories

Resources