Subsetting OpenType Collection font in pdfbox

Subsetting OpenType Collection font in pdfbox - java

I'm trying to embed a subset of noto-regular in my code. but I keeping on getting:
java.lang.UnsupportedOperationException: OTF fonts do not have a glyf table
at org.apache.fontbox.ttf.OpenTypeFont.getGlyph(OpenTypeFont.java:66)
at org.apache.fontbox.ttf.TTFSubsetter.addCompoundReferences(TTFSubsetter.java:481)
at org.apache.fontbox.ttf.TTFSubsetter.getGIDMap(TTFSubsetter.java:136)
at org.apache.pdfbox.pdmodel.font.TrueTypeEmbedder.subset(TrueTypeEmbedder.java:306)
at org.apache.pdfbox.pdmodel.font.PDType0Font.subset(PDType0Font.java:162)
at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1138)
I downloaded the font file NotoSansCJK-Regular.ttc from https://www.google.com/get/noto/help/cjk/
Font subsetting works for .ttf fonts, as I haven't had any issues if the document I saved contains no special characters.
EDIT
It appears that true type collection fonts can have shared glyf table (makes sense since the font collection contains Japanese glyphs). So the individual PDType0Font parsed from .ttc can't be treated as an individual font.
I loaded the font using:
ttc.processAllFonts((TrueTypeFont ttf) -> {
PDFont font = PDType0Font.load(doc, ttf, true);
fontList.add(font);
});
I'm guessing that there are extra work I needed to do to make this work, but I can't find any code samples anywhere.
EDIT2
Seems like the problem is that when subsetting specific OpenType font files, (which font collection contains) turns on an internal flag isPostScript. The flag is then checked and process is aborted when getGlyph() is called.
The following code generates the glyf table error when creating the pdf documents
// downloaded from Noto project site
String OTF_FILE = "./src/test/resources/NotoSansJP-Regular.otf";
PDDocument doc = new PDDocument();
PDFont otf = null;
try (InputStream inputStream = new FileInputStream(new File(OTF_FILE))) {
otf = PDType0Font.load(doc, new OTFParser().parse(inputStream), true);
PDPage page = new PDPage();
PDPageContentStream stream = new PDPageContentStream(doc, page);
stream.setFont(otf, 10f);
stream.beginText();
stream.newLineAtOffset(100f, 600f);
stream.showText("二ろほス反2化みた大第リきやね景手ハニエ者性ルヤリウ円脱");
stream.endText();
stream.close();
doc.addPage(page);
doc.save("test.pdf");
} catch (IOException iox) {
// failed
}
but it will generate the pdf fine as soon as I set the subsetting parameter to true in the PDType0Font.load call
Similarily if I load the otf font through the collection:
String OTF_FILE = "./src/test/resources/NotoSansCJK-Regular.ttc";
PDDocument doc = new PDDocument();
PDFont otf = null;
try (InputStream inputStream = new FileInputStream(new File(OTF_FILE))) {
TrueTypeCollection ttc = new TrueTypeCollection(inputStream);
otf = PDType0Font.load(doc, ttc.getFontByName("NotoSansCJKjp-Regular"), true);
PDPage page = new PDPage();
PDPageContentStream stream = new PDPageContentStream(doc, page);
stream.setFont(otf, 10f);
stream.beginText();
stream.newLineAtOffset(100f, 600f);
stream.showText("二ろほス反2化みた大第リきやね景手ハニエ者性ルヤリウ円脱");
stream.endText();
stream.close();
doc.addPage(page);
doc.save("test.pdf");
} catch (IOException iox) {
// failed
}
I either need to embed the whole font or subsetting will throw the error
EDIT 3
I ended up circumvent this by downloading the OTF font from "Language-specific OpenType/CFF (OTF)", which contains characters from all 4 regions and converted it using otf2ttf from fonttools

Related

pdfbox: how to load a font once and use it many times?

I am trying to create a lot of pdf files in a loop.
for(int i=0; i<10000; ++i){
PDDocument doc = PDDocument.load(inputstream);
PDPage page = doc.getPage(0);
PDPageContentStream content = new PDPageContentStream(doc, page, PDPageContentStream.AppendMode.APPEND, true, true);
content.beginText();
//what happens here?
PDFont font = PDType0Font.load(doc, Thread.currentThread().getContextClassLoader().getResourceAsStream("font/simsun.ttf") );
content.setFont(font, 10);
//...
doc.save(outstream);
doc.close();
}
what does it happen by calling PDType0Font.load... ? Because the ttf file is large (10M), will it create ephemeral big objects of font 10000 times? If so, is there a way to make the font as embedded as PDType1Font, so I can just load it once and use it many times in the loop?
I encountered a full GC problem here, and I'm trying to figure it out.

Create the font at the fontbox level:
TrueTypeFont ttf = new TTFParser().parse(...);
You can now reuse ttf in different PDDocument objects like this:
PDFont font = PDType0Font.load(doc, ttf, true);
When done with all documents, don't forget to close ttf.
See also PDFontTest.testPDFBox3826() in the source code.

PDFBox incorrect text appearance after copy/paste

I’m using PDFBox 2.0.4 to create PDF documents with acroForms. Here is my test code example:
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
String dir = "../testPdfBox/src/main/resources/fonts/";
PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf"));
PDResources resources = new PDResources();
String fontName = resources.add(font).getName();
acroForm.setDefaultResources(resources);
String defaultAppearanceString = format("/%s 12 Tf 0 g", fontName);
acroForm.setDefaultAppearance(defaultAppearanceString);
PDTextField field = new PDTextField(acroForm);
field.setPartialName("SampleField");
field.setDefaultAppearance(defaultAppearanceString);
acroForm.getFields().add(field);
PDAnnotationWidget widget = field.getWidgets().get(0);
PDRectangle rect = new PDRectangle(50, 750, 200, 50);
widget.setRectangle(rect);
widget.setPage(page);
widget.setPrinted(true);
page.getAnnotations().add(widget);
field.setValue("Sample field 123456");
acroForm.flatten();
document.save("target/SimpleForm.pdf");
document.close();
Everything works fine. But when I try to copy text from the created document and paste it to the NotePad or Word it becomes squares.
􀀷􀁅􀁑􀁔􀁐􀁉􀀄􀁊􀁍􀁉􀁐􀁈􀀄􀀕􀀖􀀗􀀘􀀙􀀚
I search a lot about this problem. The most popular answer is that there is no toUnicode cmap in created PDF. So I explore my document with CanOpener for Acrobat:
Yes, there is no toUnicode cmap, but everything works properly, if not to use acroForm.flatten(). When form fields are not flattened, I can copy/paste text from the document and it looks correct. Nevertheless I need all fields to be flattened.
So, I have two questions:
Why there is a problem with copy/pasting text in flattened form, and everything is ok in non-flattened?
What can I do to avoid problem with text copy/pasting?
Is there only one solution - to create toUnicode CMap by my own, like in this example?
My test pdf files are available here.

Please replace
PDType0Font font = PDType0Font.load(document, new File(dir + "Roboto-Regular.ttf"));
with
PDType0Font font = PDType0Font.load(document, new FileInputStream(dir + "Roboto-Regular.ttf"), false);
This makes sure that the font is embedded in full and not just as a subset.

PDFBox leaves font files open after generating PDF

I am experiencing problems with TTF fonts loaded for generating PDFs which eventually result in too many open files on Linux. The PDFBox version I tested last was 2.0.0-RC3.
Basically for each PDF I create a document and load two fonts which I want to use. After the document is generated I close all the resources, but the file descriptors for both fonts remain open.
My question is how do I close these two files?
This is my basic code:
doc = new PDDocument();
page = new PDPage(PDRectangle.A4);
doc.addPage(page);
PDFont font = PDType0Font.load(doc, new File(settings.getProperty("font.location")));
PDFont boldFont = PDType0Font.load(doc, new File(settings.getProperty("bold.font.location")));
PDPageContentStream content = new PDPageContentStream(doc, page);
// add content stuff
content.close();
bos = new ByteArrayOutputStream();
doc.save(bos);
bos.flush();
byte[] bytes = bos.toByteArray();
doc.close();

Adding Header to existing PDF File using PDFBox

I am trying to add a Header to an existing PDF file. It works but the table header in the existing PDF are messed up by the change in the font. If I remove setting the font then the header doesn't show up. Here is my code:
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( file );
List allPages = doc.getDocumentCatalog().getAllPages();
//PDFont font = PDType1Font.HELVETICA_BOLD;
for( int i=0; i<allPages.size(); i++ )
{
PDPage page = (PDPage)allPages.get( i );
PDRectangle pageSize = page.findMediaBox();
PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true,true);
PDFont font = PDType1Font.TIMES_ROMAN;
float fontSize = 15.0f;
contentStream.beginText();
// set font and font size
contentStream.setFont( font, fontSize);
contentStream.moveTextPositionByAmount(700, 1150);
contentStream.drawString( message);
contentStream.endText();
//contentStream.
contentStream.close();}
doc.save( outfile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}`

Essentially you are running into a PDFBox bug in the current version 1.8.2.
A workaround:
Add a getFonts call of the page resources after creating the new content stream before using a font:
PDPage page = (PDPage)allPages.get( i );
PDRectangle pageSize = page.findMediaBox();
PDPageContentStream contentStream = new PDPageContentStream(doc, page, true, true,true);
page.getResources().getFonts(); // <<<<<<<<
PDFont font = PDType1Font.TIMES_ROMAN;
float fontSize = 15.0f;
contentStream.beginText();
The bug itself:
The bug is in the method PDResources.addFont which is called from PDPageContentStream.setFont:
public String addFont(PDFont font)
{
return addFont(font, MapUtil.getNextUniqueKey( fonts, "F" ));
}
It uses the current content of the fonts member variable to determine a unique name for the new font resource on the page at hand. Unfortunately this member variable still can be (and in your case is) uninitialized at this time. This results in the MapUtil.getNextUniqueKey( fonts, "F" ) call to always return F0.
The font variable then is initialized implicitly during the addFont(PDFont, String) call later.
Thus, if unfortunately there already existed a font named F0 on that page, it is replaced by the new font.
Having tested with your PDF this is exactly what happens in your case. As the existing font F0 uses some custom encoding while your replacement font uses a standard one, the text originally written using F0 now looks like gibberish.
The work-around mentioned above implicitly initializes that member variable and, thus, prevents the font replacement.
If you plan to use PDFBox in production for this task, you might want to report the bug.
PS: As mentioned in the comments above there is another bug to observe in context with inherited resources. It should be brought to the PDFBox development's attention, too.
PPS: The issue at hand meanwhile has been fixed in PDFBox for versions 1.8.3 and 2.0.0, cf. PDFBOX-1753.

Cannot figure out how to use PDFBox

I am trying to create a PDF file with a lot of text boxes in the document and textfields from another class. I am using PDFBox.
OK, creating a new file is easy and writing one line of text is easy. Now, when I am trying to insert the next text line or textfield, it overwrites the content.
PDDocument doc = null;
PDPage page = null;
try{
doc = new PDDocument();
page = new PDPage();
doc.addPage(page);
PDFont font = PDType1Font.HELVETICA_BOLD;
PDPageContentStream title = new PDPageContentStream(doc, page);
title.beginText();
title.setFont( font, 14 );
title.moveTextPositionByAmount( 230, 720 );
title.drawString("DISPATCH SUMMARY");
title.endText();
title.close();
PDPageContentStream title1 = new PDPageContentStream(doc, page);
title1.beginText();
title1.setFont( font, 11 );
title1.moveTextPositionByAmount( 30, 620 );
title1.drawString("DEPARTURE");
title1.endText();
title1.close();
doc.save("PDFWithText.pdf");
doc.close();
} catch (Exception e){
System.out.println(e);
}
It does give me an error: "You are overwriting an existing content, you should use the append mode".
So I am trying title1.appendRawCommands(String), but it is not working.
How would I add new text boxes and textfields (from another class)? I have read tens of tutorials on Internet, but they only show creating one line.

PDPageContentStream title1 = new PDPageContentStream(doc, page, true, true);
OP posted this as the answer, so this will flag to the system that there was an answer
Furthermore, if the first content stream contains operations substantially changing the graphics state, e.g. by changing the current transformation matrix, and one wants the new content stream to start with these changes reverted, one should use the constructor with three boolean parameters:
PDPageContentStream title1 = new PDPageContentStream(doc, page, true, true, true);

This implementation is deprecated.
PDPageContentStream title1 = new PDPageContentStream(doc, page, true, true);
The new implementation would be
PDPageContentStream title1 = new PDPageContentStream(doc, page, PDPageContentStream.AppendMode.OVERWRITE, true);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Subsetting OpenType Collection font in pdfbox - java

Related

pdfbox: how to load a font once and use it many times?

PDFBox incorrect text appearance after copy/paste

PDFBox leaves font files open after generating PDF

Adding Header to existing PDF File using PDFBox

Cannot figure out how to use PDFBox

Categories

Resources