Splitting a pdf with pdfbox, but losing the font

Splitting a pdf with pdfbox, but losing the font - java

I wrote some code in Java using the pdfbox API that splits a pdf document into it's individual pages, looks through the pages for a specific string, and then makes a new pdf from the page with the string on it. My problem is that when the new page is saved, I lose my font. I just made a quick word document to test it and the default font was calibri, so when I run the program I get an error box that reads: "Cannot extract the embedded font..." So it replaces the font with some other default.
I have seen a lot of example code that shows how to change the font when you are inputting text to be placed in the pdf, but nothing that sets the font for the pdf.
If anyone is familiar with a way to do this, (or can find documentation/examples), I would greatly appreciate it!
Edit: forgot to include some sample code
if (pageContent.indexOf(findThis) >= 0){
PDPage pageToRip = pages.get(i);
>>set the font of pageToRip here
res.importPage(pageToRip); //res is the new document that will be saved
}
I don't know if that helps any, but I figured I'd include it.
Also, this is what the change looks like if the pdf is written in calibri and split:
Note: This might be a nonissue, it depends on the font used in the files that will need to be processed. I tried some things besides Calibri and it worked out fine.

From How to extract fonts from a PDF:
You actually cannot extract a font from a PDF, not even if the font is
fully embedded. There are two reasons why this is not feasible:
•Most fonts are copyrighted, making it illegal to use an extractor.
•When a font is embedded in a PDF, not all of the font data are
included. Obviously the font outline data are included as well as the
font width tables. Other information, such as data about ligatures,
are irrelevant within the PDF so those data do not get enclosed in a
PDF. I am not aware of any font extraction tools but if you come
across one, the above reasons should make it clear that these
utilities are to be avoided.

Related

PDFlib copy page and use font

The PDFlib example search and replace text copies pages and pastes rectangles and text.
Instead of loading a font from my hard disk (like it is done in the example with int font = p.load_font(REPLACEMENT_FONT, "unicode", "");) I'd like to use the original font from the source document.
How can I achieve this?
What I tried is this:
When using int font = 0 (which is equivalent to the value of tet.fontid in line 244), PDFlib throws an exception like this:
com.pdflib.PDFlibException: Option 'font' has bad font handle 0
at com.pdflib.pdflib.PDF_fit_textline(Native Method)
at com.pdflib.pdflib.fit_textline(pdflib.java:1086)
What could work (and what I'm also not able to get to run)
Maybe I could read the fonts in the target document. Reading fonts in source document is feasible with this: (int) lib.pcos_get_number(pdiHandle, "length:fonts");. Trying to read the fonts in target document with (int) lib.pcos_get_number(outputPdfHandle, "length:fonts"); (with outputPdfHandle = p.begin_document(outfilename, "") from example line 560) throws exception
com.pdflib.PDFlibException: Handle parameter or option of type 'PDI document' has bad value 1
at com.pdflib.pdflib.PDF_pcos_get_number(Native Method)
at com.pdflib.pdflib.pcos_get_number(pdflib.java:1539)

It is not possible to use a font from a document imported via PDI to create text in an output document. In theory the idea sounds attractive to access the font data from the input document via pCOS functions. One could think that it should be possible to reassemble the font data into for example a valid TrueType font that then can be loaded via the PDFlib load_font() function.
But that is not possible for the following reasons:
The font data that is stored in a PDF document is not the complete data that is stored in a TrueType font. Important TrueType tables are missing and cannot be reconstructed from the font data in the PDF file.
A font in a PDF file is almost always a subset that contains only the glyphs that are actually used in the document. So even if it would be possible to use the font data from the input document, you could use only glyphs from the subset to create new text in an output document.
Also the fontid value provided by TET cannot be used as a font handle when creating new output via PDFlib. The fontid value is the index in the pCOS pseudo object array fonts[], and it is totally unrelated to any handles used to create new output via the PDFlib API.

re-embed subset font in pdf with itext 7

I have some input PDF all with full set fonts, I want to "shrink" them all creating fonts subset. I know there is the way to unembed fonts and embed subset font, but the problem is that i don't have the source file of fonts. I just have fonts embedded in source PDF.
Someone can help me to troubleshoot this issue ?
ENV: java8, itext7.1.5

Here's a thread on a similar question (about embedding, not subsetting, despite the OP's question): How to subset fonts into an existing PDF file. The following statement is relevant:
If you want to subset it, you'd need to parse all the content streams
in the PDF to find out which glyphs are used. That's NOT a trivial
task.
I wouldn't recommend attempting this in iText, unless it's really necessary. It would likely end up buggy unless you have a very complete understanding of the PDF specs. It might be worth pursuing other avenues such as changing the way the PDFs are created, or use something like Distiller that can do this for you.
If you do want to do this in iText, I'm afraid you will likely have to use a PdfCanvasProcessor and some custom operator handlers. You would need to find all text fields, determine which font they use, build a new subset font with the applicable glyphs, and replace the fonts with new subset copies. This is how you would create a copy of the complete font to prepare for subsetting (assuming you don't have copies of the font files):
String encoding = PdfEncodings.WINANSI; // or another encoding if needed for more glyph support
PdfFont completeFont = ...; // get complete font from font dictionary
PdfFont subsetFont = PdfFontFactory.createFont(completeFont.getFontProgram(), encoding, true);
subsetFont.setSubset(true);
When you encounter a Font change operator (Tf), you would need to look up that font in the font dictionary and create a new (or lookup an already created) subset copy of that font to prepare for upcoming text fields. Don't forget to keep the font in a stack so you can pop back to the previous font (look for q and Q operators). And don't forget to check parent forms and page groups for the fonts if they don't exist in the current XObject or page resource dictionary.
When you encounter text (a Tj, TJ, ', or " operator), you would need to decode the text using the complete font, then re-encode it to the new subset font's encoding (unless you know for sure that all your source fonts are ASCII-compatible). Add that text's characters to the subset like this:
subsetFont.addSubsetRange(new int[]{character});

Overlapping characters in text field iText PDF

I have PDF with text field which contains some characters. But the language specific characters are overlapping.
When it gains focus, text changes and displays correctly. When lost focus, displays incorrectly.
When text is edited displays also correctly.
File test_extended_filled.pdf see bellow
How I created PDF:
Created odg template in OpenOffice Draw 4.0.1 -> test.odg
Exported as PDF -> test.pdf
Edited test.pdf with Adobe Acrobat X Pro 10.0.0 and resaved with extended functions (needed to save on local PC) -> test_extended.pdf
Filled form by java (pdfstamper) -> test_extended_filled.pdf
Bonus: when i change font by pdfstamper in java it looks like changes are applied only on focused text too. -> test_extended_filled_font_size.pdf
Note: When I fill test.pdf from 2. it's displayed correctly -> text_filled.pdf
Attached files (go to download section):
https://rapidshare.com/share/ACC0D81E9235A6DA2CC2353BD21A4C37
After I added
stamper.getAcroFields().addSubstitutionFont
it's better, but some characters still overlap. -> test_extended_filled_font_size_with_substitution_font.pdf
http://rapidshare.com/share/0EE3238F37E9115C36A7A74706B09826
Any ideas?

Please take a look at the FillFormSpecialChars example and the resulting PDF.
Open Office doesn't really create nice forms. As mkl already indicates, the NeedAppearances flag can cause problems, the border of the fields is drawn onto the page content instead of being part of the widget annotation, etc...
In your case, you've defined a font that isn't optimal for special characters. Using a substitution font isn't ideal, because you can clearly see that drawing the glyphs isn't that much of a problem. The problem is that the metrics are all wrong. It's as if the special characters have an advance of 0 glyph units. In this case, you should change the font using the setFieldProperty() method.

Java: Characters in pdf cannot be rendered correctly

We have a set of forms in PDF. In our program we read these forms, fill data, and then write them. We use Foxit PDF Editor to find out the font used on these forms is standard font Helvetica. When writing the forms, we set the font as follows:
bf=BaseFont.createFont(BaseFont.HELVETICA_BOLD, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
The problem is: on the original forms there are some characters that we cannot identify the fonts using Foxit PDF Editor, i.e., the font property is blank for those characters. Then on the printed forms, those characters cannot be rendered correctly. In Foxit Editor, these characters has font property as "Non embedded font: EuropeanPi-Three" while we never set any font as EuropeanPi-Three when writing the PDF forms. We use package com.lowagie.text to handle pdf in java. Anyone knows how to handle this problem? Thanks

I've had similar problem with iTextSharp.
The solution was in setting "substitution" font. The method is called something like setSubstitutionFont(BaseFont).

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting a pdf with pdfbox, but losing the font - java

Related

PDFlib copy page and use font

re-embed subset font in pdf with itext 7

Overlapping characters in text field iText PDF

Java: Characters in pdf cannot be rendered correctly

How to get given paragraph content of pdf file using iText library?

Categories

Resources