PDFBox Embeded fonts not working when filling form - java

I fill forms with field.setValue. However even though PDF document has embedded fonts in it, I am getting error "is not available in this font's encoding: WinAnsiEncoding" no matter which type of font it is. Note that this is happening for chinese or russian characters.

Your PDF documents may have embedded fonts but they apparently have been embedded with an Encoding value WinAnsiEncoding.
WinAnsiEncoding contains essentially the Latin-15 characters, so it is intended for “Western European” languages (see the Wikipedia article on this) and in particular neither for Cyrillic nor for CJK languages.
If you want to fill chinese or russian characters into form fields using PDFBox, therefore, you have to
either embed a font into your PDF using an appropriate encoding beforehand
or replace the embedded font with PDFBox right before setting the form field value, see for example this answer.

Related

re-embed subset font in pdf with itext 7

I have some input PDF all with full set fonts, I want to "shrink" them all creating fonts subset. I know there is the way to unembed fonts and embed subset font, but the problem is that i don't have the source file of fonts. I just have fonts embedded in source PDF.
Someone can help me to troubleshoot this issue ?
ENV: java8, itext7.1.5
Here's a thread on a similar question (about embedding, not subsetting, despite the OP's question): How to subset fonts into an existing PDF file. The following statement is relevant:
If you want to subset it, you'd need to parse all the content streams
in the PDF to find out which glyphs are used. That's NOT a trivial
task.
I wouldn't recommend attempting this in iText, unless it's really necessary. It would likely end up buggy unless you have a very complete understanding of the PDF specs. It might be worth pursuing other avenues such as changing the way the PDFs are created, or use something like Distiller that can do this for you.
If you do want to do this in iText, I'm afraid you will likely have to use a PdfCanvasProcessor and some custom operator handlers. You would need to find all text fields, determine which font they use, build a new subset font with the applicable glyphs, and replace the fonts with new subset copies. This is how you would create a copy of the complete font to prepare for subsetting (assuming you don't have copies of the font files):
String encoding = PdfEncodings.WINANSI; // or another encoding if needed for more glyph support
PdfFont completeFont = ...; // get complete font from font dictionary
PdfFont subsetFont = PdfFontFactory.createFont(completeFont.getFontProgram(), encoding, true);
subsetFont.setSubset(true);
When you encounter a Font change operator (Tf), you would need to look up that font in the font dictionary and create a new (or lookup an already created) subset copy of that font to prepare for upcoming text fields. Don't forget to keep the font in a stack so you can pop back to the previous font (look for q and Q operators). And don't forget to check parent forms and page groups for the fonts if they don't exist in the current XObject or page resource dictionary.
When you encounter text (a Tj, TJ, ', or " operator), you would need to decode the text using the complete font, then re-encode it to the new subset font's encoding (unless you know for sure that all your source fonts are ASCII-compatible). Add that text's characters to the subset like this:
subsetFont.addSubsetRange(new int[]{character});

Java: Characters in pdf cannot be rendered correctly

We have a set of forms in PDF. In our program we read these forms, fill data, and then write them. We use Foxit PDF Editor to find out the font used on these forms is standard font Helvetica. When writing the forms, we set the font as follows:
bf=BaseFont.createFont(BaseFont.HELVETICA_BOLD, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
The problem is: on the original forms there are some characters that we cannot identify the fonts using Foxit PDF Editor, i.e., the font property is blank for those characters. Then on the printed forms, those characters cannot be rendered correctly. In Foxit Editor, these characters has font property as "Non embedded font: EuropeanPi-Three" while we never set any font as EuropeanPi-Three when writing the PDF forms. We use package com.lowagie.text to handle pdf in java. Anyone knows how to handle this problem? Thanks
I've had similar problem with iTextSharp.
The solution was in setting "substitution" font. The method is called something like setSubstitutionFont(BaseFont).

Japanese in JTextArea

I have a database with japanese words. Additionaly i have algorithm that reads these words and put them into JTextArea.
Problem is I see rectangles instead of japanese signs.
But when i copy such a set of rectangles (ctrl+c) from JTA and put them into eg. command input of TotalCommander or Winword document, signs appears are displayed properly. But only under Win7.
Because i run Eclipse on Virtual Machine under winXP I have ability of copy rectangles also to command input of TotalCommander under winXP. There are remain rectangles as in my Java app.
It means that there is in JTA an info about particular signs, but JTA can't interpretes this info.
Of course I have installed proper font.
I've tried many way with fonts:
textArea.setFont(new Font(blablabla));
and similar, but without effects.
What should i do?
The Problem with your JTextArea is most-probably, that the font you're using isn't applicable for UTF-8 & Japanese. The font doesn't provide an mapping table from UTF-8 values to characters. i.e. 0x41 is in ASCII, as well as in UTF-8 and even SHIFT-JIS the letter 'A' - but the Font you're just linking, resolves 0x41 to an Kanji character. And the whole font doesn't contain Hiragana and Katakana characters at all - please see also the comments section on the site where you got this font from here.
After using ChapMap it has a WSIfonts TAG and does NOT! support ALL the Chinese characters it only has 90 characters and assigns 1 character per Char except Caps.
It's a chinese font - not a Japanese one. But it won't even provide all chinese characters and has no useful mapping table included - so it's pretty useless.
Try to use another font - that should work just fine, if it contains really japanese characters and provides an applicable mapping table for UTF-8.
You can find fonts, that would work i.e. here

WebView Malayalam Unicode complex/combined letters

I am having problem with my Android application. Which is not displaying some special letters ie, complex/combined letters (KOOTTAKSHARAM) from Malayalam language.
In my application I am using WebView to load the html prepared with Unicode chars received from server. The font 'Thoolika.ttf' is loading from asset.
Later I was used ascii text from server, and .ttf font file and worked without problem. I tried UTF-8 conversion also, but didn't help.
So I would like to know is it possible to display complex/combined letters (KOOTTAKSHARAM) from Malayalam language, using Unicode chars and Unicode font file (.ttf) ?
The split rendering of Koottaksharam and Chillu in Malayalam is not the real issue. The real issue is - only a few manufacturers support Malayalam Unicode fonts, and little of them renders Malayalam correctly.
You can read Malayalam in Samsung, but NOT in HTC, LG, Sony etc. Google has added native support for Malayalam in JellyBean (v.4.1)
The only workaround is - convert the Unicode text into ASCII codes, use that ASCII text in components, and load the font dynamically. You can see that at Manoramaonline.com - see the HTML source - they are not using Unicode, instead they are using some symbols, and displays those symbols using their own font, which eventually looks like Malayalam text.
Mathrubhumi.com has a mobile version of their website, which uses the same technique. You can read Malayalam perfectly even when there's no support for it. I think they are first typing out the ASCII version (to publish for Print and Android) and converts it into Unicode later (to publish in Websites)
There are many ASCII-to-Unicode converters like http://aksharangal.com/ and one famous Unocode-to-ASCII converter is - http://smc.org.in/silpa/ASCII2Unicode

Splitting a pdf with pdfbox, but losing the font

I wrote some code in Java using the pdfbox API that splits a pdf document into it's individual pages, looks through the pages for a specific string, and then makes a new pdf from the page with the string on it. My problem is that when the new page is saved, I lose my font. I just made a quick word document to test it and the default font was calibri, so when I run the program I get an error box that reads: "Cannot extract the embedded font..." So it replaces the font with some other default.
I have seen a lot of example code that shows how to change the font when you are inputting text to be placed in the pdf, but nothing that sets the font for the pdf.
If anyone is familiar with a way to do this, (or can find documentation/examples), I would greatly appreciate it!
Edit: forgot to include some sample code
if (pageContent.indexOf(findThis) >= 0){
PDPage pageToRip = pages.get(i);
>>set the font of pageToRip here
res.importPage(pageToRip); //res is the new document that will be saved
}
I don't know if that helps any, but I figured I'd include it.
Also, this is what the change looks like if the pdf is written in calibri and split:
Note: This might be a nonissue, it depends on the font used in the files that will need to be processed. I tried some things besides Calibri and it worked out fine.
From How to extract fonts from a PDF:
You actually cannot extract a font from a PDF, not even if the font is
fully embedded. There are two reasons why this is not feasible:
•Most fonts are copyrighted, making it illegal to use an extractor.
•When a font is embedded in a PDF, not all of the font data are
included. Obviously the font outline data are included as well as the
font width tables. Other information, such as data about ligatures,
are irrelevant within the PDF so those data do not get enclosed in a
PDF. I am not aware of any font extraction tools but if you come
across one, the above reasons should make it clear that these
utilities are to be avoided.

Categories

Resources