I'm bundling ttf files in a jar file and intend to use them as physical fonts for rendering text.
All the sample codes i've seen over the internet are as follow:
InputStream is=Essai.class.getResourceAsStream(resourcePath);
Font f=Font.createFont(fontFormat, is);
I have two questions about this code:
First, a single font, say DejaVu, holds 4 different ttf files, 1 per style (regular, bold, italic and bold-italic); is-it enough to create a font from one single ttf (any one of the 4)?
Second, none of the codes i've seen closed the stream after creating the font, which kept me wondering was it intentional ? maybe the the created font (which will be registered in the local GraphicsEnvironment later) needs the stream to stay opened?
The javadoc of the createFont() method says: "This method does not close the InputStream."
So my second question is should i close the stream after creating the font, or keep it opened?
I don't believe loading one variant will load them all. But then, it does not matter. Java can make a Font variant with any combination of bold+italic. I'd only put the standard Font in the Jar in the first place.
Close the stream. Many examples cut corners for the sake of brevity (though that is generally a bad idea).
Related
I have some input PDF all with full set fonts, I want to "shrink" them all creating fonts subset. I know there is the way to unembed fonts and embed subset font, but the problem is that i don't have the source file of fonts. I just have fonts embedded in source PDF.
Someone can help me to troubleshoot this issue ?
ENV: java8, itext7.1.5
Here's a thread on a similar question (about embedding, not subsetting, despite the OP's question): How to subset fonts into an existing PDF file. The following statement is relevant:
If you want to subset it, you'd need to parse all the content streams
in the PDF to find out which glyphs are used. That's NOT a trivial
task.
I wouldn't recommend attempting this in iText, unless it's really necessary. It would likely end up buggy unless you have a very complete understanding of the PDF specs. It might be worth pursuing other avenues such as changing the way the PDFs are created, or use something like Distiller that can do this for you.
If you do want to do this in iText, I'm afraid you will likely have to use a PdfCanvasProcessor and some custom operator handlers. You would need to find all text fields, determine which font they use, build a new subset font with the applicable glyphs, and replace the fonts with new subset copies. This is how you would create a copy of the complete font to prepare for subsetting (assuming you don't have copies of the font files):
String encoding = PdfEncodings.WINANSI; // or another encoding if needed for more glyph support
PdfFont completeFont = ...; // get complete font from font dictionary
PdfFont subsetFont = PdfFontFactory.createFont(completeFont.getFontProgram(), encoding, true);
subsetFont.setSubset(true);
When you encounter a Font change operator (Tf), you would need to look up that font in the font dictionary and create a new (or lookup an already created) subset copy of that font to prepare for upcoming text fields. Don't forget to keep the font in a stack so you can pop back to the previous font (look for q and Q operators). And don't forget to check parent forms and page groups for the fonts if they don't exist in the current XObject or page resource dictionary.
When you encounter text (a Tj, TJ, ', or " operator), you would need to decode the text using the complete font, then re-encode it to the new subset font's encoding (unless you know for sure that all your source fonts are ASCII-compatible). Add that text's characters to the subset like this:
subsetFont.addSubsetRange(new int[]{character});
I'm using a Ubuntu-PC to create PDFs with iText which are partly in Chinese. To read them I use Evince. So far there were hardly any problems
On my PC I tried the following three BaseFonts and they worked with success:
bf = BaseFont.createFont("MSungStd-Light", "UniCNS-UCS2-H", BaseFont.NOT_EMBEDDED);
bf = BaseFont.createFont("STSong-Light", "UniGB-UCS2-H", BaseFont.NOT_EMBEDDED);
bf = BaseFont.createFont("MSung-Light","UniCNS-UCS2-H", BaseFont.NOT_EMBEDDED);
Unfortunately in the moment the final PDF is opened on Windows with the Acrobat-Reader the document can't be displayed correctly any more.
After I googled the Fonts to get a solution I came to that Forum where the problem is explained in an understandable way (Here MSung-Light was used): http://community.jaspersoft.com/questions/531457/chinese-font-cannot-be-seen
You are using a built-in Chinese font in PDF. I'm not sure about the
ability of this font to support both English and Chinese, or mixed
language anyway.
The advantage of using an Acrobat Reader built-in font is that it
produces smaller PDF files, because it relies on those fonts being
available on the client machine that display the PDF, through the
pre-installed Acribat Asian Font Pack.
However, using the PDF built-in fonts has some disadvantages that were
discovered through testing on different machines, when we investegated
a similar problem related to a built-in Korean font.
What should I do about it?
It's not so important to be able to copy the Chinese letters. Can iText convert a paragraph to an image? Or are there any better solutions?
You're using a CJK font. CJK fonts are never embedded and they require a font pack when opening such a file in Adobe Reader. Normally, Adobe Reader will ask you if you want to install such a font pack automatically. If it doesn't, you can download the appropriate font pack here.
It seems that you want to avoid having an end user install a font pack. That's understandable to some extent. What is really bad, is your suggestion to avoid using a font and to draw the glyphs one by one instead. This is possible with iText (and documented in my book), but it comes with a severe warning: Don't do this! Your file will be bloated and print results risk being awful!
An alternative is to use another font, e.g. arialuni.ttf, YaHei, SimHei,... These fonts contain Chinese glyphs and you can embed a subset of these fonts into your PDF (embedding the whole font would be overkill). See for instance the FontTest example.
If you have a font program such as arialuni.ttf, you can use this code to create a BaseFont object:
BaseFont.createFont("c:/windows/fonts/arialuni.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
With this font, you can display Chinese characters that will be visible using any viewer on any OS. If you don't have arialuni.ttf, you need to look for another font and use the FontText example to test if Chinese is supported (if you don't see any text after "Chinese:", then Chinese isn't supported).
Extra answer in reply to your comment:
Please forget about iText-Asian as that is a jar you need when you want to use CJK fonts. You explicitly say you don't want to use CJK fonts, so you don't need to use iText-Asian.
If you want to embed the font (as opposed to rely on a font pack), you need to pick a font program that knows how to draw Chinese characters. This immediately makes your question regarding "Can you point me to an example that draws Chinese characters?" void. I could point you to such an example, but you'd still need a font program.
Once you have that font program: why wouldn't you use it the correct way? You should use that font program the way you're supposed to use it. You shouldn't use that font program to draw your glyphs as images as that would result in a PDF file with a huge filesize and a bad resolution (bad quality of the glyphs because you draw each separate character instead of using the font program in the PDF).
Did you look for a font program yet? There was a similar question about Vietnamese fonts a while ago: Can't export Vietnamese characters to PDF using iText It took me less than a quarter of my time to Google for a font that could be used. Why don't you spend a quarter of your time finding a font that supports Chinese?
Extra answer in reply to your extra comment:
When we refer to CJK, we refer to a specific approach in which fonts aren't embedded, but rely on a font pack being installed on the end users machine, so that Adobe Reader can use that font. You don't want this so all your questions about using the itext asian jar and MSung-Light and so on are irrelevant.
The Chinese character set is huge and many computers ship without any Chinese fonts (especially in the US), so the answer to your question "Isn't there any way to use a built-in arialuni" is "No, you shouldn't count on that!"
What you say about Vietnamese is irrelevant. A font is a font is a font. You have a character code on one side and a glyph on the other side. The glue that connects one with the other is the encoding. For instance: You have the hexadecimal character code B2E2 and the hexadecimal character code CAD4. If the encoding is GBK, the corresponding glyphs are 测 and 试. Note that when you'd want to represent the very same characters in UNICODE, you'd use the characters 6D4D and 8BD5. There is very little difference with other systems. For instance: you have the hexadecimal character code 41 (65 in decimals) and if the encoding is Latin-1, the corresponding glyph is A.
I have asked you to search for a font that supports Chinese. I have opened Google and I searched for the keywords "Chinese fonts". I found this page: http://www.freechinesefont.com/ and I picked a font that seemed OK to me: http://www.freechinesefont.com/simplified-hxb-mei-xin-download/
Now I use this code snippet:
import java.io.FileOutputStream;
import java.io.IOException;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Font;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfWriter;
public class ChineseTest {
/** Path to the resulting PDF file. */
public static final String DEST = "results/test.pdf";
/** Path to the vietnamese font. */
public static final String FONT = "resources/hxb-meixinti.ttf";
/**
* Creates a PDF file: hello.pdf
* #param args no arguments needed
*/
public static void main(String[] args) throws DocumentException, IOException {
new ChineseTest().createPdf(DEST);
}
/**
* Creates a PDF document.
* #param filename the path to the new PDF document
* #throws DocumentException
* #throws IOException
*/
public void createPdf(String filename) throws DocumentException, IOException {
// step 1
Document document = new Document();
// step 2
PdfWriter.getInstance(document, new FileOutputStream(filename));
// step 3
document.open();
BaseFont bf = BaseFont.createFont(FONT, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font font = new Font(bf,15);
// step 4
document.add(new Paragraph("\u6d4b\u8bd5", font));
// step 5
document.close();
}
}
The result looks like this on Windows:
How is this different from Vietnamese? The word test is displayed correctly in Chinese. A subset of the font is embedded, which means you can keep the file size low. The text is not embedded as an image which means the quality of the text is excellent.
Extra answer in answer to your extra comment: In your comment, you claim that the example that uses the file hxb-meixinti.ttf requires the installation of a font. That is incorrect. hxb-meixinti.ttf is merely a file that is read by iText and used to embed the definition of specific glyphs (a subset of the font) into a PDF.
When you write: Related to a Font-Program: Java seems to be able to do it without using external software. Java is able to use fonts because Java uses font files, just the same way as iText uses font files.
For more info, read Supported Fonts in the Java manual. I quote:
Physical fonts need to be installed in locations known to the Java
runtime environment. The JRE looks in two locations: the lib/fonts
directory within the JRE itself, and the normal font location(s)
defined by the host operating system. If fonts with the same name
exist in both locations, the one in the lib/fonts directory is used.
What I tried explaining (and what you have been ignoring since the start of this thread) is that iText needs access to a physical font. iText can accept a font from file or as a byte[], but you need to provide something like a TTF, OTF, TTC, AFM+PFB. This is not different from how Java works.
In your comment you also say that you want Adobe Reader to accept a byte stream instead of reading a PDF from file. This is not possible. Adobe Reader always requires the presence of the PDF file on disk. Even if the PDF file is served by a browser, the bytes of the PDF are stored as a temporary file. This is inherent to your request that the file needs to be viewed in Adobe Reader.
The rest of your comment is unclear. What do you mean by If everyone would just upload anything he might need a switch causes difficulties. Are you talking about downloading instead of uploading? Also: I gave you a solution that doesn't require downloading anything extra on the client side, yet you keep on nagging that no one will install anything on Acrobat.
As for your remark For BS I got a solution recently, I have no idea what you mean by BS.
I want to change the background color of an already present pdf to transparent or white,
and I am using pdfBox for performing other tasks on the pdf, I found some documentation here:
setBackroundColor - pdfBox
But I am completely unaware of how to use it as I am not accustomed to java.
Can someone possibly provide some example code on doing it ?
I want to change the background color of an already present pdf to transparent or white
According to the PDF specification ISO 32000-1, section 11.4.7:
Ordinarily, the page shall be imposed directly on an output medium, such as paper or a display screen. The page group shall be treated as an isolated group, whose results shall then be composited with a backdrop colour appropriate for the medium. The backdrop is nominally white, although varying according to the actual properties of the medium. However, some conforming readers may choose to provide a different backdrop, such as a checker board or grid to aid in visualizing the effects of transparency in the artwork.
PDF viewers most often do use this white backdrop. Thus, if your PDF on standard viewers displays a different color in the back, this normally is due to some area filling operation(s) somewhere in the page content stream.
Thus, there is not a simple single attribute of the PDF to set somewhere but instead you have to parse the page content, find the operations which paint what you perceive as background, and change them. There are numerous different operations which may be used for this task, though, and these operations may also be used for other purposes than background coloring. Thus, there is not the method to change backgrounds.
If you have a single specific PDF or PDFs generated alike, please provide a sample document to allow helping you to find find a way to change the perceived background color.
PS: The PDLayoutAttributeObject.setBackgroundColor method you found refers to the creation of so called Layout Attributes which
specify parameters of the layout process used to produce the appearance described by a
document’s PDF content. [...]
NOTE The intent is that these parameters can be used to reflow the content or export it to some other document format with at least basic styling preserved.
(section 14.8.5.4 in the PDF specification ISO 32000-1)
Thus, they are provided only in PDFs intended for content reflow or content export and are not used by regular PDF viewers.
I know that Java supports TrueType Fonts (.ttf) and that .ttc is extension of TrueType format, but i can't find information that Java also supports the TrueType collection (.ttc) to be explicitly set as font on JLabel for example.
I made an example, where I successfully load a .ttc file in my application with the following code:
InputStream is = getClass().getResourceAsStream("/resources/simsun.ttc");
Font font = Font.createFont(Font.TRUETYPE_FONT, is);
Font fontBase = font.deriveFont(15f);
field.setFont(fontBase);
The code is working well, there are no exceptions related to the creation, loading or setting of the .ttc file as a font in Swing components.
My question is: Can someone confirm this to be working well and that all glyphs from the fonts inside the .ttc are used in components, or there are any disadvantages related to this?
Also, is there any difference if the .ttc is loaded from jar on client machine or it has to be installed in system fonts?
I'm using Windows 7.
First of all, the difference between TTC and TTF is: TTC can (and usually) contain multiple fonts, but TTF only have font defined. The reason to put multiple font into one file is to save space by share glyphs (or sub glyphs). For example, in SimSun and NSimSun, most of glyphs are same, save them together can save lots of space.
Second, Java support TTC font format, but by using Font.createFont() you can only get the first font defined in the TTC file. Currently, there is no way to specify the font index. Take a look at sun.font.FontManager.createFont2D(), when they invoke new TrueTypeFont(), the fontIndex is alway zero. Shame!
For your question: if all you need is the first font in TTC file, then everything would be okay. All the glyphs defined for first font would be available. But, if you expect second or other font defined in that file, then you hit a block. You cannot even get the font name by using this API.
There is no difference between system loaded fonts and created font. However there is no good way to specify the font index, you may try to hack into FontManager and come up with some platform specific code.
I have a scenario where I need a Java app to be able to extract content from a PDF file in one of 2 modes: TEXT_ONLY or ALL. In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings. In all mode, all content (text, images, etc.) is read out of the file.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
while(page.hasMoreText())
textList.add(page.nextTextChunk());
if(allMode)
while(page.hasMoreImages())
imageList.add(page.nextImage());
I know Apache Tika uses PDFBox under the hood, but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from PDFBox).
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?
To expound some aspects of why #markStephens points you towards some resources giving some background on PDF.
In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings.
Your definition "visible" as if a human being was reading the PDF is not yet very well-defined:
Is text 1 pt in size visible? When zooming in, a human can read it; in standard magnification not, though. Which size would be the limit?
Is text in RGB (128, 129, 128) in a background of (128, 128, 128) visible? How different have the colors to be?
Is text displayed in some white noise pattern on a background of some other white noise pattern visible? How different have patterns to be?
Is text only partially on-screen visible? If yes, is one visible pixel enough? And what about some character 'I' in a giant size where the visible page area fits into the dot on the letter?
What about text covered by some annotation which can easily be moved, probably even by some automatically executed JavaScript code in the file?
What about text in some optional content group only visible when printing?
*...
I would expect most available PDF text parsing libraries to ignore all these circumstances and extract the text, at most respecting a crop box. In case of images with added, invisible OCR'ed text the extraction of that text in general is desired.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
PDF (in general) does not know about paragraphs, just some groups of glyphs positioned somewhere on the page. Recognizing paragraphs is a task which cannot be guaranteed to work properly as there are heuristics at work. If, furthermore, you have multicolumn text with an irregular separation, maybe even some image in between (making it hard to decide whether there are two columns divided by the image or whether there is one column with an integrated image), you can count on recognition of the text flow let alone text elements like paragraphs, sections, etc. to fail miserably.
If your PDFs are either properly tagged or all generated by a tool chain for which patterns in the created PDF content streams betray text structures, you may be more lucky. In case of the latter, though, your solution would have to be custom-made for that tool chain.
but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from pdfBox).
There you point towards another point of interest: PDFs can be marked that text extraction is forbidden while they otherwise can be displayed by anyone. While technically PDFs marked like that can be handled just like documents without that mark with just one decoding step (essentially they are encrypted with a publicly known password), doing so is clearly acting against the declared intention of the author and violating his copyright.
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?
As long as you expect 100% accuracy for generic input, you should reconsider your architecture.
If the PDFs are all you have and a solution as effective is possible is OK, on the other hand, there are multiple possible libraries for you, iText, and PDFBox to name but two while there are more. Which is best for you depends on more factors, e.g. on whether you need some generic solution or all PDFs are created by a tool chain as above.
In any case you'll have to do some programming yourself, though, to fine-tune them for your use case.