Setting character spacing with PDFBox - java

I'm currently using Java and the PDFBox library to create some PDFs on the fly.
I need to be able to set the character spacing/tracking of some text but can't seem to figure it out.
It looks as there is a method to do so : http://ci.apache.org/projects/pdfbox/javadoc/index.html?org/apache/pdfbox/util/operator/SetCharSpacing.html
But I'm not quite sure how to apply this in the situation.
cs.beginText();
cs.setFont( font, fontSize );
cs.setNonStrokingColor(color);
cs.moveTextPositionByAmount(position[0], position[1]);
cs.drawString(text);
cs.endText();
Any help would be appreciated! Thanks.

You need to do it the hard way, because the "Tc" operator isn't supported by the PDPageContentStream class:
cs.appendRawCommands("0.25 Tc\n");
The SetCharSpacing method you mentioned is for parsing existing PDFs.
PS: don't forget to call close after finishing writing to the content stream!
PPS: setCharacterSpacing() is available in version 2.0.4 and higher.

Related

opencv find block of text areas / detect document layout

I have color image document with text and images and tables.
Document can have two columns.
Document is composite from areas: area header and text (bigger font, can have different font color and something like sub-header additional data).
This is exemplary image but real one can be color:
What i need to do.
I need find on image document this areas of text with headers.
What i need to know.
Method how to divide document to divide document on particular parts.
I try with opencv in java(if someone have python and c++ version i can convert it for java version by myself). I found few similar problem on stack overflow, but none of them can help me. You must know that my opencv knowledge is not very well and it is only from on-line tutorials and stack overflow.
Is there any fine solution on my problem in opencv way or i need use something else, different library or application to achieve this?
One and only requirement is that it must be done from command line.
If i had this areas i can do what i need next, but this is step which stops me.
have you solved the problem?
I'm working on a similar problem.
My solution is to use HoughLines https://docs.opencv.org/3.4.0/d9/db0/tutorial_hough_lines.html
You can use text detection combined with dilation to detect bold text i.e. headers and then group the text boxes between two consecutive headers as the text under first header.

Workaround for known Bug in PDFBox

first of all: My goal is to just load a PDF, highlight words from that PDF (Page) and show that Page / PDF to the user as Image.
Till now i parse the PDF with a custom Text-Stripper to get all word-positions with their coordinates ( needed to generate a rectangle for highlighting later)
After that i started to generate PDAnnotationTextMarkup's so. Now i'm at this point where i can see my annotations well if i save the pdf to a file and view it with a PDFReader by choice. But if i use the convertToImage Method given by PDFBox, i only get a normal page rendered without annotations.
After a little time on google i found: PDFBOX-2019 which was mentioned in another stackoverflow question
Now im looking for a workaround because i think the ticket history is showing that no one will fix that issue in about a year.
Anybody a good idea to fix that and achieve my goal?
thanks in advance
ben

Add Background Image and Add Text on the Image

I am new to iText Library. My requirement is My Servlet will create an Mark Sheet(PDF). It will add image to the complete page of the document and Text on the specific location on the image of the document.
Please help?
It is unclear what the parameter text is about. Maybe you picked the direct content that goes under the image, but that's not the main issue.
You must have read some documentation, because you're using beginText(), setFontAndSize(), showText() and endText(), but you didn't read the documentation very well because:
(1) You use lineTo() without a moveTo() first and without a stroke() after. In other words: you're creating a strange path that is never drawn.
(2) You use showText(), but I don't see you defining coordinates for the text anywhere. What happened to your setTextMatrix() method?
(3) You're a newbie, but instead of using simple code, such as:
ColumnText.showTextAligned(canvas, Element.ALIGN_LEFT,
new Phrase("This is a test"), 100, 100, 0);
Seems like you want to be able to run before you've learned to walk.
Also: you're probaly using an old version of iText, because you don't mention that an exception is thrown when you use the illegal statement lineTo() inside a text block. You can't use lineTo() inside a beginText()/endText() sequence.
Please follow the advice given by mkl and read the documentation first.

itext font UnsupportedCharsetException

I am trying to create pdf documents using iText (version 5.4.0) in a java web application and I have come across an issue with fonts.
The web application is multi-lingual, and so users may save information into the system in various languages (eg. english, french, lithuanian, chinese, japanese, arabic, etc.).
When I tried to configure the pdf to output some sample japanese text it didn't show up, so I started following the examples in the official "iText in Action" book. The problem I have encountered is that when I try and configure a font with BaseFont.IDENTITY_H encoding I get the following error:
java.nio.charset.UnsupportedCharsetException: Identity-H
at java.nio.charset.Charset.forName(Charset.java:505)
at com.itextpdf.text.pdf.PdfEncodings.convertToBytes(PdfEncodings.java:186)
at com.itextpdf.text.pdf.Type1Font.<init>(Type1Font.java:276)
at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:692)
at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:615)
at com.itextpdf.text.pdf.BaseFont.createFont(BaseFont.java:450)
Nothing in the book or searching Google mentions this issue.
Any suggestions as to what I might have missed?
As you probably understood from the answers from two Michaels, you made the wrong assumption that the standard Type 1 font Times Roman and IDENTITY_H are compatible. You'll have to change the font if you want to use IDENTITY_H, or change the encoding if you want to use a standard Type 1 font (in which case using BaseFont.EMBEDDED doesn't make sense because standard Type 1 fonts are never embedded). I'm sorry if I didn't mention this in my book. I thought it was kind of trivial. One can deduct it from what I wrote about composite fonts.
I don't think there's any one encoding that works for all languages, with font embedding. For example, you'd assume that choosing the UTF-8 encoding, with font embedding set to true will embed the font, but it doesn't.
I find myself having to do this, because I don't know the language of the text ahead of time:
try {
// Try to embed the font.
// This doesn't work for type 1 fonts.
return FontFactory.getFont(fontFace, BaseFont.IDENTITY_H,
true, fontSize, fontStyle, textColor);
} catch (ExceptionConverter e) {
return FontFactory.getFont(fontFace, "UTF-8", true,
fontSize, fontStyle, textColor);
}
(The exception class may be different since I'm using an older version of iText -- 2.1.)
As with a lot of iText stuff, this is poorly documented, and makes the easy stuff unnecessarily hard.

Splitting a pdf with pdfbox, but losing the font

I wrote some code in Java using the pdfbox API that splits a pdf document into it's individual pages, looks through the pages for a specific string, and then makes a new pdf from the page with the string on it. My problem is that when the new page is saved, I lose my font. I just made a quick word document to test it and the default font was calibri, so when I run the program I get an error box that reads: "Cannot extract the embedded font..." So it replaces the font with some other default.
I have seen a lot of example code that shows how to change the font when you are inputting text to be placed in the pdf, but nothing that sets the font for the pdf.
If anyone is familiar with a way to do this, (or can find documentation/examples), I would greatly appreciate it!
Edit: forgot to include some sample code
if (pageContent.indexOf(findThis) >= 0){
PDPage pageToRip = pages.get(i);
>>set the font of pageToRip here
res.importPage(pageToRip); //res is the new document that will be saved
}
I don't know if that helps any, but I figured I'd include it.
Also, this is what the change looks like if the pdf is written in calibri and split:
Note: This might be a nonissue, it depends on the font used in the files that will need to be processed. I tried some things besides Calibri and it worked out fine.
From How to extract fonts from a PDF:
You actually cannot extract a font from a PDF, not even if the font is
fully embedded. There are two reasons why this is not feasible:
•Most fonts are copyrighted, making it illegal to use an extractor.
•When a font is embedded in a PDF, not all of the font data are
included. Obviously the font outline data are included as well as the
font width tables. Other information, such as data about ligatures,
are irrelevant within the PDF so those data do not get enclosed in a
PDF. I am not aware of any font extraction tools but if you come
across one, the above reasons should make it clear that these
utilities are to be avoided.

Categories

Resources