iText: Compare font after processing string using FontSelector

iText: Compare font after processing string using FontSelector - java

tl;dr:
In iText, is there a way to access the font's name, or figure out the language of a font that has been applied to a Phrase from a FontSelector?
This question is in relation to an issue we've been having for the printwikipedia project --- github -- issue .
I have a body of text that I do not have control over coming in to a FontSelector to be processed. Some of that text is in Arabic and some is in Hebrew and I am trying to figure out the best way of detecting the type of font in order to have it print correctly as seen here: http://developers.itextpdf.com/question/how-create-persian-content-pdf
using
pdfCell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
After running the strings through the FontSelector to become Phrases and then placing them within the cell formatted with the code above all the text ends up on the right.
Formatting of this text is very important so I cannot have all of my text aligned to the right but only when it is meant to be read as right to left. So what I believe should be the best course of action is to detect the font that has been applied to the Phrase and then alter the cell if necessary.
public FontSelector fs = new FontSelector();
// add a whole lot of fonts to fs
// incoming line of some sort of text
Phrase ph = fs.process(line);
System.out.println(ph.getFont().toString());
The above code will output some extremely varied results. Pretty much a new id for every font created for each piece of text. I can't figure out a way to compare a font that exists in the fontselector object with a font that has been applied to the incoming text.
Is this the best method to figure out what fonts are in a phrase?
How can I access the font's name, or figure out the language of a font that has been applied to a Phrase from a FontSelector?

I went through the itext documentation a bit more carefully and discovered my answer.
Phrases are made up of Chunks which contain the fonts. If one simple does a .getChunks() command on the output phrase and then iterators over the chunks they can then compare the fonts by doings a .getFont() on the chunks in the arraylist and proceeding to apply whatever styles you wish from there.

Related

Program with PDFBox searching for words

I would like to make a program that search for words in a pdf
using PDFBox.
Here is my little program:
List<String> words ;// List of words
PDDocument document = PDDocument.load("D:\\INIT.pdf");
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Comparing(content,words);//methode for searching those words on my text
System.out.println(content);
But is it possible to look directly into the PDF without the text with getText?
getText returns a string .in the case we have a big text in pdf File can this String bear the same text , is there another type to use for this case when the text is big and not supported by String ????

I hope you find a solution for this within PDFBox.
The whole process is rather more difficult than it seems. For example PDF text is broken into discontinuous fragments and spaces are often represented as gaps rather than space characters. There's a need both to abstract the fragments and also to retain the link between the human-readable text and the underlying fragments within the PDF. It is quite tricky.
Anyhow if you don't find a satisfactory solution within PDFBox ABCpdf will do this for you. For example the link below shows how to find and highlight keywords in a PDF.
http://www.websupergoo.com/helppdf9net/source/8-abcpdf.operations/8-textoperation/1-methods/group.htm
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

Read PDF text and/or all content

I have a scenario where I need a Java app to be able to extract content from a PDF file in one of 2 modes: TEXT_ONLY or ALL. In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings. In all mode, all content (text, images, etc.) is read out of the file.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
while(page.hasMoreText())
textList.add(page.nextTextChunk());
if(allMode)
while(page.hasMoreImages())
imageList.add(page.nextImage());
I know Apache Tika uses PDFBox under the hood, but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from PDFBox).
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?

To expound some aspects of why #markStephens points you towards some resources giving some background on PDF.
In text mode, only visible text ("visible" as if a human being was reading the PDF) is read out into strings.
Your definition "visible" as if a human being was reading the PDF is not yet very well-defined:
Is text 1 pt in size visible? When zooming in, a human can read it; in standard magnification not, though. Which size would be the limit?
Is text in RGB (128, 129, 128) in a background of (128, 128, 128) visible? How different have the colors to be?
Is text displayed in some white noise pattern on a background of some other white noise pattern visible? How different have patterns to be?
Is text only partially on-screen visible? If yes, is one visible pixel enough? And what about some character 'I' in a giant size where the visible page area fits into the dot on the letter?
What about text covered by some annotation which can easily be moved, probably even by some automatically executed JavaScript code in the file?
What about text in some optional content group only visible when printing?
*...
I would expect most available PDF text parsing libraries to ignore all these circumstances and extract the text, at most respecting a crop box. In case of images with added, invisible OCR'ed text the extraction of that text in general is desired.
For instance, if a PDF file was to have 1 page in it, and that page had 3 paragraphs of contiguous text, and was word-wrapping 2 images, then TEXT_ONLY would extract all 3 paragraphs, and ALL would extract all 3 paragraphs and both images:
PDF (in general) does not know about paragraphs, just some groups of glyphs positioned somewhere on the page. Recognizing paragraphs is a task which cannot be guaranteed to work properly as there are heuristics at work. If, furthermore, you have multicolumn text with an irregular separation, maybe even some image in between (making it hard to decide whether there are two columns divided by the image or whether there is one column with an integrated image), you can count on recognition of the text flow let alone text elements like paragraphs, sections, etc. to fail miserably.
If your PDFs are either properly tagged or all generated by a tool chain for which patterns in the created PDF content streams betray text structures, you may be more lucky. In case of the latter, though, your solution would have to be custom-made for that tool chain.
but am worried that this kind of functionality is shaded/prohibited by Tika (in which case, I probably need to do this directly from pdfBox).
There you point towards another point of interest: PDFs can be marked that text extraction is forbidden while they otherwise can be displayed by anyone. While technically PDFs marked like that can be handled just like documents without that mark with just one decoding step (essentially they are encrypted with a publicly known password), doing so is clearly acting against the declared intention of the author and violating his copyright.
So I ask: is this possible, and if so, which library is more appropriate for me to use? Am I going about this entirely the wrong way? Any pitfalls/caveats I am not considering here?
As long as you expect 100% accuracy for generic input, you should reconsider your architecture.
If the PDFs are all you have and a solution as effective is possible is OK, on the other hand, there are multiple possible libraries for you, iText, and PDFBox to name but two while there are more. Which is best for you depends on more factors, e.g. on whether you need some generic solution or all PDFs are created by a tool chain as above.
In any case you'll have to do some programming yourself, though, to fine-tune them for your use case.

Without having any form of GUI, what set of classes should be used to represent formatted text in memory?

I have an application I'm writing which reads a docx file. It appears that I may need to read the formatting of the text, and not just the content. I have googled the matter but finding a search term that finds me what I'm looking for, most of it points me to using formatted text inputs and the like.
Does anyone know what class I should be using?

Apache poi should give you (at least some) access to Excel styles - such as colors, fonts etc. I'm not sure about exotic cases, but it's certainly possible to obtain cell font color for example. The following code works for me:
XSSFCell cell = ...
if (IndexedColors.WHITE.getIndex() == cell.getCellStyle().getFont().getColor()) {
...
}

Read Text and Image Locations (x.y coordinates) using PDFBox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.