iText PDF Text Extraction with fonts and styles

iText PDF Text Extraction with fonts and styles - java

I am using iText to extract text from PDF to a String but I have encountered a problem
with some PDF. When I tried to extract text, the reader extract only blanks/destroyed text
on SOME pdfs.
Example of destroyed text:
"th isbe long to t he t est fo r extr act ion tex t"
What is the cause of this problem?
I am thinking of removing the fonts and change the font to a suitable one to be read by
the reader. I have tried researching about this, but what I found does not help me.

This is caused by the way text is stored in the PDF file. It just puts letters with information for rendering and location. The text extraction algorithm is smart in that it finds letters that seem to be close together and, if so, it puts them together. If they aren't that close, it puts in some space.
I can't tell you what to do about it, though.

Related

Itext get special letters from pdf

I am trying to extract accented words from pdf e book . The best results are produced when using itext library , but I fail to get accents from words .
example :
побеђивање -should come out as- побеђи́ва̄ње (accents are missing)
The letters are Cyrillic Serbian .
I tried many of the ocr solutions but they all give bad results . Is there a way for me to extract all of this pdf data the way they are in the pdf using itext. I know that this has a lot to do with the way pdf works and that this is a hard thing to get , but again I realy need this , the alternative is to retype all of the data.
The pdf file pdf example file

The sample document actually contains one big image, a scanned page, and invisible text information on top of the scanned printed letters. Most likely this text information is the result of some OCR process.
Unfortunately already this text information is missing the accents in question. E.g. the text for the first entry
is added as
(\340\361\362\340\353\367\355)Tj 0 Tc (\236)Tj
...
As you can see, the same letter \340 is used at position 1 and 4 while according to the scanned page one of the matching printed letters has an accent and one not.
This happens throughout the whole page.
Thus, any attempt at regular text extraction will fail to return the accents in question. The only chance you have is to use OCR.
You say you
tried many of the ocr solutions but they all give bad results
Probably you applied the OCR applications to the PDF or a rendered version of it. I would suggest you instead extract the scanned images; this way you get all the quality there is. iText can help you with image extraction.

iText PDF bad character conversion

i have a PDF to read that is making me craszy.
The pdf rapresent the electricity bill (in italian language) of a customer and he want me to read text from it.
Now the problem. When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters...
after a lot of research I found my answer. The pdf contains all fonts but not ontiene the cmap corresponding to allow the export of the text. I found this link which refers however to an older version of itext(I'm using version 5.5.5).
what I want to achieve, if possible, is the conversion of text from glyph code to unicode.
I've found some reference to Cmap-something but dunno how to use them and apparently no examples on the net :(
this is what i've tryed
PdfReader reader = new PdfReader("MyFile.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
strategy = parser.processContent(1, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
String cmapFile="UnicodeBigUnmarked";
byte[] text = encodedText.getBytes();
String cid = PdfEncodings.convertToString(text, cmapFile);
The Cid is a pretty japanes sequence of chars
and also:
FontFactory.registerDirectory("myDirectoryWithAllFonts");
Just before trying the conversion. This solution seems to give no results
Any help will be appreciated.

You say: When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters. I assume that you are talking about selecting text in Adobe Reader and trying to paste it in a text editor.
If this doesn't succeed, you have a PDF that doesn't allow you to extract text from the PDF because the text isn't stored in the PDF correctly. Watch this video for the full explanation.
Let's take a look at your PDF from the inside:
We see the start of a text object (where it says BT which stands for Begin Text). A font /C2_1 is defined with font size 1. At first sight, this may look odd, but the font will be scaled to size 6.9989 in a transformation. Then we see some text arrays containing strings of double byte characters such as I R H E Z M W M S R I H I P.
How should iText interpret these characters? To find out, we need to look at the encoding that is used for the font corresponding with /C2_1:
Aha, the Unicode characters stored in the content stream correspond with the actual characters we need: IRHE ZMWMSRI HIP and so on. That's exactly what we see when we convert the PDF to text using iText.
But wait a minute! How come that we see other characters when we look at the PDF using Adobe Reader? Well, characters such as I, R, H and so on are addresses that correspond with the "program" of a glyph. This program is responsible for drawing the character on the page. One would expect that in this case, the character I would correspond with the glyph (or "the drawing" if you prefer this word) of the letter I. No such luck in your PDF.
Now what does Adobe do when you use "Copy with formatting"? Plenty of magic that currently isn't implemented in iText. Why not? Hmm... I don't know the budget of Adobe, but it's probably much, much higher than the budget of the iText Group. Extracting text from documents that contain confusing information about fonts isn't on the technical roadmap of the iText Group.

Program with PDFBox searching for words

I would like to make a program that search for words in a pdf
using PDFBox.
Here is my little program:
List<String> words ;// List of words
PDDocument document = PDDocument.load("D:\\INIT.pdf");
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Comparing(content,words);//methode for searching those words on my text
System.out.println(content);
But is it possible to look directly into the PDF without the text with getText?
getText returns a string .in the case we have a big text in pdf File can this String bear the same text , is there another type to use for this case when the text is big and not supported by String ????

I hope you find a solution for this within PDFBox.
The whole process is rather more difficult than it seems. For example PDF text is broken into discontinuous fragments and spaces are often represented as gaps rather than space characters. There's a need both to abstract the fragments and also to retain the link between the human-readable text and the underlying fragments within the PDF. It is quite tricky.
Anyhow if you don't find a satisfactory solution within PDFBox ABCpdf will do this for you. For example the link below shows how to find and highlight keywords in a PDF.
http://www.websupergoo.com/helppdf9net/source/8-abcpdf.operations/8-textoperation/1-methods/group.htm
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

Read Text and Image Locations (x.y coordinates) using PDFBox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.

Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.