iText PDF bad character conversion - java

i have a PDF to read that is making me craszy.
The pdf rapresent the electricity bill (in italian language) of a customer and he want me to read text from it.
Now the problem. When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters...
after a lot of research I found my answer. The pdf contains all fonts but not ontiene the cmap corresponding to allow the export of the text. I found this link which refers however to an older version of itext(I'm using version 5.5.5).
what I want to achieve, if possible, is the conversion of text from glyph code to unicode.
I've found some reference to Cmap-something but dunno how to use them and apparently no examples on the net :(
this is what i've tryed
PdfReader reader = new PdfReader("MyFile.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
strategy = parser.processContent(1, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
String cmapFile="UnicodeBigUnmarked";
byte[] text = encodedText.getBytes();
String cid = PdfEncodings.convertToString(text, cmapFile);
The Cid is a pretty japanes sequence of chars
and also:
FontFactory.registerDirectory("myDirectoryWithAllFonts");
Just before trying the conversion. This solution seems to give no results
Any help will be appreciated.

You say: When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters. I assume that you are talking about selecting text in Adobe Reader and trying to paste it in a text editor.
If this doesn't succeed, you have a PDF that doesn't allow you to extract text from the PDF because the text isn't stored in the PDF correctly. Watch this video for the full explanation.
Let's take a look at your PDF from the inside:
We see the start of a text object (where it says BT which stands for Begin Text). A font /C2_1 is defined with font size 1. At first sight, this may look odd, but the font will be scaled to size 6.9989 in a transformation. Then we see some text arrays containing strings of double byte characters such as I R H E Z M W M S R I H I P.
How should iText interpret these characters? To find out, we need to look at the encoding that is used for the font corresponding with /C2_1:
Aha, the Unicode characters stored in the content stream correspond with the actual characters we need: IRHE ZMWMSRI HIP and so on. That's exactly what we see when we convert the PDF to text using iText.
But wait a minute! How come that we see other characters when we look at the PDF using Adobe Reader? Well, characters such as I, R, H and so on are addresses that correspond with the "program" of a glyph. This program is responsible for drawing the character on the page. One would expect that in this case, the character I would correspond with the glyph (or "the drawing" if you prefer this word) of the letter I. No such luck in your PDF.
Now what does Adobe do when you use "Copy with formatting"? Plenty of magic that currently isn't implemented in iText. Why not? Hmm... I don't know the budget of Adobe, but it's probably much, much higher than the budget of the iText Group. Extracting text from documents that contain confusing information about fonts isn't on the technical roadmap of the iText Group.

Related

PdfBox text extraction not working properly

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
Extracted text: http://pastebin.com/BXFfMy0z
Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
What can I do to extract correct text from this pdf file?
In addition to #karthik27's answer:
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.
I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation
link1
link2
The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.

Program with PDFBox searching for words

I would like to make a program that search for words in a pdf
using PDFBox.
Here is my little program:
List<String> words ;// List of words
PDDocument document = PDDocument.load("D:\\INIT.pdf");
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Comparing(content,words);//methode for searching those words on my text
System.out.println(content);
But is it possible to look directly into the PDF without the text with getText?
getText returns a string .in the case we have a big text in pdf File can this String bear the same text , is there another type to use for this case when the text is big and not supported by String ????
I hope you find a solution for this within PDFBox.
The whole process is rather more difficult than it seems. For example PDF text is broken into discontinuous fragments and spaces are often represented as gaps rather than space characters. There's a need both to abstract the fragments and also to retain the link between the human-readable text and the underlying fragments within the PDF. It is quite tricky.
Anyhow if you don't find a satisfactory solution within PDFBox ABCpdf will do this for you. For example the link below shows how to find and highlight keywords in a PDF.
http://www.websupergoo.com/helppdf9net/source/8-abcpdf.operations/8-textoperation/1-methods/group.htm
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated
Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

iText PDF Text Extraction with fonts and styles

I am using iText to extract text from PDF to a String but I have encountered a problem
with some PDF. When I tried to extract text, the reader extract only blanks/destroyed text
on SOME pdfs.
Example of destroyed text:
"th isbe long to t he t est fo r extr act ion tex t"
What is the cause of this problem?
I am thinking of removing the fonts and change the font to a suitable one to be read by
the reader. I have tried researching about this, but what I found does not help me.
This is caused by the way text is stored in the PDF file. It just puts letters with information for rendering and location. The text extraction algorithm is smart in that it finds letters that seem to be close together and, if so, it puts them together. If they aren't that close, it puts in some space.
I can't tell you what to do about it, though.

Read Text and Image Locations (x.y coordinates) using PDFBox

I am doing a java program to read encrypted PDF files and extract the contents of the file page by page including the text, images and their positions(x,y coordinates) in the file. Now I'm using PDFBox for this purpose and I'm getting the text and images. But I couldn't get the text position and image position. Also there are some problems reading some encrypted PDF files.
Take a look at org.apache.pdfbox.examples.util.PrintTextLocations. I've used it quite a bit and it's very helpful to make analyses on the layout of elements and bounding boxes in PDF documents. It also revealed items printed in white ink, or outside the printable area (presumably document watermarks, or "forgotten" items pushed out of sight by the author).
Usage example:
java -cp app/target/pdfbox-app-1.5.0.jar org.apache.pdfbox.examples.util.PrintTextLocations ~/tmp/mydoc.pdf >~/tmp/out-text-locations.txt
You'll get something like that:
Processing page: 0
String[53.9,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=4.6679993]A
String[58.568,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=2.6640015]f
String[61.232002,59.856995 fs=-6.0 xscale=6.0 height=-3.666 space=1.3320001 width=1.6679993]e
...
Which you can easily parse and use to plot element's position, bounding-box, and the "flow" (trajectory through all the elements), etc. for each page. As I'm sure you are already aware of, you'll find that PDF can be almost impossible to convert to text. It is really just a graphic description format (i.e. for the printer or the screen), not a markup language. You could easily make a PDF that prints "Hello world", but that jumps randomly through the character positions (and that uses different glyphs than any ISO char encoding, if you so choose), making the PDF very hard to convert to text. There is no notion of "word" or "paragraph". A two-column document, for example, can be a nightmare to parse into text.
For the second part of your question, I had good results using xpdf version 3.02, after fixing Xref.cc (make XRef::okToPrint(),XRef::okToChange(),XRef::okToCopy() and XRef::okToAddNotes() all return gTrue). That's to handle locked documents, not encrypted ones (there are other utils out there for that).

Categories

Resources