PdfBox text extraction not working properly

PdfBox text extraction not working properly - java

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
Extracted text: http://pastebin.com/BXFfMy0z
Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
What can I do to extract correct text from this pdf file?

In addition to #karthik27's answer:
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.

I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation
link1
link2

The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.

Related

iText PDF bad character conversion

i have a PDF to read that is making me craszy.
The pdf rapresent the electricity bill (in italian language) of a customer and he want me to read text from it.
Now the problem. When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters...
after a lot of research I found my answer. The pdf contains all fonts but not ontiene the cmap corresponding to allow the export of the text. I found this link which refers however to an older version of itext(I'm using version 5.5.5).
what I want to achieve, if possible, is the conversion of text from glyph code to unicode.
I've found some reference to Cmap-something but dunno how to use them and apparently no examples on the net :(
this is what i've tryed
PdfReader reader = new PdfReader("MyFile.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
strategy = parser.processContent(1, new SimpleTextExtractionStrategy());
String text = strategy.getResultantText();
String cmapFile="UnicodeBigUnmarked";
byte[] text = encodedText.getBytes();
String cid = PdfEncodings.convertToString(text, cmapFile);
The Cid is a pretty japanes sequence of chars
and also:
FontFactory.registerDirectory("myDirectoryWithAllFonts");
Just before trying the conversion. This solution seems to give no results
Any help will be appreciated.

You say: When i copy paste text from pdf to notepad i get a bunch of incomprehensible characters. I assume that you are talking about selecting text in Adobe Reader and trying to paste it in a text editor.
If this doesn't succeed, you have a PDF that doesn't allow you to extract text from the PDF because the text isn't stored in the PDF correctly. Watch this video for the full explanation.
Let's take a look at your PDF from the inside:
We see the start of a text object (where it says BT which stands for Begin Text). A font /C2_1 is defined with font size 1. At first sight, this may look odd, but the font will be scaled to size 6.9989 in a transformation. Then we see some text arrays containing strings of double byte characters such as I R H E Z M W M S R I H I P.
How should iText interpret these characters? To find out, we need to look at the encoding that is used for the font corresponding with /C2_1:
Aha, the Unicode characters stored in the content stream correspond with the actual characters we need: IRHE ZMWMSRI HIP and so on. That's exactly what we see when we convert the PDF to text using iText.
But wait a minute! How come that we see other characters when we look at the PDF using Adobe Reader? Well, characters such as I, R, H and so on are addresses that correspond with the "program" of a glyph. This program is responsible for drawing the character on the page. One would expect that in this case, the character I would correspond with the glyph (or "the drawing" if you prefer this word) of the letter I. No such luck in your PDF.
Now what does Adobe do when you use "Copy with formatting"? Plenty of magic that currently isn't implemented in iText. Why not? Hmm... I don't know the budget of Adobe, but it's probably much, much higher than the budget of the iText Group. Extracting text from documents that contain confusing information about fonts isn't on the technical roadmap of the iText Group.

Program with PDFBox searching for words

I would like to make a program that search for words in a pdf
using PDFBox.
Here is my little program:
List<String> words ;// List of words
PDDocument document = PDDocument.load("D:\\INIT.pdf");
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Comparing(content,words);//methode for searching those words on my text
System.out.println(content);
But is it possible to look directly into the PDF without the text with getText?
getText returns a string .in the case we have a big text in pdf File can this String bear the same text , is there another type to use for this case when the text is big and not supported by String ????

I hope you find a solution for this within PDFBox.
The whole process is rather more difficult than it seems. For example PDF text is broken into discontinuous fragments and spaces are often represented as gaps rather than space characters. There's a need both to abstract the fragments and also to retain the link between the human-readable text and the underlying fragments within the PDF. It is quite tricky.
Anyhow if you don't find a satisfactory solution within PDFBox ABCpdf will do this for you. For example the link below shows how to find and highlight keywords in a PDF.
http://www.websupergoo.com/helppdf9net/source/8-abcpdf.operations/8-textoperation/1-methods/group.htm
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

How to hide text in an PDF file?

How can I add text to a pdf document, which is not visible?
The document manipulation should be done in java. The usecase is to add further metadata to a document (in a proprietary format, about 40kb), before the document is signed and archived.
I tried:
annotation field with size 0,0
.txt file attachment
but, this annoys readers of the PDF, because they see a difference (comment / attachment bar).
Is there a comment object or a syntax to comment out lines in a PDF document?
EDIT:
I've tried adding text between PDF objects. This works, the problem is: acrobat reader asks to resave the file when closing window.
Adding the text after %EOF is not a solution, because signing is not applied to the metadata, which is a needed feature.

The proper way to add metadata to a PDF would be through XMP. It allows you to add arbitrary metadata and allows defining the metadata types inside of the same PDF file (which you really should do if you're archiving and which is a requirement in archival standards such as PDF/A).
XMP data can be extracted by readers who don't understand the PDF format using a simple text scanning algorithm yet at the same time it will be inside of the document so will be protected by the digital signature you apply.
You can read more about it here: http://www.adobe.com/products/xmp/

I have seen PDF's who had a bunch of metadata in the footer, just in color white while the background was also white, so normally you wouldn't recognize it when you're looking at the PDF. But that's quite nasty..

How to get given paragraph content of pdf file using iText library?

Is there any way to get number of paragraphs or content of given paragraph in pdf file using iText library ?. I saw some classes like Paragraph, Chunk in some code to create new pdf file but I can not find any way to get these classes in reading file. Every idea is appreciated

Is the PDF you're talking about a Tagged PDF? If not, you are making the wrong assumptions about PDF. In a PDF, content is drawn on a page. For instance: an iText PdfPTable is converted into text state operators that draw snippets of text to a canvas, as well as graphics state operators that draw paths and shapes. If the PDF isn't tagged, the lines don't know that they are borders of a table; a word doesn't know to which cell it belongs.
The same goes for paragraphs: a snippet of text doesn't know whether it belongs to a sentence, to a paragraph, to a title line,...
Due to the very nature of PDF, what you're looking for may be impossible (using iText or any other software product), or may require heuristics (artificial intelligence) to examine all text state operators and the semantics of the content to get a result that mimics how humans would interpret text.
It's very easy to achieve if your PDF is tagged correctly. See the ParseTaggedPdf example.

iText PDF Text Extraction with fonts and styles

I am using iText to extract text from PDF to a String but I have encountered a problem
with some PDF. When I tried to extract text, the reader extract only blanks/destroyed text
on SOME pdfs.
Example of destroyed text:
"th isbe long to t he t est fo r extr act ion tex t"
What is the cause of this problem?
I am thinking of removing the fonts and change the font to a suitable one to be read by
the reader. I have tried researching about this, but what I found does not help me.

This is caused by the way text is stored in the PDF file. It just puts letters with information for rendering and location. The text extraction algorithm is smart in that it finds letters that seem to be close together and, if so, it puts them together. If they aren't that close, it puts in some space.
I can't tell you what to do about it, though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.