Java - Issue with data extraction from PDF (PDFBox - 2.02)

Java - Issue with data extraction from PDF (PDFBox - 2.02) - java

I am trying to extract data from a PDF file which contains data in separate tables & convert to excel. Based on this link as my need is more or less the same, I am using PDFBOX jar to do the extraction.
To test whether I can first extract the data from different tables in the pdf, tried with the code specified below. But it does not extract & gives an error stating Corrupt object reference, don't know what it means.
To see if there was any issue with the pdf itself, I checked with https://online2pdf.com & it successfully converted the pdf file to excel, so I believe there is no issue with the pdf file.
Hope the issue I face is clear & await inputs on what needs to be done to extract the data from the pdf
Error message:
2016-07-21 13:49:11 WARN BaseParser:682 - Corrupt object reference at offset 6371
2016-07-21 13:49:11 WARN BaseParser:682 - Corrupt object reference at offset 6373
java.io.IOException: Expected string 'null' but missed at character 'u' at offset 6376
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1017)
at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1000)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:879)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:651)
at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:479)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:136)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
at main.Test.readPDF(Test.java:170)
at main.Test.main(Test.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Code :
public static void main(String[] args){
try {
File filePDF = new File("C:\\test.pdf");
PDDocument document = PDDocument.load(filePDF);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
System.out.println(content);
} catch (IOException e) {
e.printStackTrace();
}
}

Finally found a jar (PDFxStream) file which extracts all the data from the PDF in this case. Although its a paid version, but its able to extract the complete info which the other paid ones was not able to extract.
The only thing is, it extracts as a String & I would need to parse this String & extract the specific info from it.

Related

PDFBOX digit garble

I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
PDFDebugger output
The pdf file : http://tmp.link/f/6249a07f6e47f

There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>
The glyph names /g121 and /g122 don't have a standardized meaning either.
PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.

Preflight validate() is invalid but in Console valid

Currently I'm decoding a Base64 with Console:
base64 -di "myfile.txt" > mypdf.pdf
Which returns a valid pdf file.
But when I try this
DataSource dataSource = new ByteArrayDataSource(
new ByteArrayInputStream(Base64.getDecoder().decode(pdf.getEncodedContent())));
PreflightParser parser = new PreflightParser(dataSource);
parser.parse();
try (PreflightDocument document = parser.getPreflightDocument()) {
document.validate();
return !document.isEncrypted();
}
catch (ValidationException ex) {
return false;
}
I always get a validationException (pdf is not valid).
I think I need to change the configuration. I've already tried the following but that doesn't seem to help:
PreflightConfiguration config = document.getContext().getConfig();
config.setLazyValidation(true);
Stacktrace:
test.pdf is not valid: Unable to parse font metadata due to : Excepted xpacket 'end' attribute (must be present and placed in first)

I've solved this ticket. For those who are interested:
The validation worked perfectly and the pdf files where not correct even if the reader /browser could open it (the pdf reader/browser did not show any warnings or error messages).
Try to convert your pdfs to binary text and check at least if your first two lines and the last line are 'pdf default' like:
%PDF-1.7
%µµµµ
...
%%EOF
if not, then the pdf has been generated wrong and the validation will fail.

Cannot read the first line of a JSON file in Java

I am trying to read some data from a JSON file that I generated from a MongoDB document. But when trying to read the first entry in the document, i get an exception:
org.json.JSONException: JSONObject["Uhrzeit"] not found.
This only happens with the first entry, reading other entrys does not cause an exception.
Using jsonObject.getString("") on any entry that is not the first returns the values as expected.
//Initiate Mongodb and declare the database and collection
MongoClient mongoClient = new MongoClient(new MongoClientURI("mongodb://localhost:27017"));
MongoDatabase feedbackDb = mongoClient.getDatabase("local");
MongoCollection<Document> feedback = feedbackDb.getCollection("RückmeldungenShort");
//gets all documents in a collection. "new Document()" is the filter, that returns all Documents
FindIterable<Document> documents = feedback.find(new Document());
//Iterates over all documents and converts them to JSONObjects for further use
for(Document doc : documents) {
JSONObject jsonObject = new JSONObject(doc.toJson());
System.out.print(jsonObject.toString());
System.out.print(jsonObject.getString("Uhrzeit"));
}
Printing jsonObject.toString() produces the JSON String for testing purposes (in one line):
{
"Ort":"Elsterwerda",
"Wetter-Keyword":"Anderes",
"Feedback\r":"Test Gelb\r",
"Betrag":"Gelb",
"Datum":"18.05.2018",
"Abweichung":"",
"Typ":"Vorhersage",
"_id":{
"$oid":"5b33453b75ef3c23f80fc416"
},
"Uhrzeit":"05:00"
}
Note, that the order in which the entries appear is mixed up and the first one appearing in the database was "Uhrzeit".
This is how it looks like:
The JSON file is valid according to https://jsonformatter.curiousconcept.com/ .
The "Uhrzeit" is even recognized within the JSONObject while in debug mode:
I assumed it might have something to do with the entries themselves, so I switched "Datum" and "Ort" to the first place in the document but that produced the same results.
There are lots of others that have posted on this error message, but it seems to me like they all had slightly different problems.

I imported a .csv with my data into MongoDB and read the documents from there. Somewhere in the process of reading the data, "\r"s were automatically generated where the line breaks were in my .csv (aka. at the end of each dataset). In this case at the key value pair "Feedback" (as seen in the last picture).
When checking my output again with another JSON validator, I noticed that there was an "invisible" symbol in my JSON file that caused the key not to be found. Now this symbol is located in front of the first key (after the MongoDB-id) when importing a .csv document to my DB. I imported a correct version of the .csv into my MongoDB and exported it again and the symbol reappeared.
The problem was that my .csv was in "Windows" format. Converting it to "Unix" format will get rid of the generated "\r"s. The "invisible" symbol was the UTF-8-BOM code that is added at the beginning of a document. You can reformat your .csv to be just UTF-8 and get rid of it that way.

java: Protocol message tag had invalid wire type error when reading .pb file

I try to read .pb extension file.
Specifically, I would like to read this dataset (in .tgz).
I write the following code:
Path path = Paths.get(filename);
byte[] data = Files.readAllBytes(path);
Document document = Document.parseFrom(data);
But then I received the following error.
com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.
The last line of the code caused this error, but I do not know how to solve it.

Your files are actually in "delimited" format: each one contains multiple messages, each with a length prefix.
InputStream stream = new FileInputStream(filename);
Document document = Document.parseDelimitedFrom(steam);
Keep calling parseDelimitedFrom(stream) to read more messages until it returns null (end of file).
Also note that the file I looked at -- testNegative.pb in heldout_relations.tgz -- appeared to contain instances of Relation, not Document. Make sure you are parsing the correct type, because the protobuf implementation can't tell the difference -- you'll get garbage if you parse the wrong type.

extracting text from using pdfclown function 'textextractor'

i am getting an error while using textextractor of pdfclown library. The code i used is
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");
// Extract the page text!
Map textStrings = textExtractor.extract(page);
a part of the error i got is
exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>
I also found out that this happens when my pdf contains some bullets for example
item 1
item 2
item 3
Plz help me out to extract the text from such pdfs.

(The following comment turned out to be the solution:)
Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though).
After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them.
The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. Thus:
Your highlighter.java has to be used with pdfclown.jar in the classpath.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Issue with data extraction from PDF (PDFBox - 2.02) - java

Related

PDFBOX digit garble

Preflight validate() is invalid but in Console valid

Cannot read the first line of a JSON file in Java

java: Protocol message tag had invalid wire type error when reading .pb file

extracting text from using pdfclown function 'textextractor'

Categories

Resources