I have a question with regard to pdfbox 1.8.13. I am trying to read in the entire text from a one page PDF document. Adobe Reader can do the job, pdfbox reads almost the entire page but scrambles the first two lines of the document and the last two lines of the document so that letters are interchanged.
Does anybody know how to solve such an issue? First, where to ask, second, how can I share the PDF with you, and Third, does someone have the possibility to check whether the problem also exists in version 2.0.7 of pdfbox, which I understodd is completely different and thus not straightforward to implement?
Thank you in advance for your help
Stephan
Adobe Reader:
ScalableCapitalHRB217778,AmtsgerichtMünchenSeite1von1
VermögensverwaltungGmbHUSt-IdNr.DE300434774
Prinzregentenstr.
48Geschäftsführung:80538München
ErikPodzuweit,FlorianPrucker
pdfbox:
SVecramlaöbgleenCsavpeitrawlaltung GmbH UHSRtB-I2d1N7r7.7D8E,3A0m0t4s3g4e7ri7c4ht München Seite 1 von 1
8P0ri5n3zr8egMeünntcehnesntr. 48 GEreikscPhoädftzsufwüheritu,nFglo: rian Prucker
Link to the PDF (I have verified that the problem is the same with the unmodified and the modified PDF that I have uploaded):
https://wetransfer.com/downloads/5930649bce9a1d1a686a0da63f1b9bce20170808071518/9b9140
P.S.: In the meanwhile, I have also tried the PDDocument.loadNonSeq version in pdfbox 1.8.13 but this resulted in the same problem.
Thank you #tilman-hausherr for your helpful hints. With them, I managed to debug my problem.
You were right that leaving out the sorting option (I don't know why it was used before in the project that I now work on) resolved the scrambling issue even in pdfbox-1.8.13. And you were right that the text extraction result using pdfbox-2.0.7 gave even better results.
The relevant Java code snippets that I was using with pdfbox-1.8.13 were:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
...
PDDocument doc = PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
textStripper.setSortByPosition(true);
String text = textStripper.getText(doc);
If I understand correctly, the API for simple text extraction going from pdfbox-1.8.13 to pdfbox-2.0.7 is not the same, but very similar, the PDFTextStripper has just been moved from util to text:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
...
PDDocument doc = PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
// textStripper.setSortByPosition(true);
String text = textStripper.getText(doc);
To find out about all of this, as you said the command line tool was very helpful and here are the results of the text extraction with the different options (https://pdfbox.apache.org/1.8/commandline.html and https://pdfbox.apache.org/2.0/commandline.html):
java -jar pdfbox-app-1.8.13.jar ExtractText -sort "20170801 Rechnung.pdf":
SVecramlaöbgleenCsavpeitrawl HRBPrinzregentenstra.l4tu8ng GmbH GUSest-I
2d1N7r7.7D8E,3A0m0t4s3g4e7ri7c4ht München Seite 1 von 1
80538 München ErikcPhoädftzsufwüheritu,nFglo: rian Prucker
java -jar pdfbox-app-1.8.13.jar ExtractText "20170801 Rechnung.pdf":
Scalable CapitalVermögensverwaltung GmbHPrinzregentenstr. 4880538 München
HRB 217778, Amtsgericht MünchenUSt-IdNr. DE300434774Geschäftsführung:Erik
Podzuweit, Florian Prucker
Seite 1 von 1
java -jar pdfbox-app-2.0.7.jar ExtractText -sort "20170801 Rechnung.pdf":
Scalable Capital HRB 217778, Amtsgericht München Seite 1 von 1
Vermögensverwaltung GmbH USt-IdNr. DE300434774
Prinzregentenstr. 48 Geschäftsführung:
80538 München Erik Podzuweit, Florian Prucker
java -jar pdfbox-app-2.0.7.jar ExtractText "20170801 Rechnung.pdf"
Scalable Capital
Vermögensverwaltung GmbH
Prinzregentenstr. 48
80538 München
HRB 217778, Amtsgericht München
USt-IdNr. DE300434774
Geschäftsführung:
Erik Podzuweit, Florian Prucker
Seite 1 von 1
So I think pdfbox-2.0.7 gives the nicest results in this case, especially without the -sort option, even if I don't know why the algorithms behave differently, since pdfbox-1.8.3 gave the same result with or without the -nonSeq option.
Related
I am trying to create a Scanner with JFlex. I created my .jflex file and it compiles and everything. The problem is that when I try to prove it, some times it gives me and error of ArrayIndexOutOfBoundsException: 769 in the .java class that JFlex created.
I am using Cup Parser generator too. I don't know if the problem can be related with the part of Cup Analysis, but here is how I called my analyzers.
ScannerLexico lexico = new ScannerLexico(new BufferedReader(new StringReader( jTextPane1.getText())));
ParserSintactico sintaxis = new ParserSintactico(lexico);
I don't know how to fix it. Please help me.
Here are the links to my code:
JFlex File "ScannerFranklin.jflex"
Java Class generated "ScannerLexico.java"
The part where I have the problem in the .java class created by JFlex, in next_token() function (Line 899 in java file).
int zzNext = zzTransL[ zzRowMapL[zzState] + zzCMapL[zzInput] ];
if (zzNext == -1) break zzForAction;
zzState = zzNext;
Thanks.
According to its documentation, JFlex throws ArrayIndexOutOfBounds exceptions whenever it encounters Unicode characters using the %7bit or %8bit/%full encoding options. It recommends to always use the %unicode option instead, which is the default.
You are using the %unicode option, but you're also using %full. Apparently when you have both options, %full takes precedence. So remove %full and the error should go away.
I'm calling PDFBox from Matlab to figure out how many pages there are in a PDF. Everything works great with Matlba 2016b and prior. I can import the library and load a PDF without a problem:
import org.apache.pdfbox.pdmodel.PDDocument;
pdfFile = PDDocument.load(filename);
When I run the same thing in 2017a, I get the following error:
No method 'load' with matching signature found for class
'org.apache.pdfbox.pdmodel.PDDocument'.
I can change the line after the import so that the function signature matches:
jFilename = java.lang.String(filename);
pdfFile = PDDocument.load(jFilename.getBytes());
However, this causes PDFBox to have problems when I call load:
Java exception occurred:
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1111)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:1874)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:1853)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:242)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1093)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
This error seems to occur regardless of the PDF I'm trying to load. I'm getting the same exception with PDFBox 1.8.10 and 2.0.6.
I'm left with 2 questions:
Did Matlab 2017a change how it passes strings to Java? I didn't see anything in the release notes about this.
What could be causing the PDFBox error? Matlab is still on Java 1.7 in 2017a so I wouldn't think there should be any difference in how PDFBox works.
It seems like the method you are calling is from PDDocument version 1.8.11
In the latest version, PDDocument version 2.0.2 the method signature for accepting a file name no longer exists.
Change your code to the following, and it should work.
pdfFile = PDDocument.load(java.io.File(filename));
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.
I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:
import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')
print("Executing tag_corpus.py...\n")
# Import corpus file
print("Importing data...\n")
file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()
print("Tagging the corpus. Please wait...\n")
tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))
IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:
Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!
I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!
I really appreciate your kind help.
After your import lines add this line:
nltk.internals.config_java(options='-xmx2G')
This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.
See What are the Xms and Xmx parameters when starting JVMs? for more information
If you're interested in how to debug your code, read on.
So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :
from nltk.internals import find_file, find_jar, config_java, java, _java_options
We see that the nltk.internals package is handling the different Java configurations and parameters.
Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.
In version 3.9.2, the StanfordTagger class constructor accepts a parameter called java_options which can be used to set the memory for the POSTagger and also the NERTagger.
E.g. pos_tagger = StanfordPOSTagger('models/english-bidirectional-distsim.tagger', path_to_jar='stanford-postagger-3.9.2.jar', java_options='-mx3g')
I found the answer by #alvas to not work because the StanfordTagger was overriding my memory setting with the built-in default of 1000m. Perhaps using nltk.internals.config_java after initializing StanfordPOSTagger might work but I haven't tried that.
The following was produced using the most recent version of encog-workbench (3.2.0)
I was wondering if this is a bug or if I do not grasp the purpose of the output file.
When I run the [ sunspot example ][1] in the encog workbench, without seggregation, i expect the output file to have the fitted values from the model. When i create the validation chart it presents me with the figure found in the tutorial so this seems correct.
But when i go to the sunspots_output.csv output file I get the following output:
ssn(t-29) ssn(t+1) Output:ssn(t+1)
... first thirty values have output Null ...
-0.600472813 -0.947202522 null
-0.477541371 -1 8.349050184
-0.528762805 -0.976359338 8.334476431
-0.814814815 -0.986603625 8.314903157
-0.817178881 -0.892040977 8.292847897
...
All the output values are around 8 for the rest of the file.
Now when i go back to the validation chart, there is a tab data, which contains the following columns:
Ideal Result
-0.477541371 -0.52449577
-0.528762805 -0.526507195
-0.814814815 -0.535029097
-0.817178881 -0.653884012
If I denormalize the values in these columns, I get the following.
66.3 60.3414868
59.8 60.08623701
23.5 59.00480764
23.2 43.92211894
These seem to be correct values for the actual(if i compare them with the original data) and thus these should be the predicted values in the output column.
Is this a bug or do the values in the output(t+1) column mean something else.
I copied these values to excel and denormalized by typing in the formula for (-1,1).
I was hoping not to have to do this every time I run an experiment.
I am going to move to code eventually. Just wanted to get some preliminary results with the workbench. Using segregation results in the same problem, btw.
If its a bug I'll report it on the encog website.
Thanks for your answers,
Florian
UPDATE
Hey Jef, I downloaded your zip and reproduced the problem using my workbench.
The problem only arises when i do not seggregate, which i do not want to.
There are some clear differences in the .ega file created by workbench-excecutable3.2.0
When i use your .ega file and remove the seggregate section, it works.
When i use mine it doesn't. That's why i uploaded my project [here][2]:
Maybe you can discover if something new interferes with outputting the correct values.
Hope it helps!
Update 3:
My actual goal is to build a forecaster of which the project can be found here:
http://wikisend.com/download/477372/Myproject.rar
I was wondering if you could tell me if I am doing something definitely wrong, because currently my output is total rubbish.
Thanks again.
I tried to reproduce the error, but when I ran my own sunspots prediction I did get predicted values closer to the expected range. You might try running the zipped version of the example, found here.
http://www.heatonresearch.com/dload/encog/example/workbench/SunspotExample.zip
You should be able to run the EGA file and it will produce an output file. Some of my data are as follows:
"year" "mon" "ssn" "dev" "Output:ssn(t+1)"
1948 5 174.0 69.3 156.3030108771
1948 6 167.8 26.6 168.4791037592
1948 7 142.2 28.3 208.1090604116
1948 8 157.9 35.3 186.0234029962
1948 9 143.3 55.9 131.5008296846
1948 10 136.3 44.9 93.0720770479
1948 11 95.8 21.8 89.8269594386
Perhaps comparing the EGA file for the above zip to your EGA file.