Apache OpenNLP bug it doesn't load en-pos-maxent.bin

Apache OpenNLP bug it doesn't load en-pos-maxent.bin - java

I'm trying to use Apache OpenNLP POSTagger example codes, and i've come up with an error, and below is the code
public String[] SentenceDetect(String qwe) throws IOException
{
POSModel model = new POSModelLoader().load(new File("/home/jebard/chabacano/Chabacano1/src/en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
String input = "Hi. How are you? This is Mike.";
ObjectStream<String> lineStream = new PlainTextByLineStream(
new StringReader(input));
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE
.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
Error at this line
.load(new File("/home/jebard/chabacano/Chabacano1/src/en-pos-maxent.bin")
The method load(java.io.File) in type ModelLoader is not applicable for the arguments(org.apache.tomcat.jni.File)

This is actually not a bug in OpenNLP. It's a bug in your code, as you load the class File from the package (aka namespace) org.apache.tomcat.jni.File.
Yet, the API of OpenNLP requests you to use the class File from the standard JDK package java.io, i.e. you should import java.io.File instead.
In general, this should fix your problem.
Important hint
You should migrate your code, as models should not be loaded via POSModelLoader
Loads a POS Tagger Model for the command line tools.
Note: Do not use this class, internal use only!
Instead you can use the constructor POSModel(InputStream in) to load your model file via an InputStream referencing the actual model file.
Moreover, the class POSModelLoader was only present in previous releases of OpenNLP (versions <= 1.5.x). In the latest OpenNLP version 1.6.0 it was removed completely. Instead you can and should now use the constructor of the POSModel class to load/initialize the model you need.

There is some problem with XML parsing. Try this, it worked for me.
System.setProperty("org.xml.sax.driver", "org.xmlpull.v1.sax2.Driver");
try {
AssetFileDescriptor fileDescriptor =
context.getAssets().openFd("en_pos_maxent.bin");
FileInputStream inputStream = fileDescriptor.createInputStream();
POSModel posModel = new POSModel(inputStream);
posTaggerME = new POSTaggerME(posModel);
} catch (Exception e) {}

Related

How to read from PDF using Selenium webdriver and Java

I am trying to read the contents of a PDF file using Java-Selenium. Below is my code. getWebDriver is a custom method in the framework. It returns the webdriver.
URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();
String output = new PDFTextStripper().getText(parser.getPDDocument());
The second line of the code gives compile time error if I don't parse it to RandomAccessRead type.
And when I parse it, I get this run time error:
java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead
I need help with getting rid of these errors.

First of, unless you want to interfere in the PDF loading process, there is no need to explicitly use the PdfParser class. You can instead use a static PDDocument.load method:
URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
PDDocument document = PDDocument.load(fileToParse);
String output = new PDFTextStripper().getText(document);
Otherwise, if you do want to interfere in the loading process, you have to create a RandomAccessRead instance for your BufferedInputStream, you cannot simply cast it because the classes are not related.
You can do that like this
URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMainMemoryOnly();
ScratchFile scratchFile = new ScratchFile(memUsageSetting);
PDFParser parser;
try
{
RandomAccessRead source = scratchFile.createBuffer(fileToParse);
parser = new PDFParser(source);
parser.parse();
}
catch (IOException ioe)
{
IOUtils.closeQuietly(scratchFile);
throw ioe;
}
String output = new PDFTextStripper().getText(parser.getPDDocument());
(This essentially is copied and pasted from the source of PDDocument.load.)

How to use OpenNLP parser models in an Android app?

I go through this link for java nlp https://www.tutorialspoint.com/opennlp/index.htm
I tried below code in android:
try {
File file = copyAssets();
// InputStream inputStream = new FileInputStream(file);
ParserModel model = new ParserModel(file);
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
} catch (Exception e) {
}
i download file **en-parser-chunking.bin** from internet and placed in assets of android project but code stop on third line i.e ParserModel model = new ParserModel(file); without giving any exception. Need to know how can this work in android? if its not working is there any other support for nlp in android without consuming any services?

The reason the code stalls/breaks at runtime is that you need to use an InputStream instead of a File to load the binary file resource. Most likely, the File instance is null when you "load" it the way as indicated in line 2. In theory, this constructor of ParserModelshould detect this and an IOException should be thrown. Yet, sadly, the JavaDoc of OpenNLP is not precise about this kind of situation and you are not handling this exception properly in the catch block.
Moreover, the code snippet you presented should be improved, so that you know what actually went wrong.
Therefore, loading a POSModel from within an Activity should be done differently. Here is a variant that takes care for both aspects:
AssetManager assetManager = getAssets();
InputStream in = null;
try {
in = assetManager.open("en-parser-chunking.bin");
POSModel posModel;
if(in != null) {
posModel = new POSModel(in);
if(posModel!=null) {
// From here, <posModel> is initialized and you can start playing with it...
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "ParserModel could not initialized.");
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "OpenNLP binary model file could not found in assets.");
}
}
catch (Exception ex) {
Log.e("NLP", "message: " + ex.getMessage(), ex);
// proper exception handling here...
}
finally {
if(in!=null) {
in.close();
}
}
This way, you're using an InputStream approach and at the same time you take care for proper exception and resource handling. Moreover, you can now use a Debugger in case something remains unclear with the resource path references of your model files. For reference, see the official JavaDoc of AssetManager#open(String resourceName).
Note well:
Loading OpenNLP's binary resources can consume quite a lot of memory. For this reason, it might be the case that your Android App's request to allocate the needed memory for this operation can or will not be granted by the actual runtime (i.e., smartphone) environment.
Therefore, carefully monitor the amount of requested/required RAM while posModel = new POSModel(in); is invoked.
Hope it helps.

Weka model Read Error in android

I created my weka model in the machine and imported it to the android project. When i try to create the classifier it gives an error "exception.java.io.StreamCorruptedException" when i try to deserialise the model i created. The code perfectly works in machine.
This is my Code,
InputStream fis = null;
fis = new InputStream("/modle.model");
InputStream is = fis;
Classifier cls = null;
//here im getting the error when trying to read the Classifier
cls = (Classifier) SerializationHelper.read(is);
FileInputStream datais = null;
datais = new FileInputStream("/storage/emulated/0/window.arff");
InputStream dataIns = datais;
DataSource source = new DataSource(dataIns);
Instances data = null;
try {
data = source.getDataSet();
} catch (Exception e) {
e.printStackTrace();
}
data.setClassIndex(data.numAttributes() - 1);
Instance in = new Instance(13);
in.setDataset(data);
in.setValue(0, testWekaModle1[0]);
in.setValue(1, testWekaModle1[1]);
in.setValue(2, testWekaModle1[2]);
in.setValue(3, testWekaModle1[3]);
in.setValue(4, testWekaModle1[4]);
in.setValue(5, testWekaModle1[5]);
in.setValue(6, testWekaModle1[6]);
in.setValue(7, testWekaModle1[7]);
in.setValue(8, testWekaModle1[8]);
in.setValue(9, testWekaModle1[9]);
in.setValue(10, testWekaModle1[10]);
in.setValue(11, testWekaModle1[11]);
double value = 0;
value = cls.classifyInstance(in);
in.setClassValue(value);
This is the full stacktrace,
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:2109)
java.io.ObjectInputStream.<init>(ObjectInputStream.java:372)
weka.core.SerializationHelper.read(SerializationHelper.java:288)
info.androidhive.sleepApp.model.ControllerWeka.wekaModle(ControllerWeka.java:81)
info.androidhive.sleepApp.activity.HomeFragment.extract(HomeFragment.java:278)
info.androidhive.sleepApp.activity.HomeFragment.stop(HomeFragment.java:146)
"info.androidhive.sleepApp.activity.HomeFragment$2.onClick(HomeFragment.java:107)"
android.view.View.performClick(View.java:4475)"
android.view.View$PerformClick.run(View.java:18786)"
android.os.Handler.handleCallback(Handler.java:730)"
dalvik.system.NativeStart.main(Native Method)"
com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1025)"
com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1209)"
java.lang.reflect.Method.invoke(Method.java:525)"
java.lang.reflect.Method.invokeNative(Native Method)"
android.app.ActivityThread.main(ActivityThread.java:5419)"
android.os.Looper.loop(Looper.java:137)"
android.os.Handler.dispatchMessage(Handler.java:92)"
Please help me to overcome this problem.

this is resolved, the model was created in a different environment(PC) and tried to deserialise in the android environment which gave error because of the two types of JDK wasn't same at all.

Be sure that both of the weka.jar have the same version.
And do NOT use the GUI version of Weka to save the model since the Android runtime does not contain GUI related packages used by weka.
It would be fine that build and save the model programmatically with desktop and deserialise it through Android.

Specific characters not rendering properly in Java

I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}

read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache OpenNLP bug it doesn't load en-pos-maxent.bin - java

Related

How to read from PDF using Selenium webdriver and Java

How to use OpenNLP parser models in an Android app?

Weka model Read Error in android

Specific characters not rendering properly in Java

Using boilerpipe to extract non-english articles

Categories

Resources