Apache Tika fails to detect language on short sentence. Why?

Apache Tika fails to detect language on short sentence. Why? - java

I tried to detect language on short phrase and was surprised as detection result is wrong.
LanguageDetector detector = new OptimaizeLangDetector();
try {
detector.loadModels();
} catch (IOException e) {
LOG.error(e.getMessage(), e);
throw new ExceptionInInitializerError(e);
}
LanguageResult languageResult = detector.detect("Hello, my friend!")
The languageResult contains Norwegian with "medium" probability. Why? I think it have to be English instead. Longer phrases seems to be detected properly. Does this means that Apache Tika should not be used on short text?

This will not work in short text. As in documentantion say:
Implementation of the LanguageDetector API that uses
https://github.com/optimaize/language-detector
From https://tika.apache.org/1.13/api/org/apache/tika/langdetect/OptimaizeLangDetector.html
Going to review that github and check the challenges they have some issues with short texts.
This software does not work as well when the input text to analyze is
short, or unclean. For example tweets.
From their https://github.com/optimaize/language-detector Challenges Sector

I could reproduce the issue.
It may not directly answer the question but be considered as a workaround...
It seems that if you know what languages can be expected you can pass them to the detector via loadModels(models) method. This approach helps to detect English correctly:
try {
Set<String> models=new HashSet<>();
models.add("en");
models.add("ru");
models.add("de");
LanguageDetector detector = new OptimaizeLangDetector()
// .setShortText(true)
.loadModels(models);
// .loadModels();
LanguageResult enResult = detector.detect("Hello, my friend!");
// LanguageResult ruResult = detector.detect("Привет, мой друг!");
// LanguageResult deResult = detector.detect("Hallo, mein Freund!");
System.out.println(enResult.getLanguage());
} catch (IOException e) {
throw new ExceptionInInitializerError(e);
}

Related

OpenNLP-Document Categorizer- how to classify documents based on status; language of docs not English, also default features?

I want to classify my documents using OpenNLP's Document Categorizer, based on their status: pre-opened, opened, locked, closed etc.
I have 5 classes and I'm using the Naive Bayes algorithm, 60 documents in my training set, and trained my set on 1000 iterations with 1 cut off param.
But no success, when I test them I don't get good results. I was thinking maybe it is because of the language of the documents (is not in English) or maybe I should somehow add the statuses as features. I have set the default features in the categorizer, and also I'm not very familiar with them.
The result should be locked, but its categorized as opened.
InputStreamFactory in=null;
try {
in= new MarkableFileInputStreamFactory(new
File("D:\\JavaNlp\\doccategorizer\\doccategorizer.txt"));
}
catch (FileNotFoundException e2) {
System.out.println("Creating new input stream");
e2.printStackTrace();
}
ObjectStream lineStream=null;
ObjectStream sampleStream=null;
try {
lineStream = new PlainTextByLineStream(in, "UTF-8");
sampleStream = new DocumentSampleStream(lineStream);
}
catch (IOException e1) {
System.out.println("Document Sample Stream");
e1.printStackTrace();
}
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 1000+"");
params.put(TrainingParameters.CUTOFF_PARAM, 1+"");
params.put(AbstractTrainer.ALGORITHM_PARAM,
NaiveBayesTrainer.NAIVE_BAYES_VALUE);
DoccatModel model=null;
try {
model = DocumentCategorizerME.train("en", sampleStream, params, new
DoccatFactory());
}
catch (IOException e)
{
System.out.println("Training...");
e.printStackTrace();
}
System.out.println("\nModel is successfully trained.");
BufferedOutputStream modelOut=null;
try {
modelOut = new BufferedOutputStream(new
FileOutputStream("D:\\JavaNlp\\doccategorizer\\classifier-maxent.bin"));
}
catch (FileNotFoundException e) {
System.out.println("Creating output stream");
e.printStackTrace();
}
try {
model.serialize(modelOut);
}
catch (IOException e) {
System.out.println("Serialize...");
e.printStackTrace();
}
System.out.println("\nTrained model is kept in:
"+"model"+File.separator+"en-cases-classifier-maxent.bin");
DocumentCategorizer doccat = new DocumentCategorizerME(model);
String[] docWords = "Some text here...".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = doccat.categorize(docWords);
System.out.println("\n---------------------------------\nCategory :
Probability\n---------------------------------");
for(int i=0;i<doccat.getNumberOfCategories();i++){
System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
}
System.out.println("---------------------------------");
System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the category
for the given sentence");
Can someone make a suggestion for me how to categorize my documents well, like should I add a language detector first, or add new features?
Thanks in advance

By default, the document classifier takes the document text and forms a bag of words. Each word in the bag becomes a feature. As long as the language can be tokenized by an English tokenizer (again by default a white space tokenizer), I would guess that the language is not your problem. I would check the format of the data you are using for the training data. It should be formatted like this:
category<tab>document text
The text should fit be one line. The opennlp documentation for the document classifier can be found at http://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training.tool
It would be helpful if you could provide a line or two of training data to help examine the format.
Edit: Another potential issue. 60 documents may not be enough documents to train a good classifier, particularly if you have a large vocabulary. Also, even though this is not English, please tell me it is not multiple languages. Finally, is the document text the best way to classify the document? Would metadata from the document itself produce better features.
Hope it helps.

How to use OpenNLP parser models in an Android app?

I go through this link for java nlp https://www.tutorialspoint.com/opennlp/index.htm
I tried below code in android:
try {
File file = copyAssets();
// InputStream inputStream = new FileInputStream(file);
ParserModel model = new ParserModel(file);
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
} catch (Exception e) {
}
i download file **en-parser-chunking.bin** from internet and placed in assets of android project but code stop on third line i.e ParserModel model = new ParserModel(file); without giving any exception. Need to know how can this work in android? if its not working is there any other support for nlp in android without consuming any services?

The reason the code stalls/breaks at runtime is that you need to use an InputStream instead of a File to load the binary file resource. Most likely, the File instance is null when you "load" it the way as indicated in line 2. In theory, this constructor of ParserModelshould detect this and an IOException should be thrown. Yet, sadly, the JavaDoc of OpenNLP is not precise about this kind of situation and you are not handling this exception properly in the catch block.
Moreover, the code snippet you presented should be improved, so that you know what actually went wrong.
Therefore, loading a POSModel from within an Activity should be done differently. Here is a variant that takes care for both aspects:
AssetManager assetManager = getAssets();
InputStream in = null;
try {
in = assetManager.open("en-parser-chunking.bin");
POSModel posModel;
if(in != null) {
posModel = new POSModel(in);
if(posModel!=null) {
// From here, <posModel> is initialized and you can start playing with it...
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "ParserModel could not initialized.");
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "OpenNLP binary model file could not found in assets.");
}
}
catch (Exception ex) {
Log.e("NLP", "message: " + ex.getMessage(), ex);
// proper exception handling here...
}
finally {
if(in!=null) {
in.close();
}
}
This way, you're using an InputStream approach and at the same time you take care for proper exception and resource handling. Moreover, you can now use a Debugger in case something remains unclear with the resource path references of your model files. For reference, see the official JavaDoc of AssetManager#open(String resourceName).
Note well:
Loading OpenNLP's binary resources can consume quite a lot of memory. For this reason, it might be the case that your Android App's request to allocate the needed memory for this operation can or will not be granted by the actual runtime (i.e., smartphone) environment.
Therefore, carefully monitor the amount of requested/required RAM while posModel = new POSModel(in); is invoked.
Hope it helps.

Accessing Windows disks directly with Java NIO

I am using a library that uses Java NIO in order to directly map files to memory, but I am having trouble reading disks directly.
I can read the disks directly using FileInputStream with UNC, such as
File disk = new File("\\\\.\\PhysicalDrive0\\");
try (FileInputStream fis = new FileInputStream(disk);
BufferedInputStream bis = new BufferedInputStream(fis)) {
byte[] somebytes = new byte[10];
bis.read(somebytes);
} catch (Exception ex) {
System.out.println("Oh bother");
}
However, I can't extend this to NIO:
File disk = new File("\\\\.\\PhysicalDrive0\\");
Path path = disk.toPath();
try (FileChannel fc = FileChannel.open(path, StandardOpenOption.READ)){
System.out.println("No exceptions! Yay!");
} catch (Exception ex) {
System.out.println("Oh bother");
}
The stacktrace (up to the cause) is:
java.nio.file.FileSystemException: \\.\PhysicalDrive0\: The parameter is incorrect.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:334)
at hdreader.HDReader.testcode(HDReader.java:147)
I haven't been able to find a solution, though I saw something close on How to access specific raw data on disk from java. The answer by Daniel Alder suggesting the use of GLOBALROOT seems to be relevant, as the answer uses FileChannel in the answer, but I can't seem to find the drive using this pattern. Is there a way to list all devices under GLOBALROOT or something like that?
At the moment I am looking at replacing uses of NIO with straight InputStreams, but I want to avoid this if I can. Firstly, NIO was used for a reason, and secondly, it runs through a lot of code and will require a lot of work. Finally, I'd like to know how to implement something like Daniel's solution so that I can write to devices or use NIO in the future.
So in summary: how can I access drives directly with Java NIO (not InputStreams), and/or is there a way to list all devices accessible through GLOBALROOT so that I might use Daniel Alser's solution?
Summary of Answers:
I have kept the past edits (below) to avoid confusion. With the help of EJP and Apangin I think I have a workable solution. Something like
private void rafMethod(long posn) {
ByteBuffer buffer = ByteBuffer.allocate(512);
buffer.rewind();
try (RandomAccessFile raf = new RandomAccessFile(disk.getPath(), "r");
SeekableByteChannel sbc = raf.getChannel()) {
sbc.read(buffer);
} catch (Exception ex) {
System.out.println("Oh bother: " + ex);
ex.printStackTrace();
}
return buffer;
}
This will work as long as the posn parameter is a multiple of the sector size (set at 512 in this case). Note that this also works with the Channels.newChannel(FileInputStream), which seems to always return a SeekableByteStream in this case and it appears it is safe to cast it to one.
From quick and dirty testing it appears that these methods truly do seek and don't just skip. I searched for a thousand locations at the start of my drive and it read them. I did the same but added an offset of half of the disk size (to search the back of the disk). I found:
Both methods took almost the same time.
Searching the start or the end of the disk did not affect time.
Reducing the range of the addresses did reduce time.
Sorting the addresses did reduce time, but not by much.
This suggests to me that this is truly seeking and not merely reading and skipping (as a stream tends to). The speed is still terrible at this stage and it makes my hard drive sound like a washing machine, but the code was designed for a quick test and has yet to be made pretty. It may still work fine.
Thanks to both EJP and Apangin for the help. Read more in their respective answers.
Edit:
I have since run my code on a Windows 7 machine (I didn't have one originally), and I get a slightly different exception (see below). This was run with admin privileges, and the first piece of code still works under the same conditions.
java.nio.file.FileSystemException: \\.\PhysicalDrive0\: A device attached to the system is not functioning.
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:115)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at testapp.TestApp.doStuff(TestApp.java:30)
at testapp.TestApp.main(TestApp.java:24)
Edit 2:
In response to EJP, I have tried:
byte[] bytes = new byte[20];
ByteBuffer bb = ByteBuffer.wrap(bytes);
bb.rewind();
File disk = new File("\\\\.\\PhysicalDrive0\\");
try (FileInputStream fis = new FileInputStream(disk);
ReadableByteChannel rbc = Channels.newChannel(new FileInputStream(disk))) {
System.out.println("Channel created");
int read = rbc.read(bb);
System.out.println("Read " + read + " bytes");
System.out.println("No exceptions! Yay!");
} catch (Exception ex) {
System.out.println("Oh bother: " + ex);
}
When I try this I get the following output:
Channel created
Oh bother: java.io.IOException: The parameter is incorrect
So it appears that I can create a FileChannel or ReadableByteChannel, but I can't use it; that is, the error is simply deferred.

When accessing physical drive without buffering, you can read only complete sectors. This means, if a sector size is 512 bytes, you can read only multiple of 512 bytes. Change your buffer length to 512 or 4096 (whatever your sector size is) and FileChannel will work fine:
ByteBuffer buf = ByteBuffer.allocate(512);
try (RandomAccessFile raf = new RandomAccessFile("\\\\.\\PhysicalDrive0", "r");
FileChannel fc = raf.getChannel()) {
fc.read(buf);
System.out.println("It worked! Read bytes: " + buf.position());
} catch (Exception e) {
e.printStackTrace();
}
See Alignment and File Access Requirements.
Your original FileInputStream code works obviously because of BufferedInputStream which has the default buffer size of 8192. Take it away - and the code will fail with the same exception.

Using NIO your original code only needs to change very slightly.
Path disk = Paths.get("d:\\.");
try (ByteChannel bc = Files.newByteChannel(disk, StandardOpenOption.READ)) {
ByteBuffer buffer = ByteBuffer.allocate(10);
bc.read(buffer);
} catch (Exception e){
e.printStackTrace();
}
Is fine, workable code, but I get an access denied error in both your version and mine.

Run this as administrator. It really does work, as it's only a thin wrapper over java.io:
try (FileInputStream fis = new FileInputStream(disk);
ReadableByteChannel fc = Channels.newChannel(fis))
{
System.out.println("No exceptions! Yay!");
ByteBuffer bb = ByteBuffer.allocate(4096);
int count = fc.read(bb);
System.out.println("read count="+count);
}
catch (Exception ex)
{
System.out.println("Oh bother: "+ex);
ex.printStackTrace();
}
EDIT If you need random access, you're stuck with RandomAccessFile. There's no mapping from that via Channels. But the solution above isn't NIO anyway, just a Java NIO layer over FileInput/OutputStream.

iText Out of memory on a 5440032-byte allocation

I have encountered a problem when I am extracting text from PDF.
01-29 09:44:15.397: E/dalvikvm-heap(8037): Out of memory on a 5440032-byte allocation.
I looked up the contents of the page and it has a image above the text. What i want to know is how do I catch the error and skip that page? I have tried:
try {
pages = new String[pdfPage];
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for (int pageNum = 1; pageNum <= pdfPage; pageNum++) {
// String original_content = "";
// original_content = PdfTextExtractor.getTextFromPage(reader,
// pageNum, new SimpleTextExtractionStrategy());
Log.e("MyActivity", "PageCatch: " + (pageNum + fromPage));
strategy = parser.processContent(pageNum,
new SimpleTextExtractionStrategy());
readPDF(strategy.getResultantText(), pageNum - 1);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
The try catch above does not catch the error of strategy = parser.processContent(pageNum,
new SimpleTextExtractionStrategy());
I already tried commenting out all the lines inside the for loop and no error. but when i leave out strategy = parser.processContent(pageNum,
new SimpleTextExtractionStrategy()); and it errors.

as i have understood about the error, that occurs when the memory is not enough to hold the data that you are reading, I believe you can't catch that error.
I would strongly suggest you to drop some old data, and make sure to just hold not too heavy data in your variable.
or refer to this
Out of memory error due to large number of image thumbnails to display

You want to catch the error and skip that page and tried using
try {
...
} catch (Exception e) {
...
}
which didn't do the trick. Unless the DalvikVM handles out-of-memory situations completely different than Java VMs, this is no surprise: The Throwable used by Java in such situations is an OutOfMemoryError, i.e. not an Exception but an Error, the other big subtype of Throwable. Thus, you might want to try
} catch (OutOfMemoryError e) {
or
} catch (Error e) {
or even
} catch (Throwable e) {
to handle your issue. Beware, though, when an Error is thrown, this generally means something bad is happening; catching and ignoring it, therefore, might result in a weird program state.
Obviously, though, if you (as you said) only want to try and skip a single page and otherwise continue, you'll have to position the try { ... } catch() { ... } differently, more specifically around the handling of the single page, i.e. inside the loop.
On the other hand, dropping all references to objects held by the PDF library and re-opening the PDF might help, remember Kevin's answer to your question Search Text and Capacity of iText to read on the iText-Questions mailing list. Following that advice you'd have all iText use and a limited loop (for a confined number of pages) inside the try { ... } catch() { ... }, you'd merely remember the last page read in some outer variables.
Furthermore you can limit memory usage by using a PdfReader constructor taking a RandomAccessFileOrArray parameter --- readers constructed that way don't hold all the PDF in memory but instead only the cross reference table and some central objects. All else is read on demand.

Is there a good way to persist printer settings in a Swing app?

We are using the new Java printing API which uses PrinterJob.printDialog(attributes) to display the dialog to the user.
Wanting to save the user's settings for the next time, I wanted to do this:
PrintRequestAttributeSet attributes = loadAttributesFromPreferences();
if (printJob.printDialog(attributes)) {
// print, and then...
saveAttributesToPreferences(attributes);
}
However, what I found by doing this is that sometimes (I haven't figured out how, yet) the attributes get some bad data inside, and then when you print, you get a white page of nothing. Then the code saves the poisoned settings into the preferences, and all subsequent print runs get poisoned settings too. Additionally, the entire point of the exercise, making the settings for the new run the same as the user chose for the previous run, is defeated, because the new dialog does not appear to use the old settings.
So I would like to know if there is a proper way to do this. Surely Sun didn't intend that users have to select the printer, page size, orientation and margin settings every time the application starts up.
Edit to show the implementation of the storage methods:
private PrintRequestAttributeSet loadAttributesFromPreferences()
{
PrintRequestAttributeSet attributes = null;
byte[] marshaledAttributes = preferences.getByteArray(PRINT_REQUEST_ATTRIBUTES_KEY, null);
if (marshaledAttributes != null)
{
try
{
#SuppressWarnings({"IOResourceOpenedButNotSafelyClosed"})
ObjectInput objectInput = new ObjectInputStream(new ByteArrayInputStream(marshaledAttributes));
attributes = (PrintRequestAttributeSet) objectInput.readObject();
}
catch (IOException e)
{
// Can occur due to invalid object data e.g. InvalidClassException, StreamCorruptedException
Logger.getLogger(getClass()).warn("Error trying to read print attributes from preferences", e);
}
catch (ClassNotFoundException e)
{
Logger.getLogger(getClass()).warn("Class not found trying to read print attributes from preferences", e);
}
}
if (attributes == null)
{
attributes = new HashPrintRequestAttributeSet();
}
return attributes;
}
private void saveAttributesToPreferences(PrintRequestAttributeSet attributes)
{
ByteArrayOutputStream storage = new ByteArrayOutputStream();
try
{
ObjectOutput objectOutput = new ObjectOutputStream(storage);
try
{
objectOutput.writeObject(attributes);
}
finally
{
objectOutput.close(); // side-effect of flushing the underlying stream
}
}
catch (IOException e)
{
throw new IllegalStateException("I/O error writing to a stream going to a byte array", e);
}
preferences.putByteArray(PRINT_REQUEST_ATTRIBUTES_KEY, storage.toByteArray());
}
Edit: Okay, it seems like the reason it isn't remembering the printer is that it isn't in the PrintRequestAttributeSet at all. Indeed, the margins and page sizes are remembered, at least until the settings get poisoned at random. But the printer chosen by the user is not here:
[0] = {java.util.HashMap$Entry#9494} class javax.print.attribute.standard.Media -> na-letter
[1] = {java.util.HashMap$Entry#9501} class javax.print.attribute.standard.Copies -> 1
[2] = {java.util.HashMap$Entry#9510} class javax.print.attribute.standard.MediaPrintableArea -> (10.0,10.0)->(195.9,259.4)mm
[3] = {java.util.HashMap$Entry#9519} class javax.print.attribute.standard.OrientationRequested -> portrait

It appears that what you're looking for is the PrintServiceAttributeSet, rather than the PrintRequestAttributeSet.
Take a look at the PrintServiceAttribute interface, and see if the elements you need have been implemented as classes. If not, you can implement your own PrintServiceAttribute class(es).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Tika fails to detect language on short sentence. Why? - java

Related

OpenNLP-Document Categorizer- how to classify documents based on status; language of docs not English, also default features?

How to use OpenNLP parser models in an Android app?

Accessing Windows disks directly with Java NIO

iText Out of memory on a 5440032-byte allocation

Is there a good way to persist printer settings in a Swing app?

Categories

Resources