I have encountered a problem when I am extracting text from PDF.
01-29 09:44:15.397: E/dalvikvm-heap(8037): Out of memory on a 5440032-byte allocation.
I looked up the contents of the page and it has a image above the text. What i want to know is how do I catch the error and skip that page? I have tried:
try {
pages = new String[pdfPage];
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for (int pageNum = 1; pageNum <= pdfPage; pageNum++) {
// String original_content = "";
// original_content = PdfTextExtractor.getTextFromPage(reader,
// pageNum, new SimpleTextExtractionStrategy());
Log.e("MyActivity", "PageCatch: " + (pageNum + fromPage));
strategy = parser.processContent(pageNum,
new SimpleTextExtractionStrategy());
readPDF(strategy.getResultantText(), pageNum - 1);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
The try catch above does not catch the error of strategy = parser.processContent(pageNum,
new SimpleTextExtractionStrategy());
I already tried commenting out all the lines inside the for loop and no error. but when i leave out strategy = parser.processContent(pageNum,
new SimpleTextExtractionStrategy()); and it errors.
as i have understood about the error, that occurs when the memory is not enough to hold the data that you are reading, I believe you can't catch that error.
I would strongly suggest you to drop some old data, and make sure to just hold not too heavy data in your variable.
or refer to this
Out of memory error due to large number of image thumbnails to display
You want to catch the error and skip that page and tried using
try {
...
} catch (Exception e) {
...
}
which didn't do the trick. Unless the DalvikVM handles out-of-memory situations completely different than Java VMs, this is no surprise: The Throwable used by Java in such situations is an OutOfMemoryError, i.e. not an Exception but an Error, the other big subtype of Throwable. Thus, you might want to try
} catch (OutOfMemoryError e) {
or
} catch (Error e) {
or even
} catch (Throwable e) {
to handle your issue. Beware, though, when an Error is thrown, this generally means something bad is happening; catching and ignoring it, therefore, might result in a weird program state.
Obviously, though, if you (as you said) only want to try and skip a single page and otherwise continue, you'll have to position the try { ... } catch() { ... } differently, more specifically around the handling of the single page, i.e. inside the loop.
On the other hand, dropping all references to objects held by the PDF library and re-opening the PDF might help, remember Kevin's answer to your question Search Text and Capacity of iText to read on the iText-Questions mailing list. Following that advice you'd have all iText use and a limited loop (for a confined number of pages) inside the try { ... } catch() { ... }, you'd merely remember the last page read in some outer variables.
Furthermore you can limit memory usage by using a PdfReader constructor taking a RandomAccessFileOrArray parameter --- readers constructed that way don't hold all the PDF in memory but instead only the cross reference table and some central objects. All else is read on demand.
Related
I tried to detect language on short phrase and was surprised as detection result is wrong.
LanguageDetector detector = new OptimaizeLangDetector();
try {
detector.loadModels();
} catch (IOException e) {
LOG.error(e.getMessage(), e);
throw new ExceptionInInitializerError(e);
}
LanguageResult languageResult = detector.detect("Hello, my friend!")
The languageResult contains Norwegian with "medium" probability. Why? I think it have to be English instead. Longer phrases seems to be detected properly. Does this means that Apache Tika should not be used on short text?
This will not work in short text. As in documentantion say:
Implementation of the LanguageDetector API that uses
https://github.com/optimaize/language-detector
From https://tika.apache.org/1.13/api/org/apache/tika/langdetect/OptimaizeLangDetector.html
Going to review that github and check the challenges they have some issues with short texts.
This software does not work as well when the input text to analyze is
short, or unclean. For example tweets.
From their https://github.com/optimaize/language-detector Challenges Sector
I could reproduce the issue.
It may not directly answer the question but be considered as a workaround...
It seems that if you know what languages can be expected you can pass them to the detector via loadModels(models) method. This approach helps to detect English correctly:
try {
Set<String> models=new HashSet<>();
models.add("en");
models.add("ru");
models.add("de");
LanguageDetector detector = new OptimaizeLangDetector()
// .setShortText(true)
.loadModels(models);
// .loadModels();
LanguageResult enResult = detector.detect("Hello, my friend!");
// LanguageResult ruResult = detector.detect("Привет, мой друг!");
// LanguageResult deResult = detector.detect("Hallo, mein Freund!");
System.out.println(enResult.getLanguage());
} catch (IOException e) {
throw new ExceptionInInitializerError(e);
}
I want to classify my documents using OpenNLP's Document Categorizer, based on their status: pre-opened, opened, locked, closed etc.
I have 5 classes and I'm using the Naive Bayes algorithm, 60 documents in my training set, and trained my set on 1000 iterations with 1 cut off param.
But no success, when I test them I don't get good results. I was thinking maybe it is because of the language of the documents (is not in English) or maybe I should somehow add the statuses as features. I have set the default features in the categorizer, and also I'm not very familiar with them.
The result should be locked, but its categorized as opened.
InputStreamFactory in=null;
try {
in= new MarkableFileInputStreamFactory(new
File("D:\\JavaNlp\\doccategorizer\\doccategorizer.txt"));
}
catch (FileNotFoundException e2) {
System.out.println("Creating new input stream");
e2.printStackTrace();
}
ObjectStream lineStream=null;
ObjectStream sampleStream=null;
try {
lineStream = new PlainTextByLineStream(in, "UTF-8");
sampleStream = new DocumentSampleStream(lineStream);
}
catch (IOException e1) {
System.out.println("Document Sample Stream");
e1.printStackTrace();
}
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 1000+"");
params.put(TrainingParameters.CUTOFF_PARAM, 1+"");
params.put(AbstractTrainer.ALGORITHM_PARAM,
NaiveBayesTrainer.NAIVE_BAYES_VALUE);
DoccatModel model=null;
try {
model = DocumentCategorizerME.train("en", sampleStream, params, new
DoccatFactory());
}
catch (IOException e)
{
System.out.println("Training...");
e.printStackTrace();
}
System.out.println("\nModel is successfully trained.");
BufferedOutputStream modelOut=null;
try {
modelOut = new BufferedOutputStream(new
FileOutputStream("D:\\JavaNlp\\doccategorizer\\classifier-maxent.bin"));
}
catch (FileNotFoundException e) {
System.out.println("Creating output stream");
e.printStackTrace();
}
try {
model.serialize(modelOut);
}
catch (IOException e) {
System.out.println("Serialize...");
e.printStackTrace();
}
System.out.println("\nTrained model is kept in:
"+"model"+File.separator+"en-cases-classifier-maxent.bin");
DocumentCategorizer doccat = new DocumentCategorizerME(model);
String[] docWords = "Some text here...".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = doccat.categorize(docWords);
System.out.println("\n---------------------------------\nCategory :
Probability\n---------------------------------");
for(int i=0;i<doccat.getNumberOfCategories();i++){
System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
}
System.out.println("---------------------------------");
System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the category
for the given sentence");
Can someone make a suggestion for me how to categorize my documents well, like should I add a language detector first, or add new features?
Thanks in advance
By default, the document classifier takes the document text and forms a bag of words. Each word in the bag becomes a feature. As long as the language can be tokenized by an English tokenizer (again by default a white space tokenizer), I would guess that the language is not your problem. I would check the format of the data you are using for the training data. It should be formatted like this:
category<tab>document text
The text should fit be one line. The opennlp documentation for the document classifier can be found at http://opennlp.apache.org/docs/1.9.0/manual/opennlp.html#tools.doccat.training.tool
It would be helpful if you could provide a line or two of training data to help examine the format.
Edit: Another potential issue. 60 documents may not be enough documents to train a good classifier, particularly if you have a large vocabulary. Also, even though this is not English, please tell me it is not multiple languages. Finally, is the document text the best way to classify the document? Would metadata from the document itself produce better features.
Hope it helps.
I go through this link for java nlp https://www.tutorialspoint.com/opennlp/index.htm
I tried below code in android:
try {
File file = copyAssets();
// InputStream inputStream = new FileInputStream(file);
ParserModel model = new ParserModel(file);
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
} catch (Exception e) {
}
i download file **en-parser-chunking.bin** from internet and placed in assets of android project but code stop on third line i.e ParserModel model = new ParserModel(file); without giving any exception. Need to know how can this work in android? if its not working is there any other support for nlp in android without consuming any services?
The reason the code stalls/breaks at runtime is that you need to use an InputStream instead of a File to load the binary file resource. Most likely, the File instance is null when you "load" it the way as indicated in line 2. In theory, this constructor of ParserModelshould detect this and an IOException should be thrown. Yet, sadly, the JavaDoc of OpenNLP is not precise about this kind of situation and you are not handling this exception properly in the catch block.
Moreover, the code snippet you presented should be improved, so that you know what actually went wrong.
Therefore, loading a POSModel from within an Activity should be done differently. Here is a variant that takes care for both aspects:
AssetManager assetManager = getAssets();
InputStream in = null;
try {
in = assetManager.open("en-parser-chunking.bin");
POSModel posModel;
if(in != null) {
posModel = new POSModel(in);
if(posModel!=null) {
// From here, <posModel> is initialized and you can start playing with it...
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "ParserModel could not initialized.");
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "OpenNLP binary model file could not found in assets.");
}
}
catch (Exception ex) {
Log.e("NLP", "message: " + ex.getMessage(), ex);
// proper exception handling here...
}
finally {
if(in!=null) {
in.close();
}
}
This way, you're using an InputStream approach and at the same time you take care for proper exception and resource handling. Moreover, you can now use a Debugger in case something remains unclear with the resource path references of your model files. For reference, see the official JavaDoc of AssetManager#open(String resourceName).
Note well:
Loading OpenNLP's binary resources can consume quite a lot of memory. For this reason, it might be the case that your Android App's request to allocate the needed memory for this operation can or will not be granted by the actual runtime (i.e., smartphone) environment.
Therefore, carefully monitor the amount of requested/required RAM while posModel = new POSModel(in); is invoked.
Hope it helps.
I am getting Java Heap Space Error while writing large data from database to an excel sheet.
I dont want to use JVM -XMX options to increase memory.
Following are the details:
1) I am using org.apache.poi.hssf api
for excel sheet writing.
2) JDK version 1.5
3) Tomcat 6.0
Code i have wriiten works well for around 23 thousand records, but it fails for more than 23K records.
Following is the code:
ArrayList l_objAllTBMList= new ArrayList();
l_objAllTBMList = (ArrayList) m_objFreqCvrgDAO.fetchAllTBMUsers(p_strUserTerritoryId);
ArrayList l_objDocList = new ArrayList();
m_objTotalDocDtlsInDVL= new HashMap();
Object l_objTBMRecord[] = null;
Object l_objVstdDocRecord[] = null;
int l_intDocLstSize=0;
VisitedDoctorsVO l_objVisitedDoctorsVO=null;
int l_tbmListSize=l_objAllTBMList.size();
System.out.println(" getMissedDocDtlsList_NSM ");
for(int i=0; i<l_tbmListSize;i++)
{
l_objTBMRecord = (Object[]) l_objAllTBMList.get(i);
l_objDocList = (ArrayList) m_objGenerateVisitdDocsReportDAO.fetchAllDocDtlsInDVL_NSM((String) l_objTBMRecord[1], p_divCode, (String) l_objTBMRecord[2], p_startDt, p_endDt, p_planType, p_LMSValue, p_CycleId, p_finYrId);
l_intDocLstSize=l_objDocList.size();
try {
l_objVOFactoryForDoctors = new VOFactory(l_intDocLstSize, VisitedDoctorsVO.class);
/* Factory class written to create and maintain limited no of Value Objects (VOs)*/
} catch (ClassNotFoundException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (InstantiationException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
} catch (IllegalAccessException ex) {
m_objLogger.debug("DEBUG:getMissedDocDtlsList_NSM :Exception:"+ex);
}
for(int j=0; j<l_intDocLstSize;j++)
{
l_objVstdDocRecord = (Object[]) l_objDocList.get(j);
l_objVisitedDoctorsVO = (VisitedDoctorsVO) l_objVOFactoryForDoctors.getVo();
if (((String) l_objVstdDocRecord[6]).equalsIgnoreCase("-"))
{
if (String.valueOf(l_objVstdDocRecord[2]) != "null")
{
l_objVisitedDoctorsVO.setPotential_score(String.valueOf(l_objVstdDocRecord[2]));
l_objVisitedDoctorsVO.setEmpcode((String) l_objTBMRecord[1]);
l_objVisitedDoctorsVO.setEmpname((String) l_objTBMRecord[0]);
l_objVisitedDoctorsVO.setDoctorid((String) l_objVstdDocRecord[1]);
l_objVisitedDoctorsVO.setDr_name((String) l_objVstdDocRecord[4] + " " + (String) l_objVstdDocRecord[5]);
l_objVisitedDoctorsVO.setDoctor_potential((String) l_objVstdDocRecord[3]);
l_objVisitedDoctorsVO.setSpeciality((String) l_objVstdDocRecord[7]);
l_objVisitedDoctorsVO.setActualpractice((String) l_objVstdDocRecord[8]);
l_objVisitedDoctorsVO.setLastmet("-");
l_objVisitedDoctorsVO.setPreviousmet("-");
m_objTotalDocDtlsInDVL.put((String) l_objVstdDocRecord[1], l_objVisitedDoctorsVO);
}
}
}// End of While
writeExcelSheet(); // Pasting this method at the end
// Clean up code
l_objVOFactoryForDoctors.resetFactory();
m_objTotalDocDtlsInDVL.clear();// Clear the used map
l_objDocList=null;
l_objTBMRecord=null;
l_objVstdDocRecord=null;
}// End of While
l_objAllTBMList=null;
m_objTotalDocDtlsInDVL=null;
-------------------------------------------------------------------
private void writeExcelSheet() throws IOException
{
HSSFRow l_objRow = null;
HSSFCell l_objCell = null;
VisitedDoctorsVO l_objVisitedDoctorsVO = null;
Iterator l_itrDocMap = m_objTotalDocDtlsInDVL.keySet().iterator();
while (l_itrDocMap.hasNext())
{
Object key = l_itrDocMap.next();
l_objVisitedDoctorsVO = (VisitedDoctorsVO) m_objTotalDocDtlsInDVL.get(key);
l_objRow = m_objSheet.createRow(m_iRowCount++);
l_objCell = l_objRow.createCell(0);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(String.valueOf(l_intSrNo++));
l_objCell = l_objRow.createCell(1);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getEmpname() + " (" + l_objVisitedDoctorsVO.getEmpcode() + ")"); // TBM Name
l_objCell = l_objRow.createCell(2);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDr_name());// Doc Name
l_objCell = l_objRow.createCell(3);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPotential_score());// Freq potential score
l_objCell = l_objRow.createCell(4);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getDoctor_potential());// Freq potential score
l_objCell = l_objRow.createCell(5);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getSpeciality());//CP_GP_SPL
l_objCell = l_objRow.createCell(6);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getActualpractice());// Actual practise
l_objCell = l_objRow.createCell(7);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getPreviousmet());// Lastmet
l_objCell = l_objRow.createCell(8);
l_objCell.setCellStyle(m_objCellStyle4);
l_objCell.setCellValue(l_objVisitedDoctorsVO.getLastmet());// Previousmet
}
// Write OutPut Stream
try {
out = new FileOutputStream(m_objFile);
outBf = new BufferedOutputStream(out);
m_objWorkBook.write(outBf);
} catch (Exception ioe) {
ioe.printStackTrace();
System.out.println(" Exception in chunk write");
} finally {
if (outBf != null) {
outBf.flush();
outBf.close();
out.close();
l_objRow=null;
l_objCell=null;
}
}
}
Instead of populating the complete list in memory before starting to write to excel you need to modify the code to work in such a way that each object is written to a file as it is read from the database. Take a look at this question to get some idea of the other approach.
Well, I'm not sure if POI can handle incremental updates but if so you might want to write chunks of say 10000 Rows to the file. If not, you might have to use CSV instead (so no formatting) or increase memory.
The problem is that you need to make objects written to the file elligible for garbage collection (no references from a live thread anymore) before writing the file is finished (before all rows have been generated and written to the file).
Edit:
If can you write smaller chunks of data to the file you'd also have to only load the necessary chunks from the db. So it doesn't make sense to load 50000 records at once and then try and write 5 chunks of 10000, since those 50000 records are likely to consume a lot of memory already.
As Thomas points out, you have too many objects taking up too much space, and need a way to reduce that. There is a couple of strategies for this I can think of:
Do you need to create a new factory each time in the loop, or can you reuse it?
Can you start with a loop getting the information you need into a new structure, and then discarding the old one?
Can you split the processing into a thread chain, sending information forwards to the next step, avoiding building a large memory consuming structure at all?
We are using the new Java printing API which uses PrinterJob.printDialog(attributes) to display the dialog to the user.
Wanting to save the user's settings for the next time, I wanted to do this:
PrintRequestAttributeSet attributes = loadAttributesFromPreferences();
if (printJob.printDialog(attributes)) {
// print, and then...
saveAttributesToPreferences(attributes);
}
However, what I found by doing this is that sometimes (I haven't figured out how, yet) the attributes get some bad data inside, and then when you print, you get a white page of nothing. Then the code saves the poisoned settings into the preferences, and all subsequent print runs get poisoned settings too. Additionally, the entire point of the exercise, making the settings for the new run the same as the user chose for the previous run, is defeated, because the new dialog does not appear to use the old settings.
So I would like to know if there is a proper way to do this. Surely Sun didn't intend that users have to select the printer, page size, orientation and margin settings every time the application starts up.
Edit to show the implementation of the storage methods:
private PrintRequestAttributeSet loadAttributesFromPreferences()
{
PrintRequestAttributeSet attributes = null;
byte[] marshaledAttributes = preferences.getByteArray(PRINT_REQUEST_ATTRIBUTES_KEY, null);
if (marshaledAttributes != null)
{
try
{
#SuppressWarnings({"IOResourceOpenedButNotSafelyClosed"})
ObjectInput objectInput = new ObjectInputStream(new ByteArrayInputStream(marshaledAttributes));
attributes = (PrintRequestAttributeSet) objectInput.readObject();
}
catch (IOException e)
{
// Can occur due to invalid object data e.g. InvalidClassException, StreamCorruptedException
Logger.getLogger(getClass()).warn("Error trying to read print attributes from preferences", e);
}
catch (ClassNotFoundException e)
{
Logger.getLogger(getClass()).warn("Class not found trying to read print attributes from preferences", e);
}
}
if (attributes == null)
{
attributes = new HashPrintRequestAttributeSet();
}
return attributes;
}
private void saveAttributesToPreferences(PrintRequestAttributeSet attributes)
{
ByteArrayOutputStream storage = new ByteArrayOutputStream();
try
{
ObjectOutput objectOutput = new ObjectOutputStream(storage);
try
{
objectOutput.writeObject(attributes);
}
finally
{
objectOutput.close(); // side-effect of flushing the underlying stream
}
}
catch (IOException e)
{
throw new IllegalStateException("I/O error writing to a stream going to a byte array", e);
}
preferences.putByteArray(PRINT_REQUEST_ATTRIBUTES_KEY, storage.toByteArray());
}
Edit: Okay, it seems like the reason it isn't remembering the printer is that it isn't in the PrintRequestAttributeSet at all. Indeed, the margins and page sizes are remembered, at least until the settings get poisoned at random. But the printer chosen by the user is not here:
[0] = {java.util.HashMap$Entry#9494} class javax.print.attribute.standard.Media -> na-letter
[1] = {java.util.HashMap$Entry#9501} class javax.print.attribute.standard.Copies -> 1
[2] = {java.util.HashMap$Entry#9510} class javax.print.attribute.standard.MediaPrintableArea -> (10.0,10.0)->(195.9,259.4)mm
[3] = {java.util.HashMap$Entry#9519} class javax.print.attribute.standard.OrientationRequested -> portrait
It appears that what you're looking for is the PrintServiceAttributeSet, rather than the PrintRequestAttributeSet.
Take a look at the PrintServiceAttribute interface, and see if the elements you need have been implemented as classes. If not, you can implement your own PrintServiceAttribute class(es).