Java heap space when adding documents to List of documents - java

I am using import org.w3c.dom.Document; for document.
I have this block of code that parses the xml file from the arraylist fileList, there are more than 2000 xml files to be parsed and size of the xml files are around 30-50 Kb, I have no problem parsing the files:
try {
for(int i = 0; i < fileList.size(); i++) {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(fileList.get(i)); //<------ error will point here when docList.add(doc) is uncommented.
docList.add(doc);
}
} catch (ParserConfigurationException | SAXException | IOException e) {
e.printStackTrace();
}
but whenever I add them to the list this error comes up:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createChunk(Unknown Source)
at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.ensureCapacity(Unknown Source)
at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createNode(Unknown Source)
at com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.createDeferredTextNode(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractDOMParser.characters(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at com.test.parser.Parser.getDocs(Parser.java:146)
at com.test.parser.Parser.main(Parser.java:50)
uncommenting the docList.add(doc) does not produce this exception, any idea why this is happening?
EDIT: I added -Xmx1024M to VMArguments in Run Configurations and it worked.

uncommenting the docList.add(doc) does not produce this exception, any idea why this is happening?
It's simple: without storing doc reference in docList, doc reference will be overrived by new object - Document doc = builder.parse(fileList.get(i));, so the doc from previous iteration will be orphan - object without reference. This one will be fastly removed by JVM garbage collector, so during loop you will have at most 2 doc objects on the heap.
But, with docList.add(doc) active, you will still have references to all doc objects created in loop: exactly fileList.size() instances. They aren't collected (and removed from heap) by garbage collector, because docList will have valid, active references to them.
How to avoid OutOfMemoryError? Just parse / process document one by one, after destroying DOM object of previous doc, or consider using streaming parser, for example SAXParser.

right click on project folder
click -> runAs -> run Configuration -> click on arguments tab -> add
-xmx512M press Enter
-xmx2048M
Apply and Run.

Related

Exception while trying to load openNLP POS models

I have been trying to use POS Models for POS tagging, but while loading the Models I get the following exception, and this happens for both maxent as well as perceptron models:
java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at java.io.DataInputStream.readDouble(Unknown Source)
at opennlp.model.BinaryFileDataReader.readDouble(BinaryFileDataReader.java:53)
at opennlp.model.AbstractModelReader.readDouble(AbstractModelReader.java:75)
at opennlp.model.AbstractModelReader.getParameters(AbstractModelReader.java:146)
at opennlp.perceptron.PerceptronModelReader.constructModel(PerceptronModelReader.java:69)
at opennlp.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:231)
at opennlp.tools.util.model.BaseModel.(BaseModel.java:190)
at opennlp.tools.postag.POSModel.(POSModel.java:86)
at nlpcheck.NlpPOC.POSTag(NlpPOC.java:54)
at nlpcheck.NlpPOC.main(NlpPOC.java:86)
I have tried loading the tokenizaton model (en-token.bin) and Its loading and working fine.
Following is java snippet that I am using to load Model:
InputStream is = new FileInputStream(MODEL_PATH);
POSModel model = new POSModel(is);
I have downloaded the models (en-pos-perceptron.bin, en-pos-maxent.bin) from http://www.opennlp.org/models-1.5/.
It turns out the model file hosted on site mentioned above were corrupt, I was trying a different tool namely GATE(General architecture for Text Engineering) which was using the same model files so I copied them and put them on build path and it worked.

What causes an InternalError to be thrown by sun.awt.shell.Win32ShellFolder2.initSpecial()?

Some of our Windows users get this stack trace shortly after launching our app:
java.lang.InternalError: Could not bind shell folder to interface
at sun.awt.shell.Win32ShellFolder2.initSpecial(Native Method) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2.access$300(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2$1.call(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2$1.call(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolderManager2$ComInvoker.invoke(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.ShellFolder.invoke(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2.<init>(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolderManager2.getNetwork(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2.getFileSystemPath(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2.access$400(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2$10.call(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolder2$10.call(Unknown Source) ~[na:1.7.0_25]
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) ~[na:1.7.0_25]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[na:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[na:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[na:1.7.0_25]
at sun.awt.shell.Win32ShellFolderManager2$ComInvoker$3.run(Unknown Source) ~[na:1.7.0_25]
at java.lang.Thread.run(Unknown Source) ~[na:1.7.0_25]
Observations:
Every frame in the stack trace is something in the JDK, not in our code.
This happens only on Windows, but we've had reports of it on both Vista and Windows 7.
This happens with various versions of Java: 1.6.0_19, 1.6.0_21, 1.7.0_11, 1.7.0_25.
The problem happens to only a handful of our users, but is 100% repeatable for those users. Unfortunately, we haven't been able to see anything which those users' systems have in common other than exhibiting this issue, and none of our developers has ever experienced it themselves.
My search of Oracle's bug database turned up no bugs with the same or a similar stack trace.
There seem to be a lot of posts on the net about this particular issue without anyone having any idea what causes it.
I'm not holding out any hope of Oracle fixing whatever the problem is, if it is indeed a JDK bug---but if we knew what triggered the bug, we could at least help our users afflicted by it. Can anyone shed light on what causes this to happen?
Edit: The relevant native function is this one, from ShellFolder2.cpp:
JNIEXPORT void JNICALL Java_sun_awt_shell_Win32ShellFolder2_initSpecial
(JNIEnv* env, jobject folder, jlong desktopIShellFolder, jint folderType)
{
// Get desktop IShellFolder interface
IShellFolder* pDesktop = (IShellFolder*)desktopIShellFolder;
if (pDesktop == NULL) {
JNU_ThrowInternalError(env, "Desktop shell folder missing");
return;
}
// Get special folder relative PIDL
LPITEMIDLIST relPIDL;
HRESULT res = fn_SHGetSpecialFolderLocation(NULL, folderType,
&relPIDL);
if (res != S_OK) {
JNU_ThrowIOException(env, "Could not get shell folder ID list");
return;
}
// Set field ID for relative PIDL
env->CallVoidMethod(folder, MID_relativePIDL, (jlong)relPIDL);
// Get special folder IShellFolder interface
IShellFolder* pFolder;
res = pDesktop->BindToObject(relPIDL, NULL, IID_IShellFolder,
(void**)&pFolder);
if (res != S_OK) {
JNU_ThrowInternalError(env,
"Could not bind shell folder to interface");
return;
}
// Set field ID for pIShellFolder
env->CallVoidMethod(folder, MID_pIShellFolder, (jlong)pFolder);
}
In order to reach the "Could not bind" exception, it looks like pDesktop != NULL and relPIDL is retrieved successfully, but pDesktop->BindToObject() returns something other than S_OK. pDesktop is an IShellFolder*, which is apparently defined in Windows's <shellapi.h>. Aggravatingly, Java throws away the error code returned by IShellFolder::BindToObject.
So, I guess the question reduces to: What can cause IShellFolder::BindToObject to fail?
Edit 2: Since Win32ShellFolderManager2.getNetwork() is what's calling the Win32ShellFolder2 ctor at Win32ShellFolderManager2.java:181, we can see that the last argument to Win32ShellFolder2.initSpecial must be Win32ShellFolder2.NETWORK. So, is something is wrong with the user's Network Neighborhood folder, perhaps?
Well, there are several reports similar to yours (like this one from jEdit, this one from codenameone and this one - in german - which seems like a JFileChooser bug with a stack trace very close to yours in Windows 7 and Java 6). All seem related to JFileChooser and/or File Browsing in one way or another.
So I would approach this in one of two ways:
Either go for the time-consuming / non speculative road and take dumps of the affected installations with tools such as jstack, VisualVM or JConsole until you are able to isolate the root cause of the problem (which may or may not be the JFileChooser). If you choose that path, remember that either remote access to one of the affected installations or help from a technical savvy user is a must.
Or try to take a shortcut, assume that the problem is indeed the JFileChooser (as anecdotal evidence shows) and release a custom version replacing the JFileChooser by FileDialog. If it runs as expected on the affected machines case is closed; else give yourself a tap on the back for trying and take "The Long and Winding Road".
Had same problem and the answer was
java.awt.FileDialog
If you look up VbScript solutions for "Open File Dialog", it seems that there is no COM class doing the job for most of the windows platforms

how to solve this java.lang.ExceptionInInitializerError in java code?

there is one error occur on this line i did not understand what to say this. and there is also i am use one library of qoppa.jar how to slove this issue can any one help me
m_LoadedDoc = new PDFDocument(new FilePDFSource((String) path[0]), PDFViewer.this);
java.lang.ExceptionInInitializerError
at com.qoppa.android.pdfProcess.PDFDocument$1.b(Unknown Source)
at com.qoppa.android.pdfViewer.e.p.b(Unknown Source)
at com.qoppa.android.pdfProcess.PDFDocument.b(Unknown Source)
at com.qoppa.android.pdfProcess.PDFDocument.<init>(Unknown Source)
at com.qoppa.android.pdfProcess.PDFDocument.<init>(Unknown Source)
at com.pdfplugin.PDFViewer$LoadDocument.doInBackground(PDFViewer.java:469)
There is a magic line that needs to be called before reading the document:
// Magic: Register asset manager for font loading.
StandardFontTF.mAssetMgr = getContext().getAssets();
// Now you can read the document.
PDFDocument doc = new PDFDocument(new FilePDFSource(path), PDFViewer.this);
You might also need to include the assets/fonts and assets/cmaps directories from the sample project.

Why is WSDL parser still importing external documents?

I tried to turn off importing documents in WSDL4J (1.6.2) in the way suggested
by the API documentation:
wsdlReader.setFeature("javax.wsdl.importDocuments", false);
In fact, it stops importing XML schema files declared with wsdl:import tag, but does stop importing files declared with xs:import tags.
The following code snippet [see at the end of the letter] for the example file
http://www.ibspan.waw.pl/~gawinec/example.wsdl
returns the following exception:
javax.wsdl.WSDLException: WSDLException (at /definitions/types/xs:schema):
faultCode=OTHER_ERROR: An error occurred trying to resolve schema referenced
at 'EchoExceptions.xsd', relative to
'http://www.ibspan.waw.pl/~gawinec/example.wsdl'.:
java.io.FileNotFoundException: This file was not found:
http://www.ibspan.waw.pl/~gawinec/EchoExceptions.xsd
at com.ibm.wsdl.xml.WSDLReaderImpl.parseSchema(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.parseSchema(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.parseTypes(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.parseDefinitions(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at IsolatedExample.main(IsolatedExample.java:15)
Caused by: java.io.FileNotFoundException: This file was not found:
http://www.ibspan.waw.pl/~gawinec/EchoExceptions.xsd
at com.ibm.wsdl.util.StringUtils.getContentAsInputStream(Unknown Source)
... 10 more
Can you suggest me any solution to this problem? I just don't want to import
external XML schemata.
Regards,
Maciej
import javax.wsdl.WSDLException;
import javax.wsdl.factory.WSDLFactory;
import javax.wsdl.xml.WSDLReader;
public class IsolatedExample {
public static void main(String[] args) {
WSDLFactory wsdlFactory;
try {
wsdlFactory = WSDLFactory.newInstance();
WSDLReader wsdlReader = wsdlFactory.newWSDLReader();
wsdlReader.setFeature("javax.wsdl.verbose", false);
wsdlReader.setFeature("javax.wsdl.importDocuments", false);
wsdlReader.readWSDL("http://www.ibspan.waw.pl/~gawinec/example.wsdl");
} catch (WSDLException e) {
e.printStackTrace();
}
}
}
A quick look at WSDL4J (it's been a while since I've worked directly with this project) suggests that there is no option specifically to prevent the reading of imported schemas. You may have stumbled upon on a bug in WSDL4J's mechanism of deserializing schemas. That said, if you're not interested in the contents of any schemas, including those inlined in the WSDL document, you can register your own extension registry (simply modify the PopulatedExtensionRegistry class to leave out the SchemaDeserializer).
Specifically, leave out the following lines:
mapExtensionTypes(Types.class, SchemaConstants.Q_ELEM_XSD_1999,
SchemaImpl.class);
registerDeserializer(Types.class, SchemaConstants.Q_ELEM_XSD_1999,
new SchemaDeserializer());
registerSerializer(Types.class, SchemaConstants.Q_ELEM_XSD_1999,
new SchemaSerializer());
mapExtensionTypes(Types.class, SchemaConstants.Q_ELEM_XSD_2000,
SchemaImpl.class);
registerDeserializer(Types.class, SchemaConstants.Q_ELEM_XSD_2000,
new SchemaDeserializer());
registerSerializer(Types.class, SchemaConstants.Q_ELEM_XSD_2000,
new SchemaSerializer());
mapExtensionTypes(Types.class, SchemaConstants.Q_ELEM_XSD_2001,
SchemaImpl.class);
registerDeserializer(Types.class, SchemaConstants.Q_ELEM_XSD_2001,
new SchemaDeserializer());
registerSerializer(Types.class, SchemaConstants.Q_ELEM_XSD_2001,
new SchemaSerializer());
I haven't used Java for webservices, but have you tried setting an absolute path to the schemas you import? Perhaps it's trying to load a local file.
You could also try sniffing the wire to see if you're making a request, perhaps it's malformed.
$0.02

Unmarshaling xml with html entities using JAXB

I need to load wikipedia revision histories into POJOs, so I'm using JAXB to unmarshall the wikipeida data dump (well, individual pages of it). The problem is that the text nodes occasionally contain entities that are not defined in the wikipedia xml dump. eg: ° (`°' pleases keep in mind that I do not know the complete set of entities that I need to be able to read. My input file is 3tb, so let's just assume that everything html can render is in there.).
How can I configure JAXB to handle entities that are not valid xml?
Here is the SAX Exception that JAXB throws when it encounters an undefined entity:
Exception in thread "main" javax.xml.bind.UnmarshalException
- with linked exception:
[org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.]
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:481)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:199)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:168)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)
at com.stottlerhenke.tools.wikiparse.WikipediaIO.readPage(WikipediaIO.java:73)
at com.stottlerhenke.tools.wikiparse.WikipediaIO.main(WikipediaIO.java:53)
Caused by: org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:195)
Edit: The input that triggered that exception is the complete revision history for the wikipedia article on the Arctic Circle. The XSD used to generate the JAXB classes is here: http://www.mediawiki.org/xml/export-0.3.xsd
Edit: The source of this problem was an error on my part -- I was using an initial extractor that did not maintain encoded entities properly. However, I did find a way around this, should anyone have the problem I thought I had. See below.
Resolving entities is not the job of JAXB's. It's the job of the underlying
XML parser.
What you could do is:
read the data yourself using DOM
replace all unresolved entities by something you wish
then, let JAXB handle the result
This is a hack, but it works in a pinch.
I downloaded the html entity definitions from w3.org, and set the doctype of the input xml file to xhtml-transitional, but directed the doctype url to a local dtd:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">
xhtml1-transitional.dtd, in turn, requires:
xhtml-lat1.ent
xhtml-special.ent
xhtml-symbol.ent
which I sucked down and put along side xhtml1-transitional.dtd
(All files are available at: http://www.w3.org/TR/xhtml1/DTD/ )
Like I said, ugly as hell, but it did seem to do the job.

Categories

Resources