Mallet: java.lang.OutOfMemoryError with 1024GB Memory allocation - java

I am trying to use Mallet to run topic modeling on a ~1GB text file, with 11403956 rows. From the mallet directory, I cd to bin and upgrade the memory requirement to 1024GB:
set MALLET_MEMORY=1024G
I then try to run the command:
bin/mallet import-file --input combined_bios.txt --output dh_size.mallet --keep-sequence --remove-stopwords
However, this throws a memory error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at gnu.trove.TObjectIntHashMap.rehash(TObjectIntHashMap.java:170)
at gnu.trove.THash.postInsertHook(THash.java:359)
at gnu.trove.TObjectIntHashMap.put(TObjectIntHashMap.java:155)
at cc.mallet.types.Alphabet.lookupIndex(Alphabet.java:115)
at cc.mallet.types.Alphabet.lookupIndex(Alphabet.java:123)
at cc.mallet.types.FeatureSequence.add(FeatureSequence.java:131)
at cc.mallet.pipe.TokenSequence2FeatureSequence.pipe(TokenSequence2FeatureSequence.java:44)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:294)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
at cc.mallet.classify.tui.Csv2Vectors.main(Csv2Vectors.java:290)
Is there a workaround for such situations? Any help others can offer would be greatly appreciated!

If you are on Linux or OS X, I think you might be altering the wrong variable. The one you are changing is found in bin/mallet.bat, but you want to change the one in the executable at bin/mallet (i.e. without the .bat file extension):
MEMORY=1g
This is also described under "Issues with Big Data" in this Mallet tutorial:
http://programminghistorian.org/lessons/topic-modeling-and-mallet

Related

java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"’ for a file of more than 4GB

Using this tool, https://github.com/citygml4j/citygml-tools, which is called to-cityjson. I want to convert a cityGML file to a cityJSON file. The file is 4.36 GB, but i get the following error:
java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread main or
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at jdk.internal.reflect.GeneratedConstructorAccessor183.newInstance(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at com.sun.xml.bind.v2.ClassFactory.create0(ClassFactory.java:102)
at com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.createInstance(ClassBeanInfoImpl.java:255)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.createInstance(UnmarshallingContext.java:672)
at com.sun.xml.bind.v2.runtime.unmarshaller.StructureLoader.startElement(StructureLoader.java:158)
at com.sun.xml.bind.v2.runtime.unmarshaller.ProxyLoader.startElement(ProxyLoader.java:30)
at com.sun.xml.bind.v2.runtime.ElementBeanInfoImpl$IntercepterLoader.startElement(ElementBeanInfoImpl.java:223)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext._startElement(UnmarshallingContext.java:547)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallingContext.startElement(UnmarshallingContext.java:526)
at com.sun.xml.bind.v2.runtime.unmarshaller.InterningXmlVisitor.startElement(InterningXmlVisitor.java:45)
at com.sun.xml.bind.v2.runtime.unmarshaller.StAXStreamConnector.handleStartElement(StAXStreamConnector.java:216)
at com.sun.xml.bind.v2.runtime.unmarshaller.StAXStreamConnector.bridge(StAXStreamConnector.java:150)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:385)
at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:356)
at org.citygml4j.builder.jaxb.xml.io.reader.JAXBSimpleReader.nextFeature(JAXBSimpleReader.java:133)
at org.citygml4j.tools.command.ToCityJSONCommand.execute(ToCityJSONCommand.java:133)
at org.citygml4j.tools.CityGMLTools.handleParseResult(CityGMLTools.java:102)
at org.citygml4j.tools.CityGMLTools.handleParseResult(CityGMLTools.java:35)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:1526)
at org.citygml4j.tools.CityGMLTools.main(CityGMLTools.java:44)
I found one solution, which would be to use java -Xmx15G, but i don't know how to implement it.
You can use JAVA_OPTS or CITYGML_TOOLS_OPTS environment variable which is read by citygml-tools executable. Or you can modify the DEFAULT_JVM_OPTS option in the citygml-tools code:
# Add default JVM options here. You can also use JAVA_OPTS and CITYGML_TOOLS_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xms1G"'
If you are using Linux you can set in the terminal:
export JAVA_OPTS="-Xmx15G"
citygml-tools <file>
java -Xmx6144M -d64
Go to command line and execute this command, it will set it to 64 GB
Source: Increase heap size in Java
You can try a brute Force approach and assign a very large healthy space
Grep for the line
defaultJvmOpts = ['-Xms1G']
only 1 GiB of heap space
Also make sure you are using a 64 bit Java and have enough ram
Buddy Another option if you are using eclipse is the following.
This worked for me, I hope it helps you (click over here).

PySpark: java.lang.OutofMemoryError: Java heap space

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:
train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))
When I do
training_data = train_dataRDD.collectAsMap()
It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.
It looks like heap space is small. How can I set it to bigger limits?
EDIT:
Things that I tried before running:
sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')
I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html
It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.
After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.
sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor
Close your existing spark application and re run it. You will not encounter this error again. :)
If you're looking for the way to set this from within the script or a jupyter notebook, you can do:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "15g") \
.appName('my-cool-app') \
.getOrCreate()
I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.
The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.
As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.
I got the same error and I just assigned memory to spark while creating session
spark = SparkSession.builder.master("local[10]").config("spark.driver.memory", "10g").getOrCreate()
or
SparkSession.builder.appName('test').config("spark.driver.memory", "10g").getOrCreate()

I'm facing "Java heap space error" ,when I'm trying to give entire folder as input to Mapreduce Program

I'm facing "Java Heap space error",when I'm trying to run the mapreduce program by giving entire folder as input to the MR Job.When I'm giving a single file as input to MR job,I'm facing no error.The job has run successfully.
Changes I tried in hadoop-env.sh file:
=====================================
I had increased the memory size from 1024 to 2048MB
export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS"
Changes in mapred-site.xml:
===========================
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
</property>
By making changes in these files also,I'm still facing the "Java heap space error".
Can anyone please suggest me on this issue ...
You can turn on the HPROF profiling for your job with something like this,
conf.setBoolean("mapred.task.profile", true);
conf.set("mapred.task.profile.params", "-agentlib:hprof=cpu=samples," +
"heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s");
conf.set("mapred.task.profile.maps", "0-2");
conf.set("mapred.task.profile.reduces", "0-2");
This will help you to diagnose what exhausted the heap. See more details in "Hadoop The Definitive Guide" page 178-181."

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space: Parsing dblp.xml

I am using mac 10.5.8 and Java 1.5. I am trying to parse a big file: dblp.xml. I am following the instructions in this link in order to get the file parsed using SAX: > http://www.informatik.uni-trier.de/~%20ley/db/about/simpleparser/index.html . I should run the code using the Mac Terminal. Here are the commands:
javac Parser.java
java -mx900M -DentityExpansionLimit=2500000 Parser dblp.xml > out.txt
Unfortunately, I got the the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$ScannedEntity.<init>(XMLEntityManager.java:2437)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:1117)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:905)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:843)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1334)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1756)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:311)
at Parser.<init>(Parser.java:152)
at Parser.main(Parser.java:179)
mohammed-al-refais-macbook:src mohammedal-refai$ export JVM_ARGS="-Xmx1024m -XX:MaxPermSize=256m"
mohammed-al-refais-macbook:src mohammedal-refai$ java -mx900M -DentityExpansionLimit=2500000 Parser dblp.xml > out.txt
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$ScannedEntity.<init>(XMLEntityManager.java:2437)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:1117)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:905)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:843)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1334)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1756)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:311)
at Parser.<init>(Parser.java:152)
at Parser.main(Parser.java:179)
Although the instructions in the link provided showed that it would be parsed with Java 1.5 without problems, I am still having that Exception. Could anyone please help me solve this problem. Your assistance would be very much appreciated.
You are allocating not enough memory to the jvm.
I guess you set your parameter wrong because it is -Xmx900M and not -mx900M. The 'X' stands for non standard options.
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

tdbloader on Cygwin: Gettging FileNotFoundException: d:\cygdrive\d\....\node2id.idn

I am completely new to Jena/TDB. All I want to do is to load data from some sample rdf, N3 etc file using tdb scripts or through java api.
I am tried to use tbdloader on Cygwin to load data (tdb-0.9.0, on Windows XP with IBM Java 1.6). Following are the command that I ran:
$ export TDBROOT=/cygdrive/d/Project/Store_DB/jena-tdb-0.9.0-incubating
$ export PATH=$TDBROOT/bin:$PATH
I also changed classpath for java in the tdbloader script as mentioned at tdbloader on Cygwin: java.lang.NoClassDefFoundError :
exec java $JVM_ARGS $SOCKS -cp "PATH_OF_JAR_FILES" "tdb.$TDB_CMD" $TDB_SPEC "$#"
So when I run $ tdbloader --help it shows the help correctly.
But when I run
$ tdbloader --loc /cygdrive/d/Project/Store_DB/data1
OR
$ tdbloader --loc /cygdrive/d/Project/Store_DB/data1 test.rdf
I am getting following exception:
com.hp.hpl.jena.tdb.base.file.FileException: Failed to open: d:\cygdrive\d\Project\Store_DB\data1\node2id.idn (mode=rw)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.open$(ChannelManager.java:83)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.openref$(ChannelManager.java:58)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.acquire(ChannelManager.java:47)
at com.hp.hpl.jena.tdb.base.file.FileBase.<init>(FileBase.java:57)
at com.hp.hpl.jena.tdb.base.file.FileBase.<init>(FileBase.java:46)
at com.hp.hpl.jena.tdb.base.file.FileBase.create(FileBase.java:41)
at com.hp.hpl.jena.tdb.base.file.BlockAccessBase.<init>(BlockAccessBase.java:46)
at com.hp.hpl.jena.tdb.base.block.BlockMgrFactory.createStdFile(BlockMgrFactory.java:98)
at com.hp.hpl.jena.tdb.base.block.BlockMgrFactory.createFile(BlockMgrFactory.java:82)
at com.hp.hpl.jena.tdb.base.block.BlockMgrFactory.create(BlockMgrFactory.java:58)
at com.hp.hpl.jena.tdb.setup.Builder$BlockMgrBuilderStd.buildBlockMgr(Builder.java:196)
at com.hp.hpl.jena.tdb.setup.Builder$RangeIndexBuilderStd.createBPTree(Builder.java:165)
at com.hp.hpl.jena.tdb.setup.Builder$RangeIndexBuilderStd.buildRangeIndex(Builder.java:134)
at com.hp.hpl.jena.tdb.setup.Builder$IndexBuilderStd.buildIndex(Builder.java:112)
at com.hp.hpl.jena.tdb.setup.Builder$NodeTableBuilderStd.buildNodeTable(Builder.java:85)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd$NodeTableBuilderRecorder.buildNodeTable(DatasetBuilderStd.java:389)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd.makeNodeTable(DatasetBuilderStd.java:300)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd._build(DatasetBuilderStd.java:167)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd.build(DatasetBuilderStd.java:157)
at com.hp.hpl.jena.tdb.setup.DatasetBuilderStd.build(DatasetBuilderStd.java:70)
at com.hp.hpl.jena.tdb.StoreConnection.make(StoreConnection.java:132)
at com.hp.hpl.jena.tdb.transaction.DatasetGraphTransaction.<init>(DatasetGraphTransaction.java:46)
at com.hp.hpl.jena.tdb.sys.TDBMakerTxn._create(TDBMakerTxn.java:50)
at com.hp.hpl.jena.tdb.sys.TDBMakerTxn.createDatasetGraph(TDBMakerTxn.java:38)
at com.hp.hpl.jena.tdb.TDBFactory._createDatasetGraph(TDBFactory.java:166)
at com.hp.hpl.jena.tdb.TDBFactory.createDatasetGraph(TDBFactory.java:74)
at com.hp.hpl.jena.tdb.TDBFactory.createDataset(TDBFactory.java:53)
at tdb.cmdline.ModTDBDataset.createDataset(ModTDBDataset.java:95)
at arq.cmdline.ModDataset.getDataset(ModDataset.java:34)
at tdb.cmdline.CmdTDB.getDataset(CmdTDB.java:137)
at tdb.cmdline.CmdTDB.getDatasetGraph(CmdTDB.java:126)
at tdb.cmdline.CmdTDB.getDatasetGraphTDB(CmdTDB.java:131)
at tdb.tdbloader.loadQuads(tdbloader.java:163)
at tdb.tdbloader.exec(tdbloader.java:122)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
at tdb.tdbloader.main(tdbloader.java:53)
Caused by: java.io.FileNotFoundException: d:\cygdrive\d\Project\Store_DB\data1\node2id.idn (The system cannot find the path specified.)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:222)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:107)
at com.hp.hpl.jena.tdb.base.file.ChannelManager.open$(ChannelManager.java:80)
... 37 more
I am not sure what node2id.idn file is and why is it expecting it?
The file node2id.idn is one of TDB's internal index files. It's not something that you have to create or manage for yourself. I've just tried tdbloader on cygwin myself, it it worked OK for me. I can think of two basic possibilities:
your disk is full
the TDB index is corrupted
If this is the first file you are loading into an otherwise emtpy TDB, the second possibility is unlikely. If you are loading into a non-empty TDB, try deleting the TDB image and starting again. Note that TDB by itself does not manage concurrent writes: if you have more than one process writing to a single TDB image, you must handle locking at the application level, or use TDB's transactions.
The final possibility, of course, is that your disk is flaky. You might want to try your code on another machine.
If none of these suggestions help, please send a complete minimal test case to the Jena users list.

Categories

Resources