PySpark: java.lang.OutofMemoryError: Java heap space - java

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:
train_dataRDD = (train.map(lambda x:getTagsAndText(x))
.filter(lambda x:x[-1]!=[])
.flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
.groupByKey()
.mapValues(list))
When I do
training_data = train_dataRDD.collectAsMap()
It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.
It looks like heap space is small. How can I set it to bigger limits?
EDIT:
Things that I tried before running:
sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')
I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html
It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.
sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor
Close your existing spark application and re run it. You will not encounter this error again. :)

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "15g") \
.appName('my-cool-app') \
.getOrCreate()

I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.
The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.
As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.

I got the same error and I just assigned memory to spark while creating session
spark = SparkSession.builder.master("local[10]").config("spark.driver.memory", "10g").getOrCreate()
or
SparkSession.builder.appName('test').config("spark.driver.memory", "10g").getOrCreate()

Related

I've got java.lang.OutOfMemoryError: Java heap space testing ActiveMQ by JMeter on Linux build agent

I run JMeter test for ActiveMQ using Linux build agent I've got java.lang.OutOfMemoryError: Java heap space. Detailed log:
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136
at org.apache.jmeter.protocol.jms.sampler.SubscriberSampler.extractContent(SubscriberSampler.java:282) ~[ApacheJMeter_jms.jar:5.3]
at org.apache.jmeter.protocol.jms.sampler.SubscriberSampler.sample(SubscriberSampler.java:186) ~[ApacheJMeter_jms.jar:5.3
at org.apache.jmeter.protocol.jms.sampler.BaseJMSSampler.sample(BaseJMSSampler.java:98) ~
at org.apache.jmeter.threads.JMeterThread.doSampling(JMeterThread.java:635) ~[ApacheJMeter_core.jar:5.4]
at org.apache.jmeter.threads.JMeterThread.executeSamplePackage(JMeterThread.java:558) ~[ApacheJMeter_core.jar:5.4]
at org.apache.jmeter.threads.JMeterThread.processSampler(JMeterThread.java:489) ~[ApacheJMeter_core.jar:5.4]
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:256) ~[ApacheJMeter_core.jar:5.4]
I've already allocated maximum HEAP memory (-Xmx8g), but it doesn't help. Yet the same test with the same configuration on Windows build agent passed without Out of memory error.
How can it be handled? Maybe some configuration should be done for Linux machine?
Are you sure your Heap setting gets applied on Linux?
You can check it my creating a simple test plan with single JSR223 Sampler using the following code:
println('Max heap size: ' + Runtime.getRuntime().maxMemory() / 1024 / 1024 + ' megabytes')
and when you run JMeter in command-line non-GUI mode you will see the current maximum JVM heap size printed:
In order to make the change permanent amend this line in jmeter startup script according to your requirements.
The issue was resolved after updating Java to 11 version on Linux machines.

Java connection via Globals API causes a StackOverflowError

I am trying to connect a Java application to an InterSystems Caché database via Globals API.
import com.intersys.globals.*;
public class Assignment {
public static void main(String[] args) {
final String user = "Andrew";
final String password = "Tobilko";
Connection connection = ConnectionContext.getConnection();
connection.connect("USER", user, password);
}
}
The stacktrace:
Exception in thread "main" java.lang.StackOverflowError
at com.intersys.globals.internal.GlobalsConnectionJNI.connectImpl(Native Method)
at com.intersys.globals.internal.GlobalsConnectionJNI.connect(GlobalsConnectionJNI.java:107)
at com.tobilko.a3.Assignment.main(Assignment.java:12)
The credentials and the namespace are correct.
The Cache instance has been initialised correctly and by instruction.
All the global environment variables including GLOBALS_HOME and DYLD_LIBRARY_PATH have been set.
The following libraries have been soft-linked:
ln -s $GLOBALS_HOME/bin/libisccache.dylib /usr/local/lib
ln -s $GLOBALS_HOME/bin/liblcbjni.dylib /usr/local/lib
ln -s $GLOBALS_HOME/bin/liblcbindnt.dylib /usr/local/lib
ln -s $GLOBALS_HOME/bin/liblcbclientnt.dylib /usr/local/lib
ln -s $GLOBALS_HOME/bin/libmdsjni.dylib /usr/local/lib
-Djava.library.path=/usr/local/lib has been specified.
The jars have been included.
These steps led me to a StackOverflowError exception.
I have no idea where I could have made a mistake.
Any help would be appreciated.
Andrew, I'm not so familiar with GlobalsAPI. But, I did some research and found that this GlobalsAPI was in previous versions of Java CacheExtreme library cacheextreme.jar, in Caché lib folder. In version which you tried to use, GlobalsAPI already disappeared, and only Event Persistent still there. And with IRIS this old library will disappear at all. And in IRIS documentation nothing more about GlobalsAPI. I think it would be better if you ask about GlobalsAPI future on the Developer Community portal.
I skipped the Window configuration part because it isn't my OS.
Apparently, the next configuration is required for all systems:
Configuration for Windows
The default stack size of the Java Virtual Machine on Windows is too small for running eXTreme applications (running them with the
default stack size causes Java to report EXCEPTION_STACK_OVERFLOW). To
optimize performance, heap size should also be increased.
To
temporarily modify the stack size and heap size when running an
eXTreme application, add the following command line arguments:
-Xss1024k -Xms2500m -Xmx2500m
Increasing the stack size has resolved the issue.

Use docker-machine create from java

I have an application that (I want to) uses Java to start and stop Docker containers. It seems that the way to do this is using docker-machine create, which works fine when I test from the command line.
However, when running using Commons-Exec from Java I get the error:
(aa4567c1-058f-46ae-9e97-56fb8b45211c) Creating SSH key...
Error creating machine: Error in driver during machine creation: /usr/local/bin/VBoxManage modifyvm aa4567c1-058f-46ae-9e97-56fb8b45211c --firmware bios --bioslogofadein off --bioslogofadeout off --bioslogodisplaytime 0 --biosbootmenu disabled --ostype Linux26_64 --cpus 1 --memory 1024 --acpi on --ioapic on --rtcuseutc on --natdnshostresolver1 off --natdnsproxy1 on --cpuhotplug off --pae on --hpet on --hwvirtex on --nestedpaging on --largepages on --vtxvpid on --accelerate3d off --boot1 dvd failed:
VBoxManage: error: Could not find a registered machine with UUID {aa4567c1-058f-46ae-9e97-56fb8b45211c}
VBoxManage: error: Details: code VBOX_E_OBJECT_NOT_FOUND (0x80bb0001), component VirtualBoxWrap, interface IVirtualBox, callee nsISupports
VBoxManage: error: Context: "FindMachine(Bstr(a->argv[0]).raw(), machine.asOutParam())" at line 500 of file VBoxManageModifyVM.cpp
I have set my VBOX_USER_HOME variable in an initializationScript that I'm using to start the machine:
export WORKERID=$1
export VBOX_USER_HOME=/Users/me/Library/VirtualBox
# create the machine
docker-machine create $WORKERID && \ # create the worker using docker-machine
eval $(docker-machine env $WORKERID) && \ # load the env of the newly created machine
docker run -d myimage
And I'm executing this from Java via the Commons Exec CommandLine class:
CommandLine cmdline = new CommandLine("/bin/sh");
cmdline.addArgument(initializeWorkerScript.getAbsolutePath());
cmdline.addArgument("test");
Executor executor = new DefaultExecutor();
If there is another library that can interface with docker-machine from Java I'm happy to use that, or to change out Commons Exec if that's the issue (though I don't understand why). The basic requirement is that I have some way to get docker-machine to create a machine using Java and then later to be able to use docker-machine to stop that machine.
As it turns out the example that I posted should work, the issue that I was having is that I was provisioning machines with a UUID name. That name contained dash (-) characters which apparently break VBoxManage. This might be because of some kind of path problem but I'm just speculating. When I changed my UUID to have dot (.) instead of dash it loaded and started the machine just fine.
I'm happy to remove this post if the moderators want, but will leave it up here in case people are looking for solutions to problems with docker-machine create naming issues.

CentOS Java application error with SELinux

I have a CentOS box hosting a Drupal 7 site. I've attempted to run a Java application called Tika on it, to index files using Apache Solr search.
I keep running into an issue only when SELinux is enabled:
extract using tika: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f1ed9000000, 2555904, 1) failed; error='Permission denied' (errno=13)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2555904 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/jvm-2356/hs_error.log
This does not happen if I disable selinux. If I run the command from SSH, it works fine -- but not in browser. This is the command it is running:
java '-Dfile.encoding=UTF8' -cp '/var/www/drupal/sites/all/modules/contrib/apachesolr_attachments/tika' -jar '/var/www/drupal/sites/all/modules/contrib/apachesolr_attachments/tika/tika-app-1.11.jar' -t '/var/www/drupal/sites/all/modules/contrib/apachesolr_attachments/tests/test-tika.pdf'
Here is the log from SELinux at /var/log/audit/audit.log:
type=AVC msg=audit(1454636072.494:3351): avc: denied { execmem } for pid=11285 comm="java" scontext=unconfined_u:system_r:httpd_t:s0 tcontext=unconfined_u:system_r:httpd_t:s0 tclass=process
type=SYSCALL msg=audit(1454636072.494:3351): arch=c000003e syscall=9 success=no exit=-13 a0=7fdfe5000000 a1=270000 a2=7 a3=32 items=0 ppid=2377 pid=11285 auid=506 uid=48 gid=48 euid=48 suid=48 fsuid=48 egid=48 sgid=48 fsgid=48 tty=(none) ses=1 comm="java" exe="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.95.x86_64/jre/bin/java" subj=unconfined_u:system_r:httpd_t:s0 key=(null)
Is there a way I can run this with SELinux enabled? I do not know the policy name of Tika (or should I use Java?) so I'm unsure where to go from here...
This worked for me...
I have tika at /var/apache-tika/tika-app-1.14.jar
setsebool -P httpd_execmem 1
chcon -t httpd_exec_t /var/apache-tika/tika-app-1.14.jar
Using the sealert tools (https://wiki.centos.org/HowTos/SELinux) helped track down the correct selinux type.
All of your context messages reference httpd_t, so I would run
/usr/sbin/getsebool -a | grep httpd
And experiment with enabling properties that show as off. It's been a while since I ran a database-backed website (Drupal, WordPress, etc.) on CentOS, but as I recall, these two were required to be enabled:
httpd_can_network_connect
httpd_can_network_connect_db
to enable a property with persistence, run
setsebool -P httpd_can_network_connect on
etc.
The booleans you're looking for are:
httpd_execmem
httpd_read_user_content
How to find:
audit2why -i /var/log/audit/audit.log will tell you this.
Part of package: policycoreutils-python-utils

Mallet: java.lang.OutOfMemoryError with 1024GB Memory allocation

I am trying to use Mallet to run topic modeling on a ~1GB text file, with 11403956 rows. From the mallet directory, I cd to bin and upgrade the memory requirement to 1024GB:
set MALLET_MEMORY=1024G
I then try to run the command:
bin/mallet import-file --input combined_bios.txt --output dh_size.mallet --keep-sequence --remove-stopwords
However, this throws a memory error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at gnu.trove.TObjectIntHashMap.rehash(TObjectIntHashMap.java:170)
at gnu.trove.THash.postInsertHook(THash.java:359)
at gnu.trove.TObjectIntHashMap.put(TObjectIntHashMap.java:155)
at cc.mallet.types.Alphabet.lookupIndex(Alphabet.java:115)
at cc.mallet.types.Alphabet.lookupIndex(Alphabet.java:123)
at cc.mallet.types.FeatureSequence.add(FeatureSequence.java:131)
at cc.mallet.pipe.TokenSequence2FeatureSequence.pipe(TokenSequence2FeatureSequence.java:44)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:294)
at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282)
at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
at cc.mallet.classify.tui.Csv2Vectors.main(Csv2Vectors.java:290)
Is there a workaround for such situations? Any help others can offer would be greatly appreciated!
If you are on Linux or OS X, I think you might be altering the wrong variable. The one you are changing is found in bin/mallet.bat, but you want to change the one in the executable at bin/mallet (i.e. without the .bat file extension):
MEMORY=1g
This is also described under "Issues with Big Data" in this Mallet tutorial:
http://programminghistorian.org/lessons/topic-modeling-and-mallet

Categories

Resources