I'm getting an out of memory exception due to lack of Java heap space when I try and download tweets using Flume and pipe them into Hadoop.
I have set the heap space currently to 4GB in the mapred-site.xml of Hadoop, like so:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx4096m</value>
</property>
I am hoping to download tweets continually for two days but can't get past 45 minutes without errors.
Since I do have the disk space to hold all of this, I am assuming the error is coming from Java having to handle so many things at once. Is there a way for me to slow down the speed at which these tweets are downloaded, or do something else to solve this problem?
Edit: flume.conf included
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <required>
TwitterAgent.sources.Twitter.consumerSecret = <required>
TwitterAgent.sources.Twitter.accessToken = <required>
TwitterAgent.sources.Twitter.accessTokenSecret = <required>
TwitterAgent.sources.Twitter.keywords = manchester united, man united, man utd, man u
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
Edit 2
I've tried increasing the memory to 8GB which still doesn't help. I am assuming I am placing too many tweets in Hadoop at once and need to write them to disk and release the space again (or something to that effect). Is there a guide anywhere on how to do this?
Set JAVA_OPTS value at flume-env.sh and start flume agent.
It appears the problem had to do with the batch size and transactionCapacity. I changed them to the following:
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
This works without me even needing to change the JAVA_OPTS value.
Related
I want to use this function here is the code on github to sample my dataset in 2 parts 90% traning data set ( for example) and 10% (the rest) are the test ( for example tried this code :
library(XLConnect)
library(readxl)
library(xlsx)
library(readxl)
ybi <- read_excel("D:/ii.xls")
#View(ybi)
test= stratified(ybi, 8, .1)
no= (test$ID_unit) # to get indices of the testdataset samples
train = ybi [-no,] # the indices for training data
write.xlsx(train,"D:/mm.xlsx",sheetName = "Newdata")
in fact my data have 8 attributes and 65534 row.
I have selected by the code above just 10% based on the 8 eigth attribute which is the class it gives me without any problm the test set but not the training data ther error is on the figure (joined)error
how to fix it!
It looks like you JVM has no enough memory allocated for the heap.
As a quick fix, export system variable _JAVA_OPTIONS
export _JAVA_OPTIONS="-Xmx8G -Xms1G -Xcheck:jni"
you can also use:
options(java.parameters = "-Xmx8G")
and set -Xmx to a value that will make R happy.
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
I am facing an error while executing R code through java using Rengine. I don't know how to set max memory in java code using Rengine. Is there a way to set max memory for Rengine? May be like below
Rengine re = getREngine();
re.eval("options(java.parameters = '-Xmx10g')", false);
I know this is wrong syntax. Kindly guide me how to set this.
Have you seen this: http://www.bramschoenmakers.nl/en/node/726
> options( java.parameters = "-Xmx4g" )
> library( "rJava" )
but I think this options set max memory for java being executed from R. However, you would like to exceed R memory, where R is executed from Java so maybe you should extend R memory: memory.limit(memory.limit()*2)
This is my java code in which I am querying data from Hive using Apache spark sql.
JavaSparkContext ctx = new JavaSparkContext(new SparkConf().setAppName("LoadData").setMaster("MasterUrl"));
HiveContext sqlContext = new HiveContext(ctx.sc());
List<Row> result = sqlContext.sql("Select * from Tablename").collectAsList();
when I run this code it throws java.lang.OutOfMemoryError: GC overhead limit exceeded. How to solve this or how to increase the memory in Spark configuration.
If you are using the spark-shell to run it then you can use the driver-memory to bump the memory limit:
spark-shell --driver-memory Xg [other options]
If the executors are having problems then you can adjust their memory limits with --executor-memory XG
You can find more info how to exactly set them in the guides: submission for executor memory, configuration for driver memory.
#Edit: since you are running it from Netbeans you should be able to pass them as JVM arguments -Dspark.driver.memory=XG and -Dspark.executor.memory=XG. I think it was in Project Properties under Run.
have you found any solutions for your issue yet?
please share them if you have :D
and here is my idea: rdd and also javaRDD has a method toLocalIterator(),
spark document said that
The iterator will consume as much memory as the largest partition in
this RDD.
it means iterator will consume less memory than List if the rdd is devided into many partitions, you can try like this:
Iterator<Row> iter = sqlContext.sql("Select * from Tablename").javaRDD().toLocalIterator();
while (iter.hasNext()){
Row row = iter.next();
//your code here
}
ps: it's just an idea and i haven't tested it yet
I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.
I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:
import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')
print("Executing tag_corpus.py...\n")
# Import corpus file
print("Importing data...\n")
file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()
print("Tagging the corpus. Please wait...\n")
tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))
IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:
Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!
I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!
I really appreciate your kind help.
After your import lines add this line:
nltk.internals.config_java(options='-xmx2G')
This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.
See What are the Xms and Xmx parameters when starting JVMs? for more information
If you're interested in how to debug your code, read on.
So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :
from nltk.internals import find_file, find_jar, config_java, java, _java_options
We see that the nltk.internals package is handling the different Java configurations and parameters.
Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.
In version 3.9.2, the StanfordTagger class constructor accepts a parameter called java_options which can be used to set the memory for the POSTagger and also the NERTagger.
E.g. pos_tagger = StanfordPOSTagger('models/english-bidirectional-distsim.tagger', path_to_jar='stanford-postagger-3.9.2.jar', java_options='-mx3g')
I found the answer by #alvas to not work because the StanfordTagger was overriding my memory setting with the built-in default of 1000m. Perhaps using nltk.internals.config_java after initializing StanfordPOSTagger might work but I haven't tried that.