While training data in Mallet, the processed stopped because of OutOfMemoryError. Attribute MEMORY in bin/mallet has already been set to 3GB. The size of training file output.mallet is only 31 MB. I have tried to reduce the training data size. But it still throws the same error:
a161115#a161115-Inspiron-3250:~/dev/test_models/Mallet$ bin/mallet train-classifier --input output.mallet --trainer NaiveBayes --training-portion 0.0001 --num-trials 10
Training portion = 1.0E-4
Unlabeled training sub-portion = 0.0
Validation portion = 0.0
Testing portion = 0.9999
-------------------- Trial 0 --------------------
Trial 0 Training NaiveBayesTrainer with 7 instances
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at cc.mallet.types.Multinomial$Estimator.setAlphabet(Multinomial.java:309)
at cc.mallet.classify.NaiveBayesTrainer.setup(NaiveBayesTrainer.java:251)
at cc.mallet.classify.NaiveBayesTrainer.trainIncremental(NaiveBayesTrainer.java:200)
at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:193)
at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:59)
at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:415)
I would appriciate any help or insights into this problem
EDIT: this is my bin/mallet file.
#!/bin/bash
malletdir=`dirname $0`
malletdir=`dirname $malletdir`
cp=$malletdir/class:$malletdir/lib/mallet-deps.jar:$CLASSPATH
#echo $cp
MEMORY=10g
CMD=$1
shift
help()
{
cat <<EOF
Mallet 2.0 commands:
import-dir load the contents of a directory into mallet instances (one per file)
import-file load a single file into mallet instances (one per line)
import-svmlight load SVMLight format data files into Mallet instances
info get information about Mallet instances
train-classifier train a classifier from Mallet data files
classify-dir classify data from a single file with a saved classifier
classify-file classify the contents of a directory with a saved classifier
classify-svmlight classify data from a single file in SVMLight format
train-topics train a topic model from Mallet data files
infer-topics use a trained topic model to infer topics for new documents
evaluate-topics estimate the probability of new documents under a trained model
prune remove features based on frequency or information gain
split divide data into testing, training, and validation portions
bulk-load for big input files, efficiently prune vocabulary and import docs
Include --help with any option for more information
EOF
}
CLASS=
case $CMD in
import-dir) CLASS=cc.mallet.classify.tui.Text2Vectors;;
import-file) CLASS=cc.mallet.classify.tui.Csv2Vectors;;
import-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Vectors;;
info) CLASS=cc.mallet.classify.tui.Vectors2Info;;
train-classifier) CLASS=cc.mallet.classify.tui.Vectors2Classify;;
classify-dir) CLASS=cc.mallet.classify.tui.Text2Classify;;
classify-file) CLASS=cc.mallet.classify.tui.Csv2Classify;;
classify-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Classify;;
train-topics) CLASS=cc.mallet.topics.tui.TopicTrainer;;
infer-topics) CLASS=cc.mallet.topics.tui.InferTopics;;
evaluate-topics) CLASS=cc.mallet.topics.tui.EvaluateTopics;;
prune) CLASS=cc.mallet.classify.tui.Vectors2Vectors;;
split) CLASS=cc.mallet.classify.tui.Vectors2Vectors;;
bulk-load) CLASS=cc.mallet.util.BulkLoader;;
run) CLASS=$1; shift;;
*) echo "Unrecognized command: $CMD"; help; exit 1;;
esac
java -Xmx$MEMORY -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -classpath "$cp" $CLASS "$#"
It's also worth mentioning that my original training file has 60,000 items. When I reduce the number of items (20,000 instances), training will run like normal, but uses about 10GB RAM.
Check the call to Java in bin/mallet and add the flag -Xmx3g, making sure there isn't another Xmx in it; if so, edit that one).
I usually change both files: mallet files and set the memory to maximum
Mallet.batjava -Xmx%MALLET_MEMORY% -ea -Dfile.encoding=%MALLET_ENCODING% -classpath %MALLET_CLASSPATH% %CLASS% %MALLET_ARGS%
and
java -Xmx$MEMORY -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -classpath "$cp" $CLASS "$#"
I replaced the bold %MALLET_MEMORY% and $MEMORY with the memory I want: e.g. 4G
Related
I have a matrix pipeline job which performs multiple stages (something like ~200) most of which are functional tests whose results are recorded by the following code:
stage('Report') {
script {
def summary = junit allowEmptyResults: true, testResults: "**/artifacts/${product}/test-reports/*.xml"
def buildURL = "${env.BUILD_URL}"
def TestAnalyzer = buildURL.replace("/${env.BUILD_NUMBER}", "/test_results_analyzer")
def TestsURL = buildURL.replace("job/${env.JOB_NAME}/${env.BUILD_NUMBER}", "blue/organizations/jenkins/${env.JOB_NAME}/detail/${env.JOB_NAME}/${env.BUILD_NUMBER}/tests")
slackSend (
color: summary.failCount == 0 ? 'good' : 'warning',
message: "*<${TestsURL}|Test Summary>* for *${env.JOB_NAME}* on *${env.HOSTNAME} - ${product}* <${env.BUILD_URL}| #${env.BUILD_NUMBER}> - <${TestAnalyzer}|${summary.totalCount} Tests>, Failed: ${summary.failCount}, Skipped: ${summary.skipCount}, Passed: ${summary.passCount}"
)
}
}
The problem is that this Report stage regularly fails with the following error:
> Archive JUnit-formatted test results 9m 25s
[2022-11-16T02:51:49.569Z] Recording test results
Java heap space
I have increased the heap space of the jenkins server to 8GB by modifying the systemd service configuration this way:
software-dev#magnet:~$ sudo cat /etc/systemd/system/jenkins.service.d/override.conf
[Service]
Environment="JAVA_OPTS=-Djava.awt.headless=true -Xmx8g"
which was taken into account, because I verified with the following command:
software-dev#magnet:~$ tr '\0' '\n' < /proc/$(pidof java)/cmdline
/usr/bin/java
-Djava.awt.headless=true
-Xmx10g
-jar
/usr/share/java/jenkins.war
--webroot=/var/cache/jenkins/war
--httpPort=8080
I just increased the Heap size to 10GB and I'll wait for the result of this night's build, but I have the feeling that this amount of Heap space really looks excessive, so I'm suspecting that a plugin, maybe the JUnit one, may be buggy and could consume too much memory.
Is anyone aware of such a thing? Could there be workarounds?
More importantly, which methods could I use to try to track if one plugin is consuming too much?
I have notions of Java since my CS degree, but I'm not familiar with the jenkins development ecosystem.
Thank you by advance.
You can try splitting the tests into chunks/batch/groups but this solution requires changes in the code.
More details
https://semaphoreci.com/community/tutorials/how-to-split-junit-tests-in-a-continuous-integration-environment
Grouping JUnit tests
I want to use this function here is the code on github to sample my dataset in 2 parts 90% traning data set ( for example) and 10% (the rest) are the test ( for example tried this code :
library(XLConnect)
library(readxl)
library(xlsx)
library(readxl)
ybi <- read_excel("D:/ii.xls")
#View(ybi)
test= stratified(ybi, 8, .1)
no= (test$ID_unit) # to get indices of the testdataset samples
train = ybi [-no,] # the indices for training data
write.xlsx(train,"D:/mm.xlsx",sheetName = "Newdata")
in fact my data have 8 attributes and 65534 row.
I have selected by the code above just 10% based on the 8 eigth attribute which is the class it gives me without any problm the test set but not the training data ther error is on the figure (joined)error
how to fix it!
It looks like you JVM has no enough memory allocated for the heap.
As a quick fix, export system variable _JAVA_OPTIONS
export _JAVA_OPTIONS="-Xmx8G -Xms1G -Xcheck:jni"
you can also use:
options(java.parameters = "-Xmx8G")
and set -Xmx to a value that will make R happy.
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
I have a little gap in understanding how a JVM process allocates its own memory. As far as I know
RSS = Heap size + MetaSpace + OffHeap size
where OffHeap consists of thread stacks, direct buffers, mapped files (libraries and jars) and JVM code itself;
At the moment I’m trying to analyze my Java application (Spring Boot + Infinispan) which RSS is 779M (it runs in a docker container, so pid 1 is ok):
[ root#daf5a5ae9bb7:/data ]$ ps -o rss,vsz,sz 1
RSS VSZ SZ
798324 6242160 1560540
According to jvisualvm, committed Heap size is 374M
Metasapce size is 89M
In other words, I want to explain 799M - (374M + 89M) = 316M of OffHeap memory.
My app has (in average) 36 live threads.
Each of these threads consumes 1M:
[ root#fac6d0dfbbb4:/data ]$ java -XX:+PrintFlagsFinal -version |grep ThreadStackSize
intx CompilerThreadStackSize = 0
intx ThreadStackSize = 1024
intx VMThreadStackSize = 1024
So, here we can add 36M.
The only place where the app uses DirectBuffer is NIO. As far as I can see from JMX, it doesn’t consume a lot of resources - only 98K
The last step is mapped libs and jars. But according to pmap (full output)
[ root#daf5a5ae9bb7:/data ]$ pmap -x 1 | grep ".so.*" | awk '{ sum+=$3} END {print sum}'
12896K
plus
root#daf5a5ae9bb7:/data ]$ pmap -x 1 | grep “.jar" | awk '{ sum+=$3} END {print sum}'
9720K
we only have 20M here.
Hence, we still have to explain 316M - (36M + 20M) = 260M :(
Does anyone have any idea what I missed?
Approach:
You may want to use Java HotSpot Native Memory Tracking (NMT).
This may give you an exact list of memory allocated by the JVM, splitted up into the different areas heap, classes, threads, code, GC, compiler, internal, symbols, memory tracking, pooled free chunks, and unknown.
Usage:
You can start your application with -XX:NativeMemoryTracking=summary.
Observations of the current heap can be done with jcmd <pid> VM.native_memory summary.
Where to find jcmd / pid:
On a default OpedJDK installation on Ubuntu this can be found at /usr/bin/jcmd.
By just running jcmd without any parameter, you get a list of running Java applications.
user#pc:~$ /usr/bin/jcmd
5169 Main <-- 5169 is the pid
Output:
You will then receive a complete overview over your heap, looking something like the following:
Total: reserved=664192KB, committed=253120KB <--- total memory tracked by Native Memory Tracking
Java Heap (reserved=516096KB, committed=204800KB) <--- Java Heap
(mmap: reserved=516096KB, committed=204800KB)
Class (reserved=6568KB, committed=4140KB) <--- class metadata
(classes #665) <--- number of loaded classes
(malloc=424KB, #1000) <--- malloc'd memory, #number of malloc
(mmap: reserved=6144KB, committed=3716KB)
Thread (reserved=6868KB, committed=6868KB)
(thread #15) <--- number of threads
(stack: reserved=6780KB, committed=6780KB) <--- memory used by thread stacks
(malloc=27KB, #66)
(arena=61KB, #30) <--- resource and handle areas
Code (reserved=102414KB, committed=6314KB)
(malloc=2574KB, #74316)
(mmap: reserved=99840KB, committed=3740KB)
GC (reserved=26154KB, committed=24938KB)
(malloc=486KB, #110)
(mmap: reserved=25668KB, committed=24452KB)
Compiler (reserved=106KB, committed=106KB)
(malloc=7KB, #90)
(arena=99KB, #3)
Internal (reserved=586KB, committed=554KB)
(malloc=554KB, #1677)
(mmap: reserved=32KB, committed=0KB)
Symbol (reserved=906KB, committed=906KB)
(malloc=514KB, #2736)
(arena=392KB, #1)
Memory Tracking (reserved=3184KB, committed=3184KB)
(malloc=3184KB, #300)
Pooled Free Chunks (reserved=1276KB, committed=1276KB)
(malloc=1276KB)
Unknown (reserved=33KB, committed=33KB)
(arena=33KB, #1)
This gives a detailed overview of the different memory areas used by the JVM, and also shows the reserved and commited memory.
I don't know of a technique that gives you a more detailed memory consumption list.
Further reading:
You can also use -XX:NativeMemoryTracking=detail in combination with further jcmd commands. A more detailed explaination can be found at Java Platform, Standard Edition Troubleshooting Guide - 2.6 The jcmd Utility. You can check possible commands via "jcmd <pid> help"
I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.
I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:
import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')
print("Executing tag_corpus.py...\n")
# Import corpus file
print("Importing data...\n")
file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()
print("Tagging the corpus. Please wait...\n")
tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))
IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:
Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!
I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!
I really appreciate your kind help.
After your import lines add this line:
nltk.internals.config_java(options='-xmx2G')
This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.
See What are the Xms and Xmx parameters when starting JVMs? for more information
If you're interested in how to debug your code, read on.
So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :
from nltk.internals import find_file, find_jar, config_java, java, _java_options
We see that the nltk.internals package is handling the different Java configurations and parameters.
Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.
In version 3.9.2, the StanfordTagger class constructor accepts a parameter called java_options which can be used to set the memory for the POSTagger and also the NERTagger.
E.g. pos_tagger = StanfordPOSTagger('models/english-bidirectional-distsim.tagger', path_to_jar='stanford-postagger-3.9.2.jar', java_options='-mx3g')
I found the answer by #alvas to not work because the StanfordTagger was overriding my memory setting with the built-in default of 1000m. Perhaps using nltk.internals.config_java after initializing StanfordPOSTagger might work but I haven't tried that.