JDK11 getFreeSpace and getTotalSpace from File is not matching df

JDK11 getFreeSpace and getTotalSpace from File is not matching df - java

I am seeing df -h giving output like below
root#vrni-platform:/var/lib# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-var 110G 94G 11G 91% /var
root#vrni-platform:/var/lib# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg-var 114756168 98318504 10585300 91% /var
But if I do the same from java like below
final File dataPath = new File("/var");
final long totalBytes = dataPath.getTotalSpace();
final long usedBytes = totalBytes - dataPath.getFreeSpace();
System.out.printf("Disk utilization: %.2f, Total bytes: %d, Used Bytes: %d", ((double)usedBytes/totalBytes * 100), totalBytes, usedBytes);```
It is printing like below
Disk utilization: 85.68, Total bytes: 117510316032, Used Bytes: 100678909952
Can someone let me know why is this discrepancy in disk utilization?
Environment
Ubuntu 18.04
Java - Zulu OpenJDK 11.0.11

As I also mentioned in the comments, the primary reason is that getFreeSpace seems to report something else than DFs 'Avail' or 'Available'. Going by DFs '1K-blocks' and 'Used', you also get a percentage of 85,68%, while going by '1K-blocks' and 'Available' yields 91%. Also observe how DFs 'Used' and 'Available' (and 'Used' and 'Avail') do not add up to '1K-blocks' (or 'Size')
As suggested by user16320675, using getUsableSpace might be a better method to use than getFreeSpace. As to the reasons for the difference between '1K-blocks' - 'Used' and 'Available' in df, it might be better to ask that on https://unix.stackexchange.com/.

Related

Mallet: OutOfMemoryError: Java heap space

While training data in Mallet, the processed stopped because of OutOfMemoryError. Attribute MEMORY in bin/mallet has already been set to 3GB. The size of training file output.mallet is only 31 MB. I have tried to reduce the training data size. But it still throws the same error:
a161115#a161115-Inspiron-3250:~/dev/test_models/Mallet$ bin/mallet train-classifier --input output.mallet --trainer NaiveBayes --training-portion 0.0001 --num-trials 10
Training portion = 1.0E-4
Unlabeled training sub-portion = 0.0
Validation portion = 0.0
Testing portion = 0.9999
-------------------- Trial 0 --------------------
Trial 0 Training NaiveBayesTrainer with 7 instances
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at cc.mallet.types.Multinomial$Estimator.setAlphabet(Multinomial.java:309)
at cc.mallet.classify.NaiveBayesTrainer.setup(NaiveBayesTrainer.java:251)
at cc.mallet.classify.NaiveBayesTrainer.trainIncremental(NaiveBayesTrainer.java:200)
at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:193)
at cc.mallet.classify.NaiveBayesTrainer.train(NaiveBayesTrainer.java:59)
at cc.mallet.classify.tui.Vectors2Classify.main(Vectors2Classify.java:415)
I would appriciate any help or insights into this problem
EDIT: this is my bin/mallet file.
#!/bin/bash
malletdir=`dirname $0`
malletdir=`dirname $malletdir`
cp=$malletdir/class:$malletdir/lib/mallet-deps.jar:$CLASSPATH
#echo $cp
MEMORY=10g
CMD=$1
shift
help()
{
cat <<EOF
Mallet 2.0 commands:
import-dir load the contents of a directory into mallet instances (one per file)
import-file load a single file into mallet instances (one per line)
import-svmlight load SVMLight format data files into Mallet instances
info get information about Mallet instances
train-classifier train a classifier from Mallet data files
classify-dir classify data from a single file with a saved classifier
classify-file classify the contents of a directory with a saved classifier
classify-svmlight classify data from a single file in SVMLight format
train-topics train a topic model from Mallet data files
infer-topics use a trained topic model to infer topics for new documents
evaluate-topics estimate the probability of new documents under a trained model
prune remove features based on frequency or information gain
split divide data into testing, training, and validation portions
bulk-load for big input files, efficiently prune vocabulary and import docs
Include --help with any option for more information
EOF
}
CLASS=
case $CMD in
import-dir) CLASS=cc.mallet.classify.tui.Text2Vectors;;
import-file) CLASS=cc.mallet.classify.tui.Csv2Vectors;;
import-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Vectors;;
info) CLASS=cc.mallet.classify.tui.Vectors2Info;;
train-classifier) CLASS=cc.mallet.classify.tui.Vectors2Classify;;
classify-dir) CLASS=cc.mallet.classify.tui.Text2Classify;;
classify-file) CLASS=cc.mallet.classify.tui.Csv2Classify;;
classify-svmlight) CLASS=cc.mallet.classify.tui.SvmLight2Classify;;
train-topics) CLASS=cc.mallet.topics.tui.TopicTrainer;;
infer-topics) CLASS=cc.mallet.topics.tui.InferTopics;;
evaluate-topics) CLASS=cc.mallet.topics.tui.EvaluateTopics;;
prune) CLASS=cc.mallet.classify.tui.Vectors2Vectors;;
split) CLASS=cc.mallet.classify.tui.Vectors2Vectors;;
bulk-load) CLASS=cc.mallet.util.BulkLoader;;
run) CLASS=$1; shift;;
*) echo "Unrecognized command: $CMD"; help; exit 1;;
esac
java -Xmx$MEMORY -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -classpath "$cp" $CLASS "$#"
It's also worth mentioning that my original training file has 60,000 items. When I reduce the number of items (20,000 instances), training will run like normal, but uses about 10GB RAM.

Check the call to Java in bin/mallet and add the flag -Xmx3g, making sure there isn't another Xmx in it; if so, edit that one).

I usually change both files: mallet files and set the memory to maximum
Mallet.batjava -Xmx%MALLET_MEMORY% -ea -Dfile.encoding=%MALLET_ENCODING% -classpath %MALLET_CLASSPATH% %CLASS% %MALLET_ARGS%
and
java -Xmx$MEMORY -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server -classpath "$cp" $CLASS "$#"
I replaced the bold %MALLET_MEMORY% and $MEMORY with the memory I want: e.g. 4G

stratified sampling R: java.lang.OutOfMemoryError: Java heap space

I want to use this function here is the code on github to sample my dataset in 2 parts 90% traning data set ( for example) and 10% (the rest) are the test ( for example tried this code :
library(XLConnect)
library(readxl)
library(xlsx)
library(readxl)
ybi <- read_excel("D:/ii.xls")
#View(ybi)
test= stratified(ybi, 8, .1)
no= (test$ID_unit) # to get indices of the testdataset samples
train = ybi [-no,] # the indices for training data
write.xlsx(train,"D:/mm.xlsx",sheetName = "Newdata")
in fact my data have 8 attributes and 65534 row.
I have selected by the code above just 10% based on the 8 eigth attribute which is the class it gives me without any problm the test set but not the training data ther error is on the figure (joined)error
how to fix it!

It looks like you JVM has no enough memory allocated for the heap.
As a quick fix, export system variable _JAVA_OPTIONS
export _JAVA_OPTIONS="-Xmx8G -Xms1G -Xcheck:jni"
you can also use:
options(java.parameters = "-Xmx8G")
and set -Xmx to a value that will make R happy.

Why using getFreeSpace /getTotalSpace / getUsableSpace gives different output from df -H command

I have an external SSD disk (/dev/sda).
when typing df -h:
size used avil use%
587G 383G 175G 69%
when typing df -H:
size used avil use%
630G 411G 188G 69%
When using getTotalSpace() / (1024*1024) I'm getting: 600772
When using getUsableSpace() / (1024*1024) I'm getting: 178568
When using getFreeSpace() / (1024*1024) I'm getting: 209108
If I will try to calculate the usage in parentage I will not get 69%.
What is the bug ?

The 69% of df is calculated as "used / (used + available)" which is 383 / (383 + 175) = 69%. It's not calculated as "used / size".
You don't have "used" in Java (it's not the same as "size - available") so you can't make the same calculation.
But you can calculate "available / size" in both cases:
188 / 630 = 30%
209108 / 600772 = 30%
There is no bug. You were comparing different things.

python stanford posttager, java command failed after running for some time

I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)

Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())

I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.

Java Out of Memory exception in Ubuntu when using Flume/Hadoop

I'm getting an out of memory exception due to lack of Java heap space when I try and download tweets using Flume and pipe them into Hadoop.
I have set the heap space currently to 4GB in the mapred-site.xml of Hadoop, like so:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx4096m</value>
</property>
I am hoping to download tweets continually for two days but can't get past 45 minutes without errors.
Since I do have the disk space to hold all of this, I am assuming the error is coming from Java having to handle so many things at once. Is there a way for me to slow down the speed at which these tweets are downloaded, or do something else to solve this problem?
Edit: flume.conf included
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <required>
TwitterAgent.sources.Twitter.consumerSecret = <required>
TwitterAgent.sources.Twitter.accessToken = <required>
TwitterAgent.sources.Twitter.accessTokenSecret = <required>
TwitterAgent.sources.Twitter.keywords = manchester united, man united, man utd, man u
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
Edit 2
I've tried increasing the memory to 8GB which still doesn't help. I am assuming I am placing too many tweets in Hadoop at once and need to write them to disk and release the space again (or something to that effect). Is there a guide anywhere on how to do this?

Set JAVA_OPTS value at flume-env.sh and start flume agent.

It appears the problem had to do with the batch size and transactionCapacity. I changed them to the following:
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
This works without me even needing to change the JAVA_OPTS value.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JDK11 getFreeSpace and getTotalSpace from File is not matching df - java

Related

Mallet: OutOfMemoryError: Java heap space

stratified sampling R: java.lang.OutOfMemoryError: Java heap space

Why using getFreeSpace /getTotalSpace / getUsableSpace gives different output from df -H command

python stanford posttager, java command failed after running for some time

Java Out of Memory exception in Ubuntu when using Flume/Hadoop

Categories

Resources