R H2O - Memory management - java

I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.
I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.
h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
# Find all objects on server
keysToKill <- h2o.ls(clust)$Key
# Remove items to be excluded, if any
keysToKill <- setdiff(keysToKill, vte)
# Loop thru and remove items to be removed
for(i in keysToKill){
h2o.rm(object = clust, keys = i)
if(verbose == TRUE){
print(i);flush.console()
}
}
# Print remaining objects in cluster.
h2o.ls(clust)
}
The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.
Here's some pseudo code to describe the process
# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")
# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
, nfolds = 3
, importance = T
, distribution = 'bernoulli'
, n.trees = 100
, interaction.depth = 10,
, shrinkage = 0.01
)
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)#model$auc
# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))
} # end loop
My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?
Any suggestions would be appreciated.

This answer is for the original H2O project (releases 2.x.y.z).
In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.
These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.
What I recommend doing is:
at the bottom of each loop iteration, use h2o.assign() to do a deep copy of anything you want to save to a known key name
use h2o.rm() to remove anything you don't want to keep, in particular the "Last.value" temps
call gc() explicitly in R somewhere in the loop
Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:
removeLastValues <- function(conn) {
df <- h2o.ls(conn)
keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
unique_keys_to_remove = unique(keys_to_remove)
if (length(unique_keys_to_remove) > 0) {
h2o.rm(conn, unique_keys_to_remove)
}
}
Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:
https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R

New suggestion as of 12/15/2015: update to latest stable (Tibshirani 3.6.0.8 or later).
We've completely reworked how R & H2O handle internal temp variables, and the memory management is much smoother.
Next: H2O temps can be held "alive" by R dead variables... so run an R gc() every loop iteration. Once R's GC removes the dead variables, H2O will reclaim that memory.
After that, your cluster should only hold on to specifically named things, like loaded datasets, and models. These you'll need to delete roughly as fast as you make them, to avoid accumulating large data in the K/V store.
Please let us know if you have any more problems by posting to the google group h2o stream:
https://groups.google.com/forum/#!forum/h2ostream
Cliff

The most current answer to this question is that you should probably just use the h2o.grid() function rather than writing a loop.

With the H2O new version (currently 3.24.0.3), they suggest to use the following recommendations:
my for loop {
# perform loop
rm(R object that isn’t needed anymore)
rm(R object of h2o thing that isn’t needed anymore)
# trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
gc()
# optional extra one to be paranoid. this is usually very fast.
gc()
# optionally sanity check that you see only what you expect to see here, and not more.
h2o.ls()
# tell back-end cluster nodes to do three back-to-back JVM full GCs.
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
}
Here the source: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html

Related

Spark MLlib: PCA on 9570 columns takes too long

1) I am doing a PCA on 9570 columns giving it 12288 mb RAM in local mode(which means driver only) and it takes from 1.5 hours up to 2. This is the code (very simple):
System.out.println("level1\n");
VectorAssembler assemblerexp = new VectorAssembler()
.setInputCols(metincols)
.setOutputCol("intensity");
expoutput = assemblerexp.transform(expavgpeaks);
System.out.println("level2\n");
PCAModel pcaexp = new PCA()
.setInputCol("intensity")
.setOutputCol("pcaFeatures")
.setK(2)
.fit(expoutput);
System.out.println("level3\n");
So the time that it takes to print level3 is what it takes long (1.5 to 2 hours). Is it normal that it takes so long? I have tried different number partitions (2,4,6,8,50,500,10000) and for some of them also takes almost 2 hours while for others I get a Java heap space error. Also some pictures from Spark User Interface:
Executors Jobs Stages environment
2) Is it also normal that I get different results with the PCA every time?
If you are setting RAM programmatically, it does not take effect, and a proper way would be to provide JVM arguments.

Load Neo4J in memory on demand for heavy computations

How could I load Neo4J into memory on demand?
On different stages of my long running jobs I'm persisting nodes and relationships to Neo4J. So Neo4J should be on disk, since it may consume too much memory and I don't know when I gonna run read queries against it.
But at some point (only once) I will want to run pretty heavy read query against my Neo4J server, and it have very poor performance (hours). As a solution I want to load all Neo4J to RAM for better performance.
What is the best option for it? Should I use run disk or there are any better solutions?
P.S.
Query with [r:LINK_REL_1*2] works pretty fast, [r:LINK_REL_1*3] works 17 seconds, [r:LINK_REL_1*4] works more than 5 minutes, even do not know how much, since I have 5 minutes timeout. But I need [r:LINK_REL_1*2..4] query to perform in reasonable time.
My heavy query explanation
PROFILE
MATCH path = (start:COLUMN)-[r:LINK_REL_1*2]->(col:COLUMN)
WHERE start.ENTITY_ID = '385'
WITH path UNWIND NODES(path) AS col
WITH path,
COLLECT(DISTINCT col.DATABASE_ID) as distinctDBs
WHERE LENGTH(path) + 1 = SIZE(distinctDBs)
RETURN path
Updated query with explanation (got the same performance in tests)
PROFILE
MATCH (start:COLUMN)
WHERE start.ENTITY_ID = '385'
MATCH path = (start)-[r:LINK_REL_1*2]->(col:COLUMN)
WITH path, REDUCE(dbs = [], col IN NODES(path) |
CASE WHEN col.DATABASE_ID in dbs
THEN dbs
ELSE dbs + col.DATABASE_ID END) as distinctDbs
WHERE LENGTH(path) + 1 = SIZE(distinctDbs)
RETURN path
APOC procedures has apoc.warmup.run(), which may get much of Neo4j into cached memory. See if that will make a difference.
It looks like you're trying to create a query in which the path contains only :Persons from distinct countries. Is this right?
If so, I think we can find a better query that can do this without hanging.
First, let's go for low-hanging fruit and see if avoiding the UNWIND can make a difference.
PROFILE or EXPLAIN the query and see if any numbers look significantly different compared to the original query.
MATCH (start:PERSON)
WHERE start.ID = '385'
MATCH path = (start)-[r:FRIENDSHIP_REL*2..5]->(person:PERSON)
WITH path, REDUCE(countries = [], person IN NODES(path) |
CASE WHEN person.country in countries
THEN countries
ELSE countries + person.COUNTRY_ID END) as distinctCountries
WHERE LENGTH(path) + 1 = SIZE(distinctCountries)
RETURN path

Memory requirements for Stanford NER retraining

I am retraining the Stanford NER model on my own training data for extracting organizations. But, whether I use a 4GB RAM machine or an 8GB RAM machine, I get the same Java heap space error.
Could anyone tell what is the general configuration of machines on which we can retrain the models without getting these memory issues?
I used the following command :
java -mx4g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop newdata_retrain.prop
I am working with training data (multiple files - each file has about 15000 lines in the following format) - one word and its category on each line
She O
is O
working O
at O
Microsoft ORGANIZATION
Is there anything else we could do to make these models run reliably ? I did try with reducing the number of classes in my training data. But that is impacting the accuracy of extraction. For example, some locations or other entities are getting classified as organization names. Can we reduce specific number of classes without impact on accuracy ?
One data I am using is the Alan Ritter twitter nlp data : https://github.com/aritter/twitter_nlp/tree/master/data/annotated/ner.txt
The properties file looks like this:
#location of the training file
trainFile = ner.txt
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model-twitter.ser.gz
#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1
#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
The error I am getting : the stacktrace is this :
CRFClassifier invoked on Mon Dec 01 02:55:22 UTC 2014 with arguments:
-prop twitter_retrain.prop
usePrevSequences=true
useClassFeature=true
useTypeSeqs2=true
useSequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk=true
useTypeySequences=true
useDisjunctive=true
noMidNGrams=true
serializeTo=ner-model-twitter.ser.gz
maxNGramLeng=6
useNGrams=true
usePrev=true
useNext=true
maxLeft=1
trainFile=ner.txt
map=word=0,answer=1
useWord=true
useTypeSeqs=true
[1000][2000]numFeatures = 215032
setting nodeFeatureIndicesMap, size=149877
setting edgeFeatureIndicesMap, size=65155
Time to convert docs to feature indices: 4.4 seconds
numClasses: 21 [0=O,1=B-facility,2=I-facility,3=B-other,4=I-other,5=B-company,6=B-person,7=B-tvshow,8=B-product,9=B-sportsteam,10=I-person,11=B-geo-loc,12=B-movie,13=I-movie,14=I-tvshow,15=I-company,16=B-musicartist,17=I-musicartist,18=I-geo-loc,19=I-product,20=I-sportsteam]
numDocuments: 2394
numDatums: 46469
numFeatures: 215032
Time to convert docs to data/labels: 2.5 seconds
Writing feature index to temporary file.
numWeights: 31880772
QNMinimizer called on double function of 31880772 variables, using M = 25.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:923)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:885)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:879)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:91)
at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1911)
at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1718)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:759)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:747)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2937)
One way you can try reducing number of classes is to not use B-I notation. For example, club B-facility and I-facility into facility. Of course, another way it to use a bigger memory machine.
Shouldn't that be -Xmx4g not -mx4g?
Sorry for getting to this a bit late! I suspect the problem is the input format of the file; in particular, my first guess is that the file is being treated as a single long sentence.
The expected format of the training file is in the CoNLL format, which means each line of the file is a new token, and the end of a sentence is denoted by a double newline. So, for example, a file could look like:
Cats O
have O
tails O
. O
Felix ANIMAL
is O
a O
cat O
. O
Could you let me know if it's indeed in this format? If so, could you include a stack trace of the error, and the properties file you are using? Does it work if you run on just the first few sentences of the file?
--Gabor
If you are going to do analysis on non-transactional data sets you may want to use another tool like Elasticsearch (simpler) or Hadoop (exponentially more complicated). MongoDB is a good middleground as well.
First uninstall the existing java jdk and reinstall again.
Then you can use the heap size as much as you can based on your hard disk size.
In the term "-mx4g" 4g is not the RAM it is the heap size.
Even I Faced the same error initially. after doing this it is gone.
Even I misunderstood 4g as RAM initially.
Now I am able to start my server even with 100g of heap size.
Next,
Instead of using Customised NER model, I suggest you to use Custom RegexNER Model with which you can add millions of words of same entity name within in a single document too.
These 2 errors I faced initially.
For any queries comment below.

Cascading join two files very slow

I am using cascading to do a HashJoin two 300MB files. I do the following cascading workflow:
// select the field which I need from the first file
Fields f1 = new Fields("id_1");
docPipe1 = new Each( docPipe1, scrubArguments, new ScrubFunction( f1 ), Fields.RESULTS );
// select the fields which I need from the second file
Fields f2 = new Fields("id_2","category");
docPipe2 = new Each( docPipe2, scrubArguments, new ScrubFunction( f2), Fields.RESULTS );
// hashJoin
Pipe tokenPipe = new HashJoin( docPipe1, new Fields("id_1"),
docPipe2, new Fields("id_2"), new LeftJoin());
// count the number of each "category" based on the id_1 matching id_2
Pipe pipe = new Pipe(tokenPipe );
pipe = new GroupBy( pipe , new Fields("category"));
pipe = new Every( pipe, Fields.ALL, new Count(), Fields.ALL );
I am running this cascading program on a Hadoop Cluster which has 3 datanode, each is 8 RAM and 4 cores (I set mapred.child.java.opts to 4096MB.); but it takes me about 30 mins to get the final result. I think it is too slow, but I think there is no problem in my program and in the cluster. How can I make this cascading join faster?
as given in cascading userguide
HashJoin attempts to keep the entire right-hand stream in memory for rapid comparison (not just the current grouping, as no grouping is performed for a HashJoin).Thus a very large tuple stream in the right-hand stream may exceed a configurable spill-to-disk threshold, reducing performance and potentially causing a memory error. For this reason, it's advisable to use the smaller stream on the right-hand side.
or
use CoGroup that might be helpful
It may be possible that your hadoop cluster might be busy or dedicated to some other job probably and hence the time taken. I dont think that replacing HashJoin with CoGroup will help you because CoGroup is a reduce-side join, while HashJoin does a map-side join and hence HashJoin is going to be more performant than ConGroup. I think you should try once again with a less busy cluster because your code also looks good.

any quick sorting for a huge csv file

I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?
If you got other approach other than java, it is welcome if it can be easily integrated into java system. For example, unix utility.
Thanks.
Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.
I am trying to time a unix sort, would update that afterwards.
Edit:
I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt
How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.
If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.
Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:
public static void sortInUnix(File fileIn, File sortedFile)
throws IOException, InterruptedException {
String[] cmd = {
"cmd", "/c",
// above should be changed to "sh", "-c" if on Unix system
"sort " + fileIn.getAbsolutePath() + " > "
+ sortedFile.getAbsolutePath() };
Process sortProcess = Runtime.getRuntime().exec(cmd);
// capture error messages (if any)
BufferedReader reader = new BufferedReader(new InputStreamReader(
sortProcess.getErrorStream()));
String outputS = reader.readLine();
while (outputS != null) {
System.err.println(outputS);
outputS = reader.readLine();
}
sortProcess.waitFor();
}
Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).
Here's an example:
Given the CSV file below, we will sort on the second column (the "number" column):
name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withRecordSeparator("\n"),
StandardCharsets.UTF_8);
Comparator<CSVRecord> comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
};
Sorter
.serializer(serializer)
.comparator(comparator)
.input(inputFile)
.output(outputFile)
.sort();
The result is:
name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).
See the project README for more details.
Java Lists can be sorted, you can try starting there.
Python on a big server.
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
data = [ row for row in rdr ]
data.sort( key=sort_key )
fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
wtr= csv.DictWriter( target, fields }
wtr.writerows( data )
This should be reasonably quick. And it's very flexible.
On a small machine, break this into three passes: decorate, sort, undecorate
Decorate:
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
with open('temp.txt','w') as target:
for row in rdr:
target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )
Part 2 is the operating system sort using "|" as the field separator
Undecorate:
with open('sorted_temp.txt','r') as source:
with open('sorted.csv','w') as target:
for row in rdr:
keys, _, data = row.rpartition('|')
target.write( data )
You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.
My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.
Say (1G chunks, in a directory you can play around in):
split --line-bytes=1000000000 original_file chunk
for each in chunk*
do
sort $each > $each.sorted
done
sort -m chunk*.sorted > original_file.sorted
As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort).
But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using
Apache Hadoop as your application server.
Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size.
I am also pointing you to an excellent link on Hadoop setup
Work your way through single node setup and move to Hadoop cluster.
I would be glad to help you if you get stuck anywhere.
You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)
However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.
Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)
Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?
waitlist=
for f in ${SOURCE}/*
do
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
waitlist="$waitlist $!"
done
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}
This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.
Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.
Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.

Categories

Resources