Spark MLlib: PCA on 9570 columns takes too long

Spark MLlib: PCA on 9570 columns takes too long - java

1) I am doing a PCA on 9570 columns giving it 12288 mb RAM in local mode(which means driver only) and it takes from 1.5 hours up to 2. This is the code (very simple):
System.out.println("level1\n");
VectorAssembler assemblerexp = new VectorAssembler()
.setInputCols(metincols)
.setOutputCol("intensity");
expoutput = assemblerexp.transform(expavgpeaks);
System.out.println("level2\n");
PCAModel pcaexp = new PCA()
.setInputCol("intensity")
.setOutputCol("pcaFeatures")
.setK(2)
.fit(expoutput);
System.out.println("level3\n");
So the time that it takes to print level3 is what it takes long (1.5 to 2 hours). Is it normal that it takes so long? I have tried different number partitions (2,4,6,8,50,500,10000) and for some of them also takes almost 2 hours while for others I get a Java heap space error. Also some pictures from Spark User Interface:
Executors Jobs Stages environment
2) Is it also normal that I get different results with the PCA every time?

If you are setting RAM programmatically, it does not take effect, and a proper way would be to provide JVM arguments.

Related

java for loop performance difference

I am running below simple program , I know this is not best way to measure performance but the results are surprising to me , hence wanted to post question here.
public class findFirstTest {
public static void main(String[] args) {
for(int q=0;q<10;q++) {
long start2 = System.currentTimeMillis();
int k = 0;
for (int j = 0; j < 5000000; j++) {
if (j > 4500000) {
k = j;
break;
}
}
System.out.println("for value " + k + " with time " + (System.currentTimeMillis() - start2));
}
}
}
results are like below after multiple times running code.
for value 4500001 with time 3
for value 4500001 with time 25 ( surprised as it took 25 ms in 2nd iteration)
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
so I am not understanding why 2nd iteration took 25ms but 1st 3ms and later 0 ms and also why always for 2nd iteration when I am running code.
if I move start and endtime printing outside of outer forloop then results I am having is like
for value 4500001 with time 10

In first iteration, the code is running interpreted.
In second iteration, JIT kicks in, slowing it down a bit while it compiles to native code.
In remaining iterations, native code runs very fast.

Because your winamp needed to decode another few frames of your mp3 to queue it into the sound output buffers. Or because the phase of the moon changed a bit and your dynamic background needed changing, or because someone in east Croydon farted and your computer is subscribed to the 'smells from London' twitter feed. Who knows?
This isn't how you performance test. Your CPU is not such a simple machine after all; it has many cores, and each core has pipelines and multiple hierarchies of caches. Any given core can only interact with one of its caches, and because of this, if a core runs an instruction that operates on memory which is not currently in cache, then the core will shut down for a while: It sends to the memory controller a request to load the page of memory with the memory you need to access into a given cachepage, and will then wait until it is there; this can take many, many cycles.
On the other end you have an OS that is juggling hundreds of thousands of processes and threads, many of them internal to the kernel, per-empting like there is no tomorrow, and trying to give extra precedence to processes that are time sensitive, such as the aforementioned winamp which must get a chance to decode some more mp3 frames before the sound buffer is fully exhausted, or you'd notice skipping. This is non-trivial: On ye olde windows you just couldn't get this done which is why ye olde winamp was a magical marvel of engineering, more or less hacking into windows to ensure it got the priority it needed. Those days are long gone, but if you remember them, well, draw the conclusion that this isn't trivial, and thus, OSes do pre-empt with prejudice all the time these days.
A third significant factor is the JVM itself which is doing all sorts of borderline voodoo magic, as it has both a hotspot engine (which is doing bookkeeping on your code so that it can eventually conclude that it is worth spending considerable CPU resources to analyse the heck out of a method to rewrite it in optimized machinecode because that method seems to be taking a lot of CPU time), and a garbage collector.
The solution is to forget entirely about trying to measure time using such mere banalities as measuring currentTimeMillis or nanoTime and writing a few loops. It's just way too complicated for that to actually work.
No. Use JMH.

Is there a faster way (or more efficient) to stream triples/quads than RDFDataMgr.parse(..) or RDFParser..?

I am looking to find more efficient ways to stream RDF quads/triples from large datasets. I have been using both RDFDataMgr.parse(rdfStream, datasetLocation); and RDFParser.source(datasetLocation).parse(rdfStream); however both solutions suffer when a dataset grows. I did an experiment to see how much time the RDFDataMgr takes to parse datasets of different sizes. I considered: 300000, 500000, 1000000, 5000000, 7500000, 10000000 as basis and then 182011590 (a real dataset). The processing is linear but it does take a lot of time. The average time for each dataset size is as follows:
300000 - 4421.1ms
500000 - 7184.6ms
1000000 - 14431.6ms
5000000 - 71926.3ms
7500000 - 116375.7ms
10000000 - 148588.8ms
On average, the parser is parsing 68265.38333 triples per second. This number was based on the average of benchmarking runs over all test datasets. It is actually quite time consuming considering that 300k triples take ~ 4s to be parsed
The PipedRDFIterator and PipedRDFStream were initialised as follows:
int buffer = (int) (size * 0.10);
PipedRDFIterator iterator = new PipedRDFIterator(buffer, true, 50 , 10000);
PipedRDFStream rdfStream = new PipedTriplesStream((PipedRDFIterator) iterator);
ExecutorService executor = Executors.newSingleThreadExecutor();
As expected, the larger dataset (182011590 triples) took much more time, 1450001ms (around 24 minutes). Interesting enough, on average the parser was parsing 125525 triples per second.
I do understand there there parsing is not just reading from a file and dumping it, but it includes syntax checking etc..., however, is there any way or any Jena parser which could speed up the parsing time?
I hope this makes sense to you.
Thanks!
Edit: I ran these benchmarks using the following VM parameters: -Xmx10g -Xms8g on a 3.1 GHz Intel Core i5, 16GB ram 2133 MHz, and an SSD drive

Is there intermediate computation optimization when using functions.window [Spark]

I am using functions.window to create sliding window computation using Spark and Java. Example code:
Column slidingWindow = functions.window(singleIPPerRow.col("timestamp"), "3 hours", "1 seconds");
Dataset<Row> aggregatedResultsForWindow = singleIPPerRow.groupBy(slidingWindow, singleIPPerRow.col("area")).count();
The data looks like this:
+----------+-------+------+
| timestamp| area|events|
+----------+-------+------+
|1514452990|domain1| 41|
|1514452991|domain1| 42|
|1514452991|domain1| 50|
|1514452993|domain2| 53|
|1514452994|domain2| 54|
|1514452994|domain3| 54|
|1514452993|domain1| 35|
+----------+-------+------+
In real like there are a lot of events per timestamp and also note the large ratio between the step and the window size.
My question is how many time will the counts be calculated? I mean every timestamp-area count is used by step/window different rows in the result. Will Spark save the intermediate results? making every count for a couple be calculated once or will it calculate the result step/window for every couple?

HBase - Java Client Limit Scan Results Explanation?

Is there a way to limit the total number of rows of sequential data when performing a scan of the data?
Notes:
I'm working with 500,000 total rows
I've tried both setMaxResultSize and setMaxResultsPerColumnFamily. This proves to be ineffectual (there does seem to be some behavior when both are set to low numbers or setMaxResultSize is larger. What is the relationship between these two functions?)
I've worked with setting a PageFilter (size 10), and the behavior displays 5 different sequence data sets of 10.
I actually got it sudo-working while typing this out by setting the PageFilter size and the setMaxResultSize equal. When I change either, it conforms to the PageFilter. It will also jump to another large subset of PageFilter size if I make setMaxResultSize significantly larger.
HBase version is 1.1.1
Can someone better explain what is happening here and how to get the results I want?

you can use either hbase shell or hbase java client.
1- hbase shell: use this command and pipe the results to a file and do "wc -l ..."
count 'table_name',1
2- java hbase client api
long count=0;
String row="";
for (Result res : scanner)
{
for (Cell cell : res.listCells())
{
row = new String(CellUtil.cloneRow(cell));
if(!row.equals(""))
count++;
}
}

R H2O - Memory management

I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.
I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.
h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
# Find all objects on server
keysToKill <- h2o.ls(clust)$Key
# Remove items to be excluded, if any
keysToKill <- setdiff(keysToKill, vte)
# Loop thru and remove items to be removed
for(i in keysToKill){
h2o.rm(object = clust, keys = i)
if(verbose == TRUE){
print(i);flush.console()
}
}
# Print remaining objects in cluster.
h2o.ls(clust)
}
The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.
Here's some pseudo code to describe the process
# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")
# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
, nfolds = 3
, importance = T
, distribution = 'bernoulli'
, n.trees = 100
, interaction.depth = 10,
, shrinkage = 0.01
)
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)#model$auc
# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))
} # end loop
My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?
Any suggestions would be appreciated.

This answer is for the original H2O project (releases 2.x.y.z).
In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.
These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.
What I recommend doing is:
at the bottom of each loop iteration, use h2o.assign() to do a deep copy of anything you want to save to a known key name
use h2o.rm() to remove anything you don't want to keep, in particular the "Last.value" temps
call gc() explicitly in R somewhere in the loop
Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:
removeLastValues <- function(conn) {
df <- h2o.ls(conn)
keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
unique_keys_to_remove = unique(keys_to_remove)
if (length(unique_keys_to_remove) > 0) {
h2o.rm(conn, unique_keys_to_remove)
}
}
Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:
https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R

New suggestion as of 12/15/2015: update to latest stable (Tibshirani 3.6.0.8 or later).
We've completely reworked how R & H2O handle internal temp variables, and the memory management is much smoother.
Next: H2O temps can be held "alive" by R dead variables... so run an R gc() every loop iteration. Once R's GC removes the dead variables, H2O will reclaim that memory.
After that, your cluster should only hold on to specifically named things, like loaded datasets, and models. These you'll need to delete roughly as fast as you make them, to avoid accumulating large data in the K/V store.
Please let us know if you have any more problems by posting to the google group h2o stream:
https://groups.google.com/forum/#!forum/h2ostream
Cliff

The most current answer to this question is that you should probably just use the h2o.grid() function rather than writing a loop.

With the H2O new version (currently 3.24.0.3), they suggest to use the following recommendations:
my for loop {
# perform loop
rm(R object that isn’t needed anymore)
rm(R object of h2o thing that isn’t needed anymore)
# trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
gc()
# optional extra one to be paranoid. this is usually very fast.
gc()
# optionally sanity check that you see only what you expect to see here, and not more.
h2o.ls()
# tell back-end cluster nodes to do three back-to-back JVM full GCs.
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
}
Here the source: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.