Cascading join two files very slow

Cascading join two files very slow - java

I am using cascading to do a HashJoin two 300MB files. I do the following cascading workflow:
// select the field which I need from the first file
Fields f1 = new Fields("id_1");
docPipe1 = new Each( docPipe1, scrubArguments, new ScrubFunction( f1 ), Fields.RESULTS );
// select the fields which I need from the second file
Fields f2 = new Fields("id_2","category");
docPipe2 = new Each( docPipe2, scrubArguments, new ScrubFunction( f2), Fields.RESULTS );
// hashJoin
Pipe tokenPipe = new HashJoin( docPipe1, new Fields("id_1"),
docPipe2, new Fields("id_2"), new LeftJoin());
// count the number of each "category" based on the id_1 matching id_2
Pipe pipe = new Pipe(tokenPipe );
pipe = new GroupBy( pipe , new Fields("category"));
pipe = new Every( pipe, Fields.ALL, new Count(), Fields.ALL );
I am running this cascading program on a Hadoop Cluster which has 3 datanode, each is 8 RAM and 4 cores (I set mapred.child.java.opts to 4096MB.); but it takes me about 30 mins to get the final result. I think it is too slow, but I think there is no problem in my program and in the cluster. How can I make this cascading join faster?

as given in cascading userguide
HashJoin attempts to keep the entire right-hand stream in memory for rapid comparison (not just the current grouping, as no grouping is performed for a HashJoin).Thus a very large tuple stream in the right-hand stream may exceed a configurable spill-to-disk threshold, reducing performance and potentially causing a memory error. For this reason, it's advisable to use the smaller stream on the right-hand side.
or
use CoGroup that might be helpful

It may be possible that your hadoop cluster might be busy or dedicated to some other job probably and hence the time taken. I dont think that replacing HashJoin with CoGroup will help you because CoGroup is a reduce-side join, while HashJoin does a map-side join and hence HashJoin is going to be more performant than ConGroup. I think you should try once again with a less busy cluster because your code also looks good.

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

Spark Issue with Dataframe.collectAsList method when ran on Multi-node Spark Cluster

I am creating a list from a spark dataframe using collectAsList method and reading the columns by iterating through the rows. The spark java job runs on a multi-node cluster, where the config is set to spawn multiple executors. Please suggest some alternative method to do the below functionality in JAVA.
List<Row> list = df.collectAsList();
List<Row> responseList = new ArrayList<>();
for(Row r: list) {
String colVal1 = r.getAs(colName1);
String colVal2 = r.getAs(colName2);
String[] nestedValues = new String[allCols.length];
nestedValues[0]=colVal1 ;
nestedValues[1]=colVal2 ;
.
.
.
responseList.add(RowFactory.create(nestedValues));
}
Thanks

The benefit of Spark is that you can process a large amount of data using the memory and processors across multiple executors on multiple nodes. The problem you are having might be due to using collectAsList and then processing the data. collectAsList brings all the data into the "driver" which is a single JVM on a single node. It should be used as a final step, to gather results after you have processed data. If you are trying to bring a very large amount of data into the driver and then process it, it could be very slow or fail and you are not actually using Spark to process your data at that point. Instead of using collectAsList use the methods available on the DataFrame to do the processing the data, such as map()
Definitions of driver and executor https://spark.apache.org/docs/latest/cluster-overview.html
In the Java API a DataFrame is a DataSet . Here's the DataSet documentation. Use the methods here to process your data https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

I faced the same issue. I used the toLocalIterator() method for holding less data, this method will return the Iterator object. collectAsList() this method will return total data as List, but Iterator will directly fetch data from the driver while reading.
Code Like :
Iterator<Row> itr = df.toLocalIterator();
while(itr.hasNext()){
Row row = itr.next();
//do somthing
}

#Tony is on the spot. Here are few more points-
Big data needs to be scalable so that with more processing power more data can be processed in same time. This is achieved with parallel processing using multiple executers.
Spark is also resilience, if some executors die then it can recover easily.
Using collect() makes your processing heavily dependent on only 1 process/node which is driver. It won’t scale and more prone to failure.
Essentially you can use all Spark APIs except few of these - collect, collectAsList, show in production grade code. You can use them for testing as well as for small amount of data. You should't use them for large amount of data.
In your case you can simply do something like -
Dataset<Row> df = // create your dataframe from a source like file, table etc.
df.select("column1","column2",array("column1","column2").as("column3")).save(.... file name ..)
You can use tons for column based functions available on https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html. These functions are pretty much everything you need, if not you can always read data into scala/java/python and operate on them using that language's syntax.

Creating a new answer to address the new specifics in your question.
Call .map on the dataframe, and put logic in a lambda to convert one row to a new row.
// Do your data manipulation in a call to `.map`,
// which will return another DataFrame.
DataFrame df2 = df.map(
// This work will be spread out across all your nodes,
// which is the real power of Spark.
r -> {
// I'm assuming the code you put in the question works,
// and just copying it here.
// Note the type parameter of <String> with .getAs
String colVal1 = r.getAs<String>(colName1);
String colVal2 = r.getAs<String>(colName2);
String[] nestedValues = new String[allCols.length];
nestedValues[0]=colVal1;
nestedValues[1]=colVal2;
.
.
.
// Return a single Row
RowFactory.create(nestedValues);
}
);
// When you are done, get local results as Rows.
List<Row> localResultRows = df2.collectAsList();
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html

QueryRecord vs. PartitionRecord for better performance?

In a NiFi dataflow if I want to split a single flowfile into two sets based on the value of a particular field, is it faster, in terms of performance, to use QueryRecord or PartitionRecord in the following manners?
QueryRecord:
SELECT * FROM FLOWFILE WHERE WEIGHT < 1000;
PartitionRecord
In UpdateRecord in RecordPath mode populate a new "string" field greater_or_less with the value of /weight
In UpdateRecord in Literal Value mode update greater_or_less to ${field.value:toNumber():lt(1000)}
In PartitionRecord partition the flowfile on greater_or_less
In the PartitionRecord method, I will have two schemas, with one being the original data format, and the other having the greater_or_less field in addition to the original data format. We'll begin step 1 in the original schema, output from step 1 in the new schema, and then output step 3 in the original schema. The output of step 3 should be two flowfiles, one being equivalent to the output of the QueryRecord method.
In summation, although QueryRecord is a bit simpler to implement, I don't have any knowledge of the back-end machinations of NiFi, or how the overheads of these processors compare, so I am not sure which method is optimal. My instincts tell me that QueryRecord is expensive, but I am not sure how it compares to the type-switching and record-reading-and-writing of the PartitionRecord method.

I don't know which is faster off the top of my head, but both run on Apache Calcite under the covers which is very quick.
Have you considered using GenerateFlowfile to produce test data and try it out?
I would expect that PartitionRecord would be best, but use a filter with a predicate instead of generating a new field in your schema with UpdateRecord.

Both use a Record Reader and Writer for record level processing. So there is no difference on convert Record Abstract Processor in both's implmentation.
The differnce is PartitionRecord access type is native and faster to record level processing on the other hand QueryRecord has an extra overhead of running SQL for which it has to structure its records and metadata according to Calcite specififcations which is an overhead.
Some 5 minute stats I was able to process 47GB of data with a task time of 1:18:00 on QueryRecord while 0:47:00 on PartitionRecord with same number of threads.

Using multiple Leaves in Lucene Classifiers

I am trying to use the KNearestNeighbour classifier in lucene. The document classifier accepts a leafReader in its constructor, for training the classifier.
The problem is that, the Index I am using to train the classifier has multiple leaves. But the constructor for the class only accepts one leaf, and I could not find a process to add the remaining LeafReaders to the Class. I might be missing out on something. Could anyone help please me out with this?
Here is the code I am using currently :
FSDirectory index = FSDirectory.open(Paths.get(indexLoc));
IndexReader reader = DirectoryReader.open(index);
LeafReaderContext leaf = leaves.get(0);
LeafReader atomicReader = leaf.reader();
KNearestNeighborDocumentClassifier knn = new KNearestNeighborDocumentClassifier(atomicReader, BM25, null, 10, 0, 0, "Topics", field2analyzer, "Text");

Leaves represent each segment of you index. In terms of performance and resource usage, you should iterate over the leaves, run the classification for each segment and accumulate your results.
for (LeafReaderContext context : indexReader.getContext().leaves()) {
LeafReader reader = context.reader();
// run for each leaf
}
If that is not possible, you can use the SlowCompositeReaderWrapper which, as the name suggests, might be very slow as it aggregates all the leaves on the fly.
LeafReader singleLeaf = SlowCompositeReaderWrapper.wrap(indexReader);
// run classifier on singleLeaf
Depending on your Lucene version, this sits in lucene-core or lucene-misc (since Lucene 6.0, I think). Also, this class is deprecated and scheduled for removal in Lucene 7.0.
The third option might be to run forceMerge(1) until you only have one segment and you can use the single leaf for this. However, forcing a merge down to a single segment has other issues and might not work for your use case. If you data is write-once and then only used for reading, you could use a forceMerge. If you have regular updates, you'll have to end up using the first option and aggregate the classification result yourself.

R H2O - Memory management

I'm trying to use H2O via R to build multiple models using subsets of one large-ish data set (~ 10GB). The data is one years worth of data and I'm trying to build 51 models (ie train on week 1, predict on week 2, etc.) with each week being about 1.5-2.5 million rows with 8 variables.
I've done this inside of a loop which I know is not always the best way in R. One other issue I found was that the H2O entity would accumulate prior objects, so I created a function to remove all of them except the main data set.
h2o.clean <- function(clust = localH2O, verbose = TRUE, vte = c()){
# Find all objects on server
keysToKill <- h2o.ls(clust)$Key
# Remove items to be excluded, if any
keysToKill <- setdiff(keysToKill, vte)
# Loop thru and remove items to be removed
for(i in keysToKill){
h2o.rm(object = clust, keys = i)
if(verbose == TRUE){
print(i);flush.console()
}
}
# Print remaining objects in cluster.
h2o.ls(clust)
}
The script runs fine for a while and then crashes - often with a complaint about running out of memory and swapping to disk.
Here's some pseudo code to describe the process
# load h2o library
library(h2o)
# create h2o entity
localH2O = h2o.init(nthreads = 4, max_mem_size = "6g")
# load data
dat1.hex = h2o.importFile(localH2O, inFile, key = "dat1.hex")
# Start loop
for(i in 1:51){
# create test/train hex objects
train1.hex <- dat1.hex[dat1.hex$week_num == i,]
test1.hex <- dat1.hex[dat1.hex$week_num == i + 1,]
# train gbm
dat1.gbm <- h2o.gbm(y = 'click_target2', x = xVars, data = train1.hex
, nfolds = 3
, importance = T
, distribution = 'bernoulli'
, n.trees = 100
, interaction.depth = 10,
, shrinkage = 0.01
)
# calculate out of sample performance
test2.hex <- cbind.H2OParsedData(test1.hex,h2o.predict(dat1.gbm, test1.hex))
colnames(test2.hex) <- names(head(test2.hex))
gbmAuc <- h2o.performance(test2.hex$X1, test2.hex$click_target2)#model$auc
# clean h2o entity
h2o.clean(clust = localH2O, verbose = F, vte = c('dat1.hex'))
} # end loop
My question is what, if any, is the correct way to manage data and memory in a stand alone entity (this is NOT running on hadoop or a cluster - just a large EC2 instance (~ 64gb RAM + 12 CPUs)) for this type of process? Should I be killing and recreating the H2O entity after each loop (this was original process but reading data from file every time adds ~ 10 minutes per iteration)? Is there a proper way to garbage collect or release memory after each loop?
Any suggestions would be appreciated.

This answer is for the original H2O project (releases 2.x.y.z).
In the original H2O project, the H2O R package creates lots of temporary H2O objects in the H2O cluster DKV (Distributed Key/Value store) with a "Last.value" prefix.
These are visible both in the Store View from the Web UI and by calling h2o.ls() from R.
What I recommend doing is:
at the bottom of each loop iteration, use h2o.assign() to do a deep copy of anything you want to save to a known key name
use h2o.rm() to remove anything you don't want to keep, in particular the "Last.value" temps
call gc() explicitly in R somewhere in the loop
Here is a function which removes the Last.value temp objects for you. Pass in the H2O connection object as the argument:
removeLastValues <- function(conn) {
df <- h2o.ls(conn)
keys_to_remove <- grep("^Last\\.value\\.", perl=TRUE, x=df$Key, value=TRUE)
unique_keys_to_remove = unique(keys_to_remove)
if (length(unique_keys_to_remove) > 0) {
h2o.rm(conn, unique_keys_to_remove)
}
}
Here is a link to an R test in the H2O github repository that uses this technique and can run indefinitely without running out of memory:
https://github.com/h2oai/h2o/blob/master/R/tests/testdir_misc/runit_looping_slice_quantile.R

New suggestion as of 12/15/2015: update to latest stable (Tibshirani 3.6.0.8 or later).
We've completely reworked how R & H2O handle internal temp variables, and the memory management is much smoother.
Next: H2O temps can be held "alive" by R dead variables... so run an R gc() every loop iteration. Once R's GC removes the dead variables, H2O will reclaim that memory.
After that, your cluster should only hold on to specifically named things, like loaded datasets, and models. These you'll need to delete roughly as fast as you make them, to avoid accumulating large data in the K/V store.
Please let us know if you have any more problems by posting to the google group h2o stream:
https://groups.google.com/forum/#!forum/h2ostream
Cliff

The most current answer to this question is that you should probably just use the h2o.grid() function rather than writing a loop.

With the H2O new version (currently 3.24.0.3), they suggest to use the following recommendations:
my for loop {
# perform loop
rm(R object that isn’t needed anymore)
rm(R object of h2o thing that isn’t needed anymore)
# trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
gc()
# optional extra one to be paranoid. this is usually very fast.
gc()
# optionally sanity check that you see only what you expect to see here, and not more.
h2o.ls()
# tell back-end cluster nodes to do three back-to-back JVM full GCs.
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
h2o:::.h2o.garbageCollect()
}
Here the source: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.