Is there intermediate computation optimization when using functions.window [Spark]

Is there intermediate computation optimization when using functions.window [Spark] - java

I am using functions.window to create sliding window computation using Spark and Java. Example code:
Column slidingWindow = functions.window(singleIPPerRow.col("timestamp"), "3 hours", "1 seconds");
Dataset<Row> aggregatedResultsForWindow = singleIPPerRow.groupBy(slidingWindow, singleIPPerRow.col("area")).count();
The data looks like this:
+----------+-------+------+
| timestamp| area|events|
+----------+-------+------+
|1514452990|domain1| 41|
|1514452991|domain1| 42|
|1514452991|domain1| 50|
|1514452993|domain2| 53|
|1514452994|domain2| 54|
|1514452994|domain3| 54|
|1514452993|domain1| 35|
+----------+-------+------+
In real like there are a lot of events per timestamp and also note the large ratio between the step and the window size.
My question is how many time will the counts be calculated? I mean every timestamp-area count is used by step/window different rows in the result. Will Spark save the intermediate results? making every count for a couple be calculated once or will it calculate the result step/window for every couple?

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

Spark MLlib: PCA on 9570 columns takes too long

1) I am doing a PCA on 9570 columns giving it 12288 mb RAM in local mode(which means driver only) and it takes from 1.5 hours up to 2. This is the code (very simple):
System.out.println("level1\n");
VectorAssembler assemblerexp = new VectorAssembler()
.setInputCols(metincols)
.setOutputCol("intensity");
expoutput = assemblerexp.transform(expavgpeaks);
System.out.println("level2\n");
PCAModel pcaexp = new PCA()
.setInputCol("intensity")
.setOutputCol("pcaFeatures")
.setK(2)
.fit(expoutput);
System.out.println("level3\n");
So the time that it takes to print level3 is what it takes long (1.5 to 2 hours). Is it normal that it takes so long? I have tried different number partitions (2,4,6,8,50,500,10000) and for some of them also takes almost 2 hours while for others I get a Java heap space error. Also some pictures from Spark User Interface:
Executors Jobs Stages environment
2) Is it also normal that I get different results with the PCA every time?

If you are setting RAM programmatically, it does not take effect, and a proper way would be to provide JVM arguments.

Anylogic moving average of processing times

in my model I have 9 different service blocks and each service can produce 9 different features. Each combination has a different delay time and standard deviation. For example feature 3 need 5 minutes in service block 8 with a deviation of 0.05, but only needs 3 minutes with a deviation of 0.1 in service block 4.
How can I permanently track the last 5 needed times of each combination and calculate the average (like a moving average)? I want to use the average to let the products decide which service block to choose for the respective feature according to the shortes time comparing the past times of all of the machines for the respective feature. The product agents already have a parameter for the time entering the service and one calculating the processing time by subtracting the entering time from the time leaving the service block.
Thank you for your support!

I am not sure if I understand what you are asking, but this may be an answer:
to track the last 5 needed times you can use a dataset from the analysis palette, limiting the number of samples to 5...
you will update the dataset using dataset.add(yourTimeVariable); so you can leave the vertical axis value of the dataset empty.
I assume you would need 1 dataset per feature
Then you can calculate your moving average doing:
dataset.getYMean();
If you need 81 datasets, then you can create a collection as an ArrayList with element type DataSet
And on Main properties, in On Startup you can add the following code and it will have the same effect.
for(int i=0;i<81;i++){
collection.add(new DataSet( 5, new DataUpdater_xjal() {
double _lastUpdateX = Double.NaN;
#Override
public void update( DataSet _d ) {
if ( time() == _lastUpdateX ) { return; }
_d.add( time(), 0 );
_lastUpdateX = time();
}
#Override
public double getDataXValue() {
return time();
}
} )
);
}
you will only need to remember what corresponds to what serviceblock and feature and then you can just do
collection.get(4).getYMean();
and to add a new value to the dataset:
collection.get(2).add(yourTimeVariable);

What determines the number of reducers and how to avoid bottlenecks regarding reducers?

Suppose I have a big tsv file with this kind of information:
2012-09-22 00:00:01.0 249342258346881024 47268866 0 0 0 bo
2012-09-22 00:00:02.0 249342260934746115 1344951 0 0 4 ot
2012-09-22 00:00:02.0 249342261098336257 346095334 1 0 0 ot
2012-09-22 00:05:02.0 249342261500977152 254785340 0 1 0 ot
I want to implement a MapReduce job that enumerates time intervals of five minutes and filter some information of the tsv inputs. The output file would look like this:
0 47268866 bo
0 134495 ot
0 346095334 ot
1 254785340 ot
The key is the number of the interval, e.g., 0 is the reference of the interval between 2012-09-22 00:00:00.0 to 2012-09-22 00:04:59.
I don't know if this problem doesn't fit on MapReduce approach or if I'm not thinking it right. In the map function, I'm just passing the timestamp as key and the filtered information as value. In the reduce function, I count the intervals by using global variables and produce the output mentioned.
i. Does the framework determine the number of reducers in some automatically way or it is user defined? With one reducer, I think that there is no problem on my approach, but I'm wondering if one reduce can become a bottleneck when dealing with really large files, can it?
ii. How can I solve this problem with multiple reducers?
Any suggestions would be really appreciated!
Thanks in advance!
EDIT:
The first question is answered by #Olaf, but the second still gives me some doubts regarding parallelism. The map output of my map function is currently this (I'm just passing the timestamp with minute precision):
2012-09-22 00:00 47268866 bo
2012-09-22 00:00 344951 ot
2012-09-22 00:00 346095334 ot
2012-09-22 00:05 254785340 ot
So in the reduce function I receive inputs that the key represents the minute when the information was collected and the values the information itself and I want to enumerate five minutes intervals beginning with 0. I'm currently using a global variable to store the beginning of the interval and when a key extrapolate it I'm incrementing the interval counter (That is also a global variable).
Here is the code:
private long stepRange = TimeUnit.MINUTES.toMillis(5);
private long stepInitialMillis = 0;
private int stepCounter = 0;
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
long millis = Long.valueOf(key.toString());
if (stepInitialMillis == 0) {
stepInitialMillis = millis;
} else {
if (millis - stepInitialMillis > stepRange) {
stepCounter = stepCounter + 1;
stepInitialMillis = millis;
}
}
for (Text value : values) {
context.write(new Text(String.valueOf(stepCounter)),
new Text(key.toString() + "\t" + value));
}
}
So, with multiple reducers, I will have my reduce function running on two or more nodes, in two or more JVMs and I will lose the control given by the global variables and I'm not thinking of a workaround for my case.

The number of reducers depends on the configuration of the cluster, although you can limit the number of reducers used by your MapReduce job.
A single reducer would indeed become a bottleneck in your MapReduce job if you are dealing with any significant amount of data.
Hadoop MapReduce engine gurantees that all values associated with the same key are sent to the same reducer, so your approach should work with multile reducers. See Yahoo! tutorial for details: http://developer.yahoo.com/hadoop/tutorial/module4.html#listreducing
EDIT: To guarantee that all values for the same time interval go to the same reducer, you would have to use some unique identifier of the time interval as the key. You would have to do it in the mapper. I'm reading your question again and, unless you want to somehow aggregate the data between the records corresponding to the same time interval, you don't need any reducer at all.
EDIT: As #SeanOwen pointed, the number of reducers depends on the configuration of the cluster. Usually it is configured between 0.95 and 1.75 times the number of maximum tasks per node times the number of data nodes. If the mapred.reduce.tasks value is not set in the cluster configuration, the default number of reducers is 1.

It looks like you're wanting to aggregate some data by five-minute blocks. Map-reduce with Hadoop works great for this sort of thing! There should be no reason to use any "global variables". Here is how I would set it up:
The mapper reads one line of the TSV. It grabs the timestamp, and computes which five-minute bucket it belongs in. Make that into a string, and emit it as the key, something like "20120922:0000", "20120922:0005", "20120922:0010", etc. As for the value that is emitted along with that key, just keep it simple to start with, and send on the whole tab-delimited line as another Text object.
Now that the mapper has determined how the data needs to be organized, it's the reducer's job to do the aggregation. Each reducer will get a key (one of the five-minute buckers), along with the list of all the lines that fit into that bucket. It can iterate over that list, and extract whatever it wants from it, writing output to the context as needed.
As for mappers, just let hadoop figure that part out. Set the number of reducers to how many nodes you have in the cluster, as a starting point. Should run just fine.
Hope this helps.

Construct document-term matrix via Java and MapReduce

Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1

Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!

You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.

Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.