Reactor EmitterProcessor that only retains last n elements?

Reactor EmitterProcessor that only retains last n elements? - java

How do I create an EmitterProcessor that retains only the latest n elements, such that it also works even if there are no subscribers?
At the moment I create a processor like this:
EmitterProcessor<Integer> processor = EmitterProcessor.create();
And an external system provides temperature updates randomly throughout the day. In the callback from that system I do:
void tempConsumer(int temp) {
processor.onNext(temp);
}
However onNext(...) blocks once processor.getBufferSize() elements have been added.
How can I create a processor that discards the oldest element, in this case, rather than blocking?
This seems to be covered to some degree in reactor-core #763. Simon Baslé first discusses a proposed change to EmitterProcessor such that when "sending data while there are NO subscribers [and] the queue contains bufferSize elements, the oldest element is dropped and the onNext is enqueued." But then in the next comment, he says "we won't go ahead with my suggested change above. We instead advise you to use the sink() rather than directly the onNext. Namely, to use the onRequest callback inside the sink() to perform exactly as many sink.next(...) as there are requests."
However, if I understand things correctly this only covers the case where you can calculate new elements on demand, e.g. like so:
FluxSink<Integer> sink = processor.sink();
Random random = new Random();
sink.onRequest(n -> random.nextInt()); // Generate next n requested elements.
But in my situation, I can't generate the latest n temperature readings on demand. Of course, I could maintain my own external bounded buffer of the latest readings and then read from that in onRequest(...) but I'm assuming Reactor can do this for me?
I presume this question is a dup - but my Google foo has failed me here.
Ricard Kollcaku's answer that one should use ReplayProcessor seems to be the right way to do things. Here is another example that I wrote to get clear in my head how to use it:
ReplayProcessor<Integer> flux = ReplayProcessor.create(Queues.SMALL_BUFFER_SIZE);
FluxSink<Integer> sink = flux.sink();
// ReplayProcessor.getBufferSize() returns unbounded,
// while CAPACITY returns the capacity of the underlying buffer.
int capacity = flux.scan(Scannable.Attr.CAPACITY);
// Add twice as many elements as the underlying buffer can take.
int count = capacity * 2;
for (int i = 0; i < count; i++) {
sink.next(i);
}
// If `capacity` is 256, this will print value 256 thru to 511.
flux.subscribe(System.out::println);
I also found this section, in Hands-On Reactive Programming with Reactor, useful in explaining things.

You must use ReplayProcessor like this example :
ReplayProcessor<Integer> directProcessor = ReplayProcessor.cacheLast();
Flux.range(1, 10)
.map(integer -> {
directProcessor.onNext(integer);
return integer;
}).doOnComplete(() -> {
directProcessor.subscribe(System.out::println);
directProcessor.subscribe(System.out::println);
})
.subscribe();

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

Move an element in the list to the first position using Stream

I have written this piece of code and I would like to convert it to stream and while keeping the same behavior.
List<String> str1 = new ArrayList<String>();
str1.add("A");
str1.add("B");
str1.add("C");
str1.add("D");
int index = str1.indexOf("C");
if ((str1.contains("C")) && (index != 0)) {
str1.remove("C");
str1.add(0, "C");
}

You could simplify:
int index = str1.indexOf("C");
if ((str1.contains("C")) && (index != 0)) {
str1.remove("C");
str1.add(0, "C");
}
as
if (str1.remove("C")) {
str1.add(0, "C");
}
(Check the javadoc for remove(Object). It removes the first instance of the object that it finds, and returns true if an object was removed. There is no need to find and test the object's index.)
At this point, you should probably stop.
A small gotcha is that if "C" is the first element, you are removing it and adding it again ... unnecessarily. You could deal with this as follows:
int index = str1.index("C");
if (index > 1) {
str1.add(0, str1.remove(index));
}
You could optimize further to avoid the double copying of a remove and an add. For example (thanks to #Holger):
Collections.rotate(str1.subList(0, str1.indexOf("C") + 1), 1);
(But if you say that is simple, a regular Java programmer will probably throw something at you. It requires careful reading of the javadocs to understand that this is actually efficient.)
You could rewrite this as 2 stream operations but there is no point doing it. Both operations (remove first and insert into stream) are inefficient and tricky to implement. Tricky to implement means hard to read.
Note: Using streams to solve every problem is not a sensible goal. The main point of streams is to express complex transformations in a way that is easier to read and maintain than conventional loops. If the only stream-based solutions to a problem are actually harder to read and maintain than the non-stream solutions, that is clear evidence that you shouldn't be trying to use streams to solve that problem!
Note that the above solutions are for the problem as stated. However if object identity for the elements actually matters, the only one of the solutions above that actually moves an object to the start of the list is the solution that uses rotate.

Java performance issue: Need to iterate more than 8 million records with a target-branch check

We have a system that processes flat-file and (with a couple of validations only) inserts into database.
This code:
//There can be 8 million lines-of-codes
for(String line: lines){
if (!Class.isBranchNoValid(validBranchNoArr, obj.branchNo)){
continue;
}
list.add(line);
}
definition of isBranchNoValid:
//the array length ranges from 2 to 5 only
public static boolean isBranchNoValid(String[] validBranchNoArr, String branchNo) {
for (int i = 0; i < validBranchNoArr.length; i++) {
if (validBranchNoArr[i].equals(branchNo)) {
return true;
}
}
return false;
}
The validation is at line-level (we have to filter or skip the line that doesn't have a branchNo in the array). Earlier, this wasn't (filter) the case.
Now, high-performance degradation is troubling us.
I understand (may be, I am wrong) that this repeated function call is causing a lot of stack creation resulting in a very high GC invocation.
I can't figure out a way (is it even possible) to perform this filter without this high cost of performance degradation (a little difference is fine).

This is not a stack problem for sure, because your function is not recursive nothing is kept in the stack between calls; after each call the variables are erased since they are not needed anymore.
You can put the valid numbers in a set and use that one for some optimization but in your case I am not sure it will bring any benefits at all since you have at most 5 elements.

So there are several possible bottlenecks in your scenario.
reading the lines of the file
Parse the line to construct the object to insert into the database
check the applicability of the object (ie branch no filter)
insert into the db
Generally, you'd say IO is the slowest, so 1. and 2. You're saying nothing except 2. changed, right? That is weird.
Anyway, if you want to optimize that, I wouldn't be passing the array around 8 million times, and I wouldn't iterate it every time either. Since your valid branches are known, create a HashSet from it - it has O(1) access.
Set<String> validBranches = Arrays.stream(branches)
.collect(Collectors.toCollection(HashSet::new));
Then, iterate the lines
for (String line : lines) {
YourObject obj = parse(line);
if (validBranches.contains(obj.branchNo)) {
writeToDb(obj);
}
}
or, in the stream version
Files.lines(yourPath)
.map(this::parse)
.filter(o -> validBranches.contains(o.branchNo))
.forEach(this::writeToDb);
I'd also check if it isn't more efficient to first collect a batch of objects, then write to db. Also, it's possible that handling the lines in parallel gains some speed, in case the parsing is time intensive.

Why can't stream of streams be reduced un parallel ? / stream has already been operated upon or closed

Context
I've stumble upon a rather annoying problem : I've a program with a lot of data source that are able to stream the same type of elements and I want to "map" each availiable element in the program (element order doesn't matter).
Therefore I've tried to reduce my Stream<Stream<T>> streamOfStreamOfT; into a simple Stream<T> streamOfT; using streamOfT = streamOfStreamOfT.reduce(Stream.empty(), Stream::concat);
Since element order is not important for me, I've tried to parallelize the reduce operation with a .parallel() : streamOfT = streamOfStreamOfT.parallel().reduce(Stream.empty(), Stream::concat); But this triggers an java.lang.IllegalStateException: stream has already been operated upon or closed
Example
To experience it yourself just play with the following main (java 1.8u20) by commenting / uncommenting the .parallel()
public static void main(String[] args) {
// GIVEN
List<Stream<Integer>> listOfStreamOfInts = new ArrayList<>();
for (int j = 0; j < 10; j++) {
IntStream intStreamOf10Ints = IntStream.iterate(0, i -> i + 1)
.limit(10);
Stream<Integer> genericStreamOf10Ints = StreamSupport.stream(
intStreamOf10Ints.spliterator(), true);
listOfStreamOfInts.add(genericStreamOf10Ints);
}
Stream<Stream<Integer>> streamOfStreamOfInts = listOfStreamOfInts
.stream();
// WHEN
Stream<Integer> streamOfInts = streamOfStreamOfInts
// ////////////////
// PROBLEM
// |
// V
.parallel()
.reduce(Stream.empty(), Stream::concat);
// THEN
System.out.println(streamOfInts.map(String::valueOf).collect(
joining(", ")));
}
Question
Can someone explain this limitation ? / find a better way of handling parallel reduction of stream of streams
Edit 1
Following #Smutje and #LouisWasserman comments it seems that .flatMap(Function.identity()) is a better option that tolerates .parallel() streams

The form of reduce you are using takes an identity value and an associative combining function. But Stream.empty() is not a value; it has state. Streams are not data structures like arrays or collections; they are carriers for pushing data through possibly-parallel aggregate operations, and they have some state (like whether the stream has been consumed or not.) Think about how this works; you're going to build a tree where the same "empty" stream appears in more than one leaf. When you try to use this stateful not-an-identity twice (which won't happen sequentially, but will happen in parallel), the second time you try and traverse through that empty stream, it will quite correctly be seen to be already used.
So the problem is, you're simply using this reduce method incorrectly. The problem is not with the parallelism; it is simply that the parallelism exposed the underlying problem.
Secondly, even if this "worked" the way you think it should, you would only get parallelism building the tree that represents the flattened stream-of-streams; when you go to do the joining, that's a sequential stream pipeline there. Ooops.
Thirdly, even if this "worked" the way you think it should, you're going to add a lot of element-access overhead by building up concatenated streams, and you're not going to get the benefit of parallelism that you are seeking.
The simple answer is to flatten the streams:
String joined = streamOfStreams.parallel()
.flatMap(s -> s)
.collect(joining(", "));

How to reduce an algorithm into smaller parts so I can scale it?

I have updated this question(found last question not clear, if you want to refer to it check out the reversion history). The current answers so far do not work because I failed to explain my question clearly(sorry, second attempt).
Goal:
Trying to take a set of numbers(pos or neg, thus needs bounds to limit growth of specific variable) and find their linear combinations that can be used to get to a specific sum. For example, to get to a sum of 10 using [2,4,5] we get:
5*2 + 0*4 + 0*5 = 10
3*2 + 1*4 + 0*5 = 10
1*2 + 2*4 + 0*5 = 10
0*2 + 0*4 + 2*5 = 10
How can I create an algo that is scalable for large number of variables and target_sums? I can write the code on my own if an algo is given, but if there's a library avail, I'm fine with any library but prefer to use java.

One idea would be to break out of the loop once you set T[z][i] to true, since you are only basically modifying T[z][i] here, and if it does become true, it won't ever be modified again.
for i = 1 to k
for z = 0 to sum:
for j = z-x_i to 0:
if(T[j][i-1]):
T[z][i]=true;
break;
EDIT2: Additionally, if I am getting it right, T[z][i] depends on the array T[z-x_i..0][i-1]. T[z+1][i] depends on T[z+1-x_i..0][i-1]. So once you know if T[z][i] is true, you only need to check one additional element (T[z+1-x_i][i-1]) to know if T[z+1][i-1] will be true.
Let's say you represent the fact whether T[z][i] was updated by a variable changed. Then, you can simply say that T[z][i] = changed && T[z-1][i]. So you should be done in two loops instead of three. This should make it much faster.
Now, to scale it - Now that T[z,i] depends only on T[z-1,i] and T[z-1-x_i,i-1], so to populate T[z,i], you do not need to wait until the whole (i-1)th column is populated. You can start working on T[z,i] as soon as the required values are populated. I can't implement it without knowing the details, but you can try this approach.

I take it this is something like unbounded knapsack? You can dispense with the loop over c entirely.
for i = 1 to k
for z = 0 to sum
T[z][i] = z >= x_i cand (T[z - x_i][i - 1] or T[z - x_i][i])

Based on the original example data you gave (linear combination of terms) and your answer to my question in the comments section (there are bounds), would a brute force approach not work?
c0x0 + c1x1 + c2x2 +...+ cnxn = SUM
I'm guessing I'm missing something important but here it is anyway:
Brute Force Divide and Conquer:
main controller generates coefficients for say, half of the terms (or however many may make sense)
it then sends each partial set of fixed coefficients to a work queue
a worker picks up a partial set of fixed coefficients and proceeds to brute force its own way through the remaining combinations
it doesn't use much memory at all as it works sequentially on each valid set of coefficients
could be optimized to ignore equivalent combinations and probably many other ways
Pseudocode for Multiprocessing
class Controller
work_queue = Queue
solution_queue = Queue
solution_sets = []
create x number of workers with access to work_queue and solution_queue
#say for 2000 terms:
for partial_set in coefficient_generator(start_term=0, end_term=999):
if worker_available(): #generate just in time
push partial set onto work_queue
while solution_queue:
add any solutions to solution_sets
#there is an efficient way to do this type of polling but I forget
class Worker
while true: #actually stops when a stop work token is received
get partial_set from the work queue
for remaining_set in coefficient_generator(start_term=1000, end_term=1999):
combine the two sets (partial_set.extend(remaining_set))
if is_solution(full_set):
push full_set onto the solution queue

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.