Loosing old aggregated records after adding repartitioning in Kafka

Loosing old aggregated records after adding repartitioning in Kafka - java

I had a Kafka stream with following operations:
stream.mapValues.groupByKey.aggregate. Aggregation is adding records into the list basically.
Now I changed implementation to that:
stream.flatMap.groupByKey.aggregate. FlatMap is duplicating record: first record is exactly the same like in old implementation and second one has key changed. So after the change of implementation repartitioning is happening, while before, it wasn't (and it's fine). My problem is that after releasing the change, old aggregated records for old key disappeared. From the moment of change everything works as it should, but I don't understand this behaviour. As I understand, as I did not change the key, it should land on the same partition as previously and aggregation should continue adding messages to old list, not starting from beginning. Could anyone help me understand why it's happening?

If you change your processing topology, in general, you need to reset your application and reprocess all data from the input topic to recompute the state.
In your case, I assume that the aggregation operator has a different name after the change and thus, does not "find" its local state and changelog topic any longer.
You can compare the names for both topologies via Topology#describe().
To allow for a smooth upgrade, you would need to provide a fixed name for aggregate() via Materialized.as(...). If you provide a fixed name (ie, same in old and new topology) the issue would go away. However, because your original topology did not provide a fixed name, it's hard to get out of the situation.

Related

Indexes update in DynamoDB

I've been working with LSI and GSI in DynamoDB, but I guess I'm missing something.
I created an index to always query by the latest results without using the partition key only other attributes and without reading the entire items, only those that really matter, but with the GSI at some point my query returns data that are not up-to-date; this I understand due to the fact of eventual consistence described in the docs (You may correct me if I'm wrong).
And what about LSI? Even using the ConsistentRead, at some point my data is not being queried correctly, and the results are not up-to-date. From the docs I thought that LSI where updated syncronized with its table and with the ConsistentRead property set I'd always get the latest results, but this is not happening.
I'm using a REST endpoint (API Gateway) to perform inserts into my dynamo table (I perform some treatments before the insertion) so, I've been wondering if this has something to do with it: maybe the code (currently Java) or DynamoDB is slow to update but since in my endpoint everything seems to work fine I perform another request too fast or if I have to wait a little longer to interact with the table because the index is being updated, although I have already tested waiting longer I'm receiving the same wrong results. I'm a bit lost here.
This is the code I'm using to query the index:
QuerySpec spec = new QuerySpec()
.withKeyConditionExpression("#c = :v_attrib1 and #e = :v_attrib2")
.withNameMap(new NameMap()
.with("#c", "attrib1")
.with("#e", "attrib2"))
.withValueMap(new ValueMap()
.withString(":v_attrib1", attrib1Value)
.withString(":v_attrib2", attrib2Value))
.withMaxResultSize(1) // to only bring the latest one
.withConsistentRead(true) // is this wrong?
.withScanIndexForward(false); // what about this one?
I don't know if the maven library version would interfere, but in any case the version I'm using is the 1.11.76 (I know there are a lot of newer versions, but if this is the problem we'll update it then).
Thank you all in advance.

After searching for quite some time and some other tests I, finally, figured out that the problem was not in DynamoDB indexes, they are working as expected, but in the Lambda functions.
The fact that I was sending a lot of requests, one after another, was not giving the indexes time to remain updated: Lambda functions execute asynchronously (I should have known), and so the requests received by the database were not ordered and my data was not properly updated. So, we changed our implementation to use Atomic Counters: we can keep our data updated no matter the number or the order of the requests.
See: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters

Efficiently update an element in a DelayQueue

I am facing a similar problem as the author in:
DelayQueue with higher speed remove()?
The problem:
I need to process continuously incoming data and check whether the data has been seen in a certain timeframe before. Therefore I calculate a unique ID for incoming data and add this data indexed by the ID to a map. At the same time I store the ID and the timeout timestamp in a PriorityQueue, giving me the ability to efficiently check for the latest ID to time out. Unfortunately if the data comes in again before the specified timeout, I need to update the timeout stored in the PriorityQueue. So far I just removed the old ID and re-added the ID along with the new timeout. This works well, except for the time consuming remove method if my PriorityQueue grows over 300k elements.
Possible Solution:
I just thought about using a DelayQueue instead, which would make it easier to wait for the first data to time out, unfortunately I have not found an efficient way to update a timeout element stored in such a DelayQueue, without facing the same problem as with the PriorityQueue: the remove method!
Any ideas on how to solve this problem in an efficient way even for a huge Queue?

This actually sounds a lot like a Guava Cache, which is a concurrent on-heap cache supporting "expire this long after the most recent lookup for this entry." It might be simplest just to reuse that, if you can use third-party libraries.
Failing that, the approach that implementation uses looks something like this: it has a hash table, so entries can be efficiently looked up by their key, but the entries are also in a concurrent, custom linked list -- you can't do this with the built-in libraries. The linked list is in the order of "least recently accessed first." When an entry is accessed, it gets moved to the end of the linked list. Every so often, you look at the beginning of the list -- where all the least recently accessed entries live -- and delete the ones that are older than your threshold.

Concurrency Control on my Code

I am working on an order capture and generator application. Application is working fine with concurrent users working on different orders. The problem starts when two Users from different systems/locations try to work on the same order. How it is impacting the business is, that for same order, application will generate duplicate data since two users are working on that order simultaneously.
I have tried to synchronize the method where I am generating the order, but that would mean that no other user can work on any new order since synchronize will put a lock for that method. This will certainly block all the users from generating a new order when one order is being progressed, since, it will hit the synchronized code.
I have also tried with criteria initilization for an order, but no success.
Can anyone please suggest a proper approach??
All suggestions/comments are welcome. Thanks in advance.

Instead of synchronizing on the method level, you may use block-level synchronization for the blocks of code which must be operated on by only one thread at a time. This way you can increase the scope for parallel processing of the same order.

On a grand scale, if you are backing up your entities in a database, I would advice you to look at optimistic locking.
Add a version field to your order entity. Once the order is placed (the first time) the version is 1. Every update should then come in order from this, so imagine two subsequent concurrent processes
a -> Read data (version=1)
Update data
Store data (set version=2 if version=1)
b -> Read data (version=1)
Update data
Store data (set version=2 if version=1)
If the processing of these two are concurrent rather than serialized, you will notice how one of the processes indeed will fail to store data. That is the losing user, who will have to retry his edits. (Where he reads version=2 instead).
If you use JPA, optimistic locking is as easy as adding a #Version attribute to your model. If you use raw JDBC, you will need to add the add it to the update condition
update table set version=2, data=xyz where orderid=x and version=1
That is by far the best and in fact preferred solution to your general problem.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)

All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Best practice to realize a long-term history-mode for a O/RM system(Hibernate)?

I have mapped several java classes like Customer, Assessment, Rating, ... to a database with Hibernate.
Now i am thinking about a history-mode for all changes to the persistent data. The application is a web application. In case of deleting (or editing) data another user should have the possibility to see the changes and undo it. Since the changes are out of the scope of the current session, i don't know how to solve this in something like the Command pattern, which is recommended for undo functionality.
For single value editing an approach like in this question sounds OK. But what about the deletion of a whole persistent entity? The simplest way is to create a flag in the table if this customer is deleted or not. The complexest way is to create a table for each class where deleted entities are stored. Is there anything in between? And how can i integrate these two things in a O/RM system (in my case Hibernate) comfortably, without messing around to much with SQL (which i want to avoid because of portability) and still have enough flexibility?
Is there a best practice?

One approach to maintaining audit/undo trails is to mark each version of an object's record with a version number. Finding the current version would be a painful effort if the this were a simple version number, so a reverse version numbering works best. "version' 0 is always the current and if you do an update the version numbers for all previous versions are incremented. Deleting an object is done by incrementing the version numbers on the current records and not inserting a new one at 0.
Compared to an attribute-by-attribute approach this make for far simpler rollbacks or historic version views but does take more space.

One way to do it would be to have a "change history" entity with properties for entity id of the entity changed, action (edit/delete), property name, orginal value, new value. Maybe also reference to the user performing the edit. A deletion would create entities for all properties of the deleted entity with action "delete".
This entity would provide enough data to perform undos and viewing of change history.

Hmm I'm looking for an answer to this too. So far the best I've found is the www.jboss.org/envers/ framework but even that seems to me like more work than should be necessary.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.