Hazelcast write-through map store pre-population on startup

Hazelcast write-through map store pre-population on startup - java

I am currently doing a POC for developing a distributed, fault tolerant, ETL ecosystem. I have selected Hazelcast for for my clustering (data+notification) purpose. Googling through Hazelcast resources took me to this link and it exactly matches how I was thinking to go about, using a map based solution.
I need to understand one point. Before that, allow me to give a canonical idea of our architecture:
Say we have 2 nodes A,B running our server instance clustered through hazelcast. One of them is a listener accepting requests (but can change on a fail over), say A.
A gets a request and puts it to a distributed map. This map is write-through backed by a persistent store and a single memory backup is configured on nodes.
Each instance has a local map entry listener, which on entry added event, would (asynchronous/queuing) process that entry and then remove it from the distributed map.
This is working as expected.
Question:
Say 10 requests have been received and distributed with 5 on each nodes. 2 entries on each node has been processed and now both instance crashes.
So there are total 6 entries present in the backing datastore now.
Now we bring up both the instances. As per documentation - "As of 1.9.3 MapLoader has the new MapLoader.loadAllKeys API. It is used for pre-populating the in-memory map when the map is first touched/used"
We implement loadAllKeys() by simply loading all the key values present in the store.
So does that mean there is a possibility where, both the instances will now load the 6 entries and process them (thus resulting in duplicate processing)? Or is it handled in a synchronized way so that loading is done only once in a cluster?
On server startup I need to process the pending entries (if any). I see that the data is loaded, however the entryAdded event is not fired. How can make the entryAdded event fire (or any other elegant way, by which I will know that there are pending entries on startup)?
Requesting suggestions.
Thanks,
Sutanu

on initialization, loadAllKeys() will be called which will return all 6 keys in the persistent store. Then each node will select the keys it owns and load them only. So A might load 2 entries, while B loads the remaining 4.
store.load doesn't fire entry listeners. How about this: right after initialization, after registering your listener, you can get the localEntries and process the existing ones.

Related

Hazelcast entryprocessor

I have 3 nodes of my application (Java Spring boot) which use Hazelcast IMap as a distributed cache. My logic requires to frequently update a value in the map, and I have implemented a EntryProcessor for it.
Whilst testing, I have created a hazelcast cluster with 3 nodes.
What I noticed is following:
if node1 invokes the entry processor, it is not guaranteed that it will be executed on node 1. It is executed on any one of those 3 nodes. Same for backup entry processor.
same happens for the other 2 nodes.
Is there any way to ensure/enforce that the entry processor is executed on the node where it was invoked? I read through the documentation and could not find an answer for my question.
Thanks in advance.

The entry processor runs on nodes that host the primary copy of the entry and any backup copy.
The behaviour you are seeing is due to the data not being hosted in the same place from run to run. There is a degree of randomness.
This is normal, and what you want. Any attempts to "pin" the data to a specific place ALWAYS go wrong. Hazelcast, and distributed stores in general, need to be unconstrained in where they can place the data to balance the cluster.
You can run a Runnable/Callable on a specific node if you need predictability, but that's not the use-case here.

how to have transactions across main thread and entry listener in hazelcast?

I have a network of nodes represented with a graph ( or more specifically a dag). The vertex and edges are just ids pointing to large objects in the cache.
I am using hazelcast and have 2 caches:
1. ReferenceObject for the graph
2. IMap for the large objects
3. IMap for the edges in the graph
When I insert a large object, I have an entry listener that will update this graph in the cache. Similarly when I add an edge data, there is also an entry listener that will update the graph.
However I have 1 problem where if I create an edge and it creates cycle, it fails (as it's a dag) but the IMap retains the records.
Any ideas how I can have transactions across the main thread and the entry listener?

#Pilo, the problem is EntryListener listens to events fired after data already populated in the map. So when you insert the data to your first map & listen to an update event, data is already in the first map.
You can either
Manually remove the record from the first map if the operation fails on the second one.
Use transactions & make sure either all or none of the maps updated, instead of using listeners.

Though it's a completely different approach, have you looked at Hazelcast Jet? It's a DAG based event stream processing engine build on top of Hazelcast IMDG. It might fit your use case better and take care of the lower level stuff for you.
https://jet.hazelcast.org
You would have a Jet cluster, which is also a Hazelcast cluster, but you get all the processing stuff with it. It extends the Java Streams programming model so you just write your app as if it was a Java stream and run it on the cluster. Something to think about anyway.

Force put objects to Hazelcast local map

We are working on a distributed data processing system, and Hazelcast is one of the component we are using.
We have a streaming data input coming to the cluster and we have to process the data (update/accumulate etc). There is distributed request map, which has local entry listeners. We process a new request (update/accumulate in memory) and put to another distributed map, which is the actual datagrid.
Thus we can process each request concurrently without locking. However, the putting of data to the main datagrid might involve a network trip.
Is there a way I can force specify which node to be selected? Basically I would want to put it in the local map for datagrid. This should improve the overall throughput by avoiding the network trip.
By using a partition aware key, I can specify that all such keys go to the same partition, however, I am looking to actually 'specify' the partition. Is this possible?

You can create a key for a specific partition. We do this often for testing.
Once you have created such a key for every partition, you can use
map.put("yourkey#partitionkey", value)
Checkout out the git repo and look for HazelcastTestSupport.generateKeyOwnedBy(hz).
Important: it can be that a partition belongs to a member at some point in time, but a partition can move around in the system. E.g. when member joins/leaves the cluster, so the solution could be fragile.

Java cache with expiration since first write

I have events which should be accumulated into persistent key-value store. After 24 hours after key first insert this accumulated record should be processed and remove from store.
Expired data processing is distributed among multiple nodes, so use of database involves processing synchronization problems. I don't want to use any SQL database.
The best fit for me is probably some cache with configurable expiration policy according to my needs. Is there any? Or can be this solved with some No-SQL database?

It should be possible with products like infinispan or hazelcast.
Both are JSR107 compatible.
With a JSR107 compatible cache API a possible approach is to set your 24h hours expiry via the CreatedExpiryPolicy. Next, you implement and register CacheEntryExpiredListener to get a call when the entry is expired.
The call on the CacheEntryExpiredListener may be lenient and implementation dependent. Actually the event is triggered on the "eviction due to expiry". For example, one implementation may do a peridoc scan and remove expired entries every 30 minutes. However I think that "lag time" is adjustible in most implementations, so you will be able to operate in defined bounds.
Also check whether there are some resource constraints for the event callbacks you may run into, like thread pools.
I mentioned infispan or hazelcast for two reasons:
You may need the distribution capabilities.
Since you do long running processing and store data that is not recoverable, you may need the persistence and fault tolerance features. So I would say a simple in memory cache like Google Guava is out of the scope.
Good luck!

Reindex task/mapper/job for AppEngine Java

Does anybody know of a library or good code sample that could be used to re-index all/some entities in all/some namespaces ?
If I implement this on my own, is MapReduce what I should consider ?
"I need to re-index ?" feels like a problem many developers have run into but the closest I could find is this, which may be a good start ?
Other option is a homebrewn solution using Task Queues that iterate the datastore namespaces and entities but I'd prefer not the re-invent the wheel and go for a robust, proven solution.
What are the options ?

I'm afraid I don't know of any pre-built system. I think you basically need to create a cursor to iterate through all your entities and then do a get and a put on all of them (or optionally check if they're in the index before doing the put - if you have some that won't need updating, that would save you a write at the cost of a read and/or a small operation).
Follow the example here:
https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify#Cursors
Create a java.util.concurrent.SynchronousQueue to hold batches of datastore keys.
Create 10 new consumer threads (the current limit) using ThreadManager:
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/ThreadManager
Those threads should do the following:
Create a new objectify instance and turn off the session cache and memcache for objectify.
Get a batch of keys from the SynchronousQueue.
Fetch all of those entities using a batch get.
Optionally do a keys-only query for all those entities using the relevant property.
Put all those entities (or exclude the ones that were returned above).
Repeat from step 2.
In a loop, fetch the next 30 keys using a keys-only cursor query and put them into the SynchronousQueue.
Once you've put all of the items into the SynchronousQueue, set a property to stop all the consumer threads once they've done their work.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.