Hazelcast entryprocessor

Hazelcast entryprocessor - java

I have 3 nodes of my application (Java Spring boot) which use Hazelcast IMap as a distributed cache. My logic requires to frequently update a value in the map, and I have implemented a EntryProcessor for it.
Whilst testing, I have created a hazelcast cluster with 3 nodes.
What I noticed is following:
if node1 invokes the entry processor, it is not guaranteed that it will be executed on node 1. It is executed on any one of those 3 nodes. Same for backup entry processor.
same happens for the other 2 nodes.
Is there any way to ensure/enforce that the entry processor is executed on the node where it was invoked? I read through the documentation and could not find an answer for my question.
Thanks in advance.

The entry processor runs on nodes that host the primary copy of the entry and any backup copy.
The behaviour you are seeing is due to the data not being hosted in the same place from run to run. There is a degree of randomness.
This is normal, and what you want. Any attempts to "pin" the data to a specific place ALWAYS go wrong. Hazelcast, and distributed stores in general, need to be unconstrained in where they can place the data to balance the cluster.
You can run a Runnable/Callable on a specific node if you need predictability, but that's not the use-case here.

Related

Multiple Operations on Multiple Sets (Tables) in Aerospike cluster

Current system state:
Currently, I maintain three sets (tables, equivalent in RDBMS) in my aerospike namespace (database, equivalent in RDBMS) backed by RESTful service.
Use-case:
I want to perform CRUD operations on at least one set and sometimes at most on all sets based on some bulk inputs into my system.
Expectation:
I want to perform all these CRUD operations in an atomic manner (means either all happen or none. This also contains an edge-case where some sets are successfully updated with their respective latest updates, and later on even a single set is un-successful. I would want to rollback my data to the previous state in each set.)
My workaround:
First I tried to find the equivalent of InsertOnSubmit in aerospike to use the approaches explained on this answer of StackOverflow, but seems like that doesn't exist.
Second, I thought of creating an intermediate rollback workflow module. Psuedo code shown below:
Temporarily save the new data in some data-type segregated set-wise.
Loop through set-wise data, and pick the primary key from them and get the older data from aerospike and save it into some other data-type again segregated set-wise.
Loop through all the sets one-by-one from first data-type and start performing CRUD operation accordingly. IF[everything runs till the end]: GOTO step 6; ELSE: GOTO step 4.
Start rolling-back by looping through all the sets one-by-one from second data-type and start performing CRUD operation. IF[everything runs till the end]: GOTO step 7; ELSE: GOTO step 5.
Log the error including all the details and report this error to alert system. Someone will get paged for it to have a look. GO TO step 7;
Terminate, operation successful.
Terminate, operation un-successful.
Help Needed:
Is there any chance to incoperate InsertOnSubmit behaviour on Aerospike cluster without creating my own roll-back workflow?
If not, then is there any better way to optimize my second approach?

1 - No. Aerospike offers atomicity only at a single record level. While inserting Master record and then replicating its copy to another node do follow true 2-phase commit semantics in Aerospike's Strong Consistency (SC) mode, any multi-record transaction has to be implemented at the application level.
2 - Any scheme implementing multi-record transactions, such as one you are thinking of, typically involves - creating some kind of "lock" bin in a record that you set, do the multi-record updates, build a before and after state of your data, have some kind of a maximum time to complete so you can rollback and clear abandoned operations and lock by the client application. Any of these schemes will only work reliably under Aerospike's Strong Consistency mode.

how to have transactions across main thread and entry listener in hazelcast?

I have a network of nodes represented with a graph ( or more specifically a dag). The vertex and edges are just ids pointing to large objects in the cache.
I am using hazelcast and have 2 caches:
1. ReferenceObject for the graph
2. IMap for the large objects
3. IMap for the edges in the graph
When I insert a large object, I have an entry listener that will update this graph in the cache. Similarly when I add an edge data, there is also an entry listener that will update the graph.
However I have 1 problem where if I create an edge and it creates cycle, it fails (as it's a dag) but the IMap retains the records.
Any ideas how I can have transactions across the main thread and the entry listener?

#Pilo, the problem is EntryListener listens to events fired after data already populated in the map. So when you insert the data to your first map & listen to an update event, data is already in the first map.
You can either
Manually remove the record from the first map if the operation fails on the second one.
Use transactions & make sure either all or none of the maps updated, instead of using listeners.

Though it's a completely different approach, have you looked at Hazelcast Jet? It's a DAG based event stream processing engine build on top of Hazelcast IMDG. It might fit your use case better and take care of the lower level stuff for you.
https://jet.hazelcast.org
You would have a Jet cluster, which is also a Hazelcast cluster, but you get all the processing stuff with it. It extends the Java Streams programming model so you just write your app as if it was a Java stream and run it on the cluster. Something to think about anyway.

Force put objects to Hazelcast local map

We are working on a distributed data processing system, and Hazelcast is one of the component we are using.
We have a streaming data input coming to the cluster and we have to process the data (update/accumulate etc). There is distributed request map, which has local entry listeners. We process a new request (update/accumulate in memory) and put to another distributed map, which is the actual datagrid.
Thus we can process each request concurrently without locking. However, the putting of data to the main datagrid might involve a network trip.
Is there a way I can force specify which node to be selected? Basically I would want to put it in the local map for datagrid. This should improve the overall throughput by avoiding the network trip.
By using a partition aware key, I can specify that all such keys go to the same partition, however, I am looking to actually 'specify' the partition. Is this possible?

You can create a key for a specific partition. We do this often for testing.
Once you have created such a key for every partition, you can use
map.put("yourkey#partitionkey", value)
Checkout out the git repo and look for HazelcastTestSupport.generateKeyOwnedBy(hz).
Important: it can be that a partition belongs to a member at some point in time, but a partition can move around in the system. E.g. when member joins/leaves the cluster, so the solution could be fragile.

How do I run two Hazelcast nodes that share single data structure?

Say I have 2 nodes with IPs 192.168.5.101 and 192.168.5.102. I'd like to launch first one with some task initializing a distributed map and, in a couple of minutes, the second one (on those two hosts). How should I configure them to be able to see one another and to share that Map?
UPD. I had a glance at the Hazelcast docs and managed to run two instances with the following code:
Config config = new Config();
config.getNetworkConfig().getJoin().getMulticastConfig().setEnabled(false);
config.getNetworkConfig().getJoin().getTcpIpConfig().addMember("192.168.4.101").addMember("192.168.4.102").setRequiredMember("192.168.4.101").setEnabled(true);
config.getNetworkConfig().getInterfaces().setEnabled(true).addInterface("192.168.4.*");
And somewhere further:
HazelcastInstance hazelcast = Hazelcast.newHazelcastInstance(config);
MultiMap<Long, Long> idToPids = hazelcast.getMultiMap("mapName");
IMap<Long, EntityDesc> idToDesc = hazelcast.getMap("multiMapName");
All that followed by some job-performing code.
I run this class on two different nodes, they successfully see each other and communicate (even share the resources, as far as I can tell).
But the problem is the work of two nodes seems a lot slower than in the case of single local node. What am I doing wrong?

One of the reasons of a slow down is that the data used in the tasks (I don't know anything about them) could be stored on a different member than the task is running. With a single node cluster, you don't have this problem. But with a multi node cluster, the map will be partitioned, so every member will only store a subset of the data.
Also with a single node, there is no backup and therefor it is a lot faster, than in a true clustered setup (so >1 members).
These are some of the obvious reasons why things could slow down. But without additional information, it will be very hard to guess what is the cause.

Hazelcast write-through map store pre-population on startup

I am currently doing a POC for developing a distributed, fault tolerant, ETL ecosystem. I have selected Hazelcast for for my clustering (data+notification) purpose. Googling through Hazelcast resources took me to this link and it exactly matches how I was thinking to go about, using a map based solution.
I need to understand one point. Before that, allow me to give a canonical idea of our architecture:
Say we have 2 nodes A,B running our server instance clustered through hazelcast. One of them is a listener accepting requests (but can change on a fail over), say A.
A gets a request and puts it to a distributed map. This map is write-through backed by a persistent store and a single memory backup is configured on nodes.
Each instance has a local map entry listener, which on entry added event, would (asynchronous/queuing) process that entry and then remove it from the distributed map.
This is working as expected.
Question:
Say 10 requests have been received and distributed with 5 on each nodes. 2 entries on each node has been processed and now both instance crashes.
So there are total 6 entries present in the backing datastore now.
Now we bring up both the instances. As per documentation - "As of 1.9.3 MapLoader has the new MapLoader.loadAllKeys API. It is used for pre-populating the in-memory map when the map is first touched/used"
We implement loadAllKeys() by simply loading all the key values present in the store.
So does that mean there is a possibility where, both the instances will now load the 6 entries and process them (thus resulting in duplicate processing)? Or is it handled in a synchronized way so that loading is done only once in a cluster?
On server startup I need to process the pending entries (if any). I see that the data is loaded, however the entryAdded event is not fired. How can make the entryAdded event fire (or any other elegant way, by which I will know that there are pending entries on startup)?
Requesting suggestions.
Thanks,
Sutanu

on initialization, loadAllKeys() will be called which will return all 6 keys in the persistent store. Then each node will select the keys it owns and load them only. So A might load 2 entries, while B loads the remaining 4.
store.load doesn't fire entry listeners. How about this: right after initialization, after registering your listener, you can get the localEntries and process the existing ones.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.