Say I have 2 nodes with IPs 192.168.5.101 and 192.168.5.102. I'd like to launch first one with some task initializing a distributed map and, in a couple of minutes, the second one (on those two hosts). How should I configure them to be able to see one another and to share that Map?
UPD. I had a glance at the Hazelcast docs and managed to run two instances with the following code:
Config config = new Config();
config.getNetworkConfig().getJoin().getMulticastConfig().setEnabled(false);
config.getNetworkConfig().getJoin().getTcpIpConfig().addMember("192.168.4.101").addMember("192.168.4.102").setRequiredMember("192.168.4.101").setEnabled(true);
config.getNetworkConfig().getInterfaces().setEnabled(true).addInterface("192.168.4.*");
And somewhere further:
HazelcastInstance hazelcast = Hazelcast.newHazelcastInstance(config);
MultiMap<Long, Long> idToPids = hazelcast.getMultiMap("mapName");
IMap<Long, EntityDesc> idToDesc = hazelcast.getMap("multiMapName");
All that followed by some job-performing code.
I run this class on two different nodes, they successfully see each other and communicate (even share the resources, as far as I can tell).
But the problem is the work of two nodes seems a lot slower than in the case of single local node. What am I doing wrong?
One of the reasons of a slow down is that the data used in the tasks (I don't know anything about them) could be stored on a different member than the task is running. With a single node cluster, you don't have this problem. But with a multi node cluster, the map will be partitioned, so every member will only store a subset of the data.
Also with a single node, there is no backup and therefor it is a lot faster, than in a true clustered setup (so >1 members).
These are some of the obvious reasons why things could slow down. But without additional information, it will be very hard to guess what is the cause.
Related
I have 3 nodes of my application (Java Spring boot) which use Hazelcast IMap as a distributed cache. My logic requires to frequently update a value in the map, and I have implemented a EntryProcessor for it.
Whilst testing, I have created a hazelcast cluster with 3 nodes.
What I noticed is following:
if node1 invokes the entry processor, it is not guaranteed that it will be executed on node 1. It is executed on any one of those 3 nodes. Same for backup entry processor.
same happens for the other 2 nodes.
Is there any way to ensure/enforce that the entry processor is executed on the node where it was invoked? I read through the documentation and could not find an answer for my question.
Thanks in advance.
The entry processor runs on nodes that host the primary copy of the entry and any backup copy.
The behaviour you are seeing is due to the data not being hosted in the same place from run to run. There is a degree of randomness.
This is normal, and what you want. Any attempts to "pin" the data to a specific place ALWAYS go wrong. Hazelcast, and distributed stores in general, need to be unconstrained in where they can place the data to balance the cluster.
You can run a Runnable/Callable on a specific node if you need predictability, but that's not the use-case here.
I have a network of nodes represented with a graph ( or more specifically a dag). The vertex and edges are just ids pointing to large objects in the cache.
I am using hazelcast and have 2 caches:
1. ReferenceObject for the graph
2. IMap for the large objects
3. IMap for the edges in the graph
When I insert a large object, I have an entry listener that will update this graph in the cache. Similarly when I add an edge data, there is also an entry listener that will update the graph.
However I have 1 problem where if I create an edge and it creates cycle, it fails (as it's a dag) but the IMap retains the records.
Any ideas how I can have transactions across the main thread and the entry listener?
#Pilo, the problem is EntryListener listens to events fired after data already populated in the map. So when you insert the data to your first map & listen to an update event, data is already in the first map.
You can either
Manually remove the record from the first map if the operation fails on the second one.
Use transactions & make sure either all or none of the maps updated, instead of using listeners.
Though it's a completely different approach, have you looked at Hazelcast Jet? It's a DAG based event stream processing engine build on top of Hazelcast IMDG. It might fit your use case better and take care of the lower level stuff for you.
https://jet.hazelcast.org
You would have a Jet cluster, which is also a Hazelcast cluster, but you get all the processing stuff with it. It extends the Java Streams programming model so you just write your app as if it was a Java stream and run it on the cluster. Something to think about anyway.
I have a hazelcast cluster that performs several calculations for a java-client triggered by command-line. I need to persist parts of the calculated results on the client-system while the nodes are still working. I am going to store parts of the data in Hazelcasts maps. Now I am looking for a way to inform the client that a node have stored data inside the map and that he can start using it. Is there a way to trigger client operations from any hazelcast-node?
Your question is not very clear, but it looks like you could use com.hazelcast.core.EntryListener to trigger a callback that will notify the client, when a new entry is stored in the data map.
You member node can publish some intermediate results (or just notification message) to Hazelcast IQueue, ITopic or RingBuffer.
A flow looks like this.
a client registers a listener for, say, rignbuffer.
a client submits command to perform on the cluster.
a member persists intermediate results to the IMaps or any other data structure
a member sends a message to the topic about the availability of partial results.
a client receives messages and accessing data in IMap.
a member sends a message when it's done with it's task.
Something like that.
You can find some examples here
Let me know if you have any questions about it.
Cheers,
Vik
There are several paths to solve the problem. The most simple one is using a dedicated IMap or any other of Hazelcasts synchronized collections. One can simply write data in such a map and retrieve/remove it after it got added. But this will cause a huge Overhead, because the data has to be synchronized throughout the cluster. If the data is quite big and the cluster is huge with a few hundred nodes all over the world or at least the USA, the data will be synchronized over all nodes just to get deleted a few moments later, which also has to be synchronized. Not deleting is no option, because the data can get several gb big, which will make synchronization of the data even more expensive. The question got answered but the solution is not suited for every scenario.
We are working on a distributed data processing system, and Hazelcast is one of the component we are using.
We have a streaming data input coming to the cluster and we have to process the data (update/accumulate etc). There is distributed request map, which has local entry listeners. We process a new request (update/accumulate in memory) and put to another distributed map, which is the actual datagrid.
Thus we can process each request concurrently without locking. However, the putting of data to the main datagrid might involve a network trip.
Is there a way I can force specify which node to be selected? Basically I would want to put it in the local map for datagrid. This should improve the overall throughput by avoiding the network trip.
By using a partition aware key, I can specify that all such keys go to the same partition, however, I am looking to actually 'specify' the partition. Is this possible?
You can create a key for a specific partition. We do this often for testing.
Once you have created such a key for every partition, you can use
map.put("yourkey#partitionkey", value)
Checkout out the git repo and look for HazelcastTestSupport.generateKeyOwnedBy(hz).
Important: it can be that a partition belongs to a member at some point in time, but a partition can move around in the system. E.g. when member joins/leaves the cluster, so the solution could be fragile.
We are writing hi-load order processing engine. Every cluster node processing some set of contracts and write action log to local file. This file should be distributed amount some other nodes (for fault tolerance). If node fail there should be way to restore it's state at one of replication nodes as fast as possible. Currently we use cassandra but there is some problems with partitioner: there is no way to specify what nodes should be used for a specific table.
So we need to replicate file. Is there a solution?
Edit: peak load will be about 200k records per second.
With respect to your Cassandra issue: while you can't have a different replication layout per table/columnfamily, you can have a different layout per keyspace. This includes a case like yours, where it sounds like you want some set of nodes S1 be wholly responsible for some parts of the data, and some other set S2 to be responsible for another part.
If you represent S1 and S2 as different datacenters to Cassandra (via PropertyFileSnitch or whatever), then you can configure, say, keyspace K1 to have X copies on S1 and none on S2, and vice versa for keyspace K2.