Force put objects to Hazelcast local map - java

We are working on a distributed data processing system, and Hazelcast is one of the component we are using.
We have a streaming data input coming to the cluster and we have to process the data (update/accumulate etc). There is distributed request map, which has local entry listeners. We process a new request (update/accumulate in memory) and put to another distributed map, which is the actual datagrid.
Thus we can process each request concurrently without locking. However, the putting of data to the main datagrid might involve a network trip.
Is there a way I can force specify which node to be selected? Basically I would want to put it in the local map for datagrid. This should improve the overall throughput by avoiding the network trip.
By using a partition aware key, I can specify that all such keys go to the same partition, however, I am looking to actually 'specify' the partition. Is this possible?

You can create a key for a specific partition. We do this often for testing.
Once you have created such a key for every partition, you can use
map.put("yourkey#partitionkey", value)
Checkout out the git repo and look for HazelcastTestSupport.generateKeyOwnedBy(hz).
Important: it can be that a partition belongs to a member at some point in time, but a partition can move around in the system. E.g. when member joins/leaves the cluster, so the solution could be fragile.

Related

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.
To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.
For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.
Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?
(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)
Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.
It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

how to have transactions across main thread and entry listener in hazelcast?

I have a network of nodes represented with a graph ( or more specifically a dag). The vertex and edges are just ids pointing to large objects in the cache.
I am using hazelcast and have 2 caches:
1. ReferenceObject for the graph
2. IMap for the large objects
3. IMap for the edges in the graph
When I insert a large object, I have an entry listener that will update this graph in the cache. Similarly when I add an edge data, there is also an entry listener that will update the graph.
However I have 1 problem where if I create an edge and it creates cycle, it fails (as it's a dag) but the IMap retains the records.
Any ideas how I can have transactions across the main thread and the entry listener?
#Pilo, the problem is EntryListener listens to events fired after data already populated in the map. So when you insert the data to your first map & listen to an update event, data is already in the first map.
You can either
Manually remove the record from the first map if the operation fails on the second one.
Use transactions & make sure either all or none of the maps updated, instead of using listeners.
Though it's a completely different approach, have you looked at Hazelcast Jet? It's a DAG based event stream processing engine build on top of Hazelcast IMDG. It might fit your use case better and take care of the lower level stuff for you.
https://jet.hazelcast.org
You would have a Jet cluster, which is also a Hazelcast cluster, but you get all the processing stuff with it. It extends the Java Streams programming model so you just write your app as if it was a Java stream and run it on the cluster. Something to think about anyway.

Contact Java client from a Hazelcast node

I have a hazelcast cluster that performs several calculations for a java-client triggered by command-line. I need to persist parts of the calculated results on the client-system while the nodes are still working. I am going to store parts of the data in Hazelcasts maps. Now I am looking for a way to inform the client that a node have stored data inside the map and that he can start using it. Is there a way to trigger client operations from any hazelcast-node?
Your question is not very clear, but it looks like you could use com.hazelcast.core.EntryListener to trigger a callback that will notify the client, when a new entry is stored in the data map.
You member node can publish some intermediate results (or just notification message) to Hazelcast IQueue, ITopic or RingBuffer.
A flow looks like this.
a client registers a listener for, say, rignbuffer.
a client submits command to perform on the cluster.
a member persists intermediate results to the IMaps or any other data structure
a member sends a message to the topic about the availability of partial results.
a client receives messages and accessing data in IMap.
a member sends a message when it's done with it's task.
Something like that.
You can find some examples here
Let me know if you have any questions about it.
Cheers,
Vik
There are several paths to solve the problem. The most simple one is using a dedicated IMap or any other of Hazelcasts synchronized collections. One can simply write data in such a map and retrieve/remove it after it got added. But this will cause a huge Overhead, because the data has to be synchronized throughout the cluster. If the data is quite big and the cluster is huge with a few hundred nodes all over the world or at least the USA, the data will be synchronized over all nodes just to get deleted a few moments later, which also has to be synchronized. Not deleting is no option, because the data can get several gb big, which will make synchronization of the data even more expensive. The question got answered but the solution is not suited for every scenario.

Hazelcast write-through map store pre-population on startup

I am currently doing a POC for developing a distributed, fault tolerant, ETL ecosystem. I have selected Hazelcast for for my clustering (data+notification) purpose. Googling through Hazelcast resources took me to this link and it exactly matches how I was thinking to go about, using a map based solution.
I need to understand one point. Before that, allow me to give a canonical idea of our architecture:
Say we have 2 nodes A,B running our server instance clustered through hazelcast. One of them is a listener accepting requests (but can change on a fail over), say A.
A gets a request and puts it to a distributed map. This map is write-through backed by a persistent store and a single memory backup is configured on nodes.
Each instance has a local map entry listener, which on entry added event, would (asynchronous/queuing) process that entry and then remove it from the distributed map.
This is working as expected.
Question:
Say 10 requests have been received and distributed with 5 on each nodes. 2 entries on each node has been processed and now both instance crashes.
So there are total 6 entries present in the backing datastore now.
Now we bring up both the instances. As per documentation - "As of 1.9.3 MapLoader has the new MapLoader.loadAllKeys API. It is used for pre-populating the in-memory map when the map is first touched/used"
We implement loadAllKeys() by simply loading all the key values present in the store.
So does that mean there is a possibility where, both the instances will now load the 6 entries and process them (thus resulting in duplicate processing)? Or is it handled in a synchronized way so that loading is done only once in a cluster?
On server startup I need to process the pending entries (if any). I see that the data is loaded, however the entryAdded event is not fired. How can make the entryAdded event fire (or any other elegant way, by which I will know that there are pending entries on startup)?
Requesting suggestions.
Thanks,
Sutanu
on initialization, loadAllKeys() will be called which will return all 6 keys in the persistent store. Then each node will select the keys it owns and load them only. So A might load 2 entries, while B loads the remaining 4.
store.load doesn't fire entry listeners. How about this: right after initialization, after registering your listener, you can get the localEntries and process the existing ones.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Categories

Resources