Hazelcast - Persist data when shutting down the last node

Hazelcast - Persist data when shutting down the last node - java

I'm currently researching Hazelcast to use as a message queue and shared in-memory storage in a cluster.
I was wondering how to handle the situation when the last node goes down. I'd want to persist all hazelcast-managed data, queues, etc to disk with the ability to startup again at a later time.
The MapStore and MapLoad feature looks interesting, but when is this used? The documentation says it is used whenever needed, but I would only need it when shutting down the last node. There is no need to keep all data persisted during normal operation.
Also the writing to disk should happen at the very end, so no new data gets added in the meantime.
Does anyone have experience or advice on how to handle this type of situation for a newbie?
PS: I'm also using Spring and Mongo, btw.
Thanks in advance.

Currently we don't have functionality like this available out of the box.
You might want to have a look at the QueueStore/QueueLoader interface. It provides the same functionality for the Queue as the MapStore/MapLoader for the map.
We are working on a disk based storage solution for all data-structures, but that isn't ready yet.

Related

How to build a java object from multiple data sources when data is not available in all sources during the creation time?

I've this design question that I'm trying to solve, and I hope to hear back from you, if you have any suggestions.
Let's say that we have a queue that you are listening to, the queue receives a message, the listener in your application grabs it, and builds an object, and pushes it to the cache, the object information at this moment has the data that it has received from the queue only, but it's still waiting on other data that's not available in other data sources (assume it's a DB) yet, what's the best approach to update the object when data becomes available in other data sources?
Should I've a thread running in the background to fetch data periodically?
I'm thinking of using decorator design pattern to build the object, Is that a good approach?
Any suggestions are welcome.
Update: Some friends in the comments asked if I was complicating the question! And that's not the case, adding more details to explain why the case is not simple, imagine that you have a plan to drive from point A to point B, so you create a record for your trip, but there are some dependencies in order to fulfill this record, like you are waiting on a friend to confirm his pickup location, and you are waiting on your payment check to be received. The moment you receive the pickup location, you will update the record, later when you receive the payment, you will also go ahead and update it. Hope that explained the case in layman words.

In order to fetch the data from the remaining sources, use an executor in conjunction with Future.
If you are not familiar with futures, here are some tutorials:
https://www.baeldung.com/java-future
http://tutorials.jenkov.com/java-util-concurrent/java-future.html
You can use a decorator design pattern, however this is generally done to extend the functionality of objects, not add data.
If these are primarily plain java objects (POJOs), only to store data, I would recommend to just add the data using settings when you get it from the other data source.
If there is a large amount of data, and it makes sense to divide it by data source, you can have one outer object with inner objects, each from a different source (builder pattern).

Writing hundreds of data objects to a Mongo database

I am working on a Minecraft network which has several servers manipulating 'user-objects', which is just a Mongo document. After a user object is modified it need to be written to the database immediately, otherwise it may be overwritten in other servers (which have an older version of the user object), but sometimes hundreds of objects need to be written away in a short amount of time.. (in a few seconds). My question is: How can I easily write objects to a MongoDB database without really overload the database..
I have been thinking up an idea but I have no idea if it is relevant:
- Create some sort of queue in another thread, everytime an data object gets need to be saved into the database it gets in the queue and then in the 'queue thread' the objects will be saved one by one with some sort of interval..
Thanks in advance
btw Im using Morphia as framework in Java

"hundreds of objects [...] in a few seconds" doesn't sound that much. How much can you do at the moment?
The setting most important for the speed of write operations is the WriteConcern. What are you using at the moment and is this the right setting for your project (data safety vs speed)?
If you need to do many write operations at once, you can probably speed up things with bulk operations. They have been added in MongoDB 2.6 and Morphia supports them as well — see this unit test.
I would be very cautious with a queue:
Do you really need it? Depending on your hardware and configuration you should be able to do hundreds or even thousands of write operations per second.
Is async really the best approach for you? The producer of the write operation / message can only assume his change has been applied, but it probably has not and is still waiting in the queue to be written. Is this the intended behaviour?
Does it make your life easier? You need to know another piece of software, which adds many new and most likely unforeseen problems.
If you need to scale your writes, why not use sharding? No additional technology and your code will behave the same with and without it.
You might want to read the following blogpost on why you probably want to avoid queues for this kind of operation in general: http://widgetsandshit.com/teddziuba/2011/02/the-case-against-queues.html

Cache update with db changes

We have a java based product which keeps Calculation object in database as blob. During runtime we keep this in memory for fast performance. Now there is another process which updates this Calculation object in database at regular interval. Now, what could be the best strategy to implement so that when this object get updated in database, the cache removes the stored object and fetch it again from database.
I won't prefer any caching framework until it is must to use.
I appreciate response on this.

It is very difficult to give you good answer to your question without any knowledge of your system architecture, design constraints, your IT strategy etc.
Personally I would use Messaging pattern to solve this issue. A few advantages of that pattern are as follows:
Your system components (Calculation process, update process) can be loosely coupled
Depending on implementation of Messaging pattern you can "connect" many Calculation processes (out-scaling) and many update processes (with master-slave approach).
However, implementing Messaging pattern might be very challenging task and I would recommend taking one of the existing frameworks or products.
I hope that will help at least a bit.

I did some work similar to your scenario before, generally there are 2 ways.
One, the cache holder poll the database regularly, fetch the data it needs and keep it in the memory. The data can be stored in a HashMap or some other collections. This approach is simple and easy to implement, no extra framework or library needed. But users will have to endure dirty data from time to time. Besides, polling will cause a lot of pressure on DB if the number of pollers is huge or the query is not fast enough. However, it is generally not a bad one if your requirement for real-time is not that high and the scale of your system is relatively small.
The other approach is that the cache holder subscribes the notification of the data updater and update its data after being notified. It provides better user experience, but this will bring more complexity to your system because you have to get some MS infrastructure, such as JMS, involved. Developing and tuning is more time-consuming.

I know I am quite late resonding this but it might help somebody searching for the same issue.
Here was my problem, I was storing requestPerMinute information in a Hashmap in a Java filter which gets loaded during the start of the application. The problem if somebody updates the DB with new information ,the map doesn't know about this.
Solution: I took one variable updateTime in my Java filter which just stored when was my hashmap last got updated and with every request it checks if the current time is time more than 24 hours , if yes then it updates the hashmap from the database.So every 24 hours it just refreshes the whole hashmap.
Although my usecase was not to update at real time so it fits the use case.

which NOSQL database tool is better to choose for my application?

I am planning to develop some application like connecting with friends of friends of friends. It may look like as Facebook or Twitter but initially i am planning to implement that to learn more about NOSQL databases.
There are number of database tools in NOSQL. I have gone through many database types like document store, key-value store, column type, graph databases. And finally i come up with two database tools which are cassandra & Neo4J. Is it right to choose any one, if not correct me & provide me some your valuable opinions.
One more thing is the language binding which i choose is JAVA.
My question is,
Which database tool suits for my application?
Awaiting for your valuable opinions. Thanks for spending your valuable time.

Tim, you really should have posted your question separately, rather than as an answer to the OP, which it wasn't.
But to answer, first, go read Ben Black's slides at http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency.
Done? Okay, now for the specific questions:
"How would differences in [replica] data-state be reconciled on a subsequent read?"
The highest timestamp wins.
"Do all zones work off the same system clock?"
Timestamps are provided by clients (i.e., your app server). They should be synchronized with e.g. ntpd (which is good practice anyway), but high precision is not required because if ordering matters you should be avoiding conflict either by using unique column names or by using external locking.
For example: if you have a list of users following you in a Twitter clone, you should give each follower its own column and there will be no way to lose data no matter how out of sync the clocks are.
If you have an admin tool for your website and two admins upload a new favicon "simultaneously," one update is going to win and it doesn't really matter which. Here, you do want your clocks synchronized but "within a few ms" is close enough.
If you are managing user registration and you want to allow creating account "jbellis" only if it doesn't already exist, you need a lock manager no matter how closely synchronzied your clocks are.
"Would stale data get returned?"
A node (a better unit to think about than a "zone") will not have data it missed during its downtime until it is sent that data by read repair, hinted handoff, or anti-entropy repair. In the meantime, it will reply to read requests with stale data; if you use a high enough consistencylevel read requests will wait for enough other replies to make sure you always see the most recent version anyway, which may mean not being able to fulfil requests if enough other replicas are down.
Otherwise, a low consistencylevel (e.g. ONE) implicitly means "I understand that the higher availability and lower latency I get with this lower consistencylevel means I'm okay with seeing stale data temporarily after downtime."

I'm not sure I understand all of the implications of the Cassandata consistency model with respect to data-agreement across multiple availability zones.
Given multiple zones, and given that the coordinator node in Cassandra has used a consistency level that does not require all zones to report back, but only a quorum, how would differences in zone data-state be reconciled on a subsequent read?
Do all zones work off the same system clock? Or does each zone have its own clock? If they don't work off the same clock, how are they synchronized so that timestamps can be compared during the "healing" process when differences are reconciled?
Let's say that a zone that does have accurate, up-to-date data is now offline, and a zone that was offline during a previous write (so it didn't get updated and contains stale data) is now back online. Would stale data get returned? Would the coordinator have any way to know the data were stale?

If you don't need to scale in the short term I'd go with Neo4j because it is designed to store networks like the one you described. (If you eventually do need to scale, maybe you can throw Gizzard in front of it or something. Good luck!)

Have you looked on Riak database? It has the same background as Cassandra, but you don't need to care about timestamp synchronization (they involve different method for resolving data status).
My first application was build on a Cassandra database. But I am now trying Riak because it is more suitable. It is not only the difference in keys (keys - values / super column - keys - values) but goes further with the document store feature.
It has a method to create complex queries using MapReduce. Cassandra does have this option using Hadoop, but it sounds difficult.
Further more it uses a well known and defined access protocol in http/s so it's easy to manage the server when you have a lot of traffic.
The only bad point is that is slower than Cassandra. But usually you will read records more than write (and Cassandra is optimised on writes, not reads) so the end result should be ok.

Implementing a file based queue

I have an in memory bounded queue in which multiple threads queue objects. Normally the queue should be emptied by a single reader thread that processes the items in the queue.
However, there is a possibility that the queue is filled up. In such a case I would like to persist any additional items on the disk that would be processed by another background reader thread that scans a directory for such files and processes the entries within the files. I am familiar with Active MQ but prefer a more light weight solution. It is ok if the "FIFO" is not strictly followed (since the persisted entries may be processed out of order).
Are there any open source solutions out there? I did not find any but thought I would ping this list for suggestions before I embark on the implementation myself.
Thank you!

Take a look at http://square.github.io/tape/, and its impressive QueueFile.
(thanks to Brian McCallister's "The Long Tail Treasure Trove" for pointing me at that).

You could use something like SQLLite to store the objects in.

EHCache can overflow to disk. It's also highly concurrent, though you dont really need that

Why is the queue bounded? Why not use a dynamically expandable data structure? That seems much simpler than involving the disk.
Edit:
It's hard to answer your question with out more context.
Can you clarify what you mean by "run out of memory"? How big is the queue? How much memory do you have?
Are you on an embedded system with very little memory? Or do you have 2 GB or more of stuff in the queue?
If either is true, you really aught to use a "swappable" data structure like a BTree. Implementing one your self for one queue seems like overkill. I would just use an embedded database like SQL lite.
If neither of those us true, then just use a vector or a linked list.
Edit 2:
You probably don't need a BTree or a database. You could just use a linked list of pages. But again,
I have to ask: is this necessary?
Or, if you are willing to process things non serially, why not have multiple reader threads all the time?
Ultimately though I don't think your proposal is the way to go.

You could embed berkley db java edition for keeping queue elements in files.
You can look at working example here:
http://sysgears.com/articles/lightweight-fast-persistent-queue-in-java-using-berkley-db
Hope this helps

MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.
https://github.com/jankotek/MapDB
http://www.mapdb.org/

The most performant and GC friendly solution I've found by now is Chronicle Queue.
It has extremely low write latency, order of tens of nanoseconds, several grades of magnitude lower than MapDB or SQLite.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.