Writing hundreds of data objects to a Mongo database

Writing hundreds of data objects to a Mongo database - java

I am working on a Minecraft network which has several servers manipulating 'user-objects', which is just a Mongo document. After a user object is modified it need to be written to the database immediately, otherwise it may be overwritten in other servers (which have an older version of the user object), but sometimes hundreds of objects need to be written away in a short amount of time.. (in a few seconds). My question is: How can I easily write objects to a MongoDB database without really overload the database..
I have been thinking up an idea but I have no idea if it is relevant:
- Create some sort of queue in another thread, everytime an data object gets need to be saved into the database it gets in the queue and then in the 'queue thread' the objects will be saved one by one with some sort of interval..
Thanks in advance
btw Im using Morphia as framework in Java

"hundreds of objects [...] in a few seconds" doesn't sound that much. How much can you do at the moment?
The setting most important for the speed of write operations is the WriteConcern. What are you using at the moment and is this the right setting for your project (data safety vs speed)?
If you need to do many write operations at once, you can probably speed up things with bulk operations. They have been added in MongoDB 2.6 and Morphia supports them as well — see this unit test.
I would be very cautious with a queue:
Do you really need it? Depending on your hardware and configuration you should be able to do hundreds or even thousands of write operations per second.
Is async really the best approach for you? The producer of the write operation / message can only assume his change has been applied, but it probably has not and is still waiting in the queue to be written. Is this the intended behaviour?
Does it make your life easier? You need to know another piece of software, which adds many new and most likely unforeseen problems.
If you need to scale your writes, why not use sharding? No additional technology and your code will behave the same with and without it.
You might want to read the following blogpost on why you probably want to avoid queues for this kind of operation in general: http://widgetsandshit.com/teddziuba/2011/02/the-case-against-queues.html

Related

Sorted Array vs Hashtable: Which data structure would be more efficient in searching over a range of dates in a calendar app?

I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!

Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.

Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.

This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.

If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.

How to publish to KDB Ticker Plant from Java effectively

We have market data handlers which publish quotes to KDB Ticker Plant. We use exxeleron q java libary for this purpose. Unfortunately latency is quite high: hundreds milliseconds when we try to insert a batch of records. May you suggest some latency tips for KDB + Java binding, as we need to publish quite fast.

There's not enough information in this message to give a fully qualified response, but having done the same with Java+KDB it really comes down to eliminating the possibilities. This is common sense, really, nothing super technical.
make sure you're inserting asynchronously
Verify it's exxeleron q java that is causing the latency. I don't think there's 100's of millis overhead there.
Verify the CPU that your tickerplant is on isn't overloaded. Consider re-nicing, core binding, etc
Analyse your network latencies. Also, if you're using Linux, there's a few tcp tweaks you can try, e.g. TCP_QUICKACK
As you're using Java, be smarter about garbage collection. It's highly configurable, although not directly controllable.
if you find out the tickerplant is the source of latency, you could either recode it to not write to disk - or get a faster local disk.
There's so many more suggestions, but the question is a bit too ambiguous.
EDIT
Back in 2007, with old(ish) servers and a very old version of KDB+ we were managing an insertion rate of 90k rows per second using the vanilla c.java. That was after many rounds of the above points. I'm sure you can achieve way more now, it's a matter of finding where the bottlenecks are and fixing them one by one.

Make sure the data publish to ticket plant are is batch, like wait for a little bit to insert say few rows of data in batch, but not insert row by row once any new records coming

Hazelcast - Persist data when shutting down the last node

I'm currently researching Hazelcast to use as a message queue and shared in-memory storage in a cluster.
I was wondering how to handle the situation when the last node goes down. I'd want to persist all hazelcast-managed data, queues, etc to disk with the ability to startup again at a later time.
The MapStore and MapLoad feature looks interesting, but when is this used? The documentation says it is used whenever needed, but I would only need it when shutting down the last node. There is no need to keep all data persisted during normal operation.
Also the writing to disk should happen at the very end, so no new data gets added in the meantime.
Does anyone have experience or advice on how to handle this type of situation for a newbie?
PS: I'm also using Spring and Mongo, btw.
Thanks in advance.

Currently we don't have functionality like this available out of the box.
You might want to have a look at the QueueStore/QueueLoader interface. It provides the same functionality for the Queue as the MapStore/MapLoader for the map.
We are working on a disk based storage solution for all data-structures, but that isn't ready yet.

Cache update with db changes

We have a java based product which keeps Calculation object in database as blob. During runtime we keep this in memory for fast performance. Now there is another process which updates this Calculation object in database at regular interval. Now, what could be the best strategy to implement so that when this object get updated in database, the cache removes the stored object and fetch it again from database.
I won't prefer any caching framework until it is must to use.
I appreciate response on this.

It is very difficult to give you good answer to your question without any knowledge of your system architecture, design constraints, your IT strategy etc.
Personally I would use Messaging pattern to solve this issue. A few advantages of that pattern are as follows:
Your system components (Calculation process, update process) can be loosely coupled
Depending on implementation of Messaging pattern you can "connect" many Calculation processes (out-scaling) and many update processes (with master-slave approach).
However, implementing Messaging pattern might be very challenging task and I would recommend taking one of the existing frameworks or products.
I hope that will help at least a bit.

I did some work similar to your scenario before, generally there are 2 ways.
One, the cache holder poll the database regularly, fetch the data it needs and keep it in the memory. The data can be stored in a HashMap or some other collections. This approach is simple and easy to implement, no extra framework or library needed. But users will have to endure dirty data from time to time. Besides, polling will cause a lot of pressure on DB if the number of pollers is huge or the query is not fast enough. However, it is generally not a bad one if your requirement for real-time is not that high and the scale of your system is relatively small.
The other approach is that the cache holder subscribes the notification of the data updater and update its data after being notified. It provides better user experience, but this will bring more complexity to your system because you have to get some MS infrastructure, such as JMS, involved. Developing and tuning is more time-consuming.

I know I am quite late resonding this but it might help somebody searching for the same issue.
Here was my problem, I was storing requestPerMinute information in a Hashmap in a Java filter which gets loaded during the start of the application. The problem if somebody updates the DB with new information ,the map doesn't know about this.
Solution: I took one variable updateTime in my Java filter which just stored when was my hashmap last got updated and with every request it checks if the current time is time more than 24 hours , if yes then it updates the hashmap from the database.So every 24 hours it just refreshes the whole hashmap.
Although my usecase was not to update at real time so it fits the use case.

How to create a server that distributes workunits to clients?

I need a java application that should manage a database to distribute work units to its clients.
Effectively it's a grid application: the database is filled with input parameters for clients and all it's tuples must be distributed to clients that request for. After clients send their results and the server modify the database accordingly (for example marking the tuples computed).
Now let's suppose that I have a database (SQLite or MySQL) filled with tuples and that clients do a request for a group of input tuples: I want that a group of workunits are sent exclusively to a unique client, so I need to mark them "already requested by another client".
If I query the db for the first (for example 5) queries and meanwhile another client makes the same request (in a multi-threaded server architecture and without any synchronization) I think there is a possibility that both clients receive the same work-units.
I imagined that solutions could be:
1) make a single-threaded server architecture ( ServerSocket.accept() is called again only after the previous client request has been served, so that the server is effectively accessed by only a client at time)
2) in a multi-threaded architecture, make the query and tuples-lock operations synchronized, so that I obtain a kind of atomicity (effectively serializing operations over the database)
3) use atomic query operations to the database server (or file, in the case of SQLite), but in this case I need help because I don't know how things really goes...
However I hope that you understood my problem: it's very similar to seti#home that distributes it's work-units but the intersection over all distributed units to its multitude of clients is null (theoretically).
My non-functional needs are that the language is java and that database is SQLite or MySQL.

Some feedback for each of your potential solutions ...
1) make a single-threaded server
architecture ( ServerSocket.accept()
is called again only after the
previous client request has been
served, so that the server is
effectively accessed by only a client
at time)
ServerSocket.accept() will not allow you to do that, you might need some other type of synchronization to allow only one thread to be in situation of getting tuples. This basically leads you to your solutions (2).
2) in a multi-threaded architecture,
make the query and tuples-lock
operations synchronized, so that I
obtain a kind of atomicity
(effectively serializing operations
over the database)
Feasible, easy to implement and a common way to approach the problem. Only issue is how much you care about performance, latency and throughput because if you have many of those clients and the work units time span is very short then the clients might end up 90% of time locked in wait to get the "token".
Possible solution to that issue. Use a hashed based distribution for work units. Let's say you have 500 work units to be shared between 50 clients. You give IDs to you work units in such a way that you which clients will get certain work units. In the end, you can assign nodes with a simple module operation:
assigned_node_id = work_unit_id % number_of_working_nodes
This technique, called pre-allocation, doesn't work for all type of problems so it depends on your application. Use this approach if you have many short running processes.
3) use atomic query operations to the
database server (or file, in the case
of SQLite), but in this case I need
help because I don't know how things
really goes...
It's in essence same as (2) but in case you are able to do this, which I doubt you can with just SQL, you would end up tied up to some specific features of your RDBMS. Most likely you would have to some non-standard SQL procedures to achieve this solutions. And, it doesn't fix the issues you would find with solution 2.
Summary
Solution 2 is more likely to work in 90% of the cases, the longer the tasks are the better for this solution. If the tasks are very short in time definitely go for a pre-allocation based algorithm.
With solution 3 you give up portability and flexibility.
DRY: try some other Open Source systems ...
There are few Open Source java projects that already deal with this kind of issue, they might be an overkill for you but I think it's worth mentioning them ...
http://www.gridgain.com/
http://www.jppf.org/

I advice you to read some articles like this one, to see how DB can do the synchronization job for you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.