Architecture for distributed data storage

Architecture for distributed data storage - java

Currently I have two separated applications.
First is RESTful API.
Second is data storage that can process raw data and store processed data on the file system. This data grouped by folders and folder ids are grouped by user ids.
These applications connected through message queue (ActiveMQ) using queueCount queues.
Files sending also through this queue using embedded fileserver.
I want to distribute this data storage across several nodes.
1) First variant
On each of n nodes set up ActiveMQ and current storage application.
Create master node that will be serve queries to these shards.
In this way data for the different users will be stored on different nodes.
2) Second
Set up n nodes with storage app. Set up one instance for ActiveMQ. Create n*queueCount queues in ActiveMQ. Consume messages from corresponding queues from storage nodes.
But both variants are not perfect, maybe you can give advice to me?
Thanks in advance
Update:
What is the best way to evenly distribute data based on uuid?

Why dont you use a distributed file system like hdfs to distribute your data store. This way replication is covered, data is distributed and you can use hadoop to even send jobs to process your data in parallel.

#vvsh, what you are attempting is distributed storage with load-balancing (but I did not understand how you plan to keep specific user's files on a specific node and at the same time get even load distribution). Any way, before I go any further, the mechanism you are attempting is quite difficult to achieve in a stable manner, instead, consider using some of the infrastructures mentioned in the comments, they may not 100% fit your requirement but will do a much better job.
Now, to achieve even distribution, your architecture essentially needs to be some kind of hub-spoke model, where the hub (in your case master server) will collect the load from a single queue with multiple JMS clients running on multiple thread. The master server has to essentially do the round-robin dispatching (you may choose different types of schemes, based on file number, if file sizes are fairly constant or file size and net total dispatched to a node).
The persistence agents must run on every node to actually take the files, process them and persist in the datastore. The communication between the master and the agents could be through web service or direct socket (depending on the performance you require), the Q based communication with the agents could potentially chock your JMS server.
One point of observation is that the files could be staged in another location, like a Document/CMS and only the ID could be communicated to the master as well as the agents, there by reducing the network load and JMS persistence load.
The above mechanism needs to toke care of exceptions, failures and re-dispatching i.e. guaranteed delivery, horizontal scaling, concurrency handling, and optimized for performance. In my view you shall be better off using some proven infrastructure but if you really want to do it, the above architecture will get the job done.

Related

Limit on the amount of queues and verticles in Vert.x

We are now in the process of refactoring our messaging application written in Vert.x. The application processes incoming messages from users. Initially, it was implemented so that there is a single verticle instance that listens to a single queue in the event bus and processes all the incoming messages.
What we are thinking of doing is to refactor it so that it works a bit similar to actor model: we deploy an instance of a verticle for each active user and make it listen to a user-specific queue. This way the verticle instance can maintain user-specific state and the parallelization of the message processing becomes much easier.
The issue, however, is that this would lead to a huge number of verticles deployed (30k - 50k in parallel) and huge amount of queues in the eventbus. And also we would need to maintain the verticles manually (undeploy unused verticles and deploy the ones when there is a message from a new user).
Question is - is this actor-style architecture good for vert.x and can it handle large amount of deployed verticles and eventbus queues at the same time?

There's one major correction to be made here - EventBus is a single queue. So, you won't have "huge number of queues". There will be only one. You'll have huge number of addresses on a single queue.
But is this number so huge? Well, is a HashMap of 50K elements can be considered huge? Probably not, at least in terms of keys. Now note that this applies only to Vert.x in non-clustered mode. Clustered Vert.x is different (still should work, though).
Now having those verticles is another matter. Each verticle is a separate object, and if you plan to store some data in it, it will be even larger. But if you can afford machines with some decent RAM (16GB+), it should work just fine.
What does concern me in this solution, though, is that you plan to deploy verticles on demand, then undeploy them. It does incur delays, so your users will experience degraded performance for first message they send.

What you call "actor-style" does not mean, that you have to inflate a new verticle instance per user. If you do so, you are going to get a system with 98% redundancy.
It's absolutely enough to register an event-bus address for each user and use some sort of persistant storage to keep track of them. Such a storage can be any DB for long-term persistance or a cluster-wide SharedMap for short-term, or a combination of both.
Perhaps you don't even need a address-per-user scheme. Such a scheme is nice when the users are connected constantly to your system via some sort of EventBusBridge. If this is not a case, you can register a single event-bus address for all users and process messages based on payload.

Adding new node to a scalable system with zero downtime

I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component

You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.

Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

How to make my Java application scalable and fault tolerant?

In a simplified manner my Java application can be described as follows:
It is a web application running on a Tomcat server with a SOAP interface. The application uses JPA/Hibernate to store data in a MySQL database. The data stored consists of list of users, a list of hosts, and a list of URIs pointing to huge files (10GB) in the filesystem.
The whole system consists of a central server, where my application is running on, and a bunch of worker hosts. A user can connect to the SOAP interface and ask the system to copy the files that belong to him to a specific worker host, where he then can analyze the data in some way (We cannot use NFS, we need to copy the data to the local disc storage of a worker host). The database then stores for each user on which worker host his files are stored.
At the moment the system is running with one central server with the Tomcat application and the MySQL database and 10 worker hosts and about 30 users which have 100 files (on average 10GB) size stored distributed over the worker hosts.
But in the future I have to scale the system by a factor of 100-1000. So I might have to deal with 10000 users, 100000 files and 10000 hosts. And the system should also become fault tolerant, so that I have don't have a single central server (which is the single point of failure in the system now), but maybe several ones. Also, if one of the worker hosts fails the system should be notified, so it doesn't try to copy files on that server.
My question is now: Which Java technologies could I use to make my application scalable and fault tolerant? What kind of architecture would you recommend? Should I still have a huge database storing all the information about all files, hosts and users in the system in one place, or should I better distribute my database on several hosts and synchronize them somehow?

The technology you need is called Architecture.
No matter which technology you use, you need to have a well-architected system for scalability and redundancy. Make a diagram of the entire architecture of the system as it currently works. Mark each component with its limitations for users, jobs, bandwidth, hard drive space, memory, or whatever parts are limiting for your application. This will give you the baseline design.
Now draw that same diagram as it would need to be to meet your scalability and redundancy requirements. You might have to break apart pieces to make it work, or develop entirely new pieces. This diagram will make it very clear what you need.
One specific thing I want to address is the database. If you can split the database across logistic lines so that you do not join any queries from one to another, then you should have separate databases. Beyond that, the best configuration for a database is to have each database on one fast machine with lots of storage and very fast access times. If you do this, the only thing that will slow down your database are bad queries or poorly-indexed tables. In my experience, synchronizing databases is to be avoided unless you have one master database that has write access and it replicates to other databases which are read-only. Regardless, this can be a last step after you've profiled all of your queries and you literally need additional hardware.

Workload Distribution / Parallel Execution in JAVA

I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin

Might want to look into MapReduce and Hadoop

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

Check out Hadoop

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.

The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.