In a simplified manner my Java application can be described as follows:
It is a web application running on a Tomcat server with a SOAP interface. The application uses JPA/Hibernate to store data in a MySQL database. The data stored consists of list of users, a list of hosts, and a list of URIs pointing to huge files (10GB) in the filesystem.
The whole system consists of a central server, where my application is running on, and a bunch of worker hosts. A user can connect to the SOAP interface and ask the system to copy the files that belong to him to a specific worker host, where he then can analyze the data in some way (We cannot use NFS, we need to copy the data to the local disc storage of a worker host). The database then stores for each user on which worker host his files are stored.
At the moment the system is running with one central server with the Tomcat application and the MySQL database and 10 worker hosts and about 30 users which have 100 files (on average 10GB) size stored distributed over the worker hosts.
But in the future I have to scale the system by a factor of 100-1000. So I might have to deal with 10000 users, 100000 files and 10000 hosts. And the system should also become fault tolerant, so that I have don't have a single central server (which is the single point of failure in the system now), but maybe several ones. Also, if one of the worker hosts fails the system should be notified, so it doesn't try to copy files on that server.
My question is now: Which Java technologies could I use to make my application scalable and fault tolerant? What kind of architecture would you recommend? Should I still have a huge database storing all the information about all files, hosts and users in the system in one place, or should I better distribute my database on several hosts and synchronize them somehow?
The technology you need is called Architecture.
No matter which technology you use, you need to have a well-architected system for scalability and redundancy. Make a diagram of the entire architecture of the system as it currently works. Mark each component with its limitations for users, jobs, bandwidth, hard drive space, memory, or whatever parts are limiting for your application. This will give you the baseline design.
Now draw that same diagram as it would need to be to meet your scalability and redundancy requirements. You might have to break apart pieces to make it work, or develop entirely new pieces. This diagram will make it very clear what you need.
One specific thing I want to address is the database. If you can split the database across logistic lines so that you do not join any queries from one to another, then you should have separate databases. Beyond that, the best configuration for a database is to have each database on one fast machine with lots of storage and very fast access times. If you do this, the only thing that will slow down your database are bad queries or poorly-indexed tables. In my experience, synchronizing databases is to be avoided unless you have one master database that has write access and it replicates to other databases which are read-only. Regardless, this can be a last step after you've profiled all of your queries and you literally need additional hardware.
Related
I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component
You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.
Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.
Currently I have two separated applications.
First is RESTful API.
Second is data storage that can process raw data and store processed data on the file system. This data grouped by folders and folder ids are grouped by user ids.
These applications connected through message queue (ActiveMQ) using queueCount queues.
Files sending also through this queue using embedded fileserver.
I want to distribute this data storage across several nodes.
1) First variant
On each of n nodes set up ActiveMQ and current storage application.
Create master node that will be serve queries to these shards.
In this way data for the different users will be stored on different nodes.
2) Second
Set up n nodes with storage app. Set up one instance for ActiveMQ. Create n*queueCount queues in ActiveMQ. Consume messages from corresponding queues from storage nodes.
But both variants are not perfect, maybe you can give advice to me?
Thanks in advance
Update:
What is the best way to evenly distribute data based on uuid?
Why dont you use a distributed file system like hdfs to distribute your data store. This way replication is covered, data is distributed and you can use hadoop to even send jobs to process your data in parallel.
#vvsh, what you are attempting is distributed storage with load-balancing (but I did not understand how you plan to keep specific user's files on a specific node and at the same time get even load distribution). Any way, before I go any further, the mechanism you are attempting is quite difficult to achieve in a stable manner, instead, consider using some of the infrastructures mentioned in the comments, they may not 100% fit your requirement but will do a much better job.
Now, to achieve even distribution, your architecture essentially needs to be some kind of hub-spoke model, where the hub (in your case master server) will collect the load from a single queue with multiple JMS clients running on multiple thread. The master server has to essentially do the round-robin dispatching (you may choose different types of schemes, based on file number, if file sizes are fairly constant or file size and net total dispatched to a node).
The persistence agents must run on every node to actually take the files, process them and persist in the datastore. The communication between the master and the agents could be through web service or direct socket (depending on the performance you require), the Q based communication with the agents could potentially chock your JMS server.
One point of observation is that the files could be staged in another location, like a Document/CMS and only the ID could be communicated to the master as well as the agents, there by reducing the network load and JMS persistence load.
The above mechanism needs to toke care of exceptions, failures and re-dispatching i.e. guaranteed delivery, horizontal scaling, concurrency handling, and optimized for performance. In my view you shall be better off using some proven infrastructure but if you really want to do it, the above architecture will get the job done.
I have the following problem:
I have a web application that stores data in the database. I would like for the clients to be able to extract the data e.g. of 2 tables to a file (local to the client).
The database could be arbitrarily big (meaning I have no idea how many data can potentially be in the database. Could be huge).
What is the best approach for this?
Should all the data be SELECTed out of the tables and returned to the client as a single structure to be stored in a file?
Or should the data be retrieved in parts e.g. first 100 then next 100 entries etc and create the single structure in the client?
Are there any pros-cons to consider here?
I've built something similar - there are some really awkward problems here, especially as the filesize can grow beyond what you can comfortably handle in a browser. As the amount of data grows, the time to generate the file increases; this in turn is not what a web application is good at, so you run the risk of your web server getting unhappy with even a smallish number of visitors all requesting a large file.
What we did is split the application into 3 parts.
The "file request" was a simple web page, in which authenticated users can request their file. This kicks off the second part outside the context of the web page request:
File generator.
In our case, this was a windows service which looked at a database table with file requests, picked the latest one, ran the appropriate SQL query, wrote the output to a CSV file, and ZIPped that file, before moving it to the output directory and mailing the user with a link. It set the state of the record in the database to make sure only one process happened at any one point in time.
FTP/WebDAV site:
The ZIP files were written to a folder which was accessible via FTP and WebDAV - these protocols tend to do better with huge files than a standard HTTP download.
This worked pretty well - users didn't like to wait for their files, but the delay was rarely more than a few minutes.
We have a similar use case with an oracle cluster containig approx. 40GB of data. The solution working best for us is a maximum of data per select statement as it reduces DB-overhead significantly.
That being said, there are three optimizations which worked very well for us:
1.) We partition the data into 10 roughly same-sized sets and select them from the database in parallel. For our cluster we found that 8 connections in parallel work approx. 8 times faster than a single connection. There is some additional speedup up to 12 connections but that depends on your database and your dba.
2.) Keep away from hibernate or other ORMs and use custom made JDBCs once you talk about large amounts of data. Use all optimiziations you can get there (e.g. ResultSet.setFetchSize())
3.) Our data compresses very well and putting the data through a gziper saves lots of I/O time. In our case it eliminated I/O from the critical path. By the way, this is also true for storing the data in a file.
Although this is a Java-centric question it really applies to any system utilizing a multi-tier architecture.
In 3-tier architectures, you typically have 3 tiers:
A client/presentation tier where the client code lives; and
A middleware tier where business logic lives; and
A data/eis tier where the RDBMS and other data-heavy systems live
In Java land, for a web application, this might look like:
An application server, such as GlassFish running both the "web tier" (WARs comprising the client tier in a web app) as well as the "business tier" (EJBs, middleware, etc.); and
A RDBMS server embodying the data tier
In a virtualized/clustered environment, these applications (GlassFish, RBMBS such as Oracle or PostgreSQL, etc.) will run on VMs.
My question: What are the standard ways of allocating/distributing this 3-tier architecture across these VMs? Meaning, any one of the following "strategies" might be viable, but not preferential:
One VM (let's say all VMs are Ubuntu Servers so cost/price doesn't factor into the equation) running both GlassFish and the RDBMS (all 3 tiers)
Two VMs: an application server VM running GlassFish, and a database server VM running, say, PostgreSQL
Three VMs: two app server VMs both running GlassFish, however 1 GlassFish instance is only running WARs (web tier) whereas the 2nd FlassFish instance is running the middleware/biz logic; then a 3rd DB server
Obviously if all the servers (all tiers) were running on the same VM, they may run faster or more efficiently because they wouldn't be bogged down with network latency. But they'd be on the same VM, which needs I would need mega-hardware to support them. There might also be security concerns with this setup.
There will be pros/cons to each. I'm interested in what strategies would best accomplish the following goals: (1) maximizes throughput/speed, (2) is best suited for a clustered/cloud environment and (3) maximizes security.
Thanks in advance!
(1) maximizes throughput/speed,
This depends entirely on your application. e.g the Database could be your bottle neck in which case what you in JVM does matter so much.
(2) is best suited for a clustered/cloud environment
If you are going to distribute your system, you are most likely to want to distribute your presentation layer. This is because the work they do is dependant on the number of clients and the work each client does is largely independent. (in the presentation layer)
and (3) maximizes security.
Having more VMs doesn't guarantee improved security. Your JVM should be setup so the different applications running in it are pretty separate anyway. If you want to prevent denial of attack and your back end services are used by other systems, you may want to separate them otherwise, it doesn't make much difference.
I think the answer to your questions is highly dependent on this application behavior, database usage, etc... and nothing can be said without looking at current performance metrics (as other answers already mentioned). I post some of my thoughts as guidelines.
Database
Most organizations run RDBMS in separate hosts, and some engineers would choose to never virtualize these, depending on what their DB vendor best practices say for their case (that said, I do normally consider VMs to be equivalent to physical hosts and I use them whenever possible).
Performance-wise, RDBMS often require kernel tuning or unconventional filesystem strategies, and having them in separate hosts can help. If the DB ever needs to be set up in high availability mode or in a cluster, having it separated from application servers can also make things easier. Note that database tuning, if you ever need to do it, can be a difficult topic if taken seriously, and it often involves stuff like aligning partitions in the disk, trying to reduce disk head movement by cleverly allocating database data segments/files, considering DBMS and OS cache sizes and strategies... All this can impact other applications running in the same host so I'd rather leave the DB alone.
In addition, RDBMS often serve several applications (there are good reasons for this: sometimes some integrations require access to more than one application database). Having them separated from application servers helps.
Also, DB systems have their own upgrade, backup, distribution/clustering and administration procedures, and are often maintained by different persons than application servers. Thus the whole database administration topic is easier to deal with if you consider it separately. And if the database becomes a bottleneck, you can work on the database alone without considering if the other tiers are impacting performance.
I do recommend keeping RDBMS alone in a single host for reasonably-sized production environments. But of course, if you don't have performance, administration, or availability requirements, you can consider using shared server for everything.
Glassfish
In general terms, when you want to deploy a Java EE application to several server (for load balancing or high availability) you install the same application server and artifacts to all of the application servers in the cluster. Then you can choose what artifacts to enable on each node of the cluster. Some application servers can enable or disable components depending on the server load. In this case, your application server is the "unit" that you need to distribute.
Now, there may be cases where you or your organization may prefer to have completely separated network layers for the web tier and the business tier (i.e., security concerns). In this case, you would use separate hosts for this. If your web tier is really heavy and you find that you need to scale it separately from the business tier (i.e. you find you need 6 web servers but you can make it with 1 or 2 EJB containers), I'd separate these two tiers too.
As a note: there are slight benefits in running the web and EJB tiers in the same Glassfish instance: as they share the JVM, calls between the web tier and the business tier can use call-by-reference semantics. Depending on your work load and the size and serialization cost of responses, this can result in a noticeable performance increase.
In most cases, for many corporate applications, I'd use just one or two servers (depending if you need high availability) containing both layers, because even if load rises you can still grow vertically (increase server power or VM resources) or horizontally (add another server and load-balance requests).
World-available and high-throughput applications need to consider many other aspects in order for them to be scalable (simply adding nodes to a Java EE cluster node won't cut it), so I don't think any of this solution is better or worse in order to be deployed to "the Cloud", but in general terms, if you plan to deploy to elastic virtualization services and your requirements justify that, I do recommend to separate the web and business tiers.
Security
In my opinion, the topics discussed don't have a direct impact on security.
Lastly, I'm sure that much more can be said about this topic, and my experience is limited, so please get more opinions ;).
I have this in mind:
On each server: (they all are set up identically)
A free database like MySQL or PostgreSQL.
Tomcat 6.x for hosting Servlet based Java applications
Hibernate 3.x as the ORM tool
Spring 2.5 for the business layer
Wicket 1.3.2 for the presentation layer
I place a load balancer in front of the servers and a replacement load balancer in case my primary load balancer goes down.
I use Terracotta to have the session information replicated between the servers. If a server goes down the user should be able to continue their work at another server, ideally as if nothing happened.
What is left to "solve" (as I haven't actually tested this and for example do not know what I should use as a load balancer) is the database replication which is needed.
If a user interacts with the application and the database changes, then that change must be replicated to the database servers on the other server machines. How should I go about doing that? Should I use MySQL PostgreSQL or something else (which ideally is free as we have a limited budget)? Does the other things above sound sensible?
Clarification: I cluster to get high availability first and foremost and I want to be able to add servers and use them all at the same time to get high scalability.
Since you're already using Terracotta, and you believe that a second DB is a good idea (agreed), you might consider expanding Terracotta's role. We have customers who use Terracotta for database replication. Here's a brief example/description but I think they have stopped supporting clients for this product.:
http://www.terracotta.org/web/display/orgsite/TCCS+Asynchronous+Data+Replication
You are trying to create a multi-master replication, which is a very bad idea, as any change to any database has to replicate to every other database. This is terribly slow - on one server you can get several hundred transactions per second using a couple of fast disks and RAID1 or RAID10. It can be much more if you have a good RAID controller with battery backed cache. If you add the overhead of communicating with all your servers, you'll get at most tens of transactions per second.
If you want high availability you should go for a warm standby solution, where you have a server, which is replicated but not used - when main server dies a replacement takes over. You can lose some recent transactions if your main server dies.
You can also go for one master, multiple slaves asynchronous replication. Every change to a database will have to be performed on one master server. But you can have several slave, read-only servers. Data on this slave servers can be several transactions behind the master so you can also lose some recent transactions in case of server death.
PostgreSQL does have both types of replication - warm standby using log shipping and one master, multiple slaves using slony.
Only if you will have a very small number of writes you can go for synchronous replication. This can also be set for PostgreSQL using PgPool-II or Sequoia.
Please read High Availability, Load Balancing, and Replication chapter in Postgres documentation for more.
For my (Perl-driven) website, I am using MySQL on two servers with database replication. Each MySQL server is slave and master at the same time. I did this for redudancy, not for performance, but the setup has worked fine for the past 3 years, we had almost no downtime at all during this period.
Regarding Kent's question / comment: I am using the standard replication that comes with MySQL.
Regarding the failover mechanism: I am using DNSMadeEasy.com's failover functionality. I have a Perl script run every 5 minutes via cron that checks if replication is still running (and also lots of other things such as server load, HDD sanity, RAM usage, etc.). During normal operation, the faster of the two servers delivers all web pages. If the script detects that something is wrong with the server (or if the server is just plain down), DNSMadeEasy switches DNS entries so that the secondary server becomes primary. Once the "real" primary server is back up, MySQL automatically catches up on missing database changes and DNSMadeEasy automatically switches back.
Here's an idea. Read Theo Schlossnagle's book Salable Internet Architectures.
What you're proposing is not a the best idea.
Load balancers are expensive and not as valuable as they would appear. Use something simpler for distributing the load between your servers (something like Wackamole).
Rather than fool around with DB replication, spend your money on a reliable DB server separate from your front-end web servers. Do regular backups and in the very unlikely event of DB failure, get back running as quickly as possible from ordinary backups.
AFAIK, MySQL does better job being scalable. See the documentation
http://dev.mysql.com/doc/mysql-ha-scalability/en/ha-overview.html
And there is a blog, where you can take a look at real life examples:
http://highscalability.com/tags/mysql