Distributing Tiers Across VMs

Distributing Tiers Across VMs - java

Although this is a Java-centric question it really applies to any system utilizing a multi-tier architecture.
In 3-tier architectures, you typically have 3 tiers:
A client/presentation tier where the client code lives; and
A middleware tier where business logic lives; and
A data/eis tier where the RDBMS and other data-heavy systems live
In Java land, for a web application, this might look like:
An application server, such as GlassFish running both the "web tier" (WARs comprising the client tier in a web app) as well as the "business tier" (EJBs, middleware, etc.); and
A RDBMS server embodying the data tier
In a virtualized/clustered environment, these applications (GlassFish, RBMBS such as Oracle or PostgreSQL, etc.) will run on VMs.
My question: What are the standard ways of allocating/distributing this 3-tier architecture across these VMs? Meaning, any one of the following "strategies" might be viable, but not preferential:
One VM (let's say all VMs are Ubuntu Servers so cost/price doesn't factor into the equation) running both GlassFish and the RDBMS (all 3 tiers)
Two VMs: an application server VM running GlassFish, and a database server VM running, say, PostgreSQL
Three VMs: two app server VMs both running GlassFish, however 1 GlassFish instance is only running WARs (web tier) whereas the 2nd FlassFish instance is running the middleware/biz logic; then a 3rd DB server
Obviously if all the servers (all tiers) were running on the same VM, they may run faster or more efficiently because they wouldn't be bogged down with network latency. But they'd be on the same VM, which needs I would need mega-hardware to support them. There might also be security concerns with this setup.
There will be pros/cons to each. I'm interested in what strategies would best accomplish the following goals: (1) maximizes throughput/speed, (2) is best suited for a clustered/cloud environment and (3) maximizes security.
Thanks in advance!

(1) maximizes throughput/speed,
This depends entirely on your application. e.g the Database could be your bottle neck in which case what you in JVM does matter so much.
(2) is best suited for a clustered/cloud environment
If you are going to distribute your system, you are most likely to want to distribute your presentation layer. This is because the work they do is dependant on the number of clients and the work each client does is largely independent. (in the presentation layer)
and (3) maximizes security.
Having more VMs doesn't guarantee improved security. Your JVM should be setup so the different applications running in it are pretty separate anyway. If you want to prevent denial of attack and your back end services are used by other systems, you may want to separate them otherwise, it doesn't make much difference.

I think the answer to your questions is highly dependent on this application behavior, database usage, etc... and nothing can be said without looking at current performance metrics (as other answers already mentioned). I post some of my thoughts as guidelines.
Database
Most organizations run RDBMS in separate hosts, and some engineers would choose to never virtualize these, depending on what their DB vendor best practices say for their case (that said, I do normally consider VMs to be equivalent to physical hosts and I use them whenever possible).
Performance-wise, RDBMS often require kernel tuning or unconventional filesystem strategies, and having them in separate hosts can help. If the DB ever needs to be set up in high availability mode or in a cluster, having it separated from application servers can also make things easier. Note that database tuning, if you ever need to do it, can be a difficult topic if taken seriously, and it often involves stuff like aligning partitions in the disk, trying to reduce disk head movement by cleverly allocating database data segments/files, considering DBMS and OS cache sizes and strategies... All this can impact other applications running in the same host so I'd rather leave the DB alone.
In addition, RDBMS often serve several applications (there are good reasons for this: sometimes some integrations require access to more than one application database). Having them separated from application servers helps.
Also, DB systems have their own upgrade, backup, distribution/clustering and administration procedures, and are often maintained by different persons than application servers. Thus the whole database administration topic is easier to deal with if you consider it separately. And if the database becomes a bottleneck, you can work on the database alone without considering if the other tiers are impacting performance.
I do recommend keeping RDBMS alone in a single host for reasonably-sized production environments. But of course, if you don't have performance, administration, or availability requirements, you can consider using shared server for everything.
Glassfish
In general terms, when you want to deploy a Java EE application to several server (for load balancing or high availability) you install the same application server and artifacts to all of the application servers in the cluster. Then you can choose what artifacts to enable on each node of the cluster. Some application servers can enable or disable components depending on the server load. In this case, your application server is the "unit" that you need to distribute.
Now, there may be cases where you or your organization may prefer to have completely separated network layers for the web tier and the business tier (i.e., security concerns). In this case, you would use separate hosts for this. If your web tier is really heavy and you find that you need to scale it separately from the business tier (i.e. you find you need 6 web servers but you can make it with 1 or 2 EJB containers), I'd separate these two tiers too.
As a note: there are slight benefits in running the web and EJB tiers in the same Glassfish instance: as they share the JVM, calls between the web tier and the business tier can use call-by-reference semantics. Depending on your work load and the size and serialization cost of responses, this can result in a noticeable performance increase.
In most cases, for many corporate applications, I'd use just one or two servers (depending if you need high availability) containing both layers, because even if load rises you can still grow vertically (increase server power or VM resources) or horizontally (add another server and load-balance requests).
World-available and high-throughput applications need to consider many other aspects in order for them to be scalable (simply adding nodes to a Java EE cluster node won't cut it), so I don't think any of this solution is better or worse in order to be deployed to "the Cloud", but in general terms, if you plan to deploy to elastic virtualization services and your requirements justify that, I do recommend to separate the web and business tiers.
Security
In my opinion, the topics discussed don't have a direct impact on security.
Lastly, I'm sure that much more can be said about this topic, and my experience is limited, so please get more opinions ;).

Related

Running portlets on Liferay on 1Gig Server - Performance Issue

We have a couple of custom portlet applications running inside Liferay Portal.
The solution is installed on client’s computer which is entry-level (RAM <= 1 Giga). Due to red tape, it is rather unlikely the client switches to higher-end computers in the short term.
The issue is that the applications are very slow.
What are the hints to optimize Liferay configuration (or optimize the portlet application) so we are able to run decently on entry-level computers?
Or is it a good move to switch the portlets to lighter Portlets Containers alternatives such as Apache Pluto or GateIn?
Or running a portal like Liferay on entry-level computers is not an option? And we should consider porting the existing portlets to separate standard Java Web Applications so to achieve better performance?

Compare the price of tuning, minimizing the footprint and measuring the result to the price of just 1 more Gigabyte of RAM - which you might not even be able to purchase in this size any more.
Then compare the price for porting from a portal environment into Java Web Applications: You can't even be sure that this will result in a lower footprint, as you'll have to redo quite a bit of functionality that Liferay provides out of the box. Identity Management for example. Content Management as another one. This will take time (equaling money) that might be better spent with just a new server.
For ~40€/month you can get a hosted server, including network connectivity, power and even support, that is way more capable of serving an application like this than a server the size of a Raspberry Pi (<40€ total, I've seen Raspberry Pi hosting for less than 40€ per year).
I don't know what you mean with "Red Tape", but I'd say you're definitely going for the wrong target. While there is a point to tune Liferay, I'd not go for this kind of optimization.
You're not mentioning the version you're using - with that hardware I'm assuming that it's an ancient version. Back before the current version, Liferay was largely monolithic. While you can configure quite a bit (cache, deactivate some functionality) they'll not bring drastic advantages. The current version has been modularized and you can remove components that you don't use, lowering the footprint - however, it's not been built for that size of infrastructure.
And when you're running the portal on that kind of hardware, you're not running the database and an extra webserver on the same box as well, right? This would be the first thing to change: Minimize everything that's running outside of Liferay on the same OS/Box.

Decision to go for distributed application?

I have a legacy product in financial domain.Using tomcat 6. We get millions of request 10k of request in hour. I am wondering at high level
should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
The reason i am planning to go in this direction so that load is distribute and get better performance With this it becomes scalable also.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
If some expert can help
what is the criteria i should consider to go for distributed model and pros/cons of it? I also tried googling where i could get some stats
like how much load a given webserver (tomcat in my case)handle efiiciently with given hardware(16 gb ram, windows 7, processor ).
Yes i am going
to do POC where i will be measuring performance with distributed model vs without bit high level input will be highly appreciated?

It is impossible to answer this questions without more details - how long does it take to reply to one request on the current server? How many resources are allocated for one request?
having 10k requests per hour means ~3 requests per second. If performing the necessary operations and replying to a request, using 1 CPU takes ~300ms - one simple machine is totally fine. This is simple math, and doesn't always work. I guess you still have peaks within those 10k requests per hour and they aren't gradually distributed.
If we assume, one reply can take up to 1 second, than you can handle as many replies per second as your system has CPUs (given that a CPU would be the bottle neck) If the CPU isn't the bottle neck for your application server, there's probably something wrong. You should set up the database(s) on a different machine and only perform computation tasks on the application server machine.
Especially in the financial sector with a legacy software, I wouldn't try splitting a running product. How old is the current server? I believe that a new Server should be cheaper than rewriting an application. Unless you expect 50-100k requests per hour very soon, I don't think, splitting up such small parts makes sense.
Instead - run it on an up to date server hardware, split application server and data storage and you should be fine.

I am wondering at high level if should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
I'm not sure what you mean for "system" in this context, but if it means that you are planning to run your application in two servers,
one dedicated to presentation and other dedicated to business layer, take in mind that a simpler approach (and probably more suitable for your app)
is build a co-located architecture.
Basically, the idea is to replicate your app in several servers (at least two) and put in front of them a load balancer that routes the incoming requests among the available servers.
All servers share the same database instance. This will give you vertical scalability and also will improve the availability of your system.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
Distributing your business logic will probably involve a refactor of your application code, if the system is working well you will add some bugs for sure.
The necessary remote calls will add latency and the fact that you execute your business logic in several servers doesn't resolve the performance problems on the presentation tier.
In Expert One-on-One J2EE Development Without EJB (pag. 65), you can find a good reading about why not distribute your business logic.

How to make my Java application scalable and fault tolerant?

In a simplified manner my Java application can be described as follows:
It is a web application running on a Tomcat server with a SOAP interface. The application uses JPA/Hibernate to store data in a MySQL database. The data stored consists of list of users, a list of hosts, and a list of URIs pointing to huge files (10GB) in the filesystem.
The whole system consists of a central server, where my application is running on, and a bunch of worker hosts. A user can connect to the SOAP interface and ask the system to copy the files that belong to him to a specific worker host, where he then can analyze the data in some way (We cannot use NFS, we need to copy the data to the local disc storage of a worker host). The database then stores for each user on which worker host his files are stored.
At the moment the system is running with one central server with the Tomcat application and the MySQL database and 10 worker hosts and about 30 users which have 100 files (on average 10GB) size stored distributed over the worker hosts.
But in the future I have to scale the system by a factor of 100-1000. So I might have to deal with 10000 users, 100000 files and 10000 hosts. And the system should also become fault tolerant, so that I have don't have a single central server (which is the single point of failure in the system now), but maybe several ones. Also, if one of the worker hosts fails the system should be notified, so it doesn't try to copy files on that server.
My question is now: Which Java technologies could I use to make my application scalable and fault tolerant? What kind of architecture would you recommend? Should I still have a huge database storing all the information about all files, hosts and users in the system in one place, or should I better distribute my database on several hosts and synchronize them somehow?

The technology you need is called Architecture.
No matter which technology you use, you need to have a well-architected system for scalability and redundancy. Make a diagram of the entire architecture of the system as it currently works. Mark each component with its limitations for users, jobs, bandwidth, hard drive space, memory, or whatever parts are limiting for your application. This will give you the baseline design.
Now draw that same diagram as it would need to be to meet your scalability and redundancy requirements. You might have to break apart pieces to make it work, or develop entirely new pieces. This diagram will make it very clear what you need.
One specific thing I want to address is the database. If you can split the database across logistic lines so that you do not join any queries from one to another, then you should have separate databases. Beyond that, the best configuration for a database is to have each database on one fast machine with lots of storage and very fast access times. If you do this, the only thing that will slow down your database are bad queries or poorly-indexed tables. In my experience, synchronizing databases is to be avoided unless you have one master database that has write access and it replicates to other databases which are read-only. Regardless, this can be a last step after you've profiled all of your queries and you literally need additional hardware.

Scalability of a single server for running a Java Web application

I want to gain more insight regarding the scale of workload a single-server Java Web application deployed to a single Tomcat instance can handle. In particular, let's pretend that I am developing a Wiki application that has a similar usage pattern like Wikipedia. How many simultaneous requests can my server handle reliably before going out of memory or show signs of excess stress if I deploy it on a machine with the following configuration:
4-Core high-end Intel Xeon CPU
8GB RAM
2 HDDs in RAID-1 (No SSDs, no PCIe based Solid State storages)
RedHat or Centos Linux (64-bit)
Java 6 (64-bit)
MySQL 5.1 / InnoDB
Also let's assume that the MySQL DB is installed on the same machine as Tomcat and that all the Wiki data are stored inside the DB. Furthermore, let's pretend that the Java application is built on top of the following stack:
SpringMVC for the front-end
Hibernate/JPA for persistence
Spring for DI and Security, etc.
If you haven't used the exact configuration but have experience in evaluating the scalability of a similar architecture, I would be very interested in hearing about that as well.
Thanks in advance.
EDIT: I think I have not articulated my question properly. I mark the answer with the most up votes as the best answer and I'll rewrite my question in the community wiki area. In short, I just wanted to learn about your experiences on the scale of workload your Java application has been able to handle on one physical server as well as some description regarding the type and architecture of the application itself.

You will need to use group of tools :
Loadtesting Tool - JMeter can be used.
Monitoring Tool - This tool will be used to monitor various numbers of resources load. There are Lot paid as well as free ones. Jprofiler,visualvm,etc
Collection and reporting tool. (Not used any tool)
With above tools you can find optimal value. I would approach it in following way.
will get to know what should be ratio of pages being accessed. What are background processes and their frequency.
Configure my JMeter accordingly (for ratios) , and monitor performance for load applied ( time to serve page ...can be done in JMeter), monitor other resources using Monitor tool. Also check count of error ratio. (NOTE: you need to decide upon what error ratio is not acceptable.)
Keep increasing Load step by step and keep writting various numbers of interest till server fails completely.
You can decide upon optimal value based on many criterias, Low error rate, Max serving time etc.
JMeter supports lot of ways to apply load.

To be honest, it's almost impossible to say. There's probably about 3 ways (of the top of my head to build such a system) and each would have fairly different performance characteristics. You best bet is to build and test.
Firstly try to get some idea of what the estimated volumes you'll have and the latency constraints that you'll need to meet.
Come up with a basic architecture and implement a thin slice end to end through the system (ideally the most common use case). Use a load testing tool like (Grinder or Apache JMeter) to inject load and start measuring the performance. If the performance is acceptable - be conservative your simple implementation will likely include less functionality and be faster than the full system - continue building the system and testing to make sure you don't introduce a major performance bottleneck. If not come up with a different design.
If your code is reasonable the bottleneck will likely be the database and somewhere in the region 100s of db ops per second. If that is insufficient then you may need to think about caching.

Definitely take a look at Spring Insight for performance monitoring and analysis.

English Wikipedia has 14GB data. A 8GB mem cache would have very high hit/miss ratio, and I think harddisk read would be well within its capacity. Therefore, the app is most likely network bound.
English Wikipedia has about 3000 page views per second. It is possible that tomcat can handle the load by careful tuning, and the network has enough throughput to server the traffic.
So the entire wikipedia site can be hosted on one moderate machine? Probably not. Just an idea.
-
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm
http://stats.wikimedia.org/EN/TablesPageViewsMonthly.htm

Tomcat doesn't allow for spreading over multiple machines. If you really are concerned about scalability, you must consider what to do when your application outgrows a single machine.

Database replication for redundancy using a free database and a Java with Spring & Hibernate web application

I have this in mind:
On each server: (they all are set up identically)
A free database like MySQL or PostgreSQL.
Tomcat 6.x for hosting Servlet based Java applications
Hibernate 3.x as the ORM tool
Spring 2.5 for the business layer
Wicket 1.3.2 for the presentation layer
I place a load balancer in front of the servers and a replacement load balancer in case my primary load balancer goes down.
I use Terracotta to have the session information replicated between the servers. If a server goes down the user should be able to continue their work at another server, ideally as if nothing happened.
What is left to "solve" (as I haven't actually tested this and for example do not know what I should use as a load balancer) is the database replication which is needed.
If a user interacts with the application and the database changes, then that change must be replicated to the database servers on the other server machines. How should I go about doing that? Should I use MySQL PostgreSQL or something else (which ideally is free as we have a limited budget)? Does the other things above sound sensible?
Clarification: I cluster to get high availability first and foremost and I want to be able to add servers and use them all at the same time to get high scalability.

Since you're already using Terracotta, and you believe that a second DB is a good idea (agreed), you might consider expanding Terracotta's role. We have customers who use Terracotta for database replication. Here's a brief example/description but I think they have stopped supporting clients for this product.:
http://www.terracotta.org/web/display/orgsite/TCCS+Asynchronous+Data+Replication

You are trying to create a multi-master replication, which is a very bad idea, as any change to any database has to replicate to every other database. This is terribly slow - on one server you can get several hundred transactions per second using a couple of fast disks and RAID1 or RAID10. It can be much more if you have a good RAID controller with battery backed cache. If you add the overhead of communicating with all your servers, you'll get at most tens of transactions per second.
If you want high availability you should go for a warm standby solution, where you have a server, which is replicated but not used - when main server dies a replacement takes over. You can lose some recent transactions if your main server dies.
You can also go for one master, multiple slaves asynchronous replication. Every change to a database will have to be performed on one master server. But you can have several slave, read-only servers. Data on this slave servers can be several transactions behind the master so you can also lose some recent transactions in case of server death.
PostgreSQL does have both types of replication - warm standby using log shipping and one master, multiple slaves using slony.
Only if you will have a very small number of writes you can go for synchronous replication. This can also be set for PostgreSQL using PgPool-II or Sequoia.
Please read High Availability, Load Balancing, and Replication chapter in Postgres documentation for more.

For my (Perl-driven) website, I am using MySQL on two servers with database replication. Each MySQL server is slave and master at the same time. I did this for redudancy, not for performance, but the setup has worked fine for the past 3 years, we had almost no downtime at all during this period.
Regarding Kent's question / comment: I am using the standard replication that comes with MySQL.
Regarding the failover mechanism: I am using DNSMadeEasy.com's failover functionality. I have a Perl script run every 5 minutes via cron that checks if replication is still running (and also lots of other things such as server load, HDD sanity, RAM usage, etc.). During normal operation, the faster of the two servers delivers all web pages. If the script detects that something is wrong with the server (or if the server is just plain down), DNSMadeEasy switches DNS entries so that the secondary server becomes primary. Once the "real" primary server is back up, MySQL automatically catches up on missing database changes and DNSMadeEasy automatically switches back.

Here's an idea. Read Theo Schlossnagle's book Salable Internet Architectures.
What you're proposing is not a the best idea.
Load balancers are expensive and not as valuable as they would appear. Use something simpler for distributing the load between your servers (something like Wackamole).
Rather than fool around with DB replication, spend your money on a reliable DB server separate from your front-end web servers. Do regular backups and in the very unlikely event of DB failure, get back running as quickly as possible from ordinary backups.

AFAIK, MySQL does better job being scalable. See the documentation
http://dev.mysql.com/doc/mysql-ha-scalability/en/ha-overview.html
And there is a blog, where you can take a look at real life examples:
http://highscalability.com/tags/mysql

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.