building a high scale java app, what stack would you use?

building a high scale java app, what stack would you use? - java

if you needed to build a highly scalable web application using java, what framework would you use and why?
I'm just reading thinking-in-java, head first servlets and manning's spring framework book, but really I want to focus on highly scalable architectures etc.
would you use tomcat, hibernate, ehcache?
(just assume you have to design for scale, not looking for the 'worry about it when you get traffic type responses)

The answer depends on what we mean by "scalable". A lot depends on your application, not on the framework you choose to implement it with.
No matter what framework you choose, the fact is that the hardware you deploy it on will have an upper limit on the number of simultaneous requests it'll be able to handle. If you want to handle more traffic, you'll have to throw more hardware at it and include load balancing, etc.
The part that's pertinent in that case has to do with shared state. If you have a lot of shared state, you'll have to make sure that it's thread safe, "sticky" when it needs to be, replicated throughout a cluster, etc. All that has to do with the app server you deploy it to and the way you design your app, not the framework.
Tomcat's not a "framework", it's a servlet/JSP engine. It's got clustering capabilities, but so do most other Java EE app servers. You can use Tomcat if you've already chosen Spring, because it implies that you don't have EJBs. Jetty, Resin, WebLogic, JBOSS, Glassfish - any of them will do.
Spring is a good choice if you already know it well. I think following the Spring idiom will make it more likely that your app is layered and architecturally sound, but that's not the deciding factor when it comes to scalability.
Hibernate will make your development life easier, but the scalability of your database depends a great deal on the schema, indexes, etc. Hibernate isn't a guarantee.
"Scalable" is one of those catch-all terms (like "lightweight") that is easy to toss off but encompasses many considerations. I'm not sure that a simple choice of framework will solve the issue once and for all.

I would check out Apache Mina. From the home page:
Apache MINA is a network application
framework which helps users develop
high performance and high scalability
network applications easily. It
provides an abstract · event-driven ·
asynchronous API over various
transports such as TCP/IP and UDP/IP
via Java NIO.
It has an HTTP engine AsyncWeb built on top of it.
A less radical suggestion (!) is Jetty - a servlet container geared towards performance and a small footprint.

The two keywords I would mainly focus on are Asynchronous and Stateless. Or at least "as stateless as possible: Of course you need state but maybe, instead of going for a full fledged RDBMS, have a look at document centered datastores.
Have a look at AKKA concerning async and CouchDB or MongoDB as datastores...

Frameworks are more geared towards speeding up development, not performance. There will be some overhead with any framework because of use cases it handles that you don't need. Granted, the overhead may be low, and most frameworks will point you towards patterns that have been proven to scale, but those patterns can be used without the framework as well.
So I would design your architecture assuming 'bare metal', i.e. pure servlets (yes, you could go even lower level, but I'm assuming you don't want to write your own http socket layer), straight JDBC, etc. Then go back and figure out which frameworks best fit your architecture, speed up your development, and don't add too much overhead. Tomcat versus other containers, Hibernate versus other ORMs, Struts versus other web frameworks - none of that matters if you make the wrong decisions about the key performance bottlenecks.
However, a better approach might be to choose a framework that optimizes for development time and then find the bottlenecks and address those as they occur. Otherwise, you could spin your wheels optimizing prematurely for cases that never occur. But that probably falls in the category of 'worry about it when you get traffic'.

All popular modern frameworks (and "stacks") are well-written and don't pose any threat to performance and scaling, if used correctly. So focus on what stack will be best for your requirements, rather than starting with the scalability upfront.
If you have a particular requirement, then you can ask a question about it and get recommendations about what's best for handling it.

There is no framework that is magically going to make your web service scalable.
The key to scalability is replicating the functionality that is (or would otherwise be) a bottleneck. If you are serious about making your service, you need to start with a good understanding of the characteristics of your application, and hence an idea of where the bottlenecks are likely to be:
Is it a read-only service or do user requests cause primary data to change?
Do you have / need sessions, or is the system RESTful?
Are the requests normal HTTP requests with HTML responses, or are you doing AJAX or callbacks or something.
Are user requests computation intensive, I/O intensive, rendering intensive?
How big/complicated is your backend database?
What are the availability requirements?
Then you need to decide how scalable you want it to be. Do you need to support hundreds, thousands, millions of simultaneous users? (Different degrees of scalability require different architectures, and different implementation approaches.)
Once you have figured these things out, then you decide whether there is an existing framework that can cope with the level traffic that you need to support. If not, you need to design your own system architecture to be scalable in the problem areas.

If you are able to work with a commercial system, then I'd suggest taking a look at Jazz Foundation at http://jazz.net. It's the base for IBM Rational's new products. The project is led by the guys that developed Eclipse within IBM before it was open-sourced. It has pluggable DB layer as well as supporting multiple App Servers. It's designed to handle clustering and multi-site type deployments. It has nice capabilities like OAuth support and License management.

In addition to the above:
Take a good look at JMS (Java Message Service). This is a much under rated technology. There are vendor solutions such as TibCo EMS, Oracle etc. But there are also free stacks such as Active MQ.
JMS will allow you to build synch and asynch solutions using queues. You can choose to have persistent or non-persistent queues.

As others already have replied scalability isn't about what framework you use. Sure it is nice to squeeze out as much performance as possible from each node, but what you ideally want is that by adding another node you scale your app in a linear fashion.
The application should be architected in distinct layers so it is possible to add more power to different layers of the application without a rewrite and also to add different layered caching. Caching is key to archive speed.
One example of layers for a big webapp:
Load balancers (TCP level)
Caching reverse proxies
CDN for static content
Front end webservers
Appservers (business logic of the app)
Persistent storage (RDBMS, key/value, document)

Related

Choosing Java Restful framework for heavily loaded server

I'm trying to figure out what Java Restful framework is the best suitable fom heavily loaded identity manager server.
Did someone run load tests for Restful frameworks and is willing to share the conclusion?
Thanks a lot!

Great question! You'll probably find that the framework choice is not your primary determiner of performance/scalability. We've used Restlet, based on a very strong recommendation from a former colleague who used it to develop Overstock.com (a very large e-commerce site). It has good performance, and it works fine for Overstock.com. But we didn't do any head to head comparisons.
One of the big drivers for REST is its scalability, a quality of a distributed system whereby you can accomodate an increase in usage with a proportional increase in system size and cost. Caching is a key technique to achieve scalability. So if you allow your representations to be cached, much of the load is actually not borne by the identity management system but by web caches downstream. This is independent of the REST framework.
Your backend database technology is likely another primary factor in system performance and scaling. Tuning the database system and optimizing queries may pay off here. Also consider whether adding a database cache layer makes sense (eg, OpenSymphony).
We found that serialization costs were quite significant for us. Overall request rates were best if we used the Kryo or Smile binary serializations. If you need a textual serialization, we found that the Jackson JSON serializer was much faster then the XStream XML serializer, doubling the overall request rate. This might be an area to consider.
So if you haven't done so, examine your system from a scaling perspective. See http://www.highscalability.com, Richardson and Ruby's Restful Web Services (O'Reilly), Cal Henderson's Building Scalable Web Sites, and Theo Schlossnagle's Scalable Internet Architectures for a start.

Feedback on different backends for GWT

I have to re-design an existing application which uses Pylons (Python) on the backend and GWT on the frontend.
In the course of this re-design I can also change the backend system.
I tried to read up on the advantages and disadvantages of various backend systems (Java, Python, etc) but I would be thankful for some feedback from the community.
Existing application:
The existing application was developed with GWT 1.5 (runs now on 2.1) and is a multi-host-page setup.
The Pylons MVC framework defines a set of controllers/host pages in which GWT widgets are embedded ("classical website").
Data is stored in a MySQL database and accessed by the backend with SQLAlchemy/Elixir. Server/client communication is done with RequestBuilder (JSON).
The application is not a typical business like application with complex CRUD functionality (transactions, locking, etc) or sophisticated permission system (tough a simple ACL is required).
The application is used for visualization (charts, tables) of scientific data. The client interface is primarily used to display data in read-only mode. There might be some CRUD functionality but it's not the main aspect of the app.
Only a subset of the scientific data is going to be transfered to the client interface but this subset is generated out of large datasets.
The existing backend uses numpy/scipy to read data from db/files, create matrices and filter them.
The numbers of users accessing or using the app is relatively small, but the burden on the backend for each user/request is pretty high because it has to read and filter large datasets.
Requirements for the new system:
I want to move away from the multi-host-page setup to the MVP architecture (one single host page).
So the backend only serves one host page and acts as data source for AJAX calls.
Data will be still stored in a relational database (PostgreSQL instead of MySQL).
There will be a simple ACL (defines who can see what kind of data) and maybe some CRUD functionality (but it's not a priority).
The size of the datasets is going to increase, so the burden on the backend is probably going to be higher. There won't be many concurrent requests but the few ones have to be handled by the backend quickly. Hardware (RAM and CPU) for the backend server is not an issue.
Possible backend solutions:
Python (SQLAlchemy, Pylons or Django):
Advantages:
Rapid prototyping.
Re-Use of parts of the existing application
Numpy/Scipy for handling large datasets.
Disadvantages:
Weakly typed language -> debugging can be painful
Server/Client communication (JSON parsing or using 3rd party libraries).
Python GIL -> scaling with concurrent requests ?
Server language (python) <> client language (java)
Java (Hibernate/JPA, Spring, etc)
Advantages:
One language for both client and server (Java)
"Easier" to debug.
Server/Client communication (RequestFactory, RPC) easer to implement.
Performance, multi-threading, etc
Object graph can be transfered (RequestFactory).
CRUD "easy" to implement
Multitear architecture (features)
Disadvantages:
Multitear architecture (complexity,requires a lot of configuration)
Handling of arrays/matrices (not sure if there is a pendant to numpy/scipy in java).
Not all features of the Java web application layers/frameworks used (overkill?).
I didn't mention any other backend systems (RoR, etc) because I think these two systems are the most viable ones for my use case.
To be honest I am not new to Java but relatively new to Java web application frameworks. I know my way around Pylons though in the new setup not much of the Pylons features (MVC, templates) will be used because it probably only serves as AJAX backend.
If I go with a Java backend I have to decide whether to do a RESTful service (and clearly separate client from server) or use RequestFactory (tighter coupling). There is no specific requirement for "RESTfulness". In case of a Python backend I would probably go with a RESTful backend (as I have to take care of client/server communication anyways).
Although mainly scientific data is going to be displayed (not part of any Domain Object Graph) also related metadata is going to be displayed on the client (this would favor RequestFactory).
In case of python I can re-use code which was used for loading and filtering of the scientific data.
In case of Java I would have to re-implement this part.
Both backend-systems have its advantages and disadvantages.
I would be thankful for any further feedback.
Maybe somebody has experience with both backend and/or with that use case.
thanks in advance

We had the same dilemma in the past.
I was involved in designing and building a system that had a GWT frontend and Java (Spring, Hibernate) backend. Some of our other (related) systems were built in Python and Ruby, so the expertise was there, and a question just like yours came up.
We decided on Java mainly so we could use a single language for the entire stack. Since the same people worked on both the client and server side, working in a single language reduced the need to context-switch when moving from client to server code (e.g. when debugging). In hindsight I feel that we were proven right and that that was a good decision.
We used RPC, which as you mentioned yourself definitely eased the implementation of c/s communication. I can't say that I liked it much though. REST + JSON feels more right, and at the very least creates better decoupling between server and client. I guess you'll have to decide based on whether you expect you might need to re-implement either client or server independently in the future. If that's unlikely, I'd go with the KISS principle and thus with RPC which keeps it simple in this specific case.
Regarding the disadvantages for Java that you mention, I tend to agree on the principle (I prefer RoR myself), but not on the details. The multitier and configuration architecture isn't really a problem IMO - Spring and Hibernate are simple enough nowadays. IMO the advantage of using Java across client and server in this project trumps the relative ease of using python, plus you'll be introducing complexities in the interface (i.e. by doing REST vs the native RPC).
I can't comment on Numpy/Scipy and any Java alternatives. I've no experience there.

Is CQRS a good approach for implementing a social application on Google App Engine?

It seems to me that the CQRS (Command and Query Responsibility Segregation) approach might be suitable for implementing a robust and responsive social application server on GAE, because:
CQRS doesn't require a SQL database (which GAE doesn't provide)
It does require a database capable of holding serialised objects, which GAE does in fact provide
It requires event queues, which GAE also provides
It supports a non-blocking, asynchronous, message based architecture, which neatly works round GAE's limitations on long-running transactions
It is advertised as being highly scalable, which is after all why optimists choose GAE
The trouble is, I'm a rusty Java programmer with little experience relevant to this choice, and I would very much appreciate any comments from anyone who has used the two together, or at least investigated using one from the experience of using the other.
I think that my main questions are:
Is CQRS over-complex for the early stages of a new application?
Are there any booby-traps that would make them a poor match, such as such as GAE's Datastore perhaps not being a good match to CQRS requirements?
Can anyone recommend either Axon or Jdon as being particularly suitable (or unsuitable) for GAE?
What other questions should I be asking?

CQRS is not overly complex or difficult, but it does take time to adjust your thinking away from the traditional request/response and client/server interactions that have been pounded into our heads over the years.
In CQRS with event sourcing, the data store is insignificant because you don't require a lot from your storage engine--the NEventStore project (written in C#) can easily support 40-50 different types of storage engines without much difficulty.
Both pure Amazon Web Services and Google App Engine are both great platforms for a CQRS application because they guides you to all of the correct infrastructure choices--asynchronous, non-blocking communication using messaging.
I've never heard of Jdon, but Axon has been around for a while. Try not to lean too heavily on the framework. As your understanding of CQRS deepens, this will become more apparent--basically it's like trying to avoid the use of Hibernate everywhere in your code. You should only use the Axon (or whichever you choose) exactly where it should be used and no more.
Some of the better questions that you might ask surround where to go for help and what resources are already available to help guide your understanding of CQRS. There are a number of good blogs and websites--including cqrsinfo.com--which can help you get started. Also, Greg Young's six-hour video is a must if you're going to start with CQRS.

Key factors for designing scalable web based application

Currently I am working on web based application. I want to know what are the key factors a designer should take care while designing scalable web based application ?

That's a fairly vague and broad question and something you could write books about. How far do you take it? At some point the performance of SQL JOINs breaks down and you have to implement some sharding/partitioning strategy. Is that the level you mean?
General principles are:
Cache and version all static content (images, CSS, Javascript);
Put such content on another domain to stop needless cookie traffic;
GZip/deflate everything;
Only execute required Javascript;
Never do with Javascript what you can do on the serverside (eg style table rows with CSS rather than using fancy jQuery odd/even tricks, which can be a real time killer);
Keep external HTTP requests to a minimum. That means very few CSS, Javascript and image files. That may mean implementing some form of CSS spriting and/or combining CSS or JS files;
Use serverside caching where necessary but only after you find there's a problem. Memory is an expensive but often effective tradeoff for more performance;
Test and tune all database queries;
Minimize redirects.

Having a good read of highscalability.com should give you some ideas. I highly recommend the Amazon articles.

Every application is different. You'll have to profile your application to see where you should concentrate your optimization efforts. Some web applications might require database access optimizations, while others have complicated business logic that cause the bottleneck.
Don't attempt to optimize random arbitrary parts of you application without first profiling. You might end up having to support complicated optimized code that doesn't actually make your application snappier.

I get the sense from the other answers here that there is a general confusion between scalability and performance. High performance means that the response is quick. High scalability means that you get a response no matter how many others are also using the site at the same time. There's a big difference.
In fact, you actually have to sacrifice a little performance just to get good scalability. A general pattern to scalability is distributed computing. Factoring functionality out into separate tiers of clustered servers (web, business rules, database) is the usual approach to scalability. That extra round trip will slow down page load a little bit.
Everyone always wants to focus on high scalability but also don't forget that, for software vendors who sell licenses to customers who self host the application, scaling down can be just as important as scaling up. An application that can run on a single server for ten users but can also be configured to run on a ten server web cluster, a three server middle tier, and a four server database cluster for 10,000 users would be a system well designed for scalability.

None. Just code the application using proper design techniques (separation of concerns, etc) and then when the application is done or nearly done, do your performance testing. You'll find the real bottlenecks then - they won't be what you might have guessed in the beginning. This is where your proper design from the beginning comes into play - it makes it easy to make changes to fix the bottlenecks.

Sometimes, a specific answer is more helpful than just generic tips.
If you want to scale, the only thing to target is SPEED (in hardware and software) and RESOURCES (in hardware).
Hardware, the latter is expensive (more servers, load-balancers, etc.).
So, by carefully selecting your initial development framework you will save a lot of time and resources -up to several orders of magnitude.
For example, nginx is (much) faster than Apache.
Other solutions are faster than nginx (for both static and dynamic contents) but I could not disclose them without being censored on StackOverflow (it was rated SPAM & advertising despite the fact that this is a FREE solution).
That's the limits of "sharing": we must share only "acceptable" solutions rather than efficient solutions.
Cheers,
Pierre.

Why is Java EE scalable?

I heard from various sources that Java EE is highly scalable, but to me it seems that you could never scale a Java EE application to the level of the google search engine or any other large website.
I would like to hear the technical reasons why it is so scalable.

Java EE is considered scalable because if you consider the EJB architecture and run on an appropriate application server, it includes facilities to transparently cluster and allow the use of multiple instances of the EJB to serve requests.
If you managed things manually in plain-old-java, you would have to figure out all of this yourself, for example by opening ports, synchronizing states, etc.
I am not sure you could define Google as a "large website". That would be like likening the internet to your office LAN. Java EE was not meant to scale to the global level, which is why sites like Amazon and Google use their own technologies (e.g., with use of MapReduce).
There are many papers discussing the efficiency of Java EE scalability.
For example this

What makes Java EE scalable is what makes anything scalable: separation of concerns. As your processing or IO needs increase, you can add new hardware and redistribute the load semi-transparently (mostly transparent to the app, obviously less so to the configuration monkeys) because the separated, isolated concerns don't know or care if they're on the same physical hardware or on different processors in a cluster.
You can make scalable applications in any language or execution platform. (Yes, even COBOL on ancient System 370 mainframes.) What application frameworks like Java EE (and others, naturally -- Java EE is hardly unique in this regard!) give you is the ability to easily (relatively speaking) do this by doing much of the heavy lifting for you.
When my web app uses, say, an EJB to perform some business logic, that EJB may be on the same CPU core, on a different core in the same CPU, on a different CPU entirely or, in extreme cases, perhaps even across the planet. I don't know and, for the most part, provided the performance is there, I don't care. Similarly when I send a message out on the message bus to get handled, I don't know nor do I care where that message goes, which component does the processing and where that processing takes place, again as long as the performance falls within my needs. That's all for the configuration monkeys to work out. The technology permits this and the tools are in place to assess what pieces have to go where to get acceptable performance as the system scales up in size.
Now when I try and hand roll all of this, I start with the problems right away. If I don't think about all the proxying and scheduling and distribution and such in advance, when my app expands beyond the bounds of a single machine's handling I now have major rewrites in place as I shift some of the application to another box. And then each time my capacities grow I have to do this again and again.
If I do think about all of this in advance, I'm writing a whole lot of boilerplate code for each application that does minor variations of all the same things. I can code things in a scalable way, but do I want to do this every. damned. time. I write an app?
So what Java EE (and other frameworks) bring to the table is pre-written boilerplate for the common requirements of making scalable applications. Writing my apps to these doesn't guarantee they'd be scalable, of course, but the frameworks make writing said scalable apps a whole lot easier.

One could look at a scalable architecture from the point of view of what the base framework (like Java EE) provides. But that's just the beginning.
Designing for a scalable infrastructure is an architectural art. It's like the art of projection ... how will it behave when it's blown up real big. The base questions are:
Where do I keep commonly accessed stuff so that when so many persons are asking for it, I don't have to go for it so many time (cache)?
Where do I keep each individual's stuff so that when there are so many individuals needing stuff kept, I won't have trouble managing them all.
How do I remember what a person did here the last time they came here, since they may not be coming back to the same particular node they visited the last time.
How long will I have to wait for (block on) a long-running procedure if so many persons are requesting it?
...
that sort of thing is beyond what a framework can wrap. In other words, the framework could be scalable but the product is wired too tight to scale.
Java EE, as a framework is quite scalable, like most modern microprocessor-targeting enterprise frameworks. But I have seen amazing (not in a good way) stuff build out of even the best of them.
For a plethora of references, please search Google for "Designing for Scalability"

The "scalability" thing talks about "what will you do when your application doesn't fit in a single computer anymore?".
Scalable applications can grow over more computers than one.
Note that large servers can have VERY large applications with lots of memory and lots of cpu's - see http://www.sun.com/servers/highend/m9000/ or http://www-03.ibm.com/systems/i/hardware/595/index.html - but it is usually more expensive than having lots of small servers with the application spreading over them.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.