Why is Java EE scalable?

Why is Java EE scalable? - java

I heard from various sources that Java EE is highly scalable, but to me it seems that you could never scale a Java EE application to the level of the google search engine or any other large website.
I would like to hear the technical reasons why it is so scalable.

Java EE is considered scalable because if you consider the EJB architecture and run on an appropriate application server, it includes facilities to transparently cluster and allow the use of multiple instances of the EJB to serve requests.
If you managed things manually in plain-old-java, you would have to figure out all of this yourself, for example by opening ports, synchronizing states, etc.
I am not sure you could define Google as a "large website". That would be like likening the internet to your office LAN. Java EE was not meant to scale to the global level, which is why sites like Amazon and Google use their own technologies (e.g., with use of MapReduce).
There are many papers discussing the efficiency of Java EE scalability.
For example this

What makes Java EE scalable is what makes anything scalable: separation of concerns. As your processing or IO needs increase, you can add new hardware and redistribute the load semi-transparently (mostly transparent to the app, obviously less so to the configuration monkeys) because the separated, isolated concerns don't know or care if they're on the same physical hardware or on different processors in a cluster.
You can make scalable applications in any language or execution platform. (Yes, even COBOL on ancient System 370 mainframes.) What application frameworks like Java EE (and others, naturally -- Java EE is hardly unique in this regard!) give you is the ability to easily (relatively speaking) do this by doing much of the heavy lifting for you.
When my web app uses, say, an EJB to perform some business logic, that EJB may be on the same CPU core, on a different core in the same CPU, on a different CPU entirely or, in extreme cases, perhaps even across the planet. I don't know and, for the most part, provided the performance is there, I don't care. Similarly when I send a message out on the message bus to get handled, I don't know nor do I care where that message goes, which component does the processing and where that processing takes place, again as long as the performance falls within my needs. That's all for the configuration monkeys to work out. The technology permits this and the tools are in place to assess what pieces have to go where to get acceptable performance as the system scales up in size.
Now when I try and hand roll all of this, I start with the problems right away. If I don't think about all the proxying and scheduling and distribution and such in advance, when my app expands beyond the bounds of a single machine's handling I now have major rewrites in place as I shift some of the application to another box. And then each time my capacities grow I have to do this again and again.
If I do think about all of this in advance, I'm writing a whole lot of boilerplate code for each application that does minor variations of all the same things. I can code things in a scalable way, but do I want to do this every. damned. time. I write an app?
So what Java EE (and other frameworks) bring to the table is pre-written boilerplate for the common requirements of making scalable applications. Writing my apps to these doesn't guarantee they'd be scalable, of course, but the frameworks make writing said scalable apps a whole lot easier.

One could look at a scalable architecture from the point of view of what the base framework (like Java EE) provides. But that's just the beginning.
Designing for a scalable infrastructure is an architectural art. It's like the art of projection ... how will it behave when it's blown up real big. The base questions are:
Where do I keep commonly accessed stuff so that when so many persons are asking for it, I don't have to go for it so many time (cache)?
Where do I keep each individual's stuff so that when there are so many individuals needing stuff kept, I won't have trouble managing them all.
How do I remember what a person did here the last time they came here, since they may not be coming back to the same particular node they visited the last time.
How long will I have to wait for (block on) a long-running procedure if so many persons are requesting it?
...
that sort of thing is beyond what a framework can wrap. In other words, the framework could be scalable but the product is wired too tight to scale.
Java EE, as a framework is quite scalable, like most modern microprocessor-targeting enterprise frameworks. But I have seen amazing (not in a good way) stuff build out of even the best of them.
For a plethora of references, please search Google for "Designing for Scalability"

The "scalability" thing talks about "what will you do when your application doesn't fit in a single computer anymore?".
Scalable applications can grow over more computers than one.
Note that large servers can have VERY large applications with lots of memory and lots of cpu's - see http://www.sun.com/servers/highend/m9000/ or http://www-03.ibm.com/systems/i/hardware/595/index.html - but it is usually more expensive than having lots of small servers with the application spreading over them.

Related

What's the best language/platform for SOA on Linux in a post-Java world?

I need to choose a language/platform for the new development of a series of services in a SOA. I'm looking into Scala and Clojure but don't think the community and products are mature enough for a real-world enterprise product yet.
Update/Clarifications:
Of course we can use many languages/platforms for SOA but some language/platforms are easier and more suited for an SOA. IMO the best ones for SOA should allow interface programming (to ease definition of contracts), should have options for hosting the services (like Felix for Java or WCF in .NET) and scale well (see Twitter issues with RoR).
Java has always been the favourite in the enterprise market. However, many developers are looking into dynamic languages as well as talking about stagnation of Java after v6. As a result many new post Java languages have arrived: Scala, Clojure and Groovy to name a few that still run on JVM but are not Java.
I hope these clarify the question.

Depends what you mean by "mature enough for a real-world enterprise product", and your relative level of tolerance for living on the cutting edge.
For example, I'm currently building a "real-world enterprise product" in Clojure (I'd have been equally happy with Scala, it was only that Clojure fitted my needs slightly better from the concurrency and meta-programming perspective).
I'm very happy with my decision.
Some quick perspectives if you are considering this "post-Java" path:
The communities are great and supportive, but you'll still have to solve problems yourself, if only because nobody else has run into the same problem yet. None of these are likely to be insurmountable, but it does present a bit of extra risk to delivery schedules.
Both Scala and Clojure can be very productive (in terms of value delivered to customers per hour coding), but you can equally well write bad and unmaintainable code in any langauge. Java pretty much forces you to write things in a standardised, somewhat verbose but syntactically simple and understandable way. With Scala and Clojure you get a whole new arsenal of crazy ways to hit your target or shoot yourself in the foot. Is your team going to be able to make the best use of Scala/Clojure advantages?
It's harder (though by no means impossible) to bring skilled people on board with existing Clojure/Scala skills. On the flipside, the people who do have these skills (or are keen to acquire them) are likely to be among the more talented / motivated developers so the search may still be productive.
Be prepared to make tough decisions regarding whether to target language/library features that are "just round the corner". For example, do you wait for the enhanced primitive support coming in Clojure 1.3? Or make do with the perfectly adequate but slower boxed primitive functions in Clojure 1.2?
A great benefit of being on the JVM is that you can still take full advantage of the Java ecosystem without being tied to Java as a language. Don't underestimate how useful this is: for example, I use a number of extremely well tested, mature Java libraries (e.g. Netty) pretty much transparently in my Clojure application. This significantly reduces your risk and the amount of new development that you need to do.

At the moment (having just completed a services/integration project ) Jersey on top of Spring are right up there on my favourites list for web services.
I can't offer any suggestions for a SOA framework, last time I was involved in that type of thing we user Oracle BPEL Process Manager and I have mixed feelings about it. We weren't using REST then either and I'm not sure how well the Oracle software works with it.

For me python seems the easiest way to do some SOA and have interoperability with Windows computers. I don't have frameworks name but there is a lot of them in SOAP, REST, RPC...

Learning advanced Java concepts, techniques, frameworks, internals, etc for a .Net dev in the context of Android development?

I work with .Net professionally in a lot of different contexts, so it's easy for me to read about new frameworks, runtime internals, advanced techniques/design and put them to use and understand them. In the Java world, I have limited experience and am really only working with it for Android development these days. I've been able to learn the language well enough to build out the functionality I'm looking for, but I want to learn more about good practices and design that the Java guys agree on, whatever modern frameworks everyone's using, and more about the internals of the VM and how my programming choices affect how my code is compiled and executed.
Examples from the .Net world of what I'm looking for are
There's a series of books called Effective C# that outlines 50 items per book of subtle changes to your programming style and how they will make your code cleaner and more performant in specific scenarios.
Entity Framework is a framework from Microsoft for hooking up directly to a data source and building out a configurable entity model automatically
Managed Extensibility Framework is a new framework from Microsoft for writing extensible applications and pluggable libraries by exposing extension points on both ends
There is documentation galore on the internet about how the .Net garbage collector works and how your programming choices affect how this interacts with your applications
What kinds of resources, books, tutorials and frameworks exist like this in the Java world?

There's a book called Effective Java too.
There are different categories of data binding in Java. The most advanced are the Object models, like JDO, JPA, etc. They basically use a map to move data from objects to tables, and you never touch database directly as it is all handled transparently. Another is the typical "object binds to a row" technique, of which JDO is a good example. Finally, there is handling the database directly, which you use JDBC. Use the tool most appropriate to your code logic.
In general, you'll find that with Java it's not a "one solution only" environment. Some of the problems have been solved multiple times in different ways to achieve different results.
It sounds like "Managed Extensibility Framework" is a subtle copy of the J2EE server concept. J2EE has undergone at least three major revisions over the past decade. If you want to use J2EE, remember that it provides services to items within wrappers called "containers". This means you will have to adapt your code to meet the container service agreements. There is a bit of up-front learning involved, but once you understand the environment it isn't hard. You also don't need to use the entire J2EE environment and you can embed your own solutions to those provided by the J2EE server. It's a pick and choose type arrangement, precious little is forced on you.
J2EE also describes a lot of corporate technologies that may live independently of a J2EE server, so if you don't like the J2EE environment (for whatever reason) you can always include the JAR files and use the libraries without the J2EE server.
Some people have decided that the initial J2EE servers were too restrictive, so you have an almost-J2EE server called Spring. The J2EE web containers arrived pretty early on the scene in Java, so you can get "web container only" servers, like Tomcat or Jetty.
With Java, there is probably even more documentation about the garbage collector, but you have to deal with it's behaviour less. Java's garbage collector is generally much better behaved, and it doesn't have to deal with pointer support which partially makes .net's garbage collector something you do need to tend to from time to time.
That said, dereference anything you want collected. If the logic stores items in a HashMap as a cache, consider using SoftReferences, which will not be considered as references in garbage collection. Java doesn't reference count, so don't worry about circular references, you can dereference a cycle of references and they will all be collected.
The algorithm the GC uses changes depending on memory availability. In low memory utilization situations, it will copy live objects to a new page and free the old page (so compaction is obtained nearly for free. In higher memory situations, it uses a mark, sweep, and compact cycle typical of other garbage collectors. It also stages it's memory in three generational segments to order object by the frequency they should be checked for usage in the current running program.
All of that said, the real kicker is that Android uses the Java language, but it doesn't run a JVM. It runs an "I-can't-believe-it's-not-Java!" JVM-work alike that makes significant changes to the class loader and class file format. That means that you need to learn how the Davlik Virtual Machine operates and differs from the JVM.
Have fun! There is a lot more choice in Java land that you're probably accustomed to; however, many of the most popular Java tools have been ported to .net land, so you won't find the entire landscape foreign.

building a high scale java app, what stack would you use?

if you needed to build a highly scalable web application using java, what framework would you use and why?
I'm just reading thinking-in-java, head first servlets and manning's spring framework book, but really I want to focus on highly scalable architectures etc.
would you use tomcat, hibernate, ehcache?
(just assume you have to design for scale, not looking for the 'worry about it when you get traffic type responses)

The answer depends on what we mean by "scalable". A lot depends on your application, not on the framework you choose to implement it with.
No matter what framework you choose, the fact is that the hardware you deploy it on will have an upper limit on the number of simultaneous requests it'll be able to handle. If you want to handle more traffic, you'll have to throw more hardware at it and include load balancing, etc.
The part that's pertinent in that case has to do with shared state. If you have a lot of shared state, you'll have to make sure that it's thread safe, "sticky" when it needs to be, replicated throughout a cluster, etc. All that has to do with the app server you deploy it to and the way you design your app, not the framework.
Tomcat's not a "framework", it's a servlet/JSP engine. It's got clustering capabilities, but so do most other Java EE app servers. You can use Tomcat if you've already chosen Spring, because it implies that you don't have EJBs. Jetty, Resin, WebLogic, JBOSS, Glassfish - any of them will do.
Spring is a good choice if you already know it well. I think following the Spring idiom will make it more likely that your app is layered and architecturally sound, but that's not the deciding factor when it comes to scalability.
Hibernate will make your development life easier, but the scalability of your database depends a great deal on the schema, indexes, etc. Hibernate isn't a guarantee.
"Scalable" is one of those catch-all terms (like "lightweight") that is easy to toss off but encompasses many considerations. I'm not sure that a simple choice of framework will solve the issue once and for all.

I would check out Apache Mina. From the home page:
Apache MINA is a network application
framework which helps users develop
high performance and high scalability
network applications easily. It
provides an abstract · event-driven ·
asynchronous API over various
transports such as TCP/IP and UDP/IP
via Java NIO.
It has an HTTP engine AsyncWeb built on top of it.
A less radical suggestion (!) is Jetty - a servlet container geared towards performance and a small footprint.

The two keywords I would mainly focus on are Asynchronous and Stateless. Or at least "as stateless as possible: Of course you need state but maybe, instead of going for a full fledged RDBMS, have a look at document centered datastores.
Have a look at AKKA concerning async and CouchDB or MongoDB as datastores...

Frameworks are more geared towards speeding up development, not performance. There will be some overhead with any framework because of use cases it handles that you don't need. Granted, the overhead may be low, and most frameworks will point you towards patterns that have been proven to scale, but those patterns can be used without the framework as well.
So I would design your architecture assuming 'bare metal', i.e. pure servlets (yes, you could go even lower level, but I'm assuming you don't want to write your own http socket layer), straight JDBC, etc. Then go back and figure out which frameworks best fit your architecture, speed up your development, and don't add too much overhead. Tomcat versus other containers, Hibernate versus other ORMs, Struts versus other web frameworks - none of that matters if you make the wrong decisions about the key performance bottlenecks.
However, a better approach might be to choose a framework that optimizes for development time and then find the bottlenecks and address those as they occur. Otherwise, you could spin your wheels optimizing prematurely for cases that never occur. But that probably falls in the category of 'worry about it when you get traffic'.

All popular modern frameworks (and "stacks") are well-written and don't pose any threat to performance and scaling, if used correctly. So focus on what stack will be best for your requirements, rather than starting with the scalability upfront.
If you have a particular requirement, then you can ask a question about it and get recommendations about what's best for handling it.

There is no framework that is magically going to make your web service scalable.
The key to scalability is replicating the functionality that is (or would otherwise be) a bottleneck. If you are serious about making your service, you need to start with a good understanding of the characteristics of your application, and hence an idea of where the bottlenecks are likely to be:
Is it a read-only service or do user requests cause primary data to change?
Do you have / need sessions, or is the system RESTful?
Are the requests normal HTTP requests with HTML responses, or are you doing AJAX or callbacks or something.
Are user requests computation intensive, I/O intensive, rendering intensive?
How big/complicated is your backend database?
What are the availability requirements?
Then you need to decide how scalable you want it to be. Do you need to support hundreds, thousands, millions of simultaneous users? (Different degrees of scalability require different architectures, and different implementation approaches.)
Once you have figured these things out, then you decide whether there is an existing framework that can cope with the level traffic that you need to support. If not, you need to design your own system architecture to be scalable in the problem areas.

If you are able to work with a commercial system, then I'd suggest taking a look at Jazz Foundation at http://jazz.net. It's the base for IBM Rational's new products. The project is led by the guys that developed Eclipse within IBM before it was open-sourced. It has pluggable DB layer as well as supporting multiple App Servers. It's designed to handle clustering and multi-site type deployments. It has nice capabilities like OAuth support and License management.

In addition to the above:
Take a good look at JMS (Java Message Service). This is a much under rated technology. There are vendor solutions such as TibCo EMS, Oracle etc. But there are also free stacks such as Active MQ.
JMS will allow you to build synch and asynch solutions using queues. You can choose to have persistent or non-persistent queues.

As others already have replied scalability isn't about what framework you use. Sure it is nice to squeeze out as much performance as possible from each node, but what you ideally want is that by adding another node you scale your app in a linear fashion.
The application should be architected in distinct layers so it is possible to add more power to different layers of the application without a rewrite and also to add different layered caching. Caching is key to archive speed.
One example of layers for a big webapp:
Load balancers (TCP level)
Caching reverse proxies
CDN for static content
Front end webservers
Appservers (business logic of the app)
Persistent storage (RDBMS, key/value, document)

Key factors for designing scalable web based application

Currently I am working on web based application. I want to know what are the key factors a designer should take care while designing scalable web based application ?

That's a fairly vague and broad question and something you could write books about. How far do you take it? At some point the performance of SQL JOINs breaks down and you have to implement some sharding/partitioning strategy. Is that the level you mean?
General principles are:
Cache and version all static content (images, CSS, Javascript);
Put such content on another domain to stop needless cookie traffic;
GZip/deflate everything;
Only execute required Javascript;
Never do with Javascript what you can do on the serverside (eg style table rows with CSS rather than using fancy jQuery odd/even tricks, which can be a real time killer);
Keep external HTTP requests to a minimum. That means very few CSS, Javascript and image files. That may mean implementing some form of CSS spriting and/or combining CSS or JS files;
Use serverside caching where necessary but only after you find there's a problem. Memory is an expensive but often effective tradeoff for more performance;
Test and tune all database queries;
Minimize redirects.

Having a good read of highscalability.com should give you some ideas. I highly recommend the Amazon articles.

Every application is different. You'll have to profile your application to see where you should concentrate your optimization efforts. Some web applications might require database access optimizations, while others have complicated business logic that cause the bottleneck.
Don't attempt to optimize random arbitrary parts of you application without first profiling. You might end up having to support complicated optimized code that doesn't actually make your application snappier.

I get the sense from the other answers here that there is a general confusion between scalability and performance. High performance means that the response is quick. High scalability means that you get a response no matter how many others are also using the site at the same time. There's a big difference.
In fact, you actually have to sacrifice a little performance just to get good scalability. A general pattern to scalability is distributed computing. Factoring functionality out into separate tiers of clustered servers (web, business rules, database) is the usual approach to scalability. That extra round trip will slow down page load a little bit.
Everyone always wants to focus on high scalability but also don't forget that, for software vendors who sell licenses to customers who self host the application, scaling down can be just as important as scaling up. An application that can run on a single server for ten users but can also be configured to run on a ten server web cluster, a three server middle tier, and a four server database cluster for 10,000 users would be a system well designed for scalability.

None. Just code the application using proper design techniques (separation of concerns, etc) and then when the application is done or nearly done, do your performance testing. You'll find the real bottlenecks then - they won't be what you might have guessed in the beginning. This is where your proper design from the beginning comes into play - it makes it easy to make changes to fix the bottlenecks.

Sometimes, a specific answer is more helpful than just generic tips.
If you want to scale, the only thing to target is SPEED (in hardware and software) and RESOURCES (in hardware).
Hardware, the latter is expensive (more servers, load-balancers, etc.).
So, by carefully selecting your initial development framework you will save a lot of time and resources -up to several orders of magnitude.
For example, nginx is (much) faster than Apache.
Other solutions are faster than nginx (for both static and dynamic contents) but I could not disclose them without being censored on StackOverflow (it was rated SPAM & advertising despite the fact that this is a FREE solution).
That's the limits of "sharing": we must share only "acceptable" solutions rather than efficient solutions.
Cheers,
Pierre.

Linux\BSD Network Programming

Looking to write an SNMP and Netflow tool for Linux\BSD and seeking advice on language selection, C or Java.
The tool will collect Netflows, send and recieve SNMP queries, connect to a Postgresql Databases and will be fronted by a web interface (PHP), in the future it will interface with devices using web services.
Normally I would have reached for C to implement the above, plenty of robust libraries and low level access to the network stack but the database access and web services could be implemented easier (better?) in Java.
Question is whether Java is up to the task of processing all this network information under load or should I stick with lower level access provided by C?
Supplemental question, I've been considering making this a hybrid application. Heavy lifting in C and doing the higher level stuff in Java. Experiences and thoughts on this are welcome.

Java's implementations today are robust and mature, so your worry about whether they're "up to processing ... under load" are misplaced. C has its advantages (tiniest memory footprint, fastest startup times), but you pay dearly for them in terms of programming work needed to do your own memory management. It doesn't appear from what you say that minimizing memory or optimizing frequent restart is a big deal for this app, anyway. Why don't you start with Java (or whatever other high-level language you're most comfortable with) and only consider recoding some parts in C, maybe, if and when your profiling shows CPU- or memory-bottlenecks arising from the higher-level language use? (I'd bet you'll most likely end up not needing such recoding, btw).

I would absolutely go with Java on this. It's entirely capable of handling the "load". I work on Java projects which are responsible for processing extremely large amounts of data in real-time and without issue.
Java wont struggle one bit for what you are talking about doing here and it will be far easier and quicker to develop in.

Everyone's nailed this; the modern JVM implementations are in the same ballpark as C for speed, unless you're doing direct hardware access.
I'm curious why you'd consider a Java backend for a PHP frontend. C/PHP would make sense, but if you're going for Java on the backend, it might help to have the same language throughout for easier maintainability.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.