Choosing Java Restful framework for heavily loaded server

Choosing Java Restful framework for heavily loaded server - java

I'm trying to figure out what Java Restful framework is the best suitable fom heavily loaded identity manager server.
Did someone run load tests for Restful frameworks and is willing to share the conclusion?
Thanks a lot!

Great question! You'll probably find that the framework choice is not your primary determiner of performance/scalability. We've used Restlet, based on a very strong recommendation from a former colleague who used it to develop Overstock.com (a very large e-commerce site). It has good performance, and it works fine for Overstock.com. But we didn't do any head to head comparisons.
One of the big drivers for REST is its scalability, a quality of a distributed system whereby you can accomodate an increase in usage with a proportional increase in system size and cost. Caching is a key technique to achieve scalability. So if you allow your representations to be cached, much of the load is actually not borne by the identity management system but by web caches downstream. This is independent of the REST framework.
Your backend database technology is likely another primary factor in system performance and scaling. Tuning the database system and optimizing queries may pay off here. Also consider whether adding a database cache layer makes sense (eg, OpenSymphony).
We found that serialization costs were quite significant for us. Overall request rates were best if we used the Kryo or Smile binary serializations. If you need a textual serialization, we found that the Jackson JSON serializer was much faster then the XStream XML serializer, doubling the overall request rate. This might be an area to consider.
So if you haven't done so, examine your system from a scaling perspective. See http://www.highscalability.com, Richardson and Ruby's Restful Web Services (O'Reilly), Cal Henderson's Building Scalable Web Sites, and Theo Schlossnagle's Scalable Internet Architectures for a start.

Related

Choosing a database service - mongohq vs dynamodb

Currently I am gathering information what database servce we should use.
I am still very new to web development but we think we want to have a noSQL database.
We are using Java with Play! 2.
We only need a database for user registration.
Now I am already familiar with GAE ndb which is a key value store such as dynamoDB. MongoDB is a document db.
I am not sure what advantages each solution has.
I also know that dynamoDB runs on SSD's and mongoDB is inmemory.
An advantage of mongoDB would be that Java Play! already "supports" mongodb.
Now we don't expect too much database usage, but we would need to scale pretty fast if our app grows.
What alternatives do I have? What pros/cons do they have?
Considering:
Pricing
Scaling
Ease of use
Play! support?

(Disclosure: I'm a founder of MongoHQ, and would obviously prefer you choose us)
The biggest difference from a developer perspective is the querying capability. On DynamoDB, you need the exact key for a given document, or you need to build your keys in such a way that you can use them for range based queries. In Mongo, you can query on the structure of the document, add secondary indexes, do aggregations, etc.
The advantage of doing it with k/v only is that it forces you to build your application in a way that DynamoDB can scale. The advantage of Mongo flexible queries against your docs is that you can do much faster development, even if you discount what the Play framework includes. It's always going to be quicker to do new development with something like Mongo because you don't have to make your scaling decisions from the get go.
Implementation wise, both Mongo and DynamoDB can grow basically unbounded. Dynamo abstracts most of the decisions on storage, RAM and processor power. Mongo requires that you (or someone like us) make decisions on how much RAM to have, what kind of disks to use, how to managed bottlenecks, etc. The operations hurdles are different, but the end result is very similar. We run multiple Mongo DBs on top of very fast SSDs and it works phenomenally well.
Pricing is incredibly difficult to compare, unfortunately. DynamoDB pricing is based on a nominal per GB fee, but you pay for data access. You need to be sure you understand how your costs are going to grow as your database gets more active. I'm not sure I can predict DynamoDB pricing effectively, but I know we've had customers who've been surprised (to say the least) at how expensive Dynamo ended up being for the stuff they wanted to do.
Running Mongo is much more predictable cost-wise. You likely need 1GB of RAM for every 10GB of data, running a redundant setup doubles your price, etc. It's a much easier equation to wrap your head around and you're not in for quite as nasty of a shock if you have a huge amount of traffic one day.
By far the biggest advantage of Mongo (and MongoHQ) is this: you can leave your provider at any time. If you get irked at your Mongo provider, it's only a little painful to migrate away. If you get irked at Amazon, you're going to have to rewrite your app to work with an entirely different engine. This has huge implications on the support you should expect to receive, hosting Mongo is competitive enough that you get very good support from just about any Mongo specific company you choose (or we'd die).
I addressed scaling a little bit above, but the simplest answer is this: if you define your data model well, either option will scale out just about as far as you can imagine you'd need to go. You are likely to not do this right with Mongo at first, though, since you'll probably be developing quickly. This means that once you can't scale vertically any more (by adding RAM, disk speed, etc to a single server) you will have to be careful about how you choose to shard. The biggest difference between Mongo and Dynamo scaling is when you choose to make your "how do I scale my data?" decisions, not overall scaling ability.
So I'd choose Mongo (duh!). I think you can build a fantastic app on top of DynamoDB, though.

As you said, mongoDB is one step ahead among other options, because you can use morphia plugin to simplify DB interactions(you have JPA support as well). Play framework provides CRUD module (admin console) and secure module as well (for your overall login system), so I strongly suggest you to have a look at' em.

Java web framework benchmark

To compare with django, I would like to find benchmarks on main Java web frameworks (struts, faces etc...)
I searched on google, but I was unable to find a benchmark giving how many req/s java frameworks can handle.
Do you know some benchmarks for a java web framework ?

Techempower benchmark:
http://www.techempower.com/benchmarks
They are comparing a lot of frameworks and accept new frameworks for comparison. Interface very intuitive. In my view, it is the best benchmark now.
World wide wait - 1 hour speech
http://www.parleys.com/#id=2942&st=5
Django is not here, it's only benchmark of JVM frameworks. But still, it is quite scientific, worth it.

This one was just recently published: http://www.jtict.com/blog/rails-wicket-grails-play-lift-jsp It contains quite many different Java-based frameworks and their response time comparisons.

I'm not sure that what you are requesting is available. There are too many variables to measure this accurately. It ALL depends on what your web application is doing, and how you do it.
For example, do you use a DB? How do I measure Struts or Faces throughput on something that heavily depends on your schema, your DB hardware, your network setup, and the complexity of your pages?
Do you do any type of intensive processing? How do I measure Struts or Faces throughput on something that heavily depends on your algorithm, data size, memory and processor resources?
I could measure the throughput of Hello World, but how valuable would that be to you? How realistic?
In my experience, the biggest bottleneck with most web applications is NOT the framework. It's the network and/or the DB. The only way to get reliable numbers for this is to make a reasonable proof of concept of your application and measure it.

The most current (started on 20/Jul/2019) TechEmpower Web Framework benchmark filtered by
Java *
Python Django
Full Stack
Full ORM
https://www.techempower.com/benchmarks/#section=test&runid=66ee924e-3bc2-4bd8-aaf0-2cd8443f65db&hw=ph&test=fortune&l=zijzb3-f&c=6&o=e&f=0-0-4fti4g-0-4fti4g-0-b8jk-0-0-0

building a high scale java app, what stack would you use?

if you needed to build a highly scalable web application using java, what framework would you use and why?
I'm just reading thinking-in-java, head first servlets and manning's spring framework book, but really I want to focus on highly scalable architectures etc.
would you use tomcat, hibernate, ehcache?
(just assume you have to design for scale, not looking for the 'worry about it when you get traffic type responses)

The answer depends on what we mean by "scalable". A lot depends on your application, not on the framework you choose to implement it with.
No matter what framework you choose, the fact is that the hardware you deploy it on will have an upper limit on the number of simultaneous requests it'll be able to handle. If you want to handle more traffic, you'll have to throw more hardware at it and include load balancing, etc.
The part that's pertinent in that case has to do with shared state. If you have a lot of shared state, you'll have to make sure that it's thread safe, "sticky" when it needs to be, replicated throughout a cluster, etc. All that has to do with the app server you deploy it to and the way you design your app, not the framework.
Tomcat's not a "framework", it's a servlet/JSP engine. It's got clustering capabilities, but so do most other Java EE app servers. You can use Tomcat if you've already chosen Spring, because it implies that you don't have EJBs. Jetty, Resin, WebLogic, JBOSS, Glassfish - any of them will do.
Spring is a good choice if you already know it well. I think following the Spring idiom will make it more likely that your app is layered and architecturally sound, but that's not the deciding factor when it comes to scalability.
Hibernate will make your development life easier, but the scalability of your database depends a great deal on the schema, indexes, etc. Hibernate isn't a guarantee.
"Scalable" is one of those catch-all terms (like "lightweight") that is easy to toss off but encompasses many considerations. I'm not sure that a simple choice of framework will solve the issue once and for all.

I would check out Apache Mina. From the home page:
Apache MINA is a network application
framework which helps users develop
high performance and high scalability
network applications easily. It
provides an abstract · event-driven ·
asynchronous API over various
transports such as TCP/IP and UDP/IP
via Java NIO.
It has an HTTP engine AsyncWeb built on top of it.
A less radical suggestion (!) is Jetty - a servlet container geared towards performance and a small footprint.

The two keywords I would mainly focus on are Asynchronous and Stateless. Or at least "as stateless as possible: Of course you need state but maybe, instead of going for a full fledged RDBMS, have a look at document centered datastores.
Have a look at AKKA concerning async and CouchDB or MongoDB as datastores...

Frameworks are more geared towards speeding up development, not performance. There will be some overhead with any framework because of use cases it handles that you don't need. Granted, the overhead may be low, and most frameworks will point you towards patterns that have been proven to scale, but those patterns can be used without the framework as well.
So I would design your architecture assuming 'bare metal', i.e. pure servlets (yes, you could go even lower level, but I'm assuming you don't want to write your own http socket layer), straight JDBC, etc. Then go back and figure out which frameworks best fit your architecture, speed up your development, and don't add too much overhead. Tomcat versus other containers, Hibernate versus other ORMs, Struts versus other web frameworks - none of that matters if you make the wrong decisions about the key performance bottlenecks.
However, a better approach might be to choose a framework that optimizes for development time and then find the bottlenecks and address those as they occur. Otherwise, you could spin your wheels optimizing prematurely for cases that never occur. But that probably falls in the category of 'worry about it when you get traffic'.

All popular modern frameworks (and "stacks") are well-written and don't pose any threat to performance and scaling, if used correctly. So focus on what stack will be best for your requirements, rather than starting with the scalability upfront.
If you have a particular requirement, then you can ask a question about it and get recommendations about what's best for handling it.

There is no framework that is magically going to make your web service scalable.
The key to scalability is replicating the functionality that is (or would otherwise be) a bottleneck. If you are serious about making your service, you need to start with a good understanding of the characteristics of your application, and hence an idea of where the bottlenecks are likely to be:
Is it a read-only service or do user requests cause primary data to change?
Do you have / need sessions, or is the system RESTful?
Are the requests normal HTTP requests with HTML responses, or are you doing AJAX or callbacks or something.
Are user requests computation intensive, I/O intensive, rendering intensive?
How big/complicated is your backend database?
What are the availability requirements?
Then you need to decide how scalable you want it to be. Do you need to support hundreds, thousands, millions of simultaneous users? (Different degrees of scalability require different architectures, and different implementation approaches.)
Once you have figured these things out, then you decide whether there is an existing framework that can cope with the level traffic that you need to support. If not, you need to design your own system architecture to be scalable in the problem areas.

If you are able to work with a commercial system, then I'd suggest taking a look at Jazz Foundation at http://jazz.net. It's the base for IBM Rational's new products. The project is led by the guys that developed Eclipse within IBM before it was open-sourced. It has pluggable DB layer as well as supporting multiple App Servers. It's designed to handle clustering and multi-site type deployments. It has nice capabilities like OAuth support and License management.

In addition to the above:
Take a good look at JMS (Java Message Service). This is a much under rated technology. There are vendor solutions such as TibCo EMS, Oracle etc. But there are also free stacks such as Active MQ.
JMS will allow you to build synch and asynch solutions using queues. You can choose to have persistent or non-persistent queues.

As others already have replied scalability isn't about what framework you use. Sure it is nice to squeeze out as much performance as possible from each node, but what you ideally want is that by adding another node you scale your app in a linear fashion.
The application should be architected in distinct layers so it is possible to add more power to different layers of the application without a rewrite and also to add different layered caching. Caching is key to archive speed.
One example of layers for a big webapp:
Load balancers (TCP level)
Caching reverse proxies
CDN for static content
Front end webservers
Appservers (business logic of the app)
Persistent storage (RDBMS, key/value, document)

Key factors for designing scalable web based application

Currently I am working on web based application. I want to know what are the key factors a designer should take care while designing scalable web based application ?

That's a fairly vague and broad question and something you could write books about. How far do you take it? At some point the performance of SQL JOINs breaks down and you have to implement some sharding/partitioning strategy. Is that the level you mean?
General principles are:
Cache and version all static content (images, CSS, Javascript);
Put such content on another domain to stop needless cookie traffic;
GZip/deflate everything;
Only execute required Javascript;
Never do with Javascript what you can do on the serverside (eg style table rows with CSS rather than using fancy jQuery odd/even tricks, which can be a real time killer);
Keep external HTTP requests to a minimum. That means very few CSS, Javascript and image files. That may mean implementing some form of CSS spriting and/or combining CSS or JS files;
Use serverside caching where necessary but only after you find there's a problem. Memory is an expensive but often effective tradeoff for more performance;
Test and tune all database queries;
Minimize redirects.

Having a good read of highscalability.com should give you some ideas. I highly recommend the Amazon articles.

Every application is different. You'll have to profile your application to see where you should concentrate your optimization efforts. Some web applications might require database access optimizations, while others have complicated business logic that cause the bottleneck.
Don't attempt to optimize random arbitrary parts of you application without first profiling. You might end up having to support complicated optimized code that doesn't actually make your application snappier.

I get the sense from the other answers here that there is a general confusion between scalability and performance. High performance means that the response is quick. High scalability means that you get a response no matter how many others are also using the site at the same time. There's a big difference.
In fact, you actually have to sacrifice a little performance just to get good scalability. A general pattern to scalability is distributed computing. Factoring functionality out into separate tiers of clustered servers (web, business rules, database) is the usual approach to scalability. That extra round trip will slow down page load a little bit.
Everyone always wants to focus on high scalability but also don't forget that, for software vendors who sell licenses to customers who self host the application, scaling down can be just as important as scaling up. An application that can run on a single server for ten users but can also be configured to run on a ten server web cluster, a three server middle tier, and a four server database cluster for 10,000 users would be a system well designed for scalability.

None. Just code the application using proper design techniques (separation of concerns, etc) and then when the application is done or nearly done, do your performance testing. You'll find the real bottlenecks then - they won't be what you might have guessed in the beginning. This is where your proper design from the beginning comes into play - it makes it easy to make changes to fix the bottlenecks.

Sometimes, a specific answer is more helpful than just generic tips.
If you want to scale, the only thing to target is SPEED (in hardware and software) and RESOURCES (in hardware).
Hardware, the latter is expensive (more servers, load-balancers, etc.).
So, by carefully selecting your initial development framework you will save a lot of time and resources -up to several orders of magnitude.
For example, nginx is (much) faster than Apache.
Other solutions are faster than nginx (for both static and dynamic contents) but I could not disclose them without being censored on StackOverflow (it was rated SPAM & advertising despite the fact that this is a FREE solution).
That's the limits of "sharing": we must share only "acceptable" solutions rather than efficient solutions.
Cheers,
Pierre.

Is Web Service suitable for ETL purpose?

My company is considering using web service as mean of ETL process. However I don't think web service fit into this purpose, for several reasons:
1. web service could possibly consume a lot of memory when generating large xml.
2. xml is a bloated format.
3. possibly time-out if the server takes huge amount of time to generate data
4. file size limitation? (for windows, it's 2Gb, if my memory serves me right)
I am not a web service expert, so I need your opinions. :)
Thanks.

There are plenty of technologies in the Web Services tool shed that circumvent all the problems you elaborate. There is stream oriented XML shredding, there are XML compression formats for delivery, protocols that deal with fragmentation and fairness and there are many a storage systems that can hold terabytes upon terabytes of data.
If by web service you imagine some college freshmen homework concoction of an interface that accepts a single glop argument with a 2GB serialized table in it then all your arguments are valid. But if you give your requirements to an experienced team with knowledge of the concepts involved in WS-ReliableMessaging and WS-Transaction then there is no reason not to have an ETL process around Web Services. Note that I do not advocate the SOAP protocols per-se, but I do advocate knowledge and understanding of the concepts involved.
Now that being said, whether an Web Service oriented ETL process makes sense for you or not it depends on a whole set of other reasons. However, your rebuttal of the Web Service technologies does not hold water.

I would not use a web service for an ETL task. There are specialized tools for that task (e.g., Ab Initio, Informatica, etc.) that are better suited.
If you have a large amount of data, I'd say that the price of the extra latency that the network would introduce would be prohibitive.

It really does depend on what you are doing and how you are trying to accomplish it. In general webservices require more care and feeding than you would normally put into an ETL process, but they can be surprisingly effective at the task as well. I did not get enough specifics for your scenario to say whether it would work.
I have worked on Webservices which transmit and recieve 100+ MB documents, some encoded in XML some not, and do it in seconds (on a closed local network). These services required a good deal of tuning and planning, but they did work well for our scenario and they allowed a wide variety of clients to connect and transmit differing amounts of data through a fairly standard interface. This differed from some of the other ETL jobs we had were the job was specific to each client and had to be setup and maintained for each client.
It all depends on what you are doing and what your constraints are.
If you are going to pursue this route sit down and draft out the process from beginning to end, including how you want clients to connect, verify that the data was received and verify that the job is finished. Consider some of the scenarios, the clients and the types of data being transmitted and then work out what would be needed. Contrast that with what is already available in other tools, and how much time you have to get it done.

I'm really wondering why your company is not considering using a real ETL tool like like those mentioned by duffymo in his answer or, Talend or CloverETL if open source is an option.
They are in general good for ETL purpose :)
Building your own solution sounds like reinventing the wheel.
Many of them have web services oriented features (see Export a job as webservice in Talend's wiki or CloverETL Server HTTP Launch Services for example).
I'm not an ETL product expert and I didn't check them all but I'm pretty sure this is something to consider.

Look up MTOM, to start with, which allows arbitrary non-XML data to be streamed in a web service.

Web services are just fine for ETL tasks. Remember that each task is going to get handled in its own thread for free, and you're guaranteed proper cleanup between requests. Using web services inside something like Tomcat wouldn't be nearly as heavy as you think.
If you're concerned over the bloat of XML, consider JSON format.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.