Huge data processing/ HPC in Java - suggest me how to begin

Huge data processing/ HPC in Java - suggest me how to begin - java

I am thinking to work on a programming problem for which, I suppose, I will need to know a lot of advanced programming concepts. For some reasons I have decided to code it in Java - even though I am not proficient in it.
So I want you to help me with suggestions, guidance, pointers to resources, books, tutorials or any generic advises that you think is pertinent.
Here is the basic nature of my problem:
I need to create a client-server architecture. Server supports multiple concurrent clients. Clients send it simple instructions (may be server exposes some kind of API/ runs listener on specific port), server executes the instructions and send result back to client.
The main job of the server is to do huge volume of data processing based on the instructions given to it. It takes data from backend database/ file systems. Data volume can easily surge up to ~ 200GB - 700GB. Data will be usually streamed to it, but it may require to hold huge volume of data in memory cache during processing (and if RAM is not enough, then page it to disk). Computations are generally numerically intensive in nature (let's say taking the inverse of a matrix)
The server should be able to do multithreading (I don't know what this term mean in Java, what I wish is, the server should be able to distribute the job in multiple parallel sub-processes.)
The server itself should be very lightweight. I Do NOT need any GUI Interface.
It will be great if I design it in a way so that I can integrate it later with HPC frameworks like Hadoop.
Now if I got to do this, what kind of programming do I need to learn? By the way, I have good understanding on OOP, I am somewhat familiar with Data Structures and algorithms, I know basic Java (never done any network or multithreaded programming in Java before, but have used typical oop concepts, generics, comparable interfaces etc.). I basically work in database programming, but have also done lot of C, C++, C#, Python in the past.
Given the requirement and my background, please suggest,
How should I begin to work on this project? What is the way to architect the project?
Should I create some basic API definitions first and then start working on the details?
Should I follow any particular design pattern? Where to learn them from?
What are the things I need to learn in Java and where to learn them from?
What is the best way to read huge data in memory? Is Java nio good solution?
If I instantiate a class with huge amount of data, would it work? (example, let's say I have a Vector class to represent a matrix with millions of elements and the constructor of the class reads huge data set in the memory). What's the best way to handle that?

You will want to define how the client and server will talk to eachother. The easiest way is to use established protocols such as HTTP by creating REST services that the client can call without much coding.
Most frameworks that support HTTP create several listeners that run in different threads. This gives you multi threading out of the box.
I'd suggest looking into I prefer Spring Controllers. Spring is fairly light weight.
If you want to use these frameworks, you will want to quickly find, and incorporate them into your application for compilation and packaging.
I would suggest looking into Maven for this. It's a big time saver. In particular using archetypes to create your project's folder structure, and auto download dependencies, and their dependencies.
Finally my words of wisdom. Ensure your services are singleton stateless services. This means you only create the objects once, and each thread uses the same objects. There is lots less garbage collection happening. This makes a huge difference when processing large amounts of requests.
Be careful not to use class level variables to hold state, in these services. If you do, different threads will over write each others data.

First thing I would like to say that as per your explanation of the things you seem to be in a pretty good shape to use java as your server side language.
The kind of client server architecture you choose may depend on what kind of clients actually you are serving to. Would they be typical GUI or CUI based desktop clients or the web clients.
In the latter case you could use Spring Framework in a normal fashion and for the former one you could go further to explore Spring's support for Restful Web services. I would advise not to go with socket or TCP based networking solutions or use java networking.
Spring's RESTful API gives you a very cool abstraction over things like networking and multi threading even for a desktop based client. In case of a desktop client you can use JSON/XML as response and can use HttpClient library for making calls to server, which is a very cool abstraction of the underlying networking stuff.
Further up Spring's design patterns follow a very linear flow of data. A lot of your fundamental design considerations are catered by the Spring itself using Dependency Injection and Inversion of Control which are extremely simple to incorporate.
For a detailed analysis of design patterns related to specific requirements I would suggest you to read the book called Java Design Patterns: A Tutorial of Addison Wesley publications and the author is James W. Cooper.
One more thing about the API design. It would be preferable for you to first create a API specification and then go further to implement them.

Related

How can microservices communicate with each other?

I want to transition my currently monolithic application to a a microservices model. I'm currently stuck on figuring out the best way for independent microservices to communicate with each other. Each independent microservice could theoretically be written in a different language, so the communication protocol has to be language independent. Additionally, files don't work since the services may run a separate machines, or even on separate continents. Here are my current ideas and issues with them:
Use a DB/message broker like Redis: this relies on a central DB, so if it goes down the entire application goes down. Not ideal.
Use websockets: the overhead? How do I handle authentication?
Expose a HTTP(S) API: same downsides as websockets
Some other protocol that's designed for end users like RSS: This seems like a bad idea.
Any additional thoughts on the 4 methods I stated? Which one is the best? If possible, could I also be given some examples on how to best implement them? Java is the language that the first microservices will be written in. However, remember that it has to work on most other languages, NOT ONLY Java! If there's a better way to achieve my goals, please state it and elaborate on how it works.

Python requests vs java webservices

I have a legacy web application sys-1 written in cgi that currently uses a TCP socket connection to communicated with another system sys-2. Sys-1 sends out the data in the form a unix string. Now sys-2 is upgrading to java web service which in turn requires us to upgrade. Is there any way to upgrade involving minimal changes to the existing legacy code. I am contemplating the creating of a code block which gets the output of Sys-1 and changes it into a format required by Sys-2 and vice versa.
While researching, I found two ways of doing this:
By using the "requests" library in python.
Go with the java webservices.
I am new to Java web services and have some knowledge in python. Can anyone advise if this method works and which is a better way to opt from a performance and maintenance point of view? Any new suggestions are also welcome!

Is there any way to upgrade involving minimal changes to the existing legacy code.
The solution mentioned, adding a conversion layer outside of the application, would have the least impact on the existing code base (in that it does not change the existing code base).
Can anyone advise if this method works
Would writing a Legacy-System-2 to Modern-System-2 converter work? Yes. You could write this in any language you feel comfortable in. Web Services are Web Services, it matters not what they are implemented in. Same with TCP sockets.
better way to opt from a performance
How important is performance? If this is used once in a blue moon then who cares. Adding a box between services will make the communication between services slower. If implemented well and running close to either System 1 or System 2 likely not much slower.
maintenance point of view?
Adding additional infrastructure adds complexity thus more problems with maintenance. It also adds a new chunk of code to maintain, and if System 1 needs to use System 2 in a new way you have two lots of code to maintain (Legacy System 1 and Legacy/Modern converter).
Any new suggestions are also welcome!
How bad is legacy? Could you rip the System-1-to-System-2 code out into some nice interfaces that you could update to use Modern System 2 without too much pain? Long term this would have a lower overall cost, but would have a (potentially significantly) larger upfront cost. So you have to make a decision on what, for your organisation, is more important. Time To Market or Long Term Maintenance. No one but your organisation can answer that.

Maybe you could add a man-in-the-middle. A socket server who gets the unix strings, parse them into a sys-2 type of message and send it to sys-2. That could be an option to not re-write all calls between the two systems.

Java TCP Server-Client Design Solution

I'm in the process of developing a highly object-oriented solution, (i.e., I want as little coupling as possible, lots of reusability and modular code, good use of design patterns, clean code, etc). I am currently implementing the client-server aspect of the application, and I am new to it. I know how to use Sockets, and how to send streams and receive them. However, I am unsure of actually how to design my solution.
What patterns (if any) are there for TCP Java solutions? I will be sending lots of serialized objects over the network, how do I handle the different requests/objects? In fact, how do I handle a request itself? Do I wrap each object I'm sending inside another object, and then when the object arrives I parse it for a 'command/request', then handle the object contained within accordingly? It is this general design that I am struggling with.
All the tutorials online just seem to be bog-standard, echo servers, that send back the text the client sent. These are only useful when learning about actual sockets, but aren't useful when applying to a real situation. Lots of case statements and if statements just seems poor development. Any ideas? I'd much rather not use a framework at this stage.
Cheers,
Tim.

Consider using a higher level protocol then TCP/IP, don't reinvent the wheel. rmi is a good option and you should be able to find good tutorials on it.

I suggest you either use RMI, or look at it in details so you can determine how you would do things differently. At a minimum I suggest you play with RMI to see how it works before attempting to do it yourself.

If high performance and low latency aren't main requirements then just use existing solutions.
And if you decide to use rmi than consider using J2EE with EJB - it'll provide you a transaction management on top of rmi.
Otherwise if you need extremely low latency take a look on sources of existing solutions that use custom protocols on top of tcp.
For example OpenChord sends serialized Request and Response objects and Project Voldemort uses custom messages for its few operations.

Feedback on different backends for GWT

I have to re-design an existing application which uses Pylons (Python) on the backend and GWT on the frontend.
In the course of this re-design I can also change the backend system.
I tried to read up on the advantages and disadvantages of various backend systems (Java, Python, etc) but I would be thankful for some feedback from the community.
Existing application:
The existing application was developed with GWT 1.5 (runs now on 2.1) and is a multi-host-page setup.
The Pylons MVC framework defines a set of controllers/host pages in which GWT widgets are embedded ("classical website").
Data is stored in a MySQL database and accessed by the backend with SQLAlchemy/Elixir. Server/client communication is done with RequestBuilder (JSON).
The application is not a typical business like application with complex CRUD functionality (transactions, locking, etc) or sophisticated permission system (tough a simple ACL is required).
The application is used for visualization (charts, tables) of scientific data. The client interface is primarily used to display data in read-only mode. There might be some CRUD functionality but it's not the main aspect of the app.
Only a subset of the scientific data is going to be transfered to the client interface but this subset is generated out of large datasets.
The existing backend uses numpy/scipy to read data from db/files, create matrices and filter them.
The numbers of users accessing or using the app is relatively small, but the burden on the backend for each user/request is pretty high because it has to read and filter large datasets.
Requirements for the new system:
I want to move away from the multi-host-page setup to the MVP architecture (one single host page).
So the backend only serves one host page and acts as data source for AJAX calls.
Data will be still stored in a relational database (PostgreSQL instead of MySQL).
There will be a simple ACL (defines who can see what kind of data) and maybe some CRUD functionality (but it's not a priority).
The size of the datasets is going to increase, so the burden on the backend is probably going to be higher. There won't be many concurrent requests but the few ones have to be handled by the backend quickly. Hardware (RAM and CPU) for the backend server is not an issue.
Possible backend solutions:
Python (SQLAlchemy, Pylons or Django):
Advantages:
Rapid prototyping.
Re-Use of parts of the existing application
Numpy/Scipy for handling large datasets.
Disadvantages:
Weakly typed language -> debugging can be painful
Server/Client communication (JSON parsing or using 3rd party libraries).
Python GIL -> scaling with concurrent requests ?
Server language (python) <> client language (java)
Java (Hibernate/JPA, Spring, etc)
Advantages:
One language for both client and server (Java)
"Easier" to debug.
Server/Client communication (RequestFactory, RPC) easer to implement.
Performance, multi-threading, etc
Object graph can be transfered (RequestFactory).
CRUD "easy" to implement
Multitear architecture (features)
Disadvantages:
Multitear architecture (complexity,requires a lot of configuration)
Handling of arrays/matrices (not sure if there is a pendant to numpy/scipy in java).
Not all features of the Java web application layers/frameworks used (overkill?).
I didn't mention any other backend systems (RoR, etc) because I think these two systems are the most viable ones for my use case.
To be honest I am not new to Java but relatively new to Java web application frameworks. I know my way around Pylons though in the new setup not much of the Pylons features (MVC, templates) will be used because it probably only serves as AJAX backend.
If I go with a Java backend I have to decide whether to do a RESTful service (and clearly separate client from server) or use RequestFactory (tighter coupling). There is no specific requirement for "RESTfulness". In case of a Python backend I would probably go with a RESTful backend (as I have to take care of client/server communication anyways).
Although mainly scientific data is going to be displayed (not part of any Domain Object Graph) also related metadata is going to be displayed on the client (this would favor RequestFactory).
In case of python I can re-use code which was used for loading and filtering of the scientific data.
In case of Java I would have to re-implement this part.
Both backend-systems have its advantages and disadvantages.
I would be thankful for any further feedback.
Maybe somebody has experience with both backend and/or with that use case.
thanks in advance

We had the same dilemma in the past.
I was involved in designing and building a system that had a GWT frontend and Java (Spring, Hibernate) backend. Some of our other (related) systems were built in Python and Ruby, so the expertise was there, and a question just like yours came up.
We decided on Java mainly so we could use a single language for the entire stack. Since the same people worked on both the client and server side, working in a single language reduced the need to context-switch when moving from client to server code (e.g. when debugging). In hindsight I feel that we were proven right and that that was a good decision.
We used RPC, which as you mentioned yourself definitely eased the implementation of c/s communication. I can't say that I liked it much though. REST + JSON feels more right, and at the very least creates better decoupling between server and client. I guess you'll have to decide based on whether you expect you might need to re-implement either client or server independently in the future. If that's unlikely, I'd go with the KISS principle and thus with RPC which keeps it simple in this specific case.
Regarding the disadvantages for Java that you mention, I tend to agree on the principle (I prefer RoR myself), but not on the details. The multitier and configuration architecture isn't really a problem IMO - Spring and Hibernate are simple enough nowadays. IMO the advantage of using Java across client and server in this project trumps the relative ease of using python, plus you'll be introducing complexities in the interface (i.e. by doing REST vs the native RPC).
I can't comment on Numpy/Scipy and any Java alternatives. I've no experience there.

Java Backend and Rails Frontend

I have a startup considering building a Java backend and a Rails frontend. The Java backend will take care of creating a caching layer for the database and offer other additional services. The Rails frontend will mostly be for creating the webapp and monitoring tools.
What startups/companies out there are using this kind of setup? What are some gotchas in terms of development speed, deployment, scalability, and integration?
(What would be helpful for me is personal experience or informal case studies. I'd like to de-priorities answers addressing alternatives like Grails or JRuby unless it turns out to be a big part of the equation)
Thanks!

I've never done any rails development but here are my thoughts. Why not just use Grails? I've done a fair amount of Grails development and it works well for rapid prototyping. It also offers all the power of Java, Spring, and Hibernate. Instead of dealing with communication between two different technologies you could take advantage of the fact that Grails uses Spring and Hibernate under the covers to deal with caching as well as any other requirements these technologies support. If you have Java developers Grails should not be to hard to pick up. The grails plugin story is decent. All of them are stored in a central place and are easy to obtain but quality differs depending on the plugin author. You also have to remember that since Grails uses Groovy which is syntactically similar to Java and runs on the JVM it is very easy to use existing java code including the multitude of available Java libraries. I don't know this for certain but I assume there are a lot more libraries out there for use with Java and there for with Grails then are available for the Ruby language and Rails. I can't make a resource usage call for you but my question would be how many people do you have with experience designing Rails systems that use Java on the back end with all the gotchas that will go along with that? You may find that you have no one who is proficient at debugging when something goes wrong in communication between the two technologies.

I have used similar pair for one of my projects, but instead of RoR I used Python. I don't think there's large difference.
In general, there's nothing specific in this kind of programming. Two most important things you must care about:
good modularity;
well thought-out protocol between RoR and Java.
First is about exact functionality decomposition between parts of the system. We've got some troubles because of not understanding what part of a job must be done by Java, and what part by Python. In general, you must tie together all functions that are close to each other, and thing that are far must be connected in a very few places. I guess you know rules of good modularity, but in case of composing different languages this must be thought-out much more carefully. You can also be interested in creating several distinct Java services (e.g. one for database caching and another one for all the rest) to be able freely combine them or even use in other projects later.
Second is about communication between your parts. I can see two ways to communicate: through database and through pure network protocol. The former need some network communication anyway, so we used pure network protocol, without any other way to connect parts.
We had experimented a lot with SOAP, but it gave hundreds of errors: this protocol is quite ok to connect services written in one language (i.d. Java to Java), but terrible for connecting services in different languages - automatic tools for generating WSDLs gave different results for Java and Python, and manual creating of scheme was hard and labor-intensive.
So we used to REST. It uses all the features of HTTP protocol such as all four main HTTP methods (POST, GET, PUT, DELETE), error codes and many other things, so it covers almost everything you may want. The only restriction with REST is that it can't hold state, so you may need to implement your own sessions mechanism.
If you're not very comfortable with REST and seek some real example, see Facebook Graph API and for implementing REST services in Java you can use Restlets.

Perhaps I'm stating the obvious here, but the biggest gotcha here is the actual fact you're combining two technologies. Not only will development be slower - Rails and all Java frameworks are expecting to have a full Ruby/Java application - but for the other issues you mention (deployment, scalability, integration), the existing tools and solutions will not work, or at least not very well.
If you have a compelling reason to combine the two, go ahead but expect to spend more time on these issues than with a single technology solution. If you don't, pick either Rails or Java.

I would strongly urge you to consider having the frontend and backend in the same JVM for performance reasons.
For Rails, JRuby can run Rails, and you can have it in a Java EE container providing fast communication between components.
Having multiple architechtures which do not run in the same process requires you to do network based communication which takes much, much longer than just passing a reference to an object (I do not expect you to want to use shared memory).

Twitter.com is supposedly using a Java backend, and Rails for their web interface. Here are some resources:
http://www.radicalbehavior.com/5-question-interview-with-twitter-developer-alex-payne/
http://blog.adsdevshop.com/2008/05/02/twitter-is-not-abandoning-rails/
http://twitter.com/#!/ev/status/801530348
More specifically, they use Scala on the backend (which compiles to Java byte code):
http://www.artima.com/scalazine/articles/twitter_on_scala.html
I don't think there is anything wrong with this approach. However, you will be supporting two different platforms/VMs. I have seen this kind of situtation result in a sort of fracturing of expertise within an organization. And you end up paying to a lot to have two sets of expertise in house.
If the problem of segmenting expertise concerns you, I would suggest running JRuby of Rails. It is very mature now - and runs better than Rails on Ruby 1.8.x.

We have used Ruby on Rails for front-end and Core-Java SOAP Api for backend. It worked very well.
The main issue is not with performance but with the people. It is very difficult to hire expert people in Ruby on Rails for everything, instead you hire few or you can easily mould to just learn Rails UI.
As startup point of view, it is best strategy. So many believe ROR is best for startups... but very few knew it because in most of MNC's we use Java, so there is pressure to learn Rails and Code, or hire very few ROR developers. So instead you can use ROR for frontend (only few people do that, or it is easy to learn) and use experienced Java developers for DB updates and business logic.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.