JVM instructions for executing a set of algorithum

JVM instructions for executing a set of algorithum - java

This more of a concept question. Wondering if any one of you have come across any way to capture instructions passed by JVM to OS while executing a set of algorithm.
The thing is if we take say performance of an application, it has way too many variables such as # of process, # of CPU's, speed of CPU, available memory etc. etc.
What I am looking for is some way to abstract all those dependencies out, so basically it all boils down to the number of instructions passed by JVM and depending on other variables those instructions can be executed faster or slower.
Is there any way one can hook in such code may be native and get those information ?
I know its really abstract but not sure if I can put that in any simpler form.
Thanks

You can't really do that.
Once upon a time you could count clock cycles and calculate the speed of a given (small) piece of code (I distinctly remember the millisecond duration of a given x86 assembler command being available in the reference I used, it was only valid for a Intel 8086 CPU, however).
But modern hardware does so many kinds of optimizations (not to speak of software optimizations) that there is no easy way to "abstract away" those things. Even the relations might be different (a calculation might be 10 times faster than memory access in architecture #1 and 20 times faster in architecture #2). The different CPU cache level sizes alone can have a huge influence in the actual speed of a piece of code.
But if you find a way, be sure to patent it and hide it away. You should be able to make a lot of money with that.

Related

Is there any way to know JVM speed while executing byte Code?

i came to know that java applications performance is also based on speed of the JVM that executing the byte code.So, I would like to know JVM speed while executing byte Code.Is this possible?

The JVM speed varies as it runs, i.e. it optimises the code it executes more the more often it is run.
You can write a micro-benchmark which you can measure and compare with other system.
Perhaps you could clarify why you need to know this?

first of all beware while reading some Java performance related material , you may find black, white, grey , depending from the creation date , the JVM used and so on...
Don't try to deliver overkilled applications , performance should remain a question of logic and should not induce to have a non human understandable code...
What do you mean with JVM speed ? JVM speed depends from many parameters:
* size of byte code
* performance of CPu used
* tuning of the OS and the JVM
* code you write
The main Java advantage remains portability (WORA acronym) so trying to write code behaving in different ways following one 'speed' parameter would be the worth thing to do ....
You may access to different of those parameters (JVM version, CPU , memory and so on) but to do what ? I totally agree with Peter Lawrey in this point....
I guess that you are a Java newcomer and you try to learn quickly , very good..
But try to put things in order....
Starting with writing code that works in a clear, robust and efficient way , easy to maintain is a very good starting point (life's work ????)
HTH
My 2 cents
Jerome

Best to use a profiler to work with,
Some reasons why your machine (probably jvm) works faster is due to a different power savings system your machine employs, eg, no bluetooth, wifi etc. However, this is disputable.
If you use Linux/Unix or any gnu tools, use the 'time' command, eg, time java classname to get the exact time it takes to execute the process.
But from my experience, I feel that I was more alert / productive working out of my office, hence seeing my laptop perform faster. perhaps its physiological.

How to Optimize JVM & GC through Load Testing

Edit: Of the several extremely generous and helpful responses this question has already received, it is obvious to me that I didn't make an important part of this question clear when I asked it earlier this morning. The answers I've received so far are more about optimizing applications & removing bottlenecks at the code level. I am aware that this is way more important than trying to get an extra 3- or 5% out of your JVM!
This question assumes we've already done just about everything we could to optimize our application architecture at the code level. Now we want more, and the next place to look is at the JVM level and garbage collection; I've changed the question title accordingly. Thanks again!
We've got a "pipeline" style backend architecture where messages pass from one component to the next, with each component performing different processes at each step of the way.
Components live inside of WAR files deployed on Tomcat servers. Altogether we have about 20 components in the pipeline, living on 5 different Tomcat servers (I didn't choose the architecture or the distribution of WARs for each server). We use Apache Camel to create all the routes between the components, effectively forming the "connective tissue" of the pipeline.
I've been asked to optimize the GC and general performance of each server running a JVM (5 in all). I've spent several days now reading up on GC and performance tuning, and have a pretty good handle on what each of the different JVM options do, how the heap is organized, and how most of the options affect the overall performance of the JVM.
My thinking is that the best way to optimize each JVM is not to optimize it as a standalone. I "feel" (that's about as far as I can justify it!) that trying to optimize each JVM locally without considering how it will interact with the other JVMs on other servers (both upstream and downstream) will not produce a globally-optimized solution.
To me it makes sense to optimize the entire pipeline as a whole. So my first question is: does SO agree, and if not, why?
To do this, I was thinking about creating a LoadTester that would generate input and feed it to the first endpoint in the pipeline. This LoadTester might also have a separate "Monitor Thread" that would check the last endpoint for throughput. I could then do all sorts of processing where we check for average end-to-end travel time for messages, maximum throughput before faulting, etc.
The LoadTester would generate the same pattern of input messages over and over again. The variable in this experiment would be the JVM options passed to each Tomcat server's startup options. I have a list of about 20 different options I'd like to pass the JVMs, and figured I could just keep tweaking their values until I found near-optimal performance.
This may not be the absolute best way to do this, but it's the best way I could design with what time I've been given for this project (about a week).
Second question: what does SO think about this setup? How would SO create an "optimizing solution" any differently?
Last but not least, I'm curious as to what sort of metrics I could use as a basis of measure and comparison. I can really only think of:
Find the JVM option config that produces the fastest average end-to-end travel time for messages
Find the JVM option config that produces the largest volume throughput without crashing any of the servers
Any others? Any reasons why those 2 are bad?
After reviewing the play I could see how this might be construed as a monolithic question, but really what I'm asking is how SO would optimize JVMs running along a pipeline, and to feel free to cut-and-dice my solution however you like it.
Thanks in advance!

Let me go up a level and say I did something similar in a large C app many years ago.
It consisted of a number of processes exchanging messages across interconnected hardware.
I came up with a two-step approach.
Step 1. Within each process, I used this technique to get rid of any wasteful activities.
That took a few days of sampling, revising code, and repeating.
The idea is there is a chain, and the first thing to do is remove inefficiences from the links.
Step 2. This part is laborious but effective: Generate time-stamped logs of message traffic.
Merge them together into a common timeline.
Look carefully at specific message sequences.
What you're looking for is
Was the message necessary, or was it a retransmission resulting from a timeout or other avoidable reason?
When was the message sent, received, and acted upon? If there is a significant delay between being received and acted upon, what is the reason for that delay? Was it just a matter of being "in line" behind another process that was doing I/O, for example? Could it have been fixed with different process priorities?
This activity took me about a day to generate logs, combine them, find a speedup opportunity, and revise code.
At this rate, after about 10 working days, I had found/fixed a number of problems, and improved the speed dramatically.
What is common about these two steps is I'm not measuring or trying to get "statistics".
If something is spending too much time, that very fact exposes it to a dilligent programmer taking a close meticulous look at what is happening.

I would start with finding the optimum recommended jvm values specified for your hardware/software mix OR just start with what is already out there.
Next I would make sure that I have monitoring in place to measure Business throughputs and SLAs
I would not try to tweak just the GC if there is no reason to.
First you will need to find what are the major bottlenecks in your application. If it is I/O bound, SQL bound etc.
Key here is to MEASURE, IDENTIFY TOP bottlenecks, FIX them and conduct another iteration with a repeatable load.
HTH...

The biggest trick I am aware of when running multiple JVMs on the same machine is limiting the number of core the GC will use. Otherwise what can happen when one JVM does a full GC is it will attempt to grab every core, impacting the performance of all the JVMs even though they are not performing a GC. One suggestion is to limit the number of gc threads to 5/8 or less. (I can't remember where it is written)
I think you should test the system as a whole to ensure you have realistic interaction between the services. However, I would assume you may need to tune each service differently.
Changing command line options is useful if you cannot change the code. However if you profile and optimise the code you can make far for difference than tuning the GC parameters (in which cause you need to change them again)
For this reason, I would only change the command line parameters as a last resort, after you there is little improvement which can be made in the code of the application.

Detecting and pinpointing performance regressions

Are there any known techniques (and resources related to them, like research papers or blog entries) which describe how do dynamically programatically detect the part of the code that caused a performance regression, and if possible, on the JVM or some other virtual machine environment (where techniques such as instrumentation can be applied relatively easy)?
In particular, when having a large codebase and a bigger number of committers to a project (like, for example, an OS, language or some framework), it is sometimes hard to find out the change that caused a performance regression. A paper such as this one goes a long way in describing how to detect performance regressions (e.g. in a certain snippet of code), but not how to dynamically find the piece of the code in the project that got changed by some commit and caused the performance regression.
I was thinking that this might be done by instrumenting pieces of the program to detect the exact method which causes the regression, or at least narrowing the range of possible causes of the performance regression.
Does anyone know about anything written about this, or any project using such performance regression detection techniques?
EDIT:
I was referring to something along these lines, but doing further analysis into the codebase itself.

Perhaps not entirely what you are asking, but on a project I've worked on with extreme performance requirements, we wrote performance tests using our unit testing framework, and glued them into our continuous integration environment.
This meant that every check-in, our CI server would run tests that validated we hadn't slowed down the functionality beyond our acceptable boundaries.
It wasn't perfect - but it did allow us to keep an eye on our key performance statistics over time, and it caught check-ins that affected the performance.
Defining "acceptable boundaries" for performance is more an art than a science - in our CI-driven tests, we took a fairly simple approach, based on the hardware specification; we would fail the build if the performance tests exceeded a response time of more than 1 second with 100 concurrent users. This caught a bunch of lowhanging fruit performance issues, and gave us a decent level of confidence on "production" hardware.
We explicitly didn't run these tests before check-in, as that would slow down the development cycle - forcing a developer to run through fairly long-running tests before checking in encourages them not to check in too often. We also weren't confident we'd get meaningful results without deploying to known hardware.

With tools like YourKit you can take a snapshot of the performance breakdown of a test or application. If you run the application again, you can compare performance breakdowns to find differences.
Performance profiling is more of an art than a science. I don't believe you will find a tool which tells you exactly what the problem is, you have to use your judgement.
For example, say you have a method which is taking much longer than it used to do. Is it because the method has changed or because it is being called a different way, or much more often. You have to use some judgement of your own.

JProfiler allows you to see list of instrumented methods which you can sort by average execution time, inherent time, number of invocations etc. I think if this information is saved over releases one can get some insight into regression. Offcourse the profiling data will not be accurate if the tests are not exactly same.

Some people are aware of a technique for finding (as opposed to measuring) the cause of excess time being taken.
It's simple, but it's very effective.
Essentially it is this:
If the code is slow it's because it's spending some fraction F (like 20%, 50%, or 90%) of its time doing something X unnecessary, in the sense that if you knew what it was, you'd blow it away, and save that fraction of time.
During the general time it's being slow, at any random nanosecond the probability that it's doing X is F.
So just drop in on it, a few times, and ask it what it's doing.
And ask it why it's doing it.
Typical apps are spending nearly all their time either waiting for some I/O to complete, or some library function to return.
If there is something in your program taking too much time (and there is), it is almost certainly one or a few function calls, that you will find on the call stack, being done for lousy reasons.
Here's more on that subject.

Performance in Java through code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
First of all I should mention that I'm aware of the fact that performance optimizations can be very project specific. I'm mostly not facing these special issues right now. I'm facing a bunch of performance issues with the JVM itself.
I wonder now:
which code-optimization make sense
from a compiler perspective: for
example to support the garbage
collector I declared variables as
final - very much following PMD's
suggestions here from Eclipse.
what best practices there are for: vmargs,
heap and other stuff passed to the
JVM for initialization. How do I get
the right values here? Is there any
formula or is it try and error?
Java automates a lot, does many optimization on byte-code level and stuff. However I think most of that must be planed by a developer in order to work.
So how do you speed up your programs in Java? :)

Which code-optimization make sense from a compiler perspective: for example to support the garbage collector I declared variables as final - very much following PMD's suggestions here from Eclipse.
Assuming you are talking about potential micro-optimizations you can make to your code, the answer is pretty much none. The best way to increase your application performance is to run a profiler to figure out where the performance bottlenecks are, then figure out if there is anything you can do to speed them up.
All of the classic tricks like declaring classes, variables and methods final, reorganizing loops, changing primitive types are pretty much a waste of effort in most cases. The JIT compiler can typically do a much better job than you can. For example, recent JIT compilers will analyse all loaded classes to figure out which method calls are not subject to overloading, without you declaring the classes or methods as final. It will then use a quicker call sequence, or even inline the method body.
Indeed, the Sun experts say that some programmer attempts at optimization fail because they actually make it harder for JIT compiler to apply the optimizations it knows about.
On the other hand, higher level algorithmic optimizations are definitely worthwhile ... provided that your profiler tells you that your application is spending a significant amount of time in that area of the code.
Using arrays instead of collections can be a worthwhile optimization in unusual cases, and in rare cases using object pools might be too. But these optimizations 1) will make your code more complicated and bug prone and 2) can slow your application down if used inappropriately. These kinds of optimizations should only be tried as a last resort. For example, if your profiling says that such and such a HashMap<Integer,Integer> is a CPU bottleneck or a memory hog, then it is a better idea to look for an existing specialized Map or Map-like library class than to try and implement the map yourself using arrays. In other words, optimize at the high level.
If you spend long enough or your application is small enough, careful micro-optimization will probably give you a faster application (on a given JVM version / hardware platform) than just relying on the JIT compiler. If you are implementing a smallish application to do large-scale number crunching in Java, the pay-off of micro-optimization may well be considerable. But this is clearly not a typical case! For typical Java applications, the effort is large enough and the performance difference is small enough that micro-optimization is not worthwhile.
(Incidentally, I don't see how declaring a variable can make any possible difference to GC performance. The GC has to trace a variable every time it is encountered whether or not it is final. Besides, it is an open secret that final variables can actually change under certain circumstances, so it would be unsafe for the GC to assume that they don't. Unsafe as in "creates a dangling pointer resulting in a JVM crash".)

I see this a lot. The sequence generally goes:
Thinking performance is about compiler optimizations, big-O, and so on.
Designing software using the recommended ideas, lots of classes, two-way linked lists, trees with pointers up, down, left, and right, hash sets, dictionaries, properties that invoke other properties, event handlers that invoke other event handlers, XML writing, parsing, zipping and unzipping, etc. etc.
Since all those data structures were like O(1) and the compiler's optimizing its guts out, the app should be "efficient", right? Well, then, what's that little voice telling one that the startup is slow, the shutdown is slow, the loading and unloading could be faster, and why is the UI so sluggish?
Hand it off to the "performance expert". With luck, that person finds out, all this stuff is done in the recommended way, but that's why it's cranking its heart out. It's doing all that stuff because it's the recommended way to do things, not because it's needed.
With luck, one has the chance to re-engineer some of that stuff, to make it simple, and gradually remove the "bottlenecks". I say, "with luck" because often it's just not possible, so development relies on the next generation of faster processors to take away the pain.
This happens in every language, but moreso in Java, C#, C++, where abstraction has been carried to extremes. So by all means, be aware of best practices, but also understand what simple software is. Typically it consists of saving those best practices for the circumstances that really need them.

which code-optimization make sense
from a compiler perspective?
All the ones that a compiler can't reason about, because a compiler is very dumb and Java doesn't have "design by contract" (which, hence, cannot help the dumb compiler reason about your code).
For example if you're crunching data and using use int[] or long[] arrays, you may know something about your data that is IMPOSSIBLE for the compiler to figure out and you may use low-level bit-packing/compacting to improve the locality of reference in that part of your code.
Been there, done that, saw gigantic speedup. So much for the "super smart compiler".
This is just one example. There are a huge number of cases like this.
Remember that a compiler is really stupid: it cannot know that if ( Math.abs(42) > 0 ) will always return true.
This should give some food for thoughts to people that think that those compilers are "smart" (things would be different here if Java had DbC, but it doesn't).
what best practices there are for:
vmargs, heap and other stuff passed to
the JVM for initialization. How do I
get the right values here? Is there
any formula or is it try and error?
The real answer is: there shouldn't be. Sadly the situation is so pathetic that such low-level hackery is needed, due to serious failure on Java's part. Oh, one more "tiny" detail: playing with VM fine-tuning only works for server-side app. It doesn't work for desktop apps.
Anyone who has worked on Java desktop applications installed on hundreds or thousands of machines, on various OSes knows all too well what the issue is: full GC pauses making your app look like it's broken. The Apple VM on OS X 10.4 comes to mind for it's particularly afwul, but ALL the JVMs are subject to that issue.
What is worse: it is impossible to "fine tune" the GC's parameters across different OSes / VMs / memory configuration when your application is going to be run on hundreds/thousands of different configuration.
Anyone disputing that: please tell me how you "fine tune" your app knowing that it is going to be run both on octo-cores Mac loaded with 20 GB of ram (I've got users with such setups) and old OS X 10.4 PowerBook that have 768 MB of ram. Please?
But it is not bad: you should not, in the first place, have to be concerned with super-low-level detail like GC "fine tuning". The very fact that this is hinted to is a testimony to one area where Java has a major issue.
Java fans will keep on saying "the GC is super fast, object creation is cheap" while this is blatantly wrong. There's a reason with Trove' TIntIntHashMap runs around circles an HashMap<Integer,Integer>.
There's also a reason why at every new JVM release you'll get countless release notes explaining why -XXGCHyperSteroidMultiTopNotch offers better performance than the last "big JVM param" that every cool Java programmer had to know: maybe the JVM wasn't that great at GC'ing after all.
So to answer your question: how do you speed up Java programs? Easy, do like what the Trove guys did: stop needlessly creating gigantic amount of objects and stop needlessly auto(un)boxing primitives because they will kill your app's perfs.
A TIntIntHashMap OWNS the default HashMap<Integer,Integer> for a reason: for the same reason my apps are now much faster than before.
I stopped believing in crap like "object creation costs nothing" and "the GC is super-optimized, don't worry about it".
I'm using Java to crunch data (I know, I'm a bit crazy) and the one thing that made my app faster was to stop believing all the propaganda surrounding the "cheap object creation" and "amazingly fast GC".
The truth is: INSTEAD OF TRYING TO FINE-TUNE YOUR GC SETTINGS, STOP CREATING THAT MUCH GARBAGE IN THE FIRST PLACE. This can be stated this way: if changing the GC settings radically changes the way your app run, it may be time to wonder if all the needless junk objects your creating are really needed.
Oh, you know what, I'm betting we'll see more and more release notes explaining why Java version x.y.z's GC is faster than version x.y.z-1's GC ;)

Generally there are two kinds of performance optimizations you need to do with Java:
Algorithmic optimization. Choose an algorithm which behaves like you need to. For instance, a simple algorithm may perform best for small datasets, but the overhead of preparing a smarter algorithm may first pay off for much larger datasets.
Bottleneck identification. Here you need to be familiar with a profiler that can tell you what the problem is (humans always guess wrong) - memory leak?, slow method? etc... A good one to start with is VisualVM which can attach to a running program, and is available in the latest Sun JDK. When you know the problem, you can fix it.

Todays JVM's are surprisingly robust when it comes to performance. Any microoptimizations you can apply will, in practically all cases, have only very minor impact on performance. This is easy to understand if you take a look on how typical language constructs (e.g. FOR vs WHILE) translate to bytecode - they are almost indistinguishable.
Making methods/variables final has absolutely no impact on performance on a decent JIT'd JVM. The JIT will keep track of which methods are really polymorphic and optimize away the dynamic dispatch where possible. Static methods can still be faster, since they don't have a this-reference = one less local variable (which at the same time, limits their application). Most efficient micro optimizations are not so much Java specific, for example code with lots of conditional statements can become very slow due to branch mispredictions by the processor. Sometimes conditionals can be replaced by other, sequential code flow constructs (often at the cost of readability), reducing the number of mispredicted branches (and this applies to all languages that somehow compile to native code).
Note that profilers tend to inflate the time spent in short, frequently called methods. This is due to the fact that profilers need to instrument the code to keep track of invocations - this can interfere with the JIT's ability to inline those methods (and the instrumentation overhead becomes significantly larger than the time spent actually executing the methods body). Manual inlining, while apparently very performance boosting under a profiler has in most cases no effect under "real world" conditions. Don't rely purely on the profilers results, verify that optimizations you make have real impact under real runtime conditions, too.
Notable performance boosts can only be expected from changes that reduce the amount of work done, more cache friendly data layout or superior algorytms. Java partially limits your possibilities for cache friendly data layouts, since you have no control where the parts (arrays/objects) that form your data structure will be located in memory in relation to each other. Still, there are plenty of opportunities where choosing the right data structure for the job can make a huge difference (e.g. ArrayList vs LinkedList).
There is little you can do to aid the garbage collector. However, a point worth noting is, while object allocation in Java is very very fast, there is still the cost of object initialization (which is mostly under your control). Poor performance of applications that creating lots of (short lived) objects is more likely to be attributed to poor cache utilization than to the garbage collectors work.
Different applications types require different optimization strategies - so before asking about specific optimizations, find out where your application really spends its time.

If you are experiencing performance issues with your application, you should seriously consider trying some profiling (eg: hprof) to see whether the problem is algorithmic in nature, and also checking the GC performance logging (eg: -verbose:gc) to see if you could benefit from tuning your JVM GC options.

It is worth noting that the compiler does next to no optimisations, and the JVM doesn't optimise at the byte code level either. Most of the optimisations are performed by the JIT in the JVM and it optmises how the code is converted to native machine code.
The best way to optimise your code is to use a profiler which measures how much time and resources your application is using when you give it a realistic data set. Without this information you are just guessing and you can change alot of code where it really, really doesn't matter and find you have added bugs in the process.
Many come to the conclusion that its never worth optmising you code, even counter productive as it can waste time and introduce bugs and I would say that is true for 95+% of your code. However, with aprofiler you can measure the critical pieces of code and optmise the <5% worth optimising and done carefully, you won't get too many issues from trying to optimise your code.

It's hard to answer this too thoroughly because you haven't even mentioned what sort of project you're talking about. Is it a desktop application? A server-side application?
Desktop applications favor application startup time, so the HotSpot client VM is a good start. Client applications don't necessarily need all of their heap space all the time, so a good balance between starting heap and max heap is useful. (Like, maybe -Xms128m -Xmx512m)
Server applications favor overall throughput, which is something the HotSpot server VM is tuned for. You should always allocate the min and max heap sizes the same on a server application. There is an added cost at the system level to it having to malloc() and free() during garbage collection. Use something like -Xms1024m -Xmx1024m.
There are several different garbage collectors also, which are tuned to different application types.
Take a read through the Java SE 6 Performance White Paper if you want more info on the garbage collector and other performance related items from Java 6.

Which programming language for compute-intensive trading portfolio simulation?

I am building a trading portfolio management system that is responsible for production, optimization, and simulation of non-high frequency trading portfolios (dealing with 1min or 3min bars of data, not tick data).
I plan on employing Amazon web services to take on the entire load of the application.
I have four choices that I am considering as language.
Java
C++
C#
Python
Here is the scope of the extremes of the project scope. This isn't how it will be, maybe ever, but it's within the scope of the requirements:
Weekly simulation of 10,000,000 trading systems.
(Each trading system is expected to have its own data mining methods, including feature selection algorithms which are extremely computationally-expensive. Imagine 500-5000 features using wrappers. These are not run often by any means, but it's still a consideration)
Real-time production of portfolio w/ 100,000 trading strategies
Taking in 1 min or 3 min data from every stock/futures market around the globe (approx 100,000)
Portfolio optimization of portfolios with up to 100,000 strategies. (rather intensive algorithm)
Speed is a concern, but I believe that Java can handle the load.
I just want to make sure that Java CAN handle the above requirements comfortably. I don't want to do the project in C++, but I will if it's required.
The reason C# is on there is because I thought it was a good alternative to Java, even though I don't like Windows at all and would prefer Java if all things are the same.
Python - I've read somethings on PyPy and pyscho that claim python can be optimized with JIT compiling to run at near C-like speeds... That's pretty much the only reason it is on this list, besides that fact that Python is a great language and would probably be the most enjoyable language to code in, which is not a factor at all for this project, but a perk.
To sum up:
real time production
weekly simulations of a large number of systems
weekly/monthly optimizations of portfolios
large numbers of connections to collect data from
There is no dealing with millisecond or even second based trades. The only consideration is if Java can possibly deal with this kind of load when spread out of a necessary amount of EC2 servers.
Thank you guys so much for your wisdom.

Pick the language you are most familiar with. If you know them all equally and speed is a real concern, pick C.

While I am a huge fan of Python and personaly I'm not a great lover of Java, in this case I have to concede that Java is the right way to go.
For many projects Python's performance just isn't a problem, but in your case even minor performance penalties will add up extremely quickly. I know this isn't a real-time simulation, but even for batch processing it's still a factor to take into consideration. If it turns out the load is too big for one virtual server, an implementation that's twice as fast will halve your virtual server costs.
For many projects I'd also argue that Python will allow you to develop a solution faster, but here I'm not sure that would be the case. Java has world-class development tools and top-drawer enterprise grade frameworks for parallell processing and cross-server deployment and while Python has solutions in this area, Java clearly has the edge. You also have architectural options with Java that Python can't match, such as Javaspaces.
I would argue that C and C++ impose too much of a development overhead for a project like this. They're viable inthat if you are very familiar with those languages I'm sure it would be doable, but other than the potential for higher performance, they have nothing else to bring to the table.
C# is just a rewrite of Java. That's not a bad thing if you're a Windows developer and if you prefer Windows I'd use C# rather than Java, but if you don't care about Windows there's no reason to care about C#.

I would pick Java for this task. In terms of RAM, the difference between Java and C++ is that in Java, each Object has an overhead of 8 Bytes (using the Sun 32-bit JVM or the Sun 64-bit JVM with compressed pointers). So if you have millions of objects flying around, this can make a difference. In terms of speed, Java and C++ are almost equal at that scale.
So the more important thing for me is the development time. If you make a mistake in C++, you get a segmentation fault (and sometimes you don't even get that), while in Java you get a nice Exception with a stack trace. I have always preferred this.
In C++ you can have collections of primitive types, which Java hasn't. You would have to use external libraries to get them.
If you have real-time requirements, the Java garbage collector may be a nuisance, since it takes some minutes to collect a 20 GB heap, even on machines with 24 cores. But if you don't create too many temporary objects during runtime, that should be fine, too. It's just that your program can make that garbage collection pause whenever you don't expect it.

Why only one language for your system? If I were you, I will build the entire system in Python, but C or C++ will be used for performance-critical components. In this way, you will have a very flexible and extendable system with fast-enough performance. You can find even tools to generate wrappers automatically (e.g. SWIG, Cython). Python and C/C++/Java/Fortran are not competing each other; they are complementing.

Write it in your preferred language. To me that sounds like python. When you start running the system you can profile it and see where the bottlenecks are. Once you do some basic optimisations if it's still not acceptable you can rewrite portions in C.
A consideration could be writing this in iron python to take advantage of the clr and dlr in .net. Then you can leverage .net 4 and parallel extensions. If anything will give you performance increases it'll be some flavour of threading which .net does extremely well.
Edit:
Just wanted to make this part clear. From the description, it sounds like parallel processing / multithreading is where the majority of the performance gains are going to come from.

It is useful to look at the inner loop of your numerical code. After all you will spend most of your CPU-time inside this loop.
If the inner loop is a matrix operation, then I suggest python and scipy, but of the inner loop if not a matrix operation, then I would worry about python being slow. (Or maybe I would wrap c++ in python using swig or boost::python)
The benefit of python is that it is easy to debug, and you save a lot of time by not having to compile all the time. This is especially useful for a project where you spend a lot of time programming deep internals.

I would go with pypy. If not, http://lolcode.com/.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.