JRuby, large arrays, and performance issues in a real-time application

JRuby, large arrays, and performance issues in a real-time application - java

I'm working on a real-time game application. Most of it is written in Java, but recently I decided to experiment with moving some of the initialization procedures into JRuby scripts in order to maximize the ease for the player to modify how the world is generated.
As a start, I decided to move the map generation into a JRuby script. At present, this boils down to the following Java code:
ScriptingContainer container = new ScriptingContainer();
container.put("$data", dataPackage);
container.runScriptlet(PathType.RELATIVE, scriptName);
dataPackage = (BlockMapGenerationDataPackage)container.get("$data");
The data package contains all the information necessary for the Java program to produce the final terrain and render it, and it also contains the necessary data in order for the Ruby script to be able to craft all manner of maps. In particular, it contains a rather large array (currently 1000 x 1000 x 15). To test whether the Ruby script was working, I've stripped out the entire map generation algorithm and put in the following extremely simple test:
require 'java'
Dir["../../dist/\*.jar"].each { |jar| require jar }
for i in (0...$data.getWidth())
for j in (0...$data.getDepth())
$data.blocks[i][j][0] = Java::BlockMap::BlockType::GRASS
end
end
This is executed only once upon initialization. Now when this was all implemented in Java, with far more memory intensive generation algorithms, there were no performance or memory issues of any kind. The game ran smoothly at hundreds of frames per second at very high resolutions on an old laptop with a 1000 x 1000 x 15 map. However, when the Java generation code is replaced by the above JRuby script, the program appears to suffer some memory consumption issues: the frame rate drops by about 30-40 fps and the program freezes for maybe a 10th of a second at an impressively consistent periodic rate of about once every three seconds. Profiling and various tests reveal that the only possible culprit is the Ruby script.
Moreover, if the size of the map is drastically reduced, say to 100 x 100 x 15, then these issues more or less disappear.
I've tried various things like adding container.terminate(); or container.clear(); after the Java code to execute the script, but I really don't understand the source of the issue or how to fix it. I'd greatly appreciate if someone could explain what's going wrong here and whether or not this can be fixed!

It might be best to make the map creation routine a separate app altogether that chains to the java app.
I'm pretty sure that the memory layout of arrays in JRuby is going to be different and that could be causing your problems--The map object itself may be created with a different memory layout that is requiring some ongoing JRuby interaction whenever it is accessed, or it could be something as simple as creating Integers instead of ints and you aren't noticing it because of autoboxing (Again, TOTAL guess since I can't see data types)

You should at least experiment with the order of the subscripts: [i][j][0] compared to [0][i][j] and [0][j][i].

Related

how to know the performance of part of code using android studio

I want to know the performance of little fragment of my code using android studio.
I am writing a small part of my code here to explain my question
params.rightMargin = (int) (getResources().getDimensionPixelOffset(R.dimen.rightAndLeftMargin));
params.leftMargin = (int) (getResources().getDimensionPixelOffset(R.dimen.rightAndLeftMargin));
alternatively these lines can also be written as:
int margin = getResources().getDimensionPixelOffset(R.dimen.rightAndLeftMargin);
params.rightMargin = margin;
params.leftMargin = margin;
so with the help of android studio IDE how to compare the performance(like memory uses, CPU load, execution time etc.) of these two codes.
NOTE: This is not the only case I have dozens of same cases therefore I would like to have general solution for all types of codes.

With Profiler built-in into Android Studio you can easily see what methods on what threads are being called in a selected time frame. Personally I recommend using a Flame Chart to see what operation takes most amount of time.
Keep in mind that having a profiler attached to your app's process slows it down significantly, so if certain method call took e.g. 1 second, in reality it will be way less.

I don't know android at all. But when writing code, and you have a computation to make, it is better to make it once, and reuse the result. In your example, the latter is better.
I assume that the resources are already in the memory of the process, so probably the footprint here is minor. It will be to create a new method in the stack for getReaources, and another one for getDimensionPixelOffset. Creating a local variable it much cheaper that that.
The footprint increases significantly if you are making IO operations, such as accessing files, or http actions. In those cases it is much better to declare a local variable and reuse it.

Optimizing OBJ file (3d model) loading in java

Before I begin I apologize for my lack of comments in my code. I am currently making a OBJ file loader (in java.) Although my code works as expected for small files, when files become large (for example I am currently attempting to load a obj file which has 25,958 lines) my entire system crashes. I recently migrated my entire project over from C++ which could load this model quickly. I utilized a profiler alongside a debugger to determine where the entire process crashes my system. I noticed a few things; first, it was hanging at the initiation process; second, my heap was nearly completely used up (I used up about 90% of the heap.)
My code can be found here:
http://pastebin.com/VjN0pzyi
I was curious about methods I could employ to optimize this code.

When you're really low on memory, everything slows down a lot. I guess you should improve you coding skills, things like
startChar = line[i].toCharArray()[k];
probably don't get optimized to
startChar = line[i].charAt(k);
automagically. Maybe interning your strings could save a lot of memory, try String.intern or Guava Interner.
The Hotspot loves short methods, so refactor. The code as it is hard to read and I guess that given its size no optimizations get done at all!

I know this is an old question, but I wanted to throw in my two cents on your performance issues. You're saying that your code not only runs slow, but it takes up 90% of the heap. I think saying 90% is an egregious exaggeration, but this still allows me to point out the biggest flaw with Java game development. Java does not support value types, such as structs. That means that in order to gain speed you're required to avoid OOP, because every time you instance a class for your loader it is allocated onto the heap. You must then invariably wait for GC to kick in to get rid of the clutter and left over instances that your loader created. Now take a language like C# as an example of how to create a real language. C# fully supports structs. You could replace every class of your loader with them. Faces, groups, Vertex, Normal, classes are then treated as value types; they are deleted when the stack unwinds. No garbage is generated, or at least very little if you're required to use a class or two.
In my opinion, don't use Java for game development. I used it for years before discovering C#. Strictly my opinion, here, but Java is a horrible language; I will never use it again.

Benchmarking of RCPP or RCaller of C++ or Java calling R script?

I have looked high and low for this answer so I resorted to posting here. Is there any expectation of any noticeable latency if I have a Linux C++ program call an R script/function with something like RCpp? Would this be expected or even sound reasonable? What if I use something like RCaller from a Java JAR file? Do you think C++ is still faster than Java if it is calling the same R script/function? Any expected differences?
Thanks

I think you want RInside which makes it very easy to embed R in your C++ application. It ships with numerous examples in four directories, including some to use it with Qt, Wt (for webapps) and MPI.
In that framework, you instantiate R once at startup and then have your own instance. Round-trip latency will be whatever time it takes you to send a command to the R instance, plus however long R takes (which may well dominate) plus the return.
RInside uses Rcpp so you get whole object transfer and all the other niceties. Have a look at the RInside example, and post questions on the rcpp-devel list.

I don't have special knowledge of the R foreign function interface or RCpp but have worked with quite a few others. Your questions can't be answered with certainty. There are only some rules of thumb. The job of an FFI is to satisfy the assumptions of both the calling and called environments. This is usually about matching the data layouts of both languages by copying from one to the other. (This is what RCpp is all about.) Or you can be very lucky and have them match. If the runtime environments are similar or the data being moved over the boundary between languages is small, this can be very efficient: not much more costly than a self function call. Calling C from Fortran is often very fast for this reason. If the environments are very different, large data structures must be copied. Copies consume resources: memory and processor cycles. Garbage collection is the poster child for differences between environments: separate collectors will seldom (read never) cooperate. R and Java (both garbage collected) will probably require copying for this reason. If you are writing the C++ specifically to calL R, you may be able to set up your data in RCpp structures so that no copies are needed.
I'd write some small tests starting with C++ that mimic the amount of data you must move through the interface. Run and time them to get the overhead cost. From this you can make real decisions.

high cpu load during rendering

I am rendering rather heavy object consisting of about 500k triangles. I use opengl display list and in render method just call glCallList. I thought that once graphic primitives is compiled into display list cpu work is done and it just tells gpu to draw. But now one cpu core is loaded up to 100%.
Could you give me some clues why does it happen?
UPDATE: I have checked how long does it take to run glCallList, it's fast, it takes about 30 milliseconds to run it

Most likely you are hitting the limits on the list length, which are at 64k verteces per list. Try to split your 500k triangles (1500k verteces?) into smaller chunks and see what you get.
btw which graphical chips are you using? If the verteces are processed on CPU, that also might be a problem

It's a bit of a myth that display lists magically offload everything to the GPU. If that was really the case, texture objects and vertex buffers wouldn't have needed to be added to OpenGL. All the display list really is, is a convenient way of replaying a sequence of OpenGL calls and hopefully saving some of the function call/data conversion overhead (see here). None of the PC HW implementations I've used seem to have done anything more than that so far as I can tell; maybe it was different back in the days of SGI workstations, but these days buffer objects are the way to go. (And modern OpenGL books like OpenGL Distilled give glBegin/glEnd etc the briefest of mentions only before getting stuck into the new stuff).
The one place I have seen display lists make a huge difference is the GLX/X11 case where your app is running remotely to your display (X11 "server"); in that case using a display list really does push all the display-list state to the display side just once, whereas a non-display-list immediate-mode app needs to send a bunch of stuff again each frame using lots more bandwidth.
However, display lists aside, you should be aware of some issues around vsync and busy waiting (or the illusion of it)... see this question/answer.

Java Matrix processing time

I need simple opinion from all Guru!
I developed a program to do some matrix calculations. It work all fine with
small matrix. However when I start calculating BIG thousands column row matrix. It
kills the speed.
I was thinking to do processing on each row and write the result in a file then free the
memory and start processing 2nd row and write in a file, so and so forth.
Will it help in improving speed? I have to make big changes to implement this change. Thats
why I need your opinion. What do you think?
Thanks
P.S: I know about colt and Jama matrix. I can not use these packages due to company
rules.
Edited
In my program I am storing all the matrix in 2 dimensional array and if matrix is small it is fine. However, when it has thousands column and rows. Then storing all this matrix in memory for calculation cause performance issues. Matrix contains floating values. For processing I read all the matrix store in memory then start calculation. After calculating I write the result in a file.

Is memory really your bottleneck? Because if it isn't, then stopping to write things out to a file is always going to be much, much slower than the alternative. It sounds like you are probably experiencing some limitation of your algorithm.
Perhaps you should consider optimising the algorithm first.
And as I always say for all performance issue - asking people is one thing, but there is no substitute for trying it! Opinions don't matter if the real-world performance is measurable.

I suggest using profiling tools and timing statements in your code to work out exactly where your performance problem is before your start making changes.
You could spend a long time 'fixing' something that isn't the problem. I suspect that the file IO you suggest would actually slow your code down.
If your code effectively has a loop nested within another loop to process each element then you will see your speed drop away quickly as you increase the size of the matrix. If so, an area to look at would be processing your data in parallel, allowing your code to take advantage of multiple CPUs/cores.
Consider a more efficient implementation of a sparse matrix data structure and not a multidimensional array (if you are using one now)

You need to remember that perfoming an NxN multipled by an NxN takes 2xN^3 calculations. Even so it shouldn't take hours. You should get an inprovement by transposing the second matrix (about 30%) but it really shouldn't be taking hours.
So as you 2x N you increase the time by 8x. Worse than that a matrix which fit into your cache is very fast but mroe than a few MB and they have to come from main memory which slows down your operations by another 2-5x.
Putting the data on disk will really slow down your calaculation, I only suggest you do this if you martix doesn't fit in memory, but it will make it 10x - 100x slower so buying a little more memory is a good idea. (In your case your matrixies should be small enough to fit into memory)
I tried Jama, which is a very basic library which use two dimensional arrays instead of one and on 4 year old labtop took 7 minutes. You should be able to get half this time by just using the latest hardware and with multiple threads cut this below one minute.
EDIT: Using a Xeon X5570, Jama multiplied two 5000x5000 matrices in 156 seconds. Using a parallel implementation I wrote, cut this time to 27 seconds.

Use the profiler in jvisualvm in the JDK to identify where the time is spent.
I would do some simple experiments to identify how your algoritm scales, because it sounds like you might use one that has a higher runtime complexity than you think. If it runs in N^3 (which is common if you want to multiply a list with an array) then doubling the input size will eight-double the run time.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JRuby, large arrays, and performance issues in a real-time application - java

You should at least experiment with the order of the subscripts: [i][j][0] compared to [0][i][j] and [0][j][i].

Related

how to know the performance of part of code using android studio

Optimizing OBJ file (3d model) loading in java

Benchmarking of RCPP or RCaller of C++ or Java calling R script?

high cpu load during rendering

Java Matrix processing time

Categories

Resources