LWJGL 2.9.0 GL20.glUniformMatrix4 causes random stuttering

LWJGL 2.9.0 GL20.glUniformMatrix4 causes random stuttering - java

I am running renderer in a separate thread at 60FPS (16ms).
Following code produces random stuttering ...
long testTime = System.nanoTime();
GL20.glUniformMatrix4(
GL20.glGetUniformLocation(getProgram(), "projectionMatrix"),
false,
matrix4fBuffer // holds projection matrix
);
testTime = System.nanoTime() - testTime;
if (testTime > 1000000) {
System.out.println("DELAY " + (testTime / 1000000) ); // 22-30ms
}
GL20.glUniformMatrix4 call randomly takes around 22-30ms (every 10s, 30s, 45s, ...) which causes random slowdown (stuttering). Normally it takes 0ms (couple of nanoseconds).
I am testing with only one object being rendered (using programmable pipeline - shaders, OpenGL >= 3.3).
Other pieces of this example:
getProgram() // simply returns integer
// This is called before GL20.GLUniformMatrix4
FloatBuffer matrix4fBuffer = BufferUtils.createFloatBuffer(16);
projectionMatrix.store(matrix4fBuffer);
matrix4fBuffer.flip();
Any idea what is happening here?
EDIT:
I forgot to mention that I am running render and update in separate threads. I guess it could be
related with thread scheduling?
EDIT:
Okay I also tested this in single threaded environment and the problem persists ... I have also found out that other calls to glUnuformMatrix4 do not cause problems e.g.:
long testTime = System.nanoTime();
state.model.store(buffer);
buffer.flip();
GL20.glUniformMatrix4(
GL20.glGetUniformLocation(shader.getProgram(), "modelMatrix"),
false,
buffer
);
testTime = System.nanoTime() - testTime;
if (testTime > 16000000) {
System.out.println("DELAY MODEL" + (testTime / 1000000) );
}

Stop doing this:
GL20.glUniformMatrix4(
GL20.glGetUniformLocation(getProgram(), "projectionMatrix"),
[...]
Uniform locations do not change after you link your program, and querying anything from OpenGL is a great way to kill performance.
This particular Get function is particularly expensive because it uses a string to identify the location you are searching for. String comparisons are slow unless optimized into something like a trie, hash tables, etc... and the expense grows as you add more potential matches to the set of searched strings. Neither OpenGL nor GLSL defines how this function has to be implemented, but you should assume that your implementation is as stupid as they come if you are concerned about performance.
Keep a GLint handy for your frequently used named uniforms. I would honestly suggest writing a class that encapsulates a GLSL program object, and then subclass this for any specialization. The specialized classes would store all of the uniform locations they need and you would never have to query GL for uniform locations.

Related

n-body Simulation expected performance barnes hut

I made a 2D n-body simulation using brute force at first, but then following http://arborjs.org/docs/barnes-hut this I've implemented a Barnes-Hut approximation algorithm. However this didn't give me the effect I was looking for.
Ex:
Barnes-Hut -> 2000 Bodies; frametime avg. 32 ms and 5000; 164 ms
Brute force -> 2000 Bodies; frametime avg. 31 ms and 5000; 195 ms
These values are with rendering turned off.
Am I correct to assume that I haven't correctly implemented the algorithm and am thus not getting a substantial increase in performance?
Theta is currently set to s/d < 0.5. Changing this value to e.g. 1 does increase performance, but it's quite obvious why this isn't preferred.
Single threaded
My code along general lines:
while(!close)
{
long newTime = System.currentTimeMillis();
long frameTime = newTime-currentTime;
System.out.println(frameTime);
currentTime = newTime;
update the bodies
}
Within the function that updates the bodies:
first insert all bodies into the quadtree with all its subnodes
for all bodies
{
compute the physics using Barnes-Hut which yields a net force per planet (doPhysics(body))
calculate instantaneous acceleration from net force
update the instantaneous velocity
}
The barneshut function:
doPhysics(body)
{
if(node is external (contains 1 body) and that body is not itself)
{
calculate the force between those two bodies
}else if(node is internal and s/d < 0.5)
{
create a pseudobody at the COM with the nodes total mass
calculate the force between the body and pseudobody
}else (if is it internal but s/d >= 0.5)
{
(this is where recursion comes in)
doPhysics on same body but on the NorthEast subnode
doPhysics on same body but on the NorthWest subnode
doPhysics on same body but on the SouthEast subnode
doPhysics on same body but on the SouthWest subnode
}
}
Actually calculating the force:
calculateforce(body, otherbody)
{
if(body is not at exactly the same position (avoid division by 0))
{
calculate force using newtons law of gravitation in vector form
add the force to the bodies' net force in this frame
}
}

Your code is still incomplete (read on SSCCEs ), and in-depth debugging of incomplete code is not the purpose of the site. However, this is how I would approach the next steps of figuring what, if anything, is wrong:
time only the function that you are worried about (let us call it barnes_hutt_update()); and not the whole update loop. Compare that to the equivalent, non-B-H code, and not to the whole update loop without B-H. This would result in a much more meaningful comparison.
you seem to have hard-coded s/d 0.5 into your algorithm. Leaving it as an argument, you should be able to notice speedups when it is set higher; and very similar performance to a naive, non-B-H implementation if set to 0. Speedup in B-H comes from evaluating less nodes (because far-away nodes are lumped together); do you know how many nodes you are managing to skip? No skipped nodes, no speedup. On the other hand, skipping nodes introduces small errors in the calculation - have you quantified those?
have a look at other implementations of B-H online. D3's force layout uses it internally, and is quite readable. There are multiple existing quadtree implementations. If you have built your own, they may be sub-optimal (or even buggy). Unless you are trying to learn-by-doing, it is always better to use a tested library instead of rolling your own.
slowdown may be due to the use of quadtrees, rather than from force addition itself. It would be useful to know how long building and updating the quadtree is taking, as compared to the B-H force aproximation itself -- because quadtrees are, in this case, pure overhead. B-H needs quadtrees, but the naive, non-B-H implementation does not. For small amounts of nodes, naive will be faster (but will get slower very fast as you add more and more). How does the performance scale as you add more and more bodies?
are you creating and discarding large amounts of objects? You can make your algorithm avoid the associated overhead (yes, lots of news + garbage collection can result in significant slowdowns) by using a memory pool.

Run code every X seconds (Java)

This is not super necessary, I am just curious to see what others think. I know it is useless, it's just for fun.
Now I know how to do this, it's fairly simple. I am just trying to figure out a way to do this differently that doesn't require new variables to be created crowding up my class.
Here's how I would do it:
float timePassed = 0f;
public void update(){
timePassed += deltatime;//Deltatime is just a variable that represents the time passed from one update to another in seconds (it is a float)
if(timePassed >= 5){
//code to be ran every 5 seconds
timePassed -= 5f;
}
}
What I want to know is if there is a way to do this without the time passed variable. I have a statetime (time since loop started) variable that I use for other things that could be used for this.

If the goal really is to run code every X seconds, my first choice would be to use a util.Timer. Another option is to use a ScheduledExecutorService which adds a couple enhancements over the util.Timer (better Exception handling, for one).
I tend to avoid the Swing.Timer, as I prefer to leave the EDT (event dispatch thread) uncluttered.
Many people write a "game loop" which is closer to what you have started. A search on "game loop" will probably get you several variants, depending on whether you wish to keep a steady rate or not.
Sometimes, in situations where one doesn't want to continually test and reset, one can combine the two functions via the use of an "AND" operation. For example, if you AND 63 to an integer, you have the range 0-63 to iterate through. This works well on ranges that are a power of 2.
Depending on the structure of your calling code, you might pass in the "statetime" variable as a parameter and test if it is larger than your desired X. If you did this, I assume that a step in the called code will reset "statetime" to zero.
Another idea is to pass in a "startTime" to the update method. Then, your timer will test the difference between currentTimeMillis and startTime to see if X seconds has elapsed or not. Again, the code you call should probably set a new "startTime" as part of the process. The nice thing about this method is that there is no need to increment elapsed time.
As long as I am churning out ideas: could also create a future "targetTime" variable and test if currentTimeMillis() - targetTime > 0.
startTime or targetTime can be immutable, which often provides a slight plus, depending on how they are used.

Basic (arithmetic) operations and their dependence on JVM and CPU

In Java I want to measure time for
1000 integer comparisons ("<" operator),
1000 integer additions (a+b
each case for different a and b),
another simple operations.
I know I can do it in the following way:
Random rand = new Random();
long elapsedTime = 0;
for (int i = 0; i < 1000; i++) {
int a = Integer.MIN_VALUE + rand.nextInt(Integer.MAX_VALUE);
int b = Integer.MIN_VALUE + rand.nextInt(Integer.MAX_VALUE);
long start = System.currentTimeMillis();
if (a < b) {}
long stop = System.currentTimeMillis();
elapsedTime += (start - stop);
}
System.out.println(elapsedTime);
I know that this question may seem somehow not clear.
How those values depend on my processor (i.e. relation between time for those operations and my processor) and JVM? Any suggestions?
I'm looking for understandable readings...

How those values depend on my processor (i.e. relation between time for those operations and my processor) and JVM? Any suggestions?
It is not dependant on your processor, at least not directly.
Normally, when you run code enough, it will compile it to native code. When it does this, it removes code which doesn't do anything, so what you will be doing here is measuring the time it takes to perform a System.currentMillis(), which is typically about 0.00003 ms. This means you will get 0 99.997% of the time and see a 1 very rarely.
I say normally, but in this case your code won't be compiled to native code, as the default threshold is 10,000 iterations. I.e. you would be testing how long it takes the interpretor to execute the byte code. This is much slower, but would still be a fraction of a milli-second. i.e. you have higher chance seeing a 1 but still unlikely.
If you want to learn more about low level benchmarking in Java, I suggest you read JMH and the Author's blog http://shipilev.net/
If you want to see what machine code is generated from Java code I suggest you try JITWatch

Limit draw processing in a game loop

I am using this code structure below from here http://www.koonsolo.com/news/dewitters-gameloop/
to set a game loop that processes based on a set fps but renders/draws at the most possible.
How would one implement a cap on the drawing fps so as not to use up all the processing power /battery life. or to limit it for v-syncing.
const int TICKS_PER_SECOND = 60;
const int SKIP_TICKS = 1000000000 / TICKS_PER_SECOND;
const int MAX_FRAMESKIP = 5;
DWORD next_game_tick = GetTickCount();
int loops;
float interpolation;
bool game_is_running = true;
while( game_is_running ) {
loops = 0;
while( GetTickCount() > next_game_tick && loops < MAX_FRAMESKIP) {
update_game();
next_game_tick += SKIP_TICKS;
loops++;
}
interpolation = float( GetTickCount() + SKIP_TICKS - next_game_tick )
/ float( SKIP_TICKS );
display_game( interpolation );
}

I assume that you are actually doing proper motion interpolation? Otherwise it doesn't make sense to render faster than your game update: you'll just be rendering all the objects again in exactly the same position.
I'd suggest the following:
Put a Thread.sleep(millis) call in to stop the busy-looping. Probably a Thread.sleep(5) is fine, since you are just going to do a quick check for whether you are ready for the next update.
Put a conditional test on the display_game call to see if at least a certain number of millisconds has elapsed since the last display_game. For example, if you make this 10ms then your frame rate will be limited to 100 FPs.
There are also a couple of other things that are a bit unclear in your code:
What is DWORD? Is this really Java? Looks like some funny C/C++ conversion? The normal way to get the current time in Java would be long time=System.nanoTime() or similar.....
What graphics framework are you using? If it is Swing, then you need to be careful about what thread you are running on, as you don't want to be blocking the GUI thread....
Finally, you should also consider whether you want to decouple your update loop from the rendering code and have them running on different threads. This is trickier to get right since you may need to lock or take snapshots of certain objects to ensure they don't change while you are rendering them, but it will help your performance and scalability on multi-core machines (which is most of them nowadays!)

I think you can update your display_game to compare the FPS being painted against the desired limit. If it has reach that limit, you can add a wait time for wait time as:
Thread.sleep(500); //wait for 500 milliseconds

What's wrong with System.nanoTime?

I have a very long string with the pattern </value> at the very end, I am trying to test the performance of some function calls, so I made the following test to try to find out the answer... but I think I might be using nanoTime incorrectly? Because the result doesn't make sense no matter how I swap the order around...
long start, end;
start = System.nanoTime();
StringUtils.indexOf(s, "</value>");
end = System.nanoTime();
System.out.println(end - start);
start = System.nanoTime();
s.indexOf("</value>");
end = System.nanoTime();
System.out.println(end - start);
start = System.nanoTime();
sb.indexOf("</value>");
end = System.nanoTime();
System.out.println(end - start);
I get the following:
163566 // StringUtils
395227 // String
30797 // StringBuilder
165619 // StringBuilder
359639 // String
32850 // StringUtils
No matter which order I swap them around, the numbers will always be somewhat the same... What's the deal here?
From java.sun.com website's FAQ:
Using System.nanoTime() between various points in the code to perform elapsed time measurements should always be accurate.
Also:
http://download.oracle.com/javase/1.5.0/docs/api/java/lang/System.html#nanoTime()

The differences between the two runs is in the order of microseconds and that is expected. There are many things going on on your machine which make the execution environment never the same between two runs of your application. That is why you get that difference.
EDIT: Java API says:
This method provides nanosecond precision, but not necessarily
nanosecond accuracy.

Most likely there's memory initialization issues or other things that happen at the JVM's startup that is skewing your numbers. You should get a bigger sample to get more accurate numbers. Play around with the order, run it multiple times, etc.

It is more than likely that the methods you check use some common code behind the scenes. But the JIT will do its work only after about 10.000 invocations. Hence, this could be the cause why your first two example seem to be always slower.
Quick fix: just do the 3 method calls before the first measuring on a long enoug string.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.