Using Optaplanner 7.3.0 with multiple planning variable hence multiple (2) construction heuristics phases. Here is how my CH phase 2 looks like:
<constructionHeuristic>
<queuedEntityPlacer>
<entitySelector id="taskChainEntitySelector">
<entityClass>....Task</entityClass>
</entitySelector>
<changeMoveSelector>
<entitySelector mimicSelectorRef="taskChainEntitySelector"/>
<valueSelector>
<variableName>previousTaskOrEmployee</variableName>
</valueSelector>
</changeMoveSelector>
</queuedEntityPlacer>
</constructionHeuristic>
I have 814 tasks and previousTaskOrEmployee is a chained planning variable. But this CH phase takes around 7-8 mins or more than that if I don't use any of these:
1. Caching in valueSelector (Cache:PHASE, selectionOrder:SORTED )
2. selectedCountLimit=100
The reason selectedCountLimit works as this CH phase creates a large number of moves as it goes with steps, a simple data without caching/limiting/filtering:
-814init = selectedMovesCOunt:1
..
-621init = selectedMovesCOunt:300
..
-421init = selectedMovesCOunt:500
..
-221init = selectedMovesCOunt:800// increases downwards
In same cases, I've seen moves per step to be more than 50k which is crazy
My questions:
A. Is generating so many steps by changeMove in CH normal?
B. Filtering does not make much sense in CH as variables are not initialised. So, should selectedCountLimit be used ideally?
C. My first CH phase does not take any time just a few 4-5 seconds because it's relatively lesser entities and no chains. What should be the ideal time for my CH phase 2 with 814 chained entities?
See the docs section "Scaling Construction Heuristics". There's a multi-vars non-cartesian alternative that is much faster. And if that doesn't help enough, there's Partitioned Search (with only CH). Both have tradeoffs though.
Also check if your score calculation speed isn't too low (it should be above 1000/sec at the very least).
Related
I am running below simple program , I know this is not best way to measure performance but the results are surprising to me , hence wanted to post question here.
public class findFirstTest {
public static void main(String[] args) {
for(int q=0;q<10;q++) {
long start2 = System.currentTimeMillis();
int k = 0;
for (int j = 0; j < 5000000; j++) {
if (j > 4500000) {
k = j;
break;
}
}
System.out.println("for value " + k + " with time " + (System.currentTimeMillis() - start2));
}
}
}
results are like below after multiple times running code.
for value 4500001 with time 3
for value 4500001 with time 25 ( surprised as it took 25 ms in 2nd iteration)
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
so I am not understanding why 2nd iteration took 25ms but 1st 3ms and later 0 ms and also why always for 2nd iteration when I am running code.
if I move start and endtime printing outside of outer forloop then results I am having is like
for value 4500001 with time 10
In first iteration, the code is running interpreted.
In second iteration, JIT kicks in, slowing it down a bit while it compiles to native code.
In remaining iterations, native code runs very fast.
Because your winamp needed to decode another few frames of your mp3 to queue it into the sound output buffers. Or because the phase of the moon changed a bit and your dynamic background needed changing, or because someone in east Croydon farted and your computer is subscribed to the 'smells from London' twitter feed. Who knows?
This isn't how you performance test. Your CPU is not such a simple machine after all; it has many cores, and each core has pipelines and multiple hierarchies of caches. Any given core can only interact with one of its caches, and because of this, if a core runs an instruction that operates on memory which is not currently in cache, then the core will shut down for a while: It sends to the memory controller a request to load the page of memory with the memory you need to access into a given cachepage, and will then wait until it is there; this can take many, many cycles.
On the other end you have an OS that is juggling hundreds of thousands of processes and threads, many of them internal to the kernel, per-empting like there is no tomorrow, and trying to give extra precedence to processes that are time sensitive, such as the aforementioned winamp which must get a chance to decode some more mp3 frames before the sound buffer is fully exhausted, or you'd notice skipping. This is non-trivial: On ye olde windows you just couldn't get this done which is why ye olde winamp was a magical marvel of engineering, more or less hacking into windows to ensure it got the priority it needed. Those days are long gone, but if you remember them, well, draw the conclusion that this isn't trivial, and thus, OSes do pre-empt with prejudice all the time these days.
A third significant factor is the JVM itself which is doing all sorts of borderline voodoo magic, as it has both a hotspot engine (which is doing bookkeeping on your code so that it can eventually conclude that it is worth spending considerable CPU resources to analyse the heck out of a method to rewrite it in optimized machinecode because that method seems to be taking a lot of CPU time), and a garbage collector.
The solution is to forget entirely about trying to measure time using such mere banalities as measuring currentTimeMillis or nanoTime and writing a few loops. It's just way too complicated for that to actually work.
No. Use JMH.
I made a 2D n-body simulation using brute force at first, but then following http://arborjs.org/docs/barnes-hut this I've implemented a Barnes-Hut approximation algorithm. However this didn't give me the effect I was looking for.
Ex:
Barnes-Hut -> 2000 Bodies; frametime avg. 32 ms and 5000; 164 ms
Brute force -> 2000 Bodies; frametime avg. 31 ms and 5000; 195 ms
These values are with rendering turned off.
Am I correct to assume that I haven't correctly implemented the algorithm and am thus not getting a substantial increase in performance?
Theta is currently set to s/d < 0.5. Changing this value to e.g. 1 does increase performance, but it's quite obvious why this isn't preferred.
Single threaded
My code along general lines:
while(!close)
{
long newTime = System.currentTimeMillis();
long frameTime = newTime-currentTime;
System.out.println(frameTime);
currentTime = newTime;
update the bodies
}
Within the function that updates the bodies:
first insert all bodies into the quadtree with all its subnodes
for all bodies
{
compute the physics using Barnes-Hut which yields a net force per planet (doPhysics(body))
calculate instantaneous acceleration from net force
update the instantaneous velocity
}
The barneshut function:
doPhysics(body)
{
if(node is external (contains 1 body) and that body is not itself)
{
calculate the force between those two bodies
}else if(node is internal and s/d < 0.5)
{
create a pseudobody at the COM with the nodes total mass
calculate the force between the body and pseudobody
}else (if is it internal but s/d >= 0.5)
{
(this is where recursion comes in)
doPhysics on same body but on the NorthEast subnode
doPhysics on same body but on the NorthWest subnode
doPhysics on same body but on the SouthEast subnode
doPhysics on same body but on the SouthWest subnode
}
}
Actually calculating the force:
calculateforce(body, otherbody)
{
if(body is not at exactly the same position (avoid division by 0))
{
calculate force using newtons law of gravitation in vector form
add the force to the bodies' net force in this frame
}
}
Your code is still incomplete (read on SSCCEs ), and in-depth debugging of incomplete code is not the purpose of the site. However, this is how I would approach the next steps of figuring what, if anything, is wrong:
time only the function that you are worried about (let us call it barnes_hutt_update()); and not the whole update loop. Compare that to the equivalent, non-B-H code, and not to the whole update loop without B-H. This would result in a much more meaningful comparison.
you seem to have hard-coded s/d 0.5 into your algorithm. Leaving it as an argument, you should be able to notice speedups when it is set higher; and very similar performance to a naive, non-B-H implementation if set to 0. Speedup in B-H comes from evaluating less nodes (because far-away nodes are lumped together); do you know how many nodes you are managing to skip? No skipped nodes, no speedup. On the other hand, skipping nodes introduces small errors in the calculation - have you quantified those?
have a look at other implementations of B-H online. D3's force layout uses it internally, and is quite readable. There are multiple existing quadtree implementations. If you have built your own, they may be sub-optimal (or even buggy). Unless you are trying to learn-by-doing, it is always better to use a tested library instead of rolling your own.
slowdown may be due to the use of quadtrees, rather than from force addition itself. It would be useful to know how long building and updating the quadtree is taking, as compared to the B-H force aproximation itself -- because quadtrees are, in this case, pure overhead. B-H needs quadtrees, but the naive, non-B-H implementation does not. For small amounts of nodes, naive will be faster (but will get slower very fast as you add more and more). How does the performance scale as you add more and more bodies?
are you creating and discarding large amounts of objects? You can make your algorithm avoid the associated overhead (yes, lots of news + garbage collection can result in significant slowdowns) by using a memory pool.
Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.
This (with value caching):
long time = System.currentTimeMillis();
for (long timestamp : times) {
if (time - timestamp > 600000L) {
// Do something
}
}
Or this (no caching):
for (long timestamp : times) {
if (System.currentTimeMillis() - timestamp > 600000L) {
// Do something
}
}
I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.
How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?
Is this better to do from a CPU or memory optimization point of view?
A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.
So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.
Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result1. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).
If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.
If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.
Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".
Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)
IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.
So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.
So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.
In your case, method calls should always be hoisted out of loops.
System.currentTimeMillis() simply reads a value from OS memory, so it is very cheap (a few nanoseconds), as opposed to System.nanoTime(), which involves a call to hardware, and therefore can be orders of magnitude slower.
I have a variable that gets read and updated thousands of times a second. It needs to be reset regularly. But "half" the time, the value is already the reset value. Is it a good idea to check the value first (to see if it needs resetting) before resetting (a write operaion), or I should just reset it regardless? The main goal is to optimize the code for performance.
To illustrate:
Random r = new Random();
int val = Integer.MAX_VALUE;
for (int i=0; i<100000000; i++) {
if (i % 2 == 0)
val = Integer.MAX_VALUE;
else
val = r.nextInt();
if (val != Integer.MAX_VALUE) //skip check?
val = Integer.MAX_VALUE;
}
I tried to use the above program to test the 2 scenarios (by un/commenting the 2nd "if" line), but any difference is masked by the natural variance of the run duration time.
Thanks.
Don't check it.
It's more execution steps = more cycles = more time.
As an aside, you are breaking one of the basic software golden rules: "Don't optimise early". Unless you have hard evidence that this piece if code is a performance problem, you shouldn't be looking at it. (Note that doesn't mean you code without performance in mind, you still follow normal best practice, but you don't add any special code whose only purpose is "performance related")
The check has no actual performance impact. We'd be talking about a single clock cycle or something, which is usually not relevant in a Java program (as hard-core number crunching usually isn't done in Java).
Instead, base the decision on readability. Think of the maintainer who's going to change this piece of code five years on.
In the case of your example, using my rationale, I would skip the check.
Most likely the JIT will optimise the code away because it doesn't do anything.
Rather than worrying about performance, it is usually better to worry about what it
simpler to understand
cleaner to implement
In both cases, you might remove the code as it doesn't do anything useful and it could make the code faster as well.
Even if it did make the code a little slower it would be very small compared to the cost of calling r.nextInt() which is not cheap.
I am writing code in Java where I branch off based on whether a string starts with certain characters while looping through a dataset and my dataset is expected to be large.
I was wondering whether startsWith is faster than indexOf. I did experiment with 2000 records but not found any difference.
startsWith only needs to check for the presence at the very start of the string - it's doing less work, so it should be faster.
My guess is that your 2000 records finished in a few milliseconds (if that). Whenever you want to benchmark one approach against another, try to do it for enough time that differences in timing will be significant. I find that 10-30 seconds is long enough to show significant improvements, but short enough to make it bearable to run the tests multiple times. (If this were a serious investigation I'd probably try for longer times. Most of my benchmarking is for fun.)
Also make sure you've got varied data - indexOf and startsWith should have roughly the same running time in the case where indexOf returns 0. So if all your records match the pattern, you're not really testing correctly. (I don't know whether that was the case in your tests of course - it's just something to watch out for.)
In general, the golden rule of micro-optimization applies here:
"Measure, don't guess".
As with all optimizations of this type, the difference between the two calls almost certainly won't matter unless you are checking millions of strings that are each tens of thousands of characters long.
Run a profiler over your code, and only optimize this call when you can measure that it's slowing you down. Till then, go with the more readable options (startsWith, in this case). Once you know that this block is slowing you down, try both and use whichever is faster. Rinse. Repeat ;-)
Academically, my guess is that startsWith will likely be implemented using indexOf. Check the source code and see if you're interested. (Turns out that startsWith does not call indexOf)
Even without looking into the sources, it should be clear that startsWith() is faster at least for large strings and short pattern.
The running time of a.startsWith(b) is bound be the length of b. After at most the first b characters are checked, the search finished.
The running time of a.indexOf(b) is larger (depending on the actual algorithm). Every algorithm has at least a running time depending on the length of a. Roughly, you can say, that you have to look at each character once to check if the pattern starts at that position.
However, as always, it depends on the actual use case if you really see a difference in practice. Measuring the difference in real life is never bad.
Probably, if it doesn't match it can stop looking whereas indexOf needs to look for occurrences later in the string.
startsWith is clearer than indexOf == 0.
Have you identified the test as a performance bottleneck for which you need to sacrifice readability?
public class Test
{
public static void main(String args[]) {
long value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".indexOf("a");
}
long value2 = System.currentTimeMillis();
System.out.println(value2-value1);
value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".startsWith("a");
}
value2 = System.currentTimeMillis();
System.out.println(value2-value1);
}
}
Tested it with this piece of code and perf for startsWith seems to be better, for obvious reason that it doesn't have to traverse through string. But in best case scenario both should perform close while in a worst case scenario startsWith will always perform better than indexOf
You mentioned the dataset is expected to be large. So i will bet a lot of performanve will go into access this dataset and handle it in memory. That means use one or the other will not change the perfomance significant. But if this is important to you you may write your own startWith method that could be significant faster than standard library methods or at least you know exactly what is done.
Unfortunate, statsWith is not working as supposed to! it uses "indexOf" behind the sene (lazy developpers :D) so indexOf is 10x faster than implemented statsWith