Very slow iteration on Chronicle Map - java

I'm seeing very slow times iterating over a Chronicle Map - in the below example 93ms per iteration over 1M entries on my 2013 MacbookPro. I'm wondering if there's a better way to iterate or something I'm doing wrong or if this is expected? I know Chronicle Map isn't optimized for iterating but this ticket from a few years ago made me expect much faster iteration times. Toy example below:
public static void main(String[] args) throws Exception {
int numEntries = 1_000_000;
int numIterations = 1_000;
int avgEntrySize = BitUtil.SIZE_OF_LONG + BitUtil.SIZE_OF_INT;
ChronicleMap<IntValue, ByteBuffer> map = ChronicleMap.of(IntValue.class, ByteBuffer.class)
.name("test").entries(numEntries).averageValueSize(avgEntrySize)
.putReturnsNull(true).create();
IntValue value = Values.newHeapInstance(IntValue.class);
ByteBuffer buffer = ByteBuffer.allocate(avgEntrySize);
for (int i = 0; i < numEntries; i++) {
value.setValue(i);
buffer.clear();
buffer.putLong(i);
buffer.putInt(i);
buffer.flip();
map.put(value, buffer);
}
System.out.println("Finished insertion");
for (int i = 0; i < numIterations; i++) {
map.forEachEntry(entry -> {
Data<ByteBuffer> data = entry.value();
ByteBuffer val = data.get();
});
}
System.out.println("Finished priming");
long start = System.currentTimeMillis();
for (int i = 0; i < numIterations; i++) {
map.forEachEntry(entry -> {
Data<ByteBuffer> data = entry.value();
ByteBuffer val = data.get();
});
}
System.out.println(
"Elapsed: " + (System.currentTimeMillis() - start) + " for " + numIterations
+ " iterations");
}
Output:
Finished insertion
Finished priming
Elapsed: 93327 for 1000 iterations

Your results: 93 milliseconds per 1 million keys exactly match the result of benchmark here: http://jetbrains.github.io/xodus/#benchmarks, so it's in the expected ballpark. 93 ms / 1m keys is 93 ns per key, it "very slow" compared to what? Your map contains 16 MB of payload and it's total off-heap size is ~ 30 MB (FYI you can check that by map.offHeapMemoryUsed()), that is much more than the volume of L3 memory in consumer laptops, so iteration speed is bound by the latency of the main memory. Chronicle Map's iteration is mainly not sequential, so memory prefetch doesn't work. I've created an issue about this.
Also several notes about your code:
In your case the value size of the map is constant, so you should use constantValueSizeBySample(ByteBuffer.allocate(12)) instead of averageValueSize(). Even if the map value size wasn't constant, it's preferred to use averageValue() instead of averageValueSize(), because you cannot be sure how many bytes serializers use for the values.
Your value seems to be a good use case for value interfaces with two fields. Moreover you already use a value interface as the key type - IntValue.
Do benchmarks using JMH

Related

Is Arrays.stream(array_name).sum() slower than iterative approach?

I was coding a leetcode problem : https://oj.leetcode.com/problems/gas-station/ using Java 8.
My solution got TLE when I used Arrays.stream(integer_array).sum() to compute sum while the same solution got accepted using iteration to calculate the sum of elements in array. The best possible time complexity for this problem is O(n) and I am surprised to get TLE when using streaming API's from Java 8. I have implemented the solution in O(n) only.
import java.util.Arrays;
public class GasStation {
public int canCompleteCircuit(int[] gas, int[] cost) {
int start = 0, i = 0, runningCost = 0, totalGas = 0, totalCost = 0;
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
// for (int item : gas) totalGas += item;
// for (int item : cost) totalCost += item;
if (totalGas < totalCost)
return -1;
while (start > i || (start == 0 && i < gas.length)) {
runningCost += gas[i];
if (runningCost >= cost[i]) {
runningCost -= cost[i++];
} else {
runningCost -= gas[i];
if (--start < 0)
start = gas.length - 1;
runningCost += (gas[start] - cost[start]);
}
}
return start;
}
public static void main(String[] args) {
GasStation sol = new GasStation();
int[] gas = new int[] { 10, 5, 7, 14, 9 };
int[] cost = new int[] { 8, 5, 14, 3, 1 };
System.out.println(sol.canCompleteCircuit(gas, cost));
gas = new int[] { 10 };
cost = new int[] { 8 };
System.out.println(sol.canCompleteCircuit(gas, cost));
}
}
The solution gets accepted when,
I comment the following two lines: (calculating sum using streaming)
totalGas = Arrays.stream(gas).sum();
totalCost = Arrays.stream(cost).sum();
and uncomment the following two lines (calculating sum using iteration):
//for (int item : gas) totalGas += item;
//for (int item : cost) totalCost += item;
Now the solution gets accepted. Why streaming API in Java 8 is slower for large input than iteration for primitives?
The first step in dealing with problems like this is to bring the code into a controlled environment. That means running it in the JVM you control (and can invoke) and running tests inside a good benchmark harness like JMH. Analyze, don't speculate.
Here's a benchmark I whipped up using JMH to do some analysis on this:
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#State(Scope.Benchmark)
public class ArraySum {
static final long SEED = -897234L;
#Param({"1000000"})
int sz;
int[] array;
#Setup
public void setup() {
Random random = new Random(SEED);
array = new int[sz];
Arrays.setAll(array, i -> random.nextInt());
}
#Benchmark
public int sumForLoop() {
int sum = 0;
for (int a : array)
sum += a;
return sum;
}
#Benchmark
public int sumStream() {
return Arrays.stream(array).sum();
}
}
Basically this creates an array of a million ints and sums them twice: once using a for-loop and once using streams. Running the benchmark produces a bunch of output (elided for brevity and for dramatic effect) but the summary results are below:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 3 514.473 398.512 us/op
ArraySum.sumStream 1000000 avgt 3 7355.971 3170.697 us/op
Whoa! That Java 8 streams stuff is teh SUXX0R! It's 14 times slower than a for-loop, don't use it!!!1!
Well, no. First let's go over these results, and then look more closely to see if we can figure out what's going on.
The summary shows the two benchmark methods, with the "sz" parameter of a million. It's possible to vary this parameter but it doesn't turn out to make a difference in this case. I also only ran the benchmark methods 3 times, as you can see from the "samples" column. (There were also only 3 warmup iterations, not visible here.) The score is in microseconds per operation, and clearly the stream code is much, much slower than the for-loop code. But note also the score error: that's the amount of variability in the different runs. JMH helpfully prints out the standard deviation of the results (not shown here) but you can easily see that the score error is a significant fraction of reported score. This reduces our confidence in the score.
Running more iterations should help. More warmup iterations will let the JIT do more work and settle down before running the benchmarks, and running more benchmark iterations will smooth out any errors from transient activity elsewhere on my system. So let's try 10 warmup iterations and 10 benchmark iterations:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 504.803 34.010 us/op
ArraySum.sumStream 1000000 avgt 10 7128.942 178.688 us/op
Performance is overall a little faster, and the measurement error is also quite a bit smaller, so running more iterations has had the desired effect. But the streams code is still considerably slower than the for-loop code. What's going on?
A large clue can be obtained by looking at the individual timings of the streams method:
# Warmup Iteration 1: 570.490 us/op
# Warmup Iteration 2: 491.765 us/op
# Warmup Iteration 3: 756.951 us/op
# Warmup Iteration 4: 7033.500 us/op
# Warmup Iteration 5: 7350.080 us/op
# Warmup Iteration 6: 7425.829 us/op
# Warmup Iteration 7: 7029.441 us/op
# Warmup Iteration 8: 7208.584 us/op
# Warmup Iteration 9: 7104.160 us/op
# Warmup Iteration 10: 7372.298 us/op
What happened? The first few iterations were reasonably fast, but then the 4th and subsequent iterations (and all the benchmark iterations that follow) were suddenly much slower.
I've seen this before. It was in this question and this answer elsewhere on SO. I recommend reading that answer; it explains how the JVM's inlining decisions in this case result in poorer performance.
A bit of background here: a for-loop compiles to a very simple increment-and-test loop, and can easily be handled by usual optimization techniques like loop peeling and unrolling. The streams code, while not very complex in this case, is actually quite complex compared to the for-loop code; there's a fair bit of setup, and each loop requires at least one method call. Thus, the JIT optimizations, particularly its inlining decisions, are critical to making the streams code go fast. And it's possible for it to go wrong.
Another background point is that integer summation is about the simplest possible operation you can think of to do in a loop or stream. This will tend to make the fixed overhead of stream setup look relatively more expensive. It is also so simple that it can trigger pathologies in the inlining policy.
The suggestion from the other answer was to add the JVM option -XX:MaxInlineLevel=12 to increase the amount of code that can be inlined. Rerunning the benchmark with that option gives:
Benchmark (sz) Mode Samples Score Score error Units
ArraySum.sumForLoop 1000000 avgt 10 502.379 27.859 us/op
ArraySum.sumStream 1000000 avgt 10 498.572 24.195 us/op
Ah, much nicer. Disabling tiered compilation using -XX:-TieredCompilation also had the effect of avoiding the pathological behavior. I also found that making the loop computation even a bit more expensive, e.g. summing squares of integers -- that is, adding a single multiply -- also avoids the pathological behavior.
Now, your question is about running in the context of the leetcode environment, which seems to run the code in a JVM that you don't have any control over, so you can't change the inlining or compilation options. And you probably don't want to make your computation more complex to avoid the pathology either. So for this case, you might as well just stick to the good old for-loop. But don't be afraid to use streams, even for dealing with primitive arrays. It can perform quite well, aside from some narrow edge cases.
The normal iteration approach is going to be pretty much as fast as anything can be, but streams have a variety of overheads: even though it's coming directly from a stream, there's probably going to be a primitive Spliterator involved and lots of other objects being generated.
In general, you should expect the "normal approach" to usually be faster than streams unless you're both using parallelization and your data is very large.
My benchmark (see code below) shows that streaming approach is about 10-15% slower than iterative. Interestingly enough, parallel stream results vary greatly on my 4 core (i7) macbook pro, but, while I have seen a them a few times being about 30% faster than iterative, the most common result is almost three times slower than sequential.
Here is the benchmark code:
import java.util.*;
import java.util.function.*;
public class StreamingBenchmark {
private static void benchmark(String name, LongSupplier f) {
long start = System.currentTimeMillis(), sum = 0;
for(int count = 0; count < 1000; count ++) sum += f.getAsLong();
System.out.println(String.format(
"%10s in %d millis. Sum = %d",
name, System.currentTimeMillis() - start, sum
));
}
public static void main(String argv[]) {
int data[] = new int[1000000];
Random randy = new Random();
for(int i = 0; i < data.length; i++) data[i] = randy.nextInt();
benchmark("iterative", () -> { int s = 0; for(int n: data) s+=n; return s; });
benchmark("stream", () -> Arrays.stream(data).sum());
benchmark("parallel", () -> Arrays.stream(data).parallel().sum());
}
}
Here is the output from a few runs:
iterative in 350 millis. Sum = 564821058000
stream in 394 millis. Sum = 564821058000
parallel in 883 millis. Sum = 564821058000
iterative in 340 millis. Sum = -295411382000
stream in 376 millis. Sum = -295411382000
parallel in 1031 millis. Sum = -295411382000
iterative in 365 millis. Sum = 1205763898000
stream in 379 millis. Sum = 1205763898000
parallel in 1053 millis. Sum = 1205763898000
etc.
This got me curious, and I also tried running equivalent logic in scala:
object Scarr {
def main(argv: Array[String]) = {
val randy = new java.util.Random
val data = (1 to 1000000).map { _ => randy.nextInt }.toArray
val start = System.currentTimeMillis
var sum = 0l;
for ( _ <- 1 to 1000 ) sum += data.sum
println(sum + " in " + (System.currentTimeMillis - start) + " millis.")
}
}
This took 14 seconds! About 40 times(!) longer than streaming in java. Ouch!
The sum() method is syntactically equivalent to return reduce(0, Integer::sum); In a large list, there will be more overhead in making all the method calls than the basic by-hand for-loop iteration. The byte code for the for(int i : numbers) iteration is only very slightly more complicated than that generated by the by-hand for-loop. The stream operation is possibly faster in parallel-friendly environments (though maybe not for primitive methods), but unless we know that the environment is parallel-friendly (and it may not be since leetcode itself is probably designed to favor low-level over abstract since it's measuring efficiency rather than legibility).
The sum operation done in any of the three ways (Arrays.stream(int[]).sum, for (int i : ints){total+=i;}, and for(int i = 0; i < ints.length; i++){total+=i;} should be relatively similar in efficiency. I used the following test class (which sums a hundred million integers between 0 and 4096 a hundred times each and records the average times). All of them returned in very similar timeframes. It even attempts to limit parallel processing by occupying all but one of the available cores in while(true) loops, but I still found no particular difference:
public class SumTester {
private static final int ARRAY_SIZE = 100_000_000;
private static final int ITERATION_LIMIT = 100;
private static final int INT_VALUE_LIMIT = 4096;
public static void main(String[] args) {
Random random = new Random();
int[] numbers = new int[ARRAY_SIZE];
IntStream.range(0, ARRAY_SIZE).forEach(i->numbers[i] = random.nextInt(INT_VALUE_LIMIT));
Map<String, ToLongFunction<int[]>> inputs = new HashMap<String, ToLongFunction<int[]>>();
NanoTimer initializer = NanoTimer.start();
System.out.println("initialized NanoTimer in " + initializer.microEnd() + " microseconds");
inputs.put("sumByStream", SumTester::sumByStream);
inputs.put("sumByIteration", SumTester::sumByIteration);
inputs.put("sumByForLoop", SumTester::sumByForLoop);
System.out.println("Parallelables: ");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
int cores = Runtime.getRuntime().availableProcessors();
List<CancelableThreadEater> threadEaters = new ArrayList<CancelableThreadEater>();
if (cores > 1){
threadEaters = occupyThreads(cores - 1);
}
// Only one core should be left to our class
System.out.println("\nSingleCore (" + threadEaters.size() + " of " + cores + " cores occupied)");
averageTimeFor(ITERATION_LIMIT, inputs, Arrays.copyOf(numbers, numbers.length));
for (CancelableThreadEater cte : threadEaters){
cte.end();
}
System.out.println("Complete!");
}
public static long sumByStream(int[] numbers){
return Arrays.stream(numbers).sum();
}
public static long sumByIteration(int[] numbers){
int total = 0;
for (int i : numbers){
total += i;
}
return total;
}
public static long sumByForLoop(int[] numbers){
int total = 0;
for (int i = 0; i < numbers.length; i++){
total += numbers[i];
}
return total;
}
public static void averageTimeFor(int iterations, Map<String, ToLongFunction<int[]>> testMap, int[] numbers){
Map<String, Long> durationMap = new HashMap<String, Long>();
Map<String, Long> sumMap = new HashMap<String, Long>();
for (String methodName : testMap.keySet()){
durationMap.put(methodName, 0L);
sumMap.put(methodName, 0L);
}
for (int i = 0; i < iterations; i++){
for (String methodName : testMap.keySet()){
int[] newNumbers = Arrays.copyOf(numbers, ARRAY_SIZE);
ToLongFunction<int[]> function = testMap.get(methodName);
NanoTimer nt = NanoTimer.start();
long sum = function.applyAsLong(newNumbers);
long duration = nt.microEnd();
sumMap.put(methodName, sum);
durationMap.put(methodName, durationMap.get(methodName) + duration);
}
}
for (String methodName : testMap.keySet()){
long duration = durationMap.get(methodName) / iterations;
long sum = sumMap.get(methodName);
System.out.println(methodName + ": result '" + sum + "', elapsed time: " + duration + " milliseconds on average over " + iterations + " iterations");
}
}
private static List<CancelableThreadEater> occupyThreads(int numThreads){
List<CancelableThreadEater> result = new ArrayList<CancelableThreadEater>();
for (int i = 0; i < numThreads; i++){
CancelableThreadEater cte = new CancelableThreadEater();
result.add(cte);
new Thread(cte).start();
}
return result;
}
private static class CancelableThreadEater implements Runnable {
private Boolean stop = false;
public void run(){
boolean canContinue = true;
while (canContinue){
synchronized(stop){
if (stop){
canContinue = false;
}
}
}
}
public void end(){
synchronized(stop){
stop = true;
}
}
}
}
which returned
initialized NanoTimer in 22 microseconds
Parallelables:
sumByIteration: result '-1413860413', elapsed time: 35844 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 35414 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 35218 milliseconds on average over 100 iterations
SingleCore (3 of 4 cores occupied)
sumByIteration: result '-1413860413', elapsed time: 37010 milliseconds on average over 100 iterations
sumByStream: result '-1413860413', elapsed time: 38375 milliseconds on average over 100 iterations
sumByForLoop: result '-1413860413', elapsed time: 37990 milliseconds on average over 100 iterations
Complete!
That said, there's no real reason to do the sum() operation in this case. You are iterating through each array, for a total of three iterations and the last one may be a longer-than-normal iteration. It's possible to calculate correctly with one full simultaneous iteration of the arrays and one short-circuiting iteration. It may be possible to do it even more efficiently, but I couldn't figure out any better way to do it than I did. My solution ended up being one of the fastest java solutions on the chart - it ran in 223ms, which was in amongst the middle pack of python solutions.
I'll add my solution to the problem if you care to see it, but I hope the actual question is answered here.
Stream function is relatively slow. So during leetcode contest or any algorithms contest, always prefer classic loops over stream functions as large inputs are prone to TLE. This can in turn cause penality, which would affect your final ranking.
A detailed explanation is mentioned here https://stackoverflow.com/a/27994074/6185191
I also came across this issue while doing this pretty basic LeetCode problem. The first code that I submitted used Java Stream API's Arrays.stream().sum() to compute the Array sum, which gave a time of 6ms.
While the classic for loop just took a time of 1ms to iterate through the same array. Now that's insane! The Stream API method, takes atleast 6x the time, than your simple for loop. So yeah! always go with the simpler and classic method.

How do apps measure CPU usage (as a %)?

So I'm trying to write an app that measures CPU usage (ie, the time CPU is working vs the time it isn't). I've done some research, but unfortunately there are a bunch of different opinions on how it should be done.
These different solutions include, but aren't limited to:
Get Memory Usage in Android
and
http://juliano.info/en/Blog:Memory_Leak/Understanding_the_Linux_load_average
I've tried writing some code myself, that I though might do the trick, because the links above don't take into consideration when the core is off (or do they?)
long[][] cpuUseVal = {{2147483647, 0} , {2147483647, 0} , {2147483647, 0} ,
{2147483647, 0} , {2147483647, 0}};
public float[] readCPUUsage(int coreNum) {
int j=1;
String[] entries; //Array to hold entries in the /proc/stat file
int cpu_work;
float percents[] = new float[5];
Calendar c = Calendar.getInstance();
// Write the dataPackage
long currentTime = c.getTime().getTime();
for (int i = 0; i <= coreNum; i++){
try {
//Point the app to the file where CPU values are located
RandomAccessFile reader = new RandomAccessFile("/proc/stat", "r");
String load = reader.readLine();
while (j <= i){
load = reader.readLine();
j++;
}
//Reset j for use later in the loop
j=1;
entries = load.split("[ ]+");
//Pull the CPU working time from the file
cpu_work = Integer.parseInt(entries[1]) + Integer.parseInt(entries[2]) + Integer.parseInt(entries[3])
+ Integer.parseInt(entries[6]) + Integer.parseInt(entries[6]) + Integer.parseInt(entries[7]);
reader.close();
percents[i] = (float)(cpu_work - cpuUseVal[i][1]) / (currentTime - cpuUseVal[i][0]);
cpuUseVal[i][0] = currentTime;
cpuUseVal[i][1] = cpu_work;
//In case of an error, print a stack trace
} catch (IOException ex) {
ex.printStackTrace();
}
}
//Return the array holding the usage values for the CPU, and all cores
return percents;
}
So here is the idea of the code I wrote...I have a global array with some dummy values that should return negative percentages the first time the function is run. The values are being stored in a database, so I would know to disregard anything negative. Anyway, the function runs, getting values of time the cpu is doing certain things, and comparing it to the last time the function is run (with the help of the global array). These values are divided by the amount of time that has passed between the function runs (with the help of the calendar)
I've downloaded some of the existing cpu usage monitors and compared them to values I get from my app, and mine are never even close to what they get. Can someone explain what I'm doing wrong?
Thanks to some help I have changed my function to look like the following, hope this helps others who have this question
// Function to read values from /proc/stat and do computations to compute CPU %
public float[] readCPUUsage(int coreNum) {
int j = 1;
String[] entries;
int cpu_total;
int cpu_work;
float percents[] = new float[5];
for (int i = 0; i <= coreNum; i++) {
try {
// Point the app to the file where CPU values are located
RandomAccessFile reader = new RandomAccessFile("/proc/stat","r");
String load = reader.readLine();
// Loop to read down to the line that corresponds to the core
// whose values we are trying to read
while (j <= i) {
load = reader.readLine();
j++;
}
// Reset j for use later in the loop
j = 1;
// Break the line into separate array elements. The end of each
// element is determined by any number of spaces
entries = load.split("[ ]+");
// Pull the CPU total time on and "working time" from the file
cpu_total = Integer.parseInt(entries[1])
+ Integer.parseInt(entries[2])
+ Integer.parseInt(entries[3])
+ Integer.parseInt(entries[4])
+ Integer.parseInt(entries[5])
+ Integer.parseInt(entries[6])
+ Integer.parseInt(entries[7]);
cpu_work = Integer.parseInt(entries[1])
+ Integer.parseInt(entries[2])
+ Integer.parseInt(entries[3])
+ Integer.parseInt(entries[6])
+ Integer.parseInt(entries[7]);
reader.close();
//If it was off the whole time, say 0
if ((cpu_total - cpuUseVal[i][0]) == 0)
percents[i] = 0;
//If it was on for any amount of time, compute the %
else
percents[i] = (float) (cpu_work - cpuUseVal[i][1])
/ (cpu_total - cpuUseVal[i][0]);
//Save the values measured for future comparison
cpuUseVal[i][0] = cpu_total;
cpuUseVal[i][1] = cpu_work;
// In case of an error, print a stack trace
} catch (IOException ex) {
ex.printStackTrace();
}
}
// Return the array holding the usage values for the CPU, and all cores
return percents;
}
Apps don't measure CPU usage, the kernel does by interrupting the process 100 times per second (or some other frequency depending on how the kernel is tuned) and incrementing a counter which corresponds to what it was doing when interrupted.
If in the process => increment the user counter.
If in the kernel => increment the system counter
If waiting for disk or network or a device => increment the waiting for IO
Otherwise increment the idle counter.
The uptime is determined by the decaying average length of the run queue i.e. how many threads are waiting to run. The first number is the average length over the last minute. You can get the load average via JMX.

Running time of the same code blocks is different in java. why is that? [duplicate]

This question already has answers here:
Java benchmarking - why is the second loop faster?
(6 answers)
Closed 9 years ago.
I had the below code. I just wanted to check the running time of a code block. And mistakenly i had copied and pasted the same code again and get an interesting result. Though the code block is the same the running times are different. And the code block 1 taking more time than the others. If i switch the code blocks (say i move the code blocks 4 to the top) then code block 4 will be taking more time than others.
I used two different types of Arrays in my code blocks to check it depends on that. And the result is same. If the code blocks has the same type of arrays then the top most code block is taking more time. See the below code and the given out put.
public class ABBYtest {
public static void main(String[] args) {
long startTime;
long endTime;
//code block 1
startTime = System.nanoTime();
Long a[] = new Long[10];
for (int i = 0; i < a.length; i++) {
a[i] = 12l;
}
Arrays.sort(a);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 1 = " + (endTime - startTime));
//code block 6
startTime = System.nanoTime();
Long aa[] = new Long[10];
for (int i = 0; i < aa.length; i++) {
aa[i] = 12l;
}
Arrays.sort(aa);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 6 = " + (endTime - startTime));
//code block 7
startTime = System.nanoTime();
Long aaa[] = new Long[10];
for (int i = 0; i < aaa.length; i++) {
aaa[i] = 12l;
}
Arrays.sort(aaa);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 7 = " + (endTime - startTime));
//code block 2
startTime = System.nanoTime();
long c[] = new long[10];
for (int i = 0; i < c.length; i++) {
c[i] = 12l;
}
Arrays.sort(c);
endTime = System.nanoTime();
System.out.println("code block (has long array) 2 = " + (endTime - startTime));
//code block 3
startTime = System.nanoTime();
long d[] = new long[10];
for (int i = 0; i < d.length; i++) {
d[i] = 12l;
}
Arrays.sort(d);
endTime = System.nanoTime();
System.out.println("code block (has long array) 3 = " + (endTime - startTime));
//code block 4
startTime = System.nanoTime();
long b[] = new long[10];
for (int i = 0; i < b.length; i++) {
b[i] = 12l;
}
Arrays.sort(b);
endTime = System.nanoTime();
System.out.println("code block (has long array) 4 = " + (endTime - startTime));
//code block 5
startTime = System.nanoTime();
Long e[] = new Long[10];
for (int i = 0; i < e.length; i++) {
e[i] = 12l;
}
Arrays.sort(e);
endTime = System.nanoTime();
System.out.println("code block (has Long array) 5 = " + (endTime - startTime));
}
}
The running times:
code block (has Long array) 1 = 802565
code block (has Long array) 6 = 6158
code block (has Long array) 7 = 4619
code block (has long array) 2 = 171906
code block (has long array) 3 = 4105
code block (has long array) 4 = 3079
code block (has Long array) 5 = 8210
As we can see the first code block which contains the Long array will take more time than others which contain Long arrays. And it is the same for the first code block which contains long array.
Can anyone explain this behavior. or Am i doing some mistake here ??
Faulty benchmarking. The non exhaustive list of what is wrong:
No warmup: single shot measurements are almost always wrong;
Mixing several codepaths in the single method: we probably start compiling the method with the execution data available only for the first loop in the method;
Sources are predictable: should the loop compile, we can actually predict the result;
Results are dead-code eliminated: should the loop compile, we can throw the loop it away
Here is how you do it arguably right with jmh:
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Mode.AverageTime)
#Warmup(iterations = 3, time = 1)
#Measurement(iterations = 3, time = 1)
#Fork(10)
#State(Scope.Thread)
public class Longs {
public static final int COUNT = 10;
private Long[] refLongs;
private long[] primLongs;
/*
* Implementation notes:
* - copying the array from the field keeps the constant
* optimizations away, but we implicitly counting the
* costs of arraycopy() in;
* - two additional baseline experiments quantify the
* scale of arraycopy effects (note you can't directly
* subtract the baseline scores from the tests, because
* the code is mixed together;
* - the resulting arrays are always fed back into JMH
* to prevent dead-code elimination
*/
#Setup
public void setup() {
primLongs = new long[COUNT];
for (int i = 0; i < COUNT; i++) {
primLongs[i] = 12l;
}
refLongs = new Long[COUNT];
for (int i = 0; i < COUNT; i++) {
refLongs[i] = 12l;
}
}
#GenerateMicroBenchmark
public long[] prim_baseline() {
long[] d = new long[COUNT];
System.arraycopy(primLongs, 0, d, 0, COUNT);
return d;
}
#GenerateMicroBenchmark
public long[] prim_sort() {
long[] d = new long[COUNT];
System.arraycopy(primLongs, 0, d, 0, COUNT);
Arrays.sort(d);
return d;
}
#GenerateMicroBenchmark
public Long[] ref_baseline() {
Long[] d = new Long[COUNT];
System.arraycopy(refLongs, 0, d, 0, COUNT);
return d;
}
#GenerateMicroBenchmark
public Long[] ref_sort() {
Long[] d = new Long[COUNT];
System.arraycopy(refLongs, 0, d, 0, COUNT);
Arrays.sort(d);
return d;
}
}
...this yields:
Benchmark Mode Samples Mean Mean error Units
o.s.Longs.prim_baseline avgt 30 19.604 0.327 ns/op
o.s.Longs.prim_sort avgt 30 51.217 1.873 ns/op
o.s.Longs.ref_baseline avgt 30 16.935 0.087 ns/op
o.s.Longs.ref_sort avgt 30 25.199 0.430 ns/op
At this point you may start to wonder why sorting Long[] and sorting long[] takes different time. The answer lies in the Array.sort() overloads: OpenJDK sorts primitive and reference arrays via different algos (references with TimSort, primitives with dual-pivot quicksort). Here's the highlight of choosing another algo with -Djava.util.Arrays.useLegacyMergeSort=true, which falls back to merge sort for references:
Benchmark Mode Samples Mean Mean error Units
o.s.Longs.prim_baseline avgt 30 19.675 0.291 ns/op
o.s.Longs.prim_sort avgt 30 50.882 1.550 ns/op
o.s.Longs.ref_baseline avgt 30 16.742 0.089 ns/op
o.s.Longs.ref_sort avgt 30 64.207 1.047 ns/op
Hope that helps to explain the difference.
The explanation above barely scratch the surface about the performance of sorting. The performance is very different when presented with different source data (including available pre-sorted subsequences, their patterns and run lengths, sizes of the data itself).
Can anyone explain this behavior. or Am i doing some mistake here ??
Your problem is a badly written benchmark. You do not take account of JVM warmup effects. Things like the overheads of loading code, initial expansion of the heap, and JIT compilation. In addition, startup of an application always generates extra garbage that needs to be collected.
In addition, if your application itself generates garbage (and I expect that sort and / or println are doing that) then you need to take account of possible GC runs during the "steady state" phase of your benchmark application's run.
See this Q&A for hints on how to write valid Java benchmarks:
How do I write a correct micro-benchmark in Java?
There are numerous other articles on this. Google for "how to write a java benchmark".
In this example, I suspect that the first code block takes so much longer than the rest because of (initially) bytecode interpretation followed by the overhead of JIT compilation. You may well be garbage collecting to deal with temporary objects created during loading and JIT compilation. The high value for the 4th measurement is most likely due to another garbage collection cycle.
However, one would need to turn on some JVM logging to figure out the real cause.
Just to add to what everyone else is saying. Java will not necessarily compile everything. When it analyses the code for optimization, java will choose to interpret code that is not used extensively a fair amount of the time. If you look at the byte codes your Long arrays should always take more time and certainly space complexity than your long arrays, but as has been pointed out, warmup effects will have an effect.
This is could be due to a few things:
As noted by syrion, Java's virtual machine is allowed to perform optimizations on your code as it is running. Your first block is likely taking longer because Java hasn't yet optimized your code fully. As the first block runs, the JVM is applying changes which can then be utilized in the other blocks.
Your processor could be caching the results of your code, speeding up future blocks. This is similar to the previous point, but can vary even between identical JVM implementations.
While your program is running, your computer is also performing other tasks. These include handling the OS's UI, checking for program updates, etc. For this reason, some blocks of code can be slower than others, because your computer isn't concentrating as much resources towards its execution.
Java's virtual machine is garbage collected. That is to say, at unspecified points during your program's execution, the JVM takes some time to clean up any objects that are no longer used.
Points 1 and 2 are likely the cause for the large difference in the first block's execution time. Point 3 could be the reason for the smaller fluctuations, and point 4, as noted by Stephen, probably caused the large stall in block 3.
Another thing that I didn't notice is your use of both long and Long. The object form contains a larger memory overhead, and both are subject to different optimizations.

why it is so slow with 100,000 records when using pipeline in redis?

It is said that pipeline is a better way when many set/get is required in redis, so this is my test code:
public class TestPipeline {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
JedisShardInfo si = new JedisShardInfo("127.0.0.1", 6379);
List<JedisShardInfo> list = new ArrayList<JedisShardInfo>();
list.add(si);
ShardedJedis jedis = new ShardedJedis(list);
long startTime = System.currentTimeMillis();
ShardedJedisPipeline pipeline = jedis.pipelined();
for (int i = 0; i < 100000; i++) {
Map<String, String> map = new HashMap<String, String>();
map.put("id", "" + i);
map.put("name", "lyj" + i);
pipeline.hmset("m" + i, map);
}
pipeline.sync();
long endTime = System.currentTimeMillis();
System.out.println(endTime - startTime);
}
}
When I ran it, there is no response with this program for a while, but when I don't work with pipe, it takes only 20073 ms, so I am confused why it is even better without pipeline and how a wide gap!
Thanks for answer me, a few questions, how do you calculate 6MB data?
When I send 10K data, pipeline is always faster than normal mode, but with 100k, pipeline would no response.I think 100-1000 operations is a advisable choice as below said.Is there anyting with JIT since I don't understand it?
There are a few points you need to consider before writing such a benchmark (and especially a benchmark using the JVM):
on most (physical) machines, Redis is able to process more than 100K ops/s when pipelining is used. Your benchmark only deals with 100K item, so it does not last long enough to produce meaningful results. Furthermore, there is no time for the successive stages of the JIT to kick in.
the absolute time is not a very relevant metric. Displaying the throughput (i.e. the number of operation per second) while keeping the benchmark running for at least 10 seconds would be a better and more stable metric.
your inner loop generates a lot of garbage. If you plan to benchmark Jedis+Redis, then you need to keep the overhead of your own program low.
because you have defined everything into the main function, your loop will not be compiled by the JIT (depending on the JVM you use). Only the inner method calls may be. If you want the JIT to be efficient, make sure to encapsulate your code into methods that can be compiled by the JIT.
optionally, you may want to add a warm-up phase before performing the actual measurement to avoid accounting the overhead of running the first iterations with the bare-bone interpreter, and the cost of the JIT itself.
Now, regarding Redis pipelining, your pipeline is way too long. 100K commands in the pipeline means Jedis has to build a 6MB buffer before sending anything to Redis. It means the socket buffers (on client side, and perhaps server-side) will be saturated, and that Redis will have to deal with 6 MB communication buffers as well.
Furthermore, your benchmark is still synchronous (using a pipeline does not magically make it asynchronous). In other words, Jedis will not start reading replies until the last query of your pipeline has been sent to Redis. When the pipeline is too long, it has the potential to block things.
Consider limiting the size of the pipeline to 100-1000 operations. Of course, it will generate more roundtrips, but the pressure on the communication stack will be reduced to an acceptable level. For instance, consider the following program:
import redis.clients.jedis.*;
import java.util.*;
public class TestPipeline {
/**
* #param args
*/
int i = 0;
Map<String, String> map = new HashMap<String, String>();
ShardedJedis jedis;
// Number of iterations
// Use 1000 to test with the pipeline, 100 otherwise
static final int N = 1000;
public TestPipeline() {
JedisShardInfo si = new JedisShardInfo("127.0.0.1", 6379);
List<JedisShardInfo> list = new ArrayList<JedisShardInfo>();
list.add(si);
jedis = new ShardedJedis(list);
}
public void push( int n ) {
ShardedJedisPipeline pipeline = jedis.pipelined();
for ( int k = 0; k < n; k++) {
map.put("id", "" + i);
map.put("name", "lyj" + i);
pipeline.hmset("m" + i, map);
++i;
}
pipeline.sync();
}
public void push2( int n ) {
for ( int k = 0; k < n; k++) {
map.put("id", "" + i);
map.put("name", "lyj" + i);
jedis.hmset("m" + i, map);
++i;
}
}
public static void main(String[] args) {
TestPipeline obj = new TestPipeline();
long startTime = System.currentTimeMillis();
for ( int j=0; j<N; j++ ) {
// Use push2 instead to test without pipeline
obj.push(1000);
// Uncomment to see the acceleration
//System.out.println(obj.i);
}
long endTime = System.currentTimeMillis();
double d = 1000.0 * obj.i;
d /= (double)(endTime - startTime);
System.out.println("Throughput: "+d);
}
}
With this program, you can test with or without pipelining. Be sure to increase the number of iterations (N parameter) when pipelining is used, so that it runs for at least 10 seconds. If you uncomment the println in the loop, you will realize that the program is slow at the begining and will get quicker as the JIT starts to optimize things (that's why the program should run at least several seconds to give a meaningful result).
On my hardware (an old Athlon box), I can get 8-9 times more throughput when the pipeline is used. The program could be further improved by optimizing key/value formatting in the inner loop and adding a warm-up phase.

Moving from EDU to java.util.concurrent cuts the performance twice

Cross post from http://forums.oracle.com/forums/thread.jspa?threadID=2195025&tstart=0
There is a telecom application server (JAIN SLEE based) and the application running in it.
The application is receiving a message from the network, processes it and sends back to the network a response.
The requirement for request/response latency is 250 ms for 95% of calls and 3000 ms for 99.999% of calls.
We use EDU.oswego.cs.dl.util.concurrent.ConcurrentHashMap, 1 instance. For one call (one call is several messages) processing the following methods are invoked:
"put", "get", "get", "get", then in 180 seconds "remove".
There are 4 threads which invoke these methods.
(A small note: working with ConcurrentHashMap is not the only activity. Also for one network message there are a lot of other activities: protocol message parsing, querying a DB, writing an SDR into a file, creating short living and long living objects.)
When we move from EDU.oswego.cs.dl.util.concurrent.ConcurrentHashMap to java.util.concurrent.ConcurrentHashMap, we see a performance degradation from 1400 to 800 calls per second.
The first bottleneck for the last 800 calls per second is not sufficient latency for the requirement above.
This performance degradation is reproduced on hosts with the following CPU:
2 CPU x Quad-Core AMD Opteron 2356
2312 MHz, 8 HW threads in total,
2 CPU x Intel Xeon E5410 2.33 GHz, 8
HW threads in total.
It is not reproduced on X5570 CPU (Intel Xeon Nehalem X5570 2.93 GHz, 16 HW threads in total).
Did anybody face similar issues? How to solve them?
I assume you are taking about nano-seconds rather than milli-seconds. (That is one million times smaller!)
OR the use of ConcurrentHashMap is a trivial portion of your delay.
EDIT: Have edited the example to be multi-threaded using 100 tasks.
/*
Average operation time for a map of 10,000,000 was 48 ns
Average operation time for a map of 5,000,000 was 51 ns
Average operation time for a map of 2,500,000 was 48 ns
Average operation time for a map of 1,250,000 was 46 ns
Average operation time for a map of 625,000 was 45 ns
Average operation time for a map of 312,500 was 44 ns
Average operation time for a map of 156,200 was 38 ns
Average operation time for a map of 78,100 was 34 ns
Average operation time for a map of 39,000 was 35 ns
Average operation time for a map of 19,500 was 37 ns
*/
public static void main(String... args) {
ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
try {
for (int size = 100000; size >= 100; size /= 2)
test(es, size);
} finally {
es.shutdown();
}
}
private static void test(ExecutorService es, final int size) {
int tasks = 100;
final ConcurrentHashMap<Integer, String> map = new ConcurrentHashMap<Integer, String>(tasks*size);
List<Future> futures = new ArrayList<Future>();
long start = System.nanoTime();
for (int j = 0; j < tasks; j++) {
final int offset = j * size;
futures.add(es.submit(new Runnable() {
public void run() {
for (int i = 0; i < size; i++)
map.put(offset + i, "" + i);
int total = 0;
for (int j = 0; j < 10; j++)
for (int i = 0; i < size; i++)
total += map.get(offset + i).length();
for (int i = 0; i < size; i++)
map.remove(offset + i);
}
}));
}
try {
for (Future future : futures)
future.get();
} catch (Exception e) {
throw new AssertionError(e);
}
long time = System.nanoTime() - start;
System.out.printf("Average operation time for a map of %,d was %,d ns%n", size * tasks, time / tasks / 12 / size);
}
At first, did you check that the hash map is indeed the culprit? Assuming, that you did: There is a lock-free hash map designed to scale to hundreds of processors without introducing alot of contention. It's authored by Cliff Click a well known engineer on the original Hot Spot compiler team. Now, working on scaling the JDK to machines with hundreds of CPUs. So, I assume that he knows what he is doing in that hash map implementation. More infos about this hash map can be found in these slides.
Have you tried changing th concurrencyLevel in the ConcurrentHashMap? Try some lower values like 8, try some bigger values. And remember that the performance and concurrency of ConcurrentHashMap is dependend on you quality of HashCode function.
And yes, it - the java.util.ConcurrentHashMap has the same origin (Doug Lee from edu.oswego) as edu.oswego.cs.dl... , but it was totally rewritten by him so it can better scale.
I think it may be good for you to checkout the javolution fast map. It may be better suited for real-time applications.

Categories

Resources