Fastest way to strip all non-printable characters from a Java String - java

What is the fastest way to strip all non-printable characters from a String in Java?
So far I've tried and measured on 138-byte, 131-character String:
String's replaceAll() - slowest method
517009 results / sec
Precompile a Pattern, then use Matcher's replaceAll()
637836 results / sec
Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
711946 results / sec
Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
1052964 results / sec
Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
2022653 results / sec
Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
2502502 results / sec
Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
857485 results / sec
Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
791076 results / sec
Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
370164 results / sec
My best try was the following:
char[] oldChars = new char[s.length()];
s.getChars(0, s.length(), oldChars, 0);
char[] newChars = new char[s.length()];
int newLen = 0;
for (int j = 0; j < s.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ') {
newChars[newLen] = ch;
newLen++;
}
}
s = new String(newChars, 0, newLen);
Any thoughts on how to make it even faster?
Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?
Update
Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.
Update 2
I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.
You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:
Debian sid
Linux 2.6.39-2-amd64 (x86_64)
Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:
Same single string
This mode works on a same single string provided by StringSource class as a constant. The showdown is:
Ops / s │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
994 727 │ ArrayOfByteUTF8String
918 611 │ ArrayOfByteUTF8Const
756 086 │ MatcherReplace
598 945 │ StringReplaceAll
460 045 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)
Multiple strings, 100% of strings contain control characters
Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.
Ops / s │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
893 176 │ ArrayOfByteUTF8String
817 127 │ ArrayOfByteUTF8Const
778 089 │ StringBuilderChar
734 754 │ StringBuilderCodePoint
377 829 │ ArrayOfByteWindows1251
224 140 │ MatcherReplace
211 104 │ StringReplaceAll
In charted form:
(source: greycat.ru)
Multiple strings, 1% of strings contain control characters
Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.
Ops / s │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
939 055 │ StringBuilderChar
907 194 │ ArrayOfByteUTF8Const
841 963 │ StringBuilderCodePoint
606 465 │ MatcherReplace
501 555 │ StringReplaceAll
381 185 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)
It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).
References
GitHub repository with a benchmarking suite

using 1 char array could work a bit better
int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;
newLen++;
}
}
s = new String(oldChars, 0, newLen);
and I avoided repeated calls to s.length();
another micro-optimization that might work is
int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
// if there are none it ends on the null char I appended
for (int j = newLen; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
newLen++;
}
}
s = new String(oldChars, 0, newLen);

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:
char [] oldChars = new char[5];
String stripControlChars(String s)
{
final int inputLen = s.length();
if ( oldChars.length < inputLen )
{
oldChars = new char[inputLen];
}
s.getChars(0, inputLen, oldChars, 0);
etc...
This is a big win - 20% or so, as I understand the current best case.
If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.
As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.
Anyways here we go:
// Has to be done only once - so cache those! Prohibitively expensive otherwise
private Field value;
private Field offset;
private Field count;
private Field hash;
{
try {
value = String.class.getDeclaredField("value");
value.setAccessible(true);
offset = String.class.getDeclaredField("offset");
offset.setAccessible(true);
count = String.class.getDeclaredField("count");
count.setAccessible(true);
hash = String.class.getDeclaredField("hash");
hash.setAccessible(true);
}
catch (NoSuchFieldException e) {
throw new RuntimeException();
}
}
#Override
public String strip(final String old) {
final int length = old.length();
char[] chars = null;
int off = 0;
try {
chars = (char[]) value.get(old);
off = offset.getInt(old);
}
catch(IllegalArgumentException e) {
throw new RuntimeException(e);
}
catch(IllegalAccessException e) {
throw new RuntimeException(e);
}
int newLen = off;
for(int j = off; j < off + length; j++) {
final char ch = chars[j];
if (ch >= ' ') {
chars[newLen] = ch;
newLen++;
}
}
if (newLen - off != length) {
// We changed the internal state of the string, so at least
// be friendly enough to correct it.
try {
count.setInt(old, newLen - off);
// Have to recompute hash later on
hash.setInt(old, 0);
}
catch(IllegalArgumentException e) {
e.printStackTrace();
}
catch(IllegalAccessException e) {
e.printStackTrace();
}
}
// Well we have to return something
return old;
}
For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.

You could split the task into a several parallel subtasks, depending of processor's quantity.

I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.
Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).

IANA low-level java performance junkie, but have you tried unrolling your main loop? It appears that it could allow some CPU's to perform checks in parallel.
Also, this has some fun ideas for optimizations.

It can go even faster. Much faster*. How? By leveraging System.arraycopy which is native method. So to recap:
Return the same String if it's "clean".
Avoid allocating a new char[] on every iteration
Use System.arraycopy for moving the elements x positions back
public class SteliosAdamantidis implements StripAlgorithm {
private char[] copy = new char[128];
#Override
public String strip(String s) throws Exception {
int length = s.length();
if (length > copy.length) {
int newLength = copy.length * 2;
while (length > newLength) newLength *= 2;
copy = new char[newLength];
}
s.getChars(0, length, copy, 0);
int start = 0; //where to start copying from
int offset = 0; //number of non printable characters or how far
//behind the characters should be copied to
int index = 0;
//fast forward to the first non printable character
for (; index < length; ++index) {
if (copy[index] < ' ') {
start = index;
break;
}
}
//string is already clean
if (index == length) return s;
for (; index < length; ++index) {
if (copy[index] < ' ') {
if (start != index) {
System.arraycopy(copy, start, copy, start - offset, index - start);
}
++offset;
start = index + 1; //handling subsequent non printable characters
}
}
if (length != start) {
//copy the residue -if any
System.arraycopy(copy, start, copy, start - offset, length - start);
}
return new String(copy, 0, length - offset);
}
}
This class is not thread safe but I guess that if one wants to handle a gazillion of strings on separate threads then they can afford 4-8 instances of the StripAlgorithm implementation inside a ThreadLocal<>
Trivia
I used as reference the RatchetFreak2EdStaub1GreyCat2 solution. I was surprised that this wasn't performing any good on my machine. Then I wrongfully thought that the "bailout" mechanism didn't work and I moved it at the end. It skyrocketed performance. Then I though "wait a minute" and I realized that the condition works always it's just better at the end. I don't know why.
...
6. RatchetFreak2EdStaub1GreyCatEarlyBail 3508771.93 3.54x +3.9%
...
2. RatchetFreak2EdStaub1GreyCatLateBail 6060606.06 6.12x +13.9%
The test is not 100% accurate. At first I was an egoist and I've put my test second on the array of algorithms. It had some lousy results on the first run and then I moved it at the end (let the others warm up the JVM for me :) ) and then it came first.
Results
Oh and of course the results. Windows 7, jdk1.8.0_111 on a relatively old machine, so expect different results on newer hardware and or OS.
Rankings: (1.000.000 strings)
17. StringReplaceAll 990099.01 1.00x +0.0%
16. ArrayOfByteWindows1251 1642036.12 1.66x +65.8%
15. StringBuilderCodePoint 1724137.93 1.74x +5.0%
14. ArrayOfByteUTF8Const 2487562.19 2.51x +44.3%
13. StringBuilderChar 2531645.57 2.56x +1.8%
12. ArrayOfByteUTF8String 2551020.41 2.58x +0.8%
11. ArrayOfCharFromArrayOfChar 2824858.76 2.85x +10.7%
10. RatchetFreak2 2923976.61 2.95x +3.5%
9. RatchetFreak1 3076923.08 3.11x +5.2%
8. ArrayOfCharFromStringCharAt 3322259.14 3.36x +8.0%
7. EdStaub1 3378378.38 3.41x +1.7%
6. RatchetFreak2EdStaub1GreyCatEarlyBail 3508771.93 3.54x +3.9%
5. EdStaub1GreyCat1 3787878.79 3.83x +8.0%
4. MatcherReplace 4716981.13 4.76x +24.5%
3. RatchetFreak2EdStaub1GreyCat1 5319148.94 5.37x +12.8%
2. RatchetFreak2EdStaub1GreyCatLateBail 6060606.06 6.12x +13.9%
1. SteliosAdamantidis 9615384.62 9.71x +58.7%
Rankings: (10.000.000 strings)
17. ArrayOfByteWindows1251 1647175.09 1.00x +0.0%
16. StringBuilderCodePoint 1728907.33 1.05x +5.0%
15. StringBuilderChar 2480158.73 1.51x +43.5%
14. ArrayOfByteUTF8Const 2498126.41 1.52x +0.7%
13. ArrayOfByteUTF8String 2591344.91 1.57x +3.7%
12. StringReplaceAll 2626740.22 1.59x +1.4%
11. ArrayOfCharFromArrayOfChar 2810567.73 1.71x +7.0%
10. RatchetFreak2 2948113.21 1.79x +4.9%
9. RatchetFreak1 3120124.80 1.89x +5.8%
8. ArrayOfCharFromStringCharAt 3306878.31 2.01x +6.0%
7. EdStaub1 3399048.27 2.06x +2.8%
6. RatchetFreak2EdStaub1GreyCatEarlyBail 3494060.10 2.12x +2.8%
5. EdStaub1GreyCat1 3818251.24 2.32x +9.3%
4. MatcherReplace 4899559.04 2.97x +28.3%
3. RatchetFreak2EdStaub1GreyCat1 5302226.94 3.22x +8.2%
2. RatchetFreak2EdStaub1GreyCatLateBail 5924170.62 3.60x +11.7%
1. SteliosAdamantidis 9680542.11 5.88x +63.4%
* Reflection -Voo's answer
I've put an asterisk on the Much faster statement. I don't think that anything can go faster than reflection in that case. It mutates the String's internal state and avoids new String allocations. I don't think one can beat that.
I tried to uncomment and run Voo's algorithm and I got an error that offset field doesn't exit. IntelliJ complains that it can't resolve count either. Also (if I'm not mistaken) the security manager might cut reflection access to private fields and thus this solution won't work. That's why this algorithm doesn't appear in my test run. Otherwise I was curious to see myself although I believe that a non reflective solution can't be faster.

why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?
If you mean String#getBytes("utf-8") etc.: This shouldn't be faster - except for some better caching - since Charset.forName("utf-8") is used internally, if the charset is not cached.
One thing might be that you're using different charsets (or maybe some of your code does transparently) but the charset cached in StringCoding doesn't change.

Related

Why 2 similar loop codes costs different time in java

I was confused by the codes as follows:
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 10000000;
final int jBound = 100;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
That's the first version, it costs 443ms on my computer.
first version result
public static void test(){
long currentTime1 = System.currentTimeMillis();
final int iBound = 100;
final int jBound = 10000000;
for(int i = 1;i<=iBound;i++){
int a = 1;
int tot = 10;
for(int j = 1;j<=jBound;j++){
tot *= a;
}
}
long updateTime1 = System.currentTimeMillis();
System.out.println("i:"+iBound+" j:"+jBound+"\nIt costs "+(updateTime1-currentTime1)+" ms");
}
The second version costs 832ms.
second version result
The only difference is that I simply swap the i and j.
This result is incredible, I test the same code in C and the difference in C is not that huge.
Why is this 2 similar codes so different in java?
My jdk version is openjdk-14.0.2
TL;DR - This is just a bad benchmark.
I did the following:
Create a Main class with a main method.
Copy in the two versions of the test as test1() and test2().
In the main method do this:
while(true) {
test1();
test2();
}
Here is the output I got (Java 8).
i:10000000 j:100
It costs 35 ms
i:100 j:10000000
It costs 33 ms
i:10000000 j:100
It costs 33 ms
i:100 j:10000000
It costs 25 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
i:100 j:10000000
It costs 0 ms
i:10000000 j:100
It costs 0 ms
....
So as you can see, when I run two versions of the same method alternately in the same JVM, the times for each method are roughly the same.
But more importantly, after a small number of iterations the time drops to ... zero! What has happened is that the JIT compiler has compiled the two methods and (probably) deduced that their loops can be optimized away.
It is not entirely clear why people are getting different times when the two versions are run separately. One possible explanation is that the first time run, the JVM executable is being read from disk, and the second time is already cached in RAM. Or something like that.
Another possible explanation is that JIT compilation kicks in earlier1 with one version of test() so the proportion of time spent in the slower interpreting (pre-JIT) phase is different between the two versions. (It may be possible to teas this out using JIT logging options.)
But it is immaterial really ... because the performance of a Java application while the JVM is warming up (loading code, JIT compiling, growing the heap to its working size, loading caches, etc) is generally speaking not important. And for the cases where it is important, look for a JVM that can do AOT compilation; e.g. GraalVM.
1 - This could be because of the way that the interpreter gathers stats. The general idea is that the bytecode interpreter accumulates statistics on things like branches until it has "enough". Then the JVM triggers the JIT compiler to compile the bytecodes to native code. When that is done, the code runs typically 10 or more times faster. The different looping patterns might it reach "enough" earlier in one version compared to the other. NB: I am speculating here. I offer zero evidence ...
The bottom line is that you have to be careful when writing Java benchmarks because the timings can be distorted by various JVM warmup effects.
For more information read: How do I write a correct micro-benchmark in Java?
I test it myself, I get same difference (around 16ms and 4ms).
After testing, I found that :
Declare 1M of variable take less time than multiple by 1 1M time.
How ?
I made a sum of 100
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
i *= 1;
i *= 1;
[... written 20 times]
i *= 1;
i *= 1;
}
And of 100 this:
final int nb = 100000000;
for(int i = 1;i<=nb;i++){
int a = 0;
int aa = 0;
[... written 20 times]
int aaaaaaaaaaaaaaaaaaaaaa = 0;
int aaaaaaaaaaaaaaaaaaaaaaa = 0;
}
And I respectively get 8 and 3ms, which seems to correspond to what you get.
You can have different result if you have different processor.
you found the answer in algorithm books first chapter :
cost of producing and assigning is 1. so in first algorithm you have 2 declaration and assignation 10000000 and in second one you make it 100. so you reduce time ...
in first :
5 in main loop and 3 in second loop -> second loop is : 3*100 = 300
then 300 + 5 -> 305 * 10000000 = 3050000000
in second :
3*10000000 = 30000000 - > (30000000 + 5 )*100 = 3000000500
so the second one in algorithm is faster in theory but I think its back to multi cpu's ...which they can do 10000000 parallel job in first but only 100 parallel job in second .... so the first one became faster.

Execution Time: Iterativ vs Instance

I just stumbled across a strange thing while coding in Java:
I read a file into a bytearray (byte[] file_bytes) and what I want is a hexdump output (like the utilities hexdump or xxd in Linux). Basically this works (see the for-loop-code that is not commented out), but for larger Files (>100 KiB) it takes a bit, to go through the bytearray-chunks, do proper formatting, and so on.
But if I swap the for-loop-code with the code that is commented out (using a class with the same for-loop-code for calculation!), it works very fast.
What is the reason for this behavior?
Codesnippet:
[...]
long counter = 1;
int chunk_size = 512;
int chunk_count = (int) Math.ceil((double) file_bytes.length / chunk_size);
for (int i = 0; i < chunk_count; i++) {
byte[] chunk = Arrays.copyOfRange(file_bytes, i * chunk_size, (i + 1) * chunk_size);
// this commented two lines calculate way more faster than the for loop below, even though the calculation algorithm is the same!
/*
* String test = new BytesToHexstring(chunk).getHexstring();
* hex_string = hex_string.concat(test);
*/
for (byte b : chunk) {
if( (counter % 4) != 0 ){
hex_string = hex_string.concat(String.format("%02X ", b));
} else{
hex_string = hex_string.concat(String.format("%02X\n", b));
}
counter++;
}
}
[...]
class BytesToHexstring:
class BytesToHexstring {
private String m_hexstring;
public BytesToHexstringTask(byte[] ba) {
m_hexstring = "";
m_hexstring = bytes_to_hex_string(ba);
}
private String bytes_to_hex_string(byte[] ba) {
String hexstring = "";
int counter = 1;
// same calculation algorithm like in the codesnippet above!
for (byte b : ba) {
if ((counter % 4) != 0) {
hexstring = hexstring.concat(String.format("%02X ", b));
} else {
hexstring = hexstring.concat(String.format("%02X\n", b));
}
counter++;
}
return hexstring;
}
public String getHexstring() {
return m_hexstring;
}
}
String hex_string:
00 11 22 33
44 55 66 77
88 99 AA BB
CC DD EE FF
Benchmarks:
file_bytes.length = 102400 Bytes = 100 KiB
via class: ~0,7 sec
without class: ~5,2 sec
file_bytes.length = 256000 Bytes = 250 KiB
via class: ~1,2 sec
without class: ~36 sec
There's an important difference between the two options. In the slow version, you are concatenating each iteration onto the entire hex string you built up for each byte. String concatenation is a slow operation since it requires copying the entire string. As you string gets larger this copying takes longer and you copy the whole thing every byte.
In the faster version you are building each chunk up individually and only concatenating whole chunks with the output string rather than each individual bytes. This mean much fewer expensive concatenations. You are still using concatenation while building uyp the chunk, but because a chunk is much smaller than the whole output those concatenations are faster.
You could do much better though by using a string builder instead of string concatenation. StringBuilder is a class designed for efficiently building up strings incrementally. It avoids the full copy on every append that concatenation does. I expect that if you remake this to use StringBuilder both versions would perform about the same, and be faster than either version you already have.

Why does my project produce a "code too large" error at compile time? [duplicate]

This question already has answers here:
"Code too large" compilation error in Java
(14 answers)
Closed 6 years ago.
I have this code and when I try to compile it, it returns:
E:\temp\JavaApplication12\src\javaapplication12\JavaApplication12.java:15: error: code too large
public static void main(String[] args) {
1 error
My code is a sudoku solver. First I need to load all the numbers, and then process which numbers that are already present on rows and columns, to decide what I can solve. But it doesn't compile the code! I have spent weeks working on this.
The approach of my sudoku solver solve the problem in constant time. So, I am not using cycles or arrays because it will make the problem O(n). I want O(k) where k is the constant.
Even if the code compiled, it wouldn't solve a game of Sudoku. Actually, all it does is to set the 9 variables bN to true if any of the 81 variables aPQ are equal to N.
And it doesn't even do this efficiently. There are 1458 (=18*81) conditions setting each of the bN variables to true. (Simple check: each of the conditions is 3 lines; 1458 checks for each of 9 variables: 3 * 1458 * 9 = 39366, the approximate length of the file).
All of the setters of bN are independent, and are idempotent, so they can be arbitrarily rearranged and the 17 repeated checks of the conditions can be removed.
An equivalent (and adequately efficient) version of this code - using arrays - is:
// Using 10 as array size, as OP's code is one-based;
// first element is unused.
int a[][] = new int[10][10];
// Initialize the elements of a.
boolean b[] = new boolean[10];
for (int i = 1; i <= 9; i++) {
for (int j = 1; j <= 9; j++) {
if (a[i][j] >= 1 && a[i][j] <= 9) {
b[a[i][j]] = true;
}
}
}
which should fit inside the maximum size of a method quite easily.
You should focus on writing correct, maintainable code, before considering how to make it efficient - this code doesn't work for its stated purpose, and I would not want to be the one working out where the bug in 40k lines of code is. The only reason I was able to analyse this much code is that it appears to be generated, as it is very uniform in its pattern.
I did the analysis above using a (very hacky) Python script.
Run using:
curl http://pastebin.com/raw/NbyTTAdX | python script.py
script.py:
import sys
import re
with open('/dev/stdin') as fh:
lines = fh.readlines()
bequals = re.compile(r'^b\d\s*= true;$')
i = 0
bvariablesetters = {}
while i < len(lines):
if lines[i].strip().startswith('if (') and lines[i].strip().endswith('{'):
# Match the conditionals setting one of the b variables.
if lines[i+2].strip() == '}' and bequals.search(lines[i+1].strip()):
newline = ' '.join(map(str.strip, lines[i:i+3]))
spl = newline.split()
# This is the "b=" variable
bvar = spl[5]
bvariablesetters.setdefault(bvar, []).append(' '.join(newline))
i += 3
continue
else:
# Print out lines which don't match the conditional-set-b pattern, so you
# can see that there's nothing else going on.
sys.stdout.write(lines[i])
i += 1
# Print the number of conditionals setting each of the b variables.
print {(k, len(v)) for k, v in bvariablesetters.iteritems()}
# Print the number of unique conditionals setting each of the b variables.
print {(k, len(set(v))) for k, v in bvariablesetters.iteritems()}
# Print one of the lists of conditions to set a b variable.
print bvariablesetters['b1=']
# Print one of the sets of conditions to set a b variable.
print sorted(set(bvariablesetters['b1=']))

Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to optimize the byte code, etc.
The work is, essentially, dot products. As in, I have two float[50] and I need to compute the sum of pairwise products. I know processor instruction sets exist to perform these kind of operations quickly and in bulk, like SSE or MMX.
Yes I can probably access these by writing some native code in JNI. The JNI call turns out to be pretty expensive.
I know you can't guarantee what a JIT will compile or not compile. Has anyone ever heard of a JIT generating code that uses these instructions? and if so, is there anything about the Java code that helps make it compilable this way?
Probably a "no"; worth asking.
So, basically, you want your code to run faster. JNI is the answer. I know you said it didn't work for you, but let me show you that you are wrong.
Here's Dot.java:
import java.nio.FloatBuffer;
import org.bytedeco.javacpp.*;
import org.bytedeco.javacpp.annotation.*;
#Platform(include = "Dot.h", compiler = "fastfpu")
public class Dot {
static { Loader.load(); }
static float[] a = new float[50], b = new float[50];
static float dot() {
float sum = 0;
for (int i = 0; i < 50; i++) {
sum += a[i]*b[i];
}
return sum;
}
static native #MemberGetter FloatPointer ac();
static native #MemberGetter FloatPointer bc();
static native #NoException float dotc();
public static void main(String[] args) {
FloatBuffer ab = ac().capacity(50).asBuffer();
FloatBuffer bb = bc().capacity(50).asBuffer();
for (int i = 0; i < 10000000; i++) {
a[i%50] = b[i%50] = dot();
float sum = dotc();
ab.put(i%50, sum);
bb.put(i%50, sum);
}
long t1 = System.nanoTime();
for (int i = 0; i < 10000000; i++) {
a[i%50] = b[i%50] = dot();
}
long t2 = System.nanoTime();
for (int i = 0; i < 10000000; i++) {
float sum = dotc();
ab.put(i%50, sum);
bb.put(i%50, sum);
}
long t3 = System.nanoTime();
System.out.println("dot(): " + (t2 - t1)/10000000 + " ns");
System.out.println("dotc(): " + (t3 - t2)/10000000 + " ns");
}
}
and Dot.h:
float ac[50], bc[50];
inline float dotc() {
float sum = 0;
for (int i = 0; i < 50; i++) {
sum += ac[i]*bc[i];
}
return sum;
}
We can compile and run that with JavaCPP using this command:
$ java -jar javacpp.jar Dot.java -exec
With an Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz, Fedora 30, GCC 9.1.1, and OpenJDK 8 or 11, I get this kind of output:
dot(): 39 ns
dotc(): 16 ns
Or roughly 2.4 times faster. We need to use direct NIO buffers instead of arrays, but HotSpot can access direct NIO buffers as fast as arrays. On the other hand, manually unrolling the loop does not provide a measurable boost in performance, in this case.
To address some of the scepticism expressed by others here I suggest anyone who wants to prove to themselves or other use the following method:
Create a JMH project
Write a small snippet of vectorizable math.
Run their benchmark flipping between -XX:-UseSuperWord and -XX:+UseSuperWord(default)
If no difference in performance is observed, your code probably didn't get vectorized
To make sure, run your benchmark such that it prints out the assembly. On linux you can enjoy the perfasm profiler('-prof perfasm') have a look and see if the instructions you expect get generated.
Example:
#Benchmark
#CompilerControl(CompilerControl.Mode.DONT_INLINE) //makes looking at assembly easier
public void inc() {
for (int i=0;i<a.length;i++)
a[i]++;// a is an int[], I benchmarked with size 32K
}
The result with and without the flag (on recent Haswell laptop, Oracle JDK 8u60):
-XX:+UseSuperWord : 475.073 ± 44.579 ns/op (nanoseconds per op)
-XX:-UseSuperWord : 3376.364 ± 233.211 ns/op
The assembly for the hot loop is a bit much to format and stick in here but here's a snippet(hsdis.so is failing to format some of the AVX2 vector instructions so I ran with -XX:UseAVX=1): -XX:+UseSuperWord(with '-prof perfasm:intelSyntax=true')
9.15% 10.90% │││ │↗ 0x00007fc09d1ece60: vmovdqu xmm1,XMMWORD PTR [r10+r9*4+0x18]
10.63% 9.78% │││ ││ 0x00007fc09d1ece67: vpaddd xmm1,xmm1,xmm0
12.47% 12.67% │││ ││ 0x00007fc09d1ece6b: movsxd r11,r9d
8.54% 7.82% │││ ││ 0x00007fc09d1ece6e: vmovdqu xmm2,XMMWORD PTR [r10+r11*4+0x28]
│││ ││ ;*iaload
│││ ││ ; - psy.lob.saw.VectorMath::inc#17 (line 45)
10.68% 10.36% │││ ││ 0x00007fc09d1ece75: vmovdqu XMMWORD PTR [r10+r9*4+0x18],xmm1
10.65% 10.44% │││ ││ 0x00007fc09d1ece7c: vpaddd xmm1,xmm2,xmm0
10.11% 11.94% │││ ││ 0x00007fc09d1ece80: vmovdqu XMMWORD PTR [r10+r11*4+0x28],xmm1
│││ ││ ;*iastore
│││ ││ ; - psy.lob.saw.VectorMath::inc#20 (line 45)
11.19% 12.65% │││ ││ 0x00007fc09d1ece87: add r9d,0x8 ;*iinc
│││ ││ ; - psy.lob.saw.VectorMath::inc#21 (line 44)
8.38% 9.50% │││ ││ 0x00007fc09d1ece8b: cmp r9d,ecx
│││ │╰ 0x00007fc09d1ece8e: jl 0x00007fc09d1ece60 ;*if_icmpge
Have fun storming the castle!
In HotSpot versions beginning with Java 7u40, the server compiler provides support for auto-vectorisation. According to JDK-6340864
However, this seems to be true only for "simple loops" - at least for the moment. For example, accumulating an array cannot be vectorised yet JDK-7192383
Here is nice article about experimenting with Java and SIMD instructions written by my friend:
http://prestodb.rocks/code/simd/
Its general outcome is that you can expect JIT to use some SSE operations in 1.8 (and some more in 1.9). Though you should not expect much and you need to be careful.
You could write OpenCl kernel to do the computing and run it from java http://www.jocl.org/.
Code can be run on CPU and/or GPU and OpenCL language supports also vector types so you should be able to take explicitly advantage of e.g. SSE3/4 instructions.
Have a look at Performance comparison between Java and JNI for optimal implementation of computational micro-kernels. They show that Java HotSpot VM server compiler supports auto-vectorization using Super-word Level Parallelism, which is limited to simple cases of inside the loop parallelism. This article will also give you some guidance whether your data size is large enough to justify going JNI route.
I'm guessing you wrote this question before you found out about netlib-java ;-) it provides exactly the native API you require, with machine optimised implementations, and does not have any cost at the native boundary due thanks to memory pinning.
Java 16 introduced the Vector API (JEP 417, JEP 414, JEP 338). It is currently "incubating" (ie, beta), although anyone can use it. It will probably become GA in Java 19 or 20.
It's a little verbose, but is meant to be reliable and portable.
The following code can be rewritten:
void scalarComputation(float[] a, float[] b, float[] c) {
assert a.length == b.length && b.length == c.length;
for (int i = 0; i < a.length; i++) {
c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
}
}
Using the Vector API:
static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;
void vectorComputation(float[] a, float[] b, float[] c) {
assert a.length == b.length && b.length == c.length;
int i = 0;
int upperBound = SPECIES.loopBound(a.length);
for (; i < upperBound; i += SPECIES.length()) {
// FloatVector va, vb, vc;
var va = FloatVector.fromArray(SPECIES, a, i);
var vb = FloatVector.fromArray(SPECIES, b, i);
var vc = va.mul(va)
.add(vb.mul(vb))
.neg();
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
}
}
Newer builds (ie, Java 18) are trying to get rid of that last for loop using predicate instructions, but support for that is still supposedly spotty.
I dont believe most if any VMs are ever smart enough for this sort of optimisations. To be fair most optimisations are much simpler, such as shifting instead of multiplication whena power of two. The mono project introduced their own vector and other methods with native backings to help performance.

Loop counter in Java API

All,
While going through some of the files in Java API, I noticed many instances where the looping counter is being decremented rather than increment. i.e. in for and while loops in String class. Though this might be trivial, is there any significance for decrementing the counter rather than increment?
I've compiled two simple loops with eclipse 3.6 (java 6) and looked at the byte code whether we have some differences. Here's the code:
for(int i = 2; i >= 0; i--){}
for(int i = 0; i <= 2; i++){}
And this is the bytecode:
// 1st for loop - decrement 2 -> 0
0 iconst_2
1 istore_1 // i:=2
2 goto 8
5 inc 1 -1 // i+=(-1)
8 iload_1
9 ifge 5 // if (i >= 0) goto 5
// 2nd for loop - increment 0 -> 2
12 iconst_0
13 istore_1 // i:=0
14 goto 20
17 inc 1 1 // i+=1
20 iload_1
21 iconst 2
22 if_icmple 17 // if (i <= 2) goto 17
The increment/decrement operation should make no difference, it's either +1 or +(-1). The main difference in this typical(!) example is that in the first example we compare to 0 (ifge i), in the second we compare to a value (if_icmple i 2). And the comaprision is done in each iteration. So if there is any (slight) performance gain, I think it's because it's less costly to compare with 0 then to compare with other values. So I guess it's not incrementing/decrementing that makes the difference but the stop criteria.
So if you're in need to do some micro-optimization on source code level, try to write your loops in a way that you compare with zero, otherwise keep it as readable as possible (and incrementing is much easier to understand):
for (int i = 0; i <= 2; i++) {} // readable
for (int i = -2; i <= 0; i++) {} // micro-optimized and "faster" (hopefully)
Addition
Yesterday I did a very basic test - just created a 2000x2000 array and populated the cells based on calculations with the cell indices, once counting from 0->1999 for both rows and cells, another time backwards from 1999->0. I wasn't surprised that both scenarios had a similiar performance (185..210 ms on my machine).
So yes, there is a difference on byte code level (eclipse 3.6) but, hey, we're in 2010 now, it doesn't seem to make a significant difference nowadays. So again, and using Stephens words, "don't waste your time" with this kind of optimization. Keep the code readable and understandable.
When in doubt, benchmark.
public class IncDecTest
{
public static void main(String[] av)
{
long up = 0;
long down = 0;
long upStart, upStop;
long downStart, downStop;
long upStart2, upStop2;
long downStart2, downStop2;
upStart = System.currentTimeMillis();
for( long i = 0; i < 100000000; i++ )
{
up++;
}
upStop = System.currentTimeMillis();
downStart = System.currentTimeMillis();
for( long j = 100000000; j > 0; j-- )
{
down++;
}
downStop = System.currentTimeMillis();
upStart2 = System.currentTimeMillis();
for( long k = 0; k < 100000000; k++ )
{
up++;
}
upStop2 = System.currentTimeMillis();
downStart2 = System.currentTimeMillis();
for( long l = 100000000; l > 0; l-- )
{
down++;
}
downStop2 = System.currentTimeMillis();
assert (up == down);
System.out.println( "Up: " + (upStop - upStart));
System.out.println( "Down: " + (downStop - downStart));
System.out.println( "Up2: " + (upStop2 - upStart2));
System.out.println( "Down2: " + (downStop2 - downStart2));
}
}
With the following JVM:
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Has the following output (ran it multiple times to make sure the JVM was loaded and to make sure the numbers settled down a little).
$ java -ea IncDecTest
Up: 86
Down: 84
Up2: 83
Down2: 84
These all come extremely close to one another and I have a feeling that any discrepancy is a fault of the JVM loading some code at some points and not others, or a background task happening, or simply falling over and getting rounded down on a millisecond boundary.
While at one point (early days of Java) there might have been some performance voodoo to be had, it seems to me that that is no longer the case.
Feel free to try running/modifying the code to see for yourself.
It is possible that this is a result of Sun engineers doing a whole lot of profiling and micro-optimization, and those examples that you found are the result of that. It is also possible that they are the result of Sun engineers "optimizing" based on deep knowledge of the JIT compilers ... or based on shallow / incorrect knowledge / voodoo thinking.
It is possible that these sequences:
are faster than the increment loops,
are no faster or slower than increment loops, or
are slower than increment loops for the latest JVMs, and the code is no longer optimal.
Either way, you should not emulate this practice in your code, unless thorough profiling with the latest JVMs demonstrates that:
your code really will benefit from optimization, and
the decrementing loop really is faster than the incrementing loop for your particular application.
And even then, you may find that your carefully hand optimized code is less than optimal on other platforms ... and that you need to repeat the process all over again.
These days, it is generally recognized that the best first strategy is to write simple code and leave optimization to the JIT compiler. Writing complicated code (such as loops that run in reverse) may actually foil the JIT compiler's attempts to optimize.

Categories

Resources