I'm trying to find a concise example which shows auto vectorization in java on a x86-64 system.
I've implemented the below code using y[i] = y[i] + x[i] in a for loop. This code can benefit from auto vectorization, so I think java should compile it at runtime using SSE or AVX instructions to speed it up.
However, I couldn't find the vectorized instructions in the resulting native machine code.
VecOpMicroBenchmark.java should benefit from auto vectorization:
/**
* Run with this command to show native assembly:<br/>
* java -XX:+UnlockDiagnosticVMOptions
* -XX:CompileCommand=print,VecOpMicroBenchmark.profile VecOpMicroBenchmark
*/
public class VecOpMicroBenchmark {
private static final int LENGTH = 1024;
private static long profile(float[] x, float[] y) {
long t = System.nanoTime();
for (int i = 0; i < LENGTH; i++) {
y[i] = y[i] + x[i]; // line 14
}
t = System.nanoTime() - t;
return t;
}
public static void main(String[] args) throws Exception {
float[] x = new float[LENGTH];
float[] y = new float[LENGTH];
// to let the JIT compiler do its work, repeatedly invoke
// the method under test and then do a little nap
long minDuration = Long.MAX_VALUE;
for (int i = 0; i < 1000; i++) {
long duration = profile(x, y);
minDuration = Math.min(minDuration, duration);
}
Thread.sleep(10);
System.out.println("\n\nduration: " + minDuration + "ns");
}
}
To find out if it gets vectorized, I did the following:
open eclipse and create the above file
right-click the file and from the dropdown menu, choose Run > Java Application (ignore the output for now)
in the eclipse menu, click Run > Run Configurations...
in the opened window, find VecOpMicroBenchmark, click it and choose the Arguments tab
in the Arguments tab, under VM arguments: put in this: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,VecOpMicroBenchmark.profile
get libhsdis and copy (possibly rename) the file hsdis-amd64.so (.dll for windows) to java/lib directory. In my case, this was /usr/lib/jvm/java-11-openjdk-amd64/lib .
run VecOpMicroBenchmark again
It should now print lots of information to the console, part of it being the disassembled native machine code, which was produced by the JIT compiler. If you see lots of messages, but no assembly instructions like mov, push, add, etc, then maybe you can somewhere find the following message:
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
This means that java couldn't find the file hsdis-amd64.so - it's not in the right directory or it doesn't have the right name.
hsdis-amd64.so is the disassembler which is required for showing the resulting native machine code. After the JIT compiler compiles the java bytecode to native machine code, hsdis-amd64.so is used to disassemble the native machine code to make it human readable. You can find more infos on how to get/install it at How to see JIT-compiled code in JVM? .
After finding assembly instructions in the output, I skimmed through it (too much to post all of it here) and looked for line 14. I found this:
0x00007fac90ee9859: nopl 0x0(%rax)
0x00007fac90ee9860: cmp 0xc(%rdx),%esi ; implicit exception: dispatches to 0x00007fac90ee997f
0x00007fac90ee9863: jnb 0x7fac90ee9989
0x00007fac90ee9869: movsxd %esi,%rbx
0x00007fac90ee986c: vmovss 0x10(%rdx,%rbx,4),%xmm0 ;*faload {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile#16 (line 14)
0x00007fac90ee9872: cmp 0xc(%rdi),%esi ; implicit exception: dispatches to 0x00007fac90ee9997
0x00007fac90ee9875: jnb 0x7fac90ee99a1
0x00007fac90ee987b: movsxd %esi,%rbx
0x00007fac90ee987e: vmovss 0x10(%rdi,%rbx,4),%xmm1 ;*faload {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile#20 (line 14)
0x00007fac90ee9884: vaddss %xmm1,%xmm0,%xmm0
0x00007fac90ee9888: movsxd %esi,%rbx
0x00007fac90ee988b: vmovss %xmm0,0x10(%rdx,%rbx,4) ;*fastore {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile#22 (line 14)
So it's using the AVX instruction vaddss. But, if I'm correct here, vaddss means
add scalar single-precision floating-point values and this only adds one float value to another one (here, scalar means just one, whereas here single means 32 bit, i.e. float and not double).
What I expect here is vaddps, which means add packed single-precision floating-point values and which is a true SIMD instruction (SIMD = single instruction, multiple data = vectorized instruction). Here, packed means multiple floats packed together in one register.
About the ..ss and ..ps, see http://www.songho.ca/misc/sse/sse.html :
SSE defines two types of operations; scalar and packed. Scalar operation only operates on the least-significant data element (bit 0~31), and packed operation computes all four elements in parallel. SSE instructions have a suffix -ss for scalar operations (Single Scalar) and -ps for packed operations (Parallel Scalar).
Queston:
Is my java example incorrect, or why is there no SIMD instruction in the output?
In the main() method, put in i < 1000000 instead of just i < 1000. Then the JIT also produces AVX vector instructions like below, and the code runs faster:
0x00007f20c83da588: vmovdqu 0x10(%rbx,%r11,4),%ymm0
0x00007f20c83da58f: vaddps 0x10(%r13,%r11,4),%ymm0,%ymm0
0x00007f20c83da596: vmovdqu %ymm0,0x10(%rbx,%r11,4) ;*fastore {reexecute=0 rethrow=0 return_oop=0}
; - VecOpMicroBenchmark::profile#22 (line 14)
The code from the question is actually optimizable by the JIT compiler using auto-vectorization. However, as Peter Cordes pointed out in a comment, the JIT needs quite some processing, thus it is rather reluctant to decide that it should fully optimize some code.
The solution is simply to execute the code more often during one execution of the program, not just 1000 times, but 100000 times or a million times.
When executing the profile() method this many times, the JIT compiler is convinced that the code is very important and the overall runtime will benefit from full optimization, thus it optimizes the code again and then it also uses true vector instructions like vaddps.
More details in Auto Vectorization in Java
Related
What am I doing?
I am writing a data analysis program in Java which relies on R´s arulesViz library to mine association rules.
What do I want?
My purpose is to store the rules in a String variable in Java so that I can process them later.
How does it work?
The code works using a combination of String.format and eval Java and RJava instructions respectively, being its behavior summarized as:
Given properly formatted Java data structures, creates a data frame in R.
Formats the recently created data frame into a transaction list using the arules library.
Runs the apriori algorithm with the transaction list and some necessary values passed as parameter.
Reorders the generated association rules.
Given that the association rules cannot be printed, they are written to the standard output with R´s write method, capture the output and store it in a variable. We have converted the association rules into a string variable.
We return the string.
The code is the following:
// Step 1
Rutils.rengine.eval("dataFrame <- data.frame(as.factor(c(\"Red\", \"Blue\", \"Yellow\", \"Blue\", \"Yellow\")), as.factor(c(\"Big\", \"Small\", \"Small\", \"Big\", \"Tiny\")), as.factor(c(\"Heavy\", \"Light\", \"Light\", \"Heavy\", \"Heavy\")))");
//Step 2
Rutils.rengine.eval("transList <- as(dataFrame, 'transactions')");
//Step 3
Rutils.rengine.eval(String.format("info <- apriori(transList, parameter = list(supp = %f, conf = %f, maxlen = 2))", supportThreshold, confidenceThreshold));
// Step 4
Rutils.rengine.eval("orderedRules <- sort(info, by = c('count', 'lift'), order = FALSE)");
// Step 5
REXP res = Rutils.rengine.eval("rulesAsString <- paste(capture.output(write(orderedRules, file = stdout(), sep = ',', quote = TRUE, row.names = FALSE, col.names = FALSE)), collapse='\n')");
// Step 6
return res.asString().replaceAll("'", "");
What´s wrong?
Running the code in Linux Will work perfectly, but when I try to run it in Windows, I get the following error referring to the return line:
Exception in thread "main" java.lang.NullPointerException
This is a common error I have whenever the R code generates a null result and passes it to Java. There´s no way to syntax check the R code inside Java, so whenever it´s wrong, this error message appears.
However, when I run the R code in brackets in the R command line in Windows, it works flawlessly, so both the syntax and the data flow are OK.
Technical information
In Linux, I am using R with OpenJDK 10.
In Windows, I am currently using Oracle´s latest JDK release, but trying to run the program with OpenJDK 12 for Windows does not solve anything.
Everything is 64 bits.
The IDE used in both operating systems is IntelliJ IDEA 2019.
Screenshots
Linux run configuration:
Windows run configuration:
At this moment I participate in big legacy project with many huge classes and generated code.
I wish to find all methods that have bytecode length bigger than 8000 bytes (because OOTB java will not optimize it).
I found manual way like this: How many bytes of bytecode has a particular method in Java?
, however my goal is to scan many files automatically.
I tried to use jboss-javassist, but AFAIK getting bytecode length is available only on class level.
Huge methods might indeed never get inlined, however, but I have my doubts regarding the threshold of 8000. This comment suggests a much smaller limit, though it is platform and configuration dependent anyway.
You are right that getting bytecode length needs to process classes on that low level, however, you didn’t specify what actual obstacle you encountered when trying to do that with Javassist. A simple program doing that with Javassist, would be
try(InputStream is=javax.swing.JComponent.class.getResourceAsStream("JComponent.class")) {
ClassFile cf = new ClassFile(new DataInputStream(is));
for(MethodInfo mi: cf.getMethods()) {
CodeAttribute ca = mi.getCodeAttribute();
if(ca == null) continue; // abstract or native
int bLen = ca.getCode().length;
if(bLen > 300)
System.out.println(mi.getName()+" "+mi.getDescriptor()+", "+bLen+" bytes");
}
}
This has been written and tested with a recent version of Javassist that uses Generics in the API. If you have a different/older version, you have to use
try(InputStream is=javax.swing.JComponent.class.getResourceAsStream("JComponent.class")) {
ClassFile cf = new ClassFile(new DataInputStream(is));
for(Object miO: cf.getMethods()) {
MethodInfo mi = (MethodInfo)miO;
CodeAttribute ca = mi.getCodeAttribute();
if(ca == null) continue; // abstract or native
int bLen = ca.getCode().length;
if(bLen > 300)
System.out.println(mi.getName()+" "+mi.getDescriptor()+", "+bLen+" bytes");
}
}
This is a Peter Norvig's repl function:
def repl(prompt='lis.py> '):
"A prompt-read-eval-print loop."
while True:
val = eval(parse(raw_input(prompt)))
if val is not None:
print(schemestr(val))
def schemestr(exp):
"Convert a Python object back into a Scheme-readable string."
if isinstance(exp, List):
return '(' + ' '.join(map(schemestr, exp)) + ')'
else:
return str(exp)
Which works:
>>> repl()
lis.py> (define r 10)
lis.py> (* pi (* r r))
314.159265359
lis.py> (if (> (* 11 11) 120) (* 7 6) oops)
42
lis.py>
I'm trying to write program with the same functionality in Java, tried classes from Java docs, but nothing works like that; any idea? Thanks.
A REPL is called REPL because it is a Loop that Reads and Evaluates code and Prints the results. In Lisp, the code is literally:
(LOOP (PRINT (EVAL (READ))))
In an unstructured language, it would be something like:
#loop:
$code ← READ;
$res ← EVAL($code);
PRINT($res);
GOTO #loop;
That's where the name comes from.
In Java, it would be something like:
while (true) {
Code code = read(System.in);
Object res = eval(code);
System.out.println(res);
}
But, there are no methods corresponding to READ or EVAL in Java or the JRE. You will have to write read, eval, and Code yourself. Note that read is essentially a parser for Java, and eval is an interpreter for Java. Both the syntax and the semantics for Java are described in the Java Language Specification, all you have to do is read the JLS and implement those two methods.
I just started using Wala Java Slicer to do some source code analysis tasks. I have a question about the proper use of the library. Assuming I have the following example code:
public void main(String[] args) {
...
UserType ut = userType;
int i = ut.getInt();
...
System.out.println(i);
}
Calculating a slice for the println statement with Wala gives the following statements:
NORMAL_RET_CALLER:Node: < Application, LRTExecutionClass, main([Ljava/lang/String;)V > Context: Everywhere[15]13 = invokevirtual < Application, LUserType, getInt()I > 11 #27 exception:12
NORMAL main:23 = getstatic < Application, Ljava/lang/System, out, <Application,Ljava/io/PrintStream> > Node: < Application, LRTExecutionClass, main([Ljava/lang/String;)V > Context: Everywhere
NORMAL main:invokevirtual < Application, Ljava/io/PrintStream, println(I)V > 23,13 #63 exception:24 Node: < Application, LRTExecutionClass, main([Ljava/lang/String;)V > Context: Everywhere
The code I am using to create the slice with Wala is shown below:
AnalysisScope scope = AnalysisScopeReader.readJavaScope("...",
null, WalaJavaSlicer.class.getClassLoader());
ClassHierarchy cha = ClassHierarchy.make(scope);
Iterable<Entrypoint> entrypoints = Util.makeMainEntrypoints(scope, cha);
AnalysisOptions options = new AnalysisOptions(scope, entrypoints);
// Build the call graph
CallGraphBuilder cgb = Util.makeZeroCFABuilder(options, new AnalysisCache(),cha, scope, null, null);
CallGraph cg = cgb.makeCallGraph(options, null);
PointerAnalysis pa = cgb.getPointerAnalysis();
// Find seed statement
Statement statement = findCallTo(findMainMethod(cg), "println");
// Context-sensitive thin slice
Collection<Statement> slice = Slicer.computeBackwardSlice(statement, cg, pa, DataDependenceOptions.NO_BASE_NO_HEAP, ControlDependenceOptions.NONE);
dumpSlice(slice);
There are a number of statements that I expect to find in the slice but are not present:
The assign statement ut = userType is not included even though the dependent method call ut.getInt(), IS included in the slice
No statements from the implementation of getInt() are included. Is there an option to activate "inter-procedural" slicing? I should mention here that the .class file is included in the path used to create the AnalysisScope.
As you can see, I am using DataDependenceOptions.NO_BASE_NO_HEAP and ControlDependenceOptions.NONE for the dependence options. But even when I use FULL for both, the problem persists.
What am I doing wrong?
The assign statement ut = userType is not included even though the
dependent method call ut.getInt(), IS included in the slice
I suspect that assignment never makes it into the byte code since it's an un-required local variable and hence will not be visible to WALA:
Because the SSA IR has already been somewhat optimized, some
statements such as simple assignments (x=y, y=z) do not appear in the
IR, due to copy propagation optimizations done automatically during
SSA construction by the SSABuilder class. In fact, there is no SSA
assignment instruction; additionally, a javac compiler is free to do
these optimizations, so the statements may not even appear in the
bytecode. Thus, these Java statements will never appear in the slice.
http://wala.sourceforge.net/wiki/index.php/UserGuide:Slicer#Warning:_exclusion_of_copy_statements_from_slice
I have found a source of Java compiler written in Ocaml which should work.
But when I do make, it finished with an error:
unzip.o: In function `camlUnzip__59':
(.data+0x540): undefined reference to `camlzip_deflateEnd'
unzip.o: In function `camlUnzip__59':
(.data+0x544): undefined reference to `camlzip_deflate'
unzip.o: In function `camlUnzip__59':
(.data+0x548): undefined reference to `camlzip_deflateInit'
collect2: ld returned 1 exit status
File "caml_startup", line 1, characters 0-1:
Error: Error during linking
make: *** [javacx] Error 2
It is odd that the file "caml_startup" even does not exist in the folder. Could anyone help? Thank you very much.
caml_startup is part of the OCaml runtime.
The project's website mentions that it works with OCaml 3.09, which is quite old. It worked for me with 3.10 (which is still quite old; latest release is 3.12) - maybe it just doesn't work with more recent versions.
However, as a first guess, I would try simply deleting these definitions from unzip.ml - they are never called, and declare external routines which are not actually implemented (whereas other external routines in unzip.ml are implemented in zlib.c):
external deflate_init: int -> bool -> stream = "camlzip_deflateInit"
external deflate:
stream -> string -> int -> int -> string -> int -> int -> flush_command
-> bool * int * int
= "camlzip_deflate_bytecode" "camlzip_deflate"
external deflate_end: stream -> unit = "camlzip_deflateEnd"