How to speed up runtime Java code instrumentation?

How to speed up runtime Java code instrumentation? - java

I made a Java agent which is attached to a JVM during runtime and instruments all the loaded project classes and inserts some logging statements. There are 11k classes in total. I measured the total time taken by the transform method of my ClassFileTransformer and it was 3 seconds. But the duration of the whole instrumentation process takes about 30 seconds.
This is how I retransform my classes:
instrumentation.retransformClasses(myClassesArray);
I assume most time is taken up by the JVM to reload changed classes. Is that right? How can I speed up the instrumentation process?
Update:
When my agent is attached,
instrumentation.addTransformer(new MyTransfomer(), true);
instrumentation.retransformClasses(retransformClassArray);
is called only once.
Then MyTransfomer class instruments the classes and measures the total duration time of instrumentation:
public class MyTransfomer implements ClassFileTransformer {
private long total = 0;
private long min = ..., max = ...;
public final byte[] transform(ClassLoader loader, String className, Class<?> classBeingRedefined, ProtectionDomain protectionDomain, byte[] classFileBuffer) {
long s = System.currentTimeMillis();
if(s < min) min = s;
if(s > max) max = s;
byte[] transformed = this.transformInner(loader, className, classFileBuffer);
this.total += System.currentTimeMillis() - s;
return transformed;
}
}
After all the classes are instrumented(from the initial array) (a global cache keeps track of the instrumented classes) total is printed and it will be ~3 seconds. But max-min is ~30 seconds.
Update 2:
After looking at the stack trace this is what happens:
I call
instrumentation.retransformClasses(retransformClassArray);
which calls the native method retransformClasses0(). After some time(!) the JVM calls the transform() method of the sun.instrument.InstrumentationImpl class(but this method takes only one class at a time, so the JVM calls this method multiple times consecutively), which calls transform() on the sun.instrument.TransformerManager object which has a list with the all the ClassTransformers registered and calls each of these transformers to transform the class(I have only one transformer registered!!).
So to my opinion, most of the time is spent in the JVM (after retransformClasses0() is called and before each call to sun.instrument.InstrumentationImpl.transform()). Is there a way to reduce the time needed by the JVM to carry out this task?

Correction:
Because the retransformClasses(classArr)
will not retransform all the elements in the classArr at once, instead it will retransform each of them as needed(eg. while linking).(refer to the jdk [VM_RedefineClasses][1] and [jvmtiEnv][2]), it does retransform all of them at once.
What retransformClasses() does:
Transfer control to native layer, and give it a class list which we want to transform
For every class to be transformed, the native code tries to get a new version by calling our java transformer, this leads to a transfer of control between the java code and native.
The native code replace the appropriate parts of internal representation by the given new class version one another.
In step 1:
java.lang.instrument.Instrumentation#retransformClasses calls sun.instrument.InstrumentationImpl#retransformClasses0 which is a JNI method, the control will be transferred to native layer.
// src/hotspot/share/prims/jvmtiEnv.cpp
jvmtiError
JvmtiEnv::RetransformClasses(jint class_count, const jclass* classes) {
...
VM_RedefineClasses op(class_count, class_definitions, jvmti_class_load_kind_retransform);
VMThread::execute(&op);
...
} /* end RetransformClasses */
In step 2:
This step is implemented by KlassFactory::create_from_stream, this procedure will post a ClassFileLoadHook event whose callback can acquire the transformed bytecode by invoking the java transformer method. In this step, the control will switch back and forth between native code and java code.
// src/hotspot/share/classfile/klassFactory.cpp
// check and post a ClassFileLoadHook event before loading a class
// Skip this processing for VM hidden or anonymous classes
if (!cl_info.is_hidden() && (cl_info.unsafe_anonymous_host() == NULL)) {
stream = check_class_file_load_hook(stream,
name,
loader_data,
cl_info.protection_domain(),
&cached_class_file,
CHECK_NULL);
}
//src/java.instrument/share/native/libinstrument/JPLISAgent.c :
//call java code sun.instrument.InstrumentationImpl#transform
transformedBufferObject = (*jnienv)->CallObjectMethod(
jnienv,
agent->mInstrumentationImpl, //sun.instrument.InstrumentationImpl
agent->mTransform, //transform
moduleObject,
loaderObject,
classNameStringObject,
classBeingRedefined,
protectionDomain,
classFileBufferObject,
is_retransformer);
In step 3:
VM_RedefineClasses::redefine_single_class(jclass the_jclass, InstanceKlass* scratch_class, TRAPS) method replaces parts (such as constant pool, methods, etc.) in target class with parts from transformed class.
// src/hotspot/share/prims/jvmtiRedefineClasses.cpp
for (int i = 0; i < _class_count; i++) {
redefine_single_class(_class_defs[i].klass, _scratch_classes[i], thread);
}
So how to speed up runtime Java code instrumentation?
In my project, the total time and max-min time are almost the same if the app is in a paused state while transforming. can you provide some demo code?
It's impossible to change the way jvm works, so multithreading may not be a bad idea. It got several times faster after using multithreading in my demo project.

From your description it seems like the complete transformation is running in a single thread.
You could create multiple threads, each one is transforming one class at the time. As the transformation of a class should be independent of any other class. This should give you an improvement in the overall transformation time by a factor of the number of used Core available on the executing system.
You can count the cores with:
int cores = Runtime.getRuntime().availableProcessors();
Chunk the list of classes to be transformed into the number of cores and create that may threads to process the chunks in parallel.

Related

(de)serialization of parsed Groovy scripts

We want to enable our customers to customize certain aspects of their requests processing, by letting them write something (currently looking at Groovy scripts), then have those scripts saved in a DB and applied when necessary, this way we won't have to maintain all those tiny aspects of processing details that might apply to certain customers only.
So, with Groovy, a naive implementation would go like this:
GroovyShell shell = new GroovyShell(); // prepare execution engine - probably once per thread
(retrieve script body from the DB, when necessary)
Script script = shell.parse(scriptBody); // parse/compile execution unit
Binding binding = prepareBinding(..); script.setBinding(binding); // provide script instance with execution context
script.run(); doSomething(binding);
When run one after the other, step 1 takes approx. 800 msec, step 3 takes almost 2000 msec, and step 5 takes about 150 msec. Absolute numbers will vary, but the relative numbers are quite stable. Assuming that step 1 is not going to be executed per-request, and step 5 number execution time is quite tolerable, I am very much concerned with step 3: parsing the Groovy script instance from the source code. I did some reading across the documentation and code, and some googling as well, but has not thus far discovered any solution, so here's the question:
Can we somehow pre-compile groovy code ONCE, then persist it in the DB and then re-hydrate whenever necessary, to obtain an executable Script instance (that we could also cache when necessary) ?
Or (as I am just thinking now) we could just compile Java code to bytecode and persist it in the Db?? Anyway, I am not so much concerned about particular language used for the scripts, but sub-second execution time is a must.. Thanks for any hints!
NB: I am aware that GroovyShellEngine will likely cache the compiled script; that still risks too long of a delay for first time execution, also risks memory overconsumption...
UPD1: based on excellent suggestion by #daggett, I've modified a solution to look as follows:
GroovyShell shell = new GroovyShell();
final Class<? extends MetaClass> theClass = shell.parse(scriptBody).getMetaClass().getTheClass();
Script script = InvokerHelper.createScript(theClass, binding);
script.run();
this works all fine and well! Now, we need to de-couple metaclass creation and usage; for that, I've created a helper method:
private Class dehydrateClass(Class theClass) throws IOException, ClassNotFoundException {
final ByteArrayOutputStream stream = new ByteArrayOutputStream();
ObjectOutputStream outputStream = new ObjectOutputStream(stream);
outputStream.writeObject(theClass);
InputStream in = new ByteArrayInputStream(stream.toByteArray());
final ObjectInputStream inputStream = new ObjectInputStream(in);
return (Class) inputStream.readObject();
}
which I've dested as follows:
#Test
void testDehydratedClass() throws IOException, ClassNotFoundException, IllegalAccessException, InstantiationException {
RandomClass instance = (RandomClass) dehydrateClass(RandomClass.class).newInstance();
assertThat(instance.getName()).isEqualTo("Test");
}
public static class RandomClass {
private final String name;
public RandomClass() {
this("Test");
}
public RandomClass(String name) {
this.name = name;
}
public String getName() {
return this.name;
}
}
which passes OK, which means that, in general, this approach is OK.
However, when I try to apply this dehydrateClass approach to theClass, returned by compile phase, I get this exception:
java.lang.ClassNotFoundException: Script1
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:686)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1866)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1749)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1714)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1554)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
so, my impression is, that this de-serialization trick will not do any good, if the ClassLoader in question does not already have knowledge of what constitutes a Script1.. seems like the only way to make this kind of approach work is to save those pre-compiled classes somehow somewhere .. or may be learn to serialize them differently

you can parse/compile scripts/classes during editing and store compiled version somewhere - in database, file system, memory, ...
here is a groovy code snippet to compile script/class to a bytecode and then define/load classes from the bytecode.
import org.codehaus.groovy.control.BytecodeProcessor
import org.codehaus.groovy.control.CompilerConfiguration
//bytecode processor that could be used to store bytecode to cache(file,db,...)
#groovy.transform.CompileStatic
class BCP implements BytecodeProcessor{
Map<String,byte[]> bytecodeMap = [:]
byte[] processBytecode(String name, byte[] original){
println "$name >> ${original.length}"
bytecodeMap[name]=original //here we could store bytecode to a database or file system instead of memory map...
return original
}
}
def bcp = new BCP()
//------ COMPILE PHASE
def cc1 = new CompilerConfiguration()
cc1.setBytecodePostprocessor(bcp)
def gs1 = new GroovyShell(new GroovyClassLoader(), cc1)
//the next line will define 2 classes: MyConst and MyAdd (extends Script) named after the filename
gs1.parse("class MyConst{static int cnt=0} \n x+y+(++MyConst.cnt)", "MyAdd.groovy")
//------ RUN PHASE
// let's create another classloader that has no information about classes MyAdd and MyConst
def cl2 = new GroovyClassLoader()
//this try-catch just to test that MyAdd fails to load at this point
// because unknown for 2-nd class loader
try {
cl2.loadClass("MyAdd")
assert 1==0: "this should not happen because previous line should throw exception"
}catch(ClassNotFoundException e){}
//now define previously compiled classes from the bytecode
//you can load bytecode from filesystem or from database
//for test purpose let's take them from map
bcp.bytecodeMap.each{String name, byte[] bytes->
cl2.defineClass(name, bytes)
}
def myAdd = cl2.loadClass("MyAdd").newInstance()
assert myAdd instanceof groovy.lang.Script //it's a script
myAdd.setBinding([x: 1000, y: 2000] as Binding)
assert myAdd.run() == 3001 // +1 because we have x+y+(++MyConst.cnt)
myAdd.setBinding([x: 1100, y: 2200] as Binding)
assert myAdd.run() == 3302
println "OK"

Why JavaCompiler is slow while instantiating a Java class?

I'm using JavaCompiler to dyamically create a Java class, compile it and load in my application.
My problem is the following: the execution time with JavaCompiler is much slower than the standard way to instantiate the same class.
Here an example:
static void function() {
long startTime = System.currentTimeMillis();
String source = "package myPackage; import java.util.BitSet; public class MyClass{ static {";
while (!OWLMapping.axiomStack.isEmpty()) {
source += OWLMapping.axiomStack.pop() + ";";
}
source += "} }";
File root = new File("/java");
File sourceFile = new File(root, "myPackage/MyClass.java");
sourceFile.getParentFile().mkdirs();
Files.write(sourceFile.toPath(), source.getBytes(StandardCharsets.UTF_8));
// Compile source file.
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
compiler.run(null, null, null, sourceFile.getPath());
// Load and instantiate compiled class.
URLClassLoader classLoader = URLClassLoader.newInstance(new URL[] { root.toURI().toURL() });
Class<?> cls = Class.forName("myPackage.MyClass", true, classLoader);
long stopTime = System.currentTimeMillis();
long elapsedTime = stopTime - startTime;
System.out.println("EXECUTION TIME: " + elapsedTime);
}
After measuring this code, I created a new java Class with the same content of the var source to test the performance: it is much faster than the JavaCompiler way. (I cannot use a standard class because in my application I need to create it dynamically).
So, is it possible to improve the performance of this code? Or this low performance is normal?
EDIT: the generated code I also tested is a simple sequence of OWLAPI axioms:
package myPackage;
public class myClass{
static {
myPackage.OWLMapping.manager.addAxiom(myPackage.OWLMapping.ontology, myPackage.OWLMapping.factory.getOWLSubClassOfAxiom(/*whatever*/);
myPackage.OWLMapping.manager.addAxiom(myPackage.OWLMapping.ontology,myPackage.OWLMapping.factory.getOWLSubClassOfAxiom(/*whatever*/);
myPackage.OWLMapping.manager.addAxiom(myPackage.OWLMapping.ontology,myPackage.OWLMapping.factory.getOWLSubClassOfAxiom(/*whatever*/);
}
}
and this exactly what the variable source contains.
The number of axioms depends on the user's input.

You have two areas which are likely to be slow (but your benchmarks combine the two areas).
The first is in building the Java String which contains your source code. When appending Strings across different statements, the JVM can't optimize them into StringBuilders which means that first it creates the string on one side of the append, then the String on the other, then it creates a third String resulting from the two being appended. This puts a lot of pressure on the heap and garbage collection, generating lots of objects which are nearly immediately garbage collected.
To fix the first problem, create a StringBuilder and call it's .append(...).
The second problem is that you are instantiating a JavaCompiler. The compiler used to compile Java programs may have one class driving it at the top level, but it will source in tons of supporting classes to fill out its private fields and the embedded includes. Finally, when running that, more objects will be created to hold the code, the Lexer, the Parser, the AST of the CompilationUnit, and eventually the byte-code emitter. This means that the one lines of code
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
compiler.run(null, null, null, sourceFile.getPath());
Are likely (again they are not independently benchmarked) to take some time.
Finally, the class loader lines interact with the class loading system, and might be poorly adapted for performance. While it's a smaller chance it's a big performance hit, I'd benchmark that line independently too.

Java executor to call external executable only slightly faster

I have a Java program that needs to call the same external executable 6 times. This executable produces an output file and once all 6 runs are complete I "merge" these files together. I did just have a for-loop where I ran the code, waited for the first run of the external executable to end then I called it again, etc.
I found this highly time consuming, averaging 52.4s for it to run 6 times... I figured it would be pretty easy to speed up by running the external executable 6 times all at once, especially since they aren't dependent on one another. I used ExecutorService and Runnable, etc. to achieve this.
With my current implementation, I shave about ~5s off my time, making it only ~11% faster.
Here is some (simplified) code that explains what I'm doing:
private final List<Callable<Object>> tasks = new ArrayList<Callable<Object>>();
....
private void setUpThreadsAndRun() {
ExecutorService executor = Executors.newFixedThreadPool(6);
for (int i = 0; i < 6; i++) {
//create the params object
tasks.add(Executors.callable(new RunThread(params)));
}
try {
executor.invokeAll(tasks);
} catch (InterruptedException ex) {
//uh-oh
}
executor.shutdown();
System.out.println("Finished all threads!");
}
private class RunThread implements Runnable {
public RunThread(ModelParams params) {
this.params = params;
}
#Override
public void run()
{
//NOTE: cmdarray is constructed from the params object
ProcessBuilder pb = new ProcessBuilder(cmdarray);
pb.directory(new File(location));
p = pb.start();
}
}
I'm hoping there is a more efficient way to do this...or maybe I'm "clogging" my computer's resources by trying to run this process 6 times at once. This process does involve file I/O and writes files that are about 30mb in size.

The only time that forking the executable 6 times will earn a performance boost is if you have at least 6 CPU cores and your application is CPU bound -- i.e. mostly doing processor operations. Since each application writes a 30mb file, it sounds like it is doing a large amount of IO and the applications are IO bound instead -- limited by your hardware's ability to service the IO requests.
To speed up your program, you might try 2 concurrent processes to see if you get an improvement. However, if you program is IO bound, then you will never get much of a speed improvement by forking multiple copies.

Launching Java Subprocess using parent process Classpath

I want to launch a java subprocess, with the same java classpath and dynamically loaded classes as the current java process. The following is not enough, because it doesn't include any dynamically loaded classes:
String classpath = System.getProperty("java.class.path");
Currently I'm searching for each needed class with the code below. However, on some machines this fails for some classes/libs, the source variable is null. Is there a more reliable and simpler way to get the location of libs that are used by the current jvm process?
String stax = ClassFinder.classPath("javax.xml.stream.Location");
public static String classPath(String qualifiedClassName) throws NotFoundException {
try {
Class qc = Class.forName( qualifiedClassName );
CodeSource source = qc.getProtectionDomain().getCodeSource();
if ( source != null ) {
URL location = source.getLocation();
String f = location.getPath();
f = URLDecoder.decode(f, "UTF-8"); // decode URL to avoid spaces being replaced by %20
return f.substring(1);
} else {
throw new ClassFinder().new NotFoundException(qualifiedClassName+" (unknown source, likely rt.jar)");
}
} catch ( Exception e ) {
throw new ClassFinder().new NotFoundException(qualifiedClassName);
}
}

See my previous question which covers getting the classpath as well as how to launch a sub-process.

I want to launch a java subprocess, with the same java classpath and dynamically loaded classes as the current java process.
You mean invoke a new JVM?
Given that...
it is possible to plug in all sorts of agents and instrumentation into a JVM that can transform classes at load time
it is possible to take a byte array and turn it into a class
it is possible to have complex class loader hierarchies with varying visibility between classes and have the same classes loaded multiple times
...there is no general, magic, catch-all and foolproof way to do this. You should design your application and its class loading mechanisms to achieve this goal. If you allow 3rd party plug-ins, you'll have to document how this works and how they have to register their libraries.

If you look at the javadoc for Class.getClassLoader, you'll see that the "bootstrap" classloader is typically represented as the null. "String.class.getClassLoader()" will return null on the normal sun jvm implementations. i think this implementation detail carries over into the CodeSource stuff. As such, I wouldn't imagine you would need to worry about any class which comes from the bootstrap classloader as long as your sub-process uses the same jvm impl as the current process.

Howto multithreaded jython scripts running from java?

I'm constructing a framework in Java that will listen for events and then process them in Jython. Different event types will be sent to different scripts.
Since jython takes quite some time to compile the script when PythonInterpreter.exec() is called, I will have to pre-compile the scripts. I'm doing it the following way:
// initialize the script as string (would load it from file in final version)
String script = "print 'foo'";
// get the compiled code object
PyCode compiled = org.python.core.__builtin__.compile( script, "<>", "exec" );
The PyCode compiled object would be pushed to repository and used as events come in
PythonInterpreter pi = new PythonInterpreter();
pi.set( "variable_1", "value_1");
pi.set( "variable_x", "value_x");
pi.exec( compiled );
Now for my conundrum - it might happen that there are multiple events of certain type happening at the same time - thus multiple instances of script running at the same time.
Almost all scripts would probably remain short-lived - up to 100 lines, no loops. Number and frequency is completely random (user generated events) and could be from 0 to about 200 per second per event type.
What would the best way to do it be? I'm looking at a few possibilities:
use synchronization at trigger event point - this would prevent multiple instances of same script but also events wouldn't be processed as quickly as they should be
create a pool of same type scripts somehow populated by cloning original PyCode object - the biggest problem would probably be optimizing pool sizes
dynamically clone the script object from the parent whenever needed and then discard it when exec() finishes - this way the lag is removed from compile but it is still present in clone method
Probably the combination of number 2 and 3 would be the best - creating dynamic pool sizes?
So, any thoughts? ;)

It is a pity that PyCode instances aren't immutable (there are a lot of public members on the classes).
You can precompile a reusable script using this code:
// TODO: generate this name
final String name = "X";
byte[] scriptBytes = PyString.to_bytes(script);
CompilerFlags flags = Py.getCompilerFlags();
ByteArrayOutputStream ostream = new ByteArrayOutputStream();
Module.compile(parser.parse(new ByteArrayInputStream(scriptBytes), "exec",
"<>", flags), ostream, name, "<>", false, false, false, flags);
byte[] buffer = ostream.toByteArray();
Class<PyRunnable> clazz = BytecodeLoader.makeClass(name, null, buffer);
final Constructor<PyRunnable> constructor = clazz
.getConstructor(new Class[] { String.class });
You can then use the constructor to produce PyCode instances for the script whenever you need one:
PyRunnable r = constructor.newInstance(name);
PyCode pc = r.getMain();
I would be the first to admit that this is not a good way of doing things and probably speaks volumes about my inexperience with Jython. However, it is significantly faster than compiling every time. The code works under Jython 2.2.1, but won't compile under Jython 2.5 (nor will yours).

PythonInterpreter is expensive, this code will use only one.
#action.py
def execute(filename, action_locals):
#add caching of compiled scripts here
exec(compile(open(filename).read(), filename, 'exec'), action_locals)
//class variable, only one interpreter
PythonInterpreter pi;
//run once in init() or constructor
pi = new PythonInterpreter();//could do more initialization here
pi.exec("import action");
//every script execution
PyObject pyActionRunner = pi.eval("action.execute");
PyString pyActionName = new PyString(script_path);
PyDictionary pyActionLocals = new PyDictionary();
pyActionLocals.put("variable_1", "value_1");
pyActionLocals.put("variable_x", "value_x")
pyActionRunner.__call__(pyActionName, pyActionLocals);
#example_script.py
print variable_1, variable_x

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to speed up runtime Java code instrumentation? - java

Related

(de)serialization of parsed Groovy scripts

Why JavaCompiler is slow while instantiating a Java class?

Java executor to call external executable only slightly faster

Launching Java Subprocess using parent process Classpath

Howto multithreaded jython scripts running from java?

Categories

Resources