Handle Multiple Access to dll function - java

doingI have a problem while acessing dll functions with multiple threads .
I am using my own-compilated dll . I call a dll function from java (JNA) with multiple java threads .
The operation I am processing is about images processing .
With this method I do observe some little frame generation speed loss.
I am wondering if it is because of the thread access to the dll function .
Here is the function I am using :
__declspec(dllexport) int iterate(double z_r,double z_i,double c_r,double c_i,double maxIteration){
double tmp;
int i=0;
while(z_r*z_r + z_i*z_i < 4 && i <maxIteration){
tmp = z_r;
z_r = z_r*z_r - z_i*z_i + c_r;
z_i = 2*z_i*tmp + c_i;
i++;
}
return i;
}

The problem probably isn't that you are accessing the function from multiple threads, it should be the external access itself. I don't know how big your values for, for example, maxIteration are, but it seems to me that this code snippet doesn't run very long, but often.
Especially when using JNA there's probably some serious overhead when invoking this method. So you should try to do more work at once, before returning to Java (and invoking the external method again...). This way the performance advantages you might have in C could make up for the overhead.
That said, however, it is not sure to say that this method would run faster written in C than written in Java. Without citation at hand at the moment (I will try to find one), I heard in a lecture a few weeks ago that Java is supposed to be amazingly fast when it comes to simple arithmetic operations - and this is the only thing your method does. You should also check if you enabled compiler optimizations when compiling your C library.
Edit: This Wikipedia article states that Java has a performance for arithmetic operations similar to such programs written in C++. So the performance advantage might be slight and the overhead I mentioned before might decide in the end.

Related

Will C#-JIT implement "inline virtual method" optimizations with future versions in inspiration of Java?

Or should I consider refactoring my virtual indexing method(and its class) into a code-duplicated but faster one?
The issue I'm stuck at, I had some duplicated code, then refactored them and unified into a single class with just single virtual method in child classes only to minimize future code duplications. Now its %50 slower than before to accomplish this:
arr[i]=3.14f; // arr is derived from a base class with `[]` override
(so the derived class implementation is used).
but it became %500 easier to add new types now.
How many if-else checks in a non-virtual method makes it equally fast as a virtual one without if-else checks inside?(for todays 20-30 length pipelined cpus) float+char+double+some other structs = there would be more than 15 different types in my library so 15x code duplication would make the code %1500 harder to implement/refactor without virtual methods.
Example of my issue:
// implemented IList because C# arrays instead of this,
// can be used in same wrapper property too!
// Reduced even more code duplication.
public class Foo<T>:IList<T>
{
public virtual T this [int i]
{ ... }
}
public unsafe class Bar:Foo<byte>
{
public override byte this[int i]
{
get
{
return *(pByte + i);
}
set
{
*(pByte + i) = value;
}
}
}
}
Bar b=new Bar(); // Can't use Foo<byte>
// because I denied that with making its constructor `internal`
// because its mis-use would generate undefined behaviour(more than an out-of-bounds access) in a random time in a random place.
b[400]=50;
The reason I have to duplicate code without virtual is, no pointer is allowed for T generic types.
The reason I have to use pointers is, I have non-managed fast gpgpu C++ arrays to be worked likes just as same as pure C# when looked from outside.
The reason I had to use unmanaged arrays for gpgpu is, they work at top speed when they are aligned to unobtainable values like 4096 and needed to be pinned and also to reduce C# - C++ transition overheads.
Note: maybe it is not only virtual, but also the IList<T> interface contributing to slowness. Many answers say it comes with a cost but if Java can work around it, why can't C#?
Here is the environment:
.Net 3.5
MSVS 2015 community ed. all optimizations enabled.
windows 10 64 bit
project 64 bit release
c3060 cpu with a single channel ddr3 ram
for benchmarking, heating phase is added, timings are taken after many iterations and used in real data visualizations.

Where to patch back the information gathered during program analysis

I'm new to compiler design and have few years with java.
Using this and the paper
It's look like after Class hierarchy analysis and rapid type analysis will get information to do de-virtualisation. But where to patch back the information on source code or on Byte-code. And how to check the results?
Trying to understand how things really happens but stuck here.
For example : We have an example program taken from paper specified above.
public class MyProgram {
public static void main(String[] args) {
EUCitizen citizen = getCitizen();
citizen.hasRightToVote(); // Call site 1
Estonian estonian = getEstonian();
estonian.hasRightToVote(); // Call site 2
}
private static EUCitizen getCitizen() {
return new Estonian();
}
private static Estonian getEstonian() {
return new Estonian();
}
}
Using Class hieracrchy method we can conclude as none of the subclasses override hasRightToVote() , the dynamic method invocation can be replaced with a static procedure call to Estonian#hasRightToVote() . But where to replace this information and How? How to tell JVM (feed JVM) that information that we have gathered during analysis.
You can't change source code and put this there ? Could anyone provide me an example so i can start trying new ways to do analysis and still be able to patch that information.
Thanks.
Class Hierarchy Analysis is an optimization done by the virtual machine itself at runtime, you do not have to tell the VM anything. It simply does the analysis by itself based on the information available in the class files.
What generally happens is that analysis results are typically stored as some kind of association with a program representation, or are used immediately to effect the optimization so "nothing" needs to be stored.
You are right: there is generally no "good" way to annotate the source code with an analysis result (you can use Java annotations as a way). But the compiler has already read the source code and isn't going read it again.
In general, the program is parsed and variety of compiler-like structures are built (ASTs, symbol tables, control flow graphs, data flow arcs, ...) by the compiler pretty much before any serious analysis/optimization begins. A low level model of the program (data flow over the operators) is normally what gets analyzed, and the optimization analyzer will either decorate this structure with its opinions, or often just directly modify this structure to achieve the effect of the optimization.
With Java, there are two opportunities to do this: in JavaC, and in the JITter. My understanding (probably wrong, probably varies across JavaC implementations) is that not much optimization occurs in JavaC at all; it just generates naive JVM bytecode, and that all the real work is done in the JITter. The JITter doesn't have source code, but it can do all the same kinds of analysis (control flow, dataflow, ...) on the byte code that one can do on classic compiler structures, and thus achieve the same effect.
I had some doubts with the same and Rohan Padhey Cleared the ones.
In Java, I don't think there is a way to specify monomophrism of virtual method calls in byte-code. The de-virtualization analysis usually happens in the JIT compiler which compiles bytecode to native code and it does so using dynamic analysis.
Why Patching is a Problem :
In Java bytecode, the only method call instructions are: invokestatic, invokedynamic, invokevirtual, invokeinterface and invokespecial (the last is used for constructors, etc). The only type of call that does not refer to virtual method table lookups is the invokestatic call, since static methods cannot be overridden and used polymorphically on objects.
Hence, while there is no way to do a compile-time specification of the target method, you can replace virtual calls with static calls. How? consider an object "x" with a method "foo", and a call-site:
x.foo(arg1, arg2, ...)
If you know for sure that "x" is of the class "A", then you can transform this to:
A.static_foo(x, arg1, arg2, ...)
where "static_foo" is a newly created static method in class A whose body contains exactly everything that the body of "foo()" in "A" would have done, except that references to "this" inside the body should now be replaced by the first parameter, whatever you may call it.
That is exactly what the Whole-Jimple-Optimization-Pack (WJOP) in Soot does.
As regards static analysis using Soot, there is an optimization pack that does devirtualization using a work-around: https://github.com/Sable/soot/wiki/Whole-program-Devirtualization-Optimizations
But That's just a hack.
Why JIT Times Its Better :
JIT doing this better is due to the fact that static analysis has to be sound because you need to be sure when doing this transformation that 100% of the time the target of the virtual call will be one class. With JIT compilation, you can find more opportunities for optimization because even if the target is a single class 90% of the time, but not 10%, you can just-in-time compile the code to use the most-frequently taken route, and fall-back to using bytecode in the 10% of the cases where this prediction was wrong, because you can check this mistake dynamically. While the fall-back is expensive, the common-case of correct predictions 90% of the time leads to overall benefit. With static transformation, you have to make a decision of whether or not to optimize and it better be sound.

Using scala's ParHashMap in Java's project instead of ConcurrentHashMap

I've got a fairly complicated project, which heavily uses Java's multithreading. In an answer to one of my previous questions I have described an ugly hack, which is supposed to overcome inherent inability to iterate over Java's ConcurrentHashMap in parallel. Although it works, I don't like ugly hacks, and I've had a lot of trouble trying to introduce proposed proof of concept in the real system. Trying to find an alternative solution I have encountered Scala's ParHashMap, which claims to implement a foreach method, which seems to operate in parallel. Before I start learning a new language to implement a single feature I'd like to ask the following:
1) Is foreach method of Scala's ParHashMap scalable?
2) Is it simple and straightforward to call Java's code from Scala and vice versa? I'll just remind that the code is concurrent and uses generics.
3) Is there going to be a performance penalty for switching a part of codebase to Scala?
For reference, this is my previous question about parallel iteration of ConcurrentHashMap:
Scalable way to access every element of ConcurrentHashMap<Element, Boolean> exactly once
EDIT
I have implemented the proof of concept, in probably very non-idiomatic Scala, but it works just fine. AFAIK it is IMPOSSIBLE to implement a corresponding solution in Java given the current state of its standard library and any available third-party libraries.
import scala.collection.parallel.mutable.ParHashMap
class Node(value: Int, id: Int){
var v = value
var i = id
override def toString(): String = v toString
}
object testParHashMap{
def visit(entry: Tuple2[Int, Node]){
entry._2.v += 1
}
def main(args: Array[String]){
val hm = new ParHashMap[Int, Node]()
for (i <- 1 to 10){
var node = new Node(0, i)
hm.put(node.i, node)
}
println("========== BEFORE ==========")
hm.foreach{println}
hm.foreach{visit}
println("========== AFTER ==========")
hm.foreach{println}
}
}
I come to this with some caveats:
Though I can do some things, I consider myself relatively new to Scala.
I have only read about but never used the par stuff described here.
I have never tried to accomplish what you are trying to accomplish.
If you still care what I have to say, read on.
First, here is an academic paper describing how the parallel collections work.
On to your questions.
1) When it comes to multi-threading, Scala makes life so much easier than Java. The abstractions are just awesome. The ParHashMap you get from a par call will distribute the work to multiple threads. I can't say how that will scale for you without a better understanding of your machine, configuration, and use case, but done right (particularly with regard to side effects) it will be at least as good as a Java implementation. However, you might also want to look at Akka to have more control over everything. It sounds like that might be more suitable to your use case than simply ParHashMap.
2) It is generally simple to convert between Java and Scala collections using JavaConverters and the asJava and asScala methods. I would suggest though making sure that the public API for your method calls "looks Java" since Java is the least common denominator. Besides, in this scenario, Scala is an implementation detail, and you never want to leak those anyway. So keep the abstraction at a Java level.
3) I would guess there will actually be a performance gain with Scala--at runtime. However, you will find much slower compile time (which can be worked around. ish). This Stack Overflow post by the author of Scala is old but still relevant.
Hope that helps. That's quite a problem you got there.
Since Scala compiles to the same bytecode as Java, doing the same in both languages is very well possible, no matter the task. There are however some things which are easier to solve in Scala, but if this is worth learning a new language is a different question. Especially since Java 8 will include exactly what you ask for: simple parallel execution of functions on lists.
But even now you can do this in Java, you just need to write what Scala already has on your own.
final ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
//...
final Entry<String, String>[] elements = (Entry<String, String>[]) myMap.entrySet().toArray();
final AtomicInteger index = new AtomicInteger(elements.length);
for (int i = Runtime.getRuntime().availableProcessors(); i > 0; --i) {
executor.submit(new Runnable() {
public void run() {
int myIndex;
while ((myIndex = index.decrementAndGet()) >= 0) {
process(elements[myIndex]);
}
}
});
}
The trick is to pull those elements into a temporary array, so threads can take out elements in a thread-safe way. Obviously doing some caching here instead of re-creating the Runnables and the array each time is encouraged, because the Runnable creation might already take longer than the actual task.
It is as well possible to instead copy the elements into a (reusable) LinkedBlockingQueue, then have the threads poll/take on it instead. This however adds more overhead and is only reasonable for tasks that require at least some calculation time.
I don't know how Scala actually works, but given the fact that it needs to run on the same JVM, it will do something similar in the background, it just happens to be easily accessible in the standard library.

Java - calling static methods vs manual inlining - performance overhead

I am interested whether should I manually inline small methods which are called 100k - 1 million times in some performance-sensitive algorithm.
First, I thought that, by not inlining, I am incurring some overhead since JVM will have to find determine whether or not to inline this method (or even fail to do so).
However, the other day, I replaced this manually inlined code with invocation of static methods and seen a performance boost. How is that possible? Does this suggest that there is actually no overhead and that by letting JVM inline at "its will" actually boosts performance? Or this hugely depends on the platform/architecture?
(The example in which a performance boost occurred was replacing array swapping (int t = a[i]; a[i] = a[j]; a[j] = t;) with a static method call swap(int[] a, int i, int j). Another example in which there was no performance difference was when I inlined a 10-liner method which was called 1000000 times.)
I have seen something similar. "Manual inlining" isn't necessarily faster, the result program can be too complex for optimizer to analyze.
In your example let's make some wild guesses. When you use the swap() method, JVM may be able to analyze the method body, and conclude that since i and j don't change, although there are 4 array accesses, only 2 range checks are needed instead of 4. Also the local variable t isn't necessary, JVM can use 2 registers to do the job, without involving r/w of t on stack.
Later, the body of swap() is inlined into the caller method. That is after the previous optimization, so the saves are still in place. It's even possible that caller method body has proved that i and j are always within range, so the 2 remaining range checks are also dropped.
Now in the manually inlined version, the optimizer has to analyze the whole program at once, there are too many variables and too many actions, it may fail to prove that it's safe to save range checks, or eliminate the local variable t. In the worst case this version may cost 6 more memory accesses to do the swap, which is a huge overhead. Even if there is only 1 extra memory read, it is still very noticeable.
Of course, we have no basis to believe that it's always better to do manual "outlining", i.e. extract small methods, wishfully thinking that it will help the optimizer.
--
What I've learned is that, forget manual micro optimizations. It's not that I don't care about micro performance improvements, it's not that I always trust JVM's optimization. It is that I have absolutely no idea what to do that does more good than bad. So I gave up.
The JVM can inline small methods very efficiently. The only benifit inlining yourself is if you can remove code i.e. simplify what it does by inlining it.
The JVM looks for certain structures and has some "hand coded" optimisations when it recognises those structures. By using a swap method, the JVM may recognise the structure and optimise it differently with a specific optimisation.
You might be interested to try the OpenJDK 7 debug version which has an option to print out the native code it generates.
Sorry for my late reply, but I just found this topic and it got my attention.
When developing in Java, try to write "simple and stupid" code. Reasons:
the optimization is made at runtime (since the compilation itself is made at runtime). The compiler will figure out anyway what optimization to make, since it compiles not the source code you write, but the internal representation it uses (several AST -> VM code -> VM code ... -> native binary code transformations are made at runtime by the JVM compiler and the JVM interpreter)
When optimizing the compiler uses some common programming patterns in deciding what to optimize; so help him help you! write a private static (maybe also final) method and it will figure out immediately that it can:
inline the method
compile it to native code
If the method is manually inlined, it's just part of another method which the compiler first tries to understand and see whether it's time to transform it into binary code or if it must wait a bit too understand the program flow. Also, depending on what the method does, several re-JIT'ings are possible during runtime => JVM produces optimum binary code only after a "warm up"... and maybe your program ended before the JVM warms itself up (because I expect that in the end the performance should be fairly similar).
Conclusion: it makes sense to optimize code in C/C++ (since the translation into binary is made statically), but the same optimizations usually don't make a difference in Java, where the compiler JITs byte code, not your source code. And btw, from what I've seen javac doesn't even bother to make optimizations :)
However, the other day, I replaced this manually inlined code with invocation of static methods and seen a performance boost. How is that possible?
Probably the JVM profiler sees the bottleneck more easily if it is in one place (a static method) than if it is implemented several times separately.
The Hotspot JIT compiler is capable of inlining a lot of things, especially in -server mode, although I don't know how you got an actual performance boost. (My guess would be that inlining is done by method invocation count and the method swapping the two values isn't called too often.)
By the way, if its performance really matters, you could try this for swapping two int values. (I'm not saying it will be faster, but it may be worth a punt.)
a[i] = a[i] ^ a[j];
a[j] = a[i] ^ a[j];
a[i] = a[i] ^ a[j];

Automatic Java to C++ conversion [duplicate]

This question already has answers here:
Does a Java to C++ converter/tool exist? [closed]
(10 answers)
Closed 7 years ago.
Has anyone tried automatic Java to C++ conversion for speed improvements? Is it a maintenance nightmare in the long run?
Just read that is used to generate the HTML5 parsing engine in Gecko http://ejohn.org/blog/html-5-parsing/
In general, automatic conversions from one language to another will not be an improvement. Different languages have different idioms that affect performance.
The simplest example is with loops and variable creation. In a Java GC world, creating objects with new is almost free, and they dive into oblivion just as easily. In C++ memory allocation is (generally speaking) expensive:
// Sample java code
for ( int i = 0; i < 10000000; ++i )
{
String str = new String( "hi" ); // new is free, GC is almost free for young objects
}
Direct conversion to C++ will result in bad performance (use of TR1 shared_ptr as memory handler instead of GC):
for ( int i = 0; i < 10000000; ++i )
{
std::shared_ptr< std::string > str( new std::string( "hi" ) );
}
The equivalent loop written in C++ would be:
for ( int i = 0; i < 10000000; ++i )
{
std::string str( "hi" );
}
Direct translation from a language to another usually ends with the worst of both worlds and harder to maintain code.
The positive point to that conversion is that you will need a proper object oriented design in order to switch from java to C++ (paradigm intersection).
However some people say that coding C++ does not bring speed improvement compared to java code.
Even if that worked, I am not so sure that you will see much speed improvement. Java's Hotspot JIT compiler has become pretty good.
It's nearly impossible to replace the automatic memory-management of Java with a manual memory-management via a program. So you most likely will end up with a program, that has memory leaks or with C++-code that uses a garbage-collector. But a garbage-collector in Java have much more things to rely on (for instance no pointer-arithm,etics), so a garbage-collector in C++ to be safe it has a decreased performance. So your automatic conversion will most likely decrease performance.
Instead try to port it by hand to C++ or optimize the Java-code.
There is hardly a chance that this type of conversion shell lead to better performances. Usually when the JVM is working, it converts most of the code to native machine code. what you are suggesting is converting the Jave code to C++ and from there to native machine code, that is to add an extra phase.
However there are some trivial cases in which some gain may be achived due to the fact that:
1) It takes some time to load the JVM from scratch.
2) It takes some time to do the Jit, when running a very short program, a lot of times, you may prefer to waste this time before running.
3) You may not got the same level of machine code on Java, if you are not running on Server mode. (On server Mode you are expected to get top notch machine code, and also one that is best suited for your own CPU as detected on run-time, that is usually lacking on most C/C++ implantations of programs, and moreover a machine code optimized at run-time)
The languages have such different styles in usage that a mindless conversion would be of little use. An intelligent converter would be next to imposable to write because of the different styles used.
Some Problem areas:
Resource allocation is controlled by 'try{} finally{}' blocks in Java while C++ uses RAII.
Java does exception checking at compile time C++ run-time.
Exceptions are handled differently. Two exceptions:
In Java (last throw is propagated)
In C++ application terminated.
Java has a massive standard library
C++ has all the same functionality you just need to go find it on the web [which is a pain].
Java uses pointers for everything.
A straight unthinking conversion will leave you with a program consisting of just shared_ptr objects.
Anyway with JIT Compilation Java is comparable to C++ in speed.
Speaking of such converters in general, they can't be expected to produce either maintainable or high-performance code, since those generally require having an actual human who understands the language writing the code. What they are very useful for is in making it easy to port languages. Any language with a converter to C, for example, can be swiftly implemented on a wide range of languages. I used f2c in the mid-90s to run a Fortran routine on a Macintosh.
You may or may not have performance improvements if you rewrite the code in C++. You'll probably get a slowdown if you use an automatic converter.

Categories

Resources