I have a C program that will be storing and retrieving alot of data in a Java store. I am putting alot of stress in my C program and multiple threads are adding and retrieving data from Java store. How will java handle such load? Because if there is only one main thread running JVM and handling all the requests from C, then it may become bottleneck for me. Will Java create multiple threads to handle the load, or is it programmers job to create and later on abort the threads?
My java store is just a Hashtable that stores the data from C, as is, against a key provided.
You definitely want to check the jni documentation about threading, which has information around attaching multiple native threads to the JVM. Also you should consider which Map implementation that you need to use. If accessing from multiple Hashtable will work, but may introduce a bottle neck as it is synchronized on every call, which will effectively mean a single thread reading or writing at a time. Consider the ConcurrentHashMap, which uses lock striping and providers better concurrent throughput.
A couple of things to consider if you are concerned about bottlenecks and latency.
On a heavily loaded system, locking can introduce a high overhead. If the size of you map and the frequency of write allows, consider using an immutable map and perform a copy on write approach, where a single thread will handle writes by making updates to a copy of the map and replacing the original with the new version (make sure the reference is a volatile variable). This will allow reads to occur without blocking.
Calling from C to Java via JNI will probably become a bottle neck too, its not as fast as calling in the other direction (Java to C). You can pass Direct ByteBuffers through to Java that contain references to the C data structures and allow Java to call back down to C via the Direct ByteBuffer.
Plain Java requires that you write your own threading.
If you are communicating to java via web services it's likely that the web container will manage threads for you.
I guess you are using JNI, so then the situation is potentially more complex. Depending upon exactly how you are doing your JNI calls you can get at multiple threads in the JVM.
I've got to ask ... JNI is pretty gnarly and error prone, all too easy to bring down the whole process and get all manner of mysterious errors. Are there not C libraries containing a HashTable you could use? Or even write one, it's got to be less work than doing JNI.
I think this depends on the java code's implementation. If it proves not to thread, here's a potentially cleaner alternative to messy JNI:
Create a Java daemon process that communicates with your store, which INTERNALLY is threaded on requests, to guarantee efficient load handling. Use a single ExecutorService created by java.util.concurrent.Executors to service a work queue of store/retrieve operations. Each store/retrieve method call submits a Callable to the work queue and waits for it to be run. The ExecutorService will automagically queue and multithread the store/retrieval operations. This whole thing should be less than 100 lines of code, aside from communications with the C program.
You can communicate with this Java daemon from C using inter-process communication techniques (probably a socket), which would avoid JNI and let one Java daemon thread service numerous instances of the C program.
Alternately, you could use JNI to call the basic store/retrieve operations on your daemon. Same as currently, except the java daemon can decorate methods to provide caching, synchronization, and all sorts of fancy goodies associated with threading.
Related
I have been thinking about why JDBC is only blocking operation and why I can't set some listener to the hypothetical event handler onResultSetArrived(ResultSet rs). Why I have to block single one thread per each JDBC query.
After a while I've dive into Java Sockets (I suppose JDBC is build on top of them) and realised that there also isn't any event handling. Only option to provide non-blocking read is through the available() method but this is very inefficient as it has to be checked periodically in the loop.
As far as I'm aware, interruption is fundamental thing in PC. It goes down from the hardware up to the operating system. In the Java it can be implemented into event driven approach in read value from Socket.
Now, my question is am I missing something and there exists some workaround or current architecture in Java really is one thread per one blocking operation? And if yes isn't it inefficient?
In Java, you can have many threads. A thread is doing its stuff until it is blocked somewhere (typically, on a mutex or a I/O operation). Of course, this does not block other threads.
The fundamental scenario of multithreaded applications is that you use multiple threads when waiting for a blocked thread would introduce too much waiting. Definition of "too much" here depends entirely on you, but in general, this is how you achieve beter performance through better utilization of resources.
There are some limitations in how threads in Java work, however. Most, if not all of them are when the thread is blocked somewhere "outside" of Java such as in OS call or external (native) library. Theoretically, if native code blocks a thread, Java can not do anything about it. Normally, this should not be a problem unless the native code has a bug.
So in the case of a blocking JDBC response, you would create a new thread which would do other work while first thread is waiting for database to complete. Alternatively, you could make a thread just for doing JDBC. You could make it exactly like you want (with listeners etc.) except for limitations imposed by OS. So it's possible, but it's probably not provided out-of-the-box by JDBC drivers. There is a lot of infrastructure already in core Java which you might find useful (thread pools, workers, synchronized collections). But as with any multithreading, you need to be very careful with accessing data from different threads simultaneously.
Since Java 7, there is also support for non-blocking I/O (NIO). This is almost exactly what you are describing. I/O is offloaded to OS, so your operations return immediately and you get a callback when the operation is finished. However, not all libraries support NIO. For my work, I have never had a reason to use it, because I could always implement the same stuff with my threads at least as good.
If the question is whether the "current architecture in Java really is one thread per one blocking operation" and by "blocking operation" you mean "database operation" then the answer is no. Most database drivers available for Java currently are jdbc-based and do work that way. But there are usable alternatives (https://spring.io/blog/2016/11/28/going-reactive-with-spring-data) and more on the way (
https://blogs.oracle.com/java/jdbc-next:-a-new-asynchronous-api-for-connecting-to-a-database , https://dzone.com/articles/spring-5-webflux-and-jdbc-to-block-or-not-to-block). For how this works see How is ReactiveMongo implemented so that it is considered non-blocking?
For jdbc there are also ways to wrap the jdbc calls (Wrapping blocking I/O in project reactor , Spring webflux and reading from database ) and projects pursuing this approach (https://dzone.com/articles/myth-asynchronous-jdbc)
I have a small java application running a set of computational heavy tasks. For processing the tasks, I use an external library which does most of the computation via native methods and some C code. Unfortunately, after solving one task, the library suffers from heavy memory leaks and can therefore only solve one task per application execution.
The memory problem is known to the coders from the library, but not fixed yet and maybe never will (it has something to do with the java garbage collector not properly working with the native inferface). Since there is no alternative for this particular library, I am looking for options to solve the tasks by sequentially application executions.
Currently, I have a bash wrapper script, which gets a list of tasks that should be executed and for each task the script calls the application with just this single task to execute.
Since tasks often need the results from previous tasks, this involves serializing and deserializing execution results to files. This does not seem to be good practice to me, also because the user has basically no way to interact with the program control flow.
Does anybody have an idea how I can to this sequential task execution inside one single java application? I guess this would involve starting a new JVM for each task exection, hopefully only transferring the task result and not the memory leaks from the new JVM to my application.
Edit providing further information:
Changing the root of the problem: Unfortunately, the library is not open source and I have neither access to the native methods nor to the java interface api.
New processes / JVMs: Is that the same in this context? I have not much experience with the java process api or starting new JVMs. My assumption is that this would involve starting a separate java program with its own main function using ProcessBuilder.start()?
Exchange of data: It is only a couple of kilobytes so performance is not an issue. Still, a solution without files would be preferable, but if I understand correctly memory mapped files also use local files. Sockets on the other hand do sound promising.
Funnily enough, I've faced the same issue. By definition, you need to accept nothing will be best practice or nice faced with having to use a faulty library you must use but cannot upgrade.
The solution we came up with was to isolate calls to the library in it's own process. This process was a child of a master process. The master process contains the good code and the child the bad. We were then able to keep track of the number of invocations of the child process and tear it down once it reached a certain number. We knew that we could get away with X invocations before the child process was corrupt.
Because of the nature of our problem, bringing up a fresh process enabled us to have another X invocations before repeating.
Any state was returned to the master process on a successful invocation. Any state gathered during an unsuccessful invocation was discarded and we started again.
Again, none of the above is "nice" but it worked for us.
For what it's worth, if I did this again, I'd use Akka and remote actors which would make all the sub-process, remoting etc far simpler.
That depends. Do you have the source code of this external application, i.e. can you recompile it? The easiest approach is obviously to fix the leak at its root. This might however be impractical. If the library, as you say, is implemented via native methods and some C code, I do not think that the problem has something to do with the Java garbage collector not properly working. Native methods and C code do not normally store their data on the JVM's heap and are therefore not garbage collected, i.e. it is the job of the library to clean up after itself.
If the leak is indeed in the bit of Java code that the library exposes, than there is a way. Memory leaks in Java occure by forgetting about references, e.g. consider the following example:
class Foo {
private ExpensiveObject eo;
Foo(ExpensiveObject eo) {
this.eo = eo;
}
}
The ExpensiveObject is alive (at least) as long as its referencing Foo instance. If you (or your library) do(es) not isolate instance life-cycles well enough, you get into trouble. If you do not have a chance to refactor, you can however use reflection to clean up the biggest mess from another place in your code:
void release(Foo foo) {
Field f = Foo.class.getDeclaredField("eo");
f.setAccessible(true);
f.set(foo, null);
}
This should however be considered a last-resort as it is quite a hack.
Alternatively, a better approach is normally to fork another instance of a JVM to do the dirty work. It seems like you are doing something similar already. By forking a JVM, you isolate the use of memory on a process level. Once the process dies, all memory is released by the OS. The problem with this approach is normally platform compatibility but as you already use a native library, this does not worsen your situation.
You say that you currently use files to communicate between these different processes. Why do you need to store data in a file? Rather consider using sockets or memory-mapped files (NIO), if performance is important for this matter.
I am creating a distributed service and i am looking at restricting a set of time consuming operations to a single thread of execution across all JVMs at any given time. (I will have to deal with 3 JVMs max).
My initial investigations point me towards java.util.concurrent.Executors , java.util.concurrent.Semaphore. Using singleton pattern and Executors or Semaphore does not guarantee me a single thread of execution across Multiple JVMs.
I am looking for a java core API (or at least a Pattern) that i can use to accomplish my task.
P.S: I have access to ActiveMQ within my existing project which i was planning to use in order to achieve single thread of execution across multiple JVM Machines only if i dont have another choice.
There is no simple solution for this with a core java API. If the 3 JVMs have access to a shared file system you could use it to track state across JVMs.
So basically you do something like create a lock file when you start the expensive operation and delete it at the conclusion. And then have each JVM check for the existence of this lock file before starting the operation. However there are some issues with this approach like what if the JVM dies in the middle of the expensive operation and the file isn't deleted.
ZooKeeper is a nice solution for problems like this and any other cross process synchronization issue. Check it out if that is a possibility for you. I think it's a much more natural way to solve a problem like than a JMS queue.
What is the difference between a Thread and a Process in the Java context?
How is inter-Process communication and inter-Thread communication achieved in Java?
Please point me at some real life examples.
The fundamental difference is that threads live in the same address spaces, but processes live in the different address spaces. This means that inter-thread communication is about passing references to objects, and changing shared objects, but processes is about passing serialized copies of objects.
In practice, Java interthread communication can be implemented as plain Java method calls on shared object with appropriate synchronization thrown in. Alternatively, you can use the new concurrency classes to hide some of the nitty-gritty (and error prone) synchronization issues.
By contrast, Java interprocess communication is based at the lowest level on turning state, requests, etc into sequences of bytes that can be sent as messages or as a stream to another Java process. You can do this work yourself, or you can use a variety of "middleware" technologies of various levels of complexity to abstract away the implementation details. Technologies that may be used include, Java object serialization, XML, JSON, RMI, CORBA, SOAP / "web services", message queing, and so on.
At a practical level, interthread communication is many orders of magnitude faster than interprocess communication, and allows you to do many things a lot more simply. But the downside is that everything has to live in the same JVM, so there are potential scalability issues, security issues, robustness issues and so on.
A thread can access memory inside a process, even memory that could be manipulated by another thread within the same process. Since all threads are internal to the same running process, they can communicate more quickly (because they don't need the operating system to referee).
A process cannot access memory inside another process, although you can communicate between processes through various means like:
Network packages.
Files
Pipes
Shared Memory
Semaphores
Corba messages
RPC calls
The important thing to remember with process to process communication is that the communication must be managed through the operating system, and like all things which require a middle man, that adds overhead.
On the downside, if a thread misbehaves, it does so within the running process, and odds are high it will be able to take down all the well behaving threads. If a process misbehaves, it can't directly write into the memory of the other processes, and odds are that only the misbehaving process will die.
Inter-Thread Communication = threads inside the same JVM talking to each other
Inter-Process Communication (IPC) = threads inside the same machine but running in different JVMs talking to each other
Threads inside the same JVM can use pipelining through lock-free queues to talk to each other with nanosecond latency.
Threads in different JVMs can use off-heap shared memory (usually acquired through the same memory-mapped file) to talk to each other with nanosecond latency.
Threads in different machines can use the network to talk to each other with microsecond latency.
For a complete explanation about lock-free queues and IPC you can check CoralQueue.
Disclaimer: I am one of the developers of CoralQueue.
I like to think of a single instance of a JVM as a process. So, interprocess communication would be between instances of JVM's, for example, through sockets (message passing).
Threads in java implement Runnable and are contained within a JVM. They share data simply by passing references in the JVM around. Whenever threads share data you almost always need to protect the data so multiple threads don't clobber each other. There are many mechanisms for protection that all involve preventing multiple threads from entering critical sections of code.
Using java IO, it seems like forking a new process gives better ability for a process B to read data written by process A to file than what you could get if thread A wrote to a file that thread B is trying to read (within the same process).
It seems like the rules are not comparable to the memory model. So what file-based concurrency works ? References would be appreciated.
Any observations like this bound to be operating system specific, and may be specific to different versions of the operating system (kernel). What you are hitting here is probably related to the way that the OS implements threads, and thread scheduling. The Java platform provides little in the way of tuning for this kind of thing.
IMO, if you need better performance, you probably should not be using a file as a data transfer channel between two threads in the same JVM. Code your application to detect that the threads are colocated in the same JVM and use (say) Java Pipe streams.
Maybe it could have to do with thread and process blocking.
When a process wants a resource (writing/reading a file) it blocks untils the S.O. fulfills the requirement and return something to the process.
If you are not using hyperthreading a process with two threads will block both threads for fullfilling each one of the tasks. But if you separate them, maybe the S.O. can optimize access and paralelize the read/write better.
(just guessing :)