Threaty safety vs performance in Java and Python

Threaty safety vs performance in Java and Python - java

Use case: a single data structure (hashtable, array, etc) whose members are accessed frequently by multiple threads and modified infrequently by those same threads. How do I maintain performance while guaranteeing thread safety (ie, preventing dirty reads).
Java: Concurrent version of the data structure (concurrent hashmap, Vector, etc).
Python: No need if only threads accessing it, because of GIL. If it's multiple processes that will be reading and updating the data structure, then use threading.Lock. Force the each process's code to acquire the lock before and release the lock after accessing the data structure.
Does that sound reasonable? Will Java's concurrent data structure impose too much penalty to read speed? Is there higher level concurrency mechanism in python?

Instead of reasoning about performance, I highly recommend to measure it for your application. Don't risk thread problems for a performance improvement that you most probably won't ever notice.
So: write thread-safe code without any performance-tricks, use a decent profiler to find the percentage of time spent inside the data structure access, and then decide if that part is worth any improvement.
I bet there will be other bottlenecks, not the shared data structure.
If you like, come back to us with your code and the profiler results.

Related

Is there any disadvantage of using thread safe collection classes like Hashtable in single thread environment?

I was asked by the interviewer about disadvantages of using thread safe class like Hashtable in single threaded environment? Are there any disadvantages? if not then why are there non thread safe class introduced later?

I was asked by the interviewer about disadvantages of using thread safe class like Hashtable in single threaded environment?
There are, although most of the disadvantages are around performance. Even single-threaded environments have multiple threads in them (think GC, finalizers, signal handlers, JMX, etc.) so the language still needs to obey the synchronization constructs such as synchronized, volatile, and the native lock implementations. These language features flush or invalidate memory caches and affect code reordering both of which can dramatically affect overall runtime performance.
if not then why are there non thread safe class introduced later?
Non-thread-safe objects always perform better than their thread-safe counterparts in either single or multi-threaded applications. The ability to deal with local CPU cached memory is one of the main speed increases provided by modern hardware. If you don't have to reach out to the main memory bus, you can execute operations orders of magnitude faster. Synchronization constructs decrease the ability of cache memory to be used.
Lastly, thread-safe classes are typically more complicated both in terms of the data structures involved as well as the logic necessary for them to operate correctly in a multi-thread application. This means that even if we ignore the synchronization constructures, it may use more memory and run slower, although the degree to which this is the case is very dependent on the particular class in question

They are slower in single thread environment. Modern JIT is very effective in working with synchronized class in single thread environment, but it is not perfect.
They are much slower in multi thread environment. In case you have immutable collection you can safely use from different threads, but synchronized collection will work much slower.
[design] Its locking semantic is mostly useless, so additional synchronization needed anyway. You rarely need just read or write, most of time you read then write and you want it to be atomic. Or you want to allow multiple simultaneous reads.

java mechanical sympathy trough thread pinning

Given we have an application that is heavily polluted with concurrency constructs
multiple techniques are used (different people worked without clear architecture in mind),
multiple questionable locks that are there "just in case", thread safe queues. CPU usage is around 20%.
Now my goal is to optimize it such that it is making better use of caches and generally improve its performance and service time.
I'm considering to pin the parent process to a single core, remove all things that cause membars,
replace all thread safe data structures and replace all locks with some UnsafeReentrantLock
which would simply use normal reference field but take care of exclusive execution
needs...
I expect that we would end up with much more cache friendly application,
since we don't have rapid cache flushes all the time (no membars).
We would have less overhead since we dont need thread safe data structures,
volaties, atomics and replace all sorts of locks with I would assume that service time would improve also,
since we no longer synchronize on multiple thread safe queues...
Is there something that I'm overlooking here?
Maybe blocking operations would have to be paid attention to since they would not show up in that 20% usage?

Simulation thread and data writer thread parallelism

This a general programming question. Let's say I have a thread doing a specific simulation, where speed is quite important. At every iteration I want to extract data from it and write it to a file.
Is it a better practice to hand over the data to a different thread and let the simulation thread focus on his job, or since speed is very important, make the simulation thread do the data recording too without any copying of data. (in my case it is 3-5 deques of integers with a size of 1000-10000)
Firstly it surely depends on how much data we are copying, but what else can it depend on? Can the cost of synchronization and copying be worth? Is it a good practice to create small runnables at each iteration to handle the recording task in case of 50 or more iterations per second?

If you truly want low latency on this stat capturing, and you want it during the simulation itself then two techniques come to mind. They can be used together very effectively. Please note that these two approaches are fairly far from the standard Java trodden path, so measure first and confirm that you need these techniques before abusing them; they can be difficult to implement correctly.
The fastest way to write the data to file during a simulation, without slowing down the simulation is to hand the work off to another thread. However care has to be taken on how the hand off occurs, as a memory barrier in the simulation thread will slow the simulation. Given the writer only cares that the values will come eventually I would consider using the memory barrier that sits behind AtomicLong.lazySet, it requests a thread safe write out to a memory address without blocking for the write to actually become visible to the other thread. Unfortunately direct access to this memory barrier is currently only availble via lazySet or via class sun.misc.Unsafe, which obviously is not part of the public Java API. However that should not be too large of a hurdle as it is on all current JVM implementations and Doug Lea is talking about moving parts of it into the mainstream.
To avoid the slow, blocking file IO that Java uses; make use of a memory mapped file. This lets the OS perform async IO for you on your behalf, and is very efficient. It also supports use of the same memory barrier mentioned above.
For examples of both techniques, I strongly recommend reading the source code to HFT Chronicle by Peter Lawrey. In fact, HFT Chronicle may be just the library for you to use here. It offers a highly efficient and simple to use disk backed queue that can sustain a million or so messages per second.

In my work on a stress-testing HTTP client I stored the stats into an array and, when the array was ready to send to the GUI, I would create a new array for the tester client and hand off the full array to the network layer. This means that you don't need to pay for any copying, just for the allocation of a fresh array (an ultra-fast operation on the JVM, involving hand-coded assembler macros to utilize the best SIMD instructions available for the task).
I would also suggest not throwing yourself head-on into the realms of optimal memory barrier usage; the difference between a plain volatile write and an AtomicReference.lazySet() can only be measurable if your thread does almost nothing else but excercise the memory barrier (at least millions of writes per second). Depending on your target I/O throughput, you may not even need NIO to meet the goal. Better try first with simple, easily maintainable code than dig elbows-deep into highly specialized APIs without a confirmed need for that.

Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)?

There is something that bugs me with the Java memory model (if i even understand everything correctly). If there are two threads A and B, there are no guarantees that B will ever see a value written by A, unless both A and B synchronize on the same monitor.
For any system architecture that guarantees cache coherency between threads, there is no problem. But if the architecture does not support cache coherency in hardware, this essentially means that whenever a thread enters a monitor, all memory changes made before must be commited to main memory, and the cache must be invalidated. And it needs to be the entire data cache, not just a few lines, since the monitor has no information which variables in memory it guards.
But that would surely impact performance of any application that needs to synchronize frequently (especially things like job queues with short running jobs). So can Java work reasonably well on architectures without hardware cache-coherency? If not, why doesn't the memory model make stronger guarantees about visibility? Wouldn't it be more efficient if the language would require information what is guarded by a monitor?
As i see it the memory model gives us the worst of both worlds, the absolute need to synchronize, even if cache coherency is guaranteed in hardware, and on the other hand bad performance on incoherent architectures (full cache flushes). So shouldn't it be more strict (require information what is guarded by a monitor) or more lose and restrict potential platforms to cache-coherent architectures?
As it is now, it doesn't make too much sense to me. Can somebody clear up why this specific memory model was choosen?
EDIT: My use of strict and lose was a bad choice in retrospect. I used "strict" for the case where less guarantees are made and "lose" for the opposite. To avoid confusion, its probably better to speak in terms of stronger or weaker guarantees.

the absolute need to synchronize, even
if cache coherency is guaranteed in
hardware
Yes, but then you only have to reason against the Java Memory Model, not against a particular hardware architecture that your program happens to run on. Plus, it's not only about the hardware, the compiler and JIT themselves might reorder the instructions causing visibility issue. Synchronization constructs in Java addresses visibility & atomicity consistently at all possible levels of code transformation (e.g. compiler/JIT/CPU/cache).
and on the other hand bad performance
on incoherent architectures (full
cache flushes)
Maybe I misunderstood s/t, but with incoherent architectures, you have to synchronize critical sections anyway. Otherwise, you'll run into all sort of race conditions due to the reordering. I don't see why the Java Memory Model makes the matter any worse.
shouldn't it be more strict (require
information what is guarded by a
monitor)
I don't think it's possible to tell the CPU to flush any particular part of the cache at all. The best the compiler can do is emitting memory fences and let the CPU decides which parts of the cache need flushing - it's still more coarse-grained than what you're looking for I suppose. Even if more fine-grained control is possible, I think it would make concurrent programming even more difficult (it's difficult enough already).
AFAIK, the Java 5 MM (just like the .NET CLR MM) is more "strict" than memory models of common architectures like x86 and IA64. Therefore, it makes the reasoning about it relatively simpler. Yet, it obviously shouldn't offer s/t closer to sequential consistency because that would hurt performance significantly as fewer compiler/JIT/CPU/cache optimizations could be applied.

Existing architectures guarantee cache coherency, but they do not guarantee sequential consistency - the two things are different. Since seq. consistency is not guaranteed, some reorderings are allowed by the hardware and you need critical sections to limit them. Critical sections make sure that what one thread writes becomes visible to another (i.e., they prevent data races), and they also prevent the classical race conditions (if two threads increment the same variable, you need that for each thread the read of the current value and the write of the new value are indivisible).
Moreover, the execution model isn't as expensive as you describe. On most existing architectures, which are cache-coherent but not sequentially consistent, when you release a lock you must flush pending writes to memory, and when you acquire one you might need to do something to make sure future reads will not read stale values - mostly that means just preventing that reads are moved too early, since the cache is kept coherent; but reads must still not be moved.
Finally, you seem to think that Java's Memory Model (JMM) is peculiar, while the foundations are nowadays fairly state-of-the-art, and similar to Ada, POSIX locks (depending on the interpretation of the standard), and the C/C++ memory model. You might want to read the JSR-133 cookbook which explains how the JMM is implemented on existing architectures: http://g.oswego.edu/dl/jmm/cookbook.html.

The answer would be that most multiprocessors are cache-coherent, including big NUMA systems, which almost? always are ccNUMA.
I think you are somewhat confused as to how cache coherency is acomplished in practice. First, caches may be coherent/incoherent with respect to several other things on the system:
Devices
(Memory modified by) DMA
Data caches vs instruction caches
Caches on other cores/processors (the one this question is about)
...
Something has to be made to maintain coherency. When working with devices and DMA, on architectures with incoherent caches with respect to DMA/devices, you would either bypass the cache (and possibly the write buffer), or invalidate/flush the cache around operations involving DMA/devices.
Similarly, when dynamically generating code, you may need to flush the instruction cache.
When it comes to CPU caches, coherency is achieved using some coherency protocol, such as MESI, MOESI, ... These protocols define messages to be sent between caches in response to certain events (e.g: invalidate-requests to other caches when a non-exclusive cacheline is modified, ...).
While this is sufficient to maintain (eventual) coherency, it doesn't guarantee ordering, or that changes are immediately visible to other CPUs. Then, there are also write buffers, which delay writes.
So, each CPU architecture provides ordering guarantees (e.g. accesses before an aligned store cannot be reordered after the store) and/or provide instructions (memory barriers/fences) to request such guarantees. In the end, entering/exiting a monitor doesn't entail flushing the cache, but may entail draining the write buffer, and/or stall waiting for reads to end.

the caches that JVM has access to are really just CPU registers. since there aren't many of them, flushing them upon monitor exit isn't a big deal.
EDIT: (in general) the memory caches are not under the control of JVM, JVM cannot choose to read/write/flush these caches, so forget about them in this discussion
imagine each CPU has 1,000,000 registers. JVM happily exploits them to do crazy fast computations - until it bumps into monitor enter/exit, and has to flush 1,000,000 registers to the next cache layer.
if we live in that world, either Java must be smart enough to analyze what objects aren't shared (majority of objects aren't), or it must ask programmers to do that.
java memory model is a simplified programming model that allows average programmers make OK multithreading algorithms. by 'simplified' I mean there might be 12 people in the entire world who really read chapter 17 of JLS and actually understood it.

Performance ConcurrentHashmap vs HashMap

How is the performance of ConcurrentHashMap compared to HashMap, especially .get() operation (I'm especially interested for the case of only few items, in the range between maybe 0-5000)?
Is there any reason not to use ConcurrentHashMap instead of HashMap?
(I know that null values aren't allowed)
Update
just to clarify, obviously the performance in case of actual concurrent access will suffer, but how compares the performance in case of no concurrent access?

I was really surprised to find this topic to be so old and yet no one has yet provided any tests regarding the case. Using ScalaMeter I have created tests of add, get and remove for both HashMap and ConcurrentHashMap in two scenarios:
using single thread
using as many threads as I have cores available. Note that because HashMap is not thread-safe, I simply created separate HashMap for each thread, but used one, shared ConcurrentHashMap.
Code is available on my repo.
The results are as follows:
X axis (size) presents number of elements written to the map(s)
Y axis (value) presents time in milliseconds
The summary
If you want to operate on your data as fast as possible, use all the threads available. That seems obvious, each thread has 1/nth of the full work to do.
If you choose a single thread access use HashMap, it is simply faster. For add method it is even as much as 3x more efficient. Only get is faster on ConcurrentHashMap, but not much.
When operating on ConcurrentHashMap with many threads it is similarly effective to operating on separate HashMaps for each thread. So there is no need to partition your data in different structures.
To sum up, the performance for ConcurrentHashMap is worse when you use with single thread, but adding more threads to do the work will definitely speed-up the process.
Testing platform
AMD FX6100, 16GB Ram
Xubuntu 16.04, Oracle JDK 8 update 91, Scala 2.11.8

Thread safety is a complex question. If you want to make an object thread safe, do it consciously, and document that choice. People who use your class will thank you if it is thread safe when it simplifies their usage, but they will curse you if an object that once was thread safe becomes not so in a future version. Thread safety, while really nice, is not just for Christmas!
So now to your question:
ConcurrentHashMap (at least in Sun's current implementation) works by dividing the underlying map into a number of separate buckets. Getting an element does not require any locking per se, but it does use atomic/volatile operations, which implies a memory barrier (potentially very costly, and interfering with other possible optimisations).
Even if all the overhead of atomic operations can be eliminated by the JIT compiler in a single-threaded case, there is still the overhead of deciding which of the buckets to look in - admittedly this is a relatively quick calculation, but nevertheless, it is impossible to eliminate.
As for deciding which implementation to use, the choice is probably simple.
If this is a static field, you almost certainly want to use ConcurrentHashMap, unless testing shows this is a real performance killer. Your class has different thread safety expectations from the instances of that class.
If this is a local variable, then chances are a HashMap is sufficient - unless you know that references to the object can leak out to another thread. By coding to the Map interface, you allow yourself to change it easily later if you discover a problem.
If this is an instance field, and the class hasn't been designed to be thread safe, then document it as not thread safe, and use a HashMap.
If you know that this instance field is the only reason the class isn't thread safe, and are willing to live with the restrictions that promising thread safety implies, then use ConcurrentHashMap, unless testing shows significant performance implications. In that case, you might consider allowing a user of the class to choose a thread safe version of the object somehow, perhaps by using a different factory method.
In either case, document the class as being thread safe (or conditionally thread safe) so people who use your class know they can use objects across multiple threads, and people who edit your class know that they must maintain thread safety in future.

I would recommend you measure it, since (for one reason) there may be some dependence on the hashing distribution of the particular objects you're storing.

The standard hashmap provides no concurrency protection whereas the concurrent hashmap does. Before it was available, you could wrap the hashmap to get thread safe access but this was coarse grain locking and meant all concurrent access got serialised which could really impact performance.
The concurrent hashmap uses lock stripping and only locks items that affected by a particular lock. If you're running on a modern vm such as hotspot, the vm will try and use lock biasing, coarsaning and ellision if possible so you'll only pay the penalty for the locks when you actually need it.
In summary, if your map is going to be accesaed by concurrent threads and you need to guarantee a consistent view of it's state, use the concurrent hashmap.

In the case of a 1000 element hash table using 10 locks for whole table saves close to half the time when 10000 threads are inserting and 10000 threads are deleting from it.
The interesting run time difference is here
Always use Concurrent data structure. except when the downside of striping (mentioned below) becomes a frequent operation. In that case you will have to acquire all the locks? I read that the best ways to do this is by recursion.
Lock striping is useful when there is a way of breaking a high contention lock into multiple locks without compromising data integrity. If this is possible or not should take some thought and is not always the case. The data structure is also the contributing factor to the decision. So if we use a large array for implementing a hash table, using a single lock for the entire hash table for synchronizing it will lead to threads sequentially accessing the data structure. If this is the same location on the hash table then it is necessary but, what if they are accessing the two extremes of the table.
The down side of lock striping is it is difficult to get the state of the data structure that is affected by striping. In the example the size of the table, or trying to list/enumerate the whole table may be cumbersome since we need to acquire all of the striped locks.

What answer are you expecting here?
It is obviously going to depend on the number of reads happening at the same time as writes and how long a normal map must be "locked" on a write operation in your app (and whether you would make use of the putIfAbsent method on ConcurrentMap). Any benchmark is going to be largely meaningless.

It's not clear what your mean. If you need thread safeness, you have almost no choice - only ConcurrentHashMap. And it's definitely have performance/memory penalties in get() call - access to volatile variables and lock if you're unlucky.

Of course a Map without any lock system wins against one with thread-safe behavior which needs more work.
The point of the Concurrent one is to be thread safe without using synchronized so to be faster than HashTable.
Same graphics would would be very interesting for ConcurrentHashMap vs Hashtable (which is synchronized).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.