Java volatile and cache coherency. Am I missing something?

Java volatile and cache coherency. Am I missing something? - java

In numerous articles, YouTube videos, etc., I have seen Java's volatile keyword explained as a problem of cache memory, where declaring a variable volatile ensures that reads/writes are forced to main memory, and not cache memory.
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
jls-8.3.1.4 simply states
A field may be declared volatile, in which case the Java Memory Model ensures that all threads see a consistent value for the variable (§17.4).
It says nothing about caching. As long as cache coherency is in effect, volatile variables simply need to be written to a memory address, as opposed to being stored locally in a CPU register, for example. There may also be other optimizations that need to be avoided to guarantee the contract of volatile variables, and thread visibility.
I am simply amazed at the number of people who imply that CPUs do not implement cache coherency, such that I have been forced to go to StackOverflow because I am doubting my sanity. These people go to a great deal of effort, with diagrams, animated diagrams, etc. to imply that cache memory is not coherent.
jls-8.3.1.4 really is all that needs to be said, but if people are going to explain things in more depth, wouldn't it make more sense to talk about CPU Registers (and other optimizations) than blame CPU Cache Memory?

CPUs are very, very fast. That memory is physically a few centimeters away. Let's say 15 centimeters.
The speed of light is 300,000 kilometers per second, give or take. That's 30,000,000,000,000 centimeters every second. The speed of light in medium is not as fast as in vacuum, but it's close so lets ignore that part. That means sending a single signal from the CPU to the memory, even if the CPU and memory both can instantly process it all, is already limiting you to 1,000,000,000 or 1Ghz (You need to cover 30 centimeters to get form the core to the memory and back, so you can do that 1,000,000,000 every second. If you can do it any faster, you're travelling backwards in time. Or some such. You get a nobel prize if you figure out how to manage that one).
Processors are about that fast! We measure core speeds in Ghz these days, as in, in the time it takes the signal to travel, the CPU's clock has already ticked. In practice of course that memory controller is not instantaneous either, nor is the CPU pipelining system.
Thus:
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
Yes, you are wrong. QED.
I don't know why you think that or where you read that. You misremember, or you misunderstood what was written, or whatever was written was very very wrong.
In actual fact, an actual update to 'main memory' takes on the order of a thousand cycles! A CPU is just sitting there, twiddling its thumbs, doing nothing, in a time window where it could roll through a thousand, on some cores, multiple thousands of instructions, memory is that slow. Molasses level slow.
The fix is not registers, you are missing about 20 years of CPU improvement. There aren't 2 layers (registers, then main memory), no. There are more like 5: Registers, on-die cache in multiple hierarchical levels, and then, eventually, main memory. To make it all very very fast these things are very very close to the core. So close, in fact, that each core has their own, and, drumroll here - modern CPUs cannot read main memory. At all. They are entirely incapable of it.
Instead what happens is that the CPU sees you write or read to main memory and translates that, as it can't actually do any of that, by figuring out which 'page' of memory that is trying to read/write to (each chunk of e.g. 64k worth of memory is a page; actual page size depends on hardware). The CPU then checks if any of the pages loaded in its on-die cache is that page. If yes, great, and it's all mapped to that. Which does mean that, if 2 cores both have that page loaded, they both have their own copy, and obviously anything that one core does to its copy is entirely invisible to the other core.
If the CPU does -not- find this page in its own on-die cache you get what's called a cache miss, and the CPU will then check which of its loaded pages is least used, and will purge this page. Purging is 'free' if the CPU hasn't modified it, but if that page is 'dirty', it will first send a ping to the memory controller followed by blasting the entire sequence of 64k bytes into it (because sending a burst is way, way faster than waiting for the signal to bounce back and forth or to try to figure out which part of that 64k block is dirty), and the memory controller will take care of it. Then, that same CPU pings the controller to blast the correct page to it and overwrites the space that was just purged out. Now the CPU 'retries' the instruction, and this time it does work, as that page IS now in 'memory', in the sense that the part of the CPU that translates the memory location to cachepage+offset now no longer throws a CacheMiss.
And whilst all of that is going on, THOUSANDS of cycles can pass, because it's all very very slow. Cache misses suck.
This explains many things:
It explains why volatile is slow and synchronized is slower. Dog slow. In general if you want big speed, you want processes that run [A] independent (do not need to share memory between cores, except at the very start and very end perhaps to load in the data needed to operate on, and to send out the result of the complex operation), and [B] fit all memory needs to perform the calculation in 64k or so, depending on CPU cache sizes and how many pages of L1 cache it has.
It explains why one thread can observe a field having value A and another thread observes the same field having a different value for DAYS on end if you're unlucky. If the cores aren't doing all that much and the threads checking the values of those fields does it often enough, that page is never purged, and the 2 cores go on their merry way with their local core value for days. A CPU doesn't sync pages for funsies. It only does this if that page is the 'loser' and gets purged.
It explains why Spectre happened.
It explains why LinkedList is slower than ArrayList even in cases where basic fundamental informatics says it should be faster (big-O notation, analysing computational complexity). Because as long as the arraylist's stuff is limited to a single page you can more or less consider it all virtually instant - it takes about the same order of magnitude to fly through an entire page of on-die cache as it takes for that same CPU to wait around for a single cache miss. And LinkedList is horrible on this front: Every .add on it creates a tracker object (the linkedlist has to store the 'next' and 'prev' pointers somewhere!) so for every item in the linked list you have to read 2 objects (the tracker and the actual object), instead of just the one (as the arraylist's array is in contiguous memory, that page is worst-case scenario read into on-die once and remains active for your entire loop), and it's very easy to end up with the tracker object and the actual object being on different pages.
It explains the Java Memory Model rules: Any line of code may or may not observe the effect of any other line of code on the value of any field. Unless you have established a happens-before/happens-after relationship using any of the many rules set out in the JMM to establish these. That's to give the JVM the freedom to, you know, not run literally 1000x slower than neccessary, because guaranteeing consistent reads/writes can only be done by flushing memory on every read, and that is 1000x slower than not doing that.
NB: I have massively oversimplified things. I do not have the skill to fully explain ~20 years of CPU improvements in a mere SO answer. However, it should explain a few things, and it is a marvellous thing to keep in mind as you try to analyse what happens when multiple java threads try to write/read to the same field and you haven't gone out of your way to make very very sure you have an HB/HA relationship between the relevant lines. If you're scared now, good. You shouldn't be attempting to communicate between 2 threads often, or even via fields, unless you really, really know what you are doing. Toss it through a message bus, use designs where the data flow is bounded to the start and end of the entire thread's process (make a job, initialize the job with the right data, toss it in an ExecutorPool queue, set up that you get notified when its done, read out the result, don't ever share anything whatsoever with the actual thread that runs it), or talk to each other via the database.

Related

Thread locality

I have this statement, which came from Goetz's Java Concurrency In Practice:
Runtime overhead of threads due to context switching includes saving and restoring execution context, loss of locality, and CPU time spent scheduling threads instead of running them.
What is meant by "loss of locality"?

When a thread works, it often reads data from memory and from disk. The data is often stored in contiguous or close locations in memory/on the disk (for example, when iterating over an array, or when reading the fields of an object). The hardware bets on that by loading blocks of memory into fast caches so that access to contiguous/close memory locations is faster.
When you have a high number of threads and you switch between them, those caches often need to be flushed and reloaded, which makes the code of a thread take more time than if it was executed all at once, without having to switch to other threads and come back later.
A bit like we humans need some time to get back to a task after being interrupted, find where we were, what we were doing, etc.

Just to elaborate the point of "cache miss" made by JB Nizet.
As a thread runs on a core, it keeps recently used data in the L1/L2 cache which are local to the core. Modern processors typically read data from L1/L2 cache in about 5-7 ns.
When, after a pause (from being interrupted, put on wait queue etc) a thread runs again, it most likely will run on a different core. This means that the L1/L2 cache of this new core has no data related to the work that the thread was doing. It now needs to goto the main memory (which takes about 100 ns) to load data before proceeding to work.
There are ways to mitigate this issue by pinning threads to a specific core by using a thread affinity library.

Dealing with many large GC-eligible objects in tenured heap space

I have an application that produces large results objects and puts them in a queue. Multiple worker threads create the results objects and queue them, and a single writer thread de-queues the objects, converts them to CSV, and writes them to disk. Due to both I/O and the size of the results objects, writing the results takes far longer than generating them. This application is not a server, it is simply a command-line app that runs through a large batch of requests and finishes.
I would like to decrease the overall memory footprint of the application. Using a heap analysis tool (IBM HeapAnalyzer), I am finding that just before the program terminates, most of the large results objects are still on the heap, even though they were de-queued and have no other references to them. That is, they are all root objects. They take up the majority of the heap space.
To me, this means that they made it into tenured heap space while they were still in the queue. As no full GC is ever triggered during the run, that is where they remain. I realize that they should be tenured, otherwise I'd be copying them back and forth within the Eden spaces while they are still in the queue, but at the same time I wish there was something I could do to facilitate getting rid of them after de-queueing, short of calling System.gc().
I realize one way of getting rid of them would be to simply shrink the maximum heap size and trigger a full GC. However the inputs to this program vary considerably in size and I would prefer to have one -Xmx setting for all runs.
Added for Clarification: this is all an issue because there is also a large memory overhead in Eden for actually writing the object out (mostly String instances, which also appear as roots in the heap analysis). There are frequent minor GC's in Eden as a result. These would be less frequent if the result objects were not hanging around in the tenured space. The argument could be made that my real problem is the output overhead in Eden, and I am working on that, but wanted to pursue this tenured issue at the same time.
As I research this, are there any particular garbage collector settings or programmatic approaches I should be focusing on? Note I am using JDK 1.8.
Answer Update: #maaartinus made some great suggestions that helped me avoid queueing (and thus tenuring) the large objects in the first place. He also suggested bounding the queue, which would surely cut down on the tenuring of what I am now queueing instead (the CSV byte[] representations of the results objects). The right mix of thread count and queue bounds will definitely help, though I have not tried this as the problem basically disappeared by finding a way to not tenure the big objects in the first place.

I'm sceptical concerning a GC-related solution, but it looks like you're creating a problem you needn't to have:
Multiple worker threads create the results objects and queue them, and a single writer...
... writing the results takes far longer than generating them ...
So it looks like it should actually be the other way round: single producer and many consumers to keep the game even.
Multiple writers mightn't give you much speed up, but I'd try it, if possible. The number of producers doesn't matter much as long as you use a bounded queue for their results (I'm assuming they have no substantially sized input as you haven't mentioned it). This bounded queue could also ensure that the objects get never too old.
In any case, you can use multiple to CSV converters, so effectively replacing a big object by a big String or byte[], or ByteBuffer, or whatever (assuming you want to do the conversion in memory). The nice thing about the buffer is that you can recycle it (so the fact that it gets tenured is no problem anymore).
You could also use some unmanaged memory, but I really don't believe it's necessary. Simply bounding the queue should be enough, unless I'm missing something.
And by the way, quite often the cheapest solution is to buy more RAM. Really, one hour of work is worth a couple of gigabytes.
Update
how much should I be worried about contention between multiple writer threads, since they would all be sharing one thread-safe Writer?
I can imagine two kinds of problems:
Atomicity: While synchronizations ensures that each executed operations happens atomically, it doesn't mean that the output makes any sense. Imagine multiple writers, each of them generating a single CSV and the resulting file should contain all the CSVs (in any order). Using a PrintWriter would keep each line intact, but it'd intermix them.
Concurrency: For example, a FileWriter performs the conversion from chars to bytes, which may in this context end up in a synchronized block. This could reduce parallelism a bit, but as the IO seems to be the bottleneck, I guess, it doesn't matter.

"Give a rough estimate of the overhead incurred by each system call." - what? [duplicate]

I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?

It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".

The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.

You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.

Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.

Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).

For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.

Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.

You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.

You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.

A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.

Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.

To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.

its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..

Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)?

There is something that bugs me with the Java memory model (if i even understand everything correctly). If there are two threads A and B, there are no guarantees that B will ever see a value written by A, unless both A and B synchronize on the same monitor.
For any system architecture that guarantees cache coherency between threads, there is no problem. But if the architecture does not support cache coherency in hardware, this essentially means that whenever a thread enters a monitor, all memory changes made before must be commited to main memory, and the cache must be invalidated. And it needs to be the entire data cache, not just a few lines, since the monitor has no information which variables in memory it guards.
But that would surely impact performance of any application that needs to synchronize frequently (especially things like job queues with short running jobs). So can Java work reasonably well on architectures without hardware cache-coherency? If not, why doesn't the memory model make stronger guarantees about visibility? Wouldn't it be more efficient if the language would require information what is guarded by a monitor?
As i see it the memory model gives us the worst of both worlds, the absolute need to synchronize, even if cache coherency is guaranteed in hardware, and on the other hand bad performance on incoherent architectures (full cache flushes). So shouldn't it be more strict (require information what is guarded by a monitor) or more lose and restrict potential platforms to cache-coherent architectures?
As it is now, it doesn't make too much sense to me. Can somebody clear up why this specific memory model was choosen?
EDIT: My use of strict and lose was a bad choice in retrospect. I used "strict" for the case where less guarantees are made and "lose" for the opposite. To avoid confusion, its probably better to speak in terms of stronger or weaker guarantees.

the absolute need to synchronize, even
if cache coherency is guaranteed in
hardware
Yes, but then you only have to reason against the Java Memory Model, not against a particular hardware architecture that your program happens to run on. Plus, it's not only about the hardware, the compiler and JIT themselves might reorder the instructions causing visibility issue. Synchronization constructs in Java addresses visibility & atomicity consistently at all possible levels of code transformation (e.g. compiler/JIT/CPU/cache).
and on the other hand bad performance
on incoherent architectures (full
cache flushes)
Maybe I misunderstood s/t, but with incoherent architectures, you have to synchronize critical sections anyway. Otherwise, you'll run into all sort of race conditions due to the reordering. I don't see why the Java Memory Model makes the matter any worse.
shouldn't it be more strict (require
information what is guarded by a
monitor)
I don't think it's possible to tell the CPU to flush any particular part of the cache at all. The best the compiler can do is emitting memory fences and let the CPU decides which parts of the cache need flushing - it's still more coarse-grained than what you're looking for I suppose. Even if more fine-grained control is possible, I think it would make concurrent programming even more difficult (it's difficult enough already).
AFAIK, the Java 5 MM (just like the .NET CLR MM) is more "strict" than memory models of common architectures like x86 and IA64. Therefore, it makes the reasoning about it relatively simpler. Yet, it obviously shouldn't offer s/t closer to sequential consistency because that would hurt performance significantly as fewer compiler/JIT/CPU/cache optimizations could be applied.

Existing architectures guarantee cache coherency, but they do not guarantee sequential consistency - the two things are different. Since seq. consistency is not guaranteed, some reorderings are allowed by the hardware and you need critical sections to limit them. Critical sections make sure that what one thread writes becomes visible to another (i.e., they prevent data races), and they also prevent the classical race conditions (if two threads increment the same variable, you need that for each thread the read of the current value and the write of the new value are indivisible).
Moreover, the execution model isn't as expensive as you describe. On most existing architectures, which are cache-coherent but not sequentially consistent, when you release a lock you must flush pending writes to memory, and when you acquire one you might need to do something to make sure future reads will not read stale values - mostly that means just preventing that reads are moved too early, since the cache is kept coherent; but reads must still not be moved.
Finally, you seem to think that Java's Memory Model (JMM) is peculiar, while the foundations are nowadays fairly state-of-the-art, and similar to Ada, POSIX locks (depending on the interpretation of the standard), and the C/C++ memory model. You might want to read the JSR-133 cookbook which explains how the JMM is implemented on existing architectures: http://g.oswego.edu/dl/jmm/cookbook.html.

The answer would be that most multiprocessors are cache-coherent, including big NUMA systems, which almost? always are ccNUMA.
I think you are somewhat confused as to how cache coherency is acomplished in practice. First, caches may be coherent/incoherent with respect to several other things on the system:
Devices
(Memory modified by) DMA
Data caches vs instruction caches
Caches on other cores/processors (the one this question is about)
...
Something has to be made to maintain coherency. When working with devices and DMA, on architectures with incoherent caches with respect to DMA/devices, you would either bypass the cache (and possibly the write buffer), or invalidate/flush the cache around operations involving DMA/devices.
Similarly, when dynamically generating code, you may need to flush the instruction cache.
When it comes to CPU caches, coherency is achieved using some coherency protocol, such as MESI, MOESI, ... These protocols define messages to be sent between caches in response to certain events (e.g: invalidate-requests to other caches when a non-exclusive cacheline is modified, ...).
While this is sufficient to maintain (eventual) coherency, it doesn't guarantee ordering, or that changes are immediately visible to other CPUs. Then, there are also write buffers, which delay writes.
So, each CPU architecture provides ordering guarantees (e.g. accesses before an aligned store cannot be reordered after the store) and/or provide instructions (memory barriers/fences) to request such guarantees. In the end, entering/exiting a monitor doesn't entail flushing the cache, but may entail draining the write buffer, and/or stall waiting for reads to end.

the caches that JVM has access to are really just CPU registers. since there aren't many of them, flushing them upon monitor exit isn't a big deal.
EDIT: (in general) the memory caches are not under the control of JVM, JVM cannot choose to read/write/flush these caches, so forget about them in this discussion
imagine each CPU has 1,000,000 registers. JVM happily exploits them to do crazy fast computations - until it bumps into monitor enter/exit, and has to flush 1,000,000 registers to the next cache layer.
if we live in that world, either Java must be smart enough to analyze what objects aren't shared (majority of objects aren't), or it must ask programmers to do that.
java memory model is a simplified programming model that allows average programmers make OK multithreading algorithms. by 'simplified' I mean there might be 12 people in the entire world who really read chapter 17 of JLS and actually understood it.

How many threads should I use in my Java program?

I recently inherited a small Java program that takes information from a large database, does some processing and produces a detailed image regarding the information. The original author wrote the code using a single thread, then later modified it to allow it to use multiple threads.
In the code he defines a constant;
// number of threads
public static final int THREADS = Runtime.getRuntime().availableProcessors();
Which then sets the number of threads that are used to create the image.
I understand his reasoning that the number of threads cannot be greater than the number of available processors, so set it the the amount to get the full potential out of the processor(s). Is this correct? or is there a better way to utilize the full potential of the processor(s)?
EDIT: To give some more clarification, The specific algorithm that is being threaded scales to the resolution of the picture being created, (1 thread per pixel). That is obviously not the best solution though. The work that this algorithm does is what takes all the time, and is wholly mathematical operations, there are no locks or other factors that will cause any given thread to sleep. I just want to maximize the programs CPU utilization to decrease the time to completion.

Threads are fine, but as others have noted, you have to be highly aware of your bottlenecks. Your algorithm sounds like it would be susceptible to cache contention between multiple CPUs - this is particularly nasty because it has the potential to hit the performance of all of your threads (normally you think of using multiple threads to continue processing while waiting for slow or high latency IO operations).
Cache contention is a very important aspect of using multi CPUs to process a highly parallelized algorithm: Make sure that you take your memory utilization into account. If you can construct your data objects so each thread has it's own memory that it is working on, you can greatly reduce cache contention between the CPUs. For example, it may be easier to have a big array of ints and have different threads working on different parts of that array - but in Java, the bounds checks on that array are going to be trying to access the same address in memory, which can cause a given CPU to have to reload data from L2 or L3 cache.
Splitting the data into it's own data structures, and configure those data structures so they are thread local (might even be more optimal to use ThreadLocal - that actually uses constructs in the OS that provide guarantees that the CPU can use to optimize cache.
The best piece of advice I can give you is test, test, test. Don't make assumptions about how CPUs will perform - there is a huge amount of magic going on in CPUs these days, often with counterintuitive results. Note also that the JIT runtime optimization will add an additional layer of complexity here (maybe good, maybe not).

On the one hand, you'd like to think Threads == CPU/Cores makes perfect sense. Why have a thread if there's nothing to run it?
The detail boils down to "what are the threads doing". A thread that's idle waiting for a network packet or a disk block is CPU time wasted.
If your threads are CPU heavy, then a 1:1 correlation makes some sense. If you have a single "read the DB" thread that feeds the other threads, and a single "Dump the data" thread and pulls data from the CPU threads and create output, those two could most likely easily share a CPU while the CPU heavy threads keep churning away.
The real answer, as with all sorts of things, is to measure it. Since the number is configurable (apparently), configure it! Run it with 1:1 threads to CPUs, 2:1, 1.5:1, whatever, and time the results. Fast one wins.

The number that your application needs; no more, and no less.
Obviously, if you're writing an application which contains some parallelisable algorithm, then you can probably start benchmarking to find a good balance in the number of threads, but bear in mind that hundreds of threads won't speed up any operation.
If your algorithm can't be parallelised, then no number of additional threads is going to help.

Yes, that's a perfectly reasonable approach. One thread per processor/core will maximize processing power and minimize context switching. I'd probably leave that as-is unless I found a problem via benchmarking/profiling.
One thing to note is that the JVM does not guarantee availableProcessors() will be constant, so technically, you should check it immediately before spawning your threads. I doubt that this value is likely to change at runtime on typical computers, though.
P.S. As others have pointed out, if your process is not CPU-bound, this approach is unlikely to be optimal. Since you say these threads are being used to generate images, though, I assume you are CPU bound.

number of processors is a good start; but if those threads do a lot of i/o, then might be better with more... or less.
first think of what are the resources available and what do you want to optimise (least time to finish, least impact to other tasks, etc). then do the math.
sometimes it could be better if you dedicate a thread or two to each i/o resource, and the others fight for CPU. the analisys is usually easier on these designs.

The benefit of using threads is to reduce wall-clock execution time of your program by allowing your program to work on a different part of the job while another part is waiting for something to happen (usually I/O). If your program is totally CPU bound adding threads will only slow it down. If it is fully or partially I/O bound, adding threads may help but there's a balance point to be struck between the overhead of adding threads and the additional work that will get accomplished. To make the number of threads equal to the number of processors will yield peak performance if the program is totally, or near-totally CPU-bound.
As with many questions with the word "should" in them, the answer is, "It depends". If you think you can get better performance, adjust the number of threads up or down and benchmark the application's performance. Also take into account any other factors that might influence the decision (if your application is eating 100% of the computer's available horsepower, the performance of other applications will be reduced).
This assumes that the multi-threaded code is written properly etc. If the original developer only had one CPU, he would never have had a chance to experience problems with poorly-written threading code. So you should probably test behaviour as well as performance when adjusting the number of threads.
By the way, you might want to consider allowing the number of threads to be configured at run time instead of compile time to make this whole process easier.

After seeing your edit, it's quite possible that one thread per CPU is as good as it gets. Your application seems quite parallelizable. If you have extra hardware you can use GridGain to grid-enable your app and have it run on multiple machines. That's probably about the only thing, beyond buying faster / more cores, that will speed it up.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.