Lock implementation deadlock

Lock implementation deadlock - java

I have two threads, one with id 0, and the other with id 1. Based on this information I tried to construct a lock, but it seems like it deadlocks, but I don't understand why, can someone please help me?
I tried to construct a scenario when this happens, but it is really difficult
private int turn = 0;
private boolean [] flag = new boolean [2];
public void lock(int tid) {
int other;
other = (tid == 0) ? 1 : 0;
flag[tid] = true;
while (flag[other] == true) {
if (turn == other) {
flag[tid] = false;
while (turn == other){}
flag[tid] = true;
}
}
}
public void unlock(int tid) {
//turn = 1-t;
turn = (tid == 0) ? 1 : 0;
flag[tid] = false;
}

Yup, this code is broken. It's.. a complicated topic. If you want to know more, dive into the Java Memory Model (JMM) - that's a term you can search the web form.
In basis, every field in a JVM has the following property, unless it is marked volatile:
Any thread is free to make a local clone of this field, or not.
The VM is free to sync these clones up, in arbitrary fashion, at any time. Could be in 5 seconds, could be in 5 years. The VM is free to do whatever it wants, the spec intentionally does not make many guarantees.
To solve this problem, establish HB/HA (A Happens-Before/Happens-After relationship).
In other words, if 2 threads both look at the same field, and you haven't established HB/HA, your code is broken, and it's broken in a way that you can't reliably test for, because the JVM can do whatever it wants. This includes working as you expect during tests.
To establish HB/HA, you need volatile/synchronized, pretty much. There's a long list, though:
x.start() is HB to the first line in that thread's HA.
Exiting a synchronized(x) {} block is HB to the start of a synchronized (x) {} block, assuming both x's are pointing at the same object.
Volatile access establishes HB/HA, but you can't really control in which direction.
Note that many library calls, such as to ReadWriteLock and co, may well be synchronizing/using volatile under the hood, thus in passing establishing some form of HB/HA.
"Natural" HB/HA: Within a single thread, a line executed earlier is HB to a line executed later. One might say 'duh' on this one.
Note that HB/HA doesn't actually mean they actually happen before. It merely means that it is not possible to observe the state as it was before the HB line from the HA line, other than by timing attacks. This usually doesn't matter.
The problem
Your code has two threads that both look at the same field, does not contain volatile or synchronized or otherwise establishes HB/HA anywhere, and doesn't use library calls that do so. Therefore, this code is broken.
The solution is to introduce that someplace. Not sure why you're rolling your own locks - the folks who wrote ReadWriteLock and friends in the java.util.concurrent package are pretty smart. There is zero chance you can do better than them, so just use that.

Dekker's algorithm is a great example to explain sequential consistency.
Lets first give you a bit of background information. When a CPU does a store, the CPU won't wait for the store to be inserted in the (coherent) cache; once it is inserted into the cache it is globally visible to all other CPUs. The reason why a CPU doesn't wait is that it can be a long time before the cache line that contained the field you want to write is in the appropriate state; the cache line first needs to be invalidated on all other CPU's before the write can be performed on the cache-line and this can stall the CPU.
To resolve this problem modern CPUs make use of store buffers. So the store is written to the store buffer and at some point in the future, the store will be committed to the cache.
If the store is followed by a load to a different address, it could be that the store and the load are reordered. E.g. the cache line for the later load might already be in the right state while the cache line for the earlier store isn't. So the load is 'globally' performed before the store is 'globally' performed; so we have a reordering.
Now lets have a look at your example:
flag[tid] = true;
while (flag[other] == true) {
So what we have here is a store followed by a load to a different address and hence the store and the load can be reordered and the consequence is that the lock won't provide mutual exclusion since 2 threads at the same time could enter the critical section.
[PS]
It is likely that the 2 flags will be on the same cache line, but it isn't a guarantee.
This is typical what you see on the X86; an older store can be reordered with a newer load. The memory model of the X86 is called TSO. I will not go into the details because there is an optimization called store to load forwarding which complicates the TSO model. TSO is a relaxation of sequential consistency.
Dekker's and Peterson's mutex algorithms do not rely on atomic instructions like a compare and set; but they do rely on a sequential consistent system and one of the requirement of sequential consistency is that the loads and stores do not get reordered (or at least nobody should be able to proof they were reordered).
So what the compiler needs to do is to add the appropriate CPU memory fence to make sure that the older store and the newer load to a different address don't get reordered. And in this case that would be an expensive [StoreLoad] fence. It depends on the ISA, but on Intel X86 the [StoreLoad] will cause the load-store-unit stop executing loads till the store buffer has been drained.
You can't declare the fields of an array to be volatile. But there are a few options:
don't use an array, but explicit fields.
use an AtomicIntegerArray
start messing around with fences using the VarHandle. Make sure the fields are accessed using a VarHandle with at least memory order opaque to prevent the compiler to optimize out any loads/stores.
Apart from what I said above, it isn't only the CPU that can cause problems. Because there is no appropriate happens before edge between writing and reading the flags, the JIT can also optimize this code.
For example:
if(!flag[other])
return;
while (true) {
if (turn == other) {
flag[tid] = false;
while (turn == other){}
flag[tid] = true;
}
}
And now it is clear that this lock could wait indefinitely (deadlock) and doesn't need to see change in the flag. More optimizations are possible, that could totally mess up your code.
if (turn == other) {
while (true) {}
}
If you add the appropriate happens before edges, the compiler will add:
compiler fences
CPU memory fences.
And Dekker's algorithm should work as expected.
I would also suggesting setting up a test using JCStress. https://github.com/openjdk/jcstress and have a look at this test:
https://github.com/openjdk/jcstress/blob/master/jcstress-samples/src/main/java/org/openjdk/jcstress/samples/concurrency/mutex/Mutex_02_DekkerAlgorithm.java

Related

lack of volatile variables in Java Spring applications and its consequences

Those who have developed professional, multi-threaded, Java Spring applications can probably testify the use of the volatile keyword is almost non-existent (and other threading controls for that matter), despite the potential disastrous consequences of missing it when needed.
Let me provide an example of very common code
#Service
public class FeatureFlagHolder {
private boolean featureFlagActivated = false;
public void activateFeatureFlag() {
featureFlagActivated = true;
}
// similar code to de-activate
public boolean isFeatureFlagActivated() {
return featureFlagActivated;
}
}
Suppose the threads changing and reading the state of featureFlagActivated are different. The thread reading the boolean could, AFAIK, according to the JVM cache its value and never refresh it. In practice, I've never seen that happen. Actually, I've never even seen the boolean not being updated immediately on a read.
Why is that?

At the most basic level it has to be said that a lack of volatile doesn't guarantee that it will fail. It just means that the JVM is allowed to do optimizations that could lead to failure. But whether those optimizations happen and whether they then lead to failure is influenced by many different factors. Therefore it's often very hard to actually detect these problems, until they become catastrophic.
For a starter, I'd like to summarize conditions that happen to frequently coincide when it does go wrong.
the non-volatile variable is usually read in a tight loop
the non-volatile variable is changed rarely, but when it changes it's "important" in some sense.
the amount of code executed inside that loop is small (roughly small enough to be fully inlined by an aggressive compiler)
the tight loop over-running has a very visible effect (for example it leads to an exception and not just silently doing unnecessary work).
Note that not all of those are necessary, but they tend to be true when I actually observe the issue.
My personal interpretation (plus some reading on the topic) lead me to these rules of thumb:
if reading the wrong value won't be noticed, then you simply won't notice if the volatile is missing. If the only bad thing that happens is that you run through a loop a couple of times unnecessarily, then chances are you will never realize that it happens.
when the reads of the volatile variable happen with enough "distance" between them (where distance is measured by other read access to other parts of memory) then it can often behave as if it was volatile, simply because it drops out of the cache
any kind of synchronization on anything inside the loops tends to have the effect of invalidating some caches at least and thus causes the variable to act as if it was volatile.
These three alone make it rather hard to actually spot the problem except in very extreme cases (i.e. when executing once too many causes a big crash in your system).
In your specific example, I assume that the feature flag is not something that will be toggled multiple times per second. It's more likely that it's set once per process and then stays untouched.
For example, if you have multiple incoming requests in the same second and halfway through the second you toggle the feature flag it can happen that some of the requests that happen after the toggling will still use the old value, due to having it cached from earlier.
Will you notice? Unlikely. It'll be extremely hard to distinguish "this request came in just before the change" from "this request came in just after the change and wrongly used the old value". If 6 out of 10 requests use the old value instead of the correct 5 out of 10, there's a good chance no one will ever notice.

Memory barriers on entry and exit of Java synchronized block

I came across answers, here on SO, about Java flushing the work copy of variables within a synchronized block during exit. Similarly it syncs all the variable from main memory once during the entry into the synchronized section.
However, I have some fundamental questions around this:
What if I access mostly non-volatile instance variables inside my synchronized section? Will the JVM automatically cache those variables into the CPU registers at the time of entering into the block and then carry all the necessary computations before finally flushing them back?
I have a synchronized block as below:
The underscored variables _ e.g. _callStartsInLastSecondTracker are all instance variables which I heavily access in this critical section.
public CallCompletion startCall()
{
long currentTime;
Pending pending;
synchronized (_lock)
{
currentTime = _clock.currentTimeMillis();
_tracker.getStatsWithCurrentTime(currentTime);
_callStartCountTotal++;
_tracker._callStartCount++;
if (_callStartsInLastSecondTracker != null)
_callStartsInLastSecondTracker.addCall();
_concurrency++;
if (_concurrency > _tracker._concurrentMax)
{
_tracker._concurrentMax = _concurrency;
}
_lastStartTime = currentTime;
_sumOfOutstandingStartTimes += currentTime;
pending = checkForPending();
}
if (pending != null)
{
pending.deliver();
}
return new CallCompletionImpl(currentTime);
}
Does this mean that all these operations e.g. +=, ++, > etc. requires the JVM to interact with main memory repeatedly? If so, can I use local variables to cache them (preferably stack allocation for primitives) and perform operations and in the end assign them back to the instance variables? Will that help to optimise performance of this block?
I have such blocks in other places as well. On running a JProfiler, it has been observed that most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
Appreciate any help here.

(I don't know Java that well, just the underlying locking and memory-ordering concepts that Java is exposing. Some of this is based on assumptions about how Java works, so corrections welcome.)
I'd assume that the JVM can and will optimize them into registers if you access them repeatedly inside the same synchronized block.
i.e. the opening { and closing } are memory barriers (acquiring and releasing the lock), but within that block the normal rules apply.
The normal rules for non-volatile vars are like in C++: the JIT-compiler can keep private copies / temporaries and do full optimization. The closing } makes any assignments visible before marking the lock as released, so any other thread that runs the same synchronized block will see those changes.
But if you read/write those variables outside a synchronized(_lock) block while this synchronized block is executing, there's no ordering guarantee and only whatever atomicity guarantee Java has. Only volatile would force a JVM to re-read a variable on every access.
most of the time threads are in WAITING state and throughput is also very low. Hence the optimisation necessity.
The things you're worried about wouldn't really explain this. Inefficient code-gen inside the critical section would make it take somewhat longer, and that could lead to extra contention.
But there wouldn't be a big enough effect to make most threads be blocked waiting for locks (or I/O?) most of the time, compared to having most threads actively running most of the time.
#Kayaman's comment is most likely correct: this is a design issue, doing too much work inside one big mutex. I don't see loops inside your critical section, but presumably some of those methods you call contain loops or are otherwise expensive, and no other thread can enter this synchronized(_lock) block while one thread is in it.
The theoretical worst case slowdown for store/reload from memory (like compiling C in anti-optimized debug mode) vs. keeping a variable in a register would be for something like while (--shared_var >= 0) {}, giving maybe a 6x slowdown on current x86 hardware. (1 cycle latency for dec eax vs. that plus 5 cycle store-forwarding latency for a memory-destination dec). But that's only if you're looping on a shared var, or otherwise creating a dependency chain through repeated modification of it.
Note that a store buffer with store-forwarding still keeps it local to the CPU core without even having to commit to L1d cache.
In the much more likely case of code that just reads a var multiple times, anti-optimized code that really loads every time can have all those loads hit in L1d cache very efficiently. On x86 you'd probably barely notice the difference, with modern CPUs having 2/clock load throughput, and efficient handling of ALU instructions with memory source operands, like cmp eax, [rdi] being basically as efficient as cmp eax, edx.
(CPUs have coherent caches so there's no need for flushing or going all the way to DRAM to ensure you "see" data from other cores; a JVM or C compiler only has to make sure the load or store actually happens in asm, not optimized into a register. Registers are thread-private.)
But as I said, there's no reason to expect that your JVM is doing this anti-optimization inside synchronized blocks. But even if it were, it might make a 25% slowdown.

You are accessing members on a single object. So when the CPU reads the _lock member, it needs to load the cache line containing _lock member first. So probably quite a few of the member variables will be on the same cache line which is already in your cache.
I would be more worried about the synchronized block itself IF you have determined it is actually a problem; it might not be a problem at all. For example Java uses quite a few lock optimization techniques like biased locking, adaptive spin lock to reduce the costs of locks.
But if it is a contended lock, you might want to make the duration of the lock shorter by moving as much out of the lock as possible and perhaps even get rid of the whole lock and switch to a lock free approach.
I would not trust JPofiler for a second.
http://psy-lob-saw.blogspot.com/2016/02/why-most-sampling-java-profilers-are.html
So it might be that JProfiler is putting you in the wrong direction.

Usage of lazySet on AtomicXXX in Java

From this question : AtomicInteger lazySet vs. set and form this link : https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/atomic/package-summary.html
I could gather following points
lazySet could be faster than set
lazySet uses store-store barrier (writes before are honored but not the contended writes, which were yet to happen)
I could find one use-case where it could be applied, from the documentation :
Use lazySet when you want to null out a pointer to aid GC.
Are there any other practical use-cases for lazySet ?

Caffeine uses lazy or relaxed writes in many of its data structures.
When nulling out a field (e.g. ConcurrentLinkedStack)
When writing to volatile fields before publishing (e.g. SingleConsumerQueue)
When publish is safely delayable (e.g. BoundedBuffer)
When races are benign (e.g. cache expiration timestamps)
When inside a lock (e.g. BoundedLocalCache)
ConcurrentLinkedQueue uses relaxed writes prior to publishing a node and may lazily sets a node's next field (prior to publishing or to indicate a stale traversal).
You may also enjoy reading the Linux Kernel Memory Barriers paper.

TL;DR How to use .lazySet()? With care, if at all.
The main problem here is that AtomicXXX.lazySet() is low-level performance optimization and it is out of current JLS. You can't prove correctness if your concurrent code with JMM tools if you are using lazySet().
Why it is much faster than volatile write?
Main difference between set and lazySet is absence of StoreLoad barrier.
JSR-133 Cookbook for Compiler Writers:
StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind.
Moreover, on most popular x86-based hardware StoreLoad is the only explicit barrier (others are just no-op's and cost nothing), so with lazySet you eliminate all (explicit) memory barriers.
Guarantees of lazySet
From the point of JLS there isn't any.
But actually you can reason about lazySet as delayed write which cannot be reordered with any previous write and will happen eventually. Eventually is finite time, if your process makes any progress (e.g. any synchronization action occurs; in addition, size of processor's store buffer is finite). If written value became visible for other thread, you can be sure that all previous writes are visible either (although you cannot formally prove it). So you can treat it as delayed happens-before relationship (but, of course, it's not even close to it's strict and formal definition).
Usage
Most practical usage (except nulling-out references) is making writes far cheaper in context of progress. Simplest example is using lazySet() instead of set() within synchronized block (but in this case there is no great performance impact). Or you can use it instead of writes in single producer cases, when no compete on write occurs.
Disruptor developers are using lazySet exactly for this purpose in their lock-free implementation. Again, it's very hard to argue about correctness of such code, but it's good trick to be aware of.

I would think many uses of AtomicBoolean would benefit from the usage of lazySet() because they are often used as flags to indicate whether something is complete or not, or an outer loop should finish.
This is because in this case the value is initially one value and it eventually becomes another value and then stays there. Obviously this argument applied to almost any atomic that is used in that way.
public void test() {
final AtomicBoolean finished = new AtomicBoolean(false);
new Thread(new Runnable() {
#Override
public void run() {
while (!finished.get()) {
// A long process.
if (wereAllDone()) {
finished.lazySet(true);
}
}
}
}).start();
}

How does the JVM internally handle race conditions?

If multiple threads try to update the same member variable, it is called a race condition. But I was more interested in knowing how the JVM handles it internally if we don't handle it in our code by making it synchronised or something else? Will it hang my program? How will the JVM react to it? I thought the JVM would temporarily create a sync block for this situation, but I'm not sure what exactly would be happening.
If any of you have some insight, it would be good to know.

The precise term is a data race, which is a specialization of the general concept of a race condition. The term data race is an official, precisely specified concept, which means that it arises from a formal analysis of the code.
The only way to get the real picture is to go and study the Memory Model chapter of the Java Language Specification, but this is a simplified view: whenever you have a data race, there is almost no guarantee as to the outcome and a reading thread may see any value which has ever been written to the variable. Therein also lies the only guarantee: the thread will not observe an "out-of-thin-air" value, such which was never written. Well, unless you're dealing with longs or doubles, then you may see torn writes.

Maybe I'm missing something but what is there to handle? There is still a thread that will get there first. Depending on which thread that is, that thread will just update/read some variable and proceed to the next instruction. It can't magically construct a sync block, it doesn't really know what you want to do. So in other words what happens will depend on the outcome of the 'race'.
Note I'm not heavily into the lower level stuff so perhaps I don't fully understand the depth of your question.

Java provides synchronized and volatile to deal with these situations. Using them properly can be frustratingly difficult, but keep in mind that Java is only exposing the complexity of modern CPU and memory architectures. The alternatives would be to always err on the side of caution, effectively synchronizing everything which would kill performance; or ignore the problem and offer no thread safety whatsoever. And fortunately, Java provides excellent high-level constructs in the java.util.concurrent package, so you can often avoid dealing with the low-level stuff.

In short, the JVM assumes that code is free of data races when translating it into machine code. That is, if code is not correctly synchronized, the Java Language Specification provides only limited guarantees about the behavior of that code.
Most modern hardware likewise assumes that code is free of data races when executing it. That is, if code is not correctly synchronized, the hardware makes only limited guarantees about the result of its execution.
In particular, the Java Language Specification guarantees the following only in the absence of a data race:
visibility: reading a field yields the value last assigned to it (it is unclear which write was last, and writes of long or double variables need not be atomic)
ordering: if a write is visible, so are any writes preceding it. For instance, if one thread executes:
x = new FancyObject();
another thread can read x only after the constructor of FancyObject has executed completely.
In the presence of a data race, these guarantees are null and void. It is possible for a reading thread to never see a write. It is also possible to see the write of x, without seeing the effect of the constructor that logically preceded the write of x. It is very unlikely that the program is correct if such basic assumptions can not be made.
A data race will however not compromise the integrity of the Java Virtual Machine. In particular, the JVM will not crash or halt, and still guarantee memory safety (i.e. prevent memory corruption) and certain semantics of final fields.

The JVM will handle the situation just fine (ie it will not hang or complain), but you may not get a result that you like!
When multiple threads are involved, java becomes fiendishly complicated and even code that looks obviously correct can turn out to be horribly broken. As an example:
public class IntCounter {
private int i;
public IntCounter(int i){
this.i = i;
}
public void incrementInt(){
i++;
}
public int getInt(){
return i;
}
}
is flawed in many ways.
First, let's say that i is currently 0 and thread A and thread B both call incrementInt() at about the same time. There is a danger that they will both see that i is 0, then both increment it 1 and then save the result. So at the end of the two calls, i is only 1, not 2!
That's the race condition problem with the code, but there are other problems concerning memory visibility. When thread A changes a shared variable, there is no guarantee (without synchronization) that thread B will ever see the changes!
So thread A could increment i 100 times, and an hour later, thread B, calling getInt(), might see i as 0, or 100 or anywhere in between!
The only sane thing to do if you are delving into java concurrency is to read Java Concurrency in Practice by Brian Goetz et al. (OK there's probably other good ways to learn about it, but this is a great book co written by Joshua Bloch, Doug Lea and others)

Java synchronization and performance in an aspect

I just realized that I need to synchronize a significant amount of data collection code in an aspect but performance is a real concern. If performance degrades too much my tool will be thrown out. I will be writing ints and longs individually and to various arrays, ArrayLists and Maps. There will be multiple threads of an application that will make function calls that will be picked up by my aspect. What kind of things should I look out for that will negatively affect performance? What code patterns are more efficient?
In particular I have a method that calls many other data recording methods:
void foo() {
bar();
woz();
...
}
The methods mostly do adding an incrementing of aspect fields
void bar() {
f++; // f is a field of the aspect
for (int i = 0; i < ary.length; i++) {
// get some values from aspect point cut
if (some condiction) {
ary[i] += someValue; // ary a field of the aspect
}
}
}
Should I synchronize foo, or bar, woz and others individually, or should I move all the code in bar, woz, etc into foo and just synchronize it? Should I synchronize on this, on a specifically created synchronization object:
private final Object syncObject = new Object();
(see this post), or on individual data elements within the methods:
ArrayList<Integer> a = new ArrayList<Integer>();
void bar() {
synchronize(a) {
// synchronized code
}
}

Concurrency is extremely tricky. It's very easy to get it wrong, and very hard to get right. I wouldn't be too terribly worried about performance at this point. My first and foremost concern would be to get the concurrent code to work safely (no deadlocks or race conditions).
But on the issue of performance: when in doubt, profile. It's hard to say just how different synchronization schemes will affect performance. It's even harder for us to give you suggestions. We'd need to see a lot more of your code and gain a much deeper understanding of what the application does to give you a truly useful answer. In contrast, profiling gives you hard evidence as to if one approach is slower than another. It can even help you identify where the slowdown is.
There are a lot of great profiling tools for Java these days. The Netbeans and Eclipse profilers are good.
Also, I'd recommend staying away from raw synchronization altogether. Try using some of the classes in the java.util.concurrency package. They make writing concurrent code much easier, and much less error prone.
Also, I recommend you read Java Concurrency in Practice by Brian Goetz, et al. It's very well written and covers a lot of ground.

Rule of thumb is not to synchronize on this - most of the times it is a performance hit - all methods are synchronized on one object.
Consider using locks - they'a very nice abstraction and many fine features like, trying to lock for a time period, and then giving up:
if(commandsLock.tryLock(100, TimeUnit.MILLISECONDS)){
try {
//Do something
}finally{
commandsLock.unlock();
}
}else{
//couldnt acquire lock for 100 ms
}
I second opinion on using java.util.concurrent. I'd make two levls of synchronization
synchronize collection access (if it is needed)
synchronize field access
Collection access
If your collection are read-only ie no elements get removed-inserted (but elements may change) i would say that you should use synchronized collections (but this may be not needed...) and dont synchronize iterations:
Read only:
for (int i = 0; i < ary.length; i++) {
// get some values from aspect point cut
if (some condiction) {
ary += someValue; // ary a field of the aspect
}
}
and ary is instance obtained by Collections.synchronizedList.
Read-write
synchronized(ary){
for (int i = 0; i < ary.length; i++) {
// get some values from aspect point cut
if (some condiction) {
ary += someValue; // ary a field of the aspect
}
}
}
Or use some concurrent collections (like CopyOnWriteArrayList) which is inherentently therad safe.
Main difference is that - in first read-only wersion any number of threads may iterate over this collections, and in second only one at a time may iterate. In both cases only one therad at a time should increment any given field.
Field access
Synchronize incrementations on fields separately from synchronizing iterations.
like:
Integer foo = ary.get(ii);
synchronized(foo){
foo++;
}
Get rid of synchronization
Use concurrent collections (from java.util.concurrent - not from `Collections.synchronizedXXX', latter still need synchronizing on traversal).
Use java.util.atomic that enable you to atomically incrememt fields.
Something you should watch:
Java memory model - its a talk that gives very nice understanding on how synchronizations and data aligment in JAVA works.

Upadte: since writing the below, I see you've updated the question slightly. Forgive my ignorance-- I have no idea what an "aspect" is-- but from the sample code you posted, you could also consider using atomics/concurrent collections (e.g. AtomicInteger, AtomicIntegerArray) or atomic field updaters. This could mean quite a re-factoring of your code, though. (In Java 5 on a dual-proc hyperthreading Xeon, the throughput of AtomicIntegerArray is significantly better than a synchronized array; sorry, I haven't got round to repeating the test on more procs/later JVM version yet-- note that performance of 'synchronized' has improved since then.)
Without more specific information or metrics about your particular program, the best you can do is just follow good program design. It's worth noting that the performance and optimisation of synchronization locks in the JVM has beed one of the areas (if not, the area) that has received most research and attention over the last few years. And so in the latest versions of JVM's, it ain't all that bad.
So in general, I'd say synchronize minimally without "going mad". By 'minimally', I mean so that you hold on to the lock for as less time as possible, and so that only the parts that need to use that specific lock use that specific lock. But only if the change is easy to do and it's easy to prove that your program is still correct. For example, instead of doing this:
synchronized (a) {
doSomethingWith(a);
longMethodNothingToDoWithA();
doSomethingWith(a);
}
consider doing this if and only if your program will still be correct:
synchronized (a) {
doSomethingWith(a);
}
longMethodNothingToDoWithA();
synchronized (a) {
doSomethingWith(a);
}
But remember, the odd simple field update with a lock held unnecessarily probably won't make much tangible difference, and could actually improve performance. Sometimes, holding a lock for a bit longer and doing less lock "housekeeping" can be beneficial. But the JVM can make some of those decisions, so you don't need to be tooo paranoid-- just do generally sensible things and you should be fine.
In general, try and have a separate lock for each set of methods/accesses that together form an "independent process". Other than that, having a separate lock object can be a good way of encapsulating the lock within the class it's used by (i.e. preventing it from being used by outside callers in a way you didn't predict), but there's probably no performance difference per se from using one object to another as the lock (e.g. using the instance itself vs a private Object declared just to be a lock within that class as you suggest), provided the two objects would otherwise be used in exactly the same way.

There should be a performance difference between a built-in language construct and a library, but experience has taught me not to guess when it comes to performance.

If you compile the aspect into the application then you will have basically no performance hit, if you do it at runtime (load-type weaving) then you will see a performance hit.
If you have each aspect be perinstance then it may reduce the need for synchronization.
You should have as little synchronization as possible, for as short a time as possible, to reduce any problems.
If possible you may want to share as little state as possible between threads, keeping as much local as possible, to reduce any deadlock problems.
More information would lead to a better answer btw. :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.