How to run tasks in Spark on different workers?

How to run tasks in Spark on different workers? - java

I have following code for Spark:
package my.spark;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
public class ExecutionTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("ExecutionTest")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
int slices = 2;
int n = slices;
List<String> list = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
list.add("" + i);
}
JavaRDD<String> dataSet = jsc.parallelize(list, slices);
dataSet.foreach(str -> {
System.out.println("value: " + str);
Thread.sleep(10000);
});
System.out.println("done");
spark.stop();
}
}
I have run master node and two workers (everything on localhost; Windows) using the commands:
bin\spark-class org.apache.spark.deploy.master.Master
and (two times):
bin\spark-class org.apache.spark.deploy.worker.Worker spark://<local-ip>:7077
Everything started correctly.
After submitting my job using command:
bin\spark-submit --class my.spark.ExecutionTest --master spark://<local-ip>:7077 file:///<pathToFatJar>/FatJar.jar
Command started, but the value: 0 and value: 1 outputs are written by one of the workers (as displayed on Logs > stdout on page associated with the worker). Second worker has nothing in Logs > stdout. As far as I understood, this means, that each iteration is done by the same worker.
How to run these tasks on two different running workers?

It is possible, but I'm not sure if it will work correctly every time and everywhere. However, while testing, every time it worked as expected.
I have tested my code using host machine with Windows 10 x64, and 4 Virtual Machines (VM): VirtualBox with Debian 9 (stretch) kernel 4.9.0 x64, Host-Only network, Java 1.8.0_144, Apache Spark 2.2.0 for Hadoop 2.7 (spark-2.2.0-bin-hadoop2.7.tar.gz).
I have been using master and 3 slaves on VM and one more slave on Windows:
debian-master - 1 CPU, 1 GB RAM
debian-slave1 - 1 CPU, 1 GB RAM
debian-slave2 - 1 CPU, 1 GB RAM
debian-slave3 - 2 CPU, 1 GB RAM
windows-slave - 4 CPU, 8 GB RAM
I was submitting my jobs from Windows machine to the master located on VM.
The beginning is the same as before:
SparkSession spark = SparkSession
.builder()
.config("spark.cores.max", coresCount) // not necessary
.appName("ExecutionTest")
.getOrCreate();
[important] coresCount is essential for partitioning - I have to partition data using the number of used cores, not number of workers/executors.
Next, I have to create JavaSparkContext and RDD. Reusing RDD allows for executing multiple times probably the same set of workers.
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
JavaRDD<Integer> rddList
= jsc.parallelize(
IntStream.range(0, coresCount * 2)
.boxed().collect(Collectors.toList()))
.repartition(coresCount);
I have created rddList that has coresCount * 2 elements. The number of elements equal to coresCount does not allow for running on all associated workers (in my case). Maybe, the coresCount + 1 would be enough, but I have not tested it as the coresCount * 2 is not much as well.
Next thing to do is to run commands:
List<String> hostsList
= rddList.map(value -> {
Thread.sleep(3_000);
return InetAddress.getLocalHost().getHostAddress();
})
.distinct()
.collect();
System.out.println("-----> hostsList = " + hostsList);
Thread.sleep(3_000) is necessary for proper distribution of tasks. 3 seconds are enough for me. Probably the value could be smaller, and sometimes, probably, a higher value will be necessary (I guess that value depends on, how fast the workers get tasks to execute from master).
The above code will run on each core associated with the worker, so more than one per worker. To run on each worker exactly one command, I have used the following code:
/* as static field of class */
private static final AtomicBoolean ONE_ON_WORKER = new AtomicBoolean(false);
...
long nodeCount
= rddList.map(value -> {
Thread.sleep(3_000);
if (ONE_ON_WORKER.getAndSet(true) == false) {
System.out.println("Executed on "
+ InetAddress.getLocalHost().getHostName());
return 1;
} else {
return 0;
}
})
.filter(val -> val != 0)
.count();
System.out.println("-----> finished using #nodes = " + nodeCount);
And of course, at the end, the stop:
spark.stop();

Related

Does flink streaming have cache/persist feature? (like spark)

I have a Flink streaming program that have branch processing logic after a long transformation logic. Will the long transformation logic be executed multiple times? Pseudo code:
env = getEnvironment();
DataStream<Event> inputStream = getInputStream();
tempStream = inputStream.map(very_heavy_computation_func)
output1 = tempStream.map(func1);
output1.addSink(sink1);
output2 = tempStream.map(func2);
output2.addSink(sink2);
env.execute();
Questions:
How many times would inputStream.map(very_heavy_computation_func) be executed?
Once or twice?
If twice, how can I cache tempStream (or other method) to avoid the previous transformation being executed multiple times?

You can actually answer (1) easily by just trying out more or less exactly your example:
public class TestProgram {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
SingleOutputStreamOperator<Integer> stream = env.fromElements(1, 2, 3)
.map(i -> {
System.out.println("Executed expensive computation for: " + i);
return i;
});
stream.map(i -> i).addSink(new PrintSinkFunction<>());
stream.map(i -> i).addSink(new PrintSinkFunction<>());
env.execute();
}
}
produces (on my machine, for example):
Executed expensive computation for: 3
Executed expensive computation for: 1
Executed expensive computation for: 2
9> 3
8> 2
8> 2
9> 3
7> 1
7> 1
You can also find a more technical answer here which explains how records are replicated to downstream operators, rather than running the source/operator multiple times.

serializability java Matlab

I use a toolbox written in Java javaplex [https://github.com/javaplex/javaplex.github.io][1]
the link for the input matrix mat [https://drive.google.com/file/d/0B3uM9Np2kJYoTmtkRHV2WU5JeGc/view?usp=sharing][3]
the first loop which is working very good is for
cd C:\ProjetBCodesMatlab\Jplex
javaaddpath('./lib/javaplex.jar');
import edu.stanford.math.plex4.*;
javaaddpath('./lib/plex-viewer.jar');
import edu.stanford.math.plex_viewer.*;
cd './utility';
addpath(pwd);
cd '..';
max_dimension = 3;
max_filtration_value = 1000;
num_divisions = 10000;
options.max_filtration_value = max_filtration_value;
options.max_dimension = max_dimension - 1;
%------------------------------------------------------------
for i=1:10
maColonne = mat(i,:);
intervals= Calcul_interval(maColonne,options,max_dimension, max_filtration_value, num_divisions)
intervals
multinterval{i}= intervals;
end
I use a I7 when I execute the function feature('numCores');
MATLAB detected: 4 physical cores.
MATLAB detected: 8 logical cores.
MATLAB was assigned: 8 logical cores by the OS.
MATLAB is using: 4 logical cores.
MATLAB is not using all logical cores because hyper-threading is enabled.
I run the same code with parfor
cd C:\ProjetBCodesMatlab\Jplex
javaaddpath('./lib/javaplex.jar');
import edu.stanford.math.plex4.*;
javaaddpath('./lib/plex-viewer.jar');
import edu.stanford.math.plex_viewer.*;
cd './utility';
addpath(pwd);
cd '..';
max_dimension = 3;
max_filtration_value = 1000;
num_divisions = 10000;
options.max_filtration_value = max_filtration_value;
options.max_dimension = max_dimension - 1;
%------------------------------------------------------------
parfor i=1:10
maColonne = mat(i,:);
intervals= Calcul_interval(maColonne,options,max_dimension, max_filtration_value, num_divisions)
intervals
multinterval{i}= intervals;
end
i have this error : is not serializable.

Finally I got it reproduced. It can not be serialized because your variable intervals is holding an object of type edu.stanford.math.plex4.homology.barcodes.BarcodeCollection which is not serializable. You have to make it serializable or extract the relevant data on the workers. The parallel computing toolbox can only transport data which can be serialized.

Hadoop: Change number of reducers at runtime

Assume the following scenario: A set of dependent jobs, which are send to hadoop. Hadoop executes the first one, then the second one that depends on the first, etc. The jobs are submitted in one go using JobControl (see code below).
Using Hadoop 2.x (in Java), is it possible to change the number of reducers of a job at runtime? More specific, how can I change then number of reducers in job 2 after job 1 has been executed?
Also, is there a way to let hadoop automatically infer the number of reducers by estimating map output? It always takes 1, and I cannot find a way to change the default setting (except explicitly setting the number myself).
// 1. create JobControl
JobControl jc = new JobControl(name);
// 2. add all the controlled jobs to the job control
// note that this is done in one go by using a collection
jc.addJobCollection(jobs);
// 3. execute the jobcontrol in a Thread
Thread workflowThread = new Thread(jc, "Thread_" + name);
workflowThread.setDaemon(true); // will not avoid JVM to shutdown
// 4. we wait for it to complete
LOG.info("Waiting for thread to complete: " + workflowThread.getName());
while (!jc.allFinished()) {
Thread.sleep(REFRESH_WAIT);
}

Your first question. Yes, you can set number of reducers of job 2 after execution of job 1 in your driver program:
Job job1 = new Job(conf, "job 1");
//your job setup here
//...
job1.submit();
job1.waitForCompletion(true);
int job2Reducers = ... //compute based on job1 results here
Job job2 = new Job(conf, "job 2");
job2.setNumReduceTasks(job2Reducers);
//your job2 setup here
//...
job2.submit();
job2.waitForCompletion(true);
Second question, to my knowledge, no, you can't make Hadoop automatically choose number of reducers based on your mapper load.

The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps.
So we can set number of reducer tasks using same logic as map does.
To make reducer dynamic I wrote logic to set the number of reducer tasks dynamically to adjust with the number of map tasks at runtime.
In Java code:
long defaultBlockSize = 0;
int NumOfReduce = 10; // default value you can give any number.
long inputFileLength = 0;
try {
FileSystem fileSystem = FileSystem.get(this.getConf()); // hdfs file system
inputFileLength = fileSystem.getContentSummary(
new Path(PROP_HDFS_INPUT_LOCATION)).getLength();// input files stored in hdfs location
defaultBlockSize = fileSystem.getDefaultBlockSize(new Path(
hdfsFilePath.concat("PROP_HDFS_INPUT_LOCATION")));// getting default block size
if (inputFileLength > 0 && defaultBlockSize > 0) {
NumOfReduce = (int) (((inputFileLength / defaultBlockSize) + 1) * 2);// calculating number of tasks
}
System.out.println("NumOfReduce : " + NumOfReduce);
} catch (Exception e) {
LOGGER.error(" Exception{} ", e);
}
job.setNumReduceTasks(NumOfReduce);

Efficient way to allot the next available VM

The method getNextAvailableVm() allots virtual machines for a particular data center in a round-robin fashion. (The integer returned by this method is the machine allotted)
In a data center there could be virtual machines with different set of configurations. For example :
5 VMs with 1024 memory
4 VMs with 512 memory
Total : 9 VMs
For this data center a machine with 1024 memory will get task 2 times as compared to machine with 512 memory.
So machines for this data center are returned by the getNextAvailableVm() in the following way :
0 0 1 1 2 2 3 3 4 4 5 6 7 8
This is the current way, the machines are being returned.But there is a problem.
There could be cases, when a particular machine is busy and cannot be allotted the task.Instead the next machine available with the highest memory must be allotted the task.I have not been able to implement this.
For example :
0 (allotted first time)
0 (to be allotted the second time)
but if 0 is busy..
allot 1 if 1 is not busy
next circle check if 0 is busy
if not busy allot 0 (only when machine numbered 0 has not handled the requests it is entitled to handle)
if busy, allot the next
cloudSimEventFired method in the following class is called when ever the machine gets freed or is allotted.
public class TempAlgo extends VmLoadBalancer implements CloudSimEventListener {
/**
* Key : Name of the data center
* Value : List of objects of class 'VmAllocationUIElement'.
*/
private Map<String,LinkedList<DepConfAttr>> confMap = new HashMap<String,LinkedList<DepConfAttr>>();
private Iterator<Integer> availableVms = null;
private DatacenterController dcc;
private boolean sorted = false;
private int currentVM;
private boolean calledOnce = false;
private boolean indexChanged = false;
private LinkedList<Integer> busyList = new LinkedList<Integer>();
private Map<String,LinkedList<AlgoAttr>> algoMap = new HashMap<String, LinkedList<AlgoAttr>>();
private Map<String,AlgoHelper> map = new HashMap<String,AlgoHelper>();
private Map<String,Integer> vmCountMap = new HashMap<String,Integer>();
public TempAlgo(DatacenterController dcb) {
confMap = DepConfList.dcConfMap;
this.dcc = dcb;
dcc.addCloudSimEventListener(this);
if(!this.calledOnce) {
this.calledOnce = true;
// Make a new map using dcConfMap that lists 'DataCenter' as a 'key' and 'LinkedList<AlgoAttr>' as 'value'.
Set<String> keyst =DepConfList.dcConfMap.keySet();
for(String dataCenter : keyst) {
LinkedList<AlgoAttr> tmpList = new LinkedList<AlgoAttr>();
LinkedList<DepConfAttr> list = dcConfMap.get(dataCenter);
int totalVms = 0;
for(DepConfAttr o : list) {
tmpList.add(new AlgoAttr(o.getVmCount(), o.getMemory()/512, 0));
totalVms = totalVms + o.getVmCount();
}
Temp_Algo_Static_Var.algoMap.put(dataCenter, tmpList);
Temp_Algo_Static_Var.vmCountMap.put(dataCenter, totalVms);
}
this.algoMap = new HashMap<String, LinkedList<AlgoAttr>>(Temp_Algo_Static_Var.algoMap);
this.vmCountMap = new HashMap<String,Integer>(Temp_Algo_Static_Var.vmCountMap);
this.map = new HashMap<String,AlgoHelper>(Temp_Algo_Static_Var.map);
}
}
#Override
public int getNextAvailableVm() {
synchronized(this) {
String dataCenter = this.dcc.getDataCenterName();
int totalVMs = this.vmCountMap.get(dataCenter);
AlgoHelper ah = (AlgoHelper)this.map.get(dataCenter);
int lastIndex = ah.getIndex();
int lastCount = ah.getLastCount();
LinkedList<AlgoAttr> list = this.algoMap.get(dataCenter);
AlgoAttr aAtr = (AlgoAttr)list.get(lastIndex);
indexChanged = false;
if(lastCount < totalVMs) {
if(aAtr.getRequestAllocated() % aAtr.getWeightCount() == 0) {
lastCount = lastCount + 1;
this.currentVM = lastCount;
if(aAtr.getRequestAllocated() == aAtr.getVmCount() * aAtr.getWeightCount()) {
lastIndex++;
if(lastIndex != list.size()) {
AlgoAttr aAtr_N = (AlgoAttr)list.get(lastIndex);
aAtr_N.setRequestAllocated(1);
this.indexChanged = true;
}
if(lastIndex == list.size()) {
lastIndex = 0;
lastCount = 0;
this.currentVM = lastCount;
AlgoAttr aAtr_N = (AlgoAttr)list.get(lastIndex);
aAtr_N.setRequestAllocated(1);
this.indexChanged = true;
}
}
}
if(!this.indexChanged) {
aAtr.setRequestAllocated(aAtr.getRequestAllocated() + 1);
}
this.map.put(dataCenter, new AlgoHelper(lastIndex, lastCount));
//System.out.println("Current VM : " + this.currentVM + " for data center : " + dataCenter);
return this.currentVM;
}}
System.out.println("--------Before final return statement---------");
return 0;
}
#Override
public void cloudSimEventFired(CloudSimEvent e) {
if(e.getId() == CloudSimEvents.EVENT_CLOUDLET_ALLOCATED_TO_VM) {
int vmId = (Integer) e.getParameter(Constants.PARAM_VM_ID);
busyList.add(vmId);
System.out.println("+++++++++++++++++++Machine with vmID : " + vmId + " attached");
}else if(e.getId() == CloudSimEvents.EVENT_VM_FINISHED_CLOUDLET) {
int vmId = (Integer) e.getParameter(Constants.PARAM_VM_ID);
busyList.remove(vmId);
//System.out.println("+++++++++++++++++++Machine with vmID : " + vmId + " freed");
}
}
}
In the above code, all the lists are already sorted with the highest memory first.The whole idea is to balance the memory by allocating more tasks to a machine with higher memory.
Each time a machine is allotted request allocated is incremented by one.Each set of machines have a weight count attached to it, which is calculated by dividing memory_allotted by 512.
The method getNextAvailableVm() is called by multiple threads at a time. For 3 Data Centers 3 threads will simultaneously call getNextAva...() but on different class objects.The data center returned by the statement this.dcc.getDataCenterName() in the same method is returned according to the data center broker policy selected earlier.
How do I make sure that the machine I am currently returning is free and if the machine is not free I allot the next machine with highest memory available.I also have to make sure that the machine that is entitled to process X tasks, does process X tasks even that machine is currently busy.
This is a general description of the data structure used here :
The code of this class is hosted here on github.
This is the link for the complete project on github.
Most of the data structures/classes used here are inside this package

Perhaps you are over thinking the problem. A simple strategy is to have a broker which is aware of all the pending tasks. Each task worker or thread asks the broker for a new message/task to work on. The broker gives out work in the order it was asked for. This is how JMS queues works. For the JVMs which can handle two tasks you can start two threads.
There is many standard JMS which do this but I suggest looking at ActiveMQ as it is simple to get started with.
note in you case, a simpler solution is to have one machine with 8 GB of memory. You can buy 8 GB for a server for very little ($40 - $150 depending on vendor) and it will be used more efficiently in one instance by sharing resource. I assume you are looking at much larger instances. Instances smaller than 8 GB are better off just upgrading it.
How do I make sure that the machine I am currently returning is free
This is your scenario, if you don't know how to tell if a machine is free, I don't see how anyone would have more knowledge of you application.
and if the machine is not free I allot the next machine with highest memory available.
You need to look at the free machines and pick the one with the most available memory. I don't see what the catch is here other than doing what you have stated.
I also have to make sure that the machine that is entitled to process X tasks, does process X tasks even that machine is currently busy.
You need a data source or store for this information. What is allowed to run where. In JMS you would have multiple queues and only pass certain queues to the machines which can process those queue.

Java compare and swap semantics and performance

What is the semantics of compare and swap in Java? Namely, does the compare and swap method of an AtomicInteger just guarantee ordered access between different threads to the particular memory location of the atomic integer instance, or does it guarantee ordered access to all the locations in memory, i.e. it acts as if it were a volatile (a memory fence).
From the docs:
weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of any variables other than the target of the weakCompareAndSet.
compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.
It's apparent from the API documentation that compareAndSet acts as if it were a volatile variable. However, weakCompareAndSet is supposed to just change its specific memory location. Thus, if that memory location is exclusive to the cache of a single processor, weakCompareAndSet is supposed to be much faster than the regular compareAndSet.
I'm asking this because I've benchmarked the following methods by running threadnum different threads, varying threadnum from 1 to 8, and having totalwork=1e9 (the code is written in Scala, a statically compiled JVM language, but both its meaning and bytecode translation are isomorphic to that of Java in this case - this short snippets should be clear):
val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal[AtomicInteger] {
override def initialValue = new AtomicInteger(0)
}
def loop_atomic_tlocal_cas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_tlocal_cnt.get
while (i < until) {
i += 1
acnt.compareAndSet(i - 1, i)
}
acnt.get + i
}
def loop_atomic_weakcas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_cnt
while (i < until) {
i += 1
acnt.weakCompareAndSet(i - 1, i)
}
acnt.get + i
}
def loop_atomic_tlocal_weakcas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_tlocal_cnt.get
while (i < until) {
i += 1
acnt.weakCompareAndSet(i - 1, i)
}
acnt.get + i
}
on an AMD with 4 dual 2.8 GHz cores, and a 2.67 GHz 4-core i7 processor. The JVM is Sun Server Hotspot JVM 1.6. The results show no performance difference.
Specs: AMD 8220 4x dual-core # 2.8 GHz
Test name: loop_atomic_tlocal_cas
Thread num.: 1
Run times: (showing last 3)
7504.562 7502.817 7504.626 (avg = 7415.637 min = 7147.628 max = 7504.886 )
Thread num.: 2
Run times: (showing last 3)
3751.553 3752.589 3751.519 (avg = 3713.5513 min = 3574.708 max = 3752.949 )
Thread num.: 4
Run times: (showing last 3)
1890.055 1889.813 1890.047 (avg = 2065.7207 min = 1804.652 max = 3755.852 )
Thread num.: 8
Run times: (showing last 3)
960.12 989.453 970.842 (avg = 1058.8776 min = 940.492 max = 1893.127 )
Test name: loop_atomic_weakcas
Thread num.: 1
Run times: (showing last 3)
7325.425 7057.03 7325.407 (avg = 7231.8682 min = 7057.03 max = 7325.45 )
Thread num.: 2
Run times: (showing last 3)
3663.21 3665.838 3533.406 (avg = 3607.2149 min = 3529.177 max = 3665.838 )
Thread num.: 4
Run times: (showing last 3)
3664.163 1831.979 1835.07 (avg = 2014.2086 min = 1797.997 max = 3664.163 )
Thread num.: 8
Run times: (showing last 3)
940.504 928.467 921.376 (avg = 943.665 min = 919.985 max = 997.681 )
Test name: loop_atomic_tlocal_weakcas
Thread num.: 1
Run times: (showing last 3)
7502.876 7502.857 7502.933 (avg = 7414.8132 min = 7145.869 max = 7502.933 )
Thread num.: 2
Run times: (showing last 3)
3752.623 3751.53 3752.434 (avg = 3710.1782 min = 3574.398 max = 3752.623 )
Thread num.: 4
Run times: (showing last 3)
1876.723 1881.069 1876.538 (avg = 4110.4221 min = 1804.62 max = 12467.351 )
Thread num.: 8
Run times: (showing last 3)
959.329 1010.53 969.767 (avg = 1072.8444 min = 959.329 max = 1880.049 )
Specs: Intel i7 quad-core # 2.67 GHz
Test name: loop_atomic_tlocal_cas
Thread num.: 1
Run times: (showing last 3)
8138.3175 8130.0044 8130.1535 (avg = 8119.2888 min = 8049.6497 max = 8150.1950 )
Thread num.: 2
Run times: (showing last 3)
4067.7399 4067.5403 4068.3747 (avg = 4059.6344 min = 4026.2739 max = 4068.5455 )
Thread num.: 4
Run times: (showing last 3)
2033.4389 2033.2695 2033.2918 (avg = 2030.5825 min = 2017.6880 max = 2035.0352 )
Test name: loop_atomic_weakcas
Thread num.: 1
Run times: (showing last 3)
8130.5620 8129.9963 8132.3382 (avg = 8114.0052 min = 8042.0742 max = 8132.8542 )
Thread num.: 2
Run times: (showing last 3)
4066.9559 4067.0414 4067.2080 (avg = 4086.0608 min = 4023.6822 max = 4335.1791 )
Thread num.: 4
Run times: (showing last 3)
2034.6084 2169.8127 2034.5625 (avg = 2047.7025 min = 2032.8131 max = 2169.8127 )
Test name: loop_atomic_tlocal_weakcas
Thread num.: 1
Run times: (showing last 3)
8132.5267 8132.0299 8132.2415 (avg = 8114.9328 min = 8043.3674 max = 8134.0418 )
Thread num.: 2
Run times: (showing last 3)
4066.5924 4066.5797 4066.6519 (avg = 4059.1911 min = 4025.0703 max = 4066.8547 )
Thread num.: 4
Run times: (showing last 3)
2033.2614 2035.5754 2036.9110 (avg = 2033.2958 min = 2023.5082 max = 2038.8750 )
While it's possible that thread locals in the example above end up in the same cache lines, it seems to me that there is no observable performance difference between regular CAS and its weak version.
This could mean that, in fact, a weak compare and swap acts as fully fledged memory fence, i.e. acts as if it were a volatile variable.
Question: Is this observation correct? Also, is there a known architecture or Java distribution for which a weak compare and set is actually faster? If not, what is the advantage of using a weak CAS in the first place?

A weak compare and swap could act as a full volatile variable, depending on the implementation of the JVM, sure. In fact, I wouldn't be surprised if on certain architectures it is not possible to implement a weak CAS in a notably more performant way than the normal CAS. On these architectures, it may well be the case that weak CASes are implemented exactly the same as a full CAS. Or it might simply be that your JVM has not had much optimisation put into making weak CASes particularly fast, so the current implementation just invokes a full CAS because it's quick to implement, and a future version will refine this.
The JLS simply says that a weak CAS does not establish a happens-before relationship, so it's simply that there is no guarantee that the modification it causes is visible in other threads. All you get in this case is the guarantee that the compare-and-set operation is atomic, but with no guarantees about the visibility of the (potentially) new value. That's not the same as guaranteeing that it won't be seen, so your tests are consistent with this.
In general, try to avoid making any conclusions about concurrency-related behaviour through experimentation. There are so many variables to take into account, that if you don't follow what the JLS guarantees to be correct, then your program could break at any time (perhaps on a different architecture, perhaps under more aggressive optimisation that's prompted by a slight change in the layout of your code, perhaps under future builds of the JVM that don't exist yet, etc.). There's never a reason to assume you can get away with something that's stated not to be guaranteed, because experiments show that "it works".

The x86 instruction for "atomically compare and swap" is LOCK CMPXCHG. This instruction creates a full memory fence.
There is no instruction that does this job without creating a memory fence, so it is very likely that both compareAndSet and weakCompareAndSet map to LOCK CMPXCHG and perform a full memory fence.
But that's for x86, other architectures (including future variants of x86) may do things differently.

weakCompareAndSwap is not guaranteed to be faster; it's just permitted to be faster. You can look at the open-source code of the OpenJDK to see what some smart people decided to do with this permission:
source code of compareAndSet
source code of weakCompareAndSet
Namely: They're both implemented as the one-liner
return unsafe.compareAndSwapObject(this, valueOffset, expect, update);
They have exactly the same performance, because they have exactly the same implementation! (in OpenJDK at least). Other people have remarked on the fact that you can't really do any better on x86 anyway, because the hardware already gives you a bunch of guarantees "for free". It's only on simpler architectures like ARM that you have to worry about it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.