COMPSs - Nodes already filled error

COMPSs - Nodes already filled error - java

After submitting a COMPSs application I have received the following error message and the application is not executed.
MPI_CMD=mpirun -timestamp-output -n 1 -H s00r0
/apps/COMPSs/1.3/Runtime/scripts/user/runcompss
--project=/tmp/1668183.tmpdir/project_1458303603.xml
--resources=/tmp/1668183.tmpdir/resources_1458303603.xml
--uuid=2ed20e6a-9f02-49ff-a71c-e071ce35dacc
/apps/FILESPACE/pycompssfile arg1 arg2 : -n 1 -H s00r0
/apps/COMPSs/1.3/Runtime/scripts/system/adaptors/nio/persistent_worker_starter.sh
/apps/INTEL/mkl/lib/intel64 null
/home/myhome/kmeans_python/src/ true
/tmp/1668183.tmpdir 4 5 5 s00r0-ib0 43001 43000 true 1
/apps/COMPSs/1.3/Runtime/scripts/system/2ed20e6a-9f02-49ff-a71c-e071ce35dacc : -n 1 -H s00r0
/apps/COMPSs/1.3/Runtime/scripts/system/adaptors/nio/persistent_worker_starter.sh
/apps/INTEL/mkl/lib/intel64 null
/home/myhome/kmeans_python/src/ true
/tmp/1668183.tmpdir 4 5 5 s00r0-ib0 43001 43000 true 2
/apps/COMPSs/1.3/Runtime/scripts/system/2ed20e6a-9f02-49ff-a71c-e071ce35dacc
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
I am using COMPSs 1.3.
Why is this happenning?

You are trying to run master and worker in the same node. COMPSs 1.3 at cluster with the NIO adaptor (default option) is using mpirun to spawn the master and worker processes in the different nodes of the cluster and the mpirun installed in the cluster doesn't allow to do this.
The options to solve it are the following:
You do not specify --tasks_in_master= in the enqueue_compss command.
You execute with GAT Adaptor (--comm=integratedtoolkit.gat.master.GATAdaptor) which has more overhead
Next COMPSs software release will use the spawn command which is available in the different cluster resource managers( such as blaunch, srun) which must solve this issue

Related

Readiness and liveness failed with smallrye metrics in kubernetes

I'm deploying a pod written in quarkus in kubernetes and the startup seems to go fine. But there's a problem with readiness and liveness that result unhealthy.
For metrics I'm using smallrye metrics configured on port 8080 and on path:
quarkus.smallrye-metrics.path=/metrics
If i enter in the pod and i execute
curl localhost:8080/metrics
the response is
# HELP base_classloader_loadedClasses_count Displays the number of classes that are currently loaded in the Java virtual machine.
# TYPE base_classloader_loadedClasses_count gauge
base_classloader_loadedClasses_count 7399.0
# HELP base_classloader_loadedClasses_total Displays the total number of classes that have been loaded since the Java virtual machine has started execution.
# TYPE base_classloader_loadedClasses_total counter
base_classloader_loadedClasses_total 7403.0
# HELP base_classloader_unloadedClasses_total Displays the total number of classes unloaded since the Java virtual machine has started execution.
# TYPE base_classloader_unloadedClasses_total counter
base_classloader_unloadedClasses_total 4.0
# HELP base_cpu_availableProcessors Displays the number of processors available to the Java virtual machine. This value may change during a particular invocation of the virtual machine.
# TYPE base_cpu_availableProcessors gauge
base_cpu_availableProcessors 1.0
# HELP base_cpu_processCpuLoad_percent Displays the "recent cpu usage" for the Java Virtual Machine process. This value is a double in the [0.0,1.0] interval. A value of 0.0 means that none of the CPUs were running threads from the JVM process during the recent period of time observed, while a value of 1.0 means that all CPUs were actively running threads from the JVM 100% of the time during the recent period being observed. Threads from the JVM include the application threads as well as the JVM internal threads. All values between 0.0 and 1.0 are possible depending of the activities going on in the JVM process and the whole system. If the Java Virtual Machine recent CPU usage is not available, the method returns a negative value.
# TYPE base_cpu_processCpuLoad_percent gauge
base_cpu_processCpuLoad_percent 2.3218608761411404E-7
# HELP base_cpu_systemLoadAverage Displays the system load average for the last minute. The system load average is the sum of the number of runnable entities queued to the available processors and the number of runnable entities running on the available processors averaged over a period of time. The way in which the load average is calculated is operating system specific but is typically a damped time-dependent average. If the load average is not available, a negative value is displayed. This attribute is designed to provide a hint about the system load and may be queried frequently. The load average may be unavailable on some platforms where it is expensive to implement this method.
# TYPE base_cpu_systemLoadAverage gauge
base_cpu_systemLoadAverage 0.15
# HELP base_gc_time_total Displays the approximate accumulated collection elapsed time in milliseconds. This attribute displays -1 if the collection elapsed time is undefined for this collector. The Java virtual machine implementation may use a high resolution timer to measure the elapsed time. This attribute may display the same value even if the collection count has been incremented if the collection elapsed time is very short.
# TYPE base_gc_time_total counter
base_gc_time_total_seconds{name="Copy"} 0.032
base_gc_time_total_seconds{name="MarkSweepCompact"} 0.071
# HELP base_gc_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
# TYPE base_gc_total counter
base_gc_total{name="Copy"} 4.0
base_gc_total{name="MarkSweepCompact"} 2.0
# HELP base_jvm_uptime_seconds Displays the time from the start of the Java virtual machine in milliseconds.
# TYPE base_jvm_uptime_seconds gauge
base_jvm_uptime_seconds 624.763
# HELP base_memory_committedHeap_bytes Displays the amount of memory in bytes that is committed for the Java virtual machine to use. This amount of memory is guaranteed for the Java virtual machine to use.
# TYPE base_memory_committedHeap_bytes gauge
base_memory_committedHeap_bytes 8.5262336E7
# HELP base_memory_maxHeap_bytes Displays the maximum amount of heap memory in bytes that can be used for memory management. This attribute displays -1 if the maximum heap memory size is undefined. This amount of memory is not guaranteed to be available for memory management if it is greater than the amount of committed memory. The Java virtual machine may fail to allocate memory even if the amount of used memory does not exceed this maximum size.
# TYPE base_memory_maxHeap_bytes gauge
base_memory_maxHeap_bytes 1.348141056E9
# HELP base_memory_usedHeap_bytes Displays the amount of used heap memory in bytes.
# TYPE base_memory_usedHeap_bytes gauge
base_memory_usedHeap_bytes 1.2666888E7
# HELP base_thread_count Displays the current number of live threads including both daemon and non-daemon threads
# TYPE base_thread_count gauge
base_thread_count 11.0
# HELP base_thread_daemon_count Displays the current number of live daemon threads.
# TYPE base_thread_daemon_count gauge
base_thread_daemon_count 7.0
# HELP base_thread_max_count Displays the peak live thread count since the Java virtual machine started or peak was reset. This includes daemon and non-daemon threads.
# TYPE base_thread_max_count gauge
base_thread_max_count 11.0
# HELP vendor_cpu_processCpuTime_seconds Displays the CPU time used by the process on which the Java virtual machine is running in nanoseconds. The returned value is of nanoseconds precision but not necessarily nanoseconds accuracy. This method returns -1 if the the platform does not support this operation.
# TYPE vendor_cpu_processCpuTime_seconds gauge
vendor_cpu_processCpuTime_seconds 4.36
# HELP vendor_cpu_systemCpuLoad_percent Displays the "recent cpu usage" for the whole system. This value is a double in the [0.0,1.0] interval. A value of 0.0 means that all CPUs were idle during the recent period of time observed, while a value of 1.0 means that all CPUs were actively running 100% of the time during the recent period being observed. All values betweens 0.0 and 1.0 are possible depending of the activities going on in the system. If the system recent cpu usage is not available, the method returns a negative value.
# TYPE vendor_cpu_systemCpuLoad_percent gauge
vendor_cpu_systemCpuLoad_percent 2.3565253563367224E-7
# HELP vendor_memory_committedNonHeap_bytes Displays the amount of non heap memory in bytes that is committed for the Java virtual machine to use.
# TYPE vendor_memory_committedNonHeap_bytes gauge
vendor_memory_committedNonHeap_bytes 5.1757056E7
# HELP vendor_memory_freePhysicalSize_bytes Displays the amount of free physical memory in bytes.
# TYPE vendor_memory_freePhysicalSize_bytes gauge
vendor_memory_freePhysicalSize_bytes 5.44448512E9
# HELP vendor_memory_freeSwapSize_bytes Displays the amount of free swap space in bytes.
# TYPE vendor_memory_freeSwapSize_bytes gauge
vendor_memory_freeSwapSize_bytes 0.0
# HELP vendor_memory_maxNonHeap_bytes Displays the maximum amount of used non-heap memory in bytes.
# TYPE vendor_memory_maxNonHeap_bytes gauge
vendor_memory_maxNonHeap_bytes -1.0
# HELP vendor_memory_usedNonHeap_bytes Displays the amount of used non-heap memory in bytes.
# TYPE vendor_memory_usedNonHeap_bytes gauge
vendor_memory_usedNonHeap_bytes 4.7445384E7
# HELP vendor_memoryPool_usage_bytes Current usage of the memory pool denoted by the 'name' tag
# TYPE vendor_memoryPool_usage_bytes gauge
vendor_memoryPool_usage_bytes{name="CodeHeap 'non-nmethods'"} 1357184.0
vendor_memoryPool_usage_bytes{name="CodeHeap 'non-profiled nmethods'"} 976128.0
vendor_memoryPool_usage_bytes{name="CodeHeap 'profiled nmethods'"} 4787200.0
vendor_memoryPool_usage_bytes{name="Compressed Class Space"} 4562592.0
vendor_memoryPool_usage_bytes{name="Eden Space"} 0.0
vendor_memoryPool_usage_bytes{name="Metaspace"} 3.5767632E7
vendor_memoryPool_usage_bytes{name="Survivor Space"} 0.0
vendor_memoryPool_usage_bytes{name="Tenured Gen"} 9872160.0
# HELP vendor_memoryPool_usage_max_bytes Peak usage of the memory pool denoted by the 'name' tag
# TYPE vendor_memoryPool_usage_max_bytes gauge
vendor_memoryPool_usage_max_bytes{name="CodeHeap 'non-nmethods'"} 1369600.0
vendor_memoryPool_usage_max_bytes{name="CodeHeap 'non-profiled nmethods'"} 976128.0
vendor_memoryPool_usage_max_bytes{name="CodeHeap 'profiled nmethods'"} 4793088.0
vendor_memoryPool_usage_max_bytes{name="Compressed Class Space"} 4562592.0
vendor_memoryPool_usage_max_bytes{name="Eden Space"} 2.3658496E7
vendor_memoryPool_usage_max_bytes{name="Metaspace"} 3.5769312E7
vendor_memoryPool_usage_max_bytes{name="Survivor Space"} 2883584.0
vendor_memoryPool_usage_max_bytes{name="Tenured Gen"} 9872160.0
So it seems metrics are working fine, but kubernetes returns this error:
Warning Unhealthy 24m (x9 over 28m) kubelet Liveness probe errored: strconv.Atoi: parsing "metrics": invalid syntax
Warning Unhealthy 4m2s (x70 over 28m) kubelet Readiness probe errored: strconv.Atoi: parsing "metrics": invalid syntax
Any help?
Thanks

First I needed to fix dockerfile.jvm
FROM openjdk:11
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en'
# We make four distinct layers so if there are application changes the library layers can be re-used
# RUN ls -la target
COPY --chown=185 target/quarkus-app/lib/ /deployments/lib/
COPY --chown=185 target/quarkus-app/*.jar /deployments/
COPY --chown=185 target/quarkus-app/app/ /deployments/app/
COPY --chown=185 target/quarkus-app/quarkus/ /deployments/quarkus/
RUN java -version
EXPOSE 8080
USER root
ENV AB_JOLOKIA_OFF=""
ENV JAVA_OPTS="-Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager"
ENV JAVA_DEBUG="true"
ENV JAVA_APP_JAR="/deployments/quarkus-run.jar"
CMD java ${JAVA_OPTS} -jar ${JAVA_APP_JAR}
this way jar started working. without that CMD openjdk image is just starting jshell. After that I saw the log below
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
2022-09-21 19:56:00,450 INFO [io.sma.health] (executor-thread-1) SRHCK01001: Reporting health down status: {"status":"DOWN","checks":[{"name":"Database connections health check","status":"DOWN","data":{"<default>":"Unable to execute the validation check for the default DataSource: Communications link failure\n\nThe last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server."}}]}
DB connection in kubernetes is not working.
deploy command: mvn clean package -DskipTests -Dquarkus.kubernetes.deploy=true
"minikube dashboard" looks like below
used the endpoints below
quarkus.smallrye-health.root-path=/health
quarkus.smallrye-health.liveness-path=/health/live
quarkus.smallrye-metrics.path=/metrics
and liveness url looks like below in the firefox
I needed to change some dependencies in pom because I use minikube in my local and needed to delete some java code because of db connection problems, you can find working example at https://github.com/ozkanpakdil/quarkus-examples/tree/master/liveness-readiness-kubernetes
you can see the definition yaml of the deployment below.
mintozzy#mintozzy-MACH-WX9:~$ kubectl get deployments.apps app-version-checker -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
app.quarkus.io/build-timestamp: 2022-09-21 - 20:29:23 +0000
app.quarkus.io/commit-id: 7d709651868d810cd9a906609c8edad3f9d796c0
deployment.kubernetes.io/revision: "3"
prometheus.io/path: /metrics
prometheus.io/port: "8080"
prometheus.io/scheme: http
prometheus.io/scrape: "true"
creationTimestamp: "2022-09-21T20:13:21Z"
generation: 3
labels:
app.kubernetes.io/name: app-version-checker
app.kubernetes.io/version: 1.0.0-SNAPSHOT
name: app-version-checker
namespace: default
resourceVersion: "117584"
uid: 758d420b-ed22-48f8-9d6f-150422a6b38e
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: app-version-checker
app.kubernetes.io/version: 1.0.0-SNAPSHOT
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
app.quarkus.io/build-timestamp: 2022-09-21 - 20:29:23 +0000
app.quarkus.io/commit-id: 7d709651868d810cd9a906609c8edad3f9d796c0
prometheus.io/path: /metrics
prometheus.io/port: "8080"
prometheus.io/scheme: http
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app.kubernetes.io/name: app-version-checker
app.kubernetes.io/version: 1.0.0-SNAPSHOT
spec:
containers:
- env:
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: mintozzy/app-version-checker:1.0.0-SNAPSHOT
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /health/live
port: 8080
scheme: HTTP
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
name: app-version-checker
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health/ready
port: 8080
scheme: HTTP
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2022-09-21T20:13:21Z"
lastUpdateTime: "2022-09-21T20:30:03Z"
message: ReplicaSet "app-version-checker-5cb974f465" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2022-09-22T16:09:48Z"
lastUpdateTime: "2022-09-22T16:09:48Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
readyReplicas: 1
replicas: 1
updatedReplicas: 1

Spring Cloud message serialisation problem with Kafka running in docker container

I'm trying to write a test for my Spring Cloud service while it runs against Kafka and Schema Registry which run inside Docker containers.
Kafka and Schema Registry communicate with each other via a docker network, and have ports that are exposed on the host. The service I am testing is running on the host - it is able to communicate with both the docker kafka broker and docker schema registry. I am starting it up from a JUnit test which is annotated as shown below.
#ExtendWith(SpringExtension.class)
#SpringBootTest
#EnableAutoConfiguration(exclude = TestSupportBinderAutoConfiguration.class)
#AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)
public class MyTest {
...
}
My service spins up and is able to write a message to the Kafka broker running inside the Docker container, however when my service is started using the various Spring / JUnit test annotations, there appears to be something different about the way the message it writes is serialized compared to when my service runs in 'production mode' (i.e. if I run it using using java -jar com.xyz.MyService).
The message needs to be written in Avro format, so I've configured the binder in application.yml as
my-topic:
destination: my-topic
contentType: application/*+avro
producer:
useNativeEncoding: true
When attempting to consume the message that my service has written, AbstractKafkaAvroDeserializer blows up, complaining that it was unable to marshal it into a completely unrelated Avro type:
{"logger_name":"org.apache.kafka.streams.errors.LogAndFailExceptionHandler","message":"Exception caught during Deserialization, taskId: 0_0, topic: my-topic, partition: 0, offset: 1","stack_trace":"org.apache.kafka.common.errors.SerializationException: Could not find class com.xyz.SomeOtherMessageType specified in writer's schema whilst finding reader's schema for a SpecificRecord.
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.getSpecificReaderSchema(AbstractKafkaAvroDeserializer.java:265)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.getReaderSchema(AbstractKafkaAvroDeserializer.java:247)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.getDatumReader(AbstractKafkaAvroDeserializer.java:194)
...
This does not happen if my service runs in 'production mode'.
I think therefore that some setting is being applied to my service when I spin it up in 'test mode', which changes the way messages are encoded or serialized.
Can anyone suggest some things I can try to resolve this?
Update 1
So, it turns out that the messages looks pretty much identical when they are written to the topic and then read back (UUIDs are random for each test run):
Written to topic by service running in 'test mode':
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
------- -------- -------- -------- -------- ----------------
000000: 00000000 01483335 63616366 62642D30 .....H35cacfbd-0
000010: 3165642D 34653564 2D613936 652D6665 1ed-4e5d-a96e-fe
000020: 30626339 65313033 34664832 35313436 0bc9e1034fH25146
000030: 6237392D 66643334 2D346430 322D6261 b79-fd34-4d02-ba
000040: 37362D36 61396535 62623861 31343448 76-6a9e5bb8a144H
000050: 30653364 30326536 2D383732 372D3466 0e3d02e6-8727-4f
000060: 64312D38 3730662D 33646633 35353166 d1-870f-3df3551f
000070: 37343861 084D7220 54064D72 730A4A69 748a.Mr T.Mrs.Ji
000080: 6D6D790A 57686974 6514536E 6F772068 mmy.White.Snow h
000090: 6F757365 00000012 4C697665 72706F6F ouse....Liverpoo
0000A0: 6C0C4C4C 32335252 0E456E67 6C616E64 l.XXXXXX.England
0000B0: 16303735 31323334 35363738 021E4D72 .XXXXXXXXXXX..Mr
0000C0: 20542773 20427573 696E6573 73483737 T's BusinessH77
0000D0: 32383064 36352D36 3633362D 34376565 280d65-6636-47ee
0000E0: 2D393864 302D6361 36646531 32373838 -98d0-ca6de12788
0000F0: 63610000 ca..
Written to topic by service running in 'production mode':
Address 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
------- -------- -------- -------- -------- ----------------
000000: 00000000 57483433 64343264 61372D30 ....WH43d42da7-0
000010: 6533392D 34646665 2D383966 362D6531 e39-4dfe-89f6-e1
000020: 37363036 34383730 61344833 38663864 76064870a4H38f8d
000030: 3561342D 65386532 2D346134 372D6235 5a4-e8e2-4a47-b5
000040: 30662D37 31623435 36653837 33393348 0f-71b456e87393H
000050: 63666463 33653964 2D303362 612D3464 cfdc3e9d-03ba-4d
000060: 62372D62 3034622D 31393137 37323634 b7-b04b-19177264
000070: 36623665 084D7220 54064D72 730A4A69 6b6e.Mr T.Mrs.Ji
000080: 6D6D790A 57686974 6514536E 6F772068 mmy.White.Snow h
000090: 6F757365 00000012 4C697665 72706F6F ouse....Liverpoo
0000A0: 6C0C4C4C 32335252 0E456E67 6C616E64 l.XXXXXX.England
0000B0: 16303735 31323334 35363738 021E4D72 .XXXXXXXXXXX..Mr
0000C0: 20542773 20427573 696E6573 73486161 T's BusinessHaa
0000D0: 35326636 34662D36 6131642D 34393030 52f64f-6a1d-4900
0000E0: 2D616537 612D3432 33326333 65613938 -ae7a-4232c3ea98
0000F0: 38330000 83..

Testcontainers Kafka module runs a single node Kafka installation. It doesn't spin up a Schema Registry. Which I suspect might be a problem for Avro serialization.
You can add it manually to the tests. Testcontainers allows to run any Docker image programmatically with a simple API call:
var schemaRegistry = new GenericContainer(DockerImageName.parse("confluentcp/cp-schema-registry:version"));
I don't know for certain, but you probably need to connect Kafka and the schema registry, which you can do with the Network, see the Advanced networking chapter in the docs.
Unfortunately, I don't have a good example to refer to.
You can also look at something like this: https://github.com/kreuzwerker/kafka-consumer-testing.
They mock schema registry url so there's no separate schema registry container.

Dataproc Hive Job - Tez Java heap OOM

I have a problem with my cluster.
the cluster have
2 worker primary
2 secondary worker
30 gb di ram
The cluster runs correctly and launches the job hives for at least about 10h.
After 10h I have an error of :Java heap space
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) ~[?:1.8.0_292]
at org.apache.hadoop.ipc.ResponseBuffer.toByteArray(ResponseBuffer.java:53) ~[hadoop-common-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1159) ~[hadoop-common-3.2.2.jar:?]
... 5 more
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
INFO : Completed executing command(queryId=hive_20210923102707_66b4cd11-7cfb-4910-87bc-7f062ce1b00e); Time taken: 75.101 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask (state=08S01,code=1)
i tried to set this cofiguration but it didn't help.
SET hive.execution.engine = tez;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET mapreduce.job.reduces=1;
SET hive.auto.convert.join=false;
set hive.stats.column.autogather=false;
set hive.optimize.sort.dynamic.partition=true;
is there any way to clean the java heap space or I have got some configuration wrong?
the problem is solved by restarting the cluster

It seems that the default Tez container and heap sizes set by Dataproc are too small for your job. You can update the following Hive properties to increase them:
hive.tez.container.size: The YARN container size in MB for Tez. If set to "-1" (default value), it picks the value of mapreduce.map.memory.mb. Consider increasing the value if the query / Tez app fails with something like "Container is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 6.0 GB of 20 GB virtual memory used. Killing container.". Example: SET hive.tez.container.size=8192 in Hive, or --properties hive:hive.tez.container.size=8192 when creating the cluster.
hive.tez.java.opts: The JVM options for the Tez YARN application. If not set, it picks the value of mapreduce.map.java.opts. This value should be less or equal to the container size. Consider increasing the JVM heap size if the query / Tez app fails with an OOM exception. Example: SET hive.tez.java.opts=-Xmx8g or --properties hive:hive.tez.java.opts=-Xmx8g when creating the cluster.
You can check /etc/hadoop/conf/mapred-site.xml to get the value of mapreduce.map.java.opts, and /etc/hive/conf/hive-site.xml for the 2 Hive properties mentioned above.

Does "nice" affect the priority of Java threads [duplicate]

This question already has answers here:
is nice() used to change the thread priority or the process priority?
(3 answers)
Closed 1 year ago.
On a Unix system, you can run a process at lower CPU "priority" (pedantically, it does not change the thing that is called the priority, but rather influences what share of available CPU time is used, which is "priority" in the general sense) using the nice command:
nice program
And you could use that to run a JVM process:
nice java -jar program.jar
The Java program run by that JVM process will start multiple threads.
Does the nice change affect the scheduling of those Java threads? That is, will the Java threads have a lower CPU priority when run as
nice java -jar program.jar
that when run as
java -jar program.jar
In general, this will be system dependent, so I am interested in the Linux case.

According to what ps reports niceness is applied to java threads. I ran this quick test with a java application that waits for user input:
Start process with : nice -n 19 java Main
Output of ps -m -l 20746
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 - 1000 20746 10006 0 - - - 1739135 - pts/2 0:00 java Main
0 S 1000 - - 0 99 19 - - futex_ - 0:00 -
1 S 1000 - - 0 99 19 - - wait_w - 0:00 -
1 S 1000 - - 0 99 19 - - futex_ - 0:00 -
1 S 1000 - - 0 99 19 - - futex_ - 0:00 -
Start process with : nice -n 15 java Main
Output of ps -m -l 21488
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 - 1000 21488 10006 0 - - - 1722494 - pts/2 0:00 java Main
0 S 1000 - - 0 95 15 - - futex_ - 0:00 -
1 S 1000 - - 0 95 15 - - wait_w - 0:00 -
1 S 1000 - - 0 95 15 - - futex_ - 0:00 -
1 S 1000 - - 0 95 15 - - futex_ - 0:00 -
The NI column seems to reflect what I passed to nice and the priority changes accordingly too. I got the process ID (20746, 21488) using jps.
Note that running jstack 21488 for example will not give the above information.
I ran the above on Ubuntu 16.04 LTS (64bit)

Actually...Niceness is a property of the application according to POSIX.1. Here is a more detailed post. is nice() used to change the thread priority or the process priority?

Java is not special. It's just a process, and the OS sets its "niceness" the same way as with any other process.
On Linux, Java threads are implemented using native threads, so again, "niceness" is subject to OS controls in the same way as any other native thread.

COMPSs application blocked

When executing the sample application increment given in the manual (http://compss.bsc.es/releases/compss/latest/docs/COMPSs_User_Manual_App_Exec.pdf) the runtime gets blocked and no error message is displayed in the terminal.
OUTPUT:
$ runcompss increment.Increment 3 1 2 3
Using default location for project file: /opt/COMPSs/Runtime/configuration/xml/projects/project.xml
Using default location for resources file: /opt/COMPSs/Runtime/configuration/xml/resources/resources.xml
----------------- Executing increment.Increment --------------------------
WARNING: IT Properties file is null. Setting default values
[ API] - Deploying COMPSs Runtime v1.3
[ API] - Starting COMPSs Runtime v1.3
Initial counter values:
- Counter1 value is 1
- Counter2 value is 2
- Counter3 value is 3
How can I know what is blocking my application?
Thank you in advance
EDIT:
Checking the $HOME/.COMPSs/increment*/runtime.log all the tasks seem to be blocked:
grep "Blocked" runtime.log
[(410)(2016-03-04 15:48:09,864) TaskScheduler] #scheduleTask - Blocked: Task(1, increment)
[(411)(2016-03-04 15:48:09,865) TaskScheduler] #scheduleTask - Blocked: Task(2, increment)
[(412)(2016-03-04 15:48:09,866) TaskScheduler] #scheduleTask - Blocked: Task(3, increment)

The runtime.log must be in the home folder of the user who has executed runcompss
$HOME/.COMPSs/increment*
EDIT: If all tasks are blocked, check if there are constraints defined in IncrementItf.java matches with the description in the resources.xml. Another possible problem is that the resources couldn't be started.

Tasks get blocked for two reasons:
Resources are not properly configured on the xml files.
The available resources do not fulfill the task constraints (does not apply in the simple example)
You should check the project and resources xml files. There should be one resource with the same name in both files.

You should check the runtime.log file, it contains all the master's output.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.