Java Batch Multithreading

Java Batch Multithreading - java

I have some JSR-352 batch jobs that run quite well - despite their runtime.
Now I am thinking of distributing the work across several threads.
How is this pattern supported by JSR-352?
Edit: Now that I know the keywords to search for I can make out more resources on this problem:
Batch job definition: How to run a dynamically-calculated number of partitions?
JSR 352 batch application example
https://github.com/javaee-samples/jakartaee-samples/tree/main/ee7/batch
https://www.baeldung.com/java-ee-7-batch-processing#partitionof-job
https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf page 24+
I want to create a partitioned batchlet, and the partitions must be calculated at runtime. The idea is to split the processing of all records into a predefined maximum number of partitions, or a number of partitions with maximum size.

Here is another example of using partitioned chunk processing in JBeret. Partitioned batchlet should be similar in terms of job configuration. But partitioned batchlet is not common compared to partitioned chunk steps, because batchlet is mainly a task-oriented step, which is not easy to split into multiple partitions. So you may want to check if batchlet step is really suitable.
JBeret project test-apps module contains some examples with partition configuration. For example, cdiScopes tests use batchlet with partitions. Throttle tests use chunk with partitions.
JBeret test-apps/chunkPartition test module contains a job xml that uses mapper for dynamic partition.

Related

Most efficient number of threads in Kafka streams

I am using Kafka Streams with one topic(has 3 partitions).
I want to know most efficient number of thread numbers in Kafka Streams num.stream.threads option.
1 Thread and 3 tasks VS 3 Threads and 1 task(in each thread) Which one is better?
P.S. Server has 3 Core CPU.

The answer is, it depends! Typically, it will be more efficient to have as many threads as partitions/tasks as this will give you a better paralellism. But having too many threads can also be disastrous due to context switch if you don't have enought CPU.
You must also consider the throughput of the data to be processed, as well as the cost of the operation to perform on each record. If your stream application is not really data intensive you may not have interest to allocate a huge number of thread as they will be most of time idle.
It is therefore best to start with a single thread and perform load tests to measure the performance of your applications. For doing this, you can use the command-line tool available in the Apache kafka (or Confluent) distribution, i.e., bin/kafka-producer-perf-test.sh and monitor the metrics exposed by Kafka Streams using JMX (see : Monitoring Kafka Streams - Confluent Documentation).
Moreover, you should note that the maximum number of threads you can allocate to your application is not exactly equals to the number of partitions of the input topic you have declared in your topology. Actually, you should also consider all the topics from all the sub-topologies that have been generated by your application.
For example, let's say your are consuming a stream topic with 3 partitions, but your application perfom a repartition operation. Then, you will end up with two sub-topologies each consuming one topic with 3 partitions. So you will have a total of 6 tasks which means you can configure up to 6 threads.
Note: Usually, it is recommended to deploy a KafkaStreams instance with a single thread and to scale horizontally by adding more instance. This simplify the scaling model especially when using Kubernetes (i.e. 1 pod = 1 KafkaStreams instance = 1 Thread).

What is partition in Spark?

I'm trying to understand, what is partition in Spark ?
My understanding is, When we read from a source and place into any specific Datatset, then that Data set can be split into multiple sub-Datasets, those sub-Datasets are called partitions And Its upto spark framework where and how it distributes in cluster. Is it correct ?
I came into a doubt, when I read some online articles, which says
Under the hood, these RDDs or Datasets are stored in partitions on
different cluster nodes. Partition basically is a logical chunk of a
large distributed data set
This statment breaks my under standing. As per the above statment, RDDs or Datasets sits inside partition. But I was thinking RDD itself as a partition (after splitiing).
Can anyone help me to clear this doubt ?
Here is my code snippet, where am reading from JSON.
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
So while reading itself, how can I split this into multile partitions ? or Any other way around ?

What is Partition?
As per spark documentation, A partition in spark is an atomic chunk of
data (a logical division of data) stored on a node in the cluster.
Partitions are basic units of parallelism in Apache Spark. RDDs/Dataframe/Dataset in
Apache Spark is a collection of partitions.
So, When you do
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
spark reads your source json data and create a (logical division on data which are paritions) and then process those partitions parallely in cluster.
For example in laymen terms...
If you have a task to move 1-ton load of wheat from one place to another place and you have only 1 men resource(similar to a single thread) to do that task.so there can be a lot of possibilities over here.
1)Your resource might not be able to move such a huge weight at a time. (similar to you don't have enough CPU or RAM)
2)If It is capable(similar to high conf machine) then It might take a huge time and It might have stressed out.
3) AND your resource can't process any other process in between when It is doing load transfer. and soon.....
what if you divide 1-ton load of wheat into 1kg wheat blocks(similar to logical partitions on data) and hire more men and then ask your resources to move.
now it is a lot easier for them and you can add few more men resources(similar to scaling up the cluster) and can achieve your actual task very easily and fast.
similar to the above approach spark does a logical division on data so that you can process data parallelly using your cluster resources optimally and can finish your task much faster.
Note: RDD/Dataset and Dataframe are just abstractions for logical partitions of data.
and there other concepts in RDD and Dataframe which I didn't cover in the example (i.e Resilient and immutability)
How can I split this into multiple partitions ?
you can use repartition API to split furthermore partitions
spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath).**repartition**(number)
and you can use coalesce() api to bring down partitions.

What is the difference between SPARK Partitions and Worker Cores?

I used the Standalone Spark Cluster to process several files. When I executed the Driver, the data was processed on each worker using it's cores.
Now, I've read about Partitions, but I didn't get it if it's different than Worker Cores or not.
Is there a difference between setting cores number and partition numbers?

Simplistic view: Partition vs Number of Cores
When you invoke an action an RDD,
A "Job" is created for it. So, Job is a work submitted to spark.
Jobs are divided in to "STAGE" based on the shuffle boundary!!!
Each stage is further divided to tasks based on the number of partitions on the RDD. So Task is smallest unit of work for spark.
Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!

Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.

Flink - structuring job to maximize throughput

I have 4 types of kafka topics and 65 topics of each type. The goal is to do some simple windowed aggregation on the data and write it to a DB.
The topologies will look something like:
kafka -> window -> reduce -> db write
Somewhere in this mix I want / need to do a union - or possibly several (depending on how many topics are combined each time).
The data flow in the topics ranges from 10K to >200K messages / min.
I have a four node flink cluster with 30 cores / node. How do I build these topologies to spread the load out?

I am writing this answer assuming that each of the 65 topics of the same type contains the same type of data.
The simplest solution to this problem would be to change the Kafka setup such that you have 4 topics with 65 partitions each. Then you have 4 data sources in the program, with high parallelism (65) and this distributes across the cluster naturally.
If it is not possible to change the setup, I see two things you can do:
One possible solution is to create a modified version of the FlinkKafkaConsumer where one source may consume multiple topics (rather than only multiple partitions of one topic). With that change, it would work pretty much as if you were using many partitions, rather than many topics. If you want to go with this solution, I would ping the mailing list to get some support for doing this. It would be a valuable addition to the Flink code anyways.
You can give each source a separate resource group, which will give it a dedicated slot. You can do this via "env.addSource(new FlinkKafkaConsumer(...)).startNewResourceGroup();". But here, the observation is that you try to execute 260 distinct sources on a cluster with 120 cores (and thus probably 120 task slots). You would need to increase the number of slots to hold all the tasks.
I think the first option is the preferable option.

Benchmarking Kafka - mediocre performance

I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?

I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle

If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.