I'm trying to understand, what is partition in Spark ?
My understanding is, When we read from a source and place into any specific Datatset, then that Data set can be split into multiple sub-Datasets, those sub-Datasets are called partitions And Its upto spark framework where and how it distributes in cluster. Is it correct ?
I came into a doubt, when I read some online articles, which says
Under the hood, these RDDs or Datasets are stored in partitions on
different cluster nodes. Partition basically is a logical chunk of a
large distributed data set
This statment breaks my under standing. As per the above statment, RDDs or Datasets sits inside partition. But I was thinking RDD itself as a partition (after splitiing).
Can anyone help me to clear this doubt ?
Here is my code snippet, where am reading from JSON.
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
So while reading itself, how can I split this into multile partitions ? or Any other way around ?
What is Partition?
As per spark documentation, A partition in spark is an atomic chunk of
data (a logical division of data) stored on a node in the cluster.
Partitions are basic units of parallelism in Apache Spark. RDDs/Dataframe/Dataset in
Apache Spark is a collection of partitions.
So, When you do
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
spark reads your source json data and create a (logical division on data which are paritions) and then process those partitions parallely in cluster.
For example in laymen terms...
If you have a task to move 1-ton load of wheat from one place to another place and you have only 1 men resource(similar to a single thread) to do that task.so there can be a lot of possibilities over here.
1)Your resource might not be able to move such a huge weight at a time. (similar to you don't have enough CPU or RAM)
2)If It is capable(similar to high conf machine) then It might take a huge time and It might have stressed out.
3) AND your resource can't process any other process in between when It is doing load transfer. and soon.....
what if you divide 1-ton load of wheat into 1kg wheat blocks(similar to logical partitions on data) and hire more men and then ask your resources to move.
now it is a lot easier for them and you can add few more men resources(similar to scaling up the cluster) and can achieve your actual task very easily and fast.
similar to the above approach spark does a logical division on data so that you can process data parallelly using your cluster resources optimally and can finish your task much faster.
Note: RDD/Dataset and Dataframe are just abstractions for logical partitions of data.
and there other concepts in RDD and Dataframe which I didn't cover in the example (i.e Resilient and immutability)
How can I split this into multiple partitions ?
you can use repartition API to split furthermore partitions
spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath).**repartition**(number)
and you can use coalesce() api to bring down partitions.
Related
I have some JSR-352 batch jobs that run quite well - despite their runtime.
Now I am thinking of distributing the work across several threads.
How is this pattern supported by JSR-352?
Edit: Now that I know the keywords to search for I can make out more resources on this problem:
Batch job definition: How to run a dynamically-calculated number of partitions?
JSR 352 batch application example
https://github.com/javaee-samples/jakartaee-samples/tree/main/ee7/batch
https://www.baeldung.com/java-ee-7-batch-processing#partitionof-job
https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf page 24+
I want to create a partitioned batchlet, and the partitions must be calculated at runtime. The idea is to split the processing of all records into a predefined maximum number of partitions, or a number of partitions with maximum size.
Here is another example of using partitioned chunk processing in JBeret. Partitioned batchlet should be similar in terms of job configuration. But partitioned batchlet is not common compared to partitioned chunk steps, because batchlet is mainly a task-oriented step, which is not easy to split into multiple partitions. So you may want to check if batchlet step is really suitable.
JBeret project test-apps module contains some examples with partition configuration. For example, cdiScopes tests use batchlet with partitions. Throttle tests use chunk with partitions.
JBeret test-apps/chunkPartition test module contains a job xml that uses mapper for dynamic partition.
I have a billion records which are unsorted, unrelated to each other and I have to call a function processRecord on each record using Java.
The easy way to do so is using a for loop but it is taking a lot of time.
The other way I could think of is using multithreading but the question is how to divide the dataset of records efficiently and among how many threads?
Is there an efficient way to process this large dataset?
Measure
Before figuring out which implementation path to choose you should measure how long it takes to process single item. Based on that you could choose size of work chunk submitted to thread pool, queue, cluster. Very small work chunks would increase coordination overhead. Too big work chunk will take long time to be processed so you will have less gradual progress info.
Single machine processing is simpler to implement, troubleshoot maintain and reason about.
Processing on single machine
Use java.util.concurrent.ExecutorService
Submit each work piece using submit(Callable<T> task) method https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#submit-java.util.concurrent.Callable-
Create instance of ExecutorService using java.util.concurrent.Executors.newFixedThreadPool(int nThreads). Choose reasonable value for nThreads Nnumber of CPU cores is reasonable startup value. You may add use more threads if there are some blocking IO calls (database, HTTP) in processing.
Processing on multiple machines - cluster
Submit processing jobs to cluster processing technologies like Spark, Hadoop, Google BigQuery.
Processing on multiple machines - queue
You can submit your records to any queue system (Kafka, RabbitMQ, ActiveMQ, etc). Then have multiple machines that consume items from the queue. You will be able to add/remove consumers at any time. This approach is fine if you do not need to have single place with processing result.
Parallel stream could be used here to perform parallel processing of your data. By default parallel stream uses pool by one thread less than processors count.
Wide and useful information about that could be found here https://stackoverflow.com/a/21172732/8184084
I have 4 types of kafka topics and 65 topics of each type. The goal is to do some simple windowed aggregation on the data and write it to a DB.
The topologies will look something like:
kafka -> window -> reduce -> db write
Somewhere in this mix I want / need to do a union - or possibly several (depending on how many topics are combined each time).
The data flow in the topics ranges from 10K to >200K messages / min.
I have a four node flink cluster with 30 cores / node. How do I build these topologies to spread the load out?
I am writing this answer assuming that each of the 65 topics of the same type contains the same type of data.
The simplest solution to this problem would be to change the Kafka setup such that you have 4 topics with 65 partitions each. Then you have 4 data sources in the program, with high parallelism (65) and this distributes across the cluster naturally.
If it is not possible to change the setup, I see two things you can do:
One possible solution is to create a modified version of the FlinkKafkaConsumer where one source may consume multiple topics (rather than only multiple partitions of one topic). With that change, it would work pretty much as if you were using many partitions, rather than many topics. If you want to go with this solution, I would ping the mailing list to get some support for doing this. It would be a valuable addition to the Flink code anyways.
You can give each source a separate resource group, which will give it a dedicated slot. You can do this via "env.addSource(new FlinkKafkaConsumer(...)).startNewResourceGroup();". But here, the observation is that you try to execute 260 distinct sources on a cluster with 120 cores (and thus probably 120 task slots). You would need to increase the number of slots to hold all the tasks.
I think the first option is the preferable option.
I'm benchmarking Kafka 0.8.1.1 by streaming 1k size messages on EC2 servers.
I installed zookeeper on two m3.xlarge servers and have the following configuration:
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888
server.server2=zoo2:2888:3888
Second I installed Single Kafka Server on i2.2xlarge machine with 32Gb RAM and additional 6 SSD Drives where each disk partitioned as /mnt/a , mnt/b, etc..... On server I have one broker, single topic on port 9092 and 8 partitions with replication factor 1:
broker.id=1
port=9092
num.network.threads=4
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/mnt/a/dfs-data/kafka-logs,/mnt/b/dfs-data/kafka-logs,/mnt/c/dfs-data/kafka-logs,/mnt/d/dfs-data/kafka-logs,/mnt/e/dfs-data/kafka-logs,/mnt/f/dfs-data/kafka-logs
num.partitions=8
log.retention.hours=168
log.segment.bytes=536870912
log.cleanup.interval.mins=1
zookeeper.connect=172.31.26.252:2181,172.31.26.253:2181
zookeeper.connection.timeout.ms=1000000
kafka.metrics.polling.interval.secs=5
kafka.metrics.reporters=kafka.metrics.KafkaCSVMetricsReporter
kafka.csv.metrics.dir=/tmp/kafka_metrics
kafka.csv.metrics.reporter.enabled=false
replica.lag.max.messages=10000000
All my tests are done from another instance and latency between instances is less than 1 ms.
I wrote producer/consumer java client using one thread producer and 8 threads consumer, when partition key is a random number from 0 till 7.
I serialized each message using Json by providing custom encoder.
My consumer producer properties are the following:
metadata.broker.list = 172.31.47.136:9092
topic = mytopic
group.id = mytestgroup
zookeeper.connect = 172.31.26.252:2181,172.31.26.253:2181
serializer.class = com.vanilla.kafka.JsonEncoder
key.serializer.class = kafka.serializer.StringEncoder
producer.type=async
queue.enqueue.timeout.ms = -1
batch.num.messages=200
compression.codec=0
zookeeper.session.timeout.ms=400
zookeeper.sync.time.ms=200
auto.commit.interval.ms=1000
number.messages = 100000
Now when I'm sending 100k messages, I'm getting 10k messages per second capacity and about 1 ms latency.
that means that I have 10 Megabyte per second which equals to 80Mb/s, this is not bad, but I would expect better performance from those instances located in the same zone.
Am I missing something in configuration?
I suggest you break down the problem. How fast is it without JSon encoding. How fast is one node, without replication vs with replication. Build a picture of how fast each component should be.
I also suggest you test bare metal machines to see how they compare as they can be significantly faster (unless CPU bound in which case they can be much the same)
According to this benchmark you should be able to get 50 MB/s from one node http://kafka.apache.org/07/performance.html
I would expect you should be able to get close to saturating your 1 Gb links (I assume thats what you have)
Disclaimer: I work on Chronicle Queue which is quite a bit faster, http://java.dzone.com/articles/kafra-benchmark-chronicle
If it makes sense for your application, you could get better performance by streaming byte arrays instead of JSON objects, and convert the byte arrays to JSON objects on the last step of your pipeline.
You might also get better performance if each consumer thread consistently reads from the same topic partition. I think Kafka only allows one consumer to read from a partition at a time, so depending on how you're randomly selecting partitions, its possible that a consumer would be briefly blocked if it's trying to read from the same partition as another consumer thread.
It's also possible you might be able to get better performance using fewer consumer threads or different kafka batch sizes. I use parameterized JUnit tests to help find the best settings for things like number of threads per consumer and batch sizes. Here are some examples I wrote which illustrate that concept:
http://www.bigendiandata.com/2016-10-02-Junit-Examples-for-Kafka/
https://github.com/iandow/kafka_junit_tests
I hope that helps.
I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java
The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.
Sounds like a job for Hadoop to me.
I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.