Efficient Approach to Process Large set of Data in java

Efficient Approach to Process Large set of Data in java - java

I have a billion records which are unsorted, unrelated to each other and I have to call a function processRecord on each record using Java.
The easy way to do so is using a for loop but it is taking a lot of time.
The other way I could think of is using multithreading but the question is how to divide the dataset of records efficiently and among how many threads?
Is there an efficient way to process this large dataset?

Measure
Before figuring out which implementation path to choose you should measure how long it takes to process single item. Based on that you could choose size of work chunk submitted to thread pool, queue, cluster. Very small work chunks would increase coordination overhead. Too big work chunk will take long time to be processed so you will have less gradual progress info.
Single machine processing is simpler to implement, troubleshoot maintain and reason about.
Processing on single machine
Use java.util.concurrent.ExecutorService
Submit each work piece using submit(Callable<T> task) method https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#submit-java.util.concurrent.Callable-
Create instance of ExecutorService using java.util.concurrent.Executors.newFixedThreadPool(int nThreads). Choose reasonable value for nThreads Nnumber of CPU cores is reasonable startup value. You may add use more threads if there are some blocking IO calls (database, HTTP) in processing.
Processing on multiple machines - cluster
Submit processing jobs to cluster processing technologies like Spark, Hadoop, Google BigQuery.
Processing on multiple machines - queue
You can submit your records to any queue system (Kafka, RabbitMQ, ActiveMQ, etc). Then have multiple machines that consume items from the queue. You will be able to add/remove consumers at any time. This approach is fine if you do not need to have single place with processing result.

Parallel stream could be used here to perform parallel processing of your data. By default parallel stream uses pool by one thread less than processors count.
Wide and useful information about that could be found here https://stackoverflow.com/a/21172732/8184084

Related

ForkJoinPool & Asynchronous IO

In one of my use cases I need to fetch the data from multiple nodes. Each node maintains a range (partition) of data. The goal is to read the data as fast as possible. Constraints are, cardinality of a partition is not known before hand. Using work sharing approach, I could split the partitions into sub-partitions and fetch the data in parallel. One drawback with this approach is, it is possible that one thread could fetch lot of data and take more time while the other thread could finish faster. The other approach is to use work stealing where we can break the partitions into much smaller ranges and use ForkJoinPool. The drawback with this approach is, if the partition is sparse, we could make many round trips to the server to realize there is not data for a sub-partition.
The question I've is, if I want to use ForkJoinPool, where the tasks can do some I/O operations, how do I do that? From the documentation of the FJ pool and from the best practices I read so far, it appears like FJ pool is not good for blocking IO operations. If I want to use non-blocking IO, how can I do that?

Schedulers.elastic() not using early generated threads

I've written web-server on spring-boot-2.0 which uses netty under the hood.
This application uses Schedulers.elastic() threads.
When it started, about 100 elastic threads were created. These threads were rarely used and we've had few loading. But after a working day, the number of threads in elastic pool has increased to 1300. And now execution is on the elastic-1XXX, elastic-12XX threads, (name's numbers are above 100 and even 900).
Elastic, as I understand it, uses cachedThreadPool under the hood.
Why have new elastic threads been created and why has task switched to new threads?
What is the criteria for adding new trends?
And why haven't old threads (elastic-XX, elastic-1xx) been shutdown?

Without more information about the type of workload, the maximum concurrency, burst and average task duration, it’s really hard to tell if there’s a problem here.
By definition the elastic scheduler is creating an unbounded number of threads, as long as new tasks are added to the queue.
If the workload is bursty, with night concurrency at regular times, then it’s n unexpected to find a large number of threads. You could leverage the newElastic variants to reduce the TTL (default is 60s).
Again, without more information it’s hard to tell but your workload might not fit this scheduler. If your workload is CPU bound, the parallel scheduler is a better fit. The elastic one is tailored for IO/latency bound tasks.

The problem was: i was using Schedulers.elastic() for non-blocking operations, while there were no such operations . When i had removed elastic(), my service started to working correctly (without elastic's threads).

Migrating expensive to initialize java.util.concurrent.Callables to Apache Spark

I need to migrate a Java program to Apache Spark. The current Java heavily utilizes the functionality provided by java.util.concurrent and runs on a single machine. Since the initialization of a worker (Callable) is expensive, the workers are reused again and again - i.e. a worker reinserts itself into the pool once it terminates and has returned its result.
More precise:
The current implementation works on small data sets in the range of 10E06 entries/few GBs.
The data contains entries that can be processed independently. That is, one could fire up one worker per task and submit it to the java thread pool.
However, setting up a worker for processing an entry involves loading more data in, building graphs... all together some GB AND cpu time in the range of minutes.
Some data can indeed be shared among the workers e.g. some look-up tables but does not need to. Some data is private to the worker and thus not shared. The worker may change the data while processing the entry and only later reset it in a fast manner, e.g. caches specific to the entry currently processed. Thus, the worker can reinsert itself in the pool and start working on the next entry without going though the expensive initialization.
Runtime per worker and entry is in the range of seconds.
The workers hand back their results via an ExecutorCompletionService, i.e. the results are later retrieved by calling pool.take().get() in a central part of the program.
Getting to know Apache Spark I find most examples just use standard transformations and actions. I also find examples that add their own functions to the DAG by extending the API. Still, those examples all stick to simple lightweight calculations and come without initialization cost.
I now wonder what is the best approach to design a Spark application that reuses some kind of "heavy worker". The executors seem to be the only persistent entities that could possibly hold a pool of such workers. However, being new to the world of Spark I most likely miss some point...
edited 20161007
Found an answer that points to a (possible) solution using Functions. So the question is, can I
Split my partition according to the number of executors,
Each executor gets exactly one partition to work on
My Function (called setup in the linked solution) creates a thread pool and reuses the workers
A separate combine function later merges the results

Your current architecture is a monolithic, multi-threaded architecture with shared state between the threads. Given that the size of your dataset is relatively modest for modern hardware you can parallelize it quite easily with Spark, where you will replace the threads with executors in the cluster's nodes.
From your question I understand that your two main concerns is whether Spark can handle complex parallel computations and how to share the necessary bits of state in a distributed environment.
Complicated business logic: Regarding the first part, you can run arbitrarily complicated business logic in the Spark Executors, which are the equivalent of the worker threads in your current architecture.
This blog post from cloudera explains well the concept along with other important concepts of the execution model:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/
One aspect you will need to pay attention to it though, is the configuration of your Spark job, in order to avoid timeouts due to Executors taking too long to finish, which may be expected for an application with complicated business logic like yours.
Refer to the excellent page from DataBricks for more details, and more specifically to the execution behavior:
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
Shared state: You can share complicated data structures like graphs and application configuration in Spark among the nodes. One approach which works well is Broadcast Variables, where a copy of the state to be distributed is distributed to every node. Below are some very nice explanations of the concept:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html
http://g-chi.github.io/2015/10/21/Spark-why-use-broadcast-variables/
This will shave the latency from your application, while ensuring data locality.
The processing of your data can be performed on a partition based (more here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html), with the results aggregated on the driver or with the use of Accumulators (more here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-accumulators.html). In case the resulted data are complicated, the partition approach may work better and also gives you more fine grained control over your applications execution.
Regarding the hardware resource requirements, it seems that your application needs a few Gigabytes for the shared state, which will need to stay in memory and additionally a few more Gigabytes for the data in every node. You can set the persistence model to MEMORY_AND_DISK in order to ensure that you wont run out of memory, more details at
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

Distribute database records evenly across multiple processes

I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java

The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.

Sounds like a job for Hadoop to me.

I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.

Is a single Java thread better than multiple threading in my scenario?

Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks

Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.

Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.

It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.

First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system

As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.

In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.