I have 2 JSON files which contain over 1 million Objects .
I have to compare each object from both the files and then write to a file if there is a diff for any object. (Each object is identified by a key and that key is written to a file).
Currently i am using ExecutorService and doing comparisons using multiple threads and writing mismatches to a common ConcurrentHashMap.
The map is dumped to a file in the end .
I would like to update the file periodically rather than waiting for the entire execution to complete.
In case if i wish to write to the file once in every 2 minutes, how can i achieve this.
I am familiar that this could be done using another thread but could not actually understand how exactly to implement along with ExecutorService.
ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(() -> {
// some code to execute every 2 minutes
}, 0, 2, TimeUnit.MINUTES);
Related
I want to read records (1000k) from 1 table and push them to some service.
So I have clubbed 200 records(based on the service limitations) in 1 event and used the executor framework and have created 10 executors. 10 events will be processed (i.e. 10*200 records) parallelly.
Now I want to maintain the status of these events, like statistics on how many were processed successfully and how many failed.
So I was thinking of
Approach 1:
Before starting the execution,
writing each event id + record id with status
event1 + record1 -> start
and on completion
event1 + record1-> end
And later will check how many have both start and end in the file and how many do not have end.
Approach 2 :
Write all record ids in one file with status pending and
writing all successful records in another file
And then check for the missing in the successful file by using pivot
Is there a better way to maintain the status of the records?
In my view, if you want to process items parallelly, then it is better to create a log file by your amount of records. Why? Because one file is a bottleneck for multithreading, because you need to lock file to prevent conditon race. If you will decide to lock file, then each thread should wait when log file will be released and waiting of file will nullify all multithreading processing.
So one batch should have one log file.
Create an array and start threads with the passed id so they can write to the array cell by their id.
The main thread will read this array and print it.
You can use ReadWriteLock (threads will hold the read lock to write and the main thread will hold the write lock while reading the entire array).
You can store anything in this array, it can be very useful.
Java Method which returns string(fileName) is consuming memory internally for few Image operations which cannot be further optimized, lets say consuming 20 MB HeapSpace per method execution.
This method is executing as part of ProcessingImageData and need to return the file name as output to RestWebService caller.
While some n threads are parallel processing which is giving OutofMemory.
To escape from OutofMemory-HeapSpace can you please provide your suggestions like
setting only fixed number of threads to execute this method.
public String deleteImageAndProvideFile(String inputImage, int deletePageNum){
// process image
//find page and delete
//merge pages to new file
// return new file Name
}
If you have an number of tasks but you want to limit the number of threads performing them, use an ExecutorService with a bounded thread pool.
The Executors class has a helper method for creating what you need:
newFixedThreadPool(int nosThreads) (javadoc).
Adjust the nosThreads parameter according to how much memory you want to use.
The ExecutionService documentation explains how to use the API (javadoc). You submit tasks and get Future objects that can be used to wait until a given task is finished.
In your use-case, one of your web requests might submit task to a "global" executor service and then wait for the task to complete. Alternatively you could design your system so that the processing is done asynchronously with the web requests; e.g. submit a task in one request, and then make another request to see if it has been done yet.
I'm running a program where I download large files, parse them and then write the data I have extracted from the file into another file.
The files take a long time to download and parse but the write task only takes a minute or so on average. My solution I threw together was to have three fixedthreadpools of three threads.
ExecutorService downloadExecutor = Executors.newFixedThreadPool(3);
ExecutorService parseExecutor = Executors.newFixedThreadPool(3);
ExecutorService writeExecutor = Executors.newFixedThreadPool(3);
A thread in the download pool downloads the file, then submits a new thread to the parser threadpool, with the filename as a parameter. This is done within the thread itself. The download thread then gets to work downloading another file from a list of file URLs.
Once the parser thread has finished parsing the data I want from the file,it then submits a new thread containing the data to the write threadpool, where it is then written to a .csv file.
My question is if there is a more elegant solution to this. I have not really done much complex threading. Since I have a lot of files to download and parse, I do not want any of the threads being idle at any time. The idea again, is that since parsing a file can take a while, I might as well make seperate threads devoted to downloading those files first.
Why not use only one Thread pool. Download, parse and save must wait anyway for each other so the best seperation of tasks would be to use one thread per file.
This is not a bad practice as many developers do similar sort of coding. But there are something you need to keep in mind.
Number One, You can't expect the performance to increase just because you have more threads. There are optimum number of threads based on the no of CPUs.
Number Two, You must make sure how exceptions are handled.
Number Three, You must make sure you can shutdown all the thread pools in an event where you need to stop the application.
So your problem has two aspects:
Compute bound
IO bound
Reading and writing to the file is IO bound. Async IO is the best for IO bound tasks. Java has AsynchronousFileChannel that allows you to read and write files without worrying about thread pools where continuation is achieved through completion handlers.
Complete Example.
AsynchronousFileChannel ch = AsynchronousFileChannel.open(path);
final ByteBuffer buf = ByteBuffer.allocate(1024);
ch.read(buf, 0, 0,
new CompletionHandler() {
public void completed(Integer result, Integer length){
..
}
public void failed(Throwable exc, Integer length) {
..
}
}
);
And you do the same for writes, you just write to the channel
ch.write(...
No for parsing the file, thats a compute bound task, and you should get your CPU cores hot for that, you can assign a thread pool equal to the number of cores you have.
executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors())
Now what get to remember is: you need to test your code, and testing concurrent code is hard. If you can't proof its correctness, don't do it.
I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?
It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.
Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file
A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()
I have an application that processes data stored in a number of files from an input directory and then produces some output depending on that data.
So far, the application works in a sequential basis, i.e. it launches a "manager" thread that
Reads the contents of the input directory into a File[] array
Processes each file in sequence and stores results
Terminates when all files are processed
I would like to convert this into a multithreaded application, in which the "manager" thread
Reads the contents of the input directory into a File[] array
Launches a number of "processor" threads, each of which processes a single file, stores results and returns a summary report for that file to the "manager" thread
Terminates when all files have been processed
The number of "processor" threads would be at most equal to the number of files, since they would be recycled via a ThreadPoolExecutor.
Any solution avoiding the use of join() or wait()/notify() would be preferrable.
Based on the above scenario:
What would be the best way of having those "processor" threads reporting back to the "manager" thread? Would an implementation based on Callable and Future make sense here?
How can the "manager" thread know when all "processor" threads are done, i.e. when all files have been processed?
Is there a way of "timing" a processor thread and terminating it if it takes "too long" (i.e., it hasn't returned a result despite the lapse of a pre-configured amount of time)?
Any pointers to, or examples of, (pseudo-)source code would be greatly appreciated.
You can definitely do this without using join() or wait()/notify() yourself.
You should take a look at java.util.concurrent.ExecutorCompletionService to start with.
The way I see it you should write the following classes:
FileSummary - Simple value object that holds the result of processing a single file
FileProcessor implements Callable<FileSummary> - The strategy for converting a file into a FileSummary result
File Manager - The high level manager that creates FileProcessor instances, submits them to a work queue and then aggregates the results.
The FileManager would then look something like this:
class FileManager {
private CompletionService<FileSummary> cs; // Initialize this in constructor
public FinalResult processDir(File dir) {
int fileCount = 0;
for(File f : dir.listFiles()) {
cs.submit(new FileProcessor(f));
fileCount++;
}
for(int i = 0; i < fileCount; i++) {
FileSummary summary = cs.take().get();
// aggregate summary into final result;
}
}
If you want to implement a timeout you can use the poll() method on CompletionService instead of take().
wait()/notify() are very low level primitives and you are right in wanting to avoid them.
The simplest solution would be to use a thread-safe queues (or stacks, etc. -- it doesn't really matter in this case). Before starting the worker threads, your main thread can add all the Files to the thread-safe queue/stack. Then start the worker threads, and let them all pull Files and process them until there are none left.
The worker threads can add results to another thread-safe queue/stack, where the main thread can get them from. The main thread knows how many Files there were, so when it has retrieved the same number of results, it will know that the job is finished.
Something like a java.util.concurrent.BlockingQueue would work, and there are other thread-safe collections in java.util.concurrent which would also be fine.
You also asked about terminating worker threads which are taking too long. I will tell right up front: if you can make the code which runs on the worker threads robust enough that you can safely leave this feature out, you will make things a lot simpler.
If you do need this feature, the simplest and most reliable solution is to have a per-thread "terminate" flag, and make the worker task code check that flag frequently and exit if it is set. Make a custom class for workers, and include a volatile boolean field for this purpose. Also include a setter method (because of volatile, it doesn't need to be synchronized).
If a worker discovers that its "terminate" flag is set, it could push its File object back on the work queue/stack so another thread can process it. Of course, if there is some problem which means the File cannot be successfully processed, this will lead to an infinite cycle.
The best is to make the worker code very simple and robust, so you don't need to worry about it "not terminating".
No need for them to report back. Just have a count of the number of jobs remaining to be done and have the thread decrement that count when it's done.
When the count reaches zero of jobs remaining to be done, all the "processor" threads are done.
Sure, just add that code to the thread. When it starts working, check the time and compute the stop time. Periodically (say when you go to read more from the file), check to see if it's past the stop time and, if so, stop.