Improving performance of preprocessing large set of documents

Improving performance of preprocessing large set of documents - java

I am working on a project related to plagiarism detection framework using Java. My document set contains about 100 documents and I have to preprocess them and store in a suitable data structure. I have a big question that how am i going to process the large set of documents efficiently and avoiding bottlenecks . The main focus on my question is how to improve the preprocessing performance.
Thanks
Regards
Nuwan

You're a bit lacking on specifics there. Appropriate optimizations are going to depend upon things like the document format, the average document size, how you are processing them, and what sort of information you are storing in your data structure. Not knowing any of them, some general optimizations are:
Assuming that the pre-processing of a given document is independent of the pre-processing of any other document, and assuming you are running a multi-core CPU, then your workload is a good candidate for multi-threading. Allocate one thread per CPU core, and farm out jobs to your threads. Then you can process multiple documents in parallel.
More generally, do as much in memory as you can. Try to avoid reading from/writing to disk as much as possible. If you must write to disk, try to wait until you have all the data you want to write, and then write it all in a single batch.

You give very little information on which to make any good suggestions.
My default would be to process them using an executor with a thread pool with the same number of threads as cores in your machine each thread processing a document.

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?

One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.

I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Is multi threaded delete from mongodb blocking?

I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.

Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).

Simulation thread and data writer thread parallelism

This a general programming question. Let's say I have a thread doing a specific simulation, where speed is quite important. At every iteration I want to extract data from it and write it to a file.
Is it a better practice to hand over the data to a different thread and let the simulation thread focus on his job, or since speed is very important, make the simulation thread do the data recording too without any copying of data. (in my case it is 3-5 deques of integers with a size of 1000-10000)
Firstly it surely depends on how much data we are copying, but what else can it depend on? Can the cost of synchronization and copying be worth? Is it a good practice to create small runnables at each iteration to handle the recording task in case of 50 or more iterations per second?

If you truly want low latency on this stat capturing, and you want it during the simulation itself then two techniques come to mind. They can be used together very effectively. Please note that these two approaches are fairly far from the standard Java trodden path, so measure first and confirm that you need these techniques before abusing them; they can be difficult to implement correctly.
The fastest way to write the data to file during a simulation, without slowing down the simulation is to hand the work off to another thread. However care has to be taken on how the hand off occurs, as a memory barrier in the simulation thread will slow the simulation. Given the writer only cares that the values will come eventually I would consider using the memory barrier that sits behind AtomicLong.lazySet, it requests a thread safe write out to a memory address without blocking for the write to actually become visible to the other thread. Unfortunately direct access to this memory barrier is currently only availble via lazySet or via class sun.misc.Unsafe, which obviously is not part of the public Java API. However that should not be too large of a hurdle as it is on all current JVM implementations and Doug Lea is talking about moving parts of it into the mainstream.
To avoid the slow, blocking file IO that Java uses; make use of a memory mapped file. This lets the OS perform async IO for you on your behalf, and is very efficient. It also supports use of the same memory barrier mentioned above.
For examples of both techniques, I strongly recommend reading the source code to HFT Chronicle by Peter Lawrey. In fact, HFT Chronicle may be just the library for you to use here. It offers a highly efficient and simple to use disk backed queue that can sustain a million or so messages per second.

In my work on a stress-testing HTTP client I stored the stats into an array and, when the array was ready to send to the GUI, I would create a new array for the tester client and hand off the full array to the network layer. This means that you don't need to pay for any copying, just for the allocation of a fresh array (an ultra-fast operation on the JVM, involving hand-coded assembler macros to utilize the best SIMD instructions available for the task).
I would also suggest not throwing yourself head-on into the realms of optimal memory barrier usage; the difference between a plain volatile write and an AtomicReference.lazySet() can only be measurable if your thread does almost nothing else but excercise the memory barrier (at least millions of writes per second). Depending on your target I/O throughput, you may not even need NIO to meet the goal. Better try first with simple, easily maintainable code than dig elbows-deep into highly specialized APIs without a confirmed need for that.

Simple Multi-Threading in Java

Currently, I'm running on a thread-less model that isn't working simply because I'm running out of memory before I can process the data I'm being handed. I've made all the changes that I can to optimize the code, and it's still just not quite quick enough.
Clearly I should move on to a threaded model. I'm wondering what the simplest, easiest way to do the following is:
The main thread passes some info to the worker
That worker performs some work that I'll refactor out of the main method
The workers will disappear and new ones will be instantiated when needed
I've never worked with java threading and from what I've read up on it seems pretty complicated, even if what I'm looking for seems pretty simple.

If you have multiple independent units of work of equal priority, the best solution is generally some sort of work queue, where a limited number of threads (the number chosen to optimize performance) sit in a while(true) loop dequeuing work units from the queue and executing them.
Generally the optimum number of threads is going to be the number of processors +/- 1, though in some cases a larger number will be optimal if the threads tend to get stalled by disk I/O requests or some such.
But keep in mind that tuning the entire system may be required. Eg, you may need more disk arms, and certainly more RAM may be required.

I'd start by having a read through Java Concurrency as refresher ;)
In particular, I would spend some time getting to know the Executors API as it will do most of what you've described without a lot of the overhead of dealing with to many locks ;)

Distributing the memory consumption to multiple threads will not change overall memory consumption. From what I read out of your question, I would like to step forward and tell you: Increase the heap of the Java engine, this will help. Looks like you have to optimize the Java startup parameters and not your code. If I am wrong, then you will have to buffer the data. To Disk! Not to a thread in the same memory model.

One reader thread, one writer thread, n worker threads

I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?

Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.

I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.

An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.

You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.

Use Spring Batch! That is exactly what you need

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.