How to store big amount of data

How to store big amount of data - java

i have a program, that at the start generates big amount of data ( several GB, possibly more than 10GB ) and then for several times process all data, do something, process all data, do something... That much data doesn't fit into my RAM and when it starts paging, its really painful. What is the optimal way to store my data and in general, how to solve this problem?
Should i use DB even though i dont need to save the data after my program ends?
Should i split my data somehow and just save it into files and load them when i need them? Or just keep using RAM and get over paging?
With DB and files there is a problem. I have to process the data by pieces. So i load chunk of data (lets say 500mb), calculate, load next chunk and after i load and calculate everything, i can do something and repeat the cycle. That means i would read from HDD the same chunks of data i read in previous cycle.

try to reduce the amount of data.
try to modify the algorithm, to extract the relevant data at an early stage
try to divide and / or parallelize the problem, and execute it over several clients in a cluster of computing nodes

File-style will be enough for your task, couple sample:
Use BuffereReader skip() method
RandomAccessFile
Read this two, and problem with duplication chunks should go away.

You should definitely try to reduce the amount of data and have multiple threads to handle your data.
FutureTask could help you :
ExecutorService exec = Executors.newFixedThreadPool(5);
FutureTask<BigDecimal> task1 = new FutureTask<>(new Callable<BigDecimal>() {
#Override
public BigDecimal call() throws Exception {
return doBigProcessing();
}
});
// start future task asynchronously
exec.execute(task1);
// do other stuff
// blocking till processing is over
BigDecimal result = task1.get();
In the same way, you could consider caching the future task to speed up your application if possible.
If not enough, you could use Apache Spark framework to process large datasets.

Before you think about performance you must consider belows:
find a good data structure for the data.
find good algorithms to process the data.
If you do not have enough memory space,
use memory mapped file to work on data
If you have a chance to process data without load all data
divide and conquer
And please give us more details.

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?

One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.

I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Multithreads performance

I'm writing an application that listens on UDP for incoming messages. My main thread receives message after message from the network and passes each of them to a new thread for handling using an executor.
Each handling thread does the required processing on the message it's responsible on and adds it to a LinkedBlockingQueue that is shared between all the handling threads.
Then, I have a DB worker thread that drains the queue by block of 10000 messages and inserts the block of messages in the DB.
Since the arrival rate of messages may be high (more than 20000 messages per second), I thought that using LOAD DATA INFILE is more efficient. So, this DB worker threads drains the queue as said previously, creates a temporary file containing all the messages using a CSV format, and passes the created file to another thread using another executor. This new thread execute the LOAD DATA INFILE statement using JDBC.
After testing my application, I think the performances are not so good, I'm looking for ways to improve performance both at the multithreading level and at the DB access level.
I precise that I use MySQL as DBMS.
Thanks

You need to determine why your performance is poor.
E.g. its quite likely you don't need multiple threads if you are writing the data sequentially to a database which is far more likely to be your bottleneck. The problem with using multiple threads when you don't need to is that it add complexity which is an overhead in itself and it can be slower than using a single thread.
I would try and see what the performance is like if you do everything but load the data into the database. i.e. write the file and discard it.

It's hard to tell without any profiler output, but my (un-)educated guess is that the bottleneck is that you are writing your changes to a file on the hard drive, and then prompt your database to read and parse this file. Storage access is always much, much slower than memory access. So this is very likely much slower than just feeding the database the queries from memory.
But that's just guessing. Maybe the bottleneck is somewhere else where you or me would have never expected it. When you really want to know which part of your applications eats how much CPU time, you should use a profiler like Profiler4j to analyze your program.

Getting a huge amount of data from database in the most efficient way

In my application, i have to read a huge amount of data. After i have got all of my data, i put it in a list and process on it and work accordingly.
Now i was wondering if i can do anything, Anything to speed up the getting data from the database process? My database sits on a different server and i am working with java to interact with the database.
I dont have a definite size of the data, i.e. a specific number of rows that i need to process. Also I hear i can go for multithreading, but then how do go about it? since i wont know how to partition my data since it is indefinite. i.e. if the following pseudo code is to be applied
for(i=0 to number of partition) // Not certain on the number of partitions
create new thread and get data.
Or maybe i can hash data on the basis of some attribute and later tell each thread to fetch a particular index of the map, but then how do i map it before even fetching the data?
What all possible solutions can i look into, and how do i go about it? Let me know if you need any more info.
Thanks.

I hear i can go for multithreading, but then how do go about it?
This is definetly a good choice to speed up querying information from a remote server.
Usually in these tasks - the IO with the server is the main bottleneck, and by multithreading - one can "ask for" multiple rows concurrently - affectively reducing the IO wait times.
but then how do go about it?
The idea is to split the work into smaller tasks. Have a look at java high level concurrency API for more details.
One solution is to let each thread read a chunk of size M from the server, and repeat the process for each thread while there is still data in it (the server). Something like that (for each thread):
data = "start";
int chunk = threadNumber;
while (data != null) {
requestChunk(chunk);
chunk += numberOfThreads;
}
I assume here that once you are "out of bound" the server returns null (or requestChunk() processes it and returns null).
Or maybe i can hash data on the basis of some attribute and later tell
each thread to fetch a particular index of the map
If you need to iterate the data, and retrieve all of it - hashing is usually a bad solution. It is very cache inefficient and the overhead is just too big for this cases.

Improving performance of preprocessing large set of documents

I am working on a project related to plagiarism detection framework using Java. My document set contains about 100 documents and I have to preprocess them and store in a suitable data structure. I have a big question that how am i going to process the large set of documents efficiently and avoiding bottlenecks . The main focus on my question is how to improve the preprocessing performance.
Thanks
Regards
Nuwan

You're a bit lacking on specifics there. Appropriate optimizations are going to depend upon things like the document format, the average document size, how you are processing them, and what sort of information you are storing in your data structure. Not knowing any of them, some general optimizations are:
Assuming that the pre-processing of a given document is independent of the pre-processing of any other document, and assuming you are running a multi-core CPU, then your workload is a good candidate for multi-threading. Allocate one thread per CPU core, and farm out jobs to your threads. Then you can process multiple documents in parallel.
More generally, do as much in memory as you can. Try to avoid reading from/writing to disk as much as possible. If you must write to disk, try to wait until you have all the data you want to write, and then write it all in a single batch.

You give very little information on which to make any good suggestions.
My default would be to process them using an executor with a thread pool with the same number of threads as cores in your machine each thread processing a document.

One reader thread, one writer thread, n worker threads

I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?

Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.

I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.

An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.

You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.

Use Spring Batch! That is exactly what you need

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.