How to process large log file in java

How to process large log file in java - java

I have 4 files and each one is 200 MB. I have created 4 threads and parallelly running 4 thread and each thread processing and adding in to Array blocking queue.
Some other thread is taking Array Blocking Queue and process and adding in to batch. The batch size is 5000 and executing batch and inserting records into database.But still its taking complete 4 files is around 6 mins to complete.
How increase performance in this case?

1) Make sure you have enough memory for queue+processor buffers+db buffers.
2) Batch size of 5k is a bit more than needed, in general you get up to speed in 100, not that iе makes much difference here though.
3) You can push data into oracle in multiple threads. Fetching sequences for ID fields population ahead, you'll be able to insert into 1 table in parallel, if you have not many indexes. Otherwise consider disabling/recalculating indexes, or insert into temporary table and then move everything into main one.
4) Take a look at oracle settings with fellow DB admin. Things like extend size/increase can change performance.

Related

Is multi threaded delete from mongodb blocking?

I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.

Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).

Reading from Database through multiple threads in java

I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?

Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.

I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.

No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.

In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.

Perfomance in Spring MVC app

I've got an app that adds file's content row by row into data base. If file is not so big (smaller, than 100 kB) it will work well, but I can not say the same about big files. I found out that INSERT query takes about 1 msc, so 50k INSERT takes 50 sec. I find it very slow. This is my plan:
if file is big enough, do INSERT in another thread
if not, do it synchronously
So, every user will run new thread, if file is big. I mean that I can not use one instance of this thread, every user will run new. Is it a good idea or not? How would you do?

Two points:
Why don't you use batch updates? I mean doing several inserts to database at one time. Network round trip costs a lot of time, you can increase the performance significantly.
Performing update asynchronously is a good idea. But actually it doesn't mean that you need to create new thread per user. It can be a fixed pool of threads (let's say 5) to do the job for all the users.

Distribute database records evenly across multiple processes

I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java

The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.

Sounds like a job for Hadoop to me.

I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.

HyperSQL (HSQLDB): massive insert performance

I have an application that has to insert about 13 million rows of about 10 average length strings into an embedded HSQLDB. I've been tweaking things (batch size, single threaded/multithreaded, cached/non-cached tables, MVCC transactions, log_size/no logs, regular calls to checkpoint, ...) and it still takes 7 hours on a 16 core, 12 GB machine.
I chose HSQLDB because I figured I might have a substantial performance gain if I put all of those cores to good use but I'm seriously starting to doubt my decision.
Can anyone show me the silver bullet?

With CACHED tables, disk IO is taking most of the time. There is no need for multiple threads because you are inserting into the same table. One thing that noticably improves performance is the reuse of a single parameterized PreparedStatment, setting the parameters for each row insert.
On your machine, you can improve IO significantly by using a large NIO limit for memory-mapped IO. For example SET FILES NIO SIZE 8192. A 64 bit JVM is required for larger sizes to have an effect.
http://hsqldb.org/doc/2.0/guide/management-chapt.html
To reduce IO for the duration of the bulk insert use SET FILES LOG FALSE and do not perform a checkpoint until the end of the insert. The details are discussed here:
http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#dec_bulk_operations
UPDATE: An insert test with 16 million rows below resulted in a 1.9 GigaByte .data file and took just a few minutes on an average 2 core processor and 7200 RPM disk. The key is large NIO allocation.
connection time -- 47
complete setup time -- 78 ms
insert time for 16384000 rows -- 384610 ms -- 42598 tps
shutdown time -- 38109

check what your application is doing. First things would be to look at resource utilization in taskmanager (or OS specific comparable) and visualvm.
Good candidates for causing bad performance:
disk IO
garbage collector

H2Database may give you slightly better performance than HSQLDB (while maintaining syntax compatibility).
In any case, you might want to try using a higher delay for syncing to disk to reduce random access disk I/O. (ie. SET WRITE_DELAY <num>)
Hopefully you're doing bulk INSERT statements, rather than a single insert per row. If not, do that if possible.
Depending on your application requirements, you might be better off with a key-value store than an RDBMS. (Do you regularly need to insert 1.3*10^7 entries?)
Your main limiting factor is going to be random access operations to disk. I highly doubt that anything you're doing will be CPU-bound. (Take a look at top, then compare it to iotop!)

With so many records, maybe you could consider switching to a NoSQL DB. It depends on the nature/format of the data you need to store, of course.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to process large log file in java - java

Related

Is multi threaded delete from mongodb blocking?

Reading from Database through multiple threads in java

Perfomance in Spring MVC app

Distribute database records evenly across multiple processes

HyperSQL (HSQLDB): massive insert performance

Categories

Resources