Right now in my application,at certain points we are logging some heavy stuff into the log files.
Basically only for logging we are creating JSON of the data available and then logging into Log files.This is business requirement to log data in JSON format .
Now creating JSON from the data available and then logging to FILE takes lot of time and impacts the original request return time.
Now idea is to improve the sitation .
One of the things that we have discussed is to create a thread pool using
Executors.newSingleThreadExecutor()
in our code and then submitting the task to it which does the conversion of data into JSON and subsequent logging.
Is it a good approach to do this ?As we are managing the thread pool itself ,is it going to create some issues?
I would appreciate if someone can share better solutions.
Someway to use Log4j for this .I tried to use AsyncAppender but didnt achieve any desired result.
We are using EJB 3,Jboss 5.0,Log4j,java6.
I believe you are on right track in terms of using a separate thread pool for logging. In lot of products you will see the asynchronous logging feature. Logs are accumulated and pushed to log files using a separate thread than the request thread. Especially in prodcution environments, where are millions of incoming request and your response time need to be less than few seconds. You cannot afford anything such as logging to slow down the system. So the approach used is to add logs in a memory buffer and push them asynchronously in reasonably sized chunks.
A word of caution while using thread pool for logging
As multiple threads will be working on the log file(s) and on a memory log buffer, you need to be careful about the logging. You need to add logs in a FIFO kind of a buffer to be sure that logs are printed in the log files sorted by time stamp. Also make sure the file access is synchronized and you don't run into situation where log file is all upside down or messed up.
Have a look at Logback,AsyncAppender it already provide separate threadpool, queue etc and is easily configurable, it almost do the same as you are doing, but saves you from re-inventing the wheel.
Is using MongoDB for logging considered?
MongoDB inserts can be done asynchronously. One wouldn’t want a user’s experience to grind to a halt if logging were slow, stalled
or down. MongoDB provides the ability
to fire off an insert into a log collection and not wait for a response code. (If one
wants a response, one calls getLastError() — we would skip that here.)
Old log data automatically LRU’s out. By using capped collections,
we preallocate space for logs, and once it is full, the log wraps
and reuses the space specified. No risk of filling up a disk with
excessive log information, and no need to write log archival /
deletion scripts.
It’s fast enough for the problem. First, MongoDB is very fast in
general, fast enough for problems like this. Second, when using a
capped collection, insertion order is automatically preserved: we
don’t need to create an index on timestamp. This makes things even
faster, and is important given that the logging use case has a very
high number of writes compared to reads (opposite of most database
problems).
Document-oriented / JSON is a great format for log information. Very flexible and “schemaless” in the sense we can throw in an extra
field any time we want.
There is also log4j 2: http://logging.apache.org/log4j/2.x/manual/async.html Additionally read this article about why it is so fast: http://www.grobmeier.de/log4j-2-performance-close-to-insane-20072013.html#.UzwywI9Bow4
You can also try CoralLog to asynchronously log data using the disruptor pattern. That way you spend minimum time in the logger thread and all the hard work is passed to the thread doing the actual file I/O. It also provides Memory Mapped Files to speed up the consumer thread and reduce queue contention.
Disclaimer: I am one of the developers of CoralLog
Related
I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?
One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.
I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.
I am working on a task where I would need to delete some very large records from mongodb. sometimes records are between 2M and 3M. I am trying to make that as fast as it could be.
My idea was to use some kind of thread pool and divide this number into some like 20 threads that each delete a part of the collection. Before I go further in this approach I would like to know if that is a good(promising) approach or not. My main concern is that if maybe this is not possible in mongo and I will have a blocking behaviour in the db and basically the threads will wait for each other to finish deleting.
also I would be happy if any other approaches/solutions are suggested.
the project language is Java/Spring.
Before making anything "as fast as it could be" you need to understand where the bottleneck is (typically CPU, memory or disk) so that your changes actually make a difference.
When it comes to deletes, there is some overhead in the delete operation (client has to send the command to the server, server has to parse it, etc.).
Assuming you have a large number of deletes, using 2 application threads for deleting may be a good idea to reduce this overhead when measuring wallclock time.
The size of documents being deleted doesn't matter.
If you are assuming that the server will be I/O bound due to document size, then sending more requests to it concurrently wouldn't help at all (in fact that would be counterproductive).
I am trying to to log asynchronously in a heavily multi-threaded environment in java on linux platform.what would be a suitable data structure(lock-free) to bring in low thread contention?
I need to log GBs of messages. I need to do it in async/lock-free manner so I don't kill performance on the main logic(the code that invokes the logger apis).
Logback has an AsyncAppender that might meet your needs.
The simplest way to do it is to write into multiple files - one for each thread.
Make sure you put timestamps at the start of each record, so it is easier to merge them into a single log file.
example unix command:
cat *.log | sort | less
But for a better / more useful answer you do need to clarify your question by adding a lot more detail.
I would use Java Chronicle, mostly because I wrote it but I suggest it here because you can write lock free and garbage free logging with a minimum of OS calls. This requires one log per thread, but you will have kept these to a minimum already I assume.
I have used this library to write 1 GB/second from a two threads. You may find having more threads will not help as much as you think if logging is a bottle neck for you.
BTW: You have given me an idea of how the log can be updated from multiple threads/processes, but it will take a while to implement and test.
To reduce contention, you can first put log messages in a buffer, private to each thread. When the buffer is full, put it in a queue handled by a separate log thread, which then merges messages from different threads and writes them to a file. Note, you need that separate thread in any case, in order not to slowdown the working threads when the next buffer is to be written on disk.
It is impossible to avoid queue contention as your logging thread will most likely log faster than your writer (disk i/o) thread can keep up, but with some smart wait strategies and thread pinning you can minimize latency and maximize throughput.
Take a look on CoralLog developed by Coral Blocks (with which I am affiliated) which uses a lock-free queue and can log a 64-byte message in 52 nanoseconds on average. It is capable of writing more than 5 million messages per second.
I'm writing an application that listens on UDP for incoming messages. My main thread receives message after message from the network and passes each of them to a new thread for handling using an executor.
Each handling thread does the required processing on the message it's responsible on and adds it to a LinkedBlockingQueue that is shared between all the handling threads.
Then, I have a DB worker thread that drains the queue by block of 10000 messages and inserts the block of messages in the DB.
Since the arrival rate of messages may be high (more than 20000 messages per second), I thought that using LOAD DATA INFILE is more efficient. So, this DB worker threads drains the queue as said previously, creates a temporary file containing all the messages using a CSV format, and passes the created file to another thread using another executor. This new thread execute the LOAD DATA INFILE statement using JDBC.
After testing my application, I think the performances are not so good, I'm looking for ways to improve performance both at the multithreading level and at the DB access level.
I precise that I use MySQL as DBMS.
Thanks
You need to determine why your performance is poor.
E.g. its quite likely you don't need multiple threads if you are writing the data sequentially to a database which is far more likely to be your bottleneck. The problem with using multiple threads when you don't need to is that it add complexity which is an overhead in itself and it can be slower than using a single thread.
I would try and see what the performance is like if you do everything but load the data into the database. i.e. write the file and discard it.
It's hard to tell without any profiler output, but my (un-)educated guess is that the bottleneck is that you are writing your changes to a file on the hard drive, and then prompt your database to read and parse this file. Storage access is always much, much slower than memory access. So this is very likely much slower than just feeding the database the queries from memory.
But that's just guessing. Maybe the bottleneck is somewhere else where you or me would have never expected it. When you really want to know which part of your applications eats how much CPU time, you should use a profiler like Profiler4j to analyze your program.
I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?
Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.
I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.
An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.
You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.
Use Spring Batch! That is exactly what you need