kafka vs chronicle queue vs disruptor

kafka vs chronicle queue vs disruptor - java

Anyone can compare the underlying design and performance among kafka, chronicle queue and disruptor, in terms of logging? Seems kafka has most users but don't avoid GC.

I think you might be confused about how Kafka is used in logging pipelines - typically it's for "shipping logs" from a single process (local disk) to a log database like Elasticsearch or Splunk and the performance is going to be on the order of 100K messages/s for a single machine e.g. see https://www.confluent.io/blog/kafka-fastest-messaging-system/. The reason you might use Kafka is to "protect" your database from bursts e.g. https://logz.io/blog/deploying-kafka-with-elk/.
Chronicle Queue and Disrupter would be for simply writing the logs to the local disk and can achieve on the order of 10M lines/s e.g. see https://grobmeier.solutions/log4j-2-performance-close-to-insane-20072013.html#.Ue02Z2RATzc.
You might further wonder what's the point of writing 10M lines/s to disk if you can only ship 100K/s.
The reason is that you probably only write 10M lines/s when something bad is happening (or you're debugging) and only for a short period of time so if you're reading from disk into Kafka then your "reader" will get behind but can eventually catch up once the burst is over.

Related

direct logging on elasticsearch vs using logstash and filebeat

I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.

First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.

Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.

we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.

How to get threads to stop blocking each other from writing to log file on disk?

My threads have fallen behind schedule and a thread dump reveals they're all stuck in blocking IO writing log output to hard disk. My quick fix is just to reduce logging, which is easy enough to do with respect to my QA requirements. Of course, this isn't vertically scalable which will be a problem soon enough.
I thought about just increasing the thread count but I'm guessing the bottleneck is on file contention and this could be rather bad if it's the wrong thing to do.
I have a lot of ideas but really no idea which are fruitful.
I thought about increasing the thread count but I'm guessing they're bottlenecked so this won't do anything. Is this correct? How to determine it? Could decreasing the threadcount help?
How do I profile the right # of threads to be writing to disk? Is this a function of number of write requests, number of bytes written per second, number of bytes per write op, what else?
Can I toggle a lower-level setting (filesystem, OS, etc.) to reduce locking on a file in exchange for out-of-order lines being possible? Either in my Java application or lower level?
Can I profile my system or hard disk to ensure it's not somehow being overworked? (Vague, but I'm out of my domain here).
So my question is: how to profile to determine the right number of threads that can safely write to a common file? What variables determine this - number of write operations, number of bytes written per second, number of bytes per write request, any OS or hard disk information.
Also is there any way to make the log file more liberal to be written to? We timestamp everything so I'm okay with a minority of out-of-order lines if it reduces blocking.

My threads have fallen behind schedule and a thread dump reveals they're all stuck in blocking IO writing log output to hard disk.
Typically in these situations, I schedule a thread just for logging. Most logging classes (such as PrintStream) are synchronized and write/flush each line of output. By moving to a central logging thread and using some sort of BlockingQueue to queue up log messages to be written, you can make use of a BufferedWriter or some such to limit the individual IO requests. The default buffer size is 8k but you should increase that size. You'll need to make sure that you properly close the stream when your application shuts down.
With a buffered writer, you could additionally write through a GZIPOutputStream which would significantly lower your IO requirements if your log messages repeat a lot.
That said, if you are outputting too much debugging information, you may be SOL and need to either decrease your logging bandwidth or increase your disk IO speeds. Once you've optimized your application, the next steps include moving to SSD on your log server to handle the load. You could also try distributing the log messages to multiple servers to be persisted but a local SSD would most likely be faster.
To simulate the benefits of a SSD, a local RAM disk should give you a pretty good idea about increased IO bandwidth.
I thought about increasing the thread count but I'm guessing they're bottlenecked so this won't do anything. Is this correct?
If all your threads are blocked in IO, then yes, increasing the thread count will not help.
How do I profile the right # of threads to be writing to disk?
Tough question. You are going to have to do some test runs. See the throughput of your application with 10 threads, with 20, etc.. You are trying to maximize your overall transactions processed in some time. Make sure your test runs execute for a couple of minutes for best results. But, it is important to realize that a single thread can easily swamp a disk or network IO stream if it is spewing too much output.
Can I toggle a lower-level setting (filesystem, OS, etc.) to reduce locking on a file in exchange for out-of-order lines being possible? Either in my Java application or lower level?
No. See my buffered thread writer above. This is not about file locking which (I assume) is not happening. This is about number of IO requests/second.
Can I profile my system or hard disk to ensure it's not somehow being overworked? (Vague, but I'm out of my domain here).
If you are IO bound then the IO is slowing you down so it is "overworked". Moving to a SSD or RAM disk is an easy test to see if your application runs faster.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

Synchronizing several threads writing to the same file in java

Of course there is the obvious way of using "synchronized".
But I'm creating a system designed for running on several cores
and writing to that file various times at the same milisecond.
So I believe that using synchronize will hurt performance badly.
I was thinking of using the Pipe class of java (but not sure if it will help)
or having each thread write to a different file and an additional thread collecting
those writings, creating the final result.
I should mention that the order of the writings isn't important and it is timestamped
in nanotime anyway.
What is the better idea of those two? have any other suggestions?
thanks.

Using some sort of synchronization (eg. single mutexes) is quite easy to implement.
If I had enough RAM I would create a queue of some sort for each log-producer thread, and a log-consumer thread to read from all the queues in a round-robin fashion (or something like that).

Not a direct answer to your question, but the logback project has synchronization facilities built into it for writing to the same file from different threads, so you might try to use it if it suits your needs, or at least take a look at it's source code. Since it's built for speed, I'm pretty sure the implementation isn't to be taken for granted.

You are right to be concerned, you are not going to be able to have all the threads write to the same file without a performance problem.
When I had this problem (writing my own logging, way back before Log4j) I created two fixed-size buffers in memory and had all the producer threads write to one buffer while a dedicated consumer thread read from the other buffer and wrote to a file. That way the writer threads had to synchronize only on getting and incrementing the index to the buffer and when the buffers were being swapped, and it only blocked when the current buffer was full. It was memory-intensive but fast.
For other ideas you could check out how loggers like Log4j and Logback work, they would have had to solve this problem.

Try to use JMS. All your processes running on different machines may send JMS message instead of writing to file. Create only one queue receiver that receives messages and writes them to file. Log4J already has this functionality: see JMSAppender.

Fast Oracle Select [Huge Data]

I have a project whereby I'm reading huge volumes of data from an Oracle database from Java.
I have the feeling that the application we are writing is going to process the data far faster than it will be given to us using a single threaded SELECT query and so I've been trying to research faster ways of obtaining the data.
Does anyone have anything I could read that would help me with my plight?

You haven't given us a lot of information on why it will be necessary to bring "huge volumes of data" into the Java application instead of processing it on the database side. Although there can be exceptions, usually this is signal to re-think the design. As a general rule with Oracle it is most efficient to do as much work as you can with pure set operations (SQL), followed by procedural processing with the rdbms engine (PL/SQL) before bringing results back to the client application.

Oracle supports parallel DML. In particular this applies to SELECT queries. Ultimately the bottleneck will probably be the IO read speed. Either use faster disks or stripe the data accross many disks.
Update
As APC noted in the comments Parallel Queries/DML is an Entreprise Edition feature and is not available in the Standard Edition.
Also, Parallel DML/Query is not the solution to all performance problems. Since more than one process will be used by the query it may improve throughput, but at the cost of concurrency. The purpose of parallelism is to use more resources to process the query faster. If the query is IO-bound or CPU-bound, there is no extra resources to use and adding parallelism will only make matter worse.
From the link above:
Parallel execution is not normally
useful for:
Environments in which the CPU, memory, or I/O resources are already
heavily utilized. Parallel execution
is designed to exploit additional
available hardware resources; if no
such resources are available, then
parallel execution will not yield any
benefits and indeed may be detrimental
to performance.

Use the setFetchSize(int) method on the Statement or PreparedStatement before you open the query. You should experiment with different sizes. Try 75 as a starting point.
On a slightly different useage, people have said that the PL/SQL bulk fetch "sweet spot" is between 2000 and 3000 but I saw one benchmark that indicated that 75 was optimum.
A large fetch size will tend to reduce the number of round trips between client and server. But if it is too large the database has to have a big buffer and the networking software may have to break up the big message into a lot of packets.

Firstly, 'huge data' to database people is [at least] gigabytes, in which case I suspect your problems are going to be reading those sort of volumes into your processes memory and aggregating them there. Why do you think a single-threaded select will be the bottleneck ?
If the bottleneck were getting the data from disk, then having multiple threads pulling data from the same disk wouldn't necessarily be faster and may even be slower. But if you could spread the data over separate disks, separate threads would be faster. If, using SSD, you don't think disks will be a contention point,we can look elsewhere.
If the bottleneck was network bandwidth, again multiple threads wouldn't fit any more data through the pipe any faster. You may even benefit from unloading the data to a flat file, compressing it and transferring that.
If the select is being sorted or comes from a hash-join, you may use memory more efficiently with a single thread. Multiple sessions would have to share the machine's memory.
If there is a CPU intensive processing then multiple threads may help. That could be as simple as having multiple connections from java, each getting a different slice of data (eg A-K and L-Z), but it would very much depend on the SELECT.
I agree with dpbradley that you should determine the bottleneck first. If you have the data and select, it should be simple enough to determine how long it takes (both on the local machine and through the network), and a trace would be a necessary starting point to really go into how it could be speeded up.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.