I am interested in implemented the following simple flow:
Client sends to a server process a simple message which the server stores. Since the message does not have any hierarchical structure IMO the best approach is to save it in a file instead of an rdb.
But I want to figure out how to optimize this since as I see it there are 2 choices:
Server sends a 200 OK to the client and then stores the message so
the client does not notice any delay
Server saves the message and then sends the 200OK but then the
client notices the overhead of the file I/O.
I prefer the performance of (1) but this could lead to a client thinking all went ok when actually the msg was never saved (for various error cases).
So I was thinking if I could use nio and memory mapped files.
But I was wondering is this a good candidate for using mem-mapped files? Would using a memory mapped file guarantee that e.g. if the process crashed the msg would be saved?
In my mind the flow would be creating/opening and closing many files so is this a good candidate for memory-mapping files?
Server saves the message and then sends the 200OK but then the client notices the overhead of the file I/O.
I suggest you test this. I doubt a human will notice a 10 milli-second delay and I expect you should get better than this for smaller messages.
So I was thinking if I could use nio and memory mapped files.
I use memory mapping as it can reduce the overhead per write by up to 5 micro-second. Is this important to you? If not, I would stick with the simplest approach.
Would using a memory mapped file guarantee that e.g. if the process crashed the msg would be saved?
As long as the OS doesn't crash, yes.
In my mind the flow would be creating/opening and closing many files so is this a good candidate for memory-mapping files?
Opening and closing files is likely to be fast more expensive than writing the data. (By an order of magnitude) I would suggest keeping such operations to a minimum.
You might find this library of mine interesting. https://github.com/peter-lawrey/Java-Chronicle It allows you to persist messages in the order of single digit micro-seconds for text and sub-micro-second for a small binary message.
Related
I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.
First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.
Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.
we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.
My java application logs a fair amount of information to a logfile on disk. Some of this logged information is more important than the rest; except that in rare cases the less-important info is needed to explain to the end-user why the code in production took a certain decision.
I was wondering if it will be a good idea to log the less important information to a socket instead of the file on disk. Is socket write significantly faster than disk write?
Update: Basically, I wanted to log to a socket in the same subnet or even the same machine, assuming that it would be faster than writing to disk. Another process (not part of my application) would then read from that socket at its convenience. I was thinking this would be logstash pulling from a socket. Async logging to disk using another thread is another alternative but I wanted to consider the socket option first if that is an easy solution with minimal performance hit.
You have few choices:
local storage is usually faster than network
you could use async logging to disk, so your process fires and forgets (which is fast!)
logstash can read from Unix Domain Sockets, if you are on *nix; these are usually faster than I/O
If you are writing somewhere fast and from there it is being forwarded in a slower fashion (logstash logging over network to some Elastic instance) where is the buffering happening? Such setup will generate growing backlog of messages yet to be shipped if the logging happens at high rate for prolonged period of time.
In the above scenarios buffering will happen (respectively):
direct sync write to disk: final log file on the disk is the buffer
async logging framework: buffers could eat into your heap or process memory (when outside of heap, or in some kernel area, therefore in RAM)
unix domain sockets: buffered in the kernel space, so RAM again
In the last 2 options things will get increasingly creaky in constant high volume scenario.
Test and profile...
or just log to the local disk and rotate the files, deleting old ones.
Socket is not a destination. It's a transport. Your question "send data to socket" should therefore be rephrased to "send data to network", "send data to disk" or "send data to another process".
In all these cases, socket itself is unlikely to be a bottleneck. The bottleneck will be either network, disk or application CPU usage - depending on where you are actually sending your data from the socket. On OS level, sockets are usually implemented as zero-copy mechanism, which means that the data is just passed to the other side as a pointer and is therefore highly efficient.
This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.
Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.
Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.
Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....
What's the best way to go about things in terms of speed/performance?
Where do things like "Apache Thrift" come in and what are the benefits?
Please add some good resources I can use to learn about any recommendations!
Thanks all
Presuming you mean both processes are already running, then it's going to be via sockets.
Writing a file to the disk from one process then reading it from the other is going to incur the performance hit of the disk write and read (and of course whatever method you employ to keep the reader from accessing the file until it's done being written; either locks or an atomic rename on the disk).
Even ignoring that, your localhost interface is going to have a faster transfer rate than your disk controller, with the possible exception of a 10Gb fiber channel RAID array with 15k RPM drives in it.
Try it out. There's just no other way to find out.
Using sockets or the file system should be comparably fast, since both methods rely on some system calls that are very similar.
Always be aware that this communication involves these steps:
Encoding your data to a stream of bytes (JSON, XML, YAML, X.509 DER, Java Serialization)
Transferring this stream of bytes (TCP socket, UNIX socket, filesystem, ramdisk, pipes)
Decoding the stream of bytes into data (same as step 1)
Step 1 and 2 are completely independent, so take that into account when you benchmark.
I'm writing a code that generates a large XML document and writes it directly into a client stream using StAX XmlStreamWriter.
I'm afraid that if the network becomes extremely slow, the bytes written into the stream will actually stay in memory buffers for a relatively long time and consume a lot of memory on my server.
My question is: is there any way I can keep writing directly to the client stream, and avoid the potential memory problem I described above?
There wouldn't seem to be if you generate faster then you can stream out it has to be in memory. If this does become a major problem you would need to look at a way to move it out of memory such as generating a file but that still needs to be loaded and streamed by something. The main advantage of a file is if you can reuse the file for many requests.