Generic Log Parser Algorithm

Generic Log Parser Algorithm - java

My Application when running is writing logs. Now I need to check whether indexing is completed or not by checking for a status message as to whether it's written in logs or not (note that logging is going on dynamically and the process is running). My application is not sending a signal as to when it has completed the process of indexing, just logs it and goes doing other stuff. Should I poll the logs continuously to check whether status has been written in logs but that would be kind of anti-pattern or bad design. I cant even have a busy-waiting or a do nothing loop and then check, another bad design. How can I check for the entered entry in logs in the best way without querying logs repetedly for that and with consuming less CPU cycles?

Polling is the usual solution. Other solutions require the
collaboration of the generating process in some way; if this is
possible, it's obviously a preferable solution, but if the generating
process is to remain unaware of the listener (in the sense of not
knowing about its existance), then polling is about the only valid
solution. (Depending on the logging facilities, you might be able to
arrange for the log to go into a named pipe, and read that.)
Note that polling isn't necessarily that expensive, if you aren't doing
it too often.

If you control both of the programs (i.e. reading and writing the log), then the easiest solution is to have the writer notify all listeners when it is done using some form of inter-process communication (e.g. signals).
Only if IPC is not possible, you should look at smarter methods of waiting of polling for changes. Most operating systems let you register a callback for when a file or directory is modified. Take a look at this question for some suggestions.

Assuming that log parsing is the only alternative you have, the idiom you are looking for has the following high-level representation (UNIX CLI style)
# tail -f logfile.txt | grep STATUS_PATTERN
There (1) "tail -f" prints out any new lines that are appended to logfile.txt and passes them to (2) "grep" which performs the actual pattern matching.
Both (1) and (2) functionality is trivial to be implemented in Java/C++ as a separate thread/process and provide more lightweight load than the periodic polling.
You will also need a little bit of extra functionality to detect log rotation conditions.

Related

direct logging on elasticsearch vs using logstash and filebeat

I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.

First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.

Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.

we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.

How do I make 2 Java applications talk with each other?

I have 2 Java applications. First I may edit as much as I wish, but I will compile it to machine code later on. Second one I am not able to edit, but I may write an addon for it. I need to make that addon be able to talk with first application. Generally simply send strings to each other. Input and Output streams of a process is not an option for me. I am thinking of using a tcp socket client/server or a file which will act as a buffer. But both ways seem a liitle bit ugly to me, could anyone propose me a better idea?

It depends on what kind of data you wish to transfer.
If it is only Strings, then:
if number of process = 2 and if you are sure of it, then stdin &8 stdout is the best way forward. You can create a Process using ProcessBuilder and then get the streams to communicate. The other process can just to System.out to transfer message. This is preferred to Socket, because you dont have to handle graceful closing of socket etc. (In case it fails and the port is not un-binded successfully, it can be a big trouble)
if number of process > 2 and less than say 10, you can probably use Sockets and communicate through Socket. This should work well, though extra effort goes in gracefully managing sockets.
if number of process is Large, then JMS should be used. It does a lot of things which you dont need to handle. Too big a task if the number of processes are less.
So in your case, process is the best way forward.
If the data you wish to transfer, can even be Objects. RMI can be used given the number of processes are less. If more, use JMS again.
Edit: Now for all the above, there is a lot of dirty work involved. For a change, if you are looking at something new & exciting, I would advice akka. It is a actor based model which communicate with each other using Messages.
The beauty is, the actors can be on same JVM or another (very little config) and akka takes care the rest for you. I haven't seen a more cleaner way than doing this :)

What about to use JMS ?
You can use according to your needs, either the Publish/Sunbscribe or Point-to-Point Models.

Another approach is having DB table to store your data, one process can insert and other process can read it when ever required. When you are using JMS, there is likeliness of loosing data, But storing in db would be failsafe and future proof.

Time a piece of java code in production.

I need to time/performance check a piece of code, in production.
The code has java stack. It most probably has log4j integrated. It interacts with a JMS, sends some request on it and pick some response from it. I need to prove that from the user event i.e. click on the front end to the point where it goes and waits for JMS, it is relatively fast. I need to prove (know) that most of the time that it takes, in the round trip is because it is waiting for some message from the JMS.
I am currently looking at http://perf4j.codehaus.org/devguide.html. However, I would like to poll the group for suggestions. A few restrictions that I need to work with are:
I need something that can be run on production. It needs to be something that I can switch on and off relatively easily.
It needs to be something that can not be too heavy memory / CPU usage wise.
It needs to be something that I can put into the existing code base with least amount of change in the existing code.
So, does anyone have any suggestions apart from http://perf4j.codehaus.org/devguide.html?

Aspects and JVM system arguments (for enabling disabling but requires a restart) or JMX if you need real time on/off.

Is Logging using FileHandler a bottleneck?

I am considering logging business events in a J2EE web application by using Java logging and FileHandler.
I am wondering whether that could cause a performance bottleneck, since many log records will be written to one file.
What are your experiences and opinions?
Is logging a busy web application to one file with Java logging and FileHandler likely to become performance bottleneck?

It all depends on how much log statements you add. If you add logging after every line of code then performance will must certainly degrade.
Use logging for the important cases, set the correct logging level for your current purposes (testing or actual deployment) and use constructions like
if (Logger.isDebugEnabled()) {
Logger.debug("Value is " + costlyOperation()")
}
to avoid calling code that is costly to run.
You might also want to check this article

In order to avoid generalities like "it depends" or "a little" etc. you should measure the performance of your application with and without the logging overhead. Apache JMeter can help you generate the load for the test.
The information you can collect through logging is usually so essential for the integrity of the application, that you can not operate blindly. There is also a slight overhead if you use Google Analytics, but the benefits prevail.
In order to keep your log files within reasonable sizes, you can always use rotating log files.

I think that JavaRevisited blog has a pretty good post on a problem with performance: Top 10 Tips on Logging in Java

In a recent project, I log audit events to a database table and I was concerned about performance, so I added the ability to log in 'asynchronous' mode. In this mode the logger runs in a low-priority background thread and the act of logging from the main thread just puts the log events onto a queue which are lazily retrieved and written by the background logging thread.
This approach will only work, however, if there are natural 'breaks' in the processing; if your system is constantly busy then the queue will never be emptied. One way to solve this is to make the background thread more active depending on the number of the log messages in the queue (an enhancement I've yet to implement).

You should:
Define an appropriate metric of performance (e.g., responsiveness, throughput, etc.). Then you should measure this metric with all logging turned off and then on. The difference would be the cost of logging.
Then you should experiment with different logging libraries and the modes they provide and document the observed differences.
In my personal experience, for all the three projects I worked on, I found that asynchronous logging helped improve the application throughput a lot. But the same may not hold for you, so make sure you make your decision after careful measurements.
The following does not directly relate to your question.
I noticed that you specifically mentioned business logging. In this case, you may also want to keep logging relevant and clean, in case you find your log files are growing huge and difficult to understand. There is a generally accepted design pattern in this area: log as per function. This would mean that business logging (e.g., customer requested a refund) goes to a different destination, interface logging would go to another destination (e.g., user clicked the upvote button != user upvoted an answer), and a cross system call would go to another destination (e.g., Requesting clearance through payment gateway). Some people keep a master log file with all events as well just to see a timeline of the process while some design log miners/scrappers to construct timelines when required.
Hope this helps,

Reliable non-network IPC in Java

Is there a reliable, cross-platform way to do IPC (between two JVMs running on the same host) in Java (J2SE) that doesn't rely on the network stack?
To be more specific, I have a server application that I'd like to provide a small "monitoring" GUI app for. The monitor app would simply talk to the server process and display simple status information. The server app has a web interface for most of its interaction, but sometimes things go wrong (port conflict, user forgot password) that require a local control app.
In the past I've done this by having the server listen on 127.0.01 on a specific port and the client communicates that way. However, this isn't as reliable as I'd like. Certain things can make this not work (Windows's network stack can be bizarre with VPN adapters, MediaSense, laptops lid closing/power saving modes). You can imagine the user's confusion when the tool they use to diagnose the server doesn't even think the server is running.
Named Pipes seem plausible, but Java doesn't seem to have an API for them unless I'm mistaken. Ideas? Third party libraries that support this? My performance requirements are obviously extremely lax in case that helps.

One of my specialties is really low-tech solutions. Especially if your performance requirements aren't critical:
The low-low tech alternative to named pipes is named FILES. Think yourself up a protocol where one app writes a file and another reads it. If need be, you can do semaphoring between them.
Remember that a rename is pretty much an atomic operation, so you could calmly write a file in some process and then make it magically appear in its entirety by renaming/moving it from somewhere that wasn't previously visible.
You can poll for data by checking for appearance of a file (in a loop with a SLEEP in it), and you can signal completion by deleting the file.
An added benefit is that you can debug your app using the DIR command :)

Depending on how much data you need to pass between the server and the diagnostic tool you could:
go low-tech and have a background thread check a file in the file system; fetch commands from it; write ouput into a second to be picked up by the diagnostic tool.
build a component that manages an input/output queue in shared memory connecting to it via JNI.

Consider JMX. I do not know if any of the Windows JVM's allow JMX over shared memory.

Does Windows even have named pipes? I was going to suggest it. You'd just have to use an exec() to create it.

Map a read_write byte buffer into memory from a FileChannel. Write status information into the byte buffer, then call force() to get it written out. On the monitor side, open up the same file and map it into memory too. Poll it periodically to find out the status.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Generic Log Parser Algorithm - java

Related

direct logging on elasticsearch vs using logstash and filebeat

How do I make 2 Java applications talk with each other?

Time a piece of java code in production.

Is Logging using FileHandler a bottleneck?

Reliable non-network IPC in Java

Categories

Resources