How to write a Generic Log Parser - java

We need to parse several log files and run some statistics on the logs entries found (things such as number of occurrence of certain messages, spikes of occurrences, etc). The problem is with writing a log parser that will handle several log formats and will allow me to add a new log format with very little work.
To make things easier for now I'm only looking at logs that will basically look similar to this:
[11/17/11 14:07:14:030 EST] MyXmlParser E Premature end of file
so each log entry will contain a timestamp, originator (of the log message), level and log message. One important detail is that a message may have more than one line (e.g. stacktrace).
Another instance of the log entry could be:
17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file
I'm looking for a good way to specify the log format as well as the most adequate technology to implement the parser for it.
I though about regular expressions but I think it will be tricky to handle situations such as the multi-line message (e.g. stacktrace).
Actually the task of writing a parser for a specific log format does not sound so easy itself when I consider the possibility of multi-line messages. How do you go about parsing those files?
Ideally I would be able to specify something like this as a log format:
[%TIMESTAMP] %ORIGIN %LEVEL %MESSAGE
or
%TIMESTAMP %LEVEL %ORIGIN - %MESSAGE
Obviously I would have to assign the right converter to each field to it would handle it correctly (e.g. the timestamp).
Could anyone give me some good ideas on how to implement this in a robust and modular way (I'm using Java) ?

AWStats is a great log parser, open source, and you can do whatever you want with the resulting database that it generates.

You can use a Scanner for example, and some regexes. Here is a snippet of what I did to parse some complex logs :
private static final Pattern LINE_PATTERN = Pattern.compile(
"(\\S+:)?(\\S+? \\S+?) \\S+? DEBUG \\S+? - DEMANDE_ID=(\\d+?) - listener (\\S+?) : (\\S+?)");
public static EventLog parse(String line) throws ParseException {
String demandId;
String listenerClass;
long startTime;
long endTime;
SimpleDateFormat sdf = new SimpleDateFormat(DATE_PATTERN);
Matcher matcher = LINE_PATTERN.matcher(line);
if (matcher.matches()) {
int offset = matcher.groupCount()-4; // 4 interesting groups, the first is optional
demandeId = matcher.group(2+offset);
listenerClass = matcher.group(3+offset);
long time = sdf.parse(matcher.group(1+offset)).getTime();
if ("starting".equals(matcher.group(4+offset))) {
startTime = time;
endTime = -1;
} else {
startTime = -1;
endTime = time;
}
return new EventLog(demandeId, listenerClass, startTime, endTime);
}
return null;
}
So, with regexes and groups, it works pretty well.

If you have the possibility (and you should with a good logger framework) I would recommend you to duplicate logs in a parsable format. For example, with log4j use an XMLLayout or something like this.
It will be a lot easier to parse because then you will know the exact format of the logs.
You can do this quite transparently to the running app just by setup. Think about using asynchronuous appender in order to not disturb too much the running application.
Also if the XMLLayout can suit your needs have a look at Apache chainsaw

Log4j's LogFilePatternReceiver does exactly that...
This log entry:
17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file
Can be parsed using the following logformat (assuming origin is the same as 'logger'), with a timestamp leveraging Java's SimpleDateFormat of dd-MM-yyyy kk:mm:ss
TIMESTAMP LEVEL LOGGER - MESSAGE
The timezone and the level in the other form are a little tricker...there is the ability to remap strings to levels (E to ERROR) but I don't know that the timezone will quite work.
Try it out, check out the source, and play with support for it in the latest developer snapshot of Chainsaw:
http://people.apache.org/~sdeboy

I ended up not writing my own and using logstash.

At work we rolled our own log parser (in Java) so we could filter the known stacktraces out of the production logs to identify new potential production problems. It uses regex and it's tightly coupled to our log4j log format.
We've also got a python script that runs over the live production transaction logs and reports (to SiteScope - our infrastructure monitoring tool) when the count for particular errors is too high.
While both are useful, they are awful to maintain, and I would recommend trying any open source tool parsing tool first, and resorting to writing your own only if necessary. Heck, I would even pay for a tool that did this ;)

Maybe you could write a Log4j CustomAppender? For example as described here: http://mytechattempts.wordpress.com/2011/05/10/log4j-custom-memory-appender/
Your custom appender could use a database or simple Java objects queried by JMX to get your statistics. All just depends on how much data is needed to be persisted.

Related

How to automatically collapse repetitive log output in log4j

Every once in a while, a server or database error causes thousands of the same stack trace in the server log files. It might be a different error/stacktrace today than a month ago. But it causes the log files to rotate completely, and I no longer have visibility into what happened before. (Alternately, I don't want to run out of disk space, which for reasons outside my control right now is limited--I'm addressing that issue separately). At any rate, I don't need thousands of copies of the same stack trace--just a dozen or so should be enough.
I would like it if I could have log4j/log4j2/another system automatically collapse repetitive errors, so that they don't fill up the log files. For example, a threshold of maybe 10 or 100 exceptions from the same place might trigger log4j to just start counting, and wait until they stop coming, then output a count of how many more times they appeared.
What pre-made solutions exist (a quick survey with links is best)? If this is something I should implement myself, what is a good pattern to start with and what should I watch out for?
Thanks!
Will the BurstFilter do what you want? If not, please create a Jira issue with the algorithm that would work for you and the Log4j team would be happy to consider it. Better yet, if you can provide a patch it would be much more likely to be incorporated.
Log4j's BurstFilter will certainly help prevent you filling your disks. Remember to configure it so that it applies in as limited a section of code as you can, or you'll filter out messages you might want to keep (that is, don't use it on your appender, but on a particular logger that you isolate in your code).
I wrote a simple utility class at one point that wrapped a logger and filtered based on n messages within a given Duration. I used instances of it around most of my warning and error logs to protect the off chance that I'd run into problems like you did. It worked pretty well for my situation, especially because it was easier to quickly adapt for different situations.
Something like:
...
public DurationThrottledLogger(Logger logger, Duration throttleDuration, int maxMessagesInPeriod) {
...
}
public void info(String msg) {
getMsgAddendumIfNotThrottled().ifPresent(addendum->logger.info(msg + addendum));
}
private synchronized Optional<String> getMsgAddendumIfNotThrottled() {
LocalDateTime now = LocalDateTime.now();
String msgAddendum;
if (throttleDuration.compareTo(Duration.between(lastInvocationTime, now)) <= 0) {
// last one was sent longer than throttleDuration ago - send it and reset everything
if (throttledInDurationCount == 0) {
msgAddendum = " [will throttle future msgs within throttle period]";
} else {
msgAddendum = String.format(" [previously throttled %d msgs received before %s]",
throttledInDurationCount, lastInvocationTime.plus(throttleDuration).format(formatter));
}
totalMessageCount++;
throttledInDurationCount = 0;
numMessagesSentInCurrentPeriod = 1;
lastInvocationTime = now;
return Optional.of(msgAddendum);
} else if (numMessagesSentInCurrentPeriod < maxMessagesInPeriod) {
msgAddendum = String.format(" [message %d of %d within throttle period]", numMessagesSentInCurrentPeriod + 1, maxMessagesInPeriod);
// within throttle period, but haven't sent max messages yet - send it
totalMessageCount++;
numMessagesSentInCurrentPeriod++;
return Optional.of(msgAddendum);
} else {
// throttle it
totalMessageCount++;
throttledInDurationCount++;
return emptyOptional;
}
}
I'm pulling this from an old version of the code, unfortunately, but the gist is there. I wrote a bunch of static factory methods that I mainly used because they let me write a single line of code to create one of these for that one log message:
} catch (IOException e) {
DurationThrottledLogger.error(logger, Duration.ofSeconds(1), "Received IO Exception. Exiting current reader loop iteration.", e);
}
This probably won't be as important in your case; for us, we were using a somewhat underpowered graylog instance that we could hose down fairly easily.

Encrypted logger for Java

I'll put the question upfront:
Is there a logger available in Java that does encryption(preferably 128-bit AES or better)?
I've done a lot of searching for this over the last couple of days. There's a few common themes to what I've found:
Dissecting information between log4j and log4j2 is giving me headaches(but mostly unrelated to the task at hand)
Most threads are dated, including the ones here on SO. This one is probably the best I've found on SO, and one of the newer answers links to a roll-your-own version.
The most common answer is "roll-your-own", but these answers are also a few years old at this point.
A lot of people question why I or anyone would do this in Java anyway, since it's simple enough to analyze Java code even without the source.
For the last point, it's pretty much a moot point for my project. We also use a code obfuscator and could employ other obfuscation techniques. The point of using encryption is simply to raise the bar of figuring out our logs above "trivially easy", even if it's only raised to "mildly time-consuming". A slightly relevant aside - the kind of logging we're going to encrypt is intended merely for alpha/beta, and will likely only include debug, warn, and error levels of logging(so the number of messages to encrypt should be fairly low).
The best I've found for Log4j2 is in their documentation:
KeyProviders
Some components within Log4j may provide the ability to perform data encryption. These components require a secret key to perform the encryption. Applications may provide the key by creating a class that implements the SecretKeyProvider interface.
But I haven't really found anything other than wispy statements along the lines of 'plug-ins are able of doing encryption'. I haven't found a plug-in that actually has that capability.
I have also just started trying to find other loggers for Java to see if they have one implemented, but nothing is really jumping out for searches like 'java logging encryption'.
Basically log encryption is not best practise there are limited situations where you can need this functionality. As mainly people which have access to logs have also access to JVM, and in JVM all the logs are at least generated as Strings so even if you encrypt them in the log file or console the real values will be available in JVM String Pool, so if anyone will every need to hack your logs it will be as easy as have a look in string pool.
But anyway if you need a way to encrypt the logs, and as there is no generic way for this, the best way in my opinion is to go with Aspect J. This will have minimum impact on you sources, you will write code as you have done before, but the logs will be encrypted. Following is a simple application code which will encrypt all the logs from all the compiled sources using Aspctj, and Slf4j as logging facade and Log4j2 as logging implementation.
The simple class which logs the "Hello World"
public class Main {
private static final transient Logger LOG = LoggerFactory
.getLogger(Main.class);
public static void main(String[] args) {
LOG.info("Hello World");
LOG.info("Hello {0}", "World 2");
}
}
Aspect which encrypts (in this case just edits the text)
#Aspect
public class LogEncryptAspect {
#Around("call(* org.slf4j.Logger.info(..))")
public Object encryptLog (ProceedingJoinPoint thisJoinPoint) throws Throwable{
Object[] arguments = thisJoinPoint.getArgs();
if(arguments[0] instanceof String){
String encryptedLog = encryptLogMessage ((String) arguments[0], arguments.length > 1 ? Arrays.copyOfRange(arguments, 1, arguments.length) : null);
arguments[0] = encryptedLog;
}
return thisJoinPoint.proceed(arguments);
}
// TODO change this to apply some kind of encryption
public final String encryptLogMessage (String message, Object... args){
if(args != null){
return MessageFormat.format(message, args) + " encrypted";
}
return message + " encrypted";
}
}
The output is :
[main] INFO xxx.Main - Hello World encrypted
[main] INFO xxx.Main - Hello World 2 encrypted

My own Logging Handler for GAE/J (using appengine.api.log?)

I need to write my own logging handler on GAE/J. I have Android code that I'm trying to adapt such that it can be shared between GAE/J and Android. The GAE code I'm trying to write would allow the log statements in my existing code to work on GAE.
The docs say that I can just print to system.out and system.err, and it works, but badly. My logging shows up in the log viewer with too much extraneous text:
2013-03-08 19:37:11.355 [s~satethbreft22/1.365820955097965155].: [my_log_msg]
So, I started looking at the GAE log API. This looked hopeful initially: I can construct an AppLogLine and set the log records for a RequestLogs object.
However, there is no way to get the RequestLogs instance for the current request - the docs say so explicitly here:
Note: Currently, App Engine doesn't support the use of the request ID to directly look up the related logs.
I guess I could invent a new requestID and add log lines to that, but it is starting to look like this is just not meant to be?
Has anyone used this API to create their own log records, or otherwise managed to do their own logging to the log console.
Also, where can I find the source for GAE's java.util.logging? Is this public? I would like to see how that works if I can.
If what I'm trying to do is impossible then I will need to consider other options, e.g. writing my log output to a FusionTable.
I ended up just layering my logging code on top of GAE's java.util.logging. This feels non-optimal since it increases the complexity and overhead of my logging, but I guess this is what any 3rd logging framework for GAE must do (unless it is OK with the extra cruft that gets added when you just print to stdout).
Here is the crux of my code:
public int println(int priority, String msg) {
Throwable t = new Throwable();
StackTraceElement[] stackTrace = t.getStackTrace();
// Optional: translate from Android log levels to GAE log levels.
final Level[] levels = { Level.FINEST, Level.FINER, Level.FINE, Level.CONFIG,Level.INFO, Level.WARNING, Level.SEVERE, Level.SEVERE };
Level level = levels[priority];
LogRecord lr = new LogRecord(level, msg);
if (stackTrace.length > 2) { // should always be true
lr.setSourceClassName(stackTrace[2].getClassName());
lr.setSourceMethodName(stackTrace[2].getMethodName());
}
log.log(lr);
return 0;
}
Note that I use a stack depth of 2, but that # will depend on the 'depth' of your logging code.
I hope that Google will eventually support getting the current com.google.appengine.api.log.RequestLogs instance and inserting our own AppLogLine instances into it. (The API's are actually there to do that, but they explicitly don't support it, as above.)

Log4j not printing name before message

I have a log4j logger that I instantiate like this:
logger = Logger.getLogger("EQUIP(" + id + ")");
Doing so, when I call logger.info("message"), I should get an output like this (with some date formatting):
13/11/12 15:08:27 INFO: EQUIP(1): message
But I'm only getting:
13/11/12 15:08:27 INFO: message
I'm also printing logger.getName() to the console for debugging and it gives me back the correct "EQUIP(1)" name. This behaviour is happening in some cases in my program, where I have several loggers like this, but mostly in this specific class. I want to know if I'm doing something wrong, if this name should be only the class/package name, or if it can be anything (it works well in 80+% of my loggers). I need to print the ID of each equipment because I have several of them working simultaneous, and tracking them without this would be next to impossible.
How should I fix this, preferably without resourcing to changing all my log calls to include this prefix?
The output format depends on the pattern you've configured for the appender. If the pattern string includes %c then you'll get the logger name included, if it doesn't then you won't.
An alternative approach might be to use the mapped diagnostic context, which is designed to disambiguate between log output from different threads writing to the same logger.

How to mask credit card numbers in log files with Log4J?

Our web app needs to be made PCI compliant, i.e. it must not store any credit card numbers. The app is a frontend to a mainframe system which handles the CC numbers internally and - as we have just found out - occasionally still spits out a full CC number on one of its response screens. By default, the whole content of these responses are logged at debug level, and also the content parsed from these can be logged in lots of different places. So I can't hunt down the source of such data leaks. I must make sure that CC numbers are masked in our log files.
The regex part is not an issue, I will reuse the regex we already use in several other places. However I just can't find any good source on how to alter a part of a log message with Log4J. Filters seem to be much more limited, only able to decide whether to log a particular event or not, but can't alter the content of the message. I also found the ESAPI security wrapper API for Log4J which at first sight promises to do what I want. However, apparently I would need to replace all the loggers in the code with the ESAPI logger class - a pain in the butt. I would prefer a more transparent solution.
Any idea how to mask out credit card numbers from Log4J output?
Update: Based on #pgras's original idea, here is a working solution:
public class CardNumberFilteringLayout extends PatternLayout {
private static final String MASK = "$1++++++++++++";
private static final Pattern PATTERN = Pattern.compile("([0-9]{4})([0-9]{9,15})");
#Override
public String format(LoggingEvent event) {
if (event.getMessage() instanceof String) {
String message = event.getRenderedMessage();
Matcher matcher = PATTERN.matcher(message);
if (matcher.find()) {
String maskedMessage = matcher.replaceAll(MASK);
#SuppressWarnings({ "ThrowableResultOfMethodCallIgnored" })
Throwable throwable = event.getThrowableInformation() != null ?
event.getThrowableInformation().getThrowable() : null;
LoggingEvent maskedEvent = new LoggingEvent(event.fqnOfCategoryClass,
Logger.getLogger(event.getLoggerName()), event.timeStamp,
event.getLevel(), maskedMessage, throwable);
return super.format(maskedEvent);
}
}
return super.format(event);
}
}
Notes:
I mask with + rather than *, because I want to tell apart cases when the CID was masked by this logger, from cases when it was done by the backend server, or whoever else
I use a simplistic regex because I am not worried about false positives
The code is unit tested so I am fairly convinced it works properly. Of course, if you spot any possibility to improve it, please let me know :-)
You could write your own layout and configure it for all appenders...
Layout has a format method which makes a String from a loggingEvent that contains the logging message...
A better implementation of credit card number masking is at http://adamcaudill.com/2011/10/20/masking-credit-cards-for-pci/ .
You want to log the issuer and the checksum, but not the PAN (Primary Account Number).

Categories

Resources