How to automatically collapse repetitive log output in log4j

How to automatically collapse repetitive log output in log4j - java

Every once in a while, a server or database error causes thousands of the same stack trace in the server log files. It might be a different error/stacktrace today than a month ago. But it causes the log files to rotate completely, and I no longer have visibility into what happened before. (Alternately, I don't want to run out of disk space, which for reasons outside my control right now is limited--I'm addressing that issue separately). At any rate, I don't need thousands of copies of the same stack trace--just a dozen or so should be enough.
I would like it if I could have log4j/log4j2/another system automatically collapse repetitive errors, so that they don't fill up the log files. For example, a threshold of maybe 10 or 100 exceptions from the same place might trigger log4j to just start counting, and wait until they stop coming, then output a count of how many more times they appeared.
What pre-made solutions exist (a quick survey with links is best)? If this is something I should implement myself, what is a good pattern to start with and what should I watch out for?
Thanks!

Will the BurstFilter do what you want? If not, please create a Jira issue with the algorithm that would work for you and the Log4j team would be happy to consider it. Better yet, if you can provide a patch it would be much more likely to be incorporated.

Log4j's BurstFilter will certainly help prevent you filling your disks. Remember to configure it so that it applies in as limited a section of code as you can, or you'll filter out messages you might want to keep (that is, don't use it on your appender, but on a particular logger that you isolate in your code).
I wrote a simple utility class at one point that wrapped a logger and filtered based on n messages within a given Duration. I used instances of it around most of my warning and error logs to protect the off chance that I'd run into problems like you did. It worked pretty well for my situation, especially because it was easier to quickly adapt for different situations.
Something like:
...
public DurationThrottledLogger(Logger logger, Duration throttleDuration, int maxMessagesInPeriod) {
...
}
public void info(String msg) {
getMsgAddendumIfNotThrottled().ifPresent(addendum->logger.info(msg + addendum));
}
private synchronized Optional<String> getMsgAddendumIfNotThrottled() {
LocalDateTime now = LocalDateTime.now();
String msgAddendum;
if (throttleDuration.compareTo(Duration.between(lastInvocationTime, now)) <= 0) {
// last one was sent longer than throttleDuration ago - send it and reset everything
if (throttledInDurationCount == 0) {
msgAddendum = " [will throttle future msgs within throttle period]";
} else {
msgAddendum = String.format(" [previously throttled %d msgs received before %s]",
throttledInDurationCount, lastInvocationTime.plus(throttleDuration).format(formatter));
}
totalMessageCount++;
throttledInDurationCount = 0;
numMessagesSentInCurrentPeriod = 1;
lastInvocationTime = now;
return Optional.of(msgAddendum);
} else if (numMessagesSentInCurrentPeriod < maxMessagesInPeriod) {
msgAddendum = String.format(" [message %d of %d within throttle period]", numMessagesSentInCurrentPeriod + 1, maxMessagesInPeriod);
// within throttle period, but haven't sent max messages yet - send it
totalMessageCount++;
numMessagesSentInCurrentPeriod++;
return Optional.of(msgAddendum);
} else {
// throttle it
totalMessageCount++;
throttledInDurationCount++;
return emptyOptional;
}
}
I'm pulling this from an old version of the code, unfortunately, but the gist is there. I wrote a bunch of static factory methods that I mainly used because they let me write a single line of code to create one of these for that one log message:
} catch (IOException e) {
DurationThrottledLogger.error(logger, Duration.ofSeconds(1), "Received IO Exception. Exiting current reader loop iteration.", e);
}
This probably won't be as important in your case; for us, we were using a somewhat underpowered graylog instance that we could hose down fairly easily.

Related

Issues with Dynamic Destinations in Dataflow

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's FileIO and Dynamic Destinations.
About two weeks ago, I noticed an unusual buildup of unacknowledged messages. Upon restarting the df job the errors disappeared and new files were being written in GCS.
After a couple of days, writing stopped again, except this time, there were errors claiming that processing was stuck. After some trusty SO research, I found out that this was likely caused by a deadlock issue in pre 2.90 Beam because it used the Conscrypt library as the default security provider. So, I upgraded to Beam 2.11 from Beam 2.8.
Once again, it worked, until it didn't. I looked more closely at the error and noticed that it had a problem with a SimpleDateFormat object, which isn't thread-safe. So, I switched to use Java.time and DateTimeFormatter, which is thread-safe. It worked until it didn't. However, this time, the error was slightly different and didn't point to anything in my code:
The error is provided below.
Processing stuck in step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:469)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(MetricTrackingWindmillServerStub.java:202)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:409)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:311)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$BagPagingIterable$1.computeNext(WindmillStateReader.java:700)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:47)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:701)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
This error started occurring approximately 5 hours after job deployment and at an increasing rate over time. Writing slowed significantly within 24 hours. I have 60 workers and I suspect that one worker fails every time there is an error, which eventually kills the job.
In my writer, I parse the lines for certain keywords (may not be the best way) in order to determine which folder it belongs in. I then proceed to insert the file to GCS with the determined filename. Here is the code I use for my writer:
The partition function is provided as the following:
#SuppressWarnings("serial")
public static class datePartition implements SerializableFunction<String, String> {
private String filename;
public datePartition(String filename) {
this.filename = filename;
}
#Override
public String apply(String input) {
String folder_name = "NaN";
String date_dtf = "NaN";
String date_literal = "NaN";
try {
Matcher foldernames = Pattern.compile("\"foldername\":\"(.*?)\"").matcher(input);
if(foldernames.find()) {
folder_name = foldernames.group(1);
}
else {
Matcher folderid = Pattern.compile("\"folderid\":\"(.*?)\"").matcher(input);
if(folderid.find()) {
folder_name = folderid.group(1);
}
}
Matcher date_long = Pattern.compile("\"timestamp\":\"(.*?)\"").matcher(input);
if(date_long.find()) {
date_literal = date_long.group(1);
if(Utilities.isNumeric(date_literal)) {
LocalDateTime date = LocalDateTime.ofInstant(Instant.ofEpochMilli(Long.valueOf(date_literal)), ZoneId.systemDefault());
date_dtf = date.format(dtf);
}
else {
date_dtf = date_literal.split(":")[0].replace("-", "/").replace("T", "/");
}
}
return folder_name + "/" + date_dtf + "h/" + filename;
}
catch(Exception e) {
LOG.error("ERROR with either foldername or date");
LOG.error("Line : " + input);
LOG.error("folder : " + folder_name);
LOG.error("Date : " + date_dtf);
return folder_name + "/" + date_dtf + "h/" + filename;
}
}
}
And the actual place where the pipeline is deployed and run can be found below:
public void streamData() {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(new GenericFunctions.extractMsg())
.apply(FileIO.<String, String>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(TextIO.sink())
.withNumShards(options.getNumShards())
.to(options.getOutputDirectory())
.withNaming(type -> FileIO.Write.defaultNaming(type, ".txt"))
.withDestinationCoder(StringUtf8Coder.of()));
pipeline.run();
}

The error 'Processing stuck ...' indicates that some particular operation took longer than 5m, not that the job is permanently stuck. However, since the step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles is the one that is stuck and the job gets cancelled/killed, I would think on an issue while the job is writing temp files.
I found out the BEAM-7689 issue which is related to a second-granularity timestamp (yyyy-MM-dd_HH-mm-ss) that is used to write temporary files. This happens because several concurrent jobs can share the same temporary directory and this can cause that one of the jobs deletes it before the other(s) job finish(es).
According to the previous link, to mitigate the issue please upgrade to SDK 2.14. And let us know if the error is gone.

Since posting this question, I've optimized the dataflow job to dodge bottlenecks and increase parallelization. Much like rsantiago explained, processing stuck isn't an error, but simply a way dataflow communicates that a step is taking significantly longer than other steps, which is essentially a bottleneck that can't be cleared with the given resources. The changes I made seem to have addressed them. The new code is as follows:
public void streamData() {
try {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(FileIO.<String,PubsubMessage>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(Contextful.fn(
(SerializableFunction<PubsubMessage, String>) inputMsg -> new String(inputMsg.getPayload(), StandardCharsets.UTF_8)),
TextIO.sink())
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getOutputDirectory())
.withNaming(type -> new CrowdStrikeFileNaming(type))
.withNumShards(options.getNumShards())
.withTempDirectory(options.getTempLocation()));
pipeline.run();
}
catch(Exception e) {
LOG.error("Unable to deploy pipeline");
LOG.error(e.toString(), e);
}
}
The biggest change involved removing the extractMsg() function and changing partitioning to only use metadata. Both of these steps forced deserialization/reserialization of messages and heavily impacted performance.
Additionally, since my data set was unbounded, I had to set a non-zero number of shards. I wanted to simplify my filenaming policy, so I set it to 1 without knowing how much it hurt performance. Since then, I've found a good balance of workers/shards/machine type for my job (mostly based on guess & check, unfortunately).
Although it's still possible that a bottleneck might be observed with a large enough data load, the pipeline has been performing well despite heavy load (3-5tb per day). The changes also significantly improved autoscaling, but I'm not sure why. The dataflow job now reacts to spikes and valleys a lot quicker.

Trying to suppress errors while attaching browser

EDIT: I'm using the LeanFT Java SDK 14.50
EDIT2: for text clarification
I'm writing test scripts for a web application that sometimes opens popup browsers for specific actions. So natually when that happens, I will attach the new browser using BrowserFactory.attach(...). The problem is that leanFT does not seem to have a way to validate that the browser exists before attaching it, and if I try to attach it too early, it will fail. And I don't like to use an arbitrairy wait/sleep time as I can never really know how much time it's going to take for the browser to get be ready. So my solution to this is below
private Browser attachPopUpBrowser(BrowserType bt, RegExpProperty url){
Browser browser = null;
int iteration = 0;
//TimeoutLimit.SHORT = 15000
while (browser == null && iteration < TimeoutLimit.SHORT.getLimit()) {
try {
Reporter.setReportLevel(ReportLevel.Off);
browser = BrowserFactory.attach(
new BrowserDescription.Builder()
.type(bt)
.url(url)
.build()
);
Reporter.setReportLevel(ReportLevel.All);
} catch (GeneralLeanFtException e) {
try {
Thread.sleep(1000);
iteration += 1000;
} catch (InterruptedException e1) {
}
}
}
return browser;
}
Now, this works wonderfully with one exception. It generates errors in the leanft test result. Errors that I want to ignore because I know that it will fail a few times before it will succeed. As you can see, I've tried changing the ReportLevel while doing this in order to suppress the error logging, but it doesn't work. I've tried using
Browser[] browsers = BrowserFactory.getallOpenBrowsers(BrowserDescription);
thinking that it will return an empty Array if it finds nothing, but I still get errors while the browser is not ready. Does anyone have suggestions as to how I could work around this?
TL;DR
I'm looking for a way to either suppress the errors generated within my While..Loop or to validate that the browser is ready before attaching it. All of that, so that I can have a nice and clean Run Result at the end of my test (because these errors will present false negatives in all nearly all of my tests)
Addendum
Also, when the attach fails for the first time, I get a an exception
com.hp.lft.sdk.ReplayObjectNotFoundException: attachApplication
as expected, but all subsequent failures are throwing
com.hp.lft.sdk.GeneralLeanFtException: Cannot read property 'match' of null
I've compared both stack traces and they are identical except for the last 2 lines which happen within the ReplayExceptionFactory.CreateDefault() so I think that there is something that gets corrupted during the exception generation, but that is within the leanft.sdk.internal package so there might not be a lot we can do about it right now.I'm guessing that if I did not get that second "cannot read property" exception, I would correctly get the ReplayObjectNotFoundException until the browser is correctly attached.

I'd rather not force an attach endlessly until it works. Even if we'd solve the false negatives, we'd still have a not so good approach to the problem.
The cleanest solution would be to see if there is anything to attach to in the first place.
And you can do just that by getting all the browser instances that meets your description.
Browser[] browsers = BrowserFactory.getAllOpenBrowsers(new BrowserDescription.Builder().build());
Any element in this collection is an already "attached" browser - you can start using it.
If the list doesn't contain your browser instance, rerun the query.

JMX results are confusing

I am trying to learn JMX for the last few days and now got confuse here.
I have written a simple JMX programe which is using the APIs of package java.lang.management and trying to extract the Pid, CPU time, user time. In my result I am only getting the results of current JVM thread which is my JMX programe itself but I thought I should get the result of all the processes running over JVM on the same machine. How I will get the pids, cpu time, user time for all java processes running in JVM(LINUX/WDs).
How should I can get the pids, cpu time, user time for all non-java processes running in my machine(LINUX/WDs).
My code is below:
public void update() throws Exception{
final ThreadMXBean bean = ManagementFactory.getThreadMXBean();
final long[] ids = bean.getAllThreadIds();
final ThreadInfo[] infos = bean.getThreadInfo(ids);
for (long id : ids) {
if (id == threadId) {
continue; // Exclude polling thread
}
final long c = bean.getThreadCpuTime(id);
final long u = bean.getThreadUserTime(id);
if (c == -1 || u == -1) {
continue; // Thread died
}
}
String name = null;
for (int i = 0; i < infos.length; i++) {
name = infos[i].getThreadName();
System.out.print("The name of the id is /n" + name);
}
}
I am always getting the result:
The name of the id is Attach Listener
The name of the id is Signal Dispatcher
The name of the id is Finalizer
The name of the id is Reference Handler
The name of the id is main
I have some other java processes running on my machine they are not been included in the results of bean.getAllThreadIds() API..

Ah, now I see what you want to do. I'm afraid I have some bad news.
The APIs that are exposed through ManagementFactory allow you to monitor only the JVM in which your code is running. To monitor other JVMs, you have to use the JMX Remoting API (javax.management.remote), and that introduces a whole new range of issues you have to deal with.
It sounds like what you want to do is basically write your own management console using the stock APIs provided by out-of-the-box JDK. Short answer: you can't get there from here. Slightly longer answer: you can get there from here, but the road is long, winding, uphill (nearly) the entire way, and when you're done you will most likely wish you had gone a different route (read that: use a management console that has already been written).
I recommend you use JConsole or some other management console to monitor your application(s). In my experience it is usually only important that a human (not a program) interpret the stats that are provided by the various MBeans whose references are obtainable through the ManagementFactory static methods. After all, if a program had access to, say, the amount of CPU used by some other process, what conceivable use would it have with that information (other than to provide it in some human-readable format)?

Simultaneously downloading of webpages/files in EJB(java)

I have a small problem with creating threads in EJB.OK I understand why i can not use them in EJB, but dont know how to replace them with the same functionality.I am trying to download 30-40 webpages/files and i need to start downloading of all files at the same time(approximately).This is need ,because if i run them in one thread in queue.It will excecute more than 3 minutes.
I try with #Asyncronious anotation, but nothing happened.
public void execute(String lang2, String lang1,int number) {
Stopwatch timer = new Stopwatch().start();
htmlCodes.add(URL2String(URLs.get(number)));
timer.stop();
System.out.println( number +":"+ Thread.currentThread().getName() + timer.elapsedMillis()+"miseconds");
}
private void findMatches(String searchedWord, String lang1, String lang2) {
articles = search(searchedWord);
for (int i = 0; i < articles.size(); i++) {
execute(lang1,lang2,i);
}

Here are two really good SO answers that can help. This one gives you your options, and this one explains why you shouldn't spawn threads in an ejb. The problem with the first answer is it doesn't contain a lot of knowledge about EJB 3.0 options. So, here's a tutorial on using #Asynchronous.
No offense, but I don't see any evidence in your code that you've read this tutorial yet. Your asynchronous method should return a Future. As the tutorial says:
The client may retrieve the result using one of the Future.get methods. If processing hasn’t been completed by the session bean handling the invocation, calling one of the get methods will result in the client halting execution until the invocation completes. Use the Future.isDone method to determine whether processing has completed before calling one of the get methods.

My own Logging Handler for GAE/J (using appengine.api.log?)

I need to write my own logging handler on GAE/J. I have Android code that I'm trying to adapt such that it can be shared between GAE/J and Android. The GAE code I'm trying to write would allow the log statements in my existing code to work on GAE.
The docs say that I can just print to system.out and system.err, and it works, but badly. My logging shows up in the log viewer with too much extraneous text:
2013-03-08 19:37:11.355 [s~satethbreft22/1.365820955097965155].: [my_log_msg]
So, I started looking at the GAE log API. This looked hopeful initially: I can construct an AppLogLine and set the log records for a RequestLogs object.
However, there is no way to get the RequestLogs instance for the current request - the docs say so explicitly here:
Note: Currently, App Engine doesn't support the use of the request ID to directly look up the related logs.
I guess I could invent a new requestID and add log lines to that, but it is starting to look like this is just not meant to be?
Has anyone used this API to create their own log records, or otherwise managed to do their own logging to the log console.
Also, where can I find the source for GAE's java.util.logging? Is this public? I would like to see how that works if I can.
If what I'm trying to do is impossible then I will need to consider other options, e.g. writing my log output to a FusionTable.

I ended up just layering my logging code on top of GAE's java.util.logging. This feels non-optimal since it increases the complexity and overhead of my logging, but I guess this is what any 3rd logging framework for GAE must do (unless it is OK with the extra cruft that gets added when you just print to stdout).
Here is the crux of my code:
public int println(int priority, String msg) {
Throwable t = new Throwable();
StackTraceElement[] stackTrace = t.getStackTrace();
// Optional: translate from Android log levels to GAE log levels.
final Level[] levels = { Level.FINEST, Level.FINER, Level.FINE, Level.CONFIG,Level.INFO, Level.WARNING, Level.SEVERE, Level.SEVERE };
Level level = levels[priority];
LogRecord lr = new LogRecord(level, msg);
if (stackTrace.length > 2) { // should always be true
lr.setSourceClassName(stackTrace[2].getClassName());
lr.setSourceMethodName(stackTrace[2].getMethodName());
}
log.log(lr);
return 0;
}
Note that I use a stack depth of 2, but that # will depend on the 'depth' of your logging code.
I hope that Google will eventually support getting the current com.google.appengine.api.log.RequestLogs instance and inserting our own AppLogLine instances into it. (The API's are actually there to do that, but they explicitly don't support it, as above.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to automatically collapse repetitive log output in log4j - java

Will the BurstFilter do what you want? If not, please create a Jira issue with the algorithm that would work for you and the Log4j team would be happy to consider it. Better yet, if you can provide a patch it would be much more likely to be incorporated.

Related

Issues with Dynamic Destinations in Dataflow

Trying to suppress errors while attaching browser

JMX results are confusing

Simultaneously downloading of webpages/files in EJB(java)

My own Logging Handler for GAE/J (using appengine.api.log?)

Categories

Resources