How do I log from a mapper? (hadoop with commoncrawl) - java

I'm using the commoncrawl example code from their "Mapreduce for the Masses" tutorial. I'm trying to make modifications to the mapper and I'd like to be able to log strings to some output. I'm considering setting up some noSQL db and just pushing my output to it, but it doesn't feel like a good solution. What's the standard way to do this kind of logging from java?

While there is no special solution for the logs aside of usual logger (at least one I am aware about) I can see about some solutions.
a) if logs are of debug purpose - indeed write usual debug logs. In case of the failed tasks you can find them via UI and analyze.
b) if this logs are some kind of output you want to get alongside some other output from you job - assign them some specail key and write to the context. Then in the reducer you will need some special logic to put them to the output.
c) You can create directory on HDFS and make mapper to write to there. It is not classic way for MR because it is side effect - in some cases it can be fine. Especially taking to account that after each mapper will create its own file - you can use command hadoop fs -getmerge ... to get all logs as one file.
c) If you want to be able to monitor the progress of your job, number of error etc - you can use counters.

Related

Retrieving previous n lines of log4j specific to one logger class

We implemented a small routine that in case of specific errors we get an email to let us know something happened. It's pretty easy to include some error information in the email but we would like to include what was the previous logged informations when it happened.
At first I tried to retrieve the last lines from the file where the log4j are saved. The problem is that many other threads are working at the same time and I never get relevant informations. Either the buffer doesn't have time to write the logs or something else is faster at writing something else.
Is it possible to add an appender to a specific log4j at runtime to be able to retrieve only those specific logs and only if needed ?
I would like to be able to still log with the usual commands log.error, log.warn... but I want to be able to retrieve the log content in a try/catch situation. something like a memory only buffer appender?
Would that work for all the children loggers ?
Class 1 - call Class 2
Class 2 - call Class 3
Class 3 try/catch an error
log.getLogs returns all logged informations from Class 1..3
Or am I dreaming here ?
We are using Jira and in logs in files, so I guess no query would work.

Multiple reader/processor/writer in spring batch

I am new to Spring batch and I have a peculiar problem. I want to get results from a 3 different jpa queries with JpaPagingItemReader and process them individually and write them into one consolidated XML file using StaxEventItemWriter.
For eg the resultant XML would look like,
<root>
<query1>
...
</query1>
<query2>
...
</query2>
<query3>
...
</query3>
</root>
Please let me know how to achieve this?
Also, I currently implemented my configurer with one query but the reader/writer is also quite slow. It took around 59 minutes to generate file of 20MB as I am running it in single threaded environment as of now as opposed to multithreaded env. If there are any other suggestions around it, please do let me know. Thanks.
EDIT:
I tried following this approach:
Created 3 different steps and added 1 reader, processor, writer in each of them but the problem I am facing now is writer is not able to write in the same file or append to it.
This is written in StaxEventItemWriter class:
FileUtils.setUpOutputFile(file, restarted, false, overwriteOutput);
Here 3rd argument append is false by default.
The second approach to your question seems the right direction, you could create 3 different readers/processors/writers and create your custom writer which should extend AbstractFileItemWriter in which setAppend is allowed. Also, I have seen that xmlWriter writes faster xmls than StaxEventItemWriter but there is some trade off in writing boiler plate code.
One option off the top of my head is to
create a StaxEventItemWriter
create 3 instances of a step that has a JpaPagingItemReader and writes the corresponding <queryX>...</queryX> section to the shared writer
write the <root> and </root> tags in a JobExecutionListener, so the steps don't care about the envelope
There are other considerations here, like whether it's always 3 files, etc. but the general idea is to separate concerns between processors, step, job, tasks, and listeners to make each perform a clear piece of work.
use JVisualVm to monitor the bottlenecks inside your application.
Since you said it is taking 59 minutes to create file of 20MB, you will get better insights of where you are getting performance hits.
VisualVm tutorial
Open visualvm connect your application => sampler => cpu => CPU Samples.
Take snapshot at various times and analyse where is it taking much time. By checking this only you will get enough data for optimisation.
Note: JvisualVm comes under oracle jdk 8 distribution. you can simply type jvisualvm on command prompt/terminal. if not download from here

Publishing mterics to ganglia using gmetric4j

I'm considering using gmetric4j to publish metrics to ganglia. So far the only documented way I found for doing this is to use it's GSampler class to make a Metric data polling Runnable that runs at scheduled times.
In my application, though, it would be easier to have it's components themselves publish the metric data when they see fit (i.e. not in regular scheduled intervals). From inspecting the gmetric4j source code I can see that this can be done with GMetric objects, but I am not sure if this would produce the meaningful results in the end.
So what I would like to know is:
Can you publish data to ganglia at irregular intervals, and if yes how are data aggregations and time series formed in this case?
Also I failed to understand the meaning of "tmax" (-x on command line) and "dmax" (-d on command line) parameters of gmetric calls and if they have anything to do with the above problem. Does anyone know anything more about these?
Have you tried the Metrics Library? It has a Ganglia reporter that takes care of when and how to send your measurements to gmond/gmetad. You can also check out the source if you want a code example.
For dmax, tmax, and how often to report, I found this to be a good source

dynamically creating & destroying logging appenders

I have a legacy PSVM application which I'd like to redirect its logging output to unique files per execution. So, if I invoke it at 10:00, then have it redirect it's output to {thread-id}-10:00.log; and another thread of execution may begin an execution at 10:01, and its output would go to {thread-id}-10:01.log. I understand that this is not elegant.
My questions are:
is this possible?
does someone have an idea of how to approach?
is it possible to release/destroy an appender when it's no longer needed?
Thanks!
I would start with FileAppender and derive from that to create your own. Simply modify your version to get the current thread id and append a suitable thread-id/timestamp to the file prior to creation. You would maintain (say) a map of (buffered) FileWriters keyed on thread id.
Writing appenders is quite trivial - here's a Javaworld guide on how to do it.
In the above, is it at all likely that your program will start up twice in one minute ? Would you want to append a process id or similar to maintain uniqueness ?
It is not possible, at least not easy to do in log4j. However, if you look at SiftingAppender shipping with logback (log4j's successor), it is designed to handle the creation of appenders on runtime criteria as well as their removal when no longer needed.
If you application needs to create just one log file per application launch, you could simply name your log file based on a timestamp. Shout on the logback-user mailing list if you need further assistance.

How to validate Java logging properties files?

I have a basic facility for allowing users to remotely apply changes to the logging files in my application. Some logs are configured using java.util.logging properties files, and some are configured using log4j/log4cplus-style properties files. I'd like to do some basic validation of the properties that users try to apply. Namely, I want to assure the following:
Every logging.properties file must always contain at least a root logger/logging level
The logger/level must be set to a valid value. That is, they should not be able to set .level = GIBBERISH or anything like that.
I'll probably allow them to set MaxFileSize and MaxBackupIndex (log4j), and .limit and .count properties (java.util.logging), too.
What's the best way to accomplish this? I can obviously just loop over the keys and values in a Properties object and look for their values in a hard-coded Map or some other data structure that tells what valid properties are, but I'm trying to come up with a solution that's a little more elegant than that.
The problem with running any set of partial syntax checks against the properties files is that they'll always be inadequate by definition unless you capture every partial variation acceptable by the logging system, in which case you'll have recreated a portion of the logging system. No matter what properties you choose to validate theres bound to be additional ways to submit broken files.
Rather than testing for individual properties, why not create an additional (temporary, for the scope of the check only) logger object based on the input file and detect if it throws an error?
The "elegant" solution would be to write a rule-based engine for checking sets of name-value pairs. But IMO that is totally over the top for this use-case ... unless the checks are far more complex than I imagine.
I'd say that the simple (inelegant) solution is best in this case.

Categories

Resources