Hadoop Map/Reduce Mapper 'map' method and logs

Hadoop Map/Reduce Mapper 'map' method and logs - java

I've recently been asked to look into speeding up a mapreduce project.
I'm trying to view log4j log information which is being generated within the 'map' method of a class which implements: org.apache.hadoop.mapred.Mapper
Within this class there are the following methods:
#Override
public void configure( .. ) { .. }
public static void doCompileAndAdd( .. ) { .. }
public void map( .. ) { .. }
Logging information is available for the configure method and the doCompileAndAdd method (which is called from the configure method); however, no log information is being displayed for the 'map' method.
I've also tried simply using System.out.println( .. ) within the map method without success.
Is there anyone who might be able to help to shed some light on this issue?
Thanks,
Telax

Since the mapper classes actually run in tasks distributed across nodes in the cluster, the stdout from those tasks appears in the individual logs for each task. The simplest way to see those logs is to go to the job tracker page for the cluster, usually at http://namenode:50030/jobtracker.jsp. From there you can select the job and then select the map tasks you are interested in the logs for.

Related

Creating a global transaction Id that is accessible through multiple packages

Hi to all Java experts!
I am working on onboarding a new and shiny process visualization service and I need your help!
My project structure goes like this:
Service Package is dependant on Core Package which is dependant on Util package. Something like this:
Service
|-|- Core
|-|-|- Util
The application package has the main method from where our code begins. It's calling some of the Core methods that is using the Util package for read information from the input.
package com.dummy.service;
public void main(Object input) {
serviceCore.call(input);
}
package com.dummy.core;
public void call(Object input) {
String stringInput = util.readFromInput(input);
//Do stuff
}
package com.dummy.util;
public String readFromInput(Object input) {
//return stuff;
}
The problem starts when I want to onboard to the visualization service. One requirement is to use a unique transaction Id for each call to the service.
My question is - how to share the process Id between all of these methods without doing too much refactoring to the code? To see the entire process in the Process Visualization tool I will have to use the same ID across the entire call. My vision that this is going to be something like:
package com.dummy.service;
public void main(Object input) {
processVisualization.signal(PROCESS_ID, "transaction started");
serviceCore.call(input);
processVisualization.signal(PROCESS_ID, "transaction ended");
}
package com.dummy.core;
public void call(Object input) {
processVisualization.signal(PROCESS_ID, "Method call is invoked");
String stringInput = util.readFromInput(input);
//Do stuff
}
package com.dummy.util;
public String readFromInput(Object input) {
processVisualization.signal(PROCESS_ID, "Reading from input");
//return stuff;
}
I was thinking about the following, but all of these are just abstract ideas that I am not even sure can be implemented. And if yes - then how?
Creating a new package that all of the three packages are going to be dependant on and going to "hold" the process Id for each call. But how? Should I use a static class in this package? A singelton?
I've read this post about ThreadLocal variables: When and how should I use a ThreadLocal variable? but I am not familiar with these and not sure how to implement this idea - should it go to a separate package like I mentioned in 1?
Changing the method's signatures to pass the id as a variable. This is, unfortunately, too pricey in terms of time and the danger of a large refactoring.
Using file writing - save the ID in some file that is accessible throughout the process.
Constructing a unique id from the input - I think this can be the perfect solution but we may receive the same input in separate calls to the service.
Access the JVM for some unique transaction id. I know that when we are logging stuff we have the RequestId printed in the log line. This is the pattern we use in Log4J configuration:
<pattern>%d{dd MMM yyyy HH:mm:ss,SSS} %highlight{[%p]} %X{RequestId} (%t) %c: %m%n</pattern>
This RequestId is a variable on the ThreadContext that is created before the job. Is this possible and/or recommended to access this parameter and use it as a unique transaction id?

In the end we've utilized Log4J's Thread context.
It's probably not the best solution since we are mixing the purpose of the same thing, but this is how we did it:
The process id is extracted like that:
org.apache.logging.log4j.ThreadContext.get("RequestId");
And initiated on the Handler Chain (depends on which service you are using):
ThreadContext.put("RequestId", Objects.toString(job.getId(), (String)null));
This is happening on every job that is received.
Disclaimer: This solution haven't been fully tested yet but this is the direction we go with

How can I find all paths to access an java API?

I'm analyzing an Android App, looking for security flaws. I've decompiled the APK with JEB and I found a vulnerable method in it.
My problem is: The App logic is too complex and it is very difficult to find a way to trigger this vulnerable method.
I would like to know if there exists a tool to find all the "paths" in the code to access some method.
For example, for the code below:
private void methodX() {
// This is the method I want to call
}
private void methodA() {
methodX();
}
private void methodB() {
methodA();
}
private void methodC() {
methodX();
}
The paths to access methodX are:
methodA( ) -> methodX( )
methodC( ) -> methodX( )
methodB( ) -> methodA( ) -> methodX( )
By the way, I'm using eclipse in the analysis, maybe there is some command on it to do this, but I haven't found yet.

In Eclipse, Ctrl+Alt+H will open the call hierarchy for a method, showing a tree view you can expand for finding "indirect" references to that method.
Here is an example tracing a method from Spring MVC's DispatcherServlet:

Ctrl+Shift+G on methodX() will show you all references

Configuring DropWizard Programmatically

I have essentially the same question as here but am hoping to get a less vague, more informative answer.
I'm looking for a way to configure DropWizard programmatically, or at the very least, to be able to tweak configs at runtime. Specifically I have a use case where I'd like to configure metrics in the YAML file to be published with a frequency of, say, 2 minutes. This would be the "normal" default. However, under certain circumstances, I may want to speed that up to, say, every 10 seconds, and then throttle it back to the normal/default.
How can I do this, and not just for the metrics.frequency property, but for any config that might be present inside the YAML config file?

Dropwizard reads the YAML config file and configures all the components only once on startup. Neither the YAML file nor the Configuration object is used ever again. That means there is no direct way to configure on run-time.
It also doesn't provide special interfaces/delegates where you can manipulate the components. However, you can access the objects of the components (usually; if not you can always send a pull request) and configure them manually as you see fit. You may need to read the source code a bit but it's usually easy to navigate.
In the case of metrics.frequency you can see that MetricsFactory class creates ScheduledReporterManager objects per metric type using the frequency setting and doesn't look like you can change them on runtime. But you can probably work around it somehow or even better, modify the code and send a Pull Request to dropwizard community.

Although this feature isn't supported out of the box by dropwizard, you're able to accomplish this fairly easy with the tools they give you. Note that the below solution definitely works on config values you've provided, but it may not work for built in configuration values.
Also note that this doesn't persist the updated config values to the config.yml. However, this would be easy enough to implement yourself simply by writing to the config file from the application. If anyone would like to write this implementation feel free to open a PR on the example project I've linked below.
Code
Start off with a minimal config:
config.yml
myConfigValue: "hello"
And it's corresponding configuration file:
ExampleConfiguration.java
public class ExampleConfiguration extends Configuration {
private String myConfigValue;
public String getMyConfigValue() {
return myConfigValue;
}
public void setMyConfigValue(String value) {
myConfigValue = value;
}
}
Then create a task which updates the config:
UpdateConfigTask.java
public class UpdateConfigTask extends Task {
ExampleConfiguration config;
public UpdateConfigTask(ExampleConfiguration config) {
super("updateconfig");
this.config = config;
}
#Override
public void execute(Map<String, List<String>> parameters, PrintWriter output) {
config.setMyConfigValue("goodbye");
}
}
Also for demonstration purposes, create a resource which allows you to get the config value:
ConfigResource.java
#Path("/config")
public class ConfigResource {
private final ExampleConfiguration config;
public ConfigResource(ExampleConfiguration config) {
this.config = config;
}
#GET
public Response handleGet() {
return Response.ok().entity(config.getMyConfigValue()).build();
}
}
Finally wire everything up in your application:
ExampleApplication.java (exerpt)
environment.jersey().register(new ConfigResource(configuration));
environment.admin().addTask(new UpdateConfigTask(configuration));
Usage
Start up the application then run:
$ curl 'http://localhost:8080/config'
hello
$ curl -X POST 'http://localhost:8081/tasks/updateconfig'
$ curl 'http://localhost:8080/config'
goodbye
How it works
This works simply by passing the same reference to the constructor of ConfigResource.java and UpdateConfigTask.java. If you aren't familiar with the concept see here:
Is Java "pass-by-reference" or "pass-by-value"?
The linked classes above are to a project I've created which demonstrates this as a complete solution. Here's a link to the project:
scottg489/dropwizard-runtime-config-example
Footnote: I haven't verified this works with the built in configuration. However, the dropwizard Configuration class which you need to extend for your own configuration does have various "setters" for internal configuration, but it may not be safe to update those outside of run().
Disclaimer: The project I've linked here was created by me.

I solved this with bytecode manipulation via Javassist
In my case, I wanted to change the "influx" reporter
and modifyInfluxDbReporterFactory should be ran BEFORE dropwizard starts
private static void modifyInfluxDbReporterFactory() throws Exception {
ClassPool cp = ClassPool.getDefault();
CtClass cc = cp.get("com.izettle.metrics.dw.InfluxDbReporterFactory"); // do NOT use InfluxDbReporterFactory.class.getName() as this will force the class into the classloader
CtMethod m = cc.getDeclaredMethod("setTags");
m.insertAfter(
"if (tags.get(\"cloud\") != null) tags.put(\"cloud_host\", tags.get(\"cloud\") + \"_\" + host);tags.put(\"app\", \"sam\");");
cc.toClass();
}

Eclipse e4: Accessing properties in PostContextCreate

I am using the PostContextCreate part of the life cycle in an e4 RCP application to create the back-end "business logic" part of my application. I then inject it into the context using an IEclipseContext. I now have a requirement to persist some business logic configuration options between executions of my application. I have some questions:
It looks like properties (e.g. accessible from MContext) would be really useful here, a straightforward Map<String,String> sounds ideal for my simple requirements, but how can I get them in PostContextCreate?
Will my properties persist if my application is being run with clearPersistedState set to true? (I'm guessing not).
If I turn clearPersistedState off then will it try and persist the other stuff that I injected into the context?
Or am I going about this all wrong? Any suggestions would be welcome. I may just give up and read/write my own properties file.

I think the Map returned by MApplicationElement.getPersistedState() is intended to be used for persistent data. This will be cleared by -clearPersistedState.
The PostContextCreate method of the life cycle is run quite early in the startup and not everything is available at this point. So you might have to wait for the app startup complete event (UIEvents.UILifeCycle.APP_STARTUP_COMPLETE) before accessing the persisted state data.
You can always use the traditional Platform.getStateLocation(bundle) to get a location in the workspace .metadata to store arbitrary data. This is not touched by clearPersistedState.
Update:
To subscribe to the app startup complete:
#PostContextCreate
public void postContextCreate(IEventBroker eventBroker)
{
eventBroker.subscribe(UIEvents.UILifeCycle.APP_STARTUP_COMPLETE, new AppStartupCompleteEventHandler());
}
private static final class AppStartupCompleteEventHandler implements EventHandler
{
#Override
public void handleEvent(final Event event)
{
... your code here
}
}

Add Quartz Source Java Files on the Fly

I have looked around and around for this answer, but I have not been able to find a good answer. I would like to create a system based on Quartz that allows people to schedule their own tasks. I will use a pseudo example.
Let's say my main method for my Quartz program is called quartz.java.
Then I have a file called sweep.java that implements the Quartz "job" interface.
So in my quartz.java, I schedule my sweep.java to run every hour. I run quartz.java, and it works fine. GREAT; however, now I want to add a dust.java to the quartz scheduler; however, since this is a production service, I don't want to have to stop my quartz.java file, add in my dust.java, and recompile and run quartz.java again. This downtime would be unacceptable.
Does anyone have any ideas on how I could accomplish this? It seems impossible because how could you ever feed another java file into the program without recompiling, linking, etc.
I hope that this example is clear. Please let me know if I need to clarify any part of it.

Partial answer: it is possible to compile, and then instantiate, a class, programatically.
Here are links to example code:
how to compile from a String;
CompilerOutput;
CompilerOutputDirectory.
The extracted class is grabbed in the third source file (see method getGeneratedClass, which returns a Class<?> object).
HOWEVER: keep in mind that this is potentially dangerous to do so. One problem, which can be quite serious if you are not careful, is that when you dynamically instantiate a class, its static initialization blocks are executed. And these can potentially wreak havoc on your application. So, in addition, you'll have to create an appropriate SecurityContext.
In the code above, I actually only ever get the Class<?> object and never instantiate it in any way, so no code is executed. But your usage scenario is quite different.

I have not tried any of these but are worth trying .
1) Consider using Quartz camel endpoint .
If my understanding is right, Apache Camel lets you create the camel routes on the fly.
It just needs to deploy the camel-context.xml into a container taking into consideration that the required classes would be already available on classpath of container.
2) Quartz lets you create a job declaratively i.e. with xml configuration of job and trigger.
You can find more information here.
3) Now this requires some efforts ;-)
Create an interface which has a method which you will execute as a part of job. Lets say this will have a method called
public interface MyDynamicJob
{
public void executeThisAsPartOfJob();
}
Create your instances of Job methods.
public EmailJob implements MyDynamicJob
{
#Override
public void executeThisAsPartOfJob()
{
System.out.println("Sending Email");
}
}
Now in your main scheduler engine, use the Observer pattern to store/initiate the job dynamically.
Something like,
HashMap jobs=new HashMap<String,MyDynamicJob>();
// call this method to add the job dynamically.
// If you add a job after the scheduler engine started , find a way here how to reiterate over this map without shutting down the scheduler :-).
public void addJob(String someJobName,MyDynamicJob job)
{
jobs.add(someJobName,job);
}
public void initiateScheduler()
{
// Iterate over the jobs map to get all registered jobs. Create
// Create JobDetail instances dynamically for each job Entry. add your custom job class as a part of job data map.
Job jd1=JobBuilder.newJob(GenericJob.class)
.withIdentity("FirstJob", "First Group").build();
Map jobDataMap=jd1.getJobDataMap();
jobDataMap.put("dynamicjob", jobs.get("dynamicjob1"));
}
public class GenericJob implements Job {
public void execute(JobExecutionContext arg0) throws JobExecutionException {
System.out.println("Executing job");
Map jdm=arg0.getJobDetail().getJobDataMap();
MyDynamicJob mdj=jdm.get("dynamicjob");
// Now execute your custom job method here.
mdj.executeThisAsPartOfJob();
System.out.println("Job Execution complete");
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop Map/Reduce Mapper 'map' method and logs - java

Related

Creating a global transaction Id that is accessible through multiple packages

How can I find all paths to access an java API?

Configuring DropWizard Programmatically

Eclipse e4: Accessing properties in PostContextCreate

Add Quartz Source Java Files on the Fly

Categories

Resources