I would like to trigger a few Hadoop jobs simultaneously. I’ve created a pool of threads using Executors.newFixedThreadPool. Idea is that if the pool size is 2, my code will trigger 2 Hadoop jobs at the same exact time using ‘ToolRunner.run’. In my testing, I noticed that these 2 threads keep stepping on each other.
When I looked under the hood, I noticed that ToolRunner creates GenericOptionsParser which in turn calls a static method ‘buildGeneralOptions’. This method uses ‘OptionBuilder.withArgName’ which uses an instance variable called, ‘argName’. This doesn’t look thread safe to me and I believe is the root cause of issues I am running into.
Any thoughts?
Confirmed that ToolRunner is NOT thread-safe:
Original code (which runs into problems):
public static int run(Configuration conf, Tool tool, String[] args)
throws Exception{
if(conf == null) {
conf = new Configuration();
}
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
//set the configuration back, so that Tool can configure itself
tool.setConf(conf);
//get the args w/o generic hadoop args
String[] toolArgs = parser.getRemainingArgs();
return tool.run(toolArgs);
}
New Code(which works):
public static int run(Configuration conf, Tool tool, String[] args)
throws Exception{
if(conf == null) {
conf = new Configuration();
}
GenericOptionsParser parser = getParser(conf, args);
tool.setConf(conf);
//get the args w/o generic hadoop args
String[] toolArgs = parser.getRemainingArgs();
return tool.run(toolArgs);
}
private static synchronized GenericOptionsParser getParser(Configuration conf, String[] args) throws Exception {
return new GenericOptionsParser(conf, args);
}
Related
As captioned. While getJobId is available from RuntimeContext, job name isn't available.
https://nightlies.apache.org/flink/flink-docs-release-1.13/api/java/org/apache/flink/api/common/functions/RuntimeContext.html
Trying to get it from Configuration doesn't seem to work as well:
#Override
public void open(Configuration parameters) throws Exception {
String jobName = parameters.getString(PipelineOptions.NAME); // this is null
}
This is how we run a standalone sample pipeline:
public static void main(String... args) {
try {
ParameterTool parameterTool = ParameterTool.fromArgs(args);
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// some pipeline setup
env.execute("This-is-job-name");
} catch (Exception e) {
// logging
}
Assuming you are passing in the job name as a parameter to the job, you want to set it up like this:
public static void main(String... args) {
ParameterTool parameterTool = ParameterTool.fromArgs(args);
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters);
and then this should work
#Override
public void open(Configuration parameters) throws Exception {
ParameterTool params = (ParameterTool)
getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
String jobName = params.get(nameOfParameterWithJobName);
}
The Configuration passed to open is always empty -- that's an obsolete mechanism that is no longer used. The method signature hasn't been changed in order to avoid breaking a public API.
Another good way to pass info like this to a RichFunction is to pass it to the constructor.
I run a code using Flink Java API that gets some bytes from Kafka and parses it following by inserting into Cassandra database using another library static method (both parsing and inserting results is done by the library). Running code on local in IDE, I get the desired answer, but running on YARN cluster the parse method didn't work as expected!
public class Test {
static HashMap<Integer, Object> ConfigHashMap = new HashMap<>();
public static void main(String[] args) throws Exception {
CassandraConnection.connect();
Parser.setInsert(true);
stream.flatMap(new FlatMapFunction<byte[], Void>() {
#Override
public void flatMap(byte[] value, Collector<Void> out) throws Exception {
Parser.parse(ByteBuffer.wrap(value), ConfigHashMap);
// Parser.parse(ByteBuffer.wrap(value));
}
});
env.execute();
}
}
There is a static HashMap field in the Parser class that configuration of parsing data is based on its information, and data will insert it during the execution. The problem running on YARN was this data was not available for taskmanagers and they just print config is not available!
So I redefine that HashMap as a parameter for parse method, but no differences in results!
How can I fix the problem?
I changed static methods and fields to non-static and using RichFlatMapFunction solved the problem.
stream.flatMap(new RichFlatMapFunction<byte[], Void>() {
CassandraConnection con = new CassandraConnection();
int i = 0 ;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
con.connect();
}
#Override
public void flatMap(byte[] value, Collector<Void> out) throws Exception {
ByteBuffer tb = ByteBuffer.wrap(value);
np.parse(tb, ConfigHashMap, con);
}
});
I have set up a Hadoop job like so:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Legion");
job.setJarByClass(Legion.class);
job.setMapperClass(CallQualityMap.class);
job.setReducerClass(CallQualityReduce.class);
// Explicitly configure map and reduce outputs, since they're different classes
job.setMapOutputKeyClass(CallSampleKey.class);
job.setMapOutputValueClass(CallSample.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(CombineRepublicInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
CombineRepublicInputFormat.setMaxInputSplitSize(job, 128000000);
CombineRepublicInputFormat.setInputDirRecursive(job, true);
CombineRepublicInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
This job completes, but something strange happens. I get one output line per input line. Each output line consists of the output from a CallSampleKey.toString() method, then a tab, then something like CallSample#17ab34d.
This means that the reduce phase is never running and the CallSampleKey and CallSample are getting passed directly to the TextOutputFormat. But I don't understand why this would be the case. I've very clearly specified job.setReducerClass(CallQualityReduce.class);, so I have no idea why it would skip the reducer!
Edit: Here's the code for the reducer:
public static class CallQualityReduce extends Reducer<CallSampleKey, CallSample, NullWritable, Text> {
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
Call call = new Call(inKey.getId().toString(), inKey.getUuid().toString());
while (inValues.hasNext()) {
call.addSample(inValues.next());
}
context.write(NullWritable.get(), new Text(call.getStats()));
}
}
What if you try to change your
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
to use Iterable instead of Iterator?
public void reduce(CallSampleKey inKey, Iterable<CallSample> inValues, Context context) throws IOException, InterruptedException {
You'll have to then use inValues.iterator() to get the actual iterator.
If the method signature doesn't match then it's just falling through to the default identity reducer implementation. It's perhaps unfortunate that the underlying default implementation doesn't make it easy to detect this kind of typo, but the next best thing is to always use #Override in all methods you intend to override so that the compiler can help.
I've developed a custom InputFormat for Hadoop (including a custom InputSplit and a custom RecordReader) and I'm experiencing a rare NullPointerException.
These classes are going to be used for querying a third-party system which exposes a REST API for records retrieving. Thus, I got inspiration in DBInputFormat, which is a non-HDFS InputFormat as well.
The error I get is the following:
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
I've searched the code for MapTask (2.1.0 version of Hadoop) and I've seen the problematic part is the initialization of the RecordReader:
472 NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
473 org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
474 TaskReporter reporter,
475 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
476 throws InterruptedException, IOException {
...
491 this.real = inputFormat.createRecordReader(split, taskContext);
...
494 }
...
519 #Override
520 public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
521 org.apache.hadoop.mapreduce.TaskAttemptContext context
522 ) throws IOException, InterruptedException {
523 long bytesInPrev = getInputBytes(fsStats);
524 real.initialize(split, context);
525 long bytesInCurr = getInputBytes(fsStats);
526 fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
527 }
Of course, the relevant parts of my code:
# MyInputFormat.java
public static void setEnvironmnet(Job job, String host, String port, boolean ssl, String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
public static void addResId(Job job, String resId) {
Configuration conf = job.getConfiguration();
String inputs = conf.get(INPUT_RES_IDS, "");
if (inputs.isEmpty()) {
inputs += restId;
} else {
inputs += "," + resId;
}
conf.set(INPUT_RES_IDS, inputs);
}
#Override
public List<InputSplit> getSplits(JobContext job) {
// resulting splits container
List<InputSplit> splits = new ArrayList<InputSplit>();
// get the Job configuration
Configuration conf = job.getConfiguration();
// get the inputs, i.e. the list of resource IDs
String input = conf.get(INPUT_RES_IDS, "");
String[] resIDs = StringUtils.split(input);
// iterate on the resIDs
for (String resID: resIDs) {
splits.addAll(getSplitsResId(resID, job.getConfiguration()));
}
// return the splits
return splits;
}
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
# MyRecordReader.java
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
// get start, end and current positions
MyInputSplit inputSplit = (MyInputSplit) this.split;
start = inputSplit.getFirstRecordIndex();
end = start + inputSplit.getLength();
current = 0;
// query the third-party system for the related resource, seeking to the start of the split
records = backend.getRecords(inputSplit.getResId(), start, end);
}
# MapReduceTest.java
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapReduceTest(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "MapReduce test");
job.setJarByClass(MapReduceTest.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(MyInputFormat.class);
MyInputFormat.addInput(job, "ca73a799-9c71-4618-806e-7bd0ca1911f4");
InputFormat.setEnvironmnet(job, "my.host.com", "443", true, "my_api_key");
FileOutputFormat.setOutputPath(job, new Path(args[0]));
return job.waitForCompletion(true) ? 0 : 1;
}
Any ideas about what is wrong?
BTW, which is the "good" InputSplit the RecordReader must use, the one given to the constructor or the one given in the initialize method? Anyway I've tried both options and the resulting error is the same :)
The way I read your strack trace real is null on line 524.
But don't take my word for it. Slip an assert or system.out.println in there and check the value of real yourself.
NullPointerException almost always means you dotted off something you didn't expect to be null. Some libraries and collections will throw it at you as their way of saying "this can't be null".
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
To me this reads as: in the org.apache.hadoop.mapred package the MapTask class has an inner class NewTrackingRecordReader with an initialize method that threw a NullPointerException at line 524.
524 real.initialize( blah, blah) // I actually stopped reading after the dot
this.real was set on line 491.
491 this.real = inputFormat.createRecordReader(split, taskContext);
Assuming you haven't left out any more closely scoped reals that are masking the this.real then we need to look at inputFormat.createRecordReader(split, taskContext); If this can return null then it might be the culprit.
Turns out it will return null when backend is null.
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, " +
"it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
It looks like setEnvironmnet is supposed to set backend
# MyInputFormat.java
public static void setEnvironmnet(
Job job,
String host,
String port,
boolean ssl,
String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
backend must be declared somewhere outside setEnvironment (or you'd be getting a compiler error).
If backend hasn't been set to something non-null upon construction and setEnvironmnet was not called before createRecordReader then you should expect to get exactly the NullPointerException you got.
UPDATE:
As you've noted, since setEnvironmnet() is static backend must be static as well. This means that you must be sure other instances aren't setting it to null.
Solved. The problem is the backend variable is declared as static, i.e. it belongs to the java class and thus any other object changing that variable (e.g. to null) affects all the other objects of the same class.
Now, setEnvironment adds the host, port, ssl usage and the API key as configuration (the same than setResId already did with the resource ID); when createRecordReader is invoked this configuration is got and the backend object is created.
Thanks to CandiedOrange who put me in the right path!
public class Test
{
private static RestAPI rest = new RestAPIFacade("myIp","username","password");
public static void main(String[] args)
{
Map<String, Object> foo = new HashMap<String, Object>();
foo.put("Test key", "testing");
rest.createNode(foo);
}
}
No output it just hangs on connection indefinitely.
Environment:
Eclipse
JDK 7
neo4j-rest-binding 1.9: https://github.com/neo4j/java-rest-binding
Heroku
Any ideas as to why this just hangs?
The following code works:
public class Test
{
private static RestAPI rest = new RestAPIFacade("myIp","username","password");
public static void main(String[] args)
{
Node node = rest.getNodeById(1);
}
}
So it stands that I can correctly retrieve remote values.
I guess this is caused by lacking usage of transactions. By default neo4j-rest-binding aggregates multiple operations into one request (aka one transaction). There are 2 ways to deal with this:
change transactional behaviour to "1 operation = 1 transaction" by setting
-Dorg.neo4j.rest.batch_transaction=false for your JVM. Be aware this could impact performance since every atomic operation is a seperate REST request.
use transactions in your code:
.
RestGraphDatabse db = new RestGraphDatabase("http://localhost:7474/db/data",username,password);
Transaction tx = db.beginTx();
try {
Node node = db.createNode();
node.setPropery("key", "value");
tx.success();
} finally {
tx.finish();
}