I've recently started messing with Hadoop and just created my own inputformat to handle pdf's.
For some reason my custom RecordReader class doesn't have it's initialize method called. (checked it with a sysout, cause I haven't set up a debugging environment)
I'm running hadoop 2.2.0 on windows 7 32bit. Doing my calls with yarn jar, as hadoop jar is bugged under windows...
import ...
public class PDFInputFormat extends FileInputFormat<Text, Text>
{
#Override
public RecordReader<Text, Text> getRecordReader(InputSplit arg0,
JobConf arg1, Reporter arg2) throws IOException
{
return new PDFRecordReader();
}
public static class PDFRecordReader implements RecordReader<Text, Text>
{
private FSDataInputStream fileIn;
public String fileName=null;
HashSet<String> hset=new HashSet<String>();
private Text key=null;
private Text value=null;
private byte[] output=null;
private int position = 0;
#Override
public Text createValue() {
int endpos = -1;
for (int i = position; i < output.length; i++){
if (output[i] == (byte) '\n') {
endpos = i;
}
}
if (endpos == -1) {
return new Text(Arrays.copyOfRange(output,position,output.length));
}
return new Text(Arrays.copyOfRange(output,position,endpos));
}
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext job) throws
IOException, InterruptedException
{
System.out.println("initialization is called");
FileSplit split=(FileSplit) genericSplit;
Configuration conf=job.getConfiguration();
Path file=split.getPath();
FileSystem fs=file.getFileSystem(conf);
fileIn= fs.open(split.getPath());
fileName=split.getPath().getName().toString();
System.out.println(fileIn.toString());
PDDocument docum = PDDocument.load(fileIn);
ByteArrayOutputStream boss = new ByteArrayOutputStream();
OutputStreamWriter ow = new OutputStreamWriter(boss);
PDFTextStripper stripper=new PDFTextStripper();
stripper.writeText(docum, ow);
ow.flush();
output = boss.toByteArray();
}
}
}
As I figured it out last night and I might help someone else with this:
RecordReader is a deprecated interface of Hadoop (hadoop.common.mapred) and it doesn't actually contain an initialize method, which explains why it doesn't get called automatically.
Extending the RecordReader class in hadoop.common.mapreduce does let you extend the initialize method of that class.
The System.out.println() may not help while running job. To make sure your initialize() is called or not try throw some RuntimeException there as below:
#Override
public void initialize(InputSplit genericSplit, TaskAttemptContext job) throws
IOException, InterruptedException
{
throw new NullPointerException("inside initialize()");
....
This will definitely do.
Related
I am using Spring batch and have an ItemWriter as follows:
public class MyItemWriter implements ItemWriter<Fixing> {
private final FlatFileItemWriter<Fixing> writer;
private final FileSystemResource resource;
public MyItemWriter () {
this.writer = new FlatFileItemWriter<>();
this.resource = new FileSystemResource("target/output-teste.txt");
}
#Override
public void write(List<? extends Fixing> items) throws Exception {
this.writer.setResource(new FileSystemResource(resource.getFile()));
this.writer.setLineAggregator(new PassThroughLineAggregator<>());
this.writer.afterPropertiesSet();
this.writer.open(new ExecutionContext());
this.writer.write(items);
}
#AfterWrite
private void close() {
this.writer.close();
}
}
When I run my spring batch job, the items are written to file as:
Fixing{id='123456', source='TEST', startDate=null, endDate=null}
Fixing{id='1234567', source='TEST', startDate=null, endDate=null}
Fixing{id='1234568', source='TEST', startDate=null, endDate=null}
1/ How can I write just the data so that the values are comma separated and where it is null, it is not written. So the target file should look like this:
123456,TEST
1234567,TEST
1234568,TEST
2/ Secondly, I am having an issue where only when I exit spring boot application, I am able to see the file get created. What I would like is once it has processed all the items and written, the file to be available without closing the spring boot application.
There are multiple options to write the csv file. Regarding second question writer flush will solve the issue.
https://howtodoinjava.com/spring-batch/flatfileitemwriter-write-to-csv-file/
We prefer to use OpenCSV with spring batch as we are getting more speed and control on huge file example snippet is below
class DocumentWriter implements ItemWriter<BaseDTO>, Closeable {
private static final Logger LOG = LoggerFactory.getLogger(StatementWriter.class);
private ColumnPositionMappingStrategy<Statement> strategy ;
private static final String[] columns = new String[] { "csvcolumn1", "csvcolumn2", "csvcolumn3",
"csvcolumn4", "csvcolumn5", "csvcolumn6", "csvcolumn7"};
private BufferedWriter writer;
private StatefulBeanToCsv<Statement> beanToCsv;
public DocumentWriter() throws Exception {
strategy = new ColumnPositionMappingStrategy<Statement>();
strategy.setType(Statement.class);
strategy.setColumnMapping(columns);
filename = env.getProperty("globys.statement.cdf.path")+"-"+processCount+".dat";
File cdf = new File(filename);
if(cdf.exists()){
writer = Files.newBufferedWriter(Paths.get(filename), StandardCharsets.UTF_8,StandardOpenOption.APPEND);
}else{
writer = Files.newBufferedWriter(Paths.get(filename), StandardCharsets.UTF_8,StandardOpenOption.CREATE_NEW);
}
beanToCsv = new StatefulBeanToCsvBuilder<Statement>(writer).withQuotechar(CSVWriter.NO_QUOTE_CHARACTER)
.withMappingStrategy(strategy).withSeparator(',').build();
}
#Override
public void write(List<? extends BaseDTO> items) throws Exception {
List<Statement> settlementList = new ArrayList<Statement>();
for (int i = 0; i < items.size(); i++) {
BaseDTO baseDTO = items.get(i);
settlementList.addAll(baseDTO.getStatementList());
}
beanToCsv.write(settlementList);
writer.flush();
}
#PreDestroy
#Override
public void close() throws IOException {
writer.close();
}
}
Since you are using PassThroughLineAggregator which does item.toString() for writing the object, overriding the toString() function of classes extending Fixing.java should fix it.
1/ How can I write just the data so that the values are comma separated and where it is null, it is not written.
You need to provide a custom LineAggregator that filters out null fields.
2/ Secondly, I am having an issue where only when I exit spring boot application, I am able to see the file get created
This is probably because you are calling this.writer.open in the write method which is not correct. You need to make your item writer implement ItemStream and call this.writer.open and this this.writer.close respectively in ItemStream#open and ItemStream#close
I am trying to make a custom partitioner to allocate each unique key to a single reducer. this was after the default HashPartioner failed
Alternative to the default hashpartioner provided with hadoop
I keep getting the following error. It has something to do with the constructor not receiving its arguments, from what I can tell from doing some research. but in this context, with hadoop, aren't the arguments passed automatically by the framework? I cant find an error in the code
18/04/20 17:06:51 INFO mapred.JobClient: Task Id : attempt_201804201340_0007_m_000000_1, Status : FAILED
java.lang.RuntimeException: java.lang.NoSuchMethodException: biA3pipepart$parti.<init>()
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:587)
This is my partitioner:
public class Parti extends Partitioner<Text, Text> {
String partititonkey;
int result=0;
#Override
public int getPartition(Text key, Text value, int numPartitions) {
String partitionKey = key.toString();
if(numPartitions >= 9){
if(partitionKey.charAt(0) =='0' ){
if(partitionKey.charAt(2)=='0' )
result= 0;
else
if(partitionKey.charAt(2)=='1' )
result= 1;
else
result= 2;
}else
if(partitionKey.charAt(0)=='1'){
if(partitionKey.charAt(2)=='0' )
result= 3;
else
if(partitionKey.charAt(2)=='1' )
result= 4;
else
result= 5;
}else
if(partitionKey.charAt(0)=='2' ){
if(partitionKey.charAt(2)=='0' )
result= 6;
else
if(partitionKey.charAt(2)=='1' )
result= 7;
else
result= 8;
}
} //
else
result= 0;
return result;
}// close method
}// close class
My mapper signature
public static class JoinsMap extends Mapper<LongWritable,Text,Text,Text>{
public void Map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
My reducer signiture
public static class JoinsReduce extends Reducer<Text,Text,Text,Text>{
public void Reduce (Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
main class:
public static void main( String[] args ) throws Exception {
Configuration conf1 = new Configuration();
Job job1 = new Job(conf1, "biA3pipepart");
job1.setJarByClass(biA3pipepart.class);
job1.setNumReduceTasks(9);//***
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Text.class);
job1.setMapperClass(JoinsMap.class);
job1.setReducerClass(JoinsReduce.class);
job1.setInputFormatClass(TextInputFormat.class);
job1.setOutputFormatClass(TextOutputFormat.class);
job1.setPartitionerClass(Parti.class); //+++
// inputs to map.
FileInputFormat.addInputPath(job1, new Path(args[0]));
// single output from reducer.
FileOutputFormat.setOutputPath(job1, new Path(args[1]));
job1.waitForCompletion(true);
}
keys emitted by Mapper are the following:
0,0
0,1
0,2
1,0
1,1
1,2
2,0
2,1
2,2
and the Reducer only writes keys and values it receives.
SOLVED
I just added static to my Parti class like the mapper and reducer classes as suggested by comment (user238607).
public static class Parti extends Partitioner<Text, Text> {
Getting Null pointer exception in Driver class conf.getstrings() method. This driver class is invoked from my custom website.
Below are Driver class details
#SuppressWarnings("unchecked")
public void doGet(HttpServletRequest request,
HttpServletResponse response)
throws ServletException, IOException
{
Configuration conf = new Configuration();
//conf.set("fs.default.name", "hdfs://localhost:54310");
//conf.set("mapred.job.tracker", "localhost:54311");
//conf.set("mapred.jar","/home/htcuser/Desktop/ResumeLatest.jar");
Job job = new Job(conf, "ResumeSearchClass");
job.setJarByClass(HelloForm.class);
job.setJobName("ResumeParse");
job.setInputFormatClass(FileInputFormat.class);
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:54310/usr/ResumeDirectory"));
job.setMapperClass(ResumeMapper.class);
job.setReducerClass(ResumeReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setSortComparatorClass(ReverseComparator.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(FileOutPutFormat.class);
FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:54310/usr/output" + System.currentTimeMillis()));
long start = System.currentTimeMillis();
var = job.waitForCompletion(true) ? 0 : 1;
Getting NULL pointer exception from following two line of code
String[] keytextarray=conf.getStrings("Keytext");
for(int i=0;i<keytextarray.length;i++) //GETTING NULL POINTER EXCEPTION HERE IN keytextarray.length
{
//some code here
}
if(var==0)
{
RequestDispatcher dispatcher = request.getRequestDispatcher("/Result.jsp");
dispatcher.forward(request, response);
long finish= System.currentTimeMillis();
System.out.println("Time Taken "+(finish-start));
}
}
I have removed few unwanted codes from above Drives class method...
Below are RecordWriter class where I use conf.setstrings() in Write() method to set values
Below are RecordWriter class details
public class RecordWrite extends org.apache.hadoop.mapreduce.RecordWriter<IntWritable, Text> {
TaskAttemptContext context1;
Configuration conf;
public RecordWrite(DataOutputStream output, TaskAttemptContext context)
{
out = output;
conf = context.getConfiguration();
HelloForm.context1=context;
try {
out.writeBytes("result:\n");
out.writeBytes("Name:\t\t\t\tExperience\t\t\t\t\tPriority\tPriorityCount\n");
} catch (IOException e) {
e.printStackTrace();
}
}
public RecordWrite() {
// TODO Auto-generated constructor stub
}
#Override
public void close(TaskAttemptContext context) throws IOException,
InterruptedException
{
out.close();
}
int z=0;
#Override
public void write(IntWritable value,Text key) throws IOException,
InterruptedException
{
conf.setStrings("Keytext", key1string); //setting values here
conf.setStrings("valtext", valuestring);
String[] keytext=key.toString().split(Pattern.quote("^"));
//some code here
}
}`
`I suspect this null pointer exception happens since i call conf.getstrings() method after job is completed (job.waitForCompletion(true)). Please help fix this issue.
If above code is not correct way of passing values from recordwriter() method to driverclass.. please let me know how to pass values from recordwriter() to driver class.
I have tried option of setting values in RecordWriter() to an custom static class and accessing that object from static class in Driverclass again returns Null exception if i am running code in cluster..
If you have the value of key1staring and valuestirng, in Job class, try setting them in job class itself, rather than RecordWriter.write() method.
I've developed a custom InputFormat for Hadoop (including a custom InputSplit and a custom RecordReader) and I'm experiencing a rare NullPointerException.
These classes are going to be used for querying a third-party system which exposes a REST API for records retrieving. Thus, I got inspiration in DBInputFormat, which is a non-HDFS InputFormat as well.
The error I get is the following:
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:762)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
I've searched the code for MapTask (2.1.0 version of Hadoop) and I've seen the problematic part is the initialization of the RecordReader:
472 NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
473 org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
474 TaskReporter reporter,
475 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
476 throws InterruptedException, IOException {
...
491 this.real = inputFormat.createRecordReader(split, taskContext);
...
494 }
...
519 #Override
520 public void initialize(org.apache.hadoop.mapreduce.InputSplit split,
521 org.apache.hadoop.mapreduce.TaskAttemptContext context
522 ) throws IOException, InterruptedException {
523 long bytesInPrev = getInputBytes(fsStats);
524 real.initialize(split, context);
525 long bytesInCurr = getInputBytes(fsStats);
526 fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
527 }
Of course, the relevant parts of my code:
# MyInputFormat.java
public static void setEnvironmnet(Job job, String host, String port, boolean ssl, String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
public static void addResId(Job job, String resId) {
Configuration conf = job.getConfiguration();
String inputs = conf.get(INPUT_RES_IDS, "");
if (inputs.isEmpty()) {
inputs += restId;
} else {
inputs += "," + resId;
}
conf.set(INPUT_RES_IDS, inputs);
}
#Override
public List<InputSplit> getSplits(JobContext job) {
// resulting splits container
List<InputSplit> splits = new ArrayList<InputSplit>();
// get the Job configuration
Configuration conf = job.getConfiguration();
// get the inputs, i.e. the list of resource IDs
String input = conf.get(INPUT_RES_IDS, "");
String[] resIDs = StringUtils.split(input);
// iterate on the resIDs
for (String resID: resIDs) {
splits.addAll(getSplitsResId(resID, job.getConfiguration()));
}
// return the splits
return splits;
}
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
# MyRecordReader.java
#Override
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
// get start, end and current positions
MyInputSplit inputSplit = (MyInputSplit) this.split;
start = inputSplit.getFirstRecordIndex();
end = start + inputSplit.getLength();
current = 0;
// query the third-party system for the related resource, seeking to the start of the split
records = backend.getRecords(inputSplit.getResId(), start, end);
}
# MapReduceTest.java
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapReduceTest(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf, "MapReduce test");
job.setJarByClass(MapReduceTest.class);
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(MyInputFormat.class);
MyInputFormat.addInput(job, "ca73a799-9c71-4618-806e-7bd0ca1911f4");
InputFormat.setEnvironmnet(job, "my.host.com", "443", true, "my_api_key");
FileOutputFormat.setOutputPath(job, new Path(args[0]));
return job.waitForCompletion(true) ? 0 : 1;
}
Any ideas about what is wrong?
BTW, which is the "good" InputSplit the RecordReader must use, the one given to the constructor or the one given in the initialize method? Anyway I've tried both options and the resulting error is the same :)
The way I read your strack trace real is null on line 524.
But don't take my word for it. Slip an assert or system.out.println in there and check the value of real yourself.
NullPointerException almost always means you dotted off something you didn't expect to be null. Some libraries and collections will throw it at you as their way of saying "this can't be null".
Error: java.lang.NullPointerException at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:524)
To me this reads as: in the org.apache.hadoop.mapred package the MapTask class has an inner class NewTrackingRecordReader with an initialize method that threw a NullPointerException at line 524.
524 real.initialize( blah, blah) // I actually stopped reading after the dot
this.real was set on line 491.
491 this.real = inputFormat.createRecordReader(split, taskContext);
Assuming you haven't left out any more closely scoped reals that are masking the this.real then we need to look at inputFormat.createRecordReader(split, taskContext); If this can return null then it might be the culprit.
Turns out it will return null when backend is null.
#Override
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context) {
if (backend == null) {
logger.info("Unable to create a MyRecordReader, " +
"it seems the environment was not properly set");
return null;
}
// create a record reader
return new MyRecordReader(backend, split, context);
}
It looks like setEnvironmnet is supposed to set backend
# MyInputFormat.java
public static void setEnvironmnet(
Job job,
String host,
String port,
boolean ssl,
String APIKey) {
backend = new Backend(host, port, ssl, APIKey);
}
backend must be declared somewhere outside setEnvironment (or you'd be getting a compiler error).
If backend hasn't been set to something non-null upon construction and setEnvironmnet was not called before createRecordReader then you should expect to get exactly the NullPointerException you got.
UPDATE:
As you've noted, since setEnvironmnet() is static backend must be static as well. This means that you must be sure other instances aren't setting it to null.
Solved. The problem is the backend variable is declared as static, i.e. it belongs to the java class and thus any other object changing that variable (e.g. to null) affects all the other objects of the same class.
Now, setEnvironment adds the host, port, ssl usage and the API key as configuration (the same than setResId already did with the resource ID); when createRecordReader is invoked this configuration is got and the backend object is created.
Thanks to CandiedOrange who put me in the right path!
Which is the best way to get Distributed cached data?
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> globalFreq = new ArrayList<String>();
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Accessing "globalFreq" data .and do further processing
}
OR
public class TrailMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
URI[] cacheFiles
public void setup(Context context) throws IOException{
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
cacheFiles = DistributedCache.getCacheFiles(conf);
}
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
ArrayList<String> globalFreq = new ArrayList<String>();
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
String [] parts = setupData.split(" ");
globalFreq.add(parts[0]);
}
}
So if we are doing like (code 2) does that mean Say we have 5 map task every map task reads the same copy of the data . while writing like this for each map , the task reads the data multiple times am i right (5 times)?
code 1 : as it is written in setup it is read once and the global data is accessed in map.
Which is the right way of writing distributed cache.
Do as much as you can in the setup method: this will be called once by each mapper, but will then be cached for each record that is passed to the mapper. Parsing your data for each record is overhead you can avoid, since there is nothing there that depends on the key, value and context variables you are receiving in the map method.
The setup method will be called per map task, but map will be called for each record processed by that task (which can clearly be a very high number).