Hadoop InputFormat set Key to Input File Path - java

My hadoop job needs to be aware of the input path that each record is derived from.
For example assume I am running a job over a collection of S3 objects:
s3://bucket/file1
s3://bucket/file2
s3://bucket/file3
I would like to reduce key value pairs such as
s3://bucket/file1 record1
s3://bucket/file1 record2
s3://bucket/file2 record1
...
Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?
I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.

Please have a look at my blog article to customize inputsplit and recordreader.
The code in that blog sets key as below (Line 69-70 of recordreader code)
value = new Text(line);
key = new LongWritable(splitstart);
In your case you need to set key as below, I didn't test it though.
key = fsplit.getPath().toString();

Related

Reading xml file in Flink

I am trying to use flink to sync a process to read xml files from a LocalFileSystem and sync it to s3.
I need to parse a taf inside each xml file and use it to send it to respective folder in s3.
For ex: my file contains folder1 .... xxx
I need to read the value from and send it to /folder1
I was able to read the file content and sync it to s3 but the content was coming up as line by line.
I used TextInputFormat as suggested in
NFS (Netapp server)-> Flink ->s3
I have tried different formats like DelimiterInputFormat etc but not successful. I searched through google but couldnt find any solution. Isnt this something supported ?
Is there a way to read entire file or atleast value between tags ?
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// monitor directory, checking for new files
// every 100 milliseconds
TextInputFormat format = new TextInputFormat(
new org.apache.flink.core.fs.Path("file:///tmp/dir/"));
DataStream<String> inputStream = env.readFile(
format,
"file:///tmp/dir/",
FileProcessingMode.PROCESS_CONTINUOUSLY,
100,
FilePathFilter.createDefaultFilter());
First off, I assume that this is for a batch (DataSet) workflow. I typically handle this by creating a list of file paths as the input to the workflow, using a custom source that handles splitting these up for parallelism. Then I've got a MapFunction that takes the file path as input, opens/reads the XML file and parses it, and sends the interesting extracted data bits downstream.
The other approach is to use one of several Hadoop XmlInputFormat implementations that are out there (e.g. this one that is part of Mahout). There's a bit of work required to use a HadoopInputFormat with Flink, but it's doable. E.g. something like (untested!!!):
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(inputDir));
HadoopInputFormat<LongWritable, Text> inputFormat = HadoopInputs.createHadoopInput(new XmlInputFormat(), LongWritable.class, Text.class, job);
Configuration parameters = new Configuration();
parameters.setBoolean("recursive.file.enumeration", true);
inputFormat.configure(parameters);
...
env.createInput(inputFormat);

Use option p of weka filter (RemoveType) in java code

I am using weka API in my java code and have a dataset with string ID to keep track of instances, weka mentioned in this page that there is an option p that can help printing the ID of each instance in the prediction result even if the attribute has removed. But how this can be approached in java code since none of the options listed in RemoveType filter is p?
Thank you
p option, on the weka page you mentioned, is the parameter which you can set through some of the the classes which are available in the package weka.classifiers.evaluation.output.prediction
With these classes you can set the things you want in output prediction file. E.g. OutputDistribution, AttributeIndices(P)- Attribute indices which you want to have in output file, Number of decimal places in prediction probabilities, etc.
You can use any of the below classes depending on the output file format you want.
PlainText
HTML
XML
CSV
Setting the parameters through code :
Evaluation eval = new Evaluation(data);
StringBuffer forPredictionsPrinting = new StringBuffer();
PlainText classifierOutput = new PlainText();
classifierOutput.setBuffer(forPredictionsPrinting);
Boolean outputDistribution = new Boolean(true);
classifierOutput.setOutputDistribution(true);
You can find detailed usage of this class at
https://www.programcreek.com/java-api-examples/?api=weka.classifiers.evaluation.output.prediction.PlainText

Java OpenCSV - 2 List comparison and duplication

i am going to make a application, comparising 2 .csv lists, using OpenCSV. It should works like that:
Open 2 .csv files ( each file has columns: Name,Emails)
Save results ( and here is a prbolem, i don't know if it should be save to table or something)
Compare From List1 and List2 value of "Emails column".
If Email from List 1 appear on List2 - delete it(from list 1)
Export results to new .csv file
I don't know if it's good algorithm. Please Tell me which option to saving results of reading .csv file is best in that case.
Kind Regards
You can get around this more easily with univocity-parsers as it can read your data into columns:
CsvParserSettings parserSettings = new CsvParserSettings(); //parser config with many options, check the tutorial
parserSettings.setHeaderExtractionEnabled(true); // uses the first row as headers
// To get the values of all columns, use a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);
CsvParser parser = new CsvParser(parserSettings);
//This will parse everything and pass the data to the column processor
parser.parse(new FileReader(new File("/path/to/your/file.csv")));
//Finally, we can get the column values:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();
Let's say you parsed the second CSV with that. Just grab the emails and create a set:
Set<String> emails = new HashSet<>(columnValues.get("Email"));
Now just iterate over the first CSV and check if the emails are in the emails set.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
If you have a hard requirement to use openCSV then here is what I believe is the easiest solution:
First off I like Jeronimo's suggestion about the HashSet. Read the second csv file first using the CSVToBean and save off the email addresses in the HashSet.
Then create a Filter class that implements the CSVToBeanFilter interface. In the constructor pass in the set and in the allowLine method you look up the email address and return true if it is not in the set (so you have a quick lookup).
Then you pass the filter in the CsvToBean.parse when reading/parsing the first file and all you will get are the records from the first file whose email addresses are not on the second file. The CSVToBeanFilter javadoc has a good example that shows how this works.
Lastly use the BeanToCSV to create a file from the filtered list.
In interest of fairness I am the maintainer of the openCSV project and it is also open source and free (Apache V2.0 license).

in MultipleOutputs - avoid my key to be written in the files

Hi im using Hadoop mapreduce and im using multipleoutput. Below is my code
mos = new MultipleOutputs(context);
mos.write(key, value, propertyName.trim());
But it generate the multiple files with the suffix -m-0000 How can i eliminate it ?
And also i dont wanna print my key in the file . So how can i avoid my key to be written in the files.?
Look into using LazyOutputFormat - it won't create the default output files if nothing is written via context.write:
job.setOutputFormat(LazyOutputFormat.class);
// This can be any file based output format
LazyOutputFormat.setOutputFormatClass(TextOutputFormat.class);

Hadoop (1.1.2) XML processing & re-writing file

first question here... and learning hadoop...
I've spent the last 2 weeks trying to understand everything about hadoop, but it seems every hill has a mountain behind it.
Here's the setup:
Lots (1 million) of small (<50MB) XML files (Documents formatted into XML).
Each file is a record/record
Pseudo-distributed Hadoop cluster (1.1.2)
using old mapred API (can change, if new API supports what's needed)
I have found XmlInputFormat ("Mahout XMLInputFormat") as a good starting point for reading files, as I can specify the entire XML document as
My understanding is that XmlInputFormat will take care of ensuring each file is it's own record (as 1 tag exists per file/record).
My issue is this: I Want to use Hadoop to process every document, search for information, and then, for each file/record, re-write or output a new xml document with new xml tag added.
Not afraid of reading and learning, but a skeleton to play with would really help me 'play' and learn hadoop
here is my driver:
public static void main(String[] args) {
JobConf conf = new JobConf(myDriver.class);
conf.setJobName("bigjob");
// Input/Output Directories
if (args[0].length()==0 || args[1].length()==0) System.exit(-1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.set("xmlinput.start", "<document>");
conf.set("xmlinput.end", "</document>");
// Mapper & Combiner & Reducer
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(0);
// Input/Output Types
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(?????);
conf.setOutputKeyClass(????);
conf.setOutputValueClass(????);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
I would say a simple solution would be to use TextOutputFormat and then use Text as the output key and NullWritable as the output value.
TextOutputFormat uses a delimiting character to separate the key and value pairs you output from your job. For your requirement, you don't need this arrangement, but you'd just like to output a single body of XML. If you pass null or NullWritable as the output key or value, TextOutputFormat will not write the null or the delimiter, just the non-null key or value.
Another approach to using XmlINputFormat would be to use a WholeFileInput (as detailed in Tom White's Hadoop - The definitive guide).
Eitherway you'll need to write your mapper to consume the input value Text object (maybe with an XML SAX or DOM parser) and then output the transformed XML as a Text object.

Categories

Resources