in MultipleOutputs - avoid my key to be written in the files

in MultipleOutputs - avoid my key to be written in the files - java

Hi im using Hadoop mapreduce and im using multipleoutput. Below is my code
mos = new MultipleOutputs(context);
mos.write(key, value, propertyName.trim());
But it generate the multiple files with the suffix -m-0000 How can i eliminate it ?
And also i dont wanna print my key in the file . So how can i avoid my key to be written in the files.?

Look into using LazyOutputFormat - it won't create the default output files if nothing is written via context.write:
job.setOutputFormat(LazyOutputFormat.class);
// This can be any file based output format
LazyOutputFormat.setOutputFormatClass(TextOutputFormat.class);

Related

Reading xml file in Flink

I am trying to use flink to sync a process to read xml files from a LocalFileSystem and sync it to s3.
I need to parse a taf inside each xml file and use it to send it to respective folder in s3.
For ex: my file contains folder1 .... xxx
I need to read the value from and send it to /folder1
I was able to read the file content and sync it to s3 but the content was coming up as line by line.
I used TextInputFormat as suggested in
NFS (Netapp server)-> Flink ->s3
I have tried different formats like DelimiterInputFormat etc but not successful. I searched through google but couldnt find any solution. Isnt this something supported ?
Is there a way to read entire file or atleast value between tags ?
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// monitor directory, checking for new files
// every 100 milliseconds
TextInputFormat format = new TextInputFormat(
new org.apache.flink.core.fs.Path("file:///tmp/dir/"));
DataStream<String> inputStream = env.readFile(
format,
"file:///tmp/dir/",
FileProcessingMode.PROCESS_CONTINUOUSLY,
100,
FilePathFilter.createDefaultFilter());

First off, I assume that this is for a batch (DataSet) workflow. I typically handle this by creating a list of file paths as the input to the workflow, using a custom source that handles splitting these up for parallelism. Then I've got a MapFunction that takes the file path as input, opens/reads the XML file and parses it, and sends the interesting extracted data bits downstream.
The other approach is to use one of several Hadoop XmlInputFormat implementations that are out there (e.g. this one that is part of Mahout). There's a bit of work required to use a HadoopInputFormat with Flink, but it's doable. E.g. something like (untested!!!):
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(inputDir));
HadoopInputFormat<LongWritable, Text> inputFormat = HadoopInputs.createHadoopInput(new XmlInputFormat(), LongWritable.class, Text.class, job);
Configuration parameters = new Configuration();
parameters.setBoolean("recursive.file.enumeration", true);
inputFormat.configure(parameters);
...
env.createInput(inputFormat);

How to keep zero begin string when export data using opencsv library

I using opencsv library in java and export csv. But i have problem. When i used string begin zero look like : 0123456 , when i export it remove 0 and my csv look like : 123456. Zero is missing. I using way :
"\"\t"+"0123456"+ "\""; but when csv export it look like : "0123456" . I don't want it. I want 0123456. I don't want edit from excel because some end user don't know how to edit. How to export csv using open csv and keep 0 begin string. Please help

I think it is not really the problem when generating CSV but the way excel treats the data when opened via explorer.
Tried this code, and viewed the CSV in a text editor ( not excel ), notice that it shows up correctly, though when opened in excel, leading 0s are lost !
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "0123131#21212#021213".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
//don't apply quotes
writer.writeAll(a,false);
writer.close();
If you are really sure that you want to see the leading 0s for numeric values when excel is opened by user, then each cell entry be in format ="dataHere" format; see code below:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "=\"0123131\"#=\"21212\"#=\"021213\"".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
writer.writeAll(a);
writer.close();
This is how now excel shows when opening excel from windows explorer ( double clicking ):
But now, if we see the CSV in a text editor, with the modified data to "suit" excel viewing, it shows as :
Also see link :
format-number-as-text-in-csv-when-open-in-both-excel-and-notepad

have you tried to use String like this "'"+"0123456". ' char will mark number as text when parse into excel

For me OpenCsv works correctly ( vers. 5.6 ).
for example my csv file has a row as the following extract:
"999739059";;;"abcdefgh";"001024";
and opencsv reads the field "1024" as 001024 corretly. Of course I have mapped the field in a string, not in a Double.
But, if you still have problems, you can grab a simple yet powerful parser that fully adheres with RFC 4180 standard:
mykong.com
Mykong shows you some examples using opencsv directly and, in the end, he writes a simple parser to use if you don't want to import OpenCSV , and the parser works very well , and you can use it if you still have any problems.
So you have an easy-to-understand source code of a simple parser that you can modify as you want if you still have any problem or if you want to customize it for your needs.

Hadoop InputFormat set Key to Input File Path

My hadoop job needs to be aware of the input path that each record is derived from.
For example assume I am running a job over a collection of S3 objects:
s3://bucket/file1
s3://bucket/file2
s3://bucket/file3
I would like to reduce key value pairs such as
s3://bucket/file1 record1
s3://bucket/file1 record2
s3://bucket/file2 record1
...
Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?
I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.

Please have a look at my blog article to customize inputsplit and recordreader.
The code in that blog sets key as below (Line 69-70 of recordreader code)
value = new Text(line);
key = new LongWritable(splitstart);
In your case you need to set key as below, I didn't test it though.
key = fsplit.getPath().toString();

Trailing null (\x00) characters when writing text to Accumulo

I am trying to write the name of a file into Accumulo. I am using accumulo-core-1.43.
For some reason, certain files seem to be written into Accumulo with trailing \x00 characters at the end of the name. The upload is coming through a Java servlet (using the jquery file upload plugin). In the servlet, I check the name of the file with a System.out.println and it looks normal, and I even tried unescaping the string with
org.apache.commons.lang.StringEscapeUtils.unescapeJava(...);
The actual writing to accumulo looks like this:
Mutation mut = new Mutation(new Text(checkSum));
Value val = new Value(new Text(filename).getBytes());
long timestamp = System.currentTimeMillis();
mut.put(new Text(colFam), new Text(EMPTY_BYTES), timestamp, val);
but nothing unusual showed up there (perhaps \x00 isn't escaped)? But then if I do a scan on my table in accumulo, there will be one or more \x00 in the file name.
The problem this seems to cause is that I return that string within XML when I retrieve a list of files (where it shows up) and pass that back to the browser, the the XSL that is supposed to render the information in the XML no longer works when there's these extra characters (not sure why that is the case either).
In chrome, for the response on these calls, I see that there's three red dots after the file name, and when I hover over it, \u0 pops up (which I think is a different representation of 0/null?).
Anyway, I'm just trying to figure out why this happens, or at the very least, how I can filter out \x00 characters before returning the file in Java. any ideas?

You are likely incorrectly using the Hadoop Text class -- this is not an error with Accumulo. Specifically, you make the mistake in your above example:
Value val = new Value(new Text(filename).getBytes());
You must adhere to the length of provided by the Text class. See the Text javadoc for more information. If you're using Hadoop-2.2.0, you can use the provided copyBytes method on Text. If you're on older version of Hadoop where this method doesn't yet exist, you can use something like the ByteBuffer class or the System.arraycopy method to get a copy of the byte[] with the proper limits enforced.

Hadoop (1.1.2) XML processing & re-writing file

first question here... and learning hadoop...
I've spent the last 2 weeks trying to understand everything about hadoop, but it seems every hill has a mountain behind it.
Here's the setup:
Lots (1 million) of small (<50MB) XML files (Documents formatted into XML).
Each file is a record/record
Pseudo-distributed Hadoop cluster (1.1.2)
using old mapred API (can change, if new API supports what's needed)
I have found XmlInputFormat ("Mahout XMLInputFormat") as a good starting point for reading files, as I can specify the entire XML document as
My understanding is that XmlInputFormat will take care of ensuring each file is it's own record (as 1 tag exists per file/record).
My issue is this: I Want to use Hadoop to process every document, search for information, and then, for each file/record, re-write or output a new xml document with new xml tag added.
Not afraid of reading and learning, but a skeleton to play with would really help me 'play' and learn hadoop
here is my driver:
public static void main(String[] args) {
JobConf conf = new JobConf(myDriver.class);
conf.setJobName("bigjob");
// Input/Output Directories
if (args[0].length()==0 || args[1].length()==0) System.exit(-1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.set("xmlinput.start", "<document>");
conf.set("xmlinput.end", "</document>");
// Mapper & Combiner & Reducer
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(0);
// Input/Output Types
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(?????);
conf.setOutputKeyClass(????);
conf.setOutputValueClass(????);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}

I would say a simple solution would be to use TextOutputFormat and then use Text as the output key and NullWritable as the output value.
TextOutputFormat uses a delimiting character to separate the key and value pairs you output from your job. For your requirement, you don't need this arrangement, but you'd just like to output a single body of XML. If you pass null or NullWritable as the output key or value, TextOutputFormat will not write the null or the delimiter, just the non-null key or value.
Another approach to using XmlINputFormat would be to use a WholeFileInput (as detailed in Tom White's Hadoop - The definitive guide).
Eitherway you'll need to write your mapper to consume the input value Text object (maybe with an XML SAX or DOM parser) and then output the transformed XML as a Text object.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

in MultipleOutputs - avoid my key to be written in the files - java

Look into using LazyOutputFormat - it won't create the default output files if nothing is written via context.write: job.setOutputFormat(LazyOutputFormat.class); // This can be any file based output format LazyOutputFormat.setOutputFormatClass(TextOutputFormat.class);

Related

Reading xml file in Flink

How to keep zero begin string when export data using opencsv library

Hadoop InputFormat set Key to Input File Path

Trailing null (\x00) characters when writing text to Accumulo

Hadoop (1.1.2) XML processing & re-writing file

Categories

Resources