Hadoop (1.1.2) XML processing & re-writing file - java

first question here... and learning hadoop...
I've spent the last 2 weeks trying to understand everything about hadoop, but it seems every hill has a mountain behind it.
Here's the setup:
Lots (1 million) of small (<50MB) XML files (Documents formatted into XML).
Each file is a record/record
Pseudo-distributed Hadoop cluster (1.1.2)
using old mapred API (can change, if new API supports what's needed)
I have found XmlInputFormat ("Mahout XMLInputFormat") as a good starting point for reading files, as I can specify the entire XML document as
My understanding is that XmlInputFormat will take care of ensuring each file is it's own record (as 1 tag exists per file/record).
My issue is this: I Want to use Hadoop to process every document, search for information, and then, for each file/record, re-write or output a new xml document with new xml tag added.
Not afraid of reading and learning, but a skeleton to play with would really help me 'play' and learn hadoop
here is my driver:
public static void main(String[] args) {
JobConf conf = new JobConf(myDriver.class);
conf.setJobName("bigjob");
// Input/Output Directories
if (args[0].length()==0 || args[1].length()==0) System.exit(-1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.set("xmlinput.start", "<document>");
conf.set("xmlinput.end", "</document>");
// Mapper & Combiner & Reducer
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(0);
// Input/Output Types
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(?????);
conf.setOutputKeyClass(????);
conf.setOutputValueClass(????);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}

I would say a simple solution would be to use TextOutputFormat and then use Text as the output key and NullWritable as the output value.
TextOutputFormat uses a delimiting character to separate the key and value pairs you output from your job. For your requirement, you don't need this arrangement, but you'd just like to output a single body of XML. If you pass null or NullWritable as the output key or value, TextOutputFormat will not write the null or the delimiter, just the non-null key or value.
Another approach to using XmlINputFormat would be to use a WholeFileInput (as detailed in Tom White's Hadoop - The definitive guide).
Eitherway you'll need to write your mapper to consume the input value Text object (maybe with an XML SAX or DOM parser) and then output the transformed XML as a Text object.

Related

Reading xml file in Flink

I am trying to use flink to sync a process to read xml files from a LocalFileSystem and sync it to s3.
I need to parse a taf inside each xml file and use it to send it to respective folder in s3.
For ex: my file contains folder1 .... xxx
I need to read the value from and send it to /folder1
I was able to read the file content and sync it to s3 but the content was coming up as line by line.
I used TextInputFormat as suggested in
NFS (Netapp server)-> Flink ->s3
I have tried different formats like DelimiterInputFormat etc but not successful. I searched through google but couldnt find any solution. Isnt this something supported ?
Is there a way to read entire file or atleast value between tags ?
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// monitor directory, checking for new files
// every 100 milliseconds
TextInputFormat format = new TextInputFormat(
new org.apache.flink.core.fs.Path("file:///tmp/dir/"));
DataStream<String> inputStream = env.readFile(
format,
"file:///tmp/dir/",
FileProcessingMode.PROCESS_CONTINUOUSLY,
100,
FilePathFilter.createDefaultFilter());
First off, I assume that this is for a batch (DataSet) workflow. I typically handle this by creating a list of file paths as the input to the workflow, using a custom source that handles splitting these up for parallelism. Then I've got a MapFunction that takes the file path as input, opens/reads the XML file and parses it, and sends the interesting extracted data bits downstream.
The other approach is to use one of several Hadoop XmlInputFormat implementations that are out there (e.g. this one that is part of Mahout). There's a bit of work required to use a HadoopInputFormat with Flink, but it's doable. E.g. something like (untested!!!):
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(inputDir));
HadoopInputFormat<LongWritable, Text> inputFormat = HadoopInputs.createHadoopInput(new XmlInputFormat(), LongWritable.class, Text.class, job);
Configuration parameters = new Configuration();
parameters.setBoolean("recursive.file.enumeration", true);
inputFormat.configure(parameters);
...
env.createInput(inputFormat);

How does Apache Flink parallelize reading of a CSV file

I am using readCsvFile(path) function in Apache Flink api to read a CSV file and store it in a list variable. How does it work using multiple threads?
For example, is it splitting the file based on some statistics? if yes, what statistics? Or does it read the file line by line and then send the lines to threads to process them?
Here is the sample code:
//default parallelism is 4
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
.types(String.class,Double.class)
.collect();
Suppose that we have a 800mb CSV file on local disk, how does it distribute the work between those 4 threads?
The readCsvFile() API method internally creates a data source with a CsvInputFormat which is based on Flink's FileInputFormat. This InputFormat generates a list of so-called InputSplits. An InputSplit defines which range of a file should be scanned. The splits are then distributed to data source tasks.
So, each parallel task scans a certain region of a file and parses its content. This is very similar to how it is done by MapReduce / Hadoop.
This is the same as How does Hadoop process records split across block boundaries?
I extract some code from flink-release-1.1.3 DelimitedInputFormat file.
// else ..
int toRead;
if (this.splitLength > 0) {
// if we have more data, read that
toRead = this.splitLength > this.readBuffer.length ? this.readBuffer.length : (int) this.splitLength;
}
else {
// if we have exhausted our split, we need to complete the current record, or read one
// more across the next split.
// the reason is that the next split will skip over the beginning until it finds the first
// delimiter, discarding it as an incomplete chunk of data that belongs to the last record in the
// previous split.
toRead = this.readBuffer.length;
this.overLimit = true;
}
It's clear that if it don't read line delimiter in one split, it will get another split to find.( I haven't find The corresponding code, I will try.)
Plus: the image below is how I find the code, from readCsvFile() to DelimitedInputFormat.

Hadoop InputFormat set Key to Input File Path

My hadoop job needs to be aware of the input path that each record is derived from.
For example assume I am running a job over a collection of S3 objects:
s3://bucket/file1
s3://bucket/file2
s3://bucket/file3
I would like to reduce key value pairs such as
s3://bucket/file1 record1
s3://bucket/file1 record2
s3://bucket/file2 record1
...
Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?
I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.
Please have a look at my blog article to customize inputsplit and recordreader.
The code in that blog sets key as below (Line 69-70 of recordreader code)
value = new Text(line);
key = new LongWritable(splitstart);
In your case you need to set key as below, I didn't test it though.
key = fsplit.getPath().toString();

XML File looses its format after reading and writing in Java

I'm writing a program in Java that it's going to read a XML file and do some modification,and then write the file with the same format.
The following is the code block that reads and writes the XML file:
final Document fileDocument = parseFileAsDocument(file);
final OutputFormat format = new OutputFormat(fileDocument);
try {
final FileWriter out = new FileWriter(file);
final XMLSerializer serializer = new XMLSerializer(out,format);
serializer.serialize(fileDocument);
}
catch (final IOException e) {
System.out.println(e.getMessage());
}
This is the method used to parse the file:
private Document parseFileAsDocument(final File file) {
Document inputDocument = null;
try {
inputDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
}//catching some exceptions{}
return inputDocument;
}
I'm noticing two changes after the file is written:
Before I had a node similar to this:
<instance ref='filter'>
<value></value>
</instance>
After reading and writing, the node looks like this:
<instance ref="filter">
<value/>
</instance>
As you can see from above, the 'filter' has been changed to "filter" with double quote.
The second change is <value></value> has been changed to <value/>. This change happens across the XML file whenever we have a node similar to <tag></tag> with no value in between. So if we have something like <tag>somevalue</tag>, there is no issue.
Any thought please how to get the XML nodes format to be the same after writing?
I'd appreciate it!
You can't, and you shouldn't try. It's a bit like complaining that when you add 0123 and 0234, you get 357 without the leading zeroes. Leading zeroes in integers aren't considered significant, so arithmetic operations don't preserve them. The same happens to insignificant details of your XML, like the distinction between double quotes and single quotes, and the distinction between a self-closing tags and a start/end tag pair for an empty element. If any consumer of the XML is depending on these details, they need to be sent for retraining.
The most usual reason for asking for lexical details to be preserved is that you want to detect changes. But this means you are doing your comparisons the wrong way: you should be comparing at the logical level, not the physical level. One way to do comparisons is to canonicalize the XML, so whenever there is an arbitrary choice to be made between equivalent representations, it is made the same way.

in MultipleOutputs - avoid my key to be written in the files

Hi im using Hadoop mapreduce and im using multipleoutput. Below is my code
mos = new MultipleOutputs(context);
mos.write(key, value, propertyName.trim());
But it generate the multiple files with the suffix -m-0000 How can i eliminate it ?
And also i dont wanna print my key in the file . So how can i avoid my key to be written in the files.?
Look into using LazyOutputFormat - it won't create the default output files if nothing is written via context.write:
job.setOutputFormat(LazyOutputFormat.class);
// This can be any file based output format
LazyOutputFormat.setOutputFormatClass(TextOutputFormat.class);

Categories

Resources