Reading xml file in Flink - java

I am trying to use flink to sync a process to read xml files from a LocalFileSystem and sync it to s3.
I need to parse a taf inside each xml file and use it to send it to respective folder in s3.
For ex: my file contains folder1 .... xxx
I need to read the value from and send it to /folder1
I was able to read the file content and sync it to s3 but the content was coming up as line by line.
I used TextInputFormat as suggested in
NFS (Netapp server)-> Flink ->s3
I have tried different formats like DelimiterInputFormat etc but not successful. I searched through google but couldnt find any solution. Isnt this something supported ?
Is there a way to read entire file or atleast value between tags ?
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// monitor directory, checking for new files
// every 100 milliseconds
TextInputFormat format = new TextInputFormat(
new org.apache.flink.core.fs.Path("file:///tmp/dir/"));
DataStream<String> inputStream = env.readFile(
format,
"file:///tmp/dir/",
FileProcessingMode.PROCESS_CONTINUOUSLY,
100,
FilePathFilter.createDefaultFilter());

First off, I assume that this is for a batch (DataSet) workflow. I typically handle this by creating a list of file paths as the input to the workflow, using a custom source that handles splitting these up for parallelism. Then I've got a MapFunction that takes the file path as input, opens/reads the XML file and parses it, and sends the interesting extracted data bits downstream.
The other approach is to use one of several Hadoop XmlInputFormat implementations that are out there (e.g. this one that is part of Mahout). There's a bit of work required to use a HadoopInputFormat with Flink, but it's doable. E.g. something like (untested!!!):
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(inputDir));
HadoopInputFormat<LongWritable, Text> inputFormat = HadoopInputs.createHadoopInput(new XmlInputFormat(), LongWritable.class, Text.class, job);
Configuration parameters = new Configuration();
parameters.setBoolean("recursive.file.enumeration", true);
inputFormat.configure(parameters);
...
env.createInput(inputFormat);

Related

How does Apache Flink parallelize reading of a CSV file

I am using readCsvFile(path) function in Apache Flink api to read a CSV file and store it in a list variable. How does it work using multiple threads?
For example, is it splitting the file based on some statistics? if yes, what statistics? Or does it read the file line by line and then send the lines to threads to process them?
Here is the sample code:
//default parallelism is 4
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
csvPath="data/weather.csv";
List<Tuple2<String, Double>> csv= env.readCsvFile(csvPath)
.types(String.class,Double.class)
.collect();
Suppose that we have a 800mb CSV file on local disk, how does it distribute the work between those 4 threads?
The readCsvFile() API method internally creates a data source with a CsvInputFormat which is based on Flink's FileInputFormat. This InputFormat generates a list of so-called InputSplits. An InputSplit defines which range of a file should be scanned. The splits are then distributed to data source tasks.
So, each parallel task scans a certain region of a file and parses its content. This is very similar to how it is done by MapReduce / Hadoop.
This is the same as How does Hadoop process records split across block boundaries?
I extract some code from flink-release-1.1.3 DelimitedInputFormat file.
// else ..
int toRead;
if (this.splitLength > 0) {
// if we have more data, read that
toRead = this.splitLength > this.readBuffer.length ? this.readBuffer.length : (int) this.splitLength;
}
else {
// if we have exhausted our split, we need to complete the current record, or read one
// more across the next split.
// the reason is that the next split will skip over the beginning until it finds the first
// delimiter, discarding it as an incomplete chunk of data that belongs to the last record in the
// previous split.
toRead = this.readBuffer.length;
this.overLimit = true;
}
It's clear that if it don't read line delimiter in one split, it will get another split to find.( I haven't find The corresponding code, I will try.)
Plus: the image below is how I find the code, from readCsvFile() to DelimitedInputFormat.

saveAsTextFile() to write the final RDD as single text file - Apache Spark

I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.
My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter class available in Java?
Please advise on this.
Regards,
Shankar
To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.
public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".
Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems

Hadoop InputFormat set Key to Input File Path

My hadoop job needs to be aware of the input path that each record is derived from.
For example assume I am running a job over a collection of S3 objects:
s3://bucket/file1
s3://bucket/file2
s3://bucket/file3
I would like to reduce key value pairs such as
s3://bucket/file1 record1
s3://bucket/file1 record2
s3://bucket/file2 record1
...
Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?
I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.
Please have a look at my blog article to customize inputsplit and recordreader.
The code in that blog sets key as below (Line 69-70 of recordreader code)
value = new Text(line);
key = new LongWritable(splitstart);
In your case you need to set key as below, I didn't test it though.
key = fsplit.getPath().toString();

in MultipleOutputs - avoid my key to be written in the files

Hi im using Hadoop mapreduce and im using multipleoutput. Below is my code
mos = new MultipleOutputs(context);
mos.write(key, value, propertyName.trim());
But it generate the multiple files with the suffix -m-0000 How can i eliminate it ?
And also i dont wanna print my key in the file . So how can i avoid my key to be written in the files.?
Look into using LazyOutputFormat - it won't create the default output files if nothing is written via context.write:
job.setOutputFormat(LazyOutputFormat.class);
// This can be any file based output format
LazyOutputFormat.setOutputFormatClass(TextOutputFormat.class);

Hadoop (1.1.2) XML processing & re-writing file

first question here... and learning hadoop...
I've spent the last 2 weeks trying to understand everything about hadoop, but it seems every hill has a mountain behind it.
Here's the setup:
Lots (1 million) of small (<50MB) XML files (Documents formatted into XML).
Each file is a record/record
Pseudo-distributed Hadoop cluster (1.1.2)
using old mapred API (can change, if new API supports what's needed)
I have found XmlInputFormat ("Mahout XMLInputFormat") as a good starting point for reading files, as I can specify the entire XML document as
My understanding is that XmlInputFormat will take care of ensuring each file is it's own record (as 1 tag exists per file/record).
My issue is this: I Want to use Hadoop to process every document, search for information, and then, for each file/record, re-write or output a new xml document with new xml tag added.
Not afraid of reading and learning, but a skeleton to play with would really help me 'play' and learn hadoop
here is my driver:
public static void main(String[] args) {
JobConf conf = new JobConf(myDriver.class);
conf.setJobName("bigjob");
// Input/Output Directories
if (args[0].length()==0 || args[1].length()==0) System.exit(-1);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.set("xmlinput.start", "<document>");
conf.set("xmlinput.end", "</document>");
// Mapper & Combiner & Reducer
conf.setMapperClass(Mapper.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(0);
// Input/Output Types
conf.setInputFormat(XmlInputFormat.class);
conf.setOutputFormat(?????);
conf.setOutputKeyClass(????);
conf.setOutputValueClass(????);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
I would say a simple solution would be to use TextOutputFormat and then use Text as the output key and NullWritable as the output value.
TextOutputFormat uses a delimiting character to separate the key and value pairs you output from your job. For your requirement, you don't need this arrangement, but you'd just like to output a single body of XML. If you pass null or NullWritable as the output key or value, TextOutputFormat will not write the null or the delimiter, just the non-null key or value.
Another approach to using XmlINputFormat would be to use a WholeFileInput (as detailed in Tom White's Hadoop - The definitive guide).
Eitherway you'll need to write your mapper to consume the input value Text object (maybe with an XML SAX or DOM parser) and then output the transformed XML as a Text object.

Categories

Resources