This is a basic question about mapreduce outputs.
I'm trying to create a map function that takes in an xml file and makes a pdf using apache fop. However I'm a little confused as how to output it, since I know that it goes out as a (key,value) pair.
I'm also not using streaming to do this.
The point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Input-output must be specified in key-value format
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
I have not tried this but this is what i would do:
Write output of mapper to this form: key is the filename in Text (keep filename unique) and value is the output of fop in TextOutputFormat. Write it using TextOutputFormat.
Suggestion:
I am assuming that your use case is just reading input xml (maybe doing some operation on its data) and writing data to PDF files using fop. I dont think this is a hadoop use case in first place...becoz whatever you want to do can be done by a batch script. How big are your xml files ? How many xml files do you have to process ?
EDIT:
SequenceFileOutputFormat will write in a SequenceFile. SequenceFile has its own headers and other metadata along with the text that is stored. Also it stores data in form of key:values.
SequenceFile Common Header
version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header.
Using SequenceFile ruin your application as you will end up with corrupted output PDF files. Try this out and see for yourself.
You have lots of input files...and this is where hadoop sucks. (read this). Still I feel that you can do your desired operation using a script to invoke fop on every document one by one. If you have multiple nodes, run the same script but on different subset of input documents. Trust me, this will run FASTER than hadoop considering the overhead involved in creating maps and reduces (you dont need reduces..i know).
Related
We are running Spark Java in local mode on a single AWS EC2 instance using
"local[*]"
However, profiling using New Relic tools and a simple 'top' show that only one CPU core of our 16 core machine is ever in use for three different Java spark jobs we've written (we've also tried different AWS instances but only one core is ever used).
Runtime.getRuntime().availableProcessors() reports 16 processors and
sparkContext.defaultParallelism() reports 16 as well.
I've looked at various Stackoverflow local mode issues but none seem to have resolved the issue.
Any advice much appreciated.
Thanks
EDIT: Process
1) Use sqlContext to read gzipped CSV file 1 using com.databricks.spark.csv from disc (S3) into DataFrame DF1.
2) Use sqlContext to read gzipped CSV file 2 using com.databricks.spark.csv from disc (S3) into DataFrame DF2.
3) Use DF1.toJavaRDD().mapToPair(new mapping function that returns a Tuple>) RDD1
4) Use DF2.toJavaRDD().mapToPair(new mapping function that returns a Tuple>) RDD2
5) Call union on the RDDs
6) Call reduceByKey() on the unioned RDDs to "merge by key" so have a Tuple>) with only one instance of a particular key (as the same key appears in both RDD1 and RDD2).
7) Call .values().map(new mapping Function which iterates over all items in the provided List and merges them as required to return a List of the same or smaller length
8) Call .flatMap() to get an RDD
9) Use sqlContext to create a DataFrame from the flat map of type DomainClass
10) Use DF.coalease(1).write() to write the DF as gzipped CSV to S3.
I think your problem is that your CSV files are gzipped. When Spark reads files, it loads them in parallel, but it can only do this if the file codec is splittable*. Plain (non-gzipped) text and parquet are splittable, as well as the bgzip codec used in genomics (my field). Your entire files are ending up in one partition each.
Try decompressing the csv.gz files and running this again. I think you'll see much better results!
splittable formats mean that if you are given an arbitrary file offset at which to start reading, you can find the beginning of the next record in your block and interpret it. Gzipped files are not splittable.
Edit: I replicated this behavior on my machine. Using sc.textFile on a 3G gzipped text file produced 1 partition.
Background:
See this question: Parsing XmlInputFormat element larger than hdfs block size and its answer.
Suppose I want to parse Wikipedia's pages-articles bz2 compressed XML dumps using Hadoop using Mahout's XMLInputFormat. My input can be of two types:
XXwiki-latest-pages-articles-multistream-index.txt.bz2
XXwiki-latest-pages-articles.xml.bz2
The first one is a splittable compressed type, and by default (given BZip2Codec is enabled) it will be splitted and multiple mappers will be used to process each decompressed BZ2 split in parallel.
The second type is directly bz2 compressed, and therefore not splittable I guess? The default getSplits implementation that XMLInputFormat uses is FileInputFormat.getSplits which I presume makes unintelligent vanilla splits having no knowledge of my XML schema. I think this might impact performance.
Question:
Does XMLInputFormat work with unsplittable bz2 compressed files, that is, raw text directly bzipped, and not multistream bz2 (concatenated)? By work I mean in parallel, with multiple mappers and not feed everything into a single mapper.
Will it be beneficial to me if I implement a custom getSplits method like https://github.com/whym/wikihadoop#splitting? Also see https://github.com/whym/wikihadoop/blob/master/src/main/java/org/wikimedia/wikihadoop/StreamWikiDumpInputFormat.java#L146 - this InputFormat uses a custom getSplits implementation. Why?
There is a very large image (~200MB) in HDFS (block size 64MB). I want to know the following:
How to read the image in a mapReduce job?
Many topics suggest WholeInputFormat. Is there any other alternative and how to do it?
When WholeInputFormat is used, will there be any parallel processing of the blocks? I guess no.
If your block size is 64 MB, most probably HDFS would have split your image file into chunks and replicated it across the cluster, depending on what your cluster configuration is.
Assuming that you want to process your image file as 1 record rather than multiple blocks/line by line, here are a few options I can think of to process image file as a whole.
You can implement a custom input format and a record reader. The isSplitable() method in the input format should return false. The RecordReader.next( LongWritable pos, RecType val ) method should read the entire file and set val to the file contents. This will ensure
that the entire file goes to one map task as a single record.
You can sub-class the input format and override the isSplitable() method so that it returns false. This example shows how create a sub-class
SequenceFileInputFormat to implement a NonSplittableSequenceFileInputFormat.
I guess it depends on what type of processing you want to perform. If you are trying to perform something that can be done first splitting the big input into smaller image files and then independently processing the blocks and finally stitching the outputs parts back into large final output - then it may be possible. I'm no image expert but suppose if you want to make a color image into grayscale then you may be cut the large image into small images. Then convert them parallelly using MR. Once the mappers are done then stitch them back to one large grayscale image.
If you understand the format of the image then you may write your own recordreader to help the framework understand the record boundaries preventing corruption when they are inputted to the mappers.
Although you can use WholeFileInputFormat or SequenceFileInputFormat or something custom to read the image file, the actual issue(in my view) is to draw something out of the read file. OK..You have read the file, now what??How are you going to process your image to detect any object inside your mapper. I'm not saying it's impossible, but it would require a lot work to be done.
IMHO, you are better off using something like HIPI. HIPI provides an API for performing image processing tasks on top of MapReduce framework.
Edit :
If you really want to do it your way, then you need to write a custom InputFormat. Since images are not like text files, you can't use delimiters like \n for split creation. One possible workaround could be to create splits based on some given number of bytes. For example, if your image file is of 200MB, you could write an InputFormat which will create splits of 100MB(or whatever you give as a parameter in your Job configuration). I had faced such a scenario long ago while dealing with some binary files and this project had helped me a lot.
HTH
I have a Map{String,Object}. Here the Object(value for the map) could be either a String or Map{String,String}.
So if I split up the Map it could be something like this :
Map{String,Map{string,String}} and Map{String,String}
I want to write this Map to Hadoop DFS using sequence file in a key,value pair. I want this Map to be value for by sequence file and hence needs to make it writable. I have written on class but it gives me issues if I write multiple records, then while reading it back, values from adjacent records gets mixed up.
Please suggest me some solution to this problem, or sample code to make this nested Map writable.
You aren't owe to make your map Writable, you can just serialize it to bytes with your serialization framework of choice like java serialization or protobuf and writes bytes to your sequence file.
I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!
If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.
I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.