Should I add a custom getSplits implementation for Hadoop XMLInputFormat?

Should I add a custom getSplits implementation for Hadoop XMLInputFormat? - java

Background:
See this question: Parsing XmlInputFormat element larger than hdfs block size and its answer.
Suppose I want to parse Wikipedia's pages-articles bz2 compressed XML dumps using Hadoop using Mahout's XMLInputFormat. My input can be of two types:
XXwiki-latest-pages-articles-multistream-index.txt.bz2
XXwiki-latest-pages-articles.xml.bz2
The first one is a splittable compressed type, and by default (given BZip2Codec is enabled) it will be splitted and multiple mappers will be used to process each decompressed BZ2 split in parallel.
The second type is directly bz2 compressed, and therefore not splittable I guess? The default getSplits implementation that XMLInputFormat uses is FileInputFormat.getSplits which I presume makes unintelligent vanilla splits having no knowledge of my XML schema. I think this might impact performance.
Question:
Does XMLInputFormat work with unsplittable bz2 compressed files, that is, raw text directly bzipped, and not multistream bz2 (concatenated)? By work I mean in parallel, with multiple mappers and not feed everything into a single mapper.
Will it be beneficial to me if I implement a custom getSplits method like https://github.com/whym/wikihadoop#splitting? Also see https://github.com/whym/wikihadoop/blob/master/src/main/java/org/wikimedia/wikihadoop/StreamWikiDumpInputFormat.java#L146 - this InputFormat uses a custom getSplits implementation. Why?

Related

Concurrently parsing a very large XML file using Stax in Java

I need to search and extract a subset of a very large XML file based on a certain key. On very large files (~20 GB), because the stax algorithm I am using searches the document linearly, it takes 7 minutes if the subset is at the end of the file. I am new to Java and concurrency, so my question might be naive and not very clear. So, is it possible to run the same search method on multiple threads such that the first start at the beginning of the XML document, the second one after at 100,000th byte, the 3rd one at the 200,000th byte.. etc.? I am using the FileInputStream in addition to the Stax parsing method. I've read that there is a FileInputStream.skip() method which allows one to start reading a file at a certain byte location.
Basically, I want to run the same processes on chunks of the document in parallel. Is this possible and a good optimization or I just don't what I'm doing/saying?

Generate and parse text files in Java

I'm looking for a library/framework to generate/parse TXT files from/into Java objects.
I'm thinking in something like Castor or JAXB, where the mapping between the file and the objects can be defined programmatically or with XML/annotations. The TXT file is not homogeneous and has no separators (fixed positions). The size of the file is not big, therefore DOM-like handling is allowed, no streaming required.
For instance:
TextWriter.write(Collection objects) -> FileOutputStream
TextReader.read(FileInputStream fis) -> Collection

I suggest you use google's protocol buffers
Protocol buffers are a flexible, efficient, automated mechanism for
serializing structured data – think XML, but smaller, faster, and
simpler. You define how you want your data to be structured once, then
you can use special generated source code to easily write and read
your structured data to and from a variety of data streams and using a
variety of languages. You can even update your data structure without
breaking deployed programs that are compiled against the "old" format.
Protobuf messages can be exported/read in binary or text format.
Other solutions would depend on what you call text file : if base64 is texty enough for you, you could simply use java standard serialization with base64 encoding of the binary stream.

You can do this using Jackson serialize to JSON and back
http://jackson.codehaus.org/

Just generate and parse it with XML or JSON formats, there's a whole load of libraries out there that will do all the work for you.

How do I output whole files from a map job?

This is a basic question about mapreduce outputs.
I'm trying to create a map function that takes in an xml file and makes a pdf using apache fop. However I'm a little confused as how to output it, since I know that it goes out as a (key,value) pair.
I'm also not using streaming to do this.

The point of map-reduce is to tackle large amount of data that would usually not fit in memory - so input and output would usually be stored on disks somehow (a.k.a. files).
Input-output must be specified in key-value format
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
I have not tried this but this is what i would do:
Write output of mapper to this form: key is the filename in Text (keep filename unique) and value is the output of fop in TextOutputFormat. Write it using TextOutputFormat.
Suggestion:
I am assuming that your use case is just reading input xml (maybe doing some operation on its data) and writing data to PDF files using fop. I dont think this is a hadoop use case in first place...becoz whatever you want to do can be done by a batch script. How big are your xml files ? How many xml files do you have to process ?
EDIT:
SequenceFileOutputFormat will write in a SequenceFile. SequenceFile has its own headers and other metadata along with the text that is stored. Also it stores data in form of key:values.
SequenceFile Common Header
version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header.
Using SequenceFile ruin your application as you will end up with corrupted output PDF files. Try this out and see for yourself.
You have lots of input files...and this is where hadoop sucks. (read this). Still I feel that you can do your desired operation using a script to invoke fop on every document one by one. If you have multiple nodes, run the same script but on different subset of input documents. Trust me, this will run FASTER than hadoop considering the overhead involved in creating maps and reduces (you dont need reduces..i know).

dfs.block.size for Local hadoop jobs ?

I want to run a hadoop unit test, using the local filesystem mode... I would ideally like to see several part-m-* files written out to disk (rather than just 1). However, since it just a test, I dont want to process 64M of data (the default size is ~64megs per block, i believe).
In distributed mode we can set this using
dfs.block.size
I am wondering wether there a way that i can get my local file system to write small part-m files out, i.e. so that my unit test will mimic the contents of large scale data with several (albeit very small) files.

Assuming your input format can handle splitable files (see the org.apache.hadoop.mapreduce.lib.input.FileInputFormat.isSplitable(JobContext, Path) method), you can amend the input split size to process a smaller file with multi mappers (i'm going to assume you're using the new API mapreduce package):
For example, if you're using the TextInputFormat (or most input formats that extend FileInputFormat), you can call the static util methods:
FileInputFormat.setMaxInputSplitSize(Job, long)
FileInputFormat.setMinInputSplitSize(Job, long)
The long argument is the size of the split in bytes, so just set to you're desired size
Under the hood, these methods set the following job configuration properties:
mapred.min.split.size
mapred.max.split.size
Final note, some input formats may override the FileInputFormat.getFormatMinSplitSize() method (which defaults to 1 byte for FileInputFormat), so be weay if you set a value and hadoop is appearing to ignore it.
A final point - have you considered MRUnit http://incubator.apache.org/mrunit/ for actual 'unit' testing of your MR code?

try doing this it will work
hadoop fs -D dfs.block.size=16777216 -put 25090206.P .

Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Should I add a custom getSplits implementation for Hadoop XMLInputFormat? - java

Related

Concurrently parsing a very large XML file using Stax in Java

Generate and parse text files in Java

How do I output whole files from a map job?

dfs.block.size for Local hadoop jobs ?

Java XML Parser for huge files

Categories

Resources