Hadoop: split a big image file using a custom input format

Hadoop: split a big image file using a custom input format - java

I am working on big geographic image files with a size greater than a hdfs block. I need to split the images into several strips (with an height of 100px for instance), then apply some processes on them and finally rebuild the final image.
To do so, I have created a custom input format (inherited from FileInputFormat) and a custom record reader. I am splitting the image in the input format, by defining several FileSplit (corresponding to one strip) which are read in the record reader.
I am not sure my splitting process is optimized because a strip can be accross 2 hdfs blocks, and I don't know how to "send" the split to the best worker (the one where there will be the minimal number of remote reading)
For the moment I am using FileInputFormat.getBlockIndex() with the split beginning offset in order to get the host of the split.
Do you have any advices to help me to solve this problem?
P.S. I am using the new Hadoop API

Image processing on hadoop using HIPI,
[Check this out, http://hipi.cs.virginia.edu/][1]

If it is realistic to process an entire image in a single mapper then you may find it simpler to achieve full data locality by making the block size of the image files larger than the size of each image, and get parallelism by processing multiple images at a time.

Related

JAI: How do I extract a single page input stream from a multipaged TIFF image container?

I have a component that converts PDF documents to images, one image per page. Since the component uses converters producing in-memory images, it hits the JVM heap heavily and takes some time to finish conversions.
I'm trying to improve the overall performance of the conversion process, and found a native library with a JNI binding to convert PDFs to TIFFs. That library can convert PDFs to single TIFF files only (requires intermediate file system storage; does not even consume conversion streams), therefore result TIFF files have converted pages embedded, and not per-page images on the file system. Having a native library improves the overall conversion drastically and the performance gets really faster, but there is a real bottleneck: since I have to make a source-page to destination-page conversion, now I must extract every page from the result file and write all of them elsewhere. A simple and naive approach with RenderedImages:
final SeekableStream seekableStream = new FileSeekableStream(tempFile);
final ImageDecoder imageDecoder = createImageDecoder("tiff", seekableStream, null);
...
// V--- heap is wasted here
final RenderedImage renderedImage = imageDecoder.decodeAsRenderedImage(pageNumber);
// ... do the rest stuff ...
Actually speaking, I would really like just to extract a concrete page input stream from the TIFF container file (tempFile) and just redirect it to elsewhere without having it to be stored as an in-memory image. I would imagine an approach similar to containers processing where I need to seek for a specific entry to extract data from it (say, something like ZIP files processing, etc). But I couldn't find anything like that in ImageDecoder, or I'm probably wrong with my expectations and just missing something important here...
Is it possible to extract TIFF container page input streams using JAI API or probably third-party alternatives? Thanks in advance.

I could be wrong, but don't think JAI has support for splitting TIFFs without decoding the files to in-memory images. And, sorry for promoting my own library, but I think it does exactly what you need (the main part of the solution used to split TIFFs is contributed by a third party).
By using the TIFFUtilities class from com.twelvemonkeys.contrib.tiff, you should be able to split your multi-page TIFF to multiple single-page TIFFs like this:
TIFFUtilities.split(tempFile, new File("output"));
No decoding of the images are done, only splitting each IFD into a separate file, and writing the streams with corrected offsets and byte counts.
Files will be named output/0001.tif, output/0002.tif etc. If you need more control over the output name or have other requirements, you can easily modify the code. The code comes with a BSD-style license.

Reading large images from HDFS in mapreduce

There is a very large image (~200MB) in HDFS (block size 64MB). I want to know the following:
How to read the image in a mapReduce job?
Many topics suggest WholeInputFormat. Is there any other alternative and how to do it?
When WholeInputFormat is used, will there be any parallel processing of the blocks? I guess no.

If your block size is 64 MB, most probably HDFS would have split your image file into chunks and replicated it across the cluster, depending on what your cluster configuration is.
Assuming that you want to process your image file as 1 record rather than multiple blocks/line by line, here are a few options I can think of to process image file as a whole.
You can implement a custom input format and a record reader. The isSplitable() method in the input format should return false. The RecordReader.next( LongWritable pos, RecType val ) method should read the entire file and set val to the file contents. This will ensure
that the entire file goes to one map task as a single record.
You can sub-class the input format and override the isSplitable() method so that it returns false. This example shows how create a sub-class
SequenceFileInputFormat to implement a NonSplittableSequenceFileInputFormat.

I guess it depends on what type of processing you want to perform. If you are trying to perform something that can be done first splitting the big input into smaller image files and then independently processing the blocks and finally stitching the outputs parts back into large final output - then it may be possible. I'm no image expert but suppose if you want to make a color image into grayscale then you may be cut the large image into small images. Then convert them parallelly using MR. Once the mappers are done then stitch them back to one large grayscale image.
If you understand the format of the image then you may write your own recordreader to help the framework understand the record boundaries preventing corruption when they are inputted to the mappers.

Although you can use WholeFileInputFormat or SequenceFileInputFormat or something custom to read the image file, the actual issue(in my view) is to draw something out of the read file. OK..You have read the file, now what??How are you going to process your image to detect any object inside your mapper. I'm not saying it's impossible, but it would require a lot work to be done.
IMHO, you are better off using something like HIPI. HIPI provides an API for performing image processing tasks on top of MapReduce framework.
Edit :
If you really want to do it your way, then you need to write a custom InputFormat. Since images are not like text files, you can't use delimiters like \n for split creation. One possible workaround could be to create splits based on some given number of bytes. For example, if your image file is of 200MB, you could write an InputFormat which will create splits of 100MB(or whatever you give as a parameter in your Job configuration). I had faced such a scenario long ago while dealing with some binary files and this project had helped me a lot.
HTH

Getting metadata of an APNG image

I am trying to get the metadata of an apng image at the moment. I have
been able to get different frames from one apng file flawlessly and i am using PNGJ (a really great Standalone Java library for reading and writing PNG images), but I
am not able to get the different info that is stored against every
apng frame like delay of every frame.
I am at the moment just able to get the simple png image info that is stored in the header part by using
PngReader pngr = FileHelper.createPngReader(File);
pngr.imgInfo;
But I don't know how to have the information stored against the fcTL chunk. How can I do that?

You omitted the information that you are using the PNGJ library. As I mentioned in the other answer, this library does not parse APGN chunks (fcTL, fdAT). It loads them (you can inspect them in the ChunksList property) but they will be instatiated as "UNKNOWN" chunks, hence the binary data will be left in raw form. If you want to look inside the content of the fcTL chunks, you'd either parse the binary yourself, or implement youself the logic for that chunk type and register it in the reader (here's an example for a custom chunk).

Look at how you're currently reading 4-bytes integer 'seq' from fdAT.
You can read information from fcTL the same way.
Just keep in mind that some info is stored in fcTL as 4 bytes, some as 2 bytes, and some as 1 byte.

Reduce PDF file size in itext (java)

I'm creating a Web-based label printing system. For every label, there should be a unique s/n. So when a user decided to create 1000 labels (with the same data), all of it should have unique s/n, therefore the pdf will have 1000 pages, which increases the file size.
My problem is when the user decided to create more copies, the file size will get bigger.
Is there any way that I can reduce the file size of the pdf using Itext? Or is there any way that I can generated the pdf and output it in the browser without saving it neither to server/client's HDD?
Thanks for the help!

On approach is to compress the file. It should be highly compressible.
(I imagine that you should be able to generate the PDF on the server side without writing it to disc, though you could use a lot of memory / Java heap in the process. I don't think it is possible to deliver a PDF to the browser without the file going to the client PC's hard drive in some form.)

If everything except the s/n is the same for the thousands of labels, you only have to add the equal things one time as a template and put the s/n text on top of it.
Take a look at PDFTemplate in itext. If I recall correctly that creates and XObject for the recurring drawing/label/image.... and it is exactly the same object every time you use it.
Even with thousands of labels, the only thing that grows your document size is the s/n (and every page) but the graphics or text of the 'label' is only added once. That should reduce your file size.

Swing Large Files Performance

We need to load and display large files (rich text) using swing, about 50mb. The problem is that the performance to render the files is incredibly poor. We tried both JTextPane and JEditorPane with no luck.
Does someone have experience with this and could give me some advise ?
thanks,

I don't have any experience in this but if you really need to load big files I suggest you do some kind of lazy loading with JTextPane/JEditorPane.
Define a limit that JTextPane/JEditorPane can handle well (like 500KB or 1MB). You'll only need to load a chunk of the file into the control with this size.
Start by loading the 1st partition of the file.
Then you need to interact with the scroll container and see if it has reached the end/beginning of the current chunk of the file. If so, show a nice waiting cursor and load the previous/next chunk to memory and into the text control.
The loading chunk is calculated from your current cursor position in the file (offset).
loading chunk = offset - limit/2 to offset + limit/2
The text on the JTextPane/JEditorPane must not change when loading chunks or else the user feels like is in another position of the file.
This is not a trivial solution but if you don't find any other 3rd party control to do this I would go this way.

You could use Memory Mapped File I/O to create a 'window' into the file and let the operating system handle the reading of the file.

Writing an efficient WYSIWYG text editor that can handle large documents is a pretty hard problem--Even Word has problems when you get into large books.
Swing is general purpose, but you have to build up a toolset around it involving managing documents separately and paging them.
You might look at Open Office, you can embed an OO document editor screen right into your app. I believe it's called OOBean...

JTextPane/JEditorPane do not handle well even 1mb of text (especially text with long lines).
You can try JEdit (StandaloneTextArea) - it is much faster than Swing text components, but I doubt it will handle this much text. I tried with 45m file, and while it was loaded (~25 seconds) and I could scroll down, I started getting "outofmemory" with 1700m heap.
In order to build a really scalable solution there are two obvious options really:
Use pagination. You can do just fine with standard Swing by displaying text in pages.
Build a custom text renderer. It can be as simple as a scrollable pane where only the visible part is drawn using BufferedReader to skip to the desired line in the file and read a limited number of lines to display. I did it before and it is a workable solution. If you need to have 'text selection' capabilities, this is a little more work, of course.
For really large files you could build an index file that contains offsets of each line in characters, so getting the "offset" is a quick "RandomAccess" lookup by line number, and reading the text is a "skip" with this offset. Very large files can be viewed with this technique.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.