Hadoop:Grouping files for mapping

Hadoop:Grouping files for mapping - java

I am in the process of developing a program via Hadoop which is relatively new for me, so I would be grateful for an advice on building a conception for what I am planning to do.
I have a large ordered set of 1...n images. Images are logically divided into several groups, each of these groups could be processed independently. However, inside one group all the images are dependent and therefore should be processed by a single Map task. The images themselves are small so loading them into memory simultaneously should be no problem.
I thought of packing each group into a separate SequenceFile, but there seems to be no way to read a SequenceFile from an InputStream...Or, maybe, there is a way to somehow allocate M different nodes for a single MapReduce job so that each node would read its SeqFile directly from the HDFS?

I was solving similar problems by encoding images into a string with base64 and then putting them all into an array field of JSON object on preprocess stage.
Furthermore, if you store the JSON into an AVRO format, then you will have a benefit of out-of-the-box object-oriented interface to your object in your mapper.

You might need to have a look into CombineFileInputFormat, which would help you to group inputs according to a PathFilter, say folder wise grouping. And each group can be constructed as a single Split which will be processed by a single map task. Since No. of Map tasks = No. of split.
Your needs seems to be similar to this link.Please check.

Related

Suggested Architecture for a batch with multi-threading and common resources

I need to write a batch in Java that using multiple threads perform various operation on a bunch of data.
I got almost 60k rows of data, and need to do different operations on them. Some of them works on the same data but using different outputs.
So, the question is: is it right to create this big 60k-length ArrayList and pass it through the various operator, so they can add each one their output, or there is a better Architecture Design that someone can suggest me?
EDIT:
I need to create these objects:
MyObject, with an ArrayList of MyObject2, 3 different Integers, 2 Strings.
MyObject2, with 12 floats
MyBigObject, with an ArrayList of MyObjectof usually of 60k elements, and some Strings.
My different operators works on the same ArrayList of MyObject2, but outputs on the integers, so for example Operators1 fetch from ArrayList of MyObject2, perform some calculation and output its result on MyObject.Integer1, Operators2 fetch from ArrayList of MyObject2, perform some different calculation and output its result on MyObject.Integer2, and so on.
Is this architecture "safe"? The ArrayList of MyObject2 has to be read only, never edited from any operator.
EDIT:
Actually I don't have still code because I'm studying the architecture before, and then I'll start writing something.
Trying to rephrase my question:
Is it ok, in a Batch written in pure Java (without any Framework, I'm not using for example Spring Batch because it will be like shooting a fly with a shotgun for my project), to create a macro object, pass it around so that every different thread can read from the same datas but output their results on different datas?
Can it be dangerous if different threads reads from the same data at the same time?

It depends on your operations.
Generally it's possible to partition work on a dataset horizontally or vertically.
Horizontally means splitting your dataset into several smaller sets let each individual thread handle such a set. This code is safest yet usually slower because each individual thread will do several different operations. It's also a bit more complex to reason about for the same reason.
Vertically means each thread performs some operation on a specific "field" or "column" or whatever individual data units is in your data set.
This is generally easier to implement (each thread does one thing on the whole set) and can be faster. However each operation on the dataset needs to be independent of your other operations.
If you are unsure about multi-threading in general, I recommend doing work horizontally in parallel.
Now to the question about whether is ok to pass your full dataset around (some ArrayList), sure it is! It's just a reference and won't really matter. What matters are the operations you perform on the dataset.

Use multi threading to read files/processing using Java?

So I haven't really done any serious multithreading before( with the exception of the typical for-loop textbook example) so I thought I might give it a try. The task that I am trying to accomplish is the following:
Read an identification code from a file called ids.txt
Search for that identification code in a separate file called sequence.txt
Once identification is found, extract the string that follows the id.
Create an object of type DataSequence (which encapsulates the identification code and the extracted sequence) and add it to an ArrayList.
Repeat for 3000+ ids.
I have tried this the "regular" way within a single thread but the process is way too slow.How can I approach this issue in a multi-threaded fashion ?

Without seeing profiling data, it's hard to know what to recommend. But as a blind guess, I'd say that repeatedly opening, searching, and closing sequence.txt is taking most of the time. If this is guess is accurate, then the biggest improvement (by far) would be to find a way to process sequence.txt only once. The easiest way to do that would be to read in the relevant information from the file into memory, building a hash map from id to the string that follows it. The entire file is only 53.3 MB, so this is an eminently reasonable approach. Then as you process ids.txt, you only need to look up the relevant string from the map—a very quick operation.
An alternative would be to use the java.nio classes to create a memory-mapped file for sequence.txt.
I'd be hesitant about looking to multithreading to improve what seems to be a disk-bound operation, particularly if the threads will all end up contending for access to the same file (even if it is only read access). This does not strike me as a good problem with which to learn multithreading techniques; the payoff is just not likely to be there.

Multi-threading could be an overkill here. try the following algorithmic approach.
1. Open the file ids.txt in read mode
2. Declare a HashMap for storing key-value pair
2. Loop till end of the file
2A. Read a line as a string
2B. Parse the line as id (key) and rest of the line (value) to store in the HashMap object
3. Now search using the HashMap as desired or do whatever you need with this.
Note: 2A and 2B can be put in two different tasks for two different threads in a producer-consumer framework of design.

design pattern for streaming protoBuf messages

I want to stream protobuf messages onto a file.
I have a protobuf message
message car {
... // some fields
}
My java code would create multiple objects of this car message.
How should I stream these messages onto a file.
As far as I know there are 2 ways of going about it.
Have another message like cars
message cars {
repeated car c = 1;
}
and make the java code create a single cars type object and then stream it to a file.
Just stream the car messages onto a single file appropriately using the writeDelimitedTo function.
I am wondering which is the more efficient way to go about streaming using protobuf.
When should I use pattern 1 and when should I be using pattern 2?
This is what I got from https://developers.google.com/protocol-buffers/docs/techniques#large-data
I am not clear on what they are trying to say.
Large Data Sets
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
That said, Protocol Buffers are great for handling individual messages
within a large data set. Usually, large data sets are really just a
collection of small pieces, where each small piece may be a structured
piece of data. Even though Protocol Buffers cannot handle the entire
set at once, using Protocol Buffers to encode each piece greatly
simplifies your problem: now all you need is to handle a set of byte
strings rather than a set of structures.
Protocol Buffers do not include any built-in support for large data
sets because different situations call for different solutions.
Sometimes a simple list of records will do while other times you may
want something more like a database. Each solution should be developed
as a separate library, so that only those who need it need to pay the
costs.

Have a look at Previous Question. Any difference in size and time will be minimal
(option 1 faster ??, option 2 smaller).
My advice would be:
Option 2 for big files. You process message by message.
Option 1 if multiple languages are need. In the past, delimited was not supported in all languages, this seems to be changing though.
Other wise personel preferrence.

Processing different files on separate nodes using Hadoop MapReduce

I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?

It is possible but you're probably not going to get much value. You have these forces against you:
Confused input
You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)
Multiple outputs
You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)
High Cost of initialization
Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.

In brief: give a try to NLineInputFormat.
There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.
With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).
Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.
Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )

sublist for partitioning data set

I have an list of list whose indices reaches upto 100's of millions.Lets say each od the list inside list is an sentence of a text. I would like to partition this data for processing in different threads. I used subList to split
data and send it in different threads for processing. Is this a standard approach for paritioning data? If not , could you please suggest me some standard approch for it?

This will work as long as you do not "structurally modify" the list or any of these sub-lists. Read-only processing is fine.
There are many other "big data" approaches to handling 100s of millions of records, because there are other problems you might hit:
If your program fails (e.g. OutOfMemoryError), you probably don't want to have to start over from the beginning.
You might want to throw >1 machine at the problem, at which point you can't share the data within a single JVM's memory.
After you've processed each sentence, are you building some intermediate result and then processing that as a step 2? You may need to put together a pipeline of steps where you re-partition the data before each step.
You might find you have too many sentences to fit them all into memory at once.
A really common tool for this kind of work is Hadoop. You'd copy the data into HDFS, run a map-reduce job (or more than one job) on the data and then copy the data out of HDFS when you're done.
A simpler approach to implement is just to use a database and assign different ranges for the integer sentence_id column to different threads and build your output in another table.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.