sublist for partitioning data set

sublist for partitioning data set - java

I have an list of list whose indices reaches upto 100's of millions.Lets say each od the list inside list is an sentence of a text. I would like to partition this data for processing in different threads. I used subList to split
data and send it in different threads for processing. Is this a standard approach for paritioning data? If not , could you please suggest me some standard approch for it?

This will work as long as you do not "structurally modify" the list or any of these sub-lists. Read-only processing is fine.
There are many other "big data" approaches to handling 100s of millions of records, because there are other problems you might hit:
If your program fails (e.g. OutOfMemoryError), you probably don't want to have to start over from the beginning.
You might want to throw >1 machine at the problem, at which point you can't share the data within a single JVM's memory.
After you've processed each sentence, are you building some intermediate result and then processing that as a step 2? You may need to put together a pipeline of steps where you re-partition the data before each step.
You might find you have too many sentences to fit them all into memory at once.
A really common tool for this kind of work is Hadoop. You'd copy the data into HDFS, run a map-reduce job (or more than one job) on the data and then copy the data out of HDFS when you're done.
A simpler approach to implement is just to use a database and assign different ranges for the integer sentence_id column to different threads and build your output in another table.

Related

Suggested Architecture for a batch with multi-threading and common resources

I need to write a batch in Java that using multiple threads perform various operation on a bunch of data.
I got almost 60k rows of data, and need to do different operations on them. Some of them works on the same data but using different outputs.
So, the question is: is it right to create this big 60k-length ArrayList and pass it through the various operator, so they can add each one their output, or there is a better Architecture Design that someone can suggest me?
EDIT:
I need to create these objects:
MyObject, with an ArrayList of MyObject2, 3 different Integers, 2 Strings.
MyObject2, with 12 floats
MyBigObject, with an ArrayList of MyObjectof usually of 60k elements, and some Strings.
My different operators works on the same ArrayList of MyObject2, but outputs on the integers, so for example Operators1 fetch from ArrayList of MyObject2, perform some calculation and output its result on MyObject.Integer1, Operators2 fetch from ArrayList of MyObject2, perform some different calculation and output its result on MyObject.Integer2, and so on.
Is this architecture "safe"? The ArrayList of MyObject2 has to be read only, never edited from any operator.
EDIT:
Actually I don't have still code because I'm studying the architecture before, and then I'll start writing something.
Trying to rephrase my question:
Is it ok, in a Batch written in pure Java (without any Framework, I'm not using for example Spring Batch because it will be like shooting a fly with a shotgun for my project), to create a macro object, pass it around so that every different thread can read from the same datas but output their results on different datas?
Can it be dangerous if different threads reads from the same data at the same time?

It depends on your operations.
Generally it's possible to partition work on a dataset horizontally or vertically.
Horizontally means splitting your dataset into several smaller sets let each individual thread handle such a set. This code is safest yet usually slower because each individual thread will do several different operations. It's also a bit more complex to reason about for the same reason.
Vertically means each thread performs some operation on a specific "field" or "column" or whatever individual data units is in your data set.
This is generally easier to implement (each thread does one thing on the whole set) and can be faster. However each operation on the dataset needs to be independent of your other operations.
If you are unsure about multi-threading in general, I recommend doing work horizontally in parallel.
Now to the question about whether is ok to pass your full dataset around (some ArrayList), sure it is! It's just a reference and won't really matter. What matters are the operations you perform on the dataset.

Sort huge file in java

I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.

Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.

I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.

Hadoop:Grouping files for mapping

I am in the process of developing a program via Hadoop which is relatively new for me, so I would be grateful for an advice on building a conception for what I am planning to do.
I have a large ordered set of 1...n images. Images are logically divided into several groups, each of these groups could be processed independently. However, inside one group all the images are dependent and therefore should be processed by a single Map task. The images themselves are small so loading them into memory simultaneously should be no problem.
I thought of packing each group into a separate SequenceFile, but there seems to be no way to read a SequenceFile from an InputStream...Or, maybe, there is a way to somehow allocate M different nodes for a single MapReduce job so that each node would read its SeqFile directly from the HDFS?

I was solving similar problems by encoding images into a string with base64 and then putting them all into an array field of JSON object on preprocess stage.
Furthermore, if you store the JSON into an AVRO format, then you will have a benefit of out-of-the-box object-oriented interface to your object in your mapper.

You might need to have a look into CombineFileInputFormat, which would help you to group inputs according to a PathFilter, say folder wise grouping. And each group can be constructed as a single Split which will be processed by a single map task. Since No. of Map tasks = No. of split.
Your needs seems to be similar to this link.Please check.

Processing different files on separate nodes using Hadoop MapReduce

I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?

It is possible but you're probably not going to get much value. You have these forces against you:
Confused input
You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)
Multiple outputs
You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)
High Cost of initialization
Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.

In brief: give a try to NLineInputFormat.
There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.
With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).
Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.
Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.
I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.
In pseudo code:
map(key, value):
emit(SuperKey("Task 1", IncludedKey), value)
emit(SuperKey("Task 2", AnotherIncludedKey), value)
reduce(key, values):
if key.taskid == "Task 1":
for value in values:
// do stuff with key.includedkey and value
else:
// do something else
The key could be a WritableComparable which can include all the necessary information.
Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.
My questions are:
Is this a sensible approach?
Are there better alternatives?
Does it have some terrible drawback?
Would I need a custom Partitioner class for this approach?
Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.
The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

Is this a sensible approach?
There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).
Are there better alternatives?
It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).
Does it have some terrible drawback?
The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.
Would I need a custom Partitioner class for this approach?
Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).
HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

You can use:
Cascading
Oozie
Both are used to write workflows in hadoop.

I think Oozie is the best option for this. Its a workflow scheduler, where you can combine multiple hadoop jobs, where the output of one action node will be the input to the next action node. And if any of the action fails, then next time when u execute it again ,the scheduler starts from the point where the error was encountered.
http://www.infoq.com/articles/introductionOozie

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.