I need to write a batch in Java that using multiple threads perform various operation on a bunch of data.
I got almost 60k rows of data, and need to do different operations on them. Some of them works on the same data but using different outputs.
So, the question is: is it right to create this big 60k-length ArrayList and pass it through the various operator, so they can add each one their output, or there is a better Architecture Design that someone can suggest me?
EDIT:
I need to create these objects:
MyObject, with an ArrayList of MyObject2, 3 different Integers, 2 Strings.
MyObject2, with 12 floats
MyBigObject, with an ArrayList of MyObjectof usually of 60k elements, and some Strings.
My different operators works on the same ArrayList of MyObject2, but outputs on the integers, so for example Operators1 fetch from ArrayList of MyObject2, perform some calculation and output its result on MyObject.Integer1, Operators2 fetch from ArrayList of MyObject2, perform some different calculation and output its result on MyObject.Integer2, and so on.
Is this architecture "safe"? The ArrayList of MyObject2 has to be read only, never edited from any operator.
EDIT:
Actually I don't have still code because I'm studying the architecture before, and then I'll start writing something.
Trying to rephrase my question:
Is it ok, in a Batch written in pure Java (without any Framework, I'm not using for example Spring Batch because it will be like shooting a fly with a shotgun for my project), to create a macro object, pass it around so that every different thread can read from the same datas but output their results on different datas?
Can it be dangerous if different threads reads from the same data at the same time?
It depends on your operations.
Generally it's possible to partition work on a dataset horizontally or vertically.
Horizontally means splitting your dataset into several smaller sets let each individual thread handle such a set. This code is safest yet usually slower because each individual thread will do several different operations. It's also a bit more complex to reason about for the same reason.
Vertically means each thread performs some operation on a specific "field" or "column" or whatever individual data units is in your data set.
This is generally easier to implement (each thread does one thing on the whole set) and can be faster. However each operation on the dataset needs to be independent of your other operations.
If you are unsure about multi-threading in general, I recommend doing work horizontally in parallel.
Now to the question about whether is ok to pass your full dataset around (some ArrayList), sure it is! It's just a reference and won't really matter. What matters are the operations you perform on the dataset.
Related
What I plan to do is create a large set of objects (~500). Currently I have a large Java file in which I create all objects in the following form:
MyCollection.add(String name, int strength, int size, int price);
But with this it takes a lot of space and I don't think it is the best way.
I don't really know my way around in this area. How is such a problem normally handled? Would it make sense to create a CSV file or another database?
Depends on where this data comes from. Do you generate it on runtime? If so - consider abstraction, so that you define a proper structure of objects and encapsulate this data and the process of creating it properly. If the data is static then yes, a database of any sort makes more sense. Specifically this means making your data persistent and simplifying the process of reading it as much as possible, e.g. reading the data via loop, encapsulating it in an objcect MyFancyEntry, then adding it to your collection.
You are in my opinion also right about 500 manual add calls on a Collection being bad design. The reason is simple - maintainability.
It is hard to change the structure quickly, if upon a structural change you suddenly need to edit 500 lines by hand.
There is also more room for mistakes, when you have so many identical repeating lines.
I have an array list that I am using in my multi-threaded application, and I want a way to somehow be able to iterate through the array while not causing any exceptions to be thrown when I add a element to the list as I am iterating. Is there a way to stop the array list from being modified while I iterate through it?
Edit:
I now realize that my question was very poorly submitted, and the down votes are deserved. This is an attempt to fix my question.
What I want to do is have some way to 'block' the list before I iterate through it so that I don't get a concurrent modification exception. The issue is that the iterating will take quite a bit of processor time to complete, because the list will be very large, and the action I wish to carry out on each element will take a fair amount of time. This is an issue if I use synchronized methods, because the add method will block for a large amount of time and decrease application performance. So what I am trying to do is create a class that imitates an array list, except that when it is 'blocked', and a method tries to modify it, it will store that request, and when the list is 'unblocked' it will perform all the requests in a separate thread. The issue is that when I try to implement this strategy, I have to store the requests in some sort of list, and run into the same problems I do before, with having to block the ability to add to the request list while the request thread is iterating over the requests. I'm at a loss for how to implement this solution, or even if it is the correct one. if anyone could help me that would be much appreciated.
Your options are either to work with synchronization or use implementation provided in package java.util.concurrent. You also need to read up on the issue. you are asking a very fundamental and classical question and there is a LOT of info on this issue.But here are your basic options:
Use syncronization - veru expensive performance wise but is absolutely bullet-proof. read up in synchronizedterm or read up on Lock interface and its implementations. Also see about Semaphore class. Note that this option would create a very serious bottle neck in your performance
As one of the comments said use CopyOnWriteArrayList class. Also has some draw backs but in majority of cases is a better option then full synchronization.
When choosing between the 2 options consider the following points: while synchranization is a bullet-proof solution if done right, it is a very tidious work that demands a lot of testing which is not trivial. So this is already a big draw back. Plus in majority of cases reads outnumber writes or Array size is small enough (say up to few hundreds elements) where copy on write is acceptable. So it is my guess that majority of cases CopyOnWriteArrayList would be preferable. The point is that there is no clear cut answer when choosing between the two options above. A programmer needs to look at the circumstances and choose the option that fits his/her case better
I want to implement a word length program that categorize words in 4 categories on large corpus by using local aggregation methods but I don't have deepest knowledge about how these methods work. Because I am so new in MapReduce field. For example what is the sharpest differences between combiner and in-mapper combiner? In addition I should add a combiner and in-mapper combiner to my code and should measure the differences between them. But I don't have any idea where I should start, if someone help me, I appreciate it.
Implementing an in-map combiner (as best-described here) is the process of writing code within the scope of a map() method which stores multiple key-value pairs and performs some kind of aggregation function before outputting. This is different from typical map() methods which tend to deal with only a single key-value pair at once. This is quite risky as the developer is required to be very careful with memory allocation.
In-map combiners are typically used for ranking lists - i.e. an ArrayList is used to store the X highest-scoring entries to the mapper, and are output once all key-value pairs have entered the mapper. There's obviously little risk of running out of memory (unless X or the key or value are very large), and so lots of data can be immediately discarded.
Alternately, regular combiners are basically reducers that are executed immediately after a map phase finishes, and on the same node. The advantage is that the developer doesn't have to worry about implementing their own groupings (unlike the in-map combiner), and therefore memory issues are less-likely. The main disadvantage is that you can't guarantee that a combiner will run.
Regular combiners are used often for things such as counts - the WordCount with a combiner (such as this is the classic example.
For your case, I would always look to a regular combiner. Let it do all the work of grouping your categories, and avoid worrying about memory.
In my company developers go to great lengths to not create objects inside mappers / reducers. E.g., working with the basic avro record (using positions), working with byte arrays and streams instead of objects, etc.
This sounds to me like over optimization. Java based servers need to be performant as well, but people don't program like this.
So what is right?
I don't think you can say right or wrong, but perhaps overkill. You're (presumably) sacrificing readability and maintainability for some performance gains. Remember, that if you get your reducer to run 1 second faster and your job uses 100 nodes to reduce, it doesn't finish 100 seconds faster, only 1 assuming equal distribution of keys and available resources at the start.
Personally I declare class variables and initialize them in my constructor (see tip #6). Then I set them rather than creating new objects within the mapper or reducer. This way you only incur the hit once. You just have to make sure to clear the object at the start of the map or reduce method to ensure you don't have carryover from a previous invocation.
I have an list of list whose indices reaches upto 100's of millions.Lets say each od the list inside list is an sentence of a text. I would like to partition this data for processing in different threads. I used subList to split
data and send it in different threads for processing. Is this a standard approach for paritioning data? If not , could you please suggest me some standard approch for it?
This will work as long as you do not "structurally modify" the list or any of these sub-lists. Read-only processing is fine.
There are many other "big data" approaches to handling 100s of millions of records, because there are other problems you might hit:
If your program fails (e.g. OutOfMemoryError), you probably don't want to have to start over from the beginning.
You might want to throw >1 machine at the problem, at which point you can't share the data within a single JVM's memory.
After you've processed each sentence, are you building some intermediate result and then processing that as a step 2? You may need to put together a pipeline of steps where you re-partition the data before each step.
You might find you have too many sentences to fit them all into memory at once.
A really common tool for this kind of work is Hadoop. You'd copy the data into HDFS, run a map-reduce job (or more than one job) on the data and then copy the data out of HDFS when you're done.
A simpler approach to implement is just to use a database and assign different ranges for the integer sentence_id column to different threads and build your output in another table.