mapreduce: should i avoid creating objects inside a mapper/reducer?

mapreduce: should i avoid creating objects inside a mapper/reducer? - java

In my company developers go to great lengths to not create objects inside mappers / reducers. E.g., working with the basic avro record (using positions), working with byte arrays and streams instead of objects, etc.
This sounds to me like over optimization. Java based servers need to be performant as well, but people don't program like this.
So what is right?

I don't think you can say right or wrong, but perhaps overkill. You're (presumably) sacrificing readability and maintainability for some performance gains. Remember, that if you get your reducer to run 1 second faster and your job uses 100 nodes to reduce, it doesn't finish 100 seconds faster, only 1 assuming equal distribution of keys and available resources at the start.
Personally I declare class variables and initialize them in my constructor (see tip #6). Then I set them rather than creating new objects within the mapper or reducer. This way you only incur the hit once. You just have to make sure to clear the object at the start of the map or reduce method to ensure you don't have carryover from a previous invocation.

Related

Creating new objects versus encoding data in primitives

Let's assume I want to store (integer) x/y-values, what is considered more efficient: Storing it in a primitive value like long (which fits perfect, due sizeof(long) = 2*sizeof(int)) using bit-operations like shift, or and a mask, or creating a Point-Class?
Keep in mind that I want to create and store many(!) of these points (in a loop). Would be there a perfomance issue when using classes? The only reason I would prefer storing in primtives over storing in class is the garbage-collector. I guess generating new objects in a loop would trigger the gc way too much, is it correct?

Of course packing those as long[] is going to take less memory (though it is going to be contiguous). For each Object (a Point) you will pay at least 12 bytes more for the two headers.
On other hand, if you are creating them in a loop and thus escape analysis can prove they don't escape, it can apply an optimization called "scalar replacement" (thought is it very fragile); where your Objects will not be allocated at all. Instead those objects will be "desugared" to fields.
The general rule is that you should code the way it is the most easy to maintain and read that code. If and only if you see performance issues (via a profiler let's say or too many pauses), only then you should look at GC logs and potentially optimize code.
As an addendum, jdk code itself is full of such long where each bit means different things - so they do pack them. But then, me and I doubt you, are jdk developers. There such things matter, for us - I have serious doubts.

Which is faster: Array list or looping through all data combinations?

I'm programming something in Java, for context see this question: Markov Model descision process in Java
I have two options:
byte[MAX][4] mypatterns;
or
ArrayList mypatterns
I can use a Java ArrayList and append a new arrays whenever I create them, or use a static array by calculating all possible data combinations, then looping through to see which indexes are 'on or off'.
Essentially, I'm wondering if I should allocate a large block that may contain uninitialized values, or use the dynamic array.
I'm running in fps, so looping through 200 elements every frame could be very slow, especially because I will have multiple instances of this loop.
Based on theory and what I have heard, dynamic arrays are very inefficient
My question is: Would looping through an array of say, 200 elements be faster than appending an object to a dynamic array?
Edit>>>
More information:
I will know the maxlength of the array, if it is static.
The items in the array will frequently change, but their sizes are constant, therefore I can easily change them.
Allocating it statically will be the likeness of a memory pool
Other instances may have more or less of the data initialized than others

You right really, I should use a profiler first, but I'm also just curious about the question 'in theory'.
The "theory" is too complicated. There are too many alternatives (different ways to implement this) to analyse. On top of that, the actual performance for each alternative will depend on the the hardware, JIT compiler, the dimensions of the data structure, and the access and update patterns in your (real) application on (real) inputs.
And the chances are that it really doesn't matter.
In short, nobody can give you an answer that is well founded in theory. The best we can give is recommendations that are based on intuition about performance, and / or based on software engineering common sense:
simpler code is easier to write and to maintain,
a compiler is a more consistent1 optimizer than a human being,
time spent on optimizing code that doesn't need to be optimized is wasted time.
1 - Certainly over a large code-base. Given enough time and patience, human can do a better job for some problems, but that is not sustainable over a large code-base and it doesn't take account of the facts that 1) compilers are always being improved, 2) optimal code can depend on things that a human cannot take into account, and 3) a compiler doesn't get tired and make mistakes.

The fastest way to iterate over bytes is as a single arrays. A faster way to process these are as int or long types as process 4-8 bytes at a time is faster than process one byte at a time, however it rather depends on what you are doing. Note: a byte[4] is actually 24 bytes on a 64-bit JVM which means you are not making efficient use of your CPU cache. If you don't know the exact size you need you might be better off creating a buffer larger than you need even if you are not using all the buffer. i.e. in the case of the byte[][] you are using 6x time the memory you really need already.

Any performance difference will not be visible, when you set initialCapacity on ArrayList. You say that your collection's size can never change, but what if this logic changes?
Using ArrayList you get access to a lot of methods such as contains.
As other people have said already, use ArrayList unless performance benchmarks say it is a bottle neck.

Suggested Architecture for a batch with multi-threading and common resources

I need to write a batch in Java that using multiple threads perform various operation on a bunch of data.
I got almost 60k rows of data, and need to do different operations on them. Some of them works on the same data but using different outputs.
So, the question is: is it right to create this big 60k-length ArrayList and pass it through the various operator, so they can add each one their output, or there is a better Architecture Design that someone can suggest me?
EDIT:
I need to create these objects:
MyObject, with an ArrayList of MyObject2, 3 different Integers, 2 Strings.
MyObject2, with 12 floats
MyBigObject, with an ArrayList of MyObjectof usually of 60k elements, and some Strings.
My different operators works on the same ArrayList of MyObject2, but outputs on the integers, so for example Operators1 fetch from ArrayList of MyObject2, perform some calculation and output its result on MyObject.Integer1, Operators2 fetch from ArrayList of MyObject2, perform some different calculation and output its result on MyObject.Integer2, and so on.
Is this architecture "safe"? The ArrayList of MyObject2 has to be read only, never edited from any operator.
EDIT:
Actually I don't have still code because I'm studying the architecture before, and then I'll start writing something.
Trying to rephrase my question:
Is it ok, in a Batch written in pure Java (without any Framework, I'm not using for example Spring Batch because it will be like shooting a fly with a shotgun for my project), to create a macro object, pass it around so that every different thread can read from the same datas but output their results on different datas?
Can it be dangerous if different threads reads from the same data at the same time?

It depends on your operations.
Generally it's possible to partition work on a dataset horizontally or vertically.
Horizontally means splitting your dataset into several smaller sets let each individual thread handle such a set. This code is safest yet usually slower because each individual thread will do several different operations. It's also a bit more complex to reason about for the same reason.
Vertically means each thread performs some operation on a specific "field" or "column" or whatever individual data units is in your data set.
This is generally easier to implement (each thread does one thing on the whole set) and can be faster. However each operation on the dataset needs to be independent of your other operations.
If you are unsure about multi-threading in general, I recommend doing work horizontally in parallel.
Now to the question about whether is ok to pass your full dataset around (some ArrayList), sure it is! It's just a reference and won't really matter. What matters are the operations you perform on the dataset.

How to use local aggregation methods in MapReduce programs such as in-mapper combiner?

I want to implement a word length program that categorize words in 4 categories on large corpus by using local aggregation methods but I don't have deepest knowledge about how these methods work. Because I am so new in MapReduce field. For example what is the sharpest differences between combiner and in-mapper combiner? In addition I should add a combiner and in-mapper combiner to my code and should measure the differences between them. But I don't have any idea where I should start, if someone help me, I appreciate it.

Implementing an in-map combiner (as best-described here) is the process of writing code within the scope of a map() method which stores multiple key-value pairs and performs some kind of aggregation function before outputting. This is different from typical map() methods which tend to deal with only a single key-value pair at once. This is quite risky as the developer is required to be very careful with memory allocation.
In-map combiners are typically used for ranking lists - i.e. an ArrayList is used to store the X highest-scoring entries to the mapper, and are output once all key-value pairs have entered the mapper. There's obviously little risk of running out of memory (unless X or the key or value are very large), and so lots of data can be immediately discarded.
Alternately, regular combiners are basically reducers that are executed immediately after a map phase finishes, and on the same node. The advantage is that the developer doesn't have to worry about implementing their own groupings (unlike the in-map combiner), and therefore memory issues are less-likely. The main disadvantage is that you can't guarantee that a combiner will run.
Regular combiners are used often for things such as counts - the WordCount with a combiner (such as this is the classic example.
For your case, I would always look to a regular combiner. Let it do all the work of grouping your categories, and avoid worrying about memory.

Performance of Collection class in Java

All,
I have been going through a lot of sites that post about the performance of various Collection classes for various actions i.e. adding an element, searching and deleting. But I also notice that all of them provide different environments in which the test was conducted i.e. O.S, memory, threads running etc.
My question is, if there is any site/material that provides the same performance information on best test environment basis? i.e. the configurations should not be an issue or catalyst for poor performance of any specific data structure.
[Updated]: Example, HashSet and LinkedHashSet both have a complexity of O (1) for inserting an element. However, Bruce Eckel' test claims that insertion is going to take more time for LinkedHashSet than for HashSet [http://www.artima.com/weblogs/viewpost.jsp?thread=122295]. So should I still go by the Big-Oh notation ?

Here are my recommendations:
First of all, don't optimize :) Not that I am telling you to design crap software, but just to focus on design and code quality more than premature optimization. Assuming you've done that, and now you really need to worry about which collection is best beyond purely conceptual reasons, let's move on to point 2
Really, don't optimize yet (roughly stolen from M. A. Jackson)
Fine. So your problem is that even though you have theoretical time complexity formulas for best cases, worst cases and average cases, you've noticed that people say different things and that practical settings are a very different thing from theory. So run your own benchmarks! You can only read so much, and while you do that your code doesn't write itself. Once you're done with the theory, write your own benchmark - for your real-life application, not some irrelevant mini-application for testing purposes - and see what actually happens to your software and why. Then pick the best algorithm. It's empirical, it could be regarded as a waste of time, but it's the only way that actually works flawlessly (until you reach the next point).
Now that you've done that, you have the fastest app ever. Until the next update of the JVM. Or of some underlying component of the operating system your particular performance bottleneck depends on. Guess what? Maybe your clients have different ones. Here comes the fun: you need to be sure that your benchmark is valid for others or in most cases (or have fun writing code for different cases). You need to collect data from users. LOTS. And then you need to do that over and over again to see what happens and if it still holds true. And then re-write your code accordingly over and over again (The - now terminated - Engineering Windows 7 blog is actually a nice example of how user data collection helps to make educated decisions to improve user experience.
Or you can... you know... NOT optimize. Platforms and compilers will change, but a good design should - on average - perform well enough.
Other things you can also do:
Have a look at the JVM's source code. It's very educative and you discover a herd of hidden things (I'm not saying that you have to use them...)
See that other thing on your TODO list that you need to work on? Yes, the one near the top but that you always skip because it's too hard or not fun enough. That one right there. Well get to it and leave the optimization thingy alone: it's the evil child of a Pandora's Box and a Moebius band. You'll never get out of it, and you'll deeply regret you tried to have your way with it.
That being said, I don't know why you need the performance boost so maybe you have a very valid reason.
And I am not saying that picking the right collection doesn't matter. Just that ones you know which one to pick for a particular problem, and that you've looked at alternatives, then you've done your job without having to feel guilty. The collections have usually a semantic meaning, and as long as you respect it you'll be fine.

In my opinion, all you need to know about a data structure is the Big-O of the operations on it, not subjective measures from different architectures. Different collections serve different purposes.
Maps are dictionaries
Sets assert uniqueness
Lists provide grouping and preserve iteration order
Trees provide cheap ordering and quick searches on dynamically changing contents that require constant ordering
Edited to include bwawok's statement on the use case of tree structures
Update
From the javadoc on LinkedHashSet
Hash table and linked list implementation of the Set interface, with predictable iteration order.
...
Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list, with one exception: Iteration over a LinkedHashSet requires time proportional to the size of the set, regardless of its capacity. Iteration over a HashSet is likely to be more expensive, requiring time proportional to its capacity.
Now we have moved from the very general case of choosing an appropriate data-structure interface to the more specific case of which implementation to use. However, we still ultimately arrive at the conclusion that specific implementations are well suited for specific applications based on the unique, subtle invariant offered by each implementation.

What do you need to know about them, and why? The reason that benchmarks show a given JDK and hardware setup is so that they could (in theory) be reproduced. What you should get from benchmarks is an idea of how things will work. For an ABSOLUTE number, you will need to run it vs your own code doing your own thing.
The most important thing to know is the Big O runtime of various collections. Knowing that getting an element out of an unsorted ArrayList is O(n), but getting it out of a HashMap is O(1) is HUGE.
If you are already using the correct collection for a given job, you are 90% of the way there. The times when you need to worry about how fast you can, say, get items out of a HashMap should be pretty darn rare.
Once you leave single threaded land and move into multi-threaded land, you will need to start worrying about things like ConcurrentHashMap vs Collections.synchronized hashmap. Until you are multi threaded, you can just not worry about this kind of stuff and focus on which collection for which use.
Update to HashSet vs LinkedHashSet
I haven't ever found a use case where I needed a Linked Hash Set (because if I care about order I tend to have a List, if I care about O(1) gets, I tend to use a HashSet. Realistically, most code will use ArrayList, HashMap, or HashSet. If you need anything else, you are in a "edge" case.

The different collection classes have different big-O performances, but all that tells you is how they scale as they get large. If your set is big enough the one with O(1) will outperform the one with O(N) or O(logN), but there's no way to tell what value of N is the break-even point, except by experiment.
Generally, I just use the simplest possible thing, and then if it becomes a "bottleneck", as indicated by operations on that data structure taking much percent of time, then I will switch to something with a better big-O rating. Quite often, either the number of items in the collection never comes near the break-even point, or there's another simple way to resolve the performance problem.

Both HashSet and LinkedHashSet have O(1) performance. Same with HashMap and LinkedHashMap (actually the former are implemented based on the later). This only tells you how these algorithms scale, not how they actually perform. In this case, LinkHashSet does all the same work as HashSet but also always has to update a previous and next pointer to maintain the order. This means that the constant (this is an important value also when talking about actual algorithm performance) for HashSet is lower than LinkHashSet.
Thus, since these two have the same Big-O, they scale the same essentially - that is, as n changes, both have the same performance change and with O(1) the performance, on average, does not change.
So now your choice is based on functionality and your requirements (which really should be what you consider first anyway). If you only need fast add and get operations, you should always pick HashSet. If you also need consistent ordering - such as last accessed or insertion order - then you must also use the Linked... version of the class.
I have used the "linked" class in production applications, well LinkedHashMap. I used this in one case for a symbol like table so wanted quick access to the symbols and related information. But I also wanted to output the information in at least one context in the order that the user defined those symbols (insertion order). This makes the output more friendly for the user since they can find things in the same order that they were defined.

If I had to sort millions of rows I'd try to find a different way. Maybe I could improve my SQL, improve my algorithm, or perhaps write the elements to disk and use the operating system's sort command.
I've never had a case where collections where the cause of my performance issues.

I created my own experimentation with HashSets and LinkedHashSets. For add() and contains the running time is O(1) , not taking into consideration for a lot of collisions. In the add() method for a linkedhashset, I put the object in a user created hash table which is O(1) and then put the object in a separate linkedlist to account for order. So the running time to remove an element from a linkedhashset, you must find the element in the hashtable and then search through the linkedlist that has the order. So the running time is O(1) + O(n) respectively which is o(n) for remove()

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.