ArrayList or Multiple LinkedHashMap - java

I have an ArrayList of a custom object A. I need to retrieve 2 variables from A based on certain conditions. Should I simply use for loop to retrieve data from the list each time or create 2 LinkedHashMap and store the required variable in it as key/value pair for faster access later? Which is more efficient? Does creating 2 additional map objects justify the efficiency during search?
List will contain about 100-150 objects so does the two maps.
It will be used by concurrent users on daily basis.

Asking about "efficiency" is like asking about "beauty". What is "efficiency"? I argue that efficiency is what gets the code out soonest without bugs or other misbehavior. What's most efficient in terms of software costs is what saves programmer time, both for initial development and maintenance. In the time it took you to find "answers" on SO, you could have had a correct implementation coded and correct, and still had time to test your alternatives rigorously under controlled conditions to see which made any difference in the program's operation.
If you save 10 ms of program run time at the cost of horridly complex, over-engineered code that is rife with bugs and stupidly difficult to refactor or fix, is that "efficient"?
Furthermore, as phrased, the question is useless on SO. You provided no definition of "efficient" from your context. You provided no information on how the structures in question fit into your project architecture, or the extent of their use, or the size of the problem, or anything else relevant to any definition of "efficiency".
Even if you had, we'd have no more ability to answer such a question than if you asked a roomful of lawyers, "Should I sue so-and-so for what they did?" It all depends. You need advice, if you need advice at all, that is very specific to your situation and the exact circumstances of your development environment and process, your runtime environment, your team, the project goals, budget, and other relevant data.
If you are interested in runtime "efficiency", do the following. Precisely define what exactly you mean by "efficient", including an answer to "how 'efficient' is 'efficient' enough?", and including criteria to measure such "efficiency". Once you have such a precise and (dis)provable definition, then set up a rigorous test protocol to compare the alternatives in your context, and actually measure "efficiency".
When defining "efficiency", make sure that what you define matters. It makes no difference to be "efficient" in an area that has very low project cost or impact, and ignore an area that has huge cost or impact.
Don't expect any meaningful answer for your situation here on SO.

Use LinkedHashMap because it made for key value pair (according to your requirement).because data will increase in production environment.

Related

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?
After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.
There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.
You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.
The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

Set of integers. Possible performance gain in case of increasing new entries

If you were a highly skilled low-latency Java developer (I am not) and you were told to implement a set of int (primitive or not), would it be possible for you to get an extra performance gain given the guaranteed pre-condition that every new entry is higher than any other value previously stored in the set?
How significant that gain could be for the add, contains and remove operations in best/worst case scenarios?
On the one hand, it seems natural that such restriction would result in better performance. On the other hand, non-decreasing entries is a very common situation (e.g. in generating unique id) and if the gain were worth fighting for, then a more or less known implementation would have been already developed.
When you check this question, you find that add and contains is already O(1). So there is not much to improve there.
And I think that those two would be the only once could benefit from this constraint:
"adding" becomes easier because you could simply remember the last value that was added; so only need one check when a new value is coming in
similarly, when asking for "contained"; you have a first pre-check that tells you instantly when a given value can not be in the set
But that is about it.
And beyond that: when your constraint is really that each "new" entry that is about to be added is larger than the last one - then you don't need a Set in the first place. Because your constraint guarantees that all items will be unique. So in that sense, you could be looking into Lists, too ...
Regarding the comment that the question is asking between possible deltas between O(1) and O(1.5); my response:
The difference between O(1) and O(n) is of theoretical nature, you answer that using a pen and piece of paper. The difference between O(1.0) and O(1.005) ... there I would start with experiments and benchmarks.
Meaning: these "real" factors depend on various elements which are "close" to the underlying implementation. You would start by looking into how the Set that you are currently using is implemented for your platform; and how the JVM on your platform is doing its just-in-time-compiling. From there on, you could draw conclusions about things that could be improved by taking this constraint into consideration.
Finally; regarding the constraint degrading existing implementations. I guess this could also happen; as said above: such details really depend on the specific implementation. And beyond that: you named three different operations; and actual results can be very much different; depending on the operation type.
If I had to work on this problem; I would start by doing creating reasonably large files with "test data" (random numbers, increasing-only numbers; and variations of that). And then I would use a real profiler (or at least sophisticated benchmarking) and start measuring.

Can you/How do you save CPU and memory by choosing wisely [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I understand the JVM optimizes some things for you (not clear on which things yet), but lets say I were to do this:
while(true) {
int var = 0;
}
would doing:
int var;
while(true) {
var = 0;
}
take less space? Since you aren't declaring a new reference every time, you don't have to specify the type every time.
I understand you really would only need to put var outside of while if I wanted to use it outside of that loop (instead of only being able to use it locally like in the first example). Also, what about objects, would it be different that primitive types in that situation? I understand it's a small situation, but build-up of this kind of stuff can cause my application to take a lot of memory/cpu. I'm trying to use the least amount of operations possible, but I don't completely understand whats going on behind the scenes.
If someone could help me out, even maybe link me to somewhere I can learn about saving cpu by decreasing amount of operations, it would be highly appreciated. Please no books (unless they're free! :D), no way of getting one right now /:
Don't. Premature optimization is the root of all evil.
Instead, write your code as it makes most sense conceptually. Write it thoughtfully, yes. But don't think you can be a 'human compiler' and optimize and still write good code.
Once you have written your code (more or less naively, depending on your level of experience) you write performance tests for it. Try to think of different ways in which the code may be used (many times in a row, from front to back or reversed, many concurrent invocations etc) and try to cover these in test cases. Then benchmark your code.
If you find that some test cases are not performing well, investigate why. Measure parts of the test case to see where the time is going. Zoom into the parts where most time is spent.
Mostly, you will find weird loops where, upon reading the code again, you will think 'that was silly to write it that way. Of course this is slow' and easily fix it. In my experience most performance problems can be solved this way and 'hardcore optimization' is hardly ever needed.
In the end you will find that 99* percent of all performance problems can be solved by touching only 1 percent of the code. The other code never comes into play. This is why you should not 'prematurely' optimize. You will be spending valuable time optimizing code that had no performance issues in the first place. And making it less readable in the process.
Numbers made up of course but you know what I mean :)
Hot Licks points out the fact that this isn't much of an answer, so let me expand on this with some good ol' perfomance tips:
Keep an eye out for I/O
Most performance problems are not in pure Java. Instead they are in interfacing with other systems. In particular disk access is notoriously slow. So is the network. So minimize it's use.
Optimize SQL queries
SQL queries will add seconds, even minutes, to your program's execution time if you don't watch out. So think about those very carefully. Again, benchmark them. You can write very optimized Java code, but if it first spends ten seconds waiting for the database to run some monster SQL query than it will never be fast.
Use the right kind of collections
Most performance problems are related to doing things lots of times. Usually when working with big sets of data. Putting your data in a Map instead of in a List can make a huge difference. Also there are specialized collection types for all sorts of performance requirements. Study them and pick wisely.
Don't write code
When performance really matters, squeezing the last 'drops' out of some piece of code becomes a science all in itself. Unless you are writing some very exotic code, chances are great there will be some library or toolkit to solve your kind of problems. It will be used by many in the real world. Tried and tested. Don't try to beat that code. Use it.
We humble Java developers are end-users of code. We take the building blocks that the language and it's ecosystem provides and tie it together to form an application. For the most part, performance problems are caused by us not using the provided tools correctly, or not using any tools at all for that matter. But we really need specifics to be able to discuss those. Benchmarking gives you that specifity. And when the slow code is identified it is usually just a matter of changing a collection from list to map, or sorting it beforehand, or dropping a join from some query etc.
Attempting to optimise code which doesn't need to be optimised increases complexity and decreases readability.
However, there are cases were improving readability also comes with improved performance.
For example,
if a numeric value cannot be null, use a primitive instead of a wrapper. This makes it clearer that the value cannot be null but also uses less memory and reduces pressure on the GC.
use a Set when you have a collection which cannot have duplicates. Often a List is used when in fact a Set would be more appropriate, depending on the operations you perform, this can also be faster by reducing time complexity.
consider using an enum with one instance for a singleton (if you have to use singletons at all) This is much simpler as well as faster than double check locking. Hint: try to only have stateless singletons.
writing simpler, well structured code is also easier for the JIT to optimise. This is where trying to out smart the JIT with more complex solutions will back fire because you end up confusing the JIT and what you think should be faster is actually slower. (And it's more complicated as well)
try to reduce how much you write to the console (and IO in general) in critical sections. Writing to the console is so expensive, both for the program and the poor human having to read it that is it worth spending more time producing concise console output.
try to use a StringBuilder when you have a loop of elements to add. Note: Avoid using StringBuilder for one liners, just series of append() as this can actually be slower and harder to read.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. --
Antoine de Saint-Exupery,
French writer (1900 - 1944)
Developers like to solve hard problems and there is a very strong temptation to solve problems which don't need to be solved. This is a very common behaviour for developers of up to 10 years experience (it was for me anyway ;), after about this point you have already solved most common problem before and you start selecting the best/minimum set of solutions which will solve a problem. This is the point you want to get to in your career and you will be able to develop quality software in far less time than you could before.
If you dream up an interesting problem to solve, go ahead and solve it in your own time, see what difference it makes, but don't include it in your working code unless you know (because you measured) that it really makes a difference.
However, if you find a simpler, elegant solution to a problem, this is worth including not because it might be faster (thought it might be), but because it should make the code easier to understand and maintain and this is usually far more valuable use of your time. Successfully used software usually costs three times as much to maintain as it cost to develop. Do what will make the life of the poor person who has to understand why you did something easier (which is harder if you didn't do it for any good reason in the first place) as this might be you one day ;)
A good example on when you might make an application slower to improve reasoning, is in the use of immutable values and concurrency. Immutable values are usually slower than mutable ones, sometimes much slower, however when used with concurrency, mutable state is very hard to get provably right, and you need this because testing it is good but not reliable. Using concurrency you have much more CPU to burn so a bit more cost in using immutable objects is a very sensible trade off. In some cases using immutable objects can allow you to avoid using locks and actually improve throughput. e.g. CopyOnWriteArrayList, if you have a high read to write ration.

Sort a list with SQL or as a collection?

I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks
This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".
To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…
Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.
This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.

Performance of Collection class in Java

All,
I have been going through a lot of sites that post about the performance of various Collection classes for various actions i.e. adding an element, searching and deleting. But I also notice that all of them provide different environments in which the test was conducted i.e. O.S, memory, threads running etc.
My question is, if there is any site/material that provides the same performance information on best test environment basis? i.e. the configurations should not be an issue or catalyst for poor performance of any specific data structure.
[Updated]: Example, HashSet and LinkedHashSet both have a complexity of O (1) for inserting an element. However, Bruce Eckel' test claims that insertion is going to take more time for LinkedHashSet than for HashSet [http://www.artima.com/weblogs/viewpost.jsp?thread=122295]. So should I still go by the Big-Oh notation ?
Here are my recommendations:
First of all, don't optimize :) Not that I am telling you to design crap software, but just to focus on design and code quality more than premature optimization. Assuming you've done that, and now you really need to worry about which collection is best beyond purely conceptual reasons, let's move on to point 2
Really, don't optimize yet (roughly stolen from M. A. Jackson)
Fine. So your problem is that even though you have theoretical time complexity formulas for best cases, worst cases and average cases, you've noticed that people say different things and that practical settings are a very different thing from theory. So run your own benchmarks! You can only read so much, and while you do that your code doesn't write itself. Once you're done with the theory, write your own benchmark - for your real-life application, not some irrelevant mini-application for testing purposes - and see what actually happens to your software and why. Then pick the best algorithm. It's empirical, it could be regarded as a waste of time, but it's the only way that actually works flawlessly (until you reach the next point).
Now that you've done that, you have the fastest app ever. Until the next update of the JVM. Or of some underlying component of the operating system your particular performance bottleneck depends on. Guess what? Maybe your clients have different ones. Here comes the fun: you need to be sure that your benchmark is valid for others or in most cases (or have fun writing code for different cases). You need to collect data from users. LOTS. And then you need to do that over and over again to see what happens and if it still holds true. And then re-write your code accordingly over and over again (The - now terminated - Engineering Windows 7 blog is actually a nice example of how user data collection helps to make educated decisions to improve user experience.
Or you can... you know... NOT optimize. Platforms and compilers will change, but a good design should - on average - perform well enough.
Other things you can also do:
Have a look at the JVM's source code. It's very educative and you discover a herd of hidden things (I'm not saying that you have to use them...)
See that other thing on your TODO list that you need to work on? Yes, the one near the top but that you always skip because it's too hard or not fun enough. That one right there. Well get to it and leave the optimization thingy alone: it's the evil child of a Pandora's Box and a Moebius band. You'll never get out of it, and you'll deeply regret you tried to have your way with it.
That being said, I don't know why you need the performance boost so maybe you have a very valid reason.
And I am not saying that picking the right collection doesn't matter. Just that ones you know which one to pick for a particular problem, and that you've looked at alternatives, then you've done your job without having to feel guilty. The collections have usually a semantic meaning, and as long as you respect it you'll be fine.
In my opinion, all you need to know about a data structure is the Big-O of the operations on it, not subjective measures from different architectures. Different collections serve different purposes.
Maps are dictionaries
Sets assert uniqueness
Lists provide grouping and preserve iteration order
Trees provide cheap ordering and quick searches on dynamically changing contents that require constant ordering
Edited to include bwawok's statement on the use case of tree structures
Update
From the javadoc on LinkedHashSet
Hash table and linked list implementation of the Set interface, with predictable iteration order.
...
Performance is likely to be just slightly below that of HashSet, due to the added expense of maintaining the linked list, with one exception: Iteration over a LinkedHashSet requires time proportional to the size of the set, regardless of its capacity. Iteration over a HashSet is likely to be more expensive, requiring time proportional to its capacity.
Now we have moved from the very general case of choosing an appropriate data-structure interface to the more specific case of which implementation to use. However, we still ultimately arrive at the conclusion that specific implementations are well suited for specific applications based on the unique, subtle invariant offered by each implementation.
What do you need to know about them, and why? The reason that benchmarks show a given JDK and hardware setup is so that they could (in theory) be reproduced. What you should get from benchmarks is an idea of how things will work. For an ABSOLUTE number, you will need to run it vs your own code doing your own thing.
The most important thing to know is the Big O runtime of various collections. Knowing that getting an element out of an unsorted ArrayList is O(n), but getting it out of a HashMap is O(1) is HUGE.
If you are already using the correct collection for a given job, you are 90% of the way there. The times when you need to worry about how fast you can, say, get items out of a HashMap should be pretty darn rare.
Once you leave single threaded land and move into multi-threaded land, you will need to start worrying about things like ConcurrentHashMap vs Collections.synchronized hashmap. Until you are multi threaded, you can just not worry about this kind of stuff and focus on which collection for which use.
Update to HashSet vs LinkedHashSet
I haven't ever found a use case where I needed a Linked Hash Set (because if I care about order I tend to have a List, if I care about O(1) gets, I tend to use a HashSet. Realistically, most code will use ArrayList, HashMap, or HashSet. If you need anything else, you are in a "edge" case.
The different collection classes have different big-O performances, but all that tells you is how they scale as they get large. If your set is big enough the one with O(1) will outperform the one with O(N) or O(logN), but there's no way to tell what value of N is the break-even point, except by experiment.
Generally, I just use the simplest possible thing, and then if it becomes a "bottleneck", as indicated by operations on that data structure taking much percent of time, then I will switch to something with a better big-O rating. Quite often, either the number of items in the collection never comes near the break-even point, or there's another simple way to resolve the performance problem.
Both HashSet and LinkedHashSet have O(1) performance. Same with HashMap and LinkedHashMap (actually the former are implemented based on the later). This only tells you how these algorithms scale, not how they actually perform. In this case, LinkHashSet does all the same work as HashSet but also always has to update a previous and next pointer to maintain the order. This means that the constant (this is an important value also when talking about actual algorithm performance) for HashSet is lower than LinkHashSet.
Thus, since these two have the same Big-O, they scale the same essentially - that is, as n changes, both have the same performance change and with O(1) the performance, on average, does not change.
So now your choice is based on functionality and your requirements (which really should be what you consider first anyway). If you only need fast add and get operations, you should always pick HashSet. If you also need consistent ordering - such as last accessed or insertion order - then you must also use the Linked... version of the class.
I have used the "linked" class in production applications, well LinkedHashMap. I used this in one case for a symbol like table so wanted quick access to the symbols and related information. But I also wanted to output the information in at least one context in the order that the user defined those symbols (insertion order). This makes the output more friendly for the user since they can find things in the same order that they were defined.
If I had to sort millions of rows I'd try to find a different way. Maybe I could improve my SQL, improve my algorithm, or perhaps write the elements to disk and use the operating system's sort command.
I've never had a case where collections where the cause of my performance issues.
I created my own experimentation with HashSets and LinkedHashSets. For add() and contains the running time is O(1) , not taking into consideration for a lot of collisions. In the add() method for a linkedhashset, I put the object in a user created hash table which is O(1) and then put the object in a separate linkedlist to account for order. So the running time to remove an element from a linkedhashset, you must find the element in the hashtable and then search through the linkedlist that has the order. So the running time is O(1) + O(n) respectively which is o(n) for remove()

Categories

Resources