I am trying to convert a very long int[] with length of 1,000,000 to Integer[] so that I can sort it with a custom comparator (which sorts elements based on the length of their corresponding lists in a defined Map<Integer, List<Integer>>).
I have done the following:
private static Integer[] convert(int[] arr) {
Integer[] ans = new Integer[arr.length];
for (int i = 0; i < arr.length; i++) {
ans[i] = arr[i];
}
return ans;
}
It works well for me but I have also come across
Integer[] ans = Arrays.stream(intArray).boxed().toArray( Integer[]::new );
and
Integer[] ans = IntStream.of(intArray).boxed().toArray( Integer[]::new );
Is there any of them that is significantly faster than the rest? Or is there any other approach that is fast enough to shorten the run-time?
Is there any of them that is significantly faster than the rest?
You realize the question you're asking is akin to:
"I have 500,000 screws to screw into place. Unfortunately, I can't be bothered to go out and buy a screwdriver. I do have a clawhammer and an old shoe. Should I use the clawhammer to bash these things into place, or is the shoe a better option?"
The answer is clearly: Uh, neither. Go get a screwdriver, please.
To put it differently, if the 'cost' of converting to Integer[] first is 1000 points of cost in some arbitrary unit, then the difference in the options you listed is probably between 0.01 and 0.05 points - i.e. dwarfed so much, it's irrelevant. Thus, the direct answer to your question? It just does not matter.
You have 2 options:
Performance is completely irrelevant. In which case this is fine, and there's absolutely no point to actually answering this question.
You care about performance quite a bit. In which case, this Integer[] plan needs to be off the table.
Assuming you might be intrigued by option 2, you have various options.
The easiest one is to enjoy the extensive java ecosystem. Someone's been here before and made an excellent class just for this purpose. It abstracts the concept of an int array and gives you all sorts of useful methods, including sorting, and the team that made it is extremely concerned about performance, so they put in the many, many, many personweeks it takes to do proper performance analysis (between hotspot, pipelining CPUs, and today's complex OSes, it's much harder than you might think!).
Thus, I present you: IntArrayList. It has a .sortThis() method, as well as a .sortThis(IntComparator c) method, which you can use for sorting purposes.
There are a few others out there, searching the web for 'java primitive collections' should find them all, if for some reason the excellent eclipse collections project isn't to your liking (NB: You don't need eclipse-the-IDE to use it. It's a general purpose library that so happens to be maintained by the eclipse team).
If you must handroll it, searching the web for how to implement quicksort in java is not hard, thus, you can easily write your own 'sort this int array for me' code. Not that I would reinvent that particular wheel. Just pointing out that it's not too difficult if you must.
Related
public static ArrayList<Integer> duplicates(int[] arr) {
ArrayList<Integer> doubles = new ArrayList<Integer>();
boolean isEmpty = true;
for(int i = 0; i<arr.length; i++) {
for (int j = i+1; j< arr.length; j++) {
if( arr[i] == arr[j] && !doubles.contains(arr[i]) ){
doubles.add(arr[i]);
isEmpty = false;
break;
}
}
}
if(isEmpty) doubles.add(-1);
Collections.sort(doubles);
return doubles;
}
public static void main(String[] args) {
System.out.println( ( duplicates( new int[]{1,2,3,4,4,4} ) ) ); // Return: [4]
}
I made this function in Java which returns multiples of an input int array or returns a -1 if the input array is empty or when there are no multiples.
It works, but there is probably a way to make it faster.
Are there any good practices to make functions more efficient and faster in general?
There are, in broad strokes, 2 completely unrelated performance improvements you can make:
Reduce algorithmic complexity. This is a highly mathematical concept.
Reduce actual performance characteristics - literally, just make it run faster and/or use less memory (often, 'use less memory' and 'goes faster' go hand in hand).
The first is easy enough, but can be misleading: You can write an algorithm that does the same job in an algorithmically less complex way which nevertheless actually runs slower.
The second is also tricky: Your eyeballs and brain cannot do the job. The engineers that write the JVM itself are on record as stating that in general they have no idea how fast any given code actually runs. That's because the JVM is way too complicated: It has so many complicated avenues for optimizing how fast stuff runs (not just complicated in the code that powers such things, also complicated in how they work. For example, hotspot kicks in eventually, and uses the characteristics of previous runs to determine how best to rewrite a given method into finely tuned machine code, and the hardware you run it on also matters rather a lot).
This leads to the following easy conclusions:
Don't do anything unless there is an actual performance issue.
You really want a profiler report that actually indicates which code is 'relevant'. Generally, for any given java app, literally 1% of all of your lines of code is responsible for 99% of the load. There is just no point at all optimizing anything, except that 1%. A profiler report is useful in finding the 1% that requires the attention. Java ships with a profiler and there are commercial offerings as well.
If you want to micro-benchmark (time a specific slice of code against specific inputs), that's really difficult too, with many pitfalls. There's really only one way to do it right: Use the Java Microbenchmark Harness.
Whilst you can decide to focus on algorithmic complexity, you may still want a profiler report or JMH run because algorithmic complexity is all about 'Eventually, i.e. with large enough inputs, the algorithmic complexity overcomes any other performance aspect'. The trick is: Are your inputs large enough to hit that 'eventually' space?
For this specific algorithm, given that I have no idea what reasonable inputs might be, you're going to have to do the work on setting up JMH and or profiler runs. However, as far as algorithmic complexity goes:
That doubles.contains call has O(N) algorithmic complexity: The amount of time that call takes is linear relative to how large your inputs are.
You can get O(1) algorithmic complexity if you use a HashSet instead.
From the point of view of just plain performance, generally an ArrayList's performance and memory load vs. an int[] is quite large.
This gives 2 alternate obvious strategies to optimize this code:
Replace the ArrayList<Integer> with an int[].
Replace the ArrayList<integer> with a HashSet<Integer> instead.
You can't really combine these two, not without spending a heck of a long time handrolling a primitive int array backed hashbucket implementation. Fortunately, someone did the work for you: Eclipse Collections has a primitive int hashset implementation.
Theoretically it's hard to imagine how replacing this with IntHashSet can be slower. However, I can't go on record and promise you that it'll be any faster: I can imagine if your input is an int array with a few million ints in there, IntHashSet is probably going to be many orders of magnitude faster. But you really need test data and a profiler report and/or a JMH run or we're all just guessing, which is a bad idea, given that the JVM is such a complex beast.
So, if you're serious about optimizing this:
Write a bunch of test cases.
Write a wrapper around this code so you can run those tests in a JMH setup.
Replace the code with IntHashSet and compare that vs. the above in your JMH harness.
If that really improves things and the performance now fits your needs, great. You're done.
If not, you may have to re-evaluate where and how you use this code, or if there's anything else you can do to optimize things.
It works, but there is probably a way to make it faster.
I think you will find this approach significantly faster. I omitted the sort from both methods just to check. This does not discuss general optimizations as rzwitserloot's excellent answer already does that.
The two main problems with your method are:
you are using a nested loop which is essentially is an O(N*N) problem.
and you use contains on a list which must do a linear search each time to find the value.
A better way is to use a HashSet which works very close to O(1) lookup time (relatively speaking and depending on the set threshold values).
The idea is as follows.
Create two sets, one for the result and one for what's been seen.
iterate over the array
try to add the value to the seen set, if it returns true, that means a duplicate is not in the seen set so it is ignored.
if it returns false, a duplicate does exist in the seen set so it is added to the duplicate set.
Note the use of the bang ! to invert the above conditions.
once the loop is finished, return the duplicates in a list as required.
public static List<Integer> duplicatesSet(int[] arr) {
Set<Integer> seen = new HashSet<>();
Set<Integer> duplicates = new HashSet<>();
for (int v : arr) {
if (!seen.add(v)) {
duplicates.add(v);
}
}
return duplicates.isEmpty()
? new ArrayList<>(List.of(-1))
: new ArrayList<>(duplicates);
}
The sort is easily added back in. That will take additional computing time but that was not the real problem.
To test this I generated a list of random values and put them in an array. The following generates an array of 1,000,000 ints between 1 and 1000 inclusive.
Random r = new Random();
int[] val = r.ints(1_000_000, 1, 1001).toArray();
I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.
I have a large arrray of strings that looks something like this:
String temp[] = new String[200000].
I have another String, let's call it bigtext. What I need to do is iterate through each entry of temp, checking to see if that entry is found in bigtext and then do some work based on it. So, the skeletal code looks something like this:
for (int x = 0; x < temp.length; x++) {
if (bigtext.indexOf(temp[x]) > -1 {
//do some stuff
} else continue;
}
Because there are so many entries in temp and there are many instances of bigtext as well, I want to do this in the most efficient way. I am wondering if what I've outlined is the most efficient way to iterate through this search of if there are better ways to do this.
Thanks,
Elliott
I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.
Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).
By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!
Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.
If you have additional information about temp, you can maybe improve the iteration.
You can also reduce the time spent, if you parallelize the iteration.
Efficency depends heavily on what is valuable to you.
Are you willing to increase memory for reduced time? Are you willing to increase time for efficent handling of large data sets? Are you willing to increase contention for CPU cores? Are you willing to do pre-processing (perhaps one or more forms of indexing) to reduce the lookup time in a critical section.
With you offering, you indicate the entire portion you want made more efficent, but that means you have excluded any portion of the code or system where the trade-off can be made. This forces one to imagine what you care about and what you don't care about. Odds are excellent that all the posted answers are both correct and incorrect depending on one's point of view.
An alternative approach would be to tokenize the text - let's say split by common punctuation. Then put these tokens in to a Set and then find the intersect with the main container.
Instead of an array, hold the words in a Set too. The intersection can be calculated by simply doing
bidTextSet.retainAll(mainWordsSet);
What remains will be the words that occur in bigText that are in your "dictionary".
Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.
I'm afraid it's not efficient at all in any case!
To pick the right algorithm, you need to provide some answers:
What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
Are you actually searching for words? If so, index them. Bloom filter can help, too.
If you need a bit of fuzziness, may stem or soundex can do the job?
Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.
That is a very efficient approach. You can improve it slightly by only evaluating temp.length once
for(int x = 0, len = temp.length; x < len; x++)
Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.
I have some objects in an ArrayList, and I want to perform collision detection and such. Is it okay to do something like:
List<Person> A;
iterA0 = A.iterator();
while (iterA0.hasNext()) {
Person A = iterA.next();
iterA1 = A.iterator();
while (iterA1.hasNext()){
Person B = iterA1.next();
A.getsDrunkAndRegretsHookingUpWith(B);
}
}
That's gotta be terrible coding, right? How would I perform this nested iteration appropriately?
You can iterate over the same list multiple times concurrently, as long as it's not being modified during any of the iterations. For example:
List<Person> people = ...;
for (Person a : people) {
for (Person b : people)
a.getsDrunkAndRegretsHookingUpWith(b);
}
As long as the getsDrunkAndRegretsHookingUpWith method doesn't change the people list, this is all fine.
This is an example of the classic handshake problem. In a room full of n people, there are n choose 2 different possible handshakes. You can't avoid the quadratic runtime.
#Chris' answer shows you a better way to code it.
Re: OP comment
What I've been cooking up is some code where an event causes a particle to explode, which causes all nearby particles to explode...chain reaction. The objects are stored in one list and the non-exploded particles only explode if they are within a defined radius of an exploding ones. So I could dish up some conditionals to make it a bit faster, but still need the n^2 traversal.
You should not use a list to store the particles. If you're modeling particles in 2 dimentions, use a quadtree. If 3 dimensions, an octree.
Number of iterations can be reduced if in you case:
A.getsDrunkAndRegretsHookingUpWith(B) implies B.getsDrunkAndRegretsHookingUpWith(A) too,
and A.getsDrunkAndRegretsHookingUpWith(A) will always be same for all elements
then instead of using iterator or foreach, you can take more traditional approach and exclude collision with itself and with the elements with withc comparison has already taken place.
List<Person> people;
for (int i=0; i<people.size(); i++){
a = people.get(i);
for (int j=i+1; j<people.size();j++){ // compare with next element and onwards in the list
a.getsDrunkAndRegretsHookingUpWith(people.get(j)); // call method
}
}
That's gotta be terrible coding, right?
Most people would probably agree that if you were using Java 5+, and there was no particular need to expose the iterator objects, then you should simplify the code by using a "for each" loop. However, this should make no difference to the performance, and certainly not to the complexity.
Exposing the iterators unnecessarily is certainly not terrible programming. On a scale of 1 to 10 of bad coding style, this is only a 1 or 2. (And anyone who tells you otherwise hasn't seen any truly terrible coding recently ... )
So how to do n^2 over oneself?
Your original question is too contrived to be able to give an answer to that question.
Depending on what the real relation is, you may be able to exploit symmetry / anti-symmetry, or associativity to reduce the amount of work.
However, without that information (or other domain information), you can't improve on this solution.
For your real example (involving particles), you can avoid the O(N^2) comparison problem by dividing the screen into regions; e.g. using quadtrees. Then you iterate over points in the same and adjacent regions.
I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.
The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.
The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.
In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.
My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?
My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.
Thank you for your help in advance!!
Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.
Please forgive me for any fuzziness; and let me address your questions:
Q) English?
A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%
Q) Homework?
A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.
Q) Duplicates?
A) Yes. And probably every five or so words, considering conjunctions, articles, etc.
Q) Access?
A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…
Q) Iterate over the whole list?
A) Yes.
Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)
Cheers!
I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.
Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.
In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.
This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.
Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).
If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.
Just confirming pax assumptions, with a very naive benchmark
public static void main(String[] args)
{
int size = 120000;
String[] arr = new String[size];
ArrayList al = new ArrayList(size);
for (int i = 0; i < size; i++)
{
String put = Integer.toHexString(i).toString();
// System.out.print(put + " ");
al.add(put);
arr[i] = put;
}
Random rand = new Random();
Date start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = arr[get];
}
Date end = new Date();
long diff = end.getTime() - start.getTime();
System.out.println("array access took " + diff + " ms");
start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = (String) al.get(get);
}
end = new Date();
diff = end.getTime() - start.getTime();
System.out.println("array list access took " + diff + " ms");
}
and the output:
array access took 578 ms
array list access took 907 ms
running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.
If you will access these Strings sequentially, the LinkedList would be the best choice.
For random access, ArrayLists have a nice memory usage/access speed tradeof.
My take:
For a non-threaded program, an Arraylist is always fastest and simplest.
For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.
If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
Use a Hashtable? This will give you your best lookup speed.
ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.
I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.
You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.
How about a radix tree or Patricia trie?
http://en.wikipedia.org/wiki/Radix_tree
The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.
I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.
Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"
Depends on what the problem is - speed or memory.
If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.
Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.
Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.
An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.
And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.
I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").
For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.
http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html