I am trying to solve an algorithmic task where speed is of primary importance. In the algorithm, I am using a DFS search in a graph and in every step, I add a char and a String. I am not sure whether this is the bottleneck of my algorithm (probably not) but I am curious what is the fastest and most efficient way to do this.
At the moment, I use this:
transPred.symbol + word
I think that there is might be a better alternative than the "+" operator but most String methods only work with other Strings (would converting my char into String and using one of them make a difference?).
Thanks for answers.
EDIT:
for (Transition transPred : state.transtitionsPred) {
walk(someParameters, transPred.symbol + word);
}
transPred.symbol is a char and word is a string
A very common problem / concern.
Bear in mind that each String in java is immutable. Thus, if you modify the string it actually creates a new object. This results in one new object for each concatenation you're doing above. This isn't great, as it's simply creating garbage that will have to be collected at some point.
If your graph is overly large, this might be during your traversal logic - and it may slow down your algorithm.
To avoid creating a new String for each concatenation, use the StringBuilder. You can declare one outside your loop and then append each character with StringBuilder.append(char). This does not incur a new object creation for each append() operation.
After your loop you can use StringBuilder.toString(), this will create a new object (the String) but it will only be one for your entire loop.
Since you replace one char in the string at each iteration I don't think that there is anything faster than a simple + append operation. As mentioned, Strings are immutable, so when you append a char to it, you will get a new String object, but this seems to be unavoidable in your case since you need a new string at each iteration.
If you really want to optimize this part, consider using something mutable like an array of chars. This would allow you to replace the first character without any excessive object creation.
Also, I think you're right when you say that this probably isn't your bottleneck. And remember that premature optimization is the root of all evil etc. (Don't mind the irony that the most popular example of good optimization is avoiding excessive string concatenation).
Related
I'm writing a function with Java which can be simplified as following:
StringBuilder s = new StringBuilder(50);
Queue<String> q = new LinkedList<>();
for(some condition){
s.setLength(0);
s.append("...").append("...").append("...").append("...");//add several strings to s
q.add(s.toString());
}
I save those strings in the queue q, and if the size of the queue is bigger than a value, it will write q to a database. However it becomes slow especially when the times of loop is huge(millions). I assume that is because the concatenation takes a large amount of time. So is there any better way to do the concatenation? Thanks in advance for your help!
Update:
I want to use the same StringBuilder to create strings so each time I use s.setLength(0) to reset it at the beginning of each loop. These strings are the information of new nodes, such as its ID and some properties, so I need to retrieve these information by calling some functions and append them to the string s. The idea behind it is when the queue reaches to a specific size, I pop the information from this queue to create nodes to the Neo4j databases, since it will cost more if I use one transaction for every new node.
If you are calling functions to get the strings then the fastest way is to not use the StringBuilder at all. Instead, use concat() function which is the best when looking performance-wise.
Here is an example:
Queue<String> q = new LinkedList<>();
for(any condition){
q.add(getStringOne().concat(getStringTwo()).concat(getStringThree()));
}
Update: So, after a comment that Builder is better than concat in this case. I ran some tests. StringBuilder takes more than 2X time when compared to concat(). And this is because we are not using the loop to create a single string but a new string with every iteration.
Here is the repl which I used to perform test: https://repl.it/#ankitbeniwal/StringBuilderVersusStringConcat
Let's say I'm constructing a set of Strings where each String is a prefix of the next one. For example, imagine I write a function:
public Set<String> example(List<String> strings) {
Set<String> result = new HashSet<>();
String incremental = "";
for (String s : strings) {
incremental = incremental + ":" + s;
result.add(incremental);
}
return result;
}
Would it ever be worthwhile to rewrite it to use a StringBuilder rather than concatenation? Obviously that would avoid constructing a new StringBuilder in each iteration of the loop, but I'm not sure whether that would be a significant benefit for large lists or whether the overhead that you normally want to avoid by using StringBuilders in loops is mostly just the unnecessary String constructions.
This answer is only correct for Java 8; as #user85421 points out, + on strings is no longer compiled to StringBuilder operations in Java 9 and later.
Theoretically at least, there is still a reason to use a StringBuilder in your example.
Let's consider how string concatenation works: the assignment incremental = incremental + ":" + s; actually creates a new StringBuilder, appends incremental to it by copying, then appends ":" to it by copying, then appends s to it by copying, then calls toString() to build the result by copying, and assigns a reference to the new string to the variable incremental. The total number of characters copied from one place to another is (N + 1 + s.length()) * 2 where N is the original length of incremental, because of copying every character into the StringBuilder's buffer once, and then back out again once.
In contrast, if you use a StringBuilder explicitly - the same StringBuilder across all iterations - then inside the loop you would write incremental.append(":").append(s); and then explicitly call toString() to build the string to add to the set. The total number of characters copied here would be (1 + s.length()) * 2 + N, because the ":" and s have to be copied in and out of the StringBuilder, but the N characters from the previous state only have to be copied out of the StringBuilder in the toString() method; they don't also have to be copied in, because they were already there.
So, by using a StringBuilder instead of concatenation, you are copying N fewer characters into the buffer on each iteration, and the same number of characters out of the buffer. The value of N grows from initially 0, to the sum of the lengths of all of the strings (plus the number of colons), so the total saving is quadratic in the sum of the lengths of the strings. That means the saving could be quite significant; I'll leave it to someone else to do the empirical measurements to see how significant it is.
Generally you always want to go for StringBuilder in a loop, as O(n) algorithms turn into O(n^2). However, this already O(n^2). Even required memory usage is O(n^2). It looks as if it wont hugely matter, but perhaps there could be a factor of two performance difference. Also, as you can see from comments, readers are expecting StringBuilder - don't surprise them unnecessarily.
In general, whilst some may say measure, O(n^2) can blow up in situations that don't occur in testing. In any case, who wants to microbenchmark all their code? Avoid big-O inefficiencies as a matter of course.
On some implementations, String.substring would share the backing char[] between original and substring. However, I don't believe this is currently usually done. That doesn't stop you writing your own little String class.
A program I'm working on converts an array of integers to a string using string builder. I'm trying to determine the time complexity of this approach.
Check out: https://stackoverflow.com/a/7156703/7294647
Basically, it's not clear what the time complexity is for StringBuilder#append as it depends on its implementation, so you shouldn't have to worry about it.
There might be a more efficient way of approaching your int[]-String conversion depending on what you're actually trying to achieve.
If the StringBuilder needs to increase its capacity, that involves copying the entire character array to a new array. You can avoid this by initially setting the capacity so it won't have to do this. (This should be easy since you know the length of the int array and the maximum number of characters in the String representation of an int.)
If you avoid the need to increase the capacity, the complexity would seem to just be O(n). When you append, you're just copying the character array from the String to the end of the character array in the StringBuilder.
(Yes, it depends on the implementation, but it would be a rather poor implementation of StringBuilder if it couldn't append in O(n) time.)
I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.
The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.
The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.
In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.
My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?
My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.
Thank you for your help in advance!!
Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.
Please forgive me for any fuzziness; and let me address your questions:
Q) English?
A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%
Q) Homework?
A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.
Q) Duplicates?
A) Yes. And probably every five or so words, considering conjunctions, articles, etc.
Q) Access?
A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…
Q) Iterate over the whole list?
A) Yes.
Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)
Cheers!
I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.
Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.
In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.
This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.
Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).
If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.
Just confirming pax assumptions, with a very naive benchmark
public static void main(String[] args)
{
int size = 120000;
String[] arr = new String[size];
ArrayList al = new ArrayList(size);
for (int i = 0; i < size; i++)
{
String put = Integer.toHexString(i).toString();
// System.out.print(put + " ");
al.add(put);
arr[i] = put;
}
Random rand = new Random();
Date start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = arr[get];
}
Date end = new Date();
long diff = end.getTime() - start.getTime();
System.out.println("array access took " + diff + " ms");
start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = (String) al.get(get);
}
end = new Date();
diff = end.getTime() - start.getTime();
System.out.println("array list access took " + diff + " ms");
}
and the output:
array access took 578 ms
array list access took 907 ms
running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.
If you will access these Strings sequentially, the LinkedList would be the best choice.
For random access, ArrayLists have a nice memory usage/access speed tradeof.
My take:
For a non-threaded program, an Arraylist is always fastest and simplest.
For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.
If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
Use a Hashtable? This will give you your best lookup speed.
ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.
I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.
You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.
How about a radix tree or Patricia trie?
http://en.wikipedia.org/wiki/Radix_tree
The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.
I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.
Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"
Depends on what the problem is - speed or memory.
If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.
Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.
Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.
An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.
And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.
I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").
For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.
http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html
What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that
equal strings map to equal integers, and
different strings map to different integers?
So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.
An obvious solution would maintain a mapping table from strings to integers,
and a counter to guarantee that new strings are assigned a new integer. I'm just wondering
how is this problem usually solved.
Would also be interesting to extend it to other objects than strings.
Have a look at perfect hashing.
This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.
A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.
import java.util.HashMap;
import java.util.Map;
public class StringToInt {
private Map<String, Integer> map;
private int counter = Integer.MIN_VALUE;
public StringToInt() {
map = new HashMap<String, Integer>();
}
public int toInt(String s) {
Integer i = map.get(s);
if (i == null) {
map.put(s, counter);
i = counter;
++counter;
}
return i;
}
}
There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.
In most hashcode() type implementations, collisions are accepted as inevitable and tested for.
If you absolutely must have no collisions, guaranteed, the solution you outline will work.
Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.
Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)
Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.
edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.
I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.
Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String.
Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)
For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.
Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:
String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());
Reverse mapping:
String str = new String(bi.toByteArray());
Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.
As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.
If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because
The pointers can still be used as C strings.
Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.
If Java offers a test for object identity on String objects then you can play the same game there.
If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.
However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)
Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.
Interesting question.
I don't know if this is practical, but if we take only lowercase letter alphabet, than every word can be viewed as a number in 26-base positional system. For example, if a is 0 and z is 25 than boom is 1*26^3 + 14*26^2 + 14*26^1 + 12*26^0 = 27416