So, might sound like an odd question, but is it faster to compare 2 String's, or byte[]'s (using Arrays.equals())? I'm working with Hadoop/Hbase, and I get byte[] as the value from Hbase, and I have a value that is passed in. Will it be faster to convert the value I get to a String and compare? Or compare them as to byte arrays?
Without actually testing this it would seem that Array.equals() is your friend. To make a string you end up making a copy of the byte array in the String constructor, then you have to decode the unicode, which involves creating a decoder for the default Unicode encoding, and converting the byte array into a char array, then you have to do the equals, which involves iterating through every character in each of the strings.
So on a O() type calculation you already have to read every byte in the array to do the conversion to a character, so I'd say the complexity is worse for converting to String for equals.
Update:
Given the comments added to the question, it sounds like you are given a String and are using it to compare to multiple results in the MapReduce job. In this case it seems that there is one conversion of the input String to bytes and them multiple byte array comparisons. This seems faster than leaving the input String and converting every byte array returned in the job.
Firstly, You have to consider whether both the strings are of same encoding.
Then if you just want to have an equals check then go ahead with byte comparison. But if you want to have the compareTo behavior of String, then you may have to figure out how to know which string is greater or lesser, in which case I would prefer converting to String first and then compare.
If they are not of same encoding, then its better to create Strings and then compare as the decoding part will be done by String class itself.
First, you should ask yourself if it really matters. Given that you are dealing with HBase, and thus network communication, whatever you do may be completely swamped, time-wise. Like #Clint and #Suraj, I think your probably better off with fewer method calls (i.e. using Array.equals() ). Just think of what has to happen when you do a String equals, and then add in the overhead of converting the byte-arrays to Strings.
Related
there are strings like '20190709','20190929', etc. Their length are the same and all 'nummber'. I am trying to sort them without transfering to other type and I hope the result is same as int sorting. We don't talk about effeciency problem. Is there other problem that would make a wrong result?
If you want to sort a list of Strings in numerical order, the easiest way is:
Arrays.sort(strings, Comparator.comparingInt(Integer::parseInt)); for a String[]
Collections.sort(strings, Comparator.comparingInt(Integer::parseInt)); for a List<String>.
You say "without transferring to another type". The above methods will implicitly convert the Strings to ints, but you will still keep your String references.
Most methods to sort Strings will use alphabetical order, which will give a different result to numerical order.
These methods are also relatively efficient.
Why does String class in java have char[] value, int offset and int count fields. What is their purpose and what task do they accomplish?
The char[] array holds the array of characters making up that string.
The offset and count are used for the String.substring() operation. When you take a substring of a string the resultant String references the original character array, but stores an associated offset and length (this is known as a flyweight pattern and is a commonly used technique to save memory)
e.g. String.substring("ABCDEF", 1, 2);
would reference the original array of A,B,C,D,E,F but set offset to 1 and length to 1 (since the substring method uses start and end indices). Note you can do this trivially since the character array is immutable. You can't change it.
Note: This has changed recently (7u6, I believe) and is no longer true in recent versions. I suspect this is due to the realisation that this optimisation isn't really used much.
They allow passing back and forth an array as a backing for routines that are primarily interested in a portions of the array. This allows one to not worry about constructing tons of small arrays, avoiding the costs associated with object construction for particular operations.
For example, one might use an array as the input buffer, but then need additional arrays to handle the chunked up characters within that buffer, with the triple arguments of array, offset, and count, you can "simulate" reading from the middle of the buffer without the need to create a new (secondary) array.
This is important, as while you might reasonably want an array (an object in java) to hold the input characters, you probably don't want to allocate and garbage collect thousands of arrays (and copy the characters into them) to pass the data into something that only expects a single word, as delimited by white space (hey, it's just an example).
I have this 4 Dimensional array to store String values which are used to create a map which is then displayed on screen with the paintComponent. I have read many articles saying that using huge arrays is very inefficient. (Especially since the array dimensions are 16x16x3x3) I was wondering if there was any way to store the string values (I use them as ID values) differently to save memory or reduce fetching time. If you have any ideas or methods I would appreciate it. Thanks!
Well, if your matrix is full, that is every element contains data, then I believe an array is optimally efficient. But if your matrix is sparse, you could look into using more Linking-based data types.
the first thing I would do is to not use strings as IDs, use ints. It'll reduce the size of your structure a lot.
Also, that array really isn't that big, I wouldn't worry about efficiency if that's the only data structure you have. It's only 2304 elements large.
First off, 16*16*3*3 = 2304 - quite modest really. At this size I'd be more worried about the confusion likely to be caused by a 4D array than the size it is taking!
As others have said, if it fully populated, arrays are ok. If it has gaps, an ArrayList or similar would be better.
If the Strings are just IDs, why not store an enum (or even Integers) instead of a string?
Keep in mind that the String values are separate from the array. The array itself takes the same memory space regardless of what string values it links to. Accessing a specific address in your array will take the same amount of time regardless of what type of object you have saved there, or what the value of that object is.
However, if you find that many of your string values represent exactly the same string, you can avoid having multiple copies of the same string by leveraging String.intern(). If you store the interned string, and you don't have any other references to the non-interned string, that frees the non-interned string up to be garbage-collected. Your array will then have multiple entries that point to the same memory space, rather than different memory addresses with equivalent string objects.
See also:
Is it good practice to use java.lang.String.intern()?
http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html
Depending on the requirements of your IDs, you may also want to look into using a different data structure than strings. For example, while the array itself would be the same size, storing int values would avoid the need to allocate extra space for each individual entry.
Also, a 4-dimensional array may not be the best data structure for your needs in the first place. Can you describe why you've chosen this data structure for what you're trying to represent?
The Strings only take up the space of a reference in each array element. There could be a savings if the strings come from a very small set of values. A more important question is are your 4-dimensional arrays sparse or mostly filled? If you have very few values actually specified then you might have a big savings replacing the 4-d array with a Map from the indicies to the String. Let me know if you want a code sample.
Do you actually have a 4D array of 16x16x3x3 (i.e. 2k) string objects? That doesn't sound that big to me. An array is the most efficient way to store a collection of objects, in terms of memory. An ArrayList can be slightly less efficient (up to 50% wasted space).
The only other way I can think of is to store the Strings end-to-end in one giant String and then use substring() to get the bit you need from that, but you would still need to store the indexes somewhere.
Are you running out of memory? If so check that the Strings in your array are the size you think they are - the backing array of a String instance in Java can be much larger than the string itself. If you do subString() on a 1 GB string, the returned string instance shares the 1 GB array of the first string so will keep it from being GC'd longer than you might expect.
I've got 2 2D arrays, one int and one String, and I want them to appear one next to the other since they have the same number of rows. Is there a way to do this? I've thought about concatenating but that requires that they be the same type of array, so in that case, is there a way I could make my int array a String array?
If you want an array that can hold both Strings and ints, you basically have two choices:
Treat them both as Objects, so
effectively Object[][]
concatArray. Autoboxing will
convert your ints to Integers.
Treat them both as Strings (using
String.valueOf(int) and
Integer.parseInt(String)).
I don't know for a fact, but would guess autoboxing is a less expensive operation that converting ints to string and back.
Further, you can always find out the value type of a cell in the array by using instanceof operator; if values are converted to String, you actually need to parse a value to find out if its just a bit of text or a text representation of a number.
These two considerations -- one a guess, the other possibly irrelevant in your case -- would support using option 1 above.
Just cast the ints to Strings as you concatenate. The end result would be a 2D array of type String.
To concatenate an int to a String, you can just use
int myInt = 5;
String newString = myInt + "";
It's dirty, but it's commonly practiced, thus recognizable, and it works.
There are two ways to do this that I can see:
You can create a custom data object that holds the strings and ints. This would be the best way if the two items belong together. It also makes comparing rows easier.
You can create a 4D array of objects and put all the values together like this. I wouldn't recommend it, but it does solve the problem,
I hear the curse of dimensionality lurking in the background, that said - first answer that comes to mind is:
final List<int,String> list = new List<int,String>();
then reviewing op's statement, we see two [][] which raises the question of ordering. Looking at two good replies already we can do Integer.toString( int ); to get concatenation which fulfills op's problem definition, then it's whether ordering is significant or flat list and ( again ) what to do with the second array? Is it a tuple? If so, how to "hold" the data in the row ... we could then do List<Pair<Integer,String>> or List<Pair<String,String>>, which seems the canonical solution to me.
What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that
equal strings map to equal integers, and
different strings map to different integers?
So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.
An obvious solution would maintain a mapping table from strings to integers,
and a counter to guarantee that new strings are assigned a new integer. I'm just wondering
how is this problem usually solved.
Would also be interesting to extend it to other objects than strings.
Have a look at perfect hashing.
This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.
A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.
import java.util.HashMap;
import java.util.Map;
public class StringToInt {
private Map<String, Integer> map;
private int counter = Integer.MIN_VALUE;
public StringToInt() {
map = new HashMap<String, Integer>();
}
public int toInt(String s) {
Integer i = map.get(s);
if (i == null) {
map.put(s, counter);
i = counter;
++counter;
}
return i;
}
}
There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.
In most hashcode() type implementations, collisions are accepted as inevitable and tested for.
If you absolutely must have no collisions, guaranteed, the solution you outline will work.
Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.
Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)
Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.
edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.
I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.
Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String.
Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)
For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.
Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:
String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());
Reverse mapping:
String str = new String(bi.toByteArray());
Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.
As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.
If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because
The pointers can still be used as C strings.
Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.
If Java offers a test for object identity on String objects then you can play the same game there.
If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.
However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)
Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.
Interesting question.
I don't know if this is practical, but if we take only lowercase letter alphabet, than every word can be viewed as a number in 26-base positional system. For example, if a is 0 and z is 25 than boom is 1*26^3 + 14*26^2 + 14*26^1 + 12*26^0 = 27416