I have a need for a BitSet which allows easy concatenation of multiple BitSets create a new BitSet. The default implementation doesn't have such a method.
Is there any an implementation in some external library that any of you know which allows easy concatenation?
For example lets say I have a bitarray 11111 and another bit array 010101. I want functionality like appending. So after concatenating it would result in 11111010101.
Well there's no way to implement this terribly efficient (performance and memory that is) since there's no leftshift method.
What you can do is either use the obvious nextSetBit for loop - slow, but memory efficient.
The presumably faster method would be to use toLongArray on one, copy that correctly shifted into a large enough array, create a bitset from that and or it with the other. That way you don't do any bitshifting on single bits but instead work on wordsized chunks.
This worked for me:
BitSet concatenate_vectors(BitSet vector_1_in, BitSet vector_2_in) {
BitSet vector_1_in_clone = (BitSet)vector_1_in.clone();
BitSet vector_2_in_clone = (BitSet)vector_2_in.clone();
int n = 5;//_desired length of the first (leading) vector
int index = -1;
while (index < (vector_2_in_clone.length() - 1)) {
index = vector_2_in_clone.nextSetBit((index + 1));
vector_1_in_clone.set((index + n));
}
return vector_1_in_clone;
}
Result: 11111010101
Related
I know in advance that, there would be 84 strings going to be appended by comma separator, to create one string then,
Which way is be better a fixed Array, Strings or String Builder?
If by "best" you mean "most memory and/or runtime efficient" then you're probably best off with a StringBuilder you pre-allocate. (Having looked at the implementation of String.join in the JDK, it uses StringJoiner, which uses a StringBuilder with the default initial capacity [16 chars] with no attempt to avoid reallocation and copying.)
You'd sum up the lengths of your 84 strings, add in the number of commas, create a StringBuilder with that length, add them all, and call toString on it. E.g.:
int length = 0;
for (int i = 0; i < strings.length; ++i) {
length += strings[i].length();
}
length += strings.length - 1; // For the commas
StringBuilder sb = new StringBuilder(length);
sb.append(strings[0]);
for (int i = 1; i < strings.length; ++i) {
sb.append(',');
sb.append(strings[i]);
}
String result = sb.toString();
There are a lot of ways of doing that.
My preferred way of doing it (which may or may not be the best) would be to convert my 84 strings into a stream (with Arrays.stream() or list.stream(), depending how the strings are actually stored) and then do Collectors.joining(",").
That is, if you already have an array, String.join(",", array) will do the trick as well, as noted in another answer.
You could also use StringJoiner to build the String. It would be like using StringBuilder, but you don't need to worry about the commas (and you can even append and prepend a value if you want).
This is mainly useful when you're building the result in parts, or when you may omit some elements. Otherwise it offers no benefits vs. Collectors.joining() or String.join() (which internally uses StringJoiner anyway).
Attempting to write my own hash function in Java. I'm aware that this is the same one that java implements but wanted to test it out myself. I'm getting collisions when I input different values and am not sure why.
public static int hashCodeForString(String s) {
int m = 1;
int myhash = 0;
for (int i = 0; i < s.length(); i++, m++){
myhash += s.charAt(i) * Math.pow(31,(s.length() - m));
}
return myhash;
}
Kindly remember just how a hash-table (in any language ...) actually works: it consists of a (usually, prime) number of "buckets." The purpose of the hash-function is simply to convert any incoming key-value into a bucket-number. (The worst-case scenario is always that 100% of the incoming keys wind-up in a single bucket, leaving you with "a linked list.") You simply strive to devise a hash-function that will "typically" produce a "widely scattered" distribution of values so that, when calculated modulo the (prime ...) number of buckets, "most of the time, most of the buckets" will be "more-or-less equally" filled. (But remember: you can never be sure.)
"Collisions" are entirely to be expected: in fact, "they happen all the time."
In my humble opinion, you're "over-thinking" the hash-function: I see no compelling reason to use Math.pow() at all. Expect that any value which you produce will be converted to a hash-bucket number by taking its absolute value modulo the number of buckets. The best way to see if you came up with a good one (for your data ...) is to observe the resulting distribution of bucket-sizes. (Is it "good enough" for your purposes yet?)
In my code I am trying to get if a number exist in the hashmap or not. My code is following:
BitSet arp = new BitSet();
for i = 0 to 10 million
HashMap.get (i)
if number exist
arp.set(i , true)
else
arp.set(i , false)
After that from bitset I get if number i exist or not. However, I found this bitset operation is quite slow (tried with string = string + 0/1 also, more slower). Can anybody help me how to replace this operation with a faster one.
Your code is really difficult to read clearly, but I suspect you're just trying to set bits in the BitSet that are keys from your HashMap?
In that case, your code should just be more or less
BitSet bits = new BitSet(10000000);
for (Integer k : map.keySet()) {
bits.set(k);
}
Even if this wasn't what you meant, as a general rule, BitSet is blazing fast; I suspect it's the rest of your code that's slow.
If you provided your actual relevant code, we could have found some performance errors in the first place. But assuming your code is ok and you profiled your application to make sure that the BitSet operations are actually slow:
If you have enough memory space available, you can always just go for a boolean[] instead of a BitSet.
BitSet internally uses long[] to store the separate bits, so it's very good memory-wise, but can sometimes be a little bit too slow.
I am looking for a Java data structure for storing a large text (about a million words), such that I can get a word by index (for example, get the 531467 word).
The problem with String[] or ArrayList is that they take too much memory - about 40 bytes per word on my environment.
I thought of using a String[] where each element is a chunk of 10 words, joined by a space. This is much more memory-efficient - about 20 bytes per word; but the access is much slower.
Is there a more efficient way to solve this problem?
As Jon Skeet already mentioned, 40mb isn't too large.
But you stated that you are storing a text, so there may be many same Strings.
For example stop words like "and" and "or".
You can use String.intern()[1]. This will pool your String and returns a reference to an already existing String.
intern() quite slow, so you can replace this by a HashMap that will do the same trick for you.
[1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29
You could look at using memory mapping the data structure, but performance might be completely horrible.
Store all the words in a single string:
class WordList {
private final String content;
private final int[] indices;
public WordList(Collection<String> words) {
StringBuilder buf = new StringBuilder();
indices = new int[words.size()];
int currentWordIndex = 0;
int previousPosition = 0;
for (String word : words) {
buf.append(word);
indices[currentWordIndex++] = previousPosition;
previousPosition += word.length();
}
content = buf.toString();
}
public String wordAt(int index) {
if (index == indices.length - 1) return content.substring(indices[index]);
return content.substring(indices[index], indices[index + 1]);
}
public static void main(String... args) {
WordList list = new WordList(Arrays.asList(args));
for (int i = 0; i < args.length; ++i) {
System.out.printf("Word %d: %s%n", i, list.wordAt(i));
}
}
}
Apart from the characters they contain, each word has an overhead of four bytes using this solution (the entry in indices). Retrieving a word with wordAt will always allocate a new string; you could avoid this by saving the toString() of the StringBuilder rather than the builder itself, although it uses more memory on construction.
Depending on the kind of text, language, and more, you might want a solution that deals with recurring words better (like the one previously proposed).
-XX:+UseCompressedStrings
Use a byte[] for Strings which can be represented as pure ASCII.
(Introduced in Java 6 Update 21 Performance Release)
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Seems like a interesting article:
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out.
http://ahmadsoft.org/ropes/
http://en.wikipedia.org/wiki/Rope_%28computer_science%29
One option would be to store byte arrays instead with the text encoded in UTF-8:
byte[][] words = ...;
Then:
public String getWord(int index)
{
return new String(words[index], "UTF-8");
}
This will be smaller in two ways:
The data for each string is directly in a byte[], rather than the String having a couple of integer members and a reference to a separate char[] object
If your text is mostly-ASCII, you'll benefit from UTF-8 using a single byte per character for those ASCII characters
I wouldn't really recommend this approach though... again it will be slower on access, as it needs to create a new String each time. Fundamentally, if you need a million string objects (so you don't want to pay the recreation penalty each time) then you're going to have to use the memory for a million string objects...
You could create a datastructure like this:
List<string> wordlist
Dictionary<string, int> tsildrow // for reverse lookup while building the structure
List<int> wordindex
wordlist will contain a list of all (unique) words,
tsildrow will give the index of a word in wordlist and wordindex will tell you the index in wordlist of a specific index in your text.
You would operate it in the following fashion:
for word in text:
if not word in tsildrow:
wordlist.append(word)
tsildrow.add(word, wordlist.last_index)
wordindex.append(tsildrow[word])
this fills up your datastructure. Now, to find the word at index 531467:
print wordlist[wordindex[531467]]
you can reproduce the entire text like this:
for index in wordindex:
print wordlist[index] + ' '
except, that you will still have a problem of punctuation etc...
if you won't be adding any more words (i.e. your text is stable), you can delete tsildrow to free up some memory if this is a concern of yours.
OK, I have experimented with several of your suggestions, and here are my results (I checked (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()) before filling the array, and checked again after filling the array and gc()):
Original (array of strings): 54 bytes/word (not 40 as I mistakenly wrote)
My solution (array of chunks of strings, separated by spaces):
2 words per chunk - 36 b/w (but unacceptable performance)
10 words per chunk - 18 b/w
100 words per chunk - 14 b/w
byte arrays - 40 b/w
char arrays - 36 b/w
HashMap, either mapping a string to itself, or mapping a string to its index - 26 b/w
(not sure I implemented this correctly)
intern - 10 b/w
baseline (empty array) - 4 b/w
The average word length is about 3 chars, and most chars are non-ASCII so it's probably about 6 bytes. So, it seems that intern is close to the optimum. It makes sense, since it's an array of words, and many of the words appear much more than once.
I would probably consider using a file, with either fixed sized words or some sort of index. FileInputStream with skip can be pretty efficient
If you have a mobile device you can use TIntArrayList which would use 4 bytes per int value. If you use one index per word it will one need a couple of MB. You can also use int[]
If you have a PC or server, this is trivial amount of memory. Memory cost about £6 per GB or 1 cent per MB.
I have an array of floats and I would like to convert it to an array of doubles in Java. I am aware of the obvious way of iterating over the array and creating a new one. I expected Java to digest a float[] smoothly where it wishes to work with double[]... but it can not work with this.
What is the elegant, effective way of doing this conversion?
Basically something has to do the conversion of each value. There isn't an implicit conversion between the two array types because the code used to handle them after JITting would be different - they have a different element size, and the float would need a conversion whereas the double wouldn't. Compare this to array covariance for reference types, where no conversions are required when reading the data (the bit pattern is the same for a String reference as an Object reference, for example) and the element size is the same for all reference types.
In short, something will have to perform conversions in a loop. I don't know of any built-in methods to do this. I'm sure they exist in third party libraries somewhere, but unless you happen to be using one of those libraries already, I'd just write your own method. For the sake of convenience, here's a sample implementation:
public static double[] convertFloatsToDoubles(float[] input)
{
if (input == null)
{
return null; // Or throw an exception - your choice
}
double[] output = new double[input.length];
for (int i = 0; i < input.length; i++)
{
output[i] = input[i];
}
return output;
}
In Java 8 you can, if you really want to, do:
IntStream.range(0, floatArray.length).mapToDouble(i -> floatArray[i]).toArray();
But it's better (cleaner, faster, better semantics) to use Jon Skeet's function.
Do you actually need to copy your float array to a double array? If you are having trouble with the compiler and types when using the floats you can use this...
float x = 0;
double d = Double.valueOf(x);
What advantage do you get by taking a copy? If you need greater precision in the results of computations based on the floats then make the results double. I can see quite a lot of downsides to having a copy of the array, and performing the copy function, especially if it is large. Make sure you really need to do it.
HTH