I am looking for a Java data structure for storing a large text (about a million words), such that I can get a word by index (for example, get the 531467 word).
The problem with String[] or ArrayList is that they take too much memory - about 40 bytes per word on my environment.
I thought of using a String[] where each element is a chunk of 10 words, joined by a space. This is much more memory-efficient - about 20 bytes per word; but the access is much slower.
Is there a more efficient way to solve this problem?
As Jon Skeet already mentioned, 40mb isn't too large.
But you stated that you are storing a text, so there may be many same Strings.
For example stop words like "and" and "or".
You can use String.intern()[1]. This will pool your String and returns a reference to an already existing String.
intern() quite slow, so you can replace this by a HashMap that will do the same trick for you.
[1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29
You could look at using memory mapping the data structure, but performance might be completely horrible.
Store all the words in a single string:
class WordList {
private final String content;
private final int[] indices;
public WordList(Collection<String> words) {
StringBuilder buf = new StringBuilder();
indices = new int[words.size()];
int currentWordIndex = 0;
int previousPosition = 0;
for (String word : words) {
buf.append(word);
indices[currentWordIndex++] = previousPosition;
previousPosition += word.length();
}
content = buf.toString();
}
public String wordAt(int index) {
if (index == indices.length - 1) return content.substring(indices[index]);
return content.substring(indices[index], indices[index + 1]);
}
public static void main(String... args) {
WordList list = new WordList(Arrays.asList(args));
for (int i = 0; i < args.length; ++i) {
System.out.printf("Word %d: %s%n", i, list.wordAt(i));
}
}
}
Apart from the characters they contain, each word has an overhead of four bytes using this solution (the entry in indices). Retrieving a word with wordAt will always allocate a new string; you could avoid this by saving the toString() of the StringBuilder rather than the builder itself, although it uses more memory on construction.
Depending on the kind of text, language, and more, you might want a solution that deals with recurring words better (like the one previously proposed).
-XX:+UseCompressedStrings
Use a byte[] for Strings which can be represented as pure ASCII.
(Introduced in Java 6 Update 21 Performance Release)
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Seems like a interesting article:
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out.
http://ahmadsoft.org/ropes/
http://en.wikipedia.org/wiki/Rope_%28computer_science%29
One option would be to store byte arrays instead with the text encoded in UTF-8:
byte[][] words = ...;
Then:
public String getWord(int index)
{
return new String(words[index], "UTF-8");
}
This will be smaller in two ways:
The data for each string is directly in a byte[], rather than the String having a couple of integer members and a reference to a separate char[] object
If your text is mostly-ASCII, you'll benefit from UTF-8 using a single byte per character for those ASCII characters
I wouldn't really recommend this approach though... again it will be slower on access, as it needs to create a new String each time. Fundamentally, if you need a million string objects (so you don't want to pay the recreation penalty each time) then you're going to have to use the memory for a million string objects...
You could create a datastructure like this:
List<string> wordlist
Dictionary<string, int> tsildrow // for reverse lookup while building the structure
List<int> wordindex
wordlist will contain a list of all (unique) words,
tsildrow will give the index of a word in wordlist and wordindex will tell you the index in wordlist of a specific index in your text.
You would operate it in the following fashion:
for word in text:
if not word in tsildrow:
wordlist.append(word)
tsildrow.add(word, wordlist.last_index)
wordindex.append(tsildrow[word])
this fills up your datastructure. Now, to find the word at index 531467:
print wordlist[wordindex[531467]]
you can reproduce the entire text like this:
for index in wordindex:
print wordlist[index] + ' '
except, that you will still have a problem of punctuation etc...
if you won't be adding any more words (i.e. your text is stable), you can delete tsildrow to free up some memory if this is a concern of yours.
OK, I have experimented with several of your suggestions, and here are my results (I checked (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()) before filling the array, and checked again after filling the array and gc()):
Original (array of strings): 54 bytes/word (not 40 as I mistakenly wrote)
My solution (array of chunks of strings, separated by spaces):
2 words per chunk - 36 b/w (but unacceptable performance)
10 words per chunk - 18 b/w
100 words per chunk - 14 b/w
byte arrays - 40 b/w
char arrays - 36 b/w
HashMap, either mapping a string to itself, or mapping a string to its index - 26 b/w
(not sure I implemented this correctly)
intern - 10 b/w
baseline (empty array) - 4 b/w
The average word length is about 3 chars, and most chars are non-ASCII so it's probably about 6 bytes. So, it seems that intern is close to the optimum. It makes sense, since it's an array of words, and many of the words appear much more than once.
I would probably consider using a file, with either fixed sized words or some sort of index. FileInputStream with skip can be pretty efficient
If you have a mobile device you can use TIntArrayList which would use 4 bytes per int value. If you use one index per word it will one need a couple of MB. You can also use int[]
If you have a PC or server, this is trivial amount of memory. Memory cost about £6 per GB or 1 cent per MB.
Related
How to join list of millions of values into a single String by appending '\n' at end of each line -
Input data is in a List:
list[0] = And the good south wind still blew behind,
list[1] = But no sweet bird did follow,
list[2] = Nor any day for food or play
list[3] = Came to the mariners' hollo!
Below code joins the list into a string by appending new line character at the end -
String joinedStr = list.collect(Collectors.joining("\n", "{", "}"));
But, the problem is if the list has millions of data the joining fails. My guess is String object couldn't handle millions line due to large size.
Please give suggestion.
The problem with trying to compose a gigantic string is that you have to keep the entire thing in memory before you do anything further with it.
If the string is too big to fit in memory, you have only two options:
increase the available memory, or
avoid keeping a huge string in memory in the first place
This string is presumably destined for some further processing - maybe it's being written to a blob in a database, or maybe it is the body of an HTTP response. It's not being constructed just for fun.
It is probably much more preferable to write to some kind of stream (maybe an implementation of OutputStream) that can be read one character at a time. The consumer can optionally buffer based on the delimiter if they are aware of the context of what you're sending, or they can wait until they have the entire thing.
Preferably you would use something which supports back pressure so that you can pause writing if the consumer is too slow.
Exactly how this looks will depend on what you're trying to accomplish.
Maybe you can do it with a StringBuilder, which is designed specifically for handling large Strings. Here's how I'd do it:
StringBuilder sb = new StringBuilder();
for (String s : list) sb.append(s).append("\n");
return s.toString();
Haven't tested this code though, but it should work
private static String buildSomeString(Map<String, String> data) {
StringBuilder result = new StringBuilder();
for (Map.Entry<String, String> field : data.entrySet()) {
result.append("some literal")
.append(field.getKey())
.append("another literal")
.append(field.getKey())
.append("and another one")
.append(field.getValue())
.append("and the last in this iteration");
}
return result.toString();
}
When I run pmd on this I get the following error
StringBuffer constructor is initialized with size 16, but has at least 83 characters appended.
The number of characters is probably wrong, because I changed literals before posting.
Thanks
StringBuilder's constructor can optionally receive an int with the size of the internal buffer to use. If none given (as is in your code), it defaults to 16.
As you append data on the StringBuilder, it will automatically resize the internal buffer as needed. This resizing implies creating a new, larger array, and copying the old data to it. This is "a costly" operation (notes the quotes, this is a micro-optimization, if you are using bad algorithms such bubble sort you have bigger problems).
Doing a more educated guess on the expected size of the string can avoid / minimize such reallocations.
PMD doesn't know what the contents of the map are, but it knows it will include at least 83 chars (given the map is not empty).
This can be resolved by doing a more educated guess on the size, such as:
StringBuilder result = new StringBuilder(83 * data.size()); // 83 or whatever you constant strings account for
This can be further refined if you can better approach the expected value of the map's keys and values. Usually, going slightly over the actual expected output is better, as even if it implies allocating more memory, has a better chance of avoiding reallocations completely.
When you create a StringBuilder with default capacity, it's internal array has to be extended if you append beyond that capacity.
If you know the length of the final String that you need to create then you can create a StringBuilder with that capacity, so it will know that you need that many characters, and it's internal array will not need to be extended.
I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.
Every day I get another file and need to extract new elements (tens of thousands a day).
What's the best approach to solve my problem?
I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.
I use this code:
dataNew.removeAll(dataOld);
Tried to load data in HashSets - creation of HashSets is endless.
The same with LinkedHashset.
Tried to load into ArrayLists and to sort only one of them
Collections.sort(dataNew);
but it didn't speed up the process of
dataNew.removeAll(dataOld);
Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).
I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).
So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).
Any suggestions, please?
Maybe I need not alphabetic sorting of ArrayList, but something else?
UPDATE:
This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.
Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.
Interesting, when I increased maximum heap size to 16Gb
java -Xms5G -Xmx16G -jar utils.jar
loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList
My code is very simple:
List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);
on second file the program gets
Killed
[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child
[1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB
UPDATE2:
Thanks for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
First of all, don't do Files.readAllLines(Paths.get("filename")) and then pass everything to a Set, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.
Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.
Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// process your line, maybe add to the Set for the old data?
oldData.add(line);
}
}
Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
boolean oldRemoved = oldData.remove(line);
if (!oldRemoved) {
newData.add(line);
}
}
}
You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.
Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.
Also, if your data are numbers, you could just use a long and hold that instead of trying to hold instances of String? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects might serve you very well as a standard HashSet has a lot of unnecessary object allocation.
Thank's for all your ideas!
Final solution is:
converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
Pls split the strings into two and whatever part (str1 or str2) is repeated most use the intern() on it so to save duplication os same String again in Heap. Here i used intern() on both part just to show the sample but dont use it unless they are repeating most.
Set<MyObj> lineData = new HashSet<MyObj>();
String line = null;
BufferedReader bufferedReader = new BufferedReader(new FileReader(file.getAbsoluteFile()));
while((line = bufferedReader.readLine()) != null){
String[] data = line.split(",");
MyObj myObj = new MyObj();
myObj.setStr1(data[0].intern());
myObj.setStr1(data[1].intern());
lineData.add(myObj);
}
public class MyObj {
private String str1;
private String str2;
public String getStr1() {
return str1;
}
public void setStr1(String str1) {
this.str1 = str1;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((str1 == null) ? 0 : str1.hashCode());
result = prime * result + ((str2 == null) ? 0 : str2.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Test1 other = (Test1) obj;
if (str1 == null) {
if (other.str1 != null)
return false;
} else if (!str1.equals(other.str1))
return false;
if (str2 == null) {
if (other.str2 != null)
return false;
} else if (!str2.equals(other.str2))
return false;
return true;
}
public String getStr2() {
return str2;
}
public void setStr2(String str2) {
this.str2 = str2;
}
}
Use a database; to keep things simple, use a Java-embedded database (Derby, HSQL, H2, ...). With that much information, you can really benefit from standard DB caching, time-efficient storage, and querying. Your pseudo-code would be:
if first use,
define new one-column table, setting column as primary-key
iterate through input records, for each:
insert record into table
otherwise
open database with previous records
iterate through input records, for each:
lookup record in DB, update/report as required
Alternatively, you can do even less work if you use existing "table-diff" libraries, such as DiffKit - from their tutorial:
java -jar ../diffkit-app.jar -demoDB
Then configure a connection to this demo database within your favorite
JDBC enabled database browser
[...]
Your DB browser will show you the tables TEST10_LHS_TABLE and
TEST10_RHS_TABLE (amongst others) populated with the data values from
the corresponding CSV files.
That is: DiffKit does essentially what I proposed above, loading files into database tables (they use embedded H2) and then comparing these tables through DB queries.
They accept input as CSV files; but conversion from your textual input to their CSV can be done in a streaming fashion in less than 10 lines of code. And then you just need to call their jar to do the diff, and you would get the results as tables in their embedded DB.
I made a very simple spell checker, just checking if a word was in the dictionary was too slow for whole documents. I created a map structure, and it works great.
Map<String, List<String>> dictionary;
For the key, I use the first 2 letters of the word. The list has all the words that start with the key. To speed it up a bit more you can sort the list, then use a binary search to check for existence. I'm not sure the optimum length of key, and if your key gets too long you could nest the maps. Eventually it becomes a tree. A trie structure is possibly the best actually.
You can use a trie data structure for such cases: http://www.toptal.com/java/the-trie-a-neglected-data-structure
The algorithm would be as follows:
Read the old file line by line and store each line in the trie.
Read the new file line by line and test each line whether it is
in the trie: if it is not, then it is a newly added line.
A further memory optimization can take advantage that there are only 10 digits, so 4 bits is enough to store a digit (instead of 2 bytes per character in Java). You may need to adapt the trie data structure from one of the following links:
Trie data structures - Java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
The String object holding 11 characters (up to 12 in-fact) will have a size of 64 bytes (on 64bits Java with compressed oops). The only structure that can hold so much elements and be of a reasonable size is an array:
100,000,000 * (64b per String object + 4b per reference) = 6,800,000,000b ~ 6.3Gb
So you can immediately forget about Maps, Sets, etc as they introduce too much memory overhead.. But array is actually all you need. My approach would be:
Load the "old" data into an array, sort it (this should be fast enough)
Create a back-up array of primitive booleans with same size as the loaded array (you can use the BitSet here as well)
Read line by line from the new data file. Use binary search to check if the password data exists in the old data array. If the item exist mark it's index in the boolean array/bitset as true (you get back the index from the binary search). If the item does not exists just save it somewhere (array list can serve).
When all lines are processed remove from old array all the items that have false in boolean array/bitset (check by index of course). And finally add to the array all the new data you saved somewhere.
Optionally sort the array again and save to disk, so next time you load it you can skip the initial sorting.
This should be fast enough imo. Initial sort is O(n log(n)), while the binary search is O(log(n)) thus you should end up with (excluding final removal + adding which can be max 2n):
n log(n) (sort) + n log(n) (binary check for n elements) = 2 n log(n)
There would be other optimizations possible if you would explain more on the structure of that String you have (if there is some pattern or not).
The main problem in numerous resizing ArrayList when readAllLines() occurs. Better choice is LinkedList to insert data
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
List<String> result = new LinkedList<>();
for (;;) {
String line = reader.readLine();
if (line == null)
break;
result.add(line);
}
return result;
}
I have a method in engine i'm using (andengine) :
public final void setText(String pString){...}
my app is updating score every 1s from static int
mScoreText.setText(""+PlayerSystem.mScore);
The problem is that this way every second a new String object is created , and after 1 minute i have 59 String objects to collect by GC and additional AbstractStringBuilders and init 's...
I've found a partial solution on andengine forums like this :
private static StringBuilder mScoreValue = new StringBuilder("000000");
private static final char[] DIGITS = {'0','1','2','3','4','5','6','7','8','9'};
mScoreValue.setCharAt(0, DIGITS[(PlayerSystem.mScore% 1000000) / 100000]);
mScoreValue.setCharAt(1, DIGITS[(PlayerSystem.mScore% 100000) / 10000]);
mScoreValue.setCharAt(2, DIGITS[(PlayerSystem.mScore% 10000) / 1000]);
mScoreValue.setCharAt(3, DIGITS[(PlayerSystem.mScore% 1000) / 100]);
mScoreValue.setCharAt(4, DIGITS[(PlayerSystem.mScore% 100) / 10]);
mScoreValue.setCharAt(5, DIGITS[(PlayerSystem.mScore% 10)]);
mScoreText.setText(mScoreValue.toString());
but the main problem remains, .toString() is returning new object every call
Is there any way to solve this?
It sounds like a good candidate to use StringBuilder:
http://developer.android.com/reference/java/lang/StringBuilder.html
Or StringBuffer:
http://developer.android.com/reference/java/lang/StringBuffer.html
Reasoning is:
StringBuffer is used to store character strings that will be changed (String objects cannot be changed). It automatically expands as needed. Related classes: String, CharSequence.
StringBuilder was added in Java 5. It is identical in all respects to StringBuffer except that it is not synchronized, which means that if multiple threads are accessing it at the same time, there could be trouble. For single-threaded programs, the most common case, avoiding the overhead of synchronization makes the StringBuilder very slightly faster.
Edit:
One thing you have to be careful of is how you use which ever SB class you pick.
The reason is (same in .Net too) if you have a usage like this
StringBuilder sb = new StringBuilder(score.ToString() + "hello, world!");
You've still got 2 string concat operations, you're possibly actually making 3 strings there, one for score.ToString(), one to turn the literal "hello, world!" into a string, and one that contains the two concatenated together.
To get the best results, you need to use the SB's Append/insert/replace methods.
As far as I know, there is no way to get around the fact that Strings are immutable and if your method takes a String, a new one will have to be created every time.
First, 120 objects in two minutes is nothing you should worry about, unless they are very large.
Second, String class holds a pool of all the Strings created. So, if you do
String a = new String("Nabucodonosor King of Babilonia");
String b = new String("Nabucodonosor King of Babilonia");
then Nabucodonosor King of Babilonia is stored only once in memory (but there are two String objects pointing at it). Check String#intern() for details.
And last, as Daniel points, as Strings are immutable there is no workaround using Strings. You could do some tricks (checking new value with old value, and creating the String only if they are different) but I doubt they compensate for the added complexity.
I was given an English vocabulary assignment by my teacher.
Choose a random alphabet, say 'a'
Write a word from the alphabet, say
'apple' Take the last word 'e' Write a
word from e, say elephant Now from 't'
and so on.. No repetition allowed
Make a list of 500 words.
Mail the list to the teacher. :)
So Instead of doing it myself, I am working on a Java code which will do my homework for me.
The code seems to be simple.
The core of algorithm:
Pick up a random word from a dictionary, which satisfies the requirement. seek() with RandomAccessFile. Try to put it in a Set with ordering (maybe LinkedHashSet)
But the problem is the huge size of dictionary with 300 000+ enteries. :|
Brute force random algorithms wont work.
What could be the best, quickest and most efficient way out?
****UPDATE :** Now that I have written the code and its working. How can I make it efficient so that it chooses common words?
Any text files containing list of common words around??**
Either look for a data structure allowing you to keep a compacted dictionary in memory, or simply give your process more memory. Three hundred thousand words is not that much.
Hope this doesn't spoil your fun or something, but if I were you I'd take this approach..
Pseudo java:
abstract class Word {
String word;
char last();
char first();
}
abstract class DynamicDictionary {
Map<Character,Set<Word>> first_indexed;
Word removeNext(Word word){
Set<Word> candidates = first_indexed.get(word.last());
return removeRandom(candidates);
}
/**
* Remove a random word out from the entire dic.
*/
Word removeRandom();
/**
* Remove and return a random word out from the set provided.
*/
Word removeRandom(Set<Word> wordset);
}
and then
Word primer = dynamicDictionary.removeRandom();
List<Word> list = new ArrayList<Word>(500);
list.add(primer);
for(int i=0, Word cur = primer;i<499;i++){
cur = dynamicDictionary.removeNext(cur);
list.add(cur);
}
NOTE: Not intended to be viewed as actual java code, just a way to roughly explain the approach (no error handling, not a good class structure if it were really used, no encupsulation etc. etc.)
Should I encounter memory issues, maybe I'll do this:
abstract class Word {
int lineNumber;
char last();
char first();
}
If that is not sufficient, guess I'll use a binary search on the file or put it in a DB etc..
I think a way to do this could be to use a TreeSet where you put all the dictionary then use the method subSet to retreive all the words beginning by the desired letter and do a random on the subset.
But in my opinion the best way to do this, due to the quantity of data, would be to use a database with SQL requests instead of Java.
If I do this:
class LoadWords {
public static void main(String... args) {
try {
Scanner s = new Scanner(new File("/usr/share/dict/words"));
ArrayList<String> ss = new ArrayList<String>();
while (s.hasNextLine())
ss.add(s.nextLine());
System.out.format("Read %d words\n", ss.size());
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
}
}
}
I can run it with java -mx16m LoadWords, which limits the Java heap size to 16 Mb, which is not that much memory for Java. My /usr/share/dict/words file has approximately 250,000 words in it, so it may be a bit smaller than yours.
You'll need to use a different data structure than the simple ArrayList<String> that I've used. Perhaps a HashMap of ArrayList<String>, keyed on the starting letter of the word would be a good starting choice.
Here is some word frequency lists:
http://www.robwaring.org/vocab/wordlists/vocfreq.html
This text file, reachable from the above link, contains the first 2000 words that are used most frequently:
http://www.robwaring.org/vocab/wordlists/1-2000.txt
The goal is to increase your English language vocabulary - not to increase your computer's English language vocabulary.
If you do not share this goal, why are you (or your parents) paying tuition?