Working with huge text files in Java

Working with huge text files in Java - java

I was given an English vocabulary assignment by my teacher.
Choose a random alphabet, say 'a'
Write a word from the alphabet, say
'apple' Take the last word 'e' Write a
word from e, say elephant Now from 't'
and so on.. No repetition allowed
Make a list of 500 words.
Mail the list to the teacher. :)
So Instead of doing it myself, I am working on a Java code which will do my homework for me.
The code seems to be simple.
The core of algorithm:
Pick up a random word from a dictionary, which satisfies the requirement. seek() with RandomAccessFile. Try to put it in a Set with ordering (maybe LinkedHashSet)
But the problem is the huge size of dictionary with 300 000+ enteries. :|
Brute force random algorithms wont work.
What could be the best, quickest and most efficient way out?
****UPDATE :** Now that I have written the code and its working. How can I make it efficient so that it chooses common words?
Any text files containing list of common words around??**

Either look for a data structure allowing you to keep a compacted dictionary in memory, or simply give your process more memory. Three hundred thousand words is not that much.

Hope this doesn't spoil your fun or something, but if I were you I'd take this approach..
Pseudo java:
abstract class Word {
String word;
char last();
char first();
}
abstract class DynamicDictionary {
Map<Character,Set<Word>> first_indexed;
Word removeNext(Word word){
Set<Word> candidates = first_indexed.get(word.last());
return removeRandom(candidates);
}
/**
* Remove a random word out from the entire dic.
*/
Word removeRandom();
/**
* Remove and return a random word out from the set provided.
*/
Word removeRandom(Set<Word> wordset);
}
and then
Word primer = dynamicDictionary.removeRandom();
List<Word> list = new ArrayList<Word>(500);
list.add(primer);
for(int i=0, Word cur = primer;i<499;i++){
cur = dynamicDictionary.removeNext(cur);
list.add(cur);
}
NOTE: Not intended to be viewed as actual java code, just a way to roughly explain the approach (no error handling, not a good class structure if it were really used, no encupsulation etc. etc.)
Should I encounter memory issues, maybe I'll do this:
abstract class Word {
int lineNumber;
char last();
char first();
}
If that is not sufficient, guess I'll use a binary search on the file or put it in a DB etc..

I think a way to do this could be to use a TreeSet where you put all the dictionary then use the method subSet to retreive all the words beginning by the desired letter and do a random on the subset.
But in my opinion the best way to do this, due to the quantity of data, would be to use a database with SQL requests instead of Java.

If I do this:
class LoadWords {
public static void main(String... args) {
try {
Scanner s = new Scanner(new File("/usr/share/dict/words"));
ArrayList<String> ss = new ArrayList<String>();
while (s.hasNextLine())
ss.add(s.nextLine());
System.out.format("Read %d words\n", ss.size());
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
}
}
}
I can run it with java -mx16m LoadWords, which limits the Java heap size to 16 Mb, which is not that much memory for Java. My /usr/share/dict/words file has approximately 250,000 words in it, so it may be a bit smaller than yours.
You'll need to use a different data structure than the simple ArrayList<String> that I've used. Perhaps a HashMap of ArrayList<String>, keyed on the starting letter of the word would be a good starting choice.

Here is some word frequency lists:
http://www.robwaring.org/vocab/wordlists/vocfreq.html
This text file, reachable from the above link, contains the first 2000 words that are used most frequently:
http://www.robwaring.org/vocab/wordlists/1-2000.txt

The goal is to increase your English language vocabulary - not to increase your computer's English language vocabulary.
If you do not share this goal, why are you (or your parents) paying tuition?

Related

Adding Char to Char Array in Java

I'm still pretty new to Java so bear with me. I'm making a simple hangman game that takes input from the user. I am trying to append the guessed letter to the knownLetters array but I get a type mismatch. I tried .concat() and got the same type error. Here is where I am now. Any ideas or documentation resources (that a novice can read) would be helpful. Thanks!
Edit: Thanks for the comments, everyone! These are very helpful.
public static boolean updateWithGuess(char[] knownLetters,
char guessedLetter,
String word) {
System.out.println(Arrays.toString(knownLetters));
int i = 0;
while(word.indexOf(i, guessedLetter) != -1) {
i = word.indexOf(i, guessedLetter) + 1;
knownLetters += guessedLetter;

Using arrays in that case is not the best choice because arrays have a fixed size and you cannot dynamically add and remove items, maybe rather choose lists.
If you need a simple introduction, here you are:
https://www.geeksforgeeks.org/list-interface-java-examples/

Others have mentioned StringBuilder and List which are a step in the right direction as they dynamic (in number of elements). However since this question is only concerned with keeping track of each letter that was guessed a set makes the most sense. One common example is HashSet. Sets do not allow duplicates. However sets will allow both the upper and lower case versions of a character to be present in the set (since they have different values). So you could normalize all the guesses and known letters to either uppercase or lowercase.

How to read each lines of the input and output them in sorted order?

Duplicate lines should be printed the same number of times they occur in the input. Special care needs to be taken so that a file with a lot of duplicate lines does not use more memory than what is required for the number of unique lines.
I've tried all the collection interfaces but none seems to be working for this question :(
Can someone please help me??
Thanks.
The code below is memory inefficient, as it stores duplicate lines in the PriorityQueue. Hope this helps
public static void doIt(BufferedReader r, PrintWriter w) throws IOException {
PriorityQueue<String> s=new PriorityQueue<String>();
String line;
int n=0;
while ((line = r.readLine()) != null) {
s.add(line);
n++;
while (n!=0) {
w.println(s.remove());
n--;
}
}

The ideal approach would be to use a sorted multiset, such as Guava's TreeMultiset.
If you're not allowed to use external libraries, you can replace s.add(line) with s.add(line.intern()). This tells the JVM to put a copy of each unique line into the String pool and share the same object among all the references.
Note that putting Strings into the pool may cause them to stick around for a long time, which can cause problems in long-running applications, so you don't want to do this casually in a production application, but for your homework problem it's fine. In the case of a production application, you'd want to put the Strings into a SortedMap where the value is the count of times the line appeared, but this is more complicated to code correctly.

You are looking for Insertion sort, which is an online algorithm, assuming lines are being inputted on the fly, if its an offline case(text file which isn't being modified on the fly), well you can use any sort algorithm, thinking of each line as a String, and the complete file as an Array of strings. Sort the array, then loop through it while printing, and then you got, sorted lines printed.

using java to parse a csv then save in 2D array

Okay so i am working on a game based on a Trading card game in java. I Scraped all of the game peices' "information" into a csv file where each row is a game peice and each column is a type of attribute for that peice. I have spent hours upon hours writing code with Buffered reader and etc. trying to extract the information from my csv file into a 2d Array but to no avail. My csv file is linked Here: http://dl.dropbox.com/u/3625527/MonstersFinal.csv I have one year of computer science under my belt but I still cannot figure out how to do this.
So my main question is how do i place this into a 2D array that way i can keep the rows and columns?

Well, as mentioned before, some of your strings contain commas, so initially you're starting from a bad place, but I do have a solution and it's this:
--------- If possible, rescrape the site, but perform a simple encoding operation when you do. You'll want to do something like what you'll notice tends to be done in autogenerated XML files which contain HTML; reserve a 'control character' (a printable character works best, here, for reasons of debugging and... well... sanity) that, once encoded, is never meant to be read directly as an instance of itself. Ampersand is what I like to use because it's uncommon enough but still printable, but really what character you want to use is up to you. What I would do is write the program so that, at every instance of ",", that comma would be replaced by "&c" before being written to the CSV, and at every instance of an actual ampersand on the site, that "&" would be replaced by "&a". That way, you would never have the issue of accidentally separating a single value into two in the CSV, and you could simply decode each value after you've separated them by the method I'm about to outline in...
-------- Assuming you know how many columns will be in each row, you can use the StringTokenizer class (look it up- it's awesome and built into Java. A good place to look for information is, as always, the Java Tutorials) to automatically give you the values you need in the form of an array.
It works by your passing in a string and a delimiter (in this case, the delimiter would be ','), and it spitting out all the substrings which were separated by those commas. If you know how many pieces there are in total from the get-go, you can instantiate a 2D array at the beginning and just plug in each row the StringTokenizer gives them to you. If you don't, it's still okay, because you can use an ArrayList. An ArrayList is nice because it's a higher-level abstraction of an array that automatically asks for more memory such that you can continue adding to it and know that retrieval time will always be constant. However, if you plan on dynamically adding pieces, and doing that more often than retrieving them, you might want to use a LinkedList instead, because it has a linear retrieval time, but a much better relation than an ArrayList for add-remove time. Or, if you're awesome, you could use a SkipList instead. I don't know if they're implemented by default in Java, but they're awesome. Fair warning, though; the cost of speed on retrieval, removal, and placement comes with increased overhead in terms of memory. Skip lists maintain a lot of pointers.
If you know there should be the same number of values in each row, and you want them to be positionally organized, but for whatever reason your scraper doesn't handle the lack of a value for a row, and just doesn't put that value, you've some bad news... it would be easier to rewrite the part of the scraper code that deals with the lack of values than it would be to write a method that interprets varying length arrays and instantiates a Piece object for each array. My suggestion for this would again be to use the control character and fill empty columns with &n (for 'null') to be interpreted later, but then specifics are of course what will individuate your code and coding style so it's not for me to say.
edit: I think the main thing you should focus on is learning the different standard library datatypes available in Java, and maybe learn to implement some of them yourself for practice. I remember implementing a binary search tree- not an AVL tree, but alright. It's fun enough, good coding practice, and, more importantly, necessary if you want to be able to do things quickly and efficiently. I don't know exactly how Java implements arrays, because the definition is "a contiguous section of memory", yet you can allocate memory for them in Java at runtime using variables... but regardless of the specific Java implementation, arrays often aren't the best solution. Also, knowing regular expressions makes everything much easier. For practice, I'd recommend working them into your Java programs, or, if you don't want to have to compile and jar things every time, your bash scripts (if your using *nix) and/or batch scripts (if you're using Windows).

I think the way you've scraped the data makes this problem more difficult than it needs to be. Your scrape seems inconsistent and difficult to work with given that most values are surrounded by quotes inconsistently, some data already has commas in it, and not each card is on its own line.
Try re-scraping the data in a much more consistent format, such as:
R1C1|R1C2|R1C3|R1C4|R1C5|R1C6|R1C7|R1C8
R2C1|R2C2|R2C3|R2C4|R2C5|R2C6|R2C7|R3C8
R3C1|R3C2|R3C3|R3C4|R3C5|R3C6|R3C7|R3C8
R4C1|R4C2|R4C3|R4C4|R4C5|R4C6|R4C7|R4C8
A/D Changer|DREV-EN005|Effect Monster|Light|Warrior|100|100|You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
Where each line is definitely its own card (As opposed to the example CSV you posted with new lines in odd places) and the delimiter is never used in a data field as something other than a delimiter.
Once you've gotten the input into a consistently readable state, it becomes very simple to parse through it:
BufferedReader br = new BufferedReader(new FileReader(new File("MonstersFinal.csv")));
String line = "";
ArrayList<String[]> cardList = new ArrayList<String[]>(); // Use an arraylist because we might not know how many cards we need to parse.
while((line = br.readLine()) != null) { // Read a single line from the file until there are no more lines to read
StringTokenizer st = new StringTokenizer(line, "|"); // "|" is the delimiter of our input file.
String[] card = new String[8]; // Each card has 8 fields, so we need room for the 8 tokens.
for(int i = 0; i < 8; i++) { // For each token in the line that we've read:
String value = st.nextToken(); // Read the token
card[i] = value; // Place the token into the ith "column"
}
cardList.add(card); // Add the card's info to the list of cards.
}
for(int i = 0; i < cardList.size(); i++) {
for(int x = 0; x < cardList.get(i).length; x++) {
System.out.printf("card[%d][%d]: ", i, x);
System.out.println(cardList.get(i)[x]);
}
}
Which would produce the following output for my given example input:
card[0][0]: R1C1
card[0][1]: R1C2
card[0][2]: R1C3
card[0][3]: R1C4
card[0][4]: R1C5
card[0][5]: R1C6
card[0][6]: R1C7
card[0][7]: R1C8
card[1][0]: R2C1
card[1][1]: R2C2
card[1][2]: R2C3
card[1][3]: R2C4
card[1][4]: R2C5
card[1][5]: R2C6
card[1][6]: R2C7
card[1][7]: R3C8
card[2][0]: R3C1
card[2][1]: R3C2
card[2][2]: R3C3
card[2][3]: R3C4
card[2][4]: R3C5
card[2][5]: R3C6
card[2][6]: R3C7
card[2][7]: R4C8
card[3][0]: R4C1
card[3][1]: R4C2
card[3][2]: R4C3
card[3][3]: R4C4
card[3][4]: R4C5
card[3][5]: R4C6
card[3][6]: R4C7
card[3][7]: R4C8
card[4][0]: A/D Changer
card[4][1]: DREV-EN005
card[4][2]: Effect Monster
card[4][3]: Light
card[4][4]: Warrior
card[4][5]: 100
card[4][6]: 100
card[4][7]: You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
I hope re-scraping the information is an option here and I hope I haven't misunderstood anything; Good luck!
On a final note, don't forget to take advantage of OOP once you've gotten things worked out. a Card class could make working with the data even simpler.

I'm working on a similar problem for use in machine learning, so let me share what I've been able to do on the topic.
1) If you know before you start parsing the row - whether it's hard-coded into your program or whether you've got some header in your file that gives you this information (highly recommended) - how many attributes per row there will be, you can reasonably split it by comma, for example the first attribute will be RowString.substring(0, RowString.indexOf(',')), the second attribute will be the substring from the first comma to the next comma (writing a function to find the nth instance of a comma, or simply chopping off bits of the string as you go through it, should be fairly trivial), and the last attribute will be RowString.substring(RowString.lastIndexOf(','), RowString.length()). The String class's methods are your friends here.
2) If you are having trouble distinguishing between commas which are meant to separate values, and commas which are part of a string-formatted attribute, then (if the file is small enough to reformat by hand) do what Java does - represent characters with special meaning that are inside of strings with '\,' rather than just ','. That way you can search for the index of ',' and not '\,' so that you will have some way of distinguishing your characters.
3) As an alternative to 2), CSVs (in my opinion) aren't great for strings, which often include commas. There is no real common format to CSVs, so why not make them colon-separated-values, or dash-separated-values, or even triple-ampersand-separated-values? The point of separating values with commas is to make it easy to tell them apart, and if commas don't do the job there's no reason to keep them. Again, this applies only if your file is small enough to edit by hand.
4) Looking at your file for more than just the format, it becomes apparent that you can't do it by hand. Additionally, it would appear that some strings are surrounded by triple double quotes ("""string""") and some are surrounded by single double quotes ("string"). If I had to guess, I would say that anything included in a quotes is a single attribute - there are, for example, no pairs of quotes that start in one attribute and end in another. So I would say that you could:
Make a class with a method to break a string into each comma-separated fields.
Write that method such that it ignores commas preceded by an odd number of double quotes (this way, if the quote-pair hasn't been closed, it knows that it's inside a string and that the comma is not a value separator). This strategy, however, fails if the creator of your file did something like enclose some strings in double double quotes (""string""), so you may need a more comprehensive approach.

A memory-efficient large array of words

I am looking for a Java data structure for storing a large text (about a million words), such that I can get a word by index (for example, get the 531467 word).
The problem with String[] or ArrayList is that they take too much memory - about 40 bytes per word on my environment.
I thought of using a String[] where each element is a chunk of 10 words, joined by a space. This is much more memory-efficient - about 20 bytes per word; but the access is much slower.
Is there a more efficient way to solve this problem?

As Jon Skeet already mentioned, 40mb isn't too large.
But you stated that you are storing a text, so there may be many same Strings.
For example stop words like "and" and "or".
You can use String.intern()[1]. This will pool your String and returns a reference to an already existing String.
intern() quite slow, so you can replace this by a HashMap that will do the same trick for you.
[1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29

You could look at using memory mapping the data structure, but performance might be completely horrible.

Store all the words in a single string:
class WordList {
private final String content;
private final int[] indices;
public WordList(Collection<String> words) {
StringBuilder buf = new StringBuilder();
indices = new int[words.size()];
int currentWordIndex = 0;
int previousPosition = 0;
for (String word : words) {
buf.append(word);
indices[currentWordIndex++] = previousPosition;
previousPosition += word.length();
}
content = buf.toString();
}
public String wordAt(int index) {
if (index == indices.length - 1) return content.substring(indices[index]);
return content.substring(indices[index], indices[index + 1]);
}
public static void main(String... args) {
WordList list = new WordList(Arrays.asList(args));
for (int i = 0; i < args.length; ++i) {
System.out.printf("Word %d: %s%n", i, list.wordAt(i));
}
}
}
Apart from the characters they contain, each word has an overhead of four bytes using this solution (the entry in indices). Retrieving a word with wordAt will always allocate a new string; you could avoid this by saving the toString() of the StringBuilder rather than the builder itself, although it uses more memory on construction.
Depending on the kind of text, language, and more, you might want a solution that deals with recurring words better (like the one previously proposed).

-XX:+UseCompressedStrings
Use a byte[] for Strings which can be represented as pure ASCII.
(Introduced in Java 6 Update 21 Performance Release)
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
Seems like a interesting article:
http://www.javamex.com/tutorials/memory/string_saving_memory.shtml
I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out.
http://ahmadsoft.org/ropes/
http://en.wikipedia.org/wiki/Rope_%28computer_science%29

One option would be to store byte arrays instead with the text encoded in UTF-8:
byte[][] words = ...;
Then:
public String getWord(int index)
{
return new String(words[index], "UTF-8");
}
This will be smaller in two ways:
The data for each string is directly in a byte[], rather than the String having a couple of integer members and a reference to a separate char[] object
If your text is mostly-ASCII, you'll benefit from UTF-8 using a single byte per character for those ASCII characters
I wouldn't really recommend this approach though... again it will be slower on access, as it needs to create a new String each time. Fundamentally, if you need a million string objects (so you don't want to pay the recreation penalty each time) then you're going to have to use the memory for a million string objects...

You could create a datastructure like this:
List<string> wordlist
Dictionary<string, int> tsildrow // for reverse lookup while building the structure
List<int> wordindex
wordlist will contain a list of all (unique) words,
tsildrow will give the index of a word in wordlist and wordindex will tell you the index in wordlist of a specific index in your text.
You would operate it in the following fashion:
for word in text:
if not word in tsildrow:
wordlist.append(word)
tsildrow.add(word, wordlist.last_index)
wordindex.append(tsildrow[word])
this fills up your datastructure. Now, to find the word at index 531467:
print wordlist[wordindex[531467]]
you can reproduce the entire text like this:
for index in wordindex:
print wordlist[index] + ' '
except, that you will still have a problem of punctuation etc...
if you won't be adding any more words (i.e. your text is stable), you can delete tsildrow to free up some memory if this is a concern of yours.

OK, I have experimented with several of your suggestions, and here are my results (I checked (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()) before filling the array, and checked again after filling the array and gc()):
Original (array of strings): 54 bytes/word (not 40 as I mistakenly wrote)
My solution (array of chunks of strings, separated by spaces):
2 words per chunk - 36 b/w (but unacceptable performance)
10 words per chunk - 18 b/w
100 words per chunk - 14 b/w
byte arrays - 40 b/w
char arrays - 36 b/w
HashMap, either mapping a string to itself, or mapping a string to its index - 26 b/w
(not sure I implemented this correctly)
intern - 10 b/w
baseline (empty array) - 4 b/w
The average word length is about 3 chars, and most chars are non-ASCII so it's probably about 6 bytes. So, it seems that intern is close to the optimum. It makes sense, since it's an array of words, and many of the words appear much more than once.

I would probably consider using a file, with either fixed sized words or some sort of index. FileInputStream with skip can be pretty efficient

If you have a mobile device you can use TIntArrayList which would use 4 bytes per int value. If you use one index per word it will one need a couple of MB. You can also use int[]
If you have a PC or server, this is trivial amount of memory. Memory cost about £6 per GB or 1 cent per MB.

Printing an ArrayList of Strings to a PrintWriter with word wrap

Some classmates and I are working on a homework assignment for Java that requires we print an ArrayList of Strings to a PrintWriter using word wrap, so that none of the output passes 80 characters. We've extensively Googled this and can't find any Java API based way to do this.
I know it's generally "wrong" to ask a homework question on SO, but we're just looking for recommendations of the best way to do this, or if we missed something in the API. This isn't the major part of the homework, just a small output requirement.
Ideally, I'd like to be able to wordwrap the ArrayList's toString since it's nicely formatted already.

Well, this is a first for me, it's the first time one of my students has posted a question about one of the projects I've assigned them. The way it was phrased, that he was looking for an algorithm, and the answers you've all shared are just fine with me. However, this is a typical case of trying to make things too complicated. A part of the spec that was not mentioned was that the 80 characters limit was not a hard limit. I said that each line of the output file had to be roughly 80 characters long. It was OK to go over 80 a little. In my version of the solution, I just had a running count and did a modulus of the count to add the line end. I varied the value of the modulus until the output file looked right. This resulted in lines with small numbers being really short so I used a different modulus when the numbers were small. This wasn't a big part of the project and it's interesting that this got so much attention.

Our solution was to create a temporary string and append elements one by one, followed by a comma. Before adding an element, check if adding it will make the string longer than 80 characters and choose whether to print it and reset or just append.
This still has the issue with the extra trailing comma, but that's been dealt with so many times we'll be fine. I was looking to avoid this because it was originally more complicated in my head than it really is.

I think that better solution is to create your own WrapTextWriter that wraps any other writer and overrides method public void write(String str, int off, int len) throws IOException. Here it should run in loop and perform logic of wrapping.
This logic is not as simple as str.substring(80). If you are dealing with real text and wish to wrap it correctly (i.e. do not cut words, do not move comas or dots to the next line etc) you have to implement some logic. it is probably not too complicated but probably language dependent. For example in English there is not space between word and colon while in French they put space between them.
So, I performed 5 second googling and found the following discussion that can help you.

private static final int MAX_CHARACTERS = 80;
public static void main(String[] args)
throws FileNotFoundException
{
List<String> strings = new ArrayList<String>();
int size = 0;
PrintWriter writer = new PrintWriter(System.out, true); // Just as example
for (String str : strings)
{
size += str.length();
if (size > MAX_CHARACTERS)
{
writer.print(System.getProperty("line.separator") + str);
size = 0;
}
else
writer.print(str);
}
}
You can simply write a function, like "void printWordWrap(List<String> strings)", with that algorithm inside. I think, it`s a good way to solve your problem. :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.