for my coursework(binary search tree and hashtables) I would like to make a java program that scans a text file and orders words based on the most frequent words. Something like most popular tags.
Example:
1. Scan the file.
2. List words that appears more than once
WORD TOTAL
Banana 10
Sun 7
Sea 3
Question 1. is how do I scan a text file?
Question 2. how do I check for duplicates in the text file and number it?
Question 3. how do I print out the words that appears more than 1 time out in the order like my example?
My programming is not strong.
Since it is course work, I'm not gonna provide you full details, but I'll try to point you in a possible direction:
Google how to read words from a text file (this is a very common problem, you should be able to find tons of examples)
Use for instance hashmap (string to int) to count the words: if a word is not in the hashmap yet, add it with multiplicity 1; if it is in there, increment the count (you might want to do some preprocessing on the words, for instance if you want to ignore capitals)
Filter the words with multiplicity more than 1 from your hashmap
Sort the filtered list of words based on their count
Some very high-level implementation (with many open ends :) )
List<String> words = readWordsFromFile();
Map<String, Integer> wordCounts = new HashMap<>();
for(String word : words) {
String processedWord = preprocess(word);
int count = 1;
if (wordCounts.containsKey(processedWord)) {
count = wordCounts.get(processedWord)+1;
}
wordCounts.put(processedWord, count);
}
removeSingleOccurences(wordCounts);
List<String> sortedWords = sortWords(wordCounts);
You can use Multiset from Guava Lib: http://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained#Multiset
Related
I am hoping to build a tree in which a node is an English word and a branch of leaves form a sentence. Namely,
a sentence tree (plz ignore the numbers):
I was thinking to use a Trie but I am having trouble with inserting the nodes. I am not sure how to determine the level of the nodes. In a Trie, all the nodes are characters so it's possible to use . But having words is different.
Does it make sense? I am open to other data structures as well. The goal is to create a dictionary/corpus which stores a bunch of English sentences. Users can use the first a couple of words to look up the whole sentence. I am most proficient in Java but I also know python and R so if they are easier to use for my purposes.
Thank you!
void insert(String key) {
int level;
int length = key.length();
int index;
TrieNode pCrawl = root;
for (level = 0; level < length; level++)
{
index = key.charAt(level) - 'a';
if (pCrawl.children[index] == null)
pCrawl.children[index] = new TrieNode();
pCrawl = pCrawl.children[index];
}
// mark last node as leaf
pCrawl.isEndOfWord = true;
}
A little late, but maybe I can help a bit even now.
A trie sorts each level by unique key. Traditionally this is a character from a string, and the value stored at the final location is the string itself.
Tries can be much more than this. If I understand you correctly then you wish to sort sentences by their constituent words.
At each level of your trie you look at the next word and seek its position in the list of children, rather than looking at the next character. Unfortunately all the traditional implementations show sorting by character.
I have a solution for you, or rather two. The first is to use my java source code trie. This sorts any object (in your case the string containing your sentence) by an Enumeration of integers. You would need to map your words to integers (store words in trie give each a unique number), and then write an enumerator that returned the wordIntegers for a sentence. That would work. (Do not use hash for the word -> integer conversion as two words can give the same hash).
The second solution is to take my code and instead of comparing integers compare the words as strings. This would take more work, but looks entirely feasible. In fact, I have had a suspicion that my solution can be made more generic by replacing Enumeration of Integer with an Enumeration of Comparable. If you wish to do this, or collaborate in doing this I would be interested. Heck, I may even do it myself for the fun of it.
The resultant trie would have generic type
Trie<K extends Comparable, T>
and would store instances of T against a sequence of K. The coder would need to define a method
Iterator<K extends Comparable> getIterator(T t)
============================
EDIT: ========================
It was actually remarkably easy to generalise my code to use Comparable instead of Integer. Although there are plenty of warnings that I am using raw type of Comparable rather than Comparable. Maybe I will sort those out another day.
SentenceSorter sorter = new SentenceSorter();
sorter.add("This is a sentence.");
sorter.add("This is another sentence.");
sorter.add("A sentence that should come first.");
sorter.add("Ze last sentence");
sorter.add("This is a sentence that comes somewhere in the middle.");
sorter.add("This is another sentence entirely.");
Then listing sentences by:
Iterator<String> it = sorter.iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
A sentence that should come first.
This is a sentence that comes somewhere in the middle.
This is a sentence.
This is another sentence entirely.
This is another sentence.
Note that the sentence split is including the full stop with the ord and that is affecting the sort. You could improve upon this.
We can show that we are sorting by words rather than characters:
it = sorter.sentencesWithPrefix("This is a").iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
This is a sentence that comes somewhere in the middle.
This is a sentence.
whereas
it = sorter.sentencesWithPrefix("This is another").iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
This is another sentence entirely.
This is another sentence.
Hope that helps - the code is all up on the above mentioned repo, and freely available under Apache2.
I am trying to create an inverted index for Wikipedia pages however I keep running out of memory. I am not sure what else I can to do ensure it doesn't run out of memory. However we are talking about 3.9Mil words.
indexer.java
public void index() {
ArrayList<Page> pages = parse(); // Parse XML pages
HashMap<String, ArrayList<Integer>> postings = getPostings(pages);
}
public HashMap<String, ArrayList<Integer>> getPostings(ArrayList<Page> pages) {
assert pages != null;
englishStemmer stemmer = new englishStemmer();
HashSet<String> stopWords = getStopWords();
HashMap<String, ArrayList<Integer>> postings = new HashMap<>();
int count = 0;
int artCount = 0;
for (Page page : pages) {
if (!page.isRedirect()) { // Skip pages that are redirects.
StringBuilder sb = new StringBuilder();
artCount = count; // All the words until now
boolean ignore = false;
for (char c : page.getText().toCharArray()) {
if (c == '<') // Ignore words inside <> tags.
ignore = true;
if (!ignore) {
if (c != 39) {
if (c > 47 && c < 58 || c > 96 && c < 123) // Character c is a number 0-9 or a lower case letter a-z.
sb.append(c);
else if (c > 64 && c < 91) // Character c is an uppercase letter A-Z.
sb.append(Character.toLowerCase(c));
else if (sb.length() > 0) { // Check if there is a word up until now.
if (sb.length() > 1) { // Ignore single character "words"
if (!stopWords.contains(sb.toString())) { // Check if the word is not a stop word.
stemmer.setCurrent(sb.toString());
stemmer.stem(); // Stem word s
String s = sb.toString(); // Retrieve the stemmed word
if (!postings.containsKey(s)) // Check if the word already exists in the words map.
postings.put(s, new ArrayList<>()); // If the word is not in the map then create an array list for that word.
postings.get(s).add(page.getId()); // Place the id of the page in the word array list.
count++; // Increase the overall word count for the pages
}
}
sb = new StringBuilder();
}
}
}
if (c == '>')
ignore = false;
}
}
page.setCount(count - artCount);
}
System.out.println("Word count:" + count);
return postings;
}
Advantages
Some advantages for this approach are:
You can get the number of occurrences of a given word simply by getting the size of the associated ArrayList.
Looking up the number of times a given word occurs in a page is relatively easy.
Optimizations
Current optimizations:
Ignoring common words (stop words).
Stemming words to their roots and storing those.
Ignoring common Wikipedia tags that aren't English words (included in stop word list such as: lt, gt, ref .. etc).
Ignoring text within < > tags such as: <pre>, <div>
Limitations
Array lists become incredibly large with number of occurrences for words, the major disadvantage of this approach comes when an array list has to grow. A new array list is created and the items from the previous array list need to be copied into the new array list. This could be a possible performance bottleneck. Would a Linked list make more sense here? As we are simply adding more occurrences and not reading the occurrences. This would also mean that since linked lists do not rely on an array as their underlying data structure they can grow without bounds and do not need to be replaced when they are too large.
Alternative approaches
I have considered dumping the counts for each word into a database like MongoDB after each page has been processed and then append the new occurrences. It would be: {word : [occurrences]} and then let the GC clean postings HashMap after each page has been processed.
I've also considered moving the pages loop into the index() method such that GC can clean up getPostings() before a new page. Then merging the new postings after each page but I don't think that will alleviate the memory burden.
As for the hash maps would a tree map be a better fit for this situation?
Execution
On my machine this program runs on all 4 cores using 90 - 100% and takes about 2-2.5GB RAM. It runs for over an hour and a half then: GC Out of memory.
I have also considered increasing the available memory for this program but it needs to run on my instructors machine as well. So it needs to operate as standard without any "hacks".
I need help making considerable optimizations, I'm not sure what else would help.
TL;DR Most likely your data structure won't fit in memory, no matter what you do.
Side note: you should actually explain what your task is and what your approach is. You don't do that and expect us to read and poke in your code.
What you're basically doing is building a multimap of word -> ids of Wikipedia articles. For this, you parse each non-redirect page, divide it into single words and build a multimap by adding a word -> page id mapping.
Let's roughly estimate how big that structure would be. Your assumption is around 4 millions of words. There's around 5 millions of articles in EN Wikipedia. Average word length in English is around 5 characters, so let's assume 10 bytes per word, 4 bytes per article id. We're getting around 40 MB for words (keys in the map), 20 MB for article ids (values in the map).
Assuming a multihashmap-like structure you could estimate the hashmap size at around 32*size + 4*capacity.
So far this seems to be manageable, a few dozen MBs.
But there will be around 4 millions collections to store ids of articles, each will be around 8*size (if you'll take array lists), where the size is a number of articles a word will be encountered in. According to http://www.wordfrequency.info/, the top 5000 words are mentioned in COCAE over 300 million times, so I'd expect Wikipedia to be in this range.
That would be around 2.5 GB just for article ids just for 5k top words. This is a good hint that your inverted index structure will probably take too much memory to fit on a single machine.
However I don't think that the you've got problems with the size of the resulting structure. Your code indicates that you load pages in memory first and process them later on. And that definitely won't work.
You'll most probably need to process pages in a stream-like fashion and use some kind of a database to store results. There's basically a thousand ways to do that, I'd personally go with a Hadoop job on AWS with PostgreSQL as the database, utilizing the UPSERT feature.
ArrayList is a candidate for replacement by a class Index you'll have to write. It should use int[] for storing index values and a reallocation strategy that uses an increment based on the overall growth rate of the word it belongs to. (ArrayList increments by 50% of the old value, and this may not be optimal for rare words.) Also, it should leave room for optimizing the storage of ranges by storing the first index and the negative count of following numbers, e.g.,
..., 100, -3,... is index values for 100, 101, 102, 103
This may result in saving entries for frequently occurring words at the cost of a few cycles.
Consider a dump of the postings HashMap after entering a certain number of index values and a continuation with an empty map. If the file is sorted by key, it'll permit a relatively simple merge of two or more files.
l am kind of trying to build a dictionary essay (storing words and then finding one of them after)
For example:
Sorting algorithm
From Wikipedia, the free encyclopedia
A sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order. Efficient sorting is important for optimizing the use of other algorithms (such as search and merge algorithms) which require input data to be in sorted lists; it is also often useful for canonicalizing data and for producing human-readable output. More formally, the output must satisfy two conditions:
The output is in nondecreasing order (each element is no smaller than the previous element according to the desired total order);
The output is a permutation (reordering) of the input.
Further, the data is often taken to be in an array, which allows random access, rather than a list, which only allows sequential access, though often algorithms can be applied with suitable modification to either type of data.
l want to store this (above) in an array or anything without reading the same word twice and find a word from it after.
What l tried so far is using an array[10000] to store things (in case not big enough) and scanner to read from a .txt file but things its taking a long time (5min+) without even starting to find
Also, if it is a book(100,000+ words), what should l be using to not wait so long(under 10min)?
l run a menu, then ask for a .txt to be read:
int number = 0;
String[] wordlist = new String[5000000];
String readFile = keyboard.nextLine();
Scanner file = null;
file= new Scanner(new File(readFile));
while (file.hasNextLine())
{
Scanner file = new Scanner(file.nextLine());
boolean b;
while (b = file.hasNext())
{
wordlist[number]= file.next();
System.out.println(s);
number++;
}
}
Then l do the checking after
and finding
I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.
I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.
I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.
I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.
I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)
I'm doing a wee project (in Java) while uni is out just to test myself and I've hit a stumbling block.
I'm trying to write a program that will read in from a text version of dictionary, store it in a ds (data structure), then ask the user for a random string (preferably a nonsense string, but only letters and -'s, no numbers or other punctuation - I'm not interested in anything else), find out all the anagrams of the inputted string, compare it to the dictionary ds and return a list of all the possible anagrams that are in the dictionary.
Okay, for step 1 and 2 (reading from the dictionary), when I'm reading everything in I stored it in a Map, where the keys are the letter of the alphabet and the values are ArrayLists storing all the words beginning with that letter.
I'm stuck at finding all the anagrams, I figured how to calculate the number of possible permutations recursively (proudly) and I'm not sure how to go about actually doing the rearranging.
Is it better to break it up into char and play with it that way, or split it up and keep it as string elements? I've seen sample code online in different sites but I don't want to see code, I would to know the kind of approach/ideas behind developing the solution for this as I'm kinda stuck how to even begin :(
I mean, I think I know how I'm going to go about the comparison to the dictionary ds once I've generated all permutations.
Any advice would be helpful, but not code if that'd be alright, just ideas.
P.S. If you're wanting to see my code so far (for whatever reason), I'll post what I've got.
public String str = "overflow";
public ArrayList<String> possibilities = new ArrayList<String>();
public void main(String[] args)
{
permu(new boolean[str.length()],"");
}
public void permu(boolean[] used, String cur)
{
if (cur.length()==str.length())
{
possibilities.add(cur);
return;
}
for (int a = 0; a < str.length(); a++)
{
if (!used[a])
{
used[a]=true;
cur+=str.charAt(a);
permu(used,cur);
used[a] = false;
cur = cur.substring(0,cur.length()-1);
}
}
}
Simple with a really horrible run-time but it will get the job done.
EDIT : The more advanced version of this is something called a Dictionary Trie. Basically it's a Tree in which each node has 26 nodes one for each letter in the alphabet. And each node also has a boolean telling whether or not it is the end of a word. With this you can easily insert words into the dictionary and easily check if you are even on a correct path for creating a word.
I will paste the code if you would like
Computing the permutations really seem like a bad idea in this case. The word "overflow" for instance has 40320 permutations.
A better way to find out if one word is a permutation of another is to count how many times each letter occur (it will be a 26-tuple) and compare these tuples against each other.
It might be helpful if you gave an example to clarify the problem. As I understand it, you are saying that if the user typed in, say, "islent", the program would reply with "listen", "silent", and "enlist".
I think the easiest solution would be to take each word in your dictionary and store it with both the word as entered, and with the word with the letters re-arranged into alphabetical order. Let's call this the "canonical value". Index on the canonical value. Then convert the input into the canonical value, and do a straight search for matches.
To pursue the above example, when we build the dictinoary and saw the word "listen", we would translate this to "eilnst" and store "eilnst -> listen". We'd also store "eilnst -> silent" and "eilnst -> enlist". Then we get the input string, convert this to "eilnst", do a search and immediately find the three hits.