I need an elegant way to exclude specific words from processing

I need an elegant way to exclude specific words from processing - java

I am writing an algorithm to extract likely keywords from a document's text. I want to count instances of words and take the top 5 as keywords. Obviously, I want to exclude "insignificant" words lest every document appears with "the" and "and" as major keywords.
Here is the strategy I've successfully used for testing:
exclusions = new ArrayList<String>();
exclusions.add("a","and","the","or");
Now that I want to do a real-life test, my exclusion list is close to 200 words long, and I'd LOVE to be able to do something like this:
exclusions = new ArrayList<String>();
exclusions.add(each word in foo.txt);
Long term, maintaining an external list (rather than a list embedded in my code) is desirable for obvious reasons. With all the file read/write methods out there in Java, I'm fairly certain that this can be done, but my search results have come up empty...I know I've got to be searching on the wrong keywords. Anyone know an elegant way to include an external list in processing?

This does not immediately address the solution you are prescribing but might give you another avenue that might be better.
Instead of deciding in advance what is useless, you could count everything and then filter out what you deem is insignificant (from a information carrying standpoint) because of its overwhelming presence. It is similar to a low-pass filter in signal processing to eliminate noise.
So in short, count everything. Then decide that if something appears with a frequency higher than a threshold you set (you'll have to determine what that threshold is from experiment, say 5% of all words are 'the', that means it does not carry information).
If you do it this way, it'll even work for foreign languages.
Just my two cents on this.

You can use a FileReader to read the Strings out of the file and add them to an ArrayList.
private List<String> createExculsions(String file) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader(file));
String word = null;
List<String> exclusions = new ArrayList<String>();
while((word = reader.readLine()) != null) {
exclusions.add(word);
}
return exclusions;
}
Then you can use List<String> exclusions = createExclusions("exclusions.txt"); to create the list.

Not sure if it is elegant but here I created a simple solution to detect the language or remove noise words from tweets some years ago:
TweetDetector.java
JTweet.java which is using the data like for english

Google Guava library contains lots of useful methods that simplify routine tasks. You can use one of them to read file contents to string and split it by space character:
String contents = Files.toString(new File("foo.txt"), Charset.defaultCharset());
List<String> exclusions = Lists.newArrayList(contents.split("\\s"));
Apache Commons IO provides similar shortcuts:
String contents = FileUtils.readFileToString(new File("foo.txt"));
...

Commons-io has utilities that support this. Include commons-io as a dependency, then issue
File myFile = ...;
List<String> exclusions = FileUtils.readLines( myFile );
as described in:
http://commons.apache.org/io/apidocs/org/apache/commons/io/FileUtils.html
This assumes that every exclusion word is on a new line.

Reading from a file is pretty simple.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashSet;
public class ExcludeExample {
public static HashSet<String> readExclusions(File file) throws IOException{
BufferedReader br = new BufferedReader(new FileReader(file));
String line = "";
HashSet<String> exclusions = new HashSet<String>();
while ((line = br.readLine()) != null) {
exclusions.add(line);
}
br.close();
return exclusions;
}
public static void main(String[] args) throws IOException{
File foo = new File("foo.txt");
HashSet<String> exclusions = readExclusions(foo);
System.out.println(exclusions.contains("the"));
System.out.println(exclusions.contains("Java"));
}
}
foo.txt
the
a
and
or
I used a HashSet instead of a ArrayList because it has faster lookup.

Related

Create an array with the keys of nodes

I currently have a binary tree setup and would like to create an array with the keys so I can do a heap sort operation on them. How would I go about doing that?
Here is what I currently have:
public static void main(String args[]) throws IOException
{
BufferedReader in = new BufferedReader(new FileReader("employee.txt"));
String line;
Heap employee = new Heap();
while((line = in.readLine())!= null)
{
String[]text = line.split(" ");
employee.insert(Double.parseDouble(text[0]), Double.parseDouble(text[1]));
}
in.close();
}
The binary tree that I am using is pretty standard but I can post it if needed. The "text[0]" segment is what the key is for each node.

One possibility is to use the TreeSet class in combination with a Comparator. The TreeSet can behave like a heap. The class is well documented, but if you have more questions, ask.
EDIT
Take a look here. The accepted answer shows you an implementation of a binary tree. What you need now is a sorting function implementation in that class, which may be triggered at element insertion or manually when needed.
I still don't see how you want to switch a tree into a heap as they are different things. I guess you mean the tree should be "read" say from left to right and so rearranged?

How to implement Word2Vec in Java?

I installed word2Vec using this tutorial on by Ubuntu laptop. Is it completely necessary to install DL4J in order to implement word2Vec vectors in Java? I'm comfortable working in Eclipse and I'm not sure that I want all the other pre-requisites that DL4J wants me to install.
Ideally there would be a really easy way for me to just use the Java code I've already written (in Eclipse) and change a few lines -- so that word look-ups that I am doing would retrieve a word2Vec vector instead of the current retrieval process I'm using.
Also, I've looked into using GloVe, however, I do not have MatLab. Is it possible to use GloVe without MatLab? (I got an error while installing it because of this). If so, the same question as above goes... I have no idea how to implement it in Java.

What is preventing you from saving the word2vec (the C program) output in text format and then read the file with a Java piece of code and load the vectors in a hashmap keyed by the word string?
Some code snippets:
// Class to store a hashmap of wordvecs
public class WordVecs {
HashMap<String, WordVec> wordvecmap;
....
void loadFromTextFile() {
String wordvecFile = prop.getProperty("wordvecs.vecfile");
wordvecmap = new HashMap();
try (FileReader fr = new FileReader(wordvecFile);
BufferedReader br = new BufferedReader(fr)) {
String line;
while ((line = br.readLine()) != null) {
WordVec wv = new WordVec(line);
wordvecmap.put(wv.word, wv);
}
}
catch (Exception ex) { ex.printStackTrace(); }
}
....
}
// class for each wordvec
public class WordVec implements Comparable<WordVec> {
public WordVec(String line) {
String[] tokens = line.split("\\s+");
word = tokens[0];
vec = new float[tokens.length-1];
for (int i = 1; i < tokens.length; i++)
vec[i-1] = Float.parseFloat(tokens[i]);
norm = getNorm();
}
....
}
If you want to get the nearest neighbours for a given word, you can keep a list of N nearest pre-computed neighbours associated with each WordVec object.

Dl4j author here. Our word2vec implementation is targeted for people who need to have custom pipelines. I don't blame you for going the simple route here.
Our word2vec implementation is meant for when you want to do something with them not for messing around. The c word2vec format is pretty straight forward.
Here is parsing logic in java if you'd like:
https://github.com/deeplearning4j/deeplearning4j/blob/374609b2672e97737b9eb3ba12ee62fab6cfee55/deeplearning4j-scaleout/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/embeddings/loader/WordVectorSerializer.java#L113
Hope that helps a bit

Reading words from wordlist.txt into array list, sorting, and writing new file

I can't figure this out for the life of me.
Steps:
Create a new project in Eclipse
Copy the provided wordlist.txt file into the Project folder
Write a single Class named "Reverser" that performs the requested tasks:
Tasks:
Use a java.util.Scanner to load each word in the wordlist.txt file into an ArrayList
Provide the Scanner a reference to a FileReader
Report the number of words placed into the ArrayList
Use the java.util.Collections class to reverse the order of the references in the ArrayList
Use a java.util.Formatter to write the re-ordered words into a new text file named "reversed.txt"
Provide the Formatter with a reference to a FileWriter
Make sure that each word is placed onto a separate line
Additionally, write code so that Java provides the correct end of line terminator for each line. Note: No \n, or \r\n allowed!
Write code to help ensure your program has no resource leaks.
Here is what I have so far
public class Reverser {
public static void main(String[] args) {
Scanner scan = null;
File file = new File("C:\\Users\\Nick\\JavaWorkspace\\Lab 7\\wordlist.txt");
ArrayList<String> list;
try {
scan = new Scanner(new FileReader(file));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
list = new ArrayList<String>();
while((scan.nextLine()) != null){
list.add(scan.next());
}
String[] stringArr = list.toArray(new String[0]);
}
}

I will give you general guideline to follow.
This is a nice task to get some hands on experience in the following areas:
Using Input to read (from a file) and Output stream to write (to a file)
Using the size() method for your List object.
Using finally block with your try/catch block to clean up resources(close scanner in finally block)
Using Collections.reverseOrder() method.
You can look up those things individually and then try to integrate piece by piece in your code.

Creating a getList method with a csv file as an input parameter

The first assignment of my algorithms class is that I have to create a program that reads a series of book titles from a provided csv file, sorts them, and then prints them out. The assignment has very specific parameters, and one of them is that I have to create a static List getList(String file) method. The specifics of what this method entails are as follows:
"The method getList should readin the data from the csv
file book.csv. If a line doesn’t follow the pattern
title,author,year then a message should be written
to the standard error stream (see sample output) The
program should continue reading in the next line. NO
exception should be thrown ."
I don't have much experience with the usage of List, ArrayList, or reading in files, so as you can guess this is very difficult for me. Here's what I have so far for the method:
public static List<Book> getList(String file)
{
List<Book> list = new ArrayList<Book>();
return list;
}
Currently, my best guess is to make a for loop and instantiate a new Book object into the List using i as the index, but I wouldn't know how high to set the loop, as I don't have any method to tell the program how, say, many lines there are in the csv. I also wouldn't know how to get it to differentiate each book's title, author, and year in the csv.
Sorry for the long-winded question. I'd appreciate any help. Thanks.

The best way to do this, would be to read the file line by line, and check if the format of the line is correct. If it is correct, add a new object to the list with the details in the line, otherwise write your error message and continue.
You can read your file using a BufferedReader. They can read line by line by doing the following:
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
// do something with the line here
}
br.close();
Now that you have the lines, you need to verify they are in the correct format. A simple method to do this, is to split the line on commas (since it is a csv file), and check that it has at least 3 elements in the array. You can do so with the String.split(regex) method.
String[] bookDetails = line.split(",");
This would populate the array with the fields from your file. So for example, if the first line was one,two,three, then the array would be ["one","two","three"].
Now you have the values from the line, but you need to verify that it is in the correct format. Since your post specified that it should have 3 fields, we can check this by checking the length of the array we got above. If the length is less than 3, we should output some error message and skip that line.
if(bookDetails.length<3){ //title,author,year
System.err.println("Some error message here"); // output error msg
continue; // skip this line as the format is corrupted
}
Finally, since we have read and verified that the information we need is there, and is in the valid format. We can create a new object and add it to the list. We will use the Integer wrapper built into Java to parse the year into a primitive int type for the Book class constructor. The Integer has a function Integer.parseInt(String s) that will parse a String into an int value.
list.add(new Book(bookDetails[0], bookDetails[1], Integer.parseInt(bookDetails[2])));
Hopefully this helps you out, and answers your question. A full method of what we did could be the following:
public static List<Book> getList(String file) {
List<Book> list = new ArrayList<Book>();
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
String[] bookDetails = line.split(",");
if (bookDetails.length < 3) { // title,author,year
System.err.println("Some error message here");
continue;
}
list.add(new Book(bookDetails[0], bookDetails[1], Integer.parseInt(bookDetails[2])));
}
br.close();
} catch (IOException e) {
e.printStackTrace();
}
return list;
}
And if you would like to test this, a main method can be made with the following code (this is how I tested it).
public static void main(String[] args) {
String file = "books.csv";
List<Book> books = getList(file);
for(Book b : books){
System.out.println(b);
}
}
To test it, make sure you have a file (mine was "books.csv") in your root directory of your Java project. Mine looked like:
bob,jones,1993
bob,dillon,1994
bad,format
good,format,1995
another,good,1992
bad,format2
good,good,1997
And with the above main method, getList function, and file, my code generator the following output (note: the error messages were in red for the Std.err stream, SO doesn't show colors):
Some error message here
Some error message here
[title=bob, author=jones, years=1993]
[title=bob, author=dillon, years=1994]
[title=good, author=format, years=1995]
[title=another, author=good, years=1992]
[title=good, author=good, years=1997]
Feel free to ask questions if you are confused on any part of it. The output shown is from a toString() method I wrote on the Book class that I used for testing the code in my answer.

You can use a do while loop and read it till the end of file. Each new line will represent a Book Object detail.
In a csv all details are comma separated, So you can read the string and each comma will act as a delimiter between attributes of Book.

How to write a hashtable<string, string > in to text file,java?

I have hastable
htmlcontent is html string of urlstring .
I want to write hastable into a .text file .
Can anyone suggest a solution?

How about one row for each entry, and two strings separated by a comma? Sort of like:
"key1","value1"
"key2","value2"
...
"keyn","valuen"
keep the quotes and you can write out keys that refer to null entries too, like
"key", null
To actually produce the table, you might want to use code similar to:
public void write(OutputStreamWriter out, HashTable<String, String> table)
throws IOException {
String eol = System.getProperty("line.separator");
for (String key: table.keySet()) {
out.write("\"");
out.write(key);
out.write("\",\"");
out.write(String.valueOf(table.get(key)));
out.write("\"");
out.write(eol);
}
out.flush();
}

For the I/O part, you can use a new PrintWriter(new File(filename)). Just call the println methods like you would System.out, and don't forget to close() it afterward. Make sure you handle any IOException gracefully.
If you have a specific format, you'd have to explain it, but otherwise a simple for-each loop on the Hashtable.entrySet() is all you need to iterate through the entries of the Hashtable.
By the way, if you don't need the synchronized feature, a HashMap<String,String> would probably be better than a Hashtable.
Related questions
Java io ugly try-finally block
Java hashmap vs hashtable
Iterate Over Map
Here's a simple example of putting things together, but omitting a robust IOException handling for clarity, and using a simple format:
import java.io.*;
import java.util.*;
public class HashMapText {
public static void main(String[] args) throws IOException {
//PrintWriter out = new PrintWriter(System.out);
PrintWriter out = new PrintWriter(new File("map.txt"));
Map<String,String> map = new HashMap<String,String>();
map.put("1111", "One");
map.put("2222", "Two");
map.put(null, null);
for (Map.Entry<String,String> entry : map.entrySet()) {
out.println(entry.getKey() + "\t=>\t" + entry.getValue());
}
out.close();
}
}
Running this on my machine generates a map.txt containing three lines:
null => null
2222 => Two
1111 => One
As a bonus, you can use the first declaration and initialization of out, and print the same to standard output instead of a text file.
See also
Difference between java.io.PrintWriter and java.io.BufferedWriter?
java.io.PrintWriter API
Methods in this class never throw I/O exceptions, although some of its constructors may. The client may inquire as to whether any errors have occurred by invoking checkError().

For text representation, I would recommend picking a few characters that are very unlikely to occur in your strings, then outputting a CSV format file with those characters as separators, quotes, terminators, and escapes. Essentially, each row (as designated by the terminator, since otherwise there might be line-ending characters in either string) would have as the first CSV "field" the key of an entry in the hashtable, as the second field, the value for it.
A simpler approach along the same lines would be to designate one arbitrary character, say the backslash \, as the escape character. You'll have to double up backslashes when they occur in either string, and express in escape-form the tab (\t) and line-end ('\n); then you can use a real (not escape-sequence) tab character as the field separator between the two fields (key and value), and a real (not escape-sequence) line-end at the end of each row.

You can try
public static void save(String filename, Map<String, String> hashtable) throws IOException {
Properties prop = new Properties();
prop.putAll(hashtable);
FileOutputStream fos = new FileOutputStream(filename);
try {
prop.store(fos, prop);
} finally {
fos.close();
}
}
This stores the hashtable (or any Map) as a properties file. You can use the Properties class to load the data back in again.

import java.io.*;
class FileWrite
{
public static void main(String args[])
{
HashTable table = //get the table
try{
// Create file
BufferedWriter writer = new BufferedWriter(new FileWriter("out.txt"));
writer.write(table.toString());
}catch (Exception e){
e.printStackTrace();
}finally{
out.close();
}
}
}

Since you don't have any requirements to the file format, I would not create a custom one. Just use something standard. I would recommend use json for that!
Alternatives include xml and csv but I think that json is the best option here. Csv doesn't handle complex types like having a list in one of the keys of your map and xml can be quite complex to encode/decode.
Using json-simple as example:
String serialized = JSONValue.toJSONString(yourMap);
and then just save the string to your file (what is not specific of your domain either using Apache Commons IO):
FileUtils.writeStringToFile(new File(yourFilePath), serialized);
To read the file:
Map map = (JSONObject) JSONValue.parse(FileUtils.readFileToString(new File(yourFilePath));
You can use other json library as well but I think this one fits your need.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

I need an elegant way to exclude specific words from processing - java

Not sure if it is elegant but here I created a simple solution to detect the language or remove noise words from tweets some years ago: TweetDetector.java JTweet.java which is using the data like for english

Related

Create an array with the keys of nodes

How to implement Word2Vec in Java?

Reading words from wordlist.txt into array list, sorting, and writing new file

Creating a getList method with a csv file as an input parameter

How to write a hashtable<string, string > in to text file,java?

Categories

Resources