I was just struck with an odd exception from the entrails of StanfordNLP, when trying to tokenize:
java.lang.NullPointerException at
edu.stanford.nlp.process.PTBLexer.zzRefill(PTBLexer.java:24511) at
edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:24718) at
edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:276)
at
edu.stanford.nlp.process.PTBTokenizer.getNext(PTBTokenizer.java:163)
at
edu.stanford.nlp.process.AbstractTokenizer.hasNext(AbstractTokenizer.java:55)
at
edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.primeNext(DocumentPreprocessor.java:270)
at
edu.stanford.nlp.process.DocumentPreprocessor$PlainTextIterator.hasNext(DocumentPreprocessor.java:334)
The code that cause it looks like this:
DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(
tweet));
// unigrams
for (List<HasWord> sentence : dp) {
for (HasWord word : sentence) {
// do stuff
}
}
// bigrams
for (List<HasWord> sentence : dp) { //<< exception is thrown here
Iterator<HasWord> it = sentence.iterator();
String st1 = it.next().word();
while (it.hasNext()) {
String st2 = it.next().word();
String bigram = st1 + " " + st2;
// do stuff
st1 = st2;
}
}
What is going on? Has this to do with me looping over the tokens twice?
This is certainly an ugly stacktrace, which can and should be improved. (I'm about to check in a fix for that.) But the reason that this doesn't work is that a DocumentProcessor acts like a Reader: It only lets you make a single pass through the sentences of a document. So after the first for-loop, the document is exhausted, and the underlying Reader has been closed. Hence the second for-loop fails, and here crashes out deep in the lexer. I'm going to change it so that it just will give you nothing. But to get what you want you either want to (most efficient) get both the unigrams and bigrams in one for-loop pass through the document or to create a second DocumentPreprocessor for the second pass.
I think it.next().word() is causing it.
Change your code so you can first check if it.hasNext() and then do it.next().word() .
Related
I am trying to run a mapreduce job on hadoop which reads the fifth entry of a tab delimited file (fifth entry are user reviews) and then do some sentiment analysis and word count on them.
However, as you know with user reviews, they usually include line breaks and empty lines. My code iterates through the words of each review to find keywords and check sentiment if keyword is found.
The problem is as the code iterates through the review, it gives me ArrayIndexOutofBoundsException Error because of these line breaks and empty lines in one review.
I have tried using replaceAll("\r", " ") and replaceAll("\n", " ") to no avail.
I have also tried if(tokenizer.countTokens() == 2){
word.set(tokenizer.nextToken());}
else {
}
also to no avail. Below is my code:
public class KWSentiment_Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> keywordsList = new ArrayList<String>();
ArrayList<String> posWordsList = new ArrayList<String>();
ArrayList<String> tokensList = new ArrayList<String>();
int e;
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
String Review = line[4].replaceAll("[\\-\\+\\\\)\\.\\(\"\\{\\$\\^:,]", "").toLowerCase();
StringTokenizer tokenizer = new StringTokenizer(Review);
while (tokenizer.hasMoreTokens()) {
// 1- first read the review line and store the tokens in an arraylist, 2-
// iterate through review to check for KW if found
// 3-check if there's PosWord near (upto +3 and -2)
// 4- setWord & context.write 5- null the review line arraylist
String CompareString = tokenizer.nextToken();
tokensList.add(CompareString);
}
{
for (int i = 0; i < tokensList.size(); i++)
{
for (int j = 0; j < keywordsList.size(); j++) {
boolean flag = false;
if (tokensList.get(i).startsWith(keywordsList.get(j)) == true) {
for (int e = Math.max(0, i - 2); e < Math.min(tokensList.size(), i + 4); e++) {
if (posWordsList.contains(tokensList.get(e))) {
word.set(keywordsList.get(j));
context.write(word, one);
flag = true;
break; // breaks out of e loop }}
}
}
}
if (flag)
break;
}
}
tokensList.clear();
}
}
Expected results are such that:
Take these two cases of reviews where error occurs:
Case 1: "Beautiful and spacious!
I highly recommend this place and great host."
Case 2: "The place in general was really silent but we didn't feel stayed.
Aside from this, the bathroom is big and the shower is really nice but there problem. "
The system should read the whole review as one line and iterate through the words in it. However, it just stops as it finds a line break or an empty line as in case 2.
Case 1 should be read such as: "Beautiful and spacious! I highly recommend this place and great host."
Case 2 should be:"The place in general was really silent but we didn't feel stayed. Aside from this, the bathroom is big and the shower is really nice but there problem. "
I am running out of time and would really appreciate help here.
Thanks!
So, I hope I am understanding what what you are trying to do....
If I am reading what you have above correctly, the value of 'value' passed into your map function above contains the delimited value that you would like to parse the user reviews out of. If that is the case, I believe we can make use of the escaping functionality in the opencsv library using tabs as your delimiting character instead of commas to correctly populate the user review field:
http://opencsv.sourceforge.net
In this example we are reading one line from the input that is passed in and parsing it into 'columns' base on the tab character and placing the results in the 'nextLine' array. This will allow us to use the escaping functionality of the CSVReader without reading an actual file and instead using the value of the text passed into your map function.
StringReader reader = new StringReader(value.toString());
CSVReader csvReader = new CSVReader(reader, '\t', '\"', '\\', 0);
String [] nextLine = csvReader.readNext();
if(nextLine != null && nextLine.length >= 5) {
// Do some stuff
}
In the example that you pasted above, I think even that split("\n") will be problematic as tabs within a user review split into two results in the result in addition to new lines being treated as new records. But, both of these characters are legal as long as they are inside a quoted value (as they should be in a properly escaped file and as they are in your example). CSVReader should handle all of these.
Validate each line at the start of the map method, so that you know line[4] exists and isn't null.
if (value == null || value.toString == null) {
return;
}
String[] line = value.toString().split("\t");
if (line == null || line.length() < 5 || line[4] == null) {
return;
}
As for line breaks, you'll need to show some sample input. By default MapReduce passes each line into the map method independently, so if you do want to read multiple lines as one message, you'll have to write a custom InputSplit, or pre-format your data so that all data for each review is on the same line.
I was working on some code using try-catch and I needed empty substrings to throw an exception when doing Double.parseDouble() (in this case, I presume it would be a NullPointerException).
My question is why this code doesn't throw an exception if I enter something like , , , (space-comma-space-comma-space-comma) or similar (which should split the string into three whitespaces, if I understand correctly):
Scanner input = new Scanner(System.in);
String[] inputParts = null;
String inputLine = input.nextLine();
// the matches() here prevents this from happening, but I still don't understand
// why an exception isn't thrown
if ((inputLine.contains(",") || inputLine.contains(" ")) && !inputLine.matches("\\s+")) {
inputParts = inputLine.split("\\s*(,*\\s+)|(,+)");
}
for (int i = 0; i < inputParts.length; ++i) {
// this prints nothing -- not even a new line. Same behavior even if I don't parseDouble
// and just print the string directly
System.out.println(Double.parseDouble(inputParts[i]));
}
If I try to parseDouble from an empty string "" or " " without taking user input like this it does throw an exception.
I'm quite confused as to why this is happening, considering the code I was working on does work except when I enter something like the above (although I fixed it by checking to see if each substring was only whitespace and throwing the appropriate exception manually).
Thanks.
I'm reading in a transaction file that looks like this:
1112, D
4444, A, Smith, Jones, 45000, 2, Shipping
6666, U, Jones
8900, A, Hill, Bill, 65000, 0, Accounting
When I attempt to read the file line by line using ", " the token, the program bombs out with a NoSuchElementException error at the first record. I've deduced that the condition in which I'm reading the file is causing the issue, particularly at the while loop below. I've tried using an "if" statement and setting the conditions to "while (st2.hasMoreTokens)" and a combination of the two but the error persists and I'm not sure why? Thank you in advance for any assistance. This is the code below:
Scanner transactionFile = new Scanner (new File(fileName2));
for (int i = 0; i < T_SIZE; i++) {
line2[i] = transactionFile.nextLine();
transaction[i] = new Transaction();
st2 = new StringTokenizer(line2[i], ", ");
transaction[i].setEmployeeID(Integer.parseInt(st2.nextToken()));
transaction[i].setAction(st2.nextToken());
while ((transaction[i].getAction() != "D")) {
transaction[i].setLastName(st2.nextToken());
transaction[i].setFirstName(st2.nextToken());
transaction[i].setSalary(Integer.parseInt(st2.nextToken()));
transaction[i].setNumOfDependants(Integer.parseInt(st2.nextToken()));
transaction[i].setDepartment(st2.nextToken());
}
}
Take a look at the your while loop. The == operator in Java checks if two objects are the same reference, which is rarely a good idea to rely on, and probably causes this loop to loop infinately (or at least until the program crashes with an exception). What you'd want to do, logically, is check that both strings are equal, i.e., both contain the string "D":
while (!transaction[i].getAction().equals("D"))
str.nextToken()
function access an element when it is called and increments index of it so your are calling it more then the elements in array so it cant have the access to the higher indexes and throws an exception of noSuchElementFound
I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises. I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.
My code is below.
My questions are:
Is my file input routine the most respectful of JVM resources?
Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?
Am I using the Collection classes and interface the most efficient way I can?
Thanks much for any opinions. I'm just trying to have some fun and improve my programming skills.
import java.io.*;
import java.util.*;
public class Sort
{
public static void main(String[] args)
{
String sUnsorted = null;
String[] saSplit = null;
int iCurrentWordCount = 1;
String currentword = null;
String pastword = "";
// Read the text file into a string
sUnsorted = readIn("input1.txt");
// Parse the String by white space into String array of single words
saSplit = sUnsorted.split("\\s+");
// Sort the String array in descending order
java.util.Arrays.sort(saSplit, Collections.reverseOrder());
// Count the occurences of each word in the String array
for (int i = 0; i < saSplit.length; i++ )
{
currentword = saSplit[i];
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else if (!currentword.equals(pastword))
{
if ( !pastword.equals("") )
{
System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);
}
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
// Print out the count for the last word processed
System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
// Read The Input File Into A String
public static String readIn(String infile)
{
String result = " ";
try
{
FileInputStream file = new FileInputStream (infile);
DataInputStream in = new DataInputStream (file);
byte[] b = new byte[ in.available() ];
in.readFully (b);
in.close ();
result = new String (b, 0, b.length, "US-ASCII");
}
catch ( Exception e )
{
e.printStackTrace();
}
return result;
}// end funciton readIn()
}// end class Sort()
/////////////////////////////////////////////////
// Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////
import java.io.*;
import java.util.*;
public class Sort2
{
public static void main(String[] args) throws Exception
{
// Scanner will tokenize on white space, like we need
Scanner scanner = new Scanner(new FileInputStream("input1.txt"));
ArrayList <String> wordlist = new ArrayList<String>();
String currentword = null;
String pastword = null;
int iCurrentWordCount = 1;
while (scanner.hasNext())
wordlist.add(scanner.next() );
// Sort in descending natural order
Collections.sort(wordlist);
Collections.reverse(wordlist);
for ( String temp : wordlist )
{
currentword = temp;
// If this word was seen before, increase the count & print the
// word to stdout
if ( currentword.equals(pastword) )
{
iCurrentWordCount ++;
System.out.println(currentword);
}
// Output the count of the LAST word to stdout,
// Reset our counter
else //if (!currentword.equals(pastword))
{
if ( pastword != null )
System.out.println("Count for " + pastword + ": " +
CurrentWordCount);
System.out.println(currentword );
iCurrentWordCount = 1;
}
pastword = currentword;
}// end for loop
System.out.println("Count for " + currentword + ": " + iCurrentWordCount);
}// end funciton main()
}// end class Sort2
There are more idiomatic ways of reading in all the words in a file in Java.
BreakIterator is a better way of reading in words from an input.
Use List<String> instead of Array in almost all cases. Array isn't technically part of the Collection API and isn't as easy to replace implementations as List, Set and Map are.
You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over. AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe. A SortedMap implementation would give you the words in order with their counts as well.
Make as many variables, even local ones final as possible. and declare them right before you use them, not at the top where their intended scope will get lost.
You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.
That said, don't concern yourself with micro optimizations until you have "correct" behavior.
the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128)
you can provide customer delimiters to the Scanner type for breaking streams
Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.
Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.
To elaborate, you can count words in a map like this:
void increment(Map<String, Integer> wordCountMap, String word) {
Integer count = wordCountMap.get(word);
wordCountMap.put(word, count == null ? 1 : ++count);
}
Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets. An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)
Can you use Guava for your homework assignment? Multiset handles the counting. Specifically, LinkedHashMultiset might be useful.
Some other things you might find interesting:
To read the file you could use a BufferedReader (if it's text only).
This:
for (int i = 0; i < saSplit.length; i++ ){
currentword = saSplit[i];
[...]
}
Could be done using a extended for-loop (the Java-foreach), like shown here.
if ( currentword.equals(pastword) ){
[...]
} else if (!currentword.equals(pastword)) {
[...]
}
In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).
if ( !pastword.equals("") )
I think using length is faster here:
if (!pastword.length == 0)
Input method:
Make it easier on yourself and deal directly with characters instead of bytes. For example, you could use a FileReader and possibly wrap it inside a BufferedReader. At the least, I'd suggest looking at InputStreamReader, as the implementation to change from bytes to characters is already done for you. My preference would be using Scanner.
I would prefer returning null or throwing an exception from your readIn() method. Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid. Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types. You'll have to handle all checked exceptions, but you may want to handle them differently.
Collections:
You're really not use Collections classes, you're using an array. Your implementation seems fine, but...
There are certainly many ways of handling this problem. Your method -- sorting then comparing to last -- is O(nlogn) on average. That's certainly not bad. Look at a way of using a Map implementation (such as HashMap) to store the data you need while only traversing the text in O(n) (HashMap's get() and put() -- and presumably contains() -- methods are O(1)).
I'm working on a little server app with Java. So, I'm getting informations from different client, and if information comes in, the following method is called:
public void writeToArray(String data) {
data = trim(data);
String[] netInput = new String[5];
netInput[0]="a";
netInput[1]="a";
netInput[2]="a";
netInput[3]="a";
netInput[4]="a";
netInput = split(data, ",");
pos_arr = PApplet.parseInt(netInput[0]);
rohr_value = PApplet.parseInt(netInput[1]); // THIS LINE KICKS OUT THE ERROR.
if(pos_arr >0 && pos_arr<100) {
fernrohre[pos_arr] = rohr_value;
println("pos arr length: " + fernrohre[pos_arr]);
println("pos arr: " + pos_arr);
}
The console on OS X gives me the following error:
Exception in thread "Animation Thread"
java.lang.ArrayIndexOutOfBoundsException:1
at server_app.writeToArray(server_app.java:108) at server_app.draw(server_app.java:97)
at processing.core.PApplet.handleDraw(PApplet.java:1606)
at processing.core.PApplet.run(PApplet.java:1503)
at java.lang.Thread.run(Thread.java:637)
As you can see, I tried to fill the array netInput with at least 5 entries, so there can't be an ArrayIndexOutOfBoundsException.
I don't understand that, and I'm thankful for your help!
It would work already for me, if I can catch the error and keep the app continuing.
You put 5 Strings into the array, but then undo all your good work with this line;
netInput = split(data, ",");
data obviously doesn't have any commas in it.
In this line
netInput = split(data, ",");
your array is being reinitialized. Your split method probably returns an array with only 1 element (I can guess that data string doesn't contain any ",").
Update
The split() method is custom, not String.split. It too needs to be checked to see what is going wrong. Thanks #Carlos for pointing it out.
Original Answer
Consider this line:
netInput = split(data, ",");
This will split the data string using comma as a separator. It will return an array of (number of commas + 1) resulting elements. If your string has no commas, you'll get a single element array.
Apparently your input string doesn't have any commas. This will result in a single element array (first element aka index = 0 will be the string itself). Consequently when you try to index the 2nd element (index = 1) it raises an exception.
You need some defensive code,
if(netInput.length > 1)
pos_arr = PApplet.parseInt(netInput[0]);
rohr_value = PApplet.parseInt(netInput[1]);
You make
netInput = split(data, ",");
and
split(data, ",");
returns one element array
You are re-assigning your netInput variable when the split() method is called.
The new value might not have an array count of 5.
Can you provide the source for the split() method?