how to get words from a sequence of characters in java?

how to get words from a sequence of characters in java? - java

I have a method getNextChar() which reads a string character by character. And I am writing a method to get the words in a character sequence provided by getNextChar().
The text contains punctuation marks and other special characters.
I am thinking to have an array which contains all the punctuation marks and special characters, and when I read the characters of the text, check if the character is in the array to ignore it.
The method will recognize the word when it get's a space. The words will be stored in a Collection (ex: map) as i need to count the frequencies as well by checking if the word has been inserted before in the map and increasing the counter of that word.
Is this the best and efficient way of doing it? I am looking for the most efficient way. A
Is there any complete list of punctuation marks and special characters?

I think there is an easier way to do this.
No matter what your input source is, I would be reading it using the Scanner class. You can instantiate this class using your input string and call the Scanner.nextWord() method to get the next word in the string. This automatically checks for whitespace and returns the next word. Then, you can use String.replace("punctuation","") to remove punctuation and then insert these words into an ArrayList and you can count frequencies etc.
Scanner reader = new Scanner(string);
String word = reader.nextWord();
word=word.replaceAll(//code);
list.add(word);

You could use string.split() to break apart the string into an array of strings separated by whitespace (for your words.) You could also check each character with Character.isLetterOrDigit() to avoid punctuation. (Not necessarily in that order.)

The lookup for punctation will have a better performance if you use a Set of characters.
Set<Character> punctationchars ....
if(punctationcahars.contains(yourChar) { ... }

Just use a Scanner to read in the Strings:
Scanner in = new Scanner(...);
while (in.hasNext()) {
String word = in.next();
/* do something with the word, check punctuation, etc. */
}

Related

Java StreamTokenizer splits Email address at # sign

I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.
I already set the # sign as an ordinaryChar and space as the only whitespace:
StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('#');
tokeziner.whitespaceChars(' ', ' ');
Still, all E-mail addresses are split up.
A line to parse looks like the following:
"Student 6 Name6 LastName6 del6#uni.at Competition speech University of Innsbruck".
The Tokenizer splits del6#uni.at to "del6" and "uni.at".
Is there a way to tell the tokenizer to not split at # signs?

So here is why it worked like it did:
StreamTokenizer regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.
So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;. A simple tokenizer that merely breaks the text by whitespace would break this into a, = and a+b;. But StreamTokenizer would break this into a, =, a, +, b, and ;, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer's types are rather basic, but this behavior is the key to understanding what happened in your case.
It wasn't recognizing the # as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype field, and you were probably just looking at the sval.
A StreamTokenizer would recognize your line as:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6
The character #
The word uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
(This is the actual output of a little demo I wrote tokenizing your example line and printing by type).
In fact, by telling it that # was an "ordinary character", you were telling it to take the # as its own token (which it does anyway by default). The ordinaryChar() documentation tells you that this method:
Specifies that the character argument is "ordinary" in this tokenizer.
It removes any special significance the character has as a comment
character, word component, string delimiter, white space, or number
character. When such a character is encountered by the parser, the
parser treats it as a single-character token and sets ttype field to
the character value.
(My emphasis).
In fact, if you had instead passed it to wordChars(), as in tokenizer.wordChars('#','#') it would have kept the whole e-mail together. My little demo with that added gives:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6#uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
If you need a programming-language-like tokenizer, StreamTokenizer may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split(), or if it is just a whitespace-delimited chain of tokens, where Scanner might suit you better.

In order to simply split a String, see the answer to this question (adapted for whitespace):
The best way is to not use a StringTokenizer at all, but use String's
split method. It returns an array of Strings, and you can get the
length from that.
For each line in your file you can do the following:
String[] tokens = line.split(" +");
tokens will now have 6 - 8 Strings. Use tokens.length() to find out
how many, then create your object from the array.
This is sufficient for the given line, and might be sufficient for everything. Here is some code that uses it (it reads System.in):
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class T {
public static void main(String[] args) {
BufferedReader st = new BufferedReader(new InputStreamReader(System.in));
String line;
try {
while ( st.ready() ) {
line = st.readLine();
String[] tokens = line.split(" +");
for( String token: tokens ) {
System.out.println(token);
}
}
} catch ( IOException e ) {
throw new RuntimeException(e); // handle error here
}
}
}

Regular expression to retrieve words from files

I have a set of files in particular diretory.
After retrieving the contents from all the files(text files) in the directory, I have a
List of Strings.
Each string element represents the retrieved content from each file. So the first String element in the list represents the content from first file.
Now I want to split the string to get words.(Later the words store into an array of strings)
1) words can be seperated by single space/multiple space.
2) Sentences are end by a '.', so a new word can be started after '.'
3) A new word can start after '\n'
So can anyone suggest a regular expression which can fit into split() method?

Perhaps the StringTokenizer class is a better fit for your need. The constructor takes the string to tokenize and a list of delimiters (in your case: space, ., and line break).

String[] result = myString.split("[\\.\\s]");

You probably don't need regexp for this, just remove every nonletter charcters from file, and use Tokenizer to read each word.

I would suggest using tokens for this ... simply go through each character and decide what to do based on what the character is. Here's the pseudo-code
string word = "";
while ( EOF ){
char = getNextChar()
if ( char not space or full-stop ){
append the char to the word
}
else {
if ( the word is empty ){ continue /* ignore multi space */ }
else {
add the word to an array of words
reset the word to ""
}
}
}
This way, you have a complete control of the way you process the data - you don't have to worry about crazy scenarios with to include in the regex rule. Most of all, this is the most efficient way (def better than regex) and you do only a single pass through the data.

How to use delimiter to isolate words (Java)

I am writing a program that scans text files and then writes each word into a Hashmap.
The Scanner class has a defualt delimiter of space. But I ended up having my words stored with punctuations attached to them. I want the scanner to recognize periods, comas and other types of common punctuations as a sign to stop the token. Here's what I have attempted:
Scanner line_scanner = new Scanner(line).useDelimiter("[.,:;()?!\" \t]+~\\s");
The scanner basically ignored all the spaces even though I have '\\s' as part of the expression. Sorry, but I have hardly any understanding of regex.

Scanner line_scanner = new Scanner(line).useDelimiter("[.,:;()?!\"\\s]+");

You might go for no unicode letters:
useDelimiter("[^\\p{L}\\p{M}]+");
([^...] is not, Capital p means Unicode category, L are the letters, M the diacritical combining marks (accents).)

How to tokenize in java without using the java.util tokenizer?

Consider the following as tokens:
+, -, ), (
alpha charactors and underscore
integer
Implement 1.getToken() - returns a string corresponding to the next token
2.getTokPos() - returns the position of the current token in the input string
Example input: (a+b)-21)
Output: (| a| +| b| )| -| 21| )|
Note: Cannot use the java string tokenizer class
Work in progress - Successfully tokenized +,-,),(. Need to figure out characters and numbers:
OUTPUT: +|-|+|-|(|(|)|)|)|(| |

java.util tokenizer is a deprecated class.
Tokenizing Strings in Java is much easier with "String.split()" since Java 1.4 :
String[] tokens = "(a+b)-21)".split("[+-)(]");
If it is a homework, you probably have to reimplement a "split" method:
read the String character by character
if the character is not a special char, add it to a buffer
when you encounter a special char, add the buffer content to a list and clear the buffer
Since it is (probably) a homework, I let you implement it.

Java lets you examine the characters in a String one by one with the charAt method. So use that in a for loop and examine each character. When you encounter a TOKEN you wrap that token with the pipes and any other character you just append to the output.
public static final char PLUS_TOKEN = '+';
// add all tokens as
public String doStuff(String input)
{
StringBuilder output = new StringBuilder();
for (int index = 0; index < input.length(); index++)
{
if (input.charAt(index) == PLUS_TOKEN)
{
// when you see a token you need to append the pipes (|) around it
output.append('|');
output.append(input.charAt(index);
output.append('|');
}
else if () //compare the current character with all tokens
else
{
// just add to new output
output.append(input.charAt(index);
}
}
return output.toString();
}

If it's not a homework assignment use String.split(). If is a homework assignment, say so and tag it so that we can give the appropriate level of help (I did so for you, just in case...).

Because the string needs to be cut in several different ways, not just on whitespace or parens, using the String.split method with any of the symbols there will not work. Split removes the character used as a seperator. You could try to split on the empty string, but this wouldn't get compound symbols, like 21. To correctly parse this string, you will need to effectively implement your own tokenizer. Try thinking about how you could tell you had a complete token if you looked at the string one character at a time. You could probably start a string that collects the characters until you have identified a complete token, and then you can remove the characters from the original and return the string. Starting from this point, you can probably make a basic tokenizer.
If you'd rather learn how to make a full strength tokenizer, most of them are defined by creating a regular expression that only matches the tokens.

String.split() - matching leading empty String prior to first delimiter?

I need to be able to split an input String by commas, semi-colons or white-space (or a mix of the three). I would also like to treat multiple consecutive delimiters in the input as a single delimiter. Here's what I have so far:
String regex = "[,;\\s]+";
return input.split(regex);
This works, except for when the input string starts with one of the delimiter characters, in which case the first element of the result array is an empty String. I do not want my result to have empty Strings, so that something like, ",,,,ZERO; , ;;ONE ,TWO;," returns just a three element array containing the capitalized Strings.
Is there a better way to do this than stripping out any leading characters that match my reg-ex prior to invoking String.split?
Thanks in advance!

No, there isn't. You can only ignore trailing delimiters by providing 0 as a second parameter to String's split() method:
return input.split(regex, 0);
but for leading delimiters, you'll have to strip them first:
return input.replaceFirst("^"+regex, "").split(regex, 0);

If by "better" you mean higher performance then you might want to try creating a regular expression that matches what you want to match and using Matcher.find in a loop and pulling out the matches as you find them. This saves modifying the string first. But measure it for yourself to see which is faster for your data.
If by "better" you mean simpler, then no I don't think there is a simpler way than the way you suggested: removing the leading separators before applying the split.

Pretty much all splitting facilities built into the JDK are broken one way or another. You'd be better off using a third-party class such as Splitter, which is both flexible and correct in how it handles empty tokens and whitespaces:
Splitter.on(CharMatcher.anyOf(";,").or(CharMatcher.WHITESPACE))
.omitEmptyStrings()
.split(",,,ZERO;,ONE TWO");
will yield an Iterable<String> containing "ZERO", "ONE", "TWO"

You could also potentially use StringTokenizer to build the list, depending what you need to do with it:
StringTokenizer st = new StringTokenizer(",,,ZERO;,ONE TWO", ",; ", false);
while(st.hasMoreTokens()) {
String str = st.nextToken();
//add to list, process, etc...
}
As a caveat, however, you'll need to define each potential whitespace character separately in the second argument to the constructor.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.