How to use delimiter to isolate words (Java)

How to use delimiter to isolate words (Java) - java

I am writing a program that scans text files and then writes each word into a Hashmap.
The Scanner class has a defualt delimiter of space. But I ended up having my words stored with punctuations attached to them. I want the scanner to recognize periods, comas and other types of common punctuations as a sign to stop the token. Here's what I have attempted:
Scanner line_scanner = new Scanner(line).useDelimiter("[.,:;()?!\" \t]+~\\s");
The scanner basically ignored all the spaces even though I have '\\s' as part of the expression. Sorry, but I have hardly any understanding of regex.

Scanner line_scanner = new Scanner(line).useDelimiter("[.,:;()?!\"\\s]+");

You might go for no unicode letters:
useDelimiter("[^\\p{L}\\p{M}]+");
([^...] is not, Capital p means Unicode category, L are the letters, M the diacritical combining marks (accents).)

Related

How do I use Scanner Delimiter to eliminate groups of chars without removing the char completely?

I'm trying to rid my scans of special symbols and fractions like "1/4" with out eliminating whole ints such as "1". When I input the code below it not only removes my fractions with 1 as a numerator but also the one at the beginning. Is there any way around this?
Scanner sc2 = new Scanner("1.How many cups do I need?(You need 1/4)");
sc2.useDelimiter("[?.! ()\\s*1/\\s*]+");

just remove 1 on your pattern in delimiter:
sc2.useDelimiter("[?.! ()\\s*1/\\s*]+")
to
sc2.useDelimiter("[?.! ()\\s*/\\s*]+") // remove 1

How to extract letter words only from an arbitrary input file

I'm writing a spell checker, and I have to extract only word (constructed out of letter). I'm having trouble using multiple delimiters. Java documentation specifies the use of several delimiters, but I have troubles including every printing character that is not a letter.
in_file.useDelimiter("., !?/##$%^&*(){}[]<>\\\"'");
in this case - run time
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed character class near index 35
I tried using pattern such as
("\s+,|\s+\?|""|\s:|\s;|\{}|\s[|[]|\s!");
run time -
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal repetition
I'm aware of tokenizer but we are restricted to use scanner.

The pattern in Scanner is supposed to be a regular expression that describes all the characters you don't want included in a token, repeated one or more times (this last part is because the word may be delimited by more than one space/punctuation etc.)
This means you need a pattern that describes something which is not a letter. Regular expressions give you the ability to negate a class of characters. So if a letter is [a-zA-Z], a "non-letter" is [^a-zA-Z]. So you can use [^a-zA-Z]+ to describe "1 or more non-letters".
There are other ways to express the same thing. \p{Alpha} is the same as [a-zA-Z]. And you negate it by capitalizing the P: \P{Alpha}+.
If your file contains words that are not in English, then you may want to use a Unicode category: \P{L}+ (meaning: 1 or more characters which are not Unicode letters).
Demonstration:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{Alpha}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Output:
Hello
ho
na
ve
This is because we asked for just US-ASCII alphabet (\p{Alpha}). So it broke the word naïve because ï is not a letter in the US-ASCII range. It also ignored all those words in other languages. But if we use:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{L}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Then we have used a unicode category, and the output will be:
Hello
שלום
ho
こんにちは
naïve
漢字
مرحبا
Which gives you all the words in all the languages. So it's your choice.
Summary
To create a Scanner delimiter that allows you to get all the strings that are made of a particular category of characters (in this case, letters):
Create a regular expression for the category of characters you want
Negate it
Add + to signify 1 or more of the negated category.
This is just a common recipe, and complicated cases may require a different method.

There is a Metacharacter for word-extraction: \w. It selects everything that is considered to be a word.
If you are just interested in word boundarys you can use \b, which should be appropriate as a delimiter.
See http://www.vogella.com/tutorials/JavaRegularExpressions/article.html (Chapter 3.2)

Java StreamTokenizer splits Email address at # sign

I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.
I already set the # sign as an ordinaryChar and space as the only whitespace:
StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('#');
tokeziner.whitespaceChars(' ', ' ');
Still, all E-mail addresses are split up.
A line to parse looks like the following:
"Student 6 Name6 LastName6 del6#uni.at Competition speech University of Innsbruck".
The Tokenizer splits del6#uni.at to "del6" and "uni.at".
Is there a way to tell the tokenizer to not split at # signs?

So here is why it worked like it did:
StreamTokenizer regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.
So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;. A simple tokenizer that merely breaks the text by whitespace would break this into a, = and a+b;. But StreamTokenizer would break this into a, =, a, +, b, and ;, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer's types are rather basic, but this behavior is the key to understanding what happened in your case.
It wasn't recognizing the # as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype field, and you were probably just looking at the sval.
A StreamTokenizer would recognize your line as:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6
The character #
The word uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
(This is the actual output of a little demo I wrote tokenizing your example line and printing by type).
In fact, by telling it that # was an "ordinary character", you were telling it to take the # as its own token (which it does anyway by default). The ordinaryChar() documentation tells you that this method:
Specifies that the character argument is "ordinary" in this tokenizer.
It removes any special significance the character has as a comment
character, word component, string delimiter, white space, or number
character. When such a character is encountered by the parser, the
parser treats it as a single-character token and sets ttype field to
the character value.
(My emphasis).
In fact, if you had instead passed it to wordChars(), as in tokenizer.wordChars('#','#') it would have kept the whole e-mail together. My little demo with that added gives:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6#uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
If you need a programming-language-like tokenizer, StreamTokenizer may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split(), or if it is just a whitespace-delimited chain of tokens, where Scanner might suit you better.

In order to simply split a String, see the answer to this question (adapted for whitespace):
The best way is to not use a StringTokenizer at all, but use String's
split method. It returns an array of Strings, and you can get the
length from that.
For each line in your file you can do the following:
String[] tokens = line.split(" +");
tokens will now have 6 - 8 Strings. Use tokens.length() to find out
how many, then create your object from the array.
This is sufficient for the given line, and might be sufficient for everything. Here is some code that uses it (it reads System.in):
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class T {
public static void main(String[] args) {
BufferedReader st = new BufferedReader(new InputStreamReader(System.in));
String line;
try {
while ( st.ready() ) {
line = st.readLine();
String[] tokens = line.split(" +");
for( String token: tokens ) {
System.out.println(token);
}
}
} catch ( IOException e ) {
throw new RuntimeException(e); // handle error here
}
}
}

Regex matching bracket in java scanner

I'm wondering is it possible to match bracket with regex expression
say I need to find "[Detail]" in a text file but since "[]" is reserved it would result in selecting character "adeilt" which is not what I wanted in the first place
Scanner s = new Scanner(data)
s.findInLine("[Detail]")
Many thanks.

escape reserved characters
\\[Detail\\]

how to get words from a sequence of characters in java?

I have a method getNextChar() which reads a string character by character. And I am writing a method to get the words in a character sequence provided by getNextChar().
The text contains punctuation marks and other special characters.
I am thinking to have an array which contains all the punctuation marks and special characters, and when I read the characters of the text, check if the character is in the array to ignore it.
The method will recognize the word when it get's a space. The words will be stored in a Collection (ex: map) as i need to count the frequencies as well by checking if the word has been inserted before in the map and increasing the counter of that word.
Is this the best and efficient way of doing it? I am looking for the most efficient way. A
Is there any complete list of punctuation marks and special characters?

I think there is an easier way to do this.
No matter what your input source is, I would be reading it using the Scanner class. You can instantiate this class using your input string and call the Scanner.nextWord() method to get the next word in the string. This automatically checks for whitespace and returns the next word. Then, you can use String.replace("punctuation","") to remove punctuation and then insert these words into an ArrayList and you can count frequencies etc.
Scanner reader = new Scanner(string);
String word = reader.nextWord();
word=word.replaceAll(//code);
list.add(word);

You could use string.split() to break apart the string into an array of strings separated by whitespace (for your words.) You could also check each character with Character.isLetterOrDigit() to avoid punctuation. (Not necessarily in that order.)

The lookup for punctation will have a better performance if you use a Set of characters.
Set<Character> punctationchars ....
if(punctationcahars.contains(yourChar) { ... }

Just use a Scanner to read in the Strings:
Scanner in = new Scanner(...);
while (in.hasNext()) {
String word = in.next();
/* do something with the word, check punctuation, etc. */
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.