Java StreamTokenizer splits Email address at # sign - java

I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.
I already set the # sign as an ordinaryChar and space as the only whitespace:
StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('#');
tokeziner.whitespaceChars(' ', ' ');
Still, all E-mail addresses are split up.
A line to parse looks like the following:
"Student 6 Name6 LastName6 del6#uni.at Competition speech University of Innsbruck".
The Tokenizer splits del6#uni.at to "del6" and "uni.at".
Is there a way to tell the tokenizer to not split at # signs?

So here is why it worked like it did:
StreamTokenizer regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.
So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;. A simple tokenizer that merely breaks the text by whitespace would break this into a, = and a+b;. But StreamTokenizer would break this into a, =, a, +, b, and ;, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer's types are rather basic, but this behavior is the key to understanding what happened in your case.
It wasn't recognizing the # as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype field, and you were probably just looking at the sval.
A StreamTokenizer would recognize your line as:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6
The character #
The word uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
(This is the actual output of a little demo I wrote tokenizing your example line and printing by type).
In fact, by telling it that # was an "ordinary character", you were telling it to take the # as its own token (which it does anyway by default). The ordinaryChar() documentation tells you that this method:
Specifies that the character argument is "ordinary" in this tokenizer.
It removes any special significance the character has as a comment
character, word component, string delimiter, white space, or number
character. When such a character is encountered by the parser, the
parser treats it as a single-character token and sets ttype field to
the character value.
(My emphasis).
In fact, if you had instead passed it to wordChars(), as in tokenizer.wordChars('#','#') it would have kept the whole e-mail together. My little demo with that added gives:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6#uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
If you need a programming-language-like tokenizer, StreamTokenizer may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split(), or if it is just a whitespace-delimited chain of tokens, where Scanner might suit you better.

In order to simply split a String, see the answer to this question (adapted for whitespace):
The best way is to not use a StringTokenizer at all, but use String's
split method. It returns an array of Strings, and you can get the
length from that.
For each line in your file you can do the following:
String[] tokens = line.split(" +");
tokens will now have 6 - 8 Strings. Use tokens.length() to find out
how many, then create your object from the array.
This is sufficient for the given line, and might be sufficient for everything. Here is some code that uses it (it reads System.in):
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class T {
public static void main(String[] args) {
BufferedReader st = new BufferedReader(new InputStreamReader(System.in));
String line;
try {
while ( st.ready() ) {
line = st.readLine();
String[] tokens = line.split(" +");
for( String token: tokens ) {
System.out.println(token);
}
}
} catch ( IOException e ) {
throw new RuntimeException(e); // handle error here
}
}
}

Related

Java Split regex

Given a string S, find the number of words in that string. For this problem a word is defined by a string of one or more English letters.
Note: Space or any of the special characters like ![,?.\_'#+] will act as a delimiter.
Input Format: The string will only contain lower case English letters, upper case English letters, spaces, and these special characters: ![,?._'#+].
Output Format: On the first line, print the number of words in the string. The words don't need to be unique. Then, print each word in a separate line.
My code:
Scanner sc = new Scanner(System.in);
String str = sc.nextLine();
String regex = "( |!|[|,|?|.|_|'|#|+|]|\\\\)+";
String[] arr = str.split(regex);
System.out.println(arr.length);
for(int i = 0; i < arr.length; i++)
System.out.println(arr[i]);
When I submit the code, it works for just over half of the test cases. I do not know what the test cases are. I'm asking for help with the Murphy's law. What are the situations where the regex I implemented won't work?
You don't escape some special characters in your regex. Let's start with []. Since you don't escape them, the part [|,|?|.|_|'|#|+|] is treated like a set of characters |,?._'#+. This means that your regex doesn't split on [ and ].
For example x..]y+[z is split to x, ]y and [z.
You can fix that by escaping those characters. That will force you to escape more of them and you end up with a proper definition:
String regex = "( |!|\\[|,|\\?|\\.|_|'|#|\\+|\\])+";
Note that instead of defining alternatives, you could use a set which will make your regex easier to read:
String regex = "[!\\[,?._'#+\\].]+";
In this case you only need to escape [ and ].
UPDATE:
There's also a problem with leading special character (like in your example ".Hi?there[broski.]#####"). You need to split on it but it produces an empty string in the results. I don't think there's a way to use split function without producing it but you can mitigate it by removing the first group before splitting using the same regex:
String[] arr = str.replaceFirst(regex, "").split(regex);

How to extract letter words only from an arbitrary input file

I'm writing a spell checker, and I have to extract only word (constructed out of letter). I'm having trouble using multiple delimiters. Java documentation specifies the use of several delimiters, but I have troubles including every printing character that is not a letter.
in_file.useDelimiter("., !?/##$%^&*(){}[]<>\\\"'");
in this case - run time
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed character class near index 35
I tried using pattern such as
("\s+,|\s+\?|""|\s:|\s;|\{}|\s[|[]|\s!");
run time -
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal repetition
I'm aware of tokenizer but we are restricted to use scanner.
The pattern in Scanner is supposed to be a regular expression that describes all the characters you don't want included in a token, repeated one or more times (this last part is because the word may be delimited by more than one space/punctuation etc.)
This means you need a pattern that describes something which is not a letter. Regular expressions give you the ability to negate a class of characters. So if a letter is [a-zA-Z], a "non-letter" is [^a-zA-Z]. So you can use [^a-zA-Z]+ to describe "1 or more non-letters".
There are other ways to express the same thing. \p{Alpha} is the same as [a-zA-Z]. And you negate it by capitalizing the P: \P{Alpha}+.
If your file contains words that are not in English, then you may want to use a Unicode category: \P{L}+ (meaning: 1 or more characters which are not Unicode letters).
Demonstration:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{Alpha}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Output:
Hello
ho
na
ve
This is because we asked for just US-ASCII alphabet (\p{Alpha}). So it broke the word naïve because ï is not a letter in the US-ASCII range. It also ignored all those words in other languages. But if we use:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{L}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Then we have used a unicode category, and the output will be:
Hello
שלום
ho
こんにちは
naïve
漢字
مرحبا
Which gives you all the words in all the languages. So it's your choice.
Summary
To create a Scanner delimiter that allows you to get all the strings that are made of a particular category of characters (in this case, letters):
Create a regular expression for the category of characters you want
Negate it
Add + to signify 1 or more of the negated category.
This is just a common recipe, and complicated cases may require a different method.
There is a Metacharacter for word-extraction: \w. It selects everything that is considered to be a word.
If you are just interested in word boundarys you can use \b, which should be appropriate as a delimiter.
See http://www.vogella.com/tutorials/JavaRegularExpressions/article.html (Chapter 3.2)

Java String delete tokens contains numbers

I have a string like this and I would like to eliminate all the tokens that contain a number:
String[] s="In the 1980s".split(" ");
Is there a way to remove the tokens that contain numbers - in this case 1980s, but also, for example 784th or s787?
Use a \w*\d\w* regex matcher for that. It will match all words with at least one digit in them. Although I generally despise regexes, they are particularily well suited for your problem.
String[] s = input.replaceAll("\\w*\\d\\w* *", "").split(" +");
See Java lib docs for Pattern/Matcher (RegEx) for more reference how to work with regexes in general.
Test code:
http://ideone.com/LrHDsT
Remove the unwanted words first, then split:
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
Some test code:
String str = "666 In the 1980s 784th s787 foo BAR";
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
System.out.println(Arrays.toString(s));
Output:
[In, the, foo, BAR]
You could Regex as suggested by #vaxquis or alternately after splitting the string based on the delimiter
You could Parse the token strings and check if the token has number among them using NumberUtils.isNumber and remove those tokens.
split doesn't seem to be what you are looking for. Even if you remove words which contain digit like in case of
"1foo f2oo bar whatever baz2"
you will end up with
" bar whatever "
and if you split on spaces now you will end up with ["", "bar", "whatever"].
To solve this problem you may want also to remove spaces after word you removed so now
"1foo f2oo bar whatever baz2"
would become
"bar whatever "
so it can be split correctly (space at the end is not the problem since split by default removes trailing empty strings in result array).
But instead of doing two iterations (removing words and splitting on string) you can achieve same thing with only one iteration. All you need to do is use opposite approach:instead of focusing on removing wrong elements, lets try to find correct ones.
Correct tokens seem to be words which contains any non-space characters but not digits. You can regex representing such words with this regex \b[\S&&\D]\b where:
\b represents word boundaries,
\S any non whitespace character
\D any non digit character
[\S&&\D] intersection of non-whitespaces and non-digits, in other words non whitespaces which are also non-ditigts
Demo:
String input = "1foo f2oo bar whatever baz2";
Pattern p = Pattern.compile("\\b[\\S&&\\D]+\\b");
Matcher m = p.matcher(input);
while(m.find())
System.out.println(m.group());
Output:
bar
whatever
BTW to avoid potential problems with potential empty element at start of results you can use Scanner which doesn't return empty element if delimiter is found at start of string. So we can simply set delimiter as series of spaces or words which contains digit. So your code can also look like
Scanner sc = new Scanner(input);
sc.useDelimiter("(\\s|\\w*\\d\\w*)+");
while (sc.hasNext())
System.out.println(sc.next());
sc.close();

how to get words from a sequence of characters in java?

I have a method getNextChar() which reads a string character by character. And I am writing a method to get the words in a character sequence provided by getNextChar().
The text contains punctuation marks and other special characters.
I am thinking to have an array which contains all the punctuation marks and special characters, and when I read the characters of the text, check if the character is in the array to ignore it.
The method will recognize the word when it get's a space. The words will be stored in a Collection (ex: map) as i need to count the frequencies as well by checking if the word has been inserted before in the map and increasing the counter of that word.
Is this the best and efficient way of doing it? I am looking for the most efficient way. A
Is there any complete list of punctuation marks and special characters?
I think there is an easier way to do this.
No matter what your input source is, I would be reading it using the Scanner class. You can instantiate this class using your input string and call the Scanner.nextWord() method to get the next word in the string. This automatically checks for whitespace and returns the next word. Then, you can use String.replace("punctuation","") to remove punctuation and then insert these words into an ArrayList and you can count frequencies etc.
Scanner reader = new Scanner(string);
String word = reader.nextWord();
word=word.replaceAll(//code);
list.add(word);
You could use string.split() to break apart the string into an array of strings separated by whitespace (for your words.) You could also check each character with Character.isLetterOrDigit() to avoid punctuation. (Not necessarily in that order.)
The lookup for punctation will have a better performance if you use a Set of characters.
Set<Character> punctationchars ....
if(punctationcahars.contains(yourChar) { ... }
Just use a Scanner to read in the Strings:
Scanner in = new Scanner(...);
while (in.hasNext()) {
String word = in.next();
/* do something with the word, check punctuation, etc. */
}

How to tokenize in java without using the java.util tokenizer?

Consider the following as tokens:
+, -, ), (
alpha charactors and underscore
integer
Implement 1.getToken() - returns a string corresponding to the next token
2.getTokPos() - returns the position of the current token in the input string
Example input: (a+b)-21)
Output: (| a| +| b| )| -| 21| )|
Note: Cannot use the java string tokenizer class
Work in progress - Successfully tokenized +,-,),(. Need to figure out characters and numbers:
OUTPUT: +|-|+|-|(|(|)|)|)|(| |
java.util tokenizer is a deprecated class.
Tokenizing Strings in Java is much easier with "String.split()" since Java 1.4 :
String[] tokens = "(a+b)-21)".split("[+-)(]");
If it is a homework, you probably have to reimplement a "split" method:
read the String character by character
if the character is not a special char, add it to a buffer
when you encounter a special char, add the buffer content to a list and clear the buffer
Since it is (probably) a homework, I let you implement it.
Java lets you examine the characters in a String one by one with the charAt method. So use that in a for loop and examine each character. When you encounter a TOKEN you wrap that token with the pipes and any other character you just append to the output.
public static final char PLUS_TOKEN = '+';
// add all tokens as
public String doStuff(String input)
{
StringBuilder output = new StringBuilder();
for (int index = 0; index < input.length(); index++)
{
if (input.charAt(index) == PLUS_TOKEN)
{
// when you see a token you need to append the pipes (|) around it
output.append('|');
output.append(input.charAt(index);
output.append('|');
}
else if () //compare the current character with all tokens
else
{
// just add to new output
output.append(input.charAt(index);
}
}
return output.toString();
}
If it's not a homework assignment use String.split(). If is a homework assignment, say so and tag it so that we can give the appropriate level of help (I did so for you, just in case...).
Because the string needs to be cut in several different ways, not just on whitespace or parens, using the String.split method with any of the symbols there will not work. Split removes the character used as a seperator. You could try to split on the empty string, but this wouldn't get compound symbols, like 21. To correctly parse this string, you will need to effectively implement your own tokenizer. Try thinking about how you could tell you had a complete token if you looked at the string one character at a time. You could probably start a string that collects the characters until you have identified a complete token, and then you can remove the characters from the original and return the string. Starting from this point, you can probably make a basic tokenizer.
If you'd rather learn how to make a full strength tokenizer, most of them are defined by creating a regular expression that only matches the tokens.

Categories

Resources