How to extract letter words only from an arbitrary input file

How to extract letter words only from an arbitrary input file - java

I'm writing a spell checker, and I have to extract only word (constructed out of letter). I'm having trouble using multiple delimiters. Java documentation specifies the use of several delimiters, but I have troubles including every printing character that is not a letter.
in_file.useDelimiter("., !?/##$%^&*(){}[]<>\\\"'");
in this case - run time
Exception in thread "main" java.util.regex.PatternSyntaxException:
Unclosed character class near index 35
I tried using pattern such as
("\s+,|\s+\?|""|\s:|\s;|\{}|\s[|[]|\s!");
run time -
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal repetition
I'm aware of tokenizer but we are restricted to use scanner.

The pattern in Scanner is supposed to be a regular expression that describes all the characters you don't want included in a token, repeated one or more times (this last part is because the word may be delimited by more than one space/punctuation etc.)
This means you need a pattern that describes something which is not a letter. Regular expressions give you the ability to negate a class of characters. So if a letter is [a-zA-Z], a "non-letter" is [^a-zA-Z]. So you can use [^a-zA-Z]+ to describe "1 or more non-letters".
There are other ways to express the same thing. \p{Alpha} is the same as [a-zA-Z]. And you negate it by capitalizing the P: \P{Alpha}+.
If your file contains words that are not in English, then you may want to use a Unicode category: \P{L}+ (meaning: 1 or more characters which are not Unicode letters).
Demonstration:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{Alpha}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Output:
Hello
ho
na
ve
This is because we asked for just US-ASCII alphabet (\p{Alpha}). So it broke the word naïve because ï is not a letter in the US-ASCII range. It also ignored all those words in other languages. But if we use:
Scanner sc = new Scanner( "Hello, 123 שלום 134098ho こんにちは 'naïve,. 漢字 +?+?+مرحبا.");
sc.useDelimiter("\\P{L}+");
while ( sc.hasNext()) {
System.out.println(sc.next());
}
Then we have used a unicode category, and the output will be:
Hello
שלום
ho
こんにちは
naïve
漢字
مرحبا
Which gives you all the words in all the languages. So it's your choice.
Summary
To create a Scanner delimiter that allows you to get all the strings that are made of a particular category of characters (in this case, letters):
Create a regular expression for the category of characters you want
Negate it
Add + to signify 1 or more of the negated category.
This is just a common recipe, and complicated cases may require a different method.

There is a Metacharacter for word-extraction: \w. It selects everything that is considered to be a word.
If you are just interested in word boundarys you can use \b, which should be appropriate as a delimiter.
See http://www.vogella.com/tutorials/JavaRegularExpressions/article.html (Chapter 3.2)

Related

Find all keywords of a text file that have at least one letter by using regular expression

I want to write a regular expression to remove all tokens of a text file that do not have at least one letter. I used OpenNLP tokenizer for extracting tokens of my text file.For instance, tokens 90-87, 65#7, ---, 8/0, ? are removed from given text.
I tried to follow these pages 1 ,2 and 3; but I could not find the expression that I want. For example, the following code remove token anti-age, mid-november.
String[] tokens = t.getTokens(sen);
for (String word : tokens)
if((!isstopWord(word)) && word.matches("[a-zA-Z]+"))
bufferedw.append(word+"\n");
But, I do not know how to prevent removing tokens like anti-age.
where is the problem?

The [a-zA-Z]+ expression matches a string that only consists of one or more ASCII letters. It does not allow hyphens, apostrophes, etc.
To match a string containing no spaces and at least one letter, you can use
word.matches("\\S*\\pL\\S*")
See IDEONE demo
The \S* pattern matches zero or more non-whitespace characters and \pL matches any Unicode letter.

Unique regex for first name and last name

I have a single input where users should enter name and surname. The problem is i need to use checking regEx. There's a list of a requirements:
The name should start from Capital Letter (not space)
There can't be space stacks
It's obligate to support these Name and Surname (all people are able to write theirs first/name). Example:
John Smith
and
Armirat Bair Hossan
And the last symbol shouldn't be space.
Please help,
ATM i have regex like
^\\p{L}\\[p{L} ,.'-]+$
but it denies ALL input, which is not good
Thanks for helping me
UPDATE:
CORRECT INPUT:
"John Smith"
"Alberto del Muerto"
INCORRECT
" John Smith "
" John Smith"

You can use
^[\p{Lu}\p{M}][\p{L}\p{M},.'-]+(?: [\p{L}\p{M},.'-]+)*$
or
^\p{Lu}\p{M}*+(?:\p{L}\p{M}*+|[,.'-])++(?: (?:\p{L}\p{M}*+|[,.'-])++)*+$
See the regex demo and demo 2
Java declaration:
if (str.matches("[\\p{Lu}\\p{M}][\\p{L}\\p{M},.'-]+(?: [\\p{L}\\p{M},.'-]+)*")) { ... }
// or if (str.matches("\\p{Lu}\\p{M}*+(?:\\p{L}\\p{M}*+|[,.'-])++(?: (?:\\p{L}\\p{M}*+|[,.'-])++)*+")) { ... }
The first regex breakdown:
^ - start of string (not necessary with matches() method)
[\p{Lu}\p{M}] - 1 Unicode letter (incl. precomposed ones as \p{M} matches diacritics and \p{Lu} matches any uppercase Unicode base letter)
[\p{L}\p{M},.'-]+ - matches 1 or more Unicode letters, a ,, ., ' or - (if 1 letter names are valid, replace + with - at the end here)
(?: [\p{L}\p{M},.'-]+)* - 0 or more sequences of
- a space
[\p{L}\p{M},.'-]+ - 1 or more characters that are either Unicode letters or commas, or periods, or apostrophes or -.
$ - end of string (not necessary with matches() method)
NOTE: Sometimes, names contain curly apostrophes, you can add them to the character classes ([‘’]).
The 2nd regex is less effecient but is more accurate as it will only match diacritics after base letters. See more about matching Unicode letters at regular-expressions.info:
To match a letter including any diacritics, use \p{L}\p{M}*+.

Try this one
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- ']))[A-Za-z- ']{2,}$
There is also an interactive Demo of this pattern available at an external website.

You made a typo: the second \\ should be in front of p.
However even then there is a check missing for a trailing space
"^\\p{L}[\\p{L} ,.'-]+$"
For a .matches the following would suffice
"\\p{L}[\\p{L} ,.'-]*[\\p{L}.]"
Names like "del Rey, Hidalgo" do not require an initial capital.
Also I would advise to simply .trim() the input; imagine a user regarding at the input being rejected for a spurious blank.

Try this
^[A-Z][a-z]+(([\s][A-Z])?[a-z]+){1,2}$
but use \\ instead \ for java

Java StreamTokenizer splits Email address at # sign

I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.
I already set the # sign as an ordinaryChar and space as the only whitespace:
StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('#');
tokeziner.whitespaceChars(' ', ' ');
Still, all E-mail addresses are split up.
A line to parse looks like the following:
"Student 6 Name6 LastName6 del6#uni.at Competition speech University of Innsbruck".
The Tokenizer splits del6#uni.at to "del6" and "uni.at".
Is there a way to tell the tokenizer to not split at # signs?

So here is why it worked like it did:
StreamTokenizer regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.
So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;. A simple tokenizer that merely breaks the text by whitespace would break this into a, = and a+b;. But StreamTokenizer would break this into a, =, a, +, b, and ;, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer's types are rather basic, but this behavior is the key to understanding what happened in your case.
It wasn't recognizing the # as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype field, and you were probably just looking at the sval.
A StreamTokenizer would recognize your line as:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6
The character #
The word uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
(This is the actual output of a little demo I wrote tokenizing your example line and printing by type).
In fact, by telling it that # was an "ordinary character", you were telling it to take the # as its own token (which it does anyway by default). The ordinaryChar() documentation tells you that this method:
Specifies that the character argument is "ordinary" in this tokenizer.
It removes any special significance the character has as a comment
character, word component, string delimiter, white space, or number
character. When such a character is encountered by the parser, the
parser treats it as a single-character token and sets ttype field to
the character value.
(My emphasis).
In fact, if you had instead passed it to wordChars(), as in tokenizer.wordChars('#','#') it would have kept the whole e-mail together. My little demo with that added gives:
The word Student
The number 6.0
The word Name6
The word LastName6
The word del6#uni.at
The word Competition
The word speech
The word University
The word of
The word Innsbruck
If you need a programming-language-like tokenizer, StreamTokenizer may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split(), or if it is just a whitespace-delimited chain of tokens, where Scanner might suit you better.

In order to simply split a String, see the answer to this question (adapted for whitespace):
The best way is to not use a StringTokenizer at all, but use String's
split method. It returns an array of Strings, and you can get the
length from that.
For each line in your file you can do the following:
String[] tokens = line.split(" +");
tokens will now have 6 - 8 Strings. Use tokens.length() to find out
how many, then create your object from the array.
This is sufficient for the given line, and might be sufficient for everything. Here is some code that uses it (it reads System.in):
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class T {
public static void main(String[] args) {
BufferedReader st = new BufferedReader(new InputStreamReader(System.in));
String line;
try {
while ( st.ready() ) {
line = st.readLine();
String[] tokens = line.split(" +");
for( String token: tokens ) {
System.out.println(token);
}
}
} catch ( IOException e ) {
throw new RuntimeException(e); // handle error here
}
}
}

Returning java regex (words, spaces, special characters, double quotes)

I am trying to use java regex to tokenize any language source file. What I want the list to return is:
words ([a-z_A-Z0-9])
spaces
any of [()*.,+-/=&:] as a single character
and quoted items left in quotes.
Here is the code I have so far:
Pattern pattern = Pattern.compile("[\"(\\w)\"]+|[\\s\\(\\)\\*\\+\\.,-/=&:]");
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();
while(matcher.find()) {
matchlist.add(matcher.group(0));
}
For example,
"I" am_the 2nd "best".
returns: list, size 8
("I", ,am_the, ,2nd, ,"best", .)
which is what I want. However, if the whole sentence is quoted, except for the period:
"I am_the 2nd best".
returns: list, size 8
("I, ,am_the, ,2nd, ,best", .)
and I want it to be able to return: list, size 2
("I am_the 2nd best", .)
If that makes sense. I believe it works for everything I want it to except for returning string literals (which I want to keep the quotes). What is it that I am missing from the pattern that will allow me to achieve this?
And by all means, if there is an easier pattern to use that I do not see, please help me out. The pattern shown above was the compilation of many trial/error. Thank you very much in advance for any help.

First, you'll need to separate the word-matching code from the string-literal-matching code. For word matching, use:
\w+
Next there's whitespace.
\s+
To match strings as one token, you need to allow more characters than just \w. That only allows alphanumeric characters and _, which means whitespace and symbols are not. You also need to move the starting and ending quotes outside of the square brackets.
And don't forget backslashes to escape characters. You want to allow \" inside of strings.
"(\\.|[^"])+"
Finally, there are the symbols. You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. I recommend the latter so you don't choke on other symbols like # or |. So for symbols:
[^\s\w"]
Putting the pieces together, we get this combined regex:
\w+|\s+|"(\\.|[^"])+"|[^\s\w"]
Or, escaping everything properly so it can be put into source code:
Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");

Typically, when parsing text, the process you're describing is called "lexical analysis" and the function used is called a 'lexer' which is used to break up an input stream into identifiable tokens like words, numbers, spaces, periods, etc.
The output of a lexer is consumed by a 'parser' which does "syntactic analysis" by identifying groups of tokens which belong together, like [double-quote] [word] [double-quote].
I would recommend you follow the same two-pass strategy, since it's been proven time and again in many, many parsers.
So, your first step might be to use this regular expression as your lexer:
\W|\w+
which will break your input text into either single non-word characters (like spaces, double and single quotation marks, commas, periods, etc.) or sequences of one or more word characters where \w is really just a shortcut for [a-zA-Z_0-9].
So, using your example above:
String str=/"I" am_the 2nd "best"./
String p="\\W|\\w+"
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();
while(matcher.find()) {
matchlist.add(matcher.group(0));
}
produces:
['"', 'I', '"', ' ', 'am_the', ' ', '2nd', ' ', '"', 'best', '"', '.']
which you can then decide how to treat in your code.
No, this doesn't give you a single one-size-fits-all regular expression which matches both of the cases you list above, but in my experience, regular expressions aren't really the best tool to do the kind of syntactic analysis you require because they either lack the expressiveness needed to cover all possible cases or, and this is far more likely, they quickly become far too complex for most but the true RegExp maven to fully comprehend.

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

You're mixing decimal (8211) and hexadecimal (0x8211).
\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).
But why not simply use the Unicode property "Dash punctuation"?
As a Java string: "\\s\\p{Pd}\\s"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.