Java literate text word parsing regexp

Java literate text word parsing regexp - java

Firstly I was happy with [A-Za-z]+
Now I need to parse words that end with the letter "s", but i should skip words that have 2 or more first letters in upper-case.
I try something like [\n\\ ][A-Za-z]{0,1}[a-z]*s[ \\.\\,\\?\\!\\:]+ but the first part of it [\n\\ ] for some reason doesn't see the beginning of the line.
here is the example
the text is Denis goeS to school every day!
but the only parsed word is goeS
Any Ideas?

What about
\b[A-Z]?[a-z]*x\b
the \b is a word boundary, I assume that what you wanted. the ? is the shorter form of {0,1}

Try this:
Pattern p = Pattern.compile("\\b([A-Z]?[a-z]*[sS])\\b");
Matcher m = p.matcher("Denis goeS to school every day!");
while(m.find())
{
System.out.println( m.group(1) );
}
The regex matches every word that starts with anything but a whitespace or 2 upper case characters, only contains lower case characters in the middle and ends on either s or S.
In your example this would match Denis and goeS. If you want to only match upper case S change the expression to "\\b([A-Z]?[a-z]*[S])\\b" which woudl match goeS and GoeS but not GOeS, gOeSor goES.

Related

Replace white spaces only in part of the string

I have a String like
"This is apple tree"
I want to remove the white spaces available until the word apple.After the change it will be like
"Thisisapple tree"
I need to achieve this in single replace command combined with regular expressions.

For now it looks like you may be looking for
String s = "This is apple tree";
System.out.println(s.replaceAll("\\G(\\S+)(?<!(?<!\\S)apple)\\s", "$1"));
Output: Thisisapple tree.
Explanation:
\G represents either end of previous match or start of input (^) if there was no previous match yet (when we are attempting to find first match)
\S+ represents one or more non-whitespace characters (to match words, including non-alphabetic characters like ' or punctuation)
(?<!(?<!\\S)apple)\\s negative-look-behind will prevent accepting whitespace which has apple before it (I added another negative-look-behind before apple to make sure that it doesn't have any non-whitespace which ensures that this is not part of some other word)
$1 in replacement represents match from group 1 (the one from (\S+)) which represents word. So we are replacing word and spaces with only word (effectively removing spaces)
WARNING: This solution assumes that
sentence doesn't start with space,
words can be separated with only one space.
If we want to get rid of this assumptions we would need something like:
System.out.println(s.replaceAll("^\\s+|\\G(\\S+)(?<!(?<!\\S)apple)\\s+", "$1"));
^\s+ will allow us to match spaces at beginning of string (and replace them with content of group 1 (word) which in this case will be empty, so we will simply remove these whitespaces)
\s+ at the end allows us to match word and one or more spaces after it (to remove them)

A single replace() is unlikely to solve your problem. You could do something like this..
String s[] = "This is an apple tree, not an orange tree".split("apple");
System.out.println(new StringBuilder(s[0].replace(" ","")).append("apple").append(s[1]));

This is achived via lookahead assertion, like this:
String str = "This is an apple tree";
System.out.println(str.replaceAll(" (?=.*apple)", ""));
It means: replace all spaces in front of which there anywhere word apple

If you want to use a regular expression you could try:
Matcher matcher = Pattern.compile("^(.*?\\bapple\\b)(.*)$").matcher("This is an apple but this apple is an orange");
System.out.println((!matcher.matches()) ? "No match" : matcher.group(1).replaceAll(" ", "") + matcher.group(2));
This checks that "apple" is an individual word and not just part of another word such as "snapple". It also splits at the first use of "apple".

Match consecutive single characters as whole word

While filtering from list of strings, i want to match consecutive single characters as whole word
e.g. below strings
'm g road'
'some a b c d limited'
in first case should match if user types
"mg" or "m g" or "m g road" or "mg road"
in second case should match if user types
"some abcd" or "some a b c d" or "abcd" or "a b c d"
How i can do that, can i achieve this using regex?
Order of whole words i can handle right now using searching words one by one, but not sure how to treat consecutive single chars as single word
e.g. "mg road" or "road mg" i can handle by searching "mg" and "road" one by one
EDIT
For making requirement more clear, below is my test case
#Test
public void testRemoveSpaceFromConsecutiveSingleCharacters() throws Exception {
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("some a b c d limited").equals("some abcd limited"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("m g road").equals("mg road"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c").equals("bank abc"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c limited n a").equals("bank abc limited na"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("c road").equals("c road"));
}

1.) Strip out spaces within space-surrounded single letters from stringtocheck and userinput.
.replaceAll("(?<=\\b\\w) +(?=\\w\\b)","")
(?<=\b\w) look behind to check if preceded by \b word boundary, \w word character
(?=\\w\\b) look ahead to check if followed by \w word character, \b word boundary
See demo at regexplanet (click Java)
2.) Check if stringtocheck .contains userinput.

Sounds like you simply want to ignore white space. You can easily can do this by stripping out white space from both the target string and the user input before looking for a match.

You're basically wanting each search term to be modified to allow intervening spaces, so
"abcd" becomes regex "\ba ?b ?c ?d\b"
To achieve this, do this to each word before matching:
word = "\\b" + word.replaceAll("(?<=.)(?=.)", " ?") + "\\b";
The word breaks \b are necessary to stop matching "comma bcd" or "abc duck".

This regex will match all single characters separated by one or more spaces
(^(\w\s+)+)|(\s+\w)+$|((\s+\w)+\s+)

The following regex (in multiline mode) could help you out:
^(?<first>\w+)(?<chars>(?:.(?!(?:\b\w{2,}\b)))*)
# assure that it is the beginning of the line
# capture as many word characters as possible in the first group "first"
# the construction afterwards consumes everything up to (not including)
# a word which has at least two characters...
# ... and saves it to the group called "chars"
You would only need to replace the whitespaces in the second group (aka "chars").
See a demo on regex101.com

str = str.replaceAll("\\s","");

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])

Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog

Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.

this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

RegEx to find the word between last Upper Case word and another word

My problem is to find a word between two words. Out of these two words one is an all UPPER CASE word which can be anything and the other word is "is". I tried out few regexes but none are helping me. Here is my example:
String :
In THE house BIG BLACK cat is very good.
Expected output :
cat
RegEx used :
(?<=[A-Z]*\s)(.*?)(?=\sis)
The above RegEx gives me BIG BLACK cat as output whereas I just need cat.

One solution is to simplify your regular expression a bit,
[A-Z]+\s(\w+)\sis
and use only the matched group (i.e., \1). See it in action here.
Since you came up with something more complex, I assume you understand all the parts of the above expression but for someone who might come along later, here are more details:
[A-Z]+ will match one or more upper-case characters
\s will match a space
(\w+) will match one or more word characters ([a-zA-Z0-9_]) and store the match in the first match group
\s will match a space
is will match "is"
My example is very specific and may break down for different input. Your question didn't provided many details about what other inputs you expect, so I'm not confident my solution will work in all cases.

Try this one:
String TestInput = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern
.compile(
"(?<=\\b\\p{Lu}+\\s) # lookbehind assertion to ensure a uppercase word before\n"
+ "\\p{L}+ # matching at least one letter\n"
+ "(?=\\sis) # lookahead assertion to ensure a whitespace is ahead\n"
, Pattern.COMMENTS); Matcher m = p.matcher(TestInput);
if(m.find())
System.out.println(m.group(0));
it matches only "cat".
\p{L} is a Unicode property for a letter in any language.
\p{Lu} is a Unicode property for an uppercase letter in any language.

You want to look for a condition that depends on several parts of infirmation and then only retrieve a specific part of that information. That is not possible in a regex without grouping. In Java you should do it like this:
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+\\s(\\w+)\\sis");
Matcher matcher = pattern.matcher("In THE house BIG BLACK cat is very good.");
if (matcher.find())
System.out.println(matcher.group(1));
}
}
}
The group(1) is the one with brackets around it. In this case w+. And that's your word. The return type of group() is String so you can use it right away

The following part has a extrange behavior
(?<=[A-Z]*\s)(.*?)
For some reason [A-Z]* is matching a empty string. And (.*?) is matching BIG BLACK. With a little tweaks, I think the following will work (but it still matches some false positives):
(?<=[A-Z]+\s)(\w+)(?=\sis)
A slightly better regex would be:
(?<=\b[A-Z]+\s)(\w+)(?=\sis)
Hope it helps

String m = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern.compile("[A-Z]+\\s\\w+\\sis");
Matcher m1 = p.matcher(m);
if(m1.find()){
String group []= m1.group().split("\\s");// split by space
System.out.println(group[1]);// print the 2 position
}

Regex match for anything EXCEPT pattern

I am coding in Java here.
I know that the regex for matching any number or string of letter is
"(0|[1-9][0-9]*)(\\.[0-9]+)?|[a-zA-Z]+"
But I would like to match anything except letter or number, ie symbols like !, #, +, -
I tried doing [^.. ] but it doesn't work.
For example, let's say I want to do the opposite, ie return all parts of the string that contains numbers or strings of letters or #, I would do
public ArrayList<String> findMatch(String string){
ArrayList <String> outputArr = new ArrayList<String>();
Pattern p = Pattern.compile("(0|[1-9][0-9]*)(\\.[0-9]+)?|[a-zA-Z]+|\\#");
// recognizes number, string, and #
Matcher m = p.matcher(string)
while (m.find()) {
outputArr.add(m.group());
}
return outputArr;
}
Let's say I want to find the opposite of the code above, how can I change line 3?

You'll probably want to use just this:
\W+
That will match a string of any characters that aren't "word characters", defined as:
[a-zA-Z0-9_]
or "all letters, numbers, and underscore". If you want to include underscore, try the following:
[\W_]+
Or, if you'd rather have it explicit:
[^A-Za-z0-9]+
Which means "everything but letters and numbers".
Hope this helps.

The simplest regex pattern that you can use is : [^\w]+
This will match all the special characters which are neither numbers nor alphabets. Hope this helps. This is a sample Regex Tester with sample examples. You can test your regex for correctness over here. Hope this will help you.
From the example you have provided what I understand is, you want all the characters except alphabets, numbers and '#'.
In regex '\w' matches any alphabet(including underscore) and any number. So you need to negate this, to get other symbolic characters like '$,#' etc.
Below expression will solve your issue = [^\w#]+
'^' indicate negation symbol. Here '^\w' meaning 'match anything except alphabets or numbers'. I have also added '#' symbol in the expression as you need to ignore it as well.
Hope this will answer your question.

If you can give some more detail, what is your requirement? and what you expect?
It will help me to figure out the solution.
What you put in your query looks like you want to match special characters only. Am I right?
If so you can just try:
[^A-Za-z0-9][your quantifier here]
quantifier can be:
? for 0 or 1 frequency
+ for >=1 frequency
* for >=0 frequency
Suppose you have a String like
String s="shyuit6785%^7kui!#*&123f#$annds";
//And you want to find out the characters except alphabets and numerals . (I hope its your requirement)
Pattern p = Pattern.compile("[^A-Za-z0-9#]+");
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println("Found a required character " + m.group() + " at index number " +m.start());
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java literate text word parsing regexp - java

What about \b[A-Z]?[a-z]*x\b the \b is a word boundary, I assume that what you wanted. the ? is the shorter form of {0,1}

Related

Replace white spaces only in part of the string

Match consecutive single characters as whole word

Extract string without last char if vowel

RegEx to find the word between last Upper Case word and another word

Regex match for anything EXCEPT pattern

Categories

Resources