Check a string order in a sentence - java

I want to find out if a specific word comes before another. Partial words are not a match.
Some example tests:
“Hi my name is AB, I’m from London and I love it here ..."
if "from" is before "Hi" -> return false
if "Hi" is before "AB" -> return true

There are several ways of doing this:
Use indexOf - this is perhaps the simplest approach. Get indexes of the strings, and compare them. The string with a lower indexs is before the other string
Use regular expressions - construct a regex that matches the strings in the desired order, for example "from.*?Hi". This approach is likely to use multiple regular expressions.
One twist on the first approach would be to start searching for the second word at the index of the first word plus the length of the word, and avoid index comparisons. With many searches and long strings this could save you some CPU cycles.
Note: Depending on the requirements you may need to watch out for the Scunthorpe problem, when you get a false positive for a match on a substring. If your requirement is that "Hi, my friend AB" should be matched, but "Higher than AB" should not be matched, then the regex approach with \b anchors on both ends of the word would provide an easier solution than manipulating string indexes. The "from.*?Hi" regex above becomes "\\bfrom\\b.*?\\bHi\\b".

yourString.matches(".*? Hi\\b.*? AB\\b.*")
This will make sure that you have spaces in between and you're matching whole words.
If you're dealing with latin american stuff where puncuation can come before words, this is more general
yourString.matches(".*?\\bHi\\b.*?\\bAB\\b.*")
Breaking that down you have
.*? = anything, even the empty string. Ignore the ? for now.
\\b = a word boundary
So that regex means
<anything><word boundary>Hi<word boundary><anything><word boundary>AB<word boundary><anything>
which is the same as
if "Hi" is before "AB" -> return true
which would be used as
if(yourString.matches(".*?\\bHi\\b.*?\\bAB\\b.*")){
return true;
}

You can take a look at the indexOf(String string), which returns an integer denoting the position of the substring, or -1 if not found. You could use that to see which strings preceeds another.

You can use indexOf method and get the first occurrence of each word and then check. For example:
String sentence = "Hi my name is AB, I’m from London and I love it here …";
int fromIndex = sentence.indexOf("from");
int hiIndex = sentence.indexOf("Hi");
if (fromIndex < hiIndex)
System.out.println("false");
else
System.out.println("true");
Note that if a word does not exist within the sentence, then indexOf will return -1.

Related

Please justify the output in Regex Java program

I have came across one Java program in Regex .
Below is the program code :
import java.util.regex.*;
public class Regex_demo01 {
public static void main(String[] args) {
boolean b=true;
Pattern p=Pattern.compile("\\d*");
Matcher m=p.matcher("ab34ef");
while(b=m.find())
{
System.out.println(b);
System.out.println(">"+m.start()+"\t"+m.group()+"<");
}
}
}
Output :
true
>0 <
true
>1 <
true
>2 34<
true
>4 <
true
>5 <
true
>6 <
Doubt : As we all know that The find() method returns true if it gets a match and remembers the start position of the match. If find() returns true, you can call the start() method to get the starting position of the match, and you can call the group() method to get the string that represents the actual bit of source data that was matched.
My question is how come ">6 <" is present is the output when the string indexing is till index 5 ?
Anser is simple. x* matche any count of x even 0.
Replace * to + which matche to 1 or more element that is left to it.
My question is how come >6 < is present is the output when the string indexing is till index 5 ?
That behavior is due to your regex i.e. \\d* which matches 0 or more digits.
As you can see it is showing start position 0 as well when there is no digit at the start.
Similarly 6 is last index +1 because there is an empty match past the last character as well.
You should use \\d+ as your regex.
The star quantifier (*) is defined as "zero or more times". That said, your pattern matches zero digits most of the time.
What you actually want is probably the plus quantifier (+), which means "one or more times".
Source: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Why is there a match at index 6?
RegEx doesn't work on a char-basis, but rather inbetween single chars. When matching an empty string, it will look before and after every character. Duplicate findings are omitted, of course, so an empty string after the first char and before the second char will yield one match instead of two. By default the algorithm is greedy, which means it will match as many characters as possible.
Consider this example:
Input string is 1
RegEx is \\d*
In this case the RegEx engine starts before the first character and tries to match zero, one or more digits. Since it's greedy, it doesn't stop after the empty string it finds at the beginning. It finds a '1' with no digits following. This is the first match. Then it continues the search after the match. It finds an empty string and matches it too, since that equals zero digits.
For RegEx the string '1' looks rather like this:
"" + "1" + ""
The first two units (empty string and the "1") match the pattern, the third, empty string does, too.
In-depth article about this: http://www.regular-expressions.info/zerolength.html

Regex to validate 4 different characters are in a string

I would like to enforce that 4 different characters will be in a string.
Valid examples:
"1q2w3e4r5t"
"abcd"
Invalid examples:
"good"
"1ab1"
Ideas for a pattern?
You should consider using a non-regex solution. I only write this answer to show a simpler regex solution for this problem.
Initial solution
Here is a simpler regex solution, which asserts that there are at least 4 distinct characters in the string:
(.).*?((?!\1).).*?((?!\1|\2).).*?((?!\1|\2|\3).).*
Demo on regex101 (PCRE and Java has the same behavior for this regex)
.*?((?!\1).), .*?((?!\1|\2).), ... searches for the next character which has not appeared before, which is implemented by the checking the character is not the same as whatever captured in previous capturing groups.
Logically, the laziness/greediness of the quantifier doesn't matter here. The lazy quantifier .*? is used to make the search start from the closest character which has not appeared before, rather than from the furthest character. It should slightly improve the performance in matching case, since less backtracking is done.
Used with String.matches(), which asserts that the whole string matches the regex:
input.matches("(.).*?((?!\\1).).*?((?!\\1|\\2).).*?((?!\\1|\\2|\\3).).*")
Improved solution
If you are concerned about performance:
(.)(?>.*?((?!\1).))(?>.*?((?!\1|\2).))(?>.*?((?!\1|\2|\3).)).*
Demo on regex101
With String.matches():
input.matches("(.)(?>.*?((?!\\1).))(?>.*?((?!\\1|\\2).))(?>.*?((?!\\1|\\2|\\3).)).*")
The (?>pattern) construct prevents backtracking into the group once you exit from the pattern inside. This is used to "lock" the capturing groups to the first appearance of each of the distinct character, since the result is the same even if you pick a different character later in the string.
This regex behaves the same as a normal program which loops from left-to-right, checks the current character against a set of distinct characters and adds it to the set if the current character is not in the set.
Due to this reason, the lazy quantifier .*? becomes significant, since it searches for the closest character which has not appeared so far.
You can use a regular expression to validate this, with negative look-aheads checking that the captured alphanumeric character is not the same 4 times.
I'd say it is very ugly, but working:
String rx = "^(.).*?((?!\\1).).*?((?!\\1|\\2).).*?((?!\\1|\\2|\\3).).*?$"
See demo
IDEONE Demo
String re = "^(.).*?((?!\\1).).*?((?!\\1|\\2).).*?((?!\\1|\\2|\\3).).*?$";
// Good
System.out.println("1q2w3e4r5t".matches(re));
System.out.println("goody".matches(re));
System.out.println("gggoooggoofr".matches(re));
// Bad
System.out.println("good".matches(re));
System.out.println("1ab1".matches(re));
Output:
true
true
true
false
false
You can count the number of distinct chars like this:
String s = "abcdefaa";
long numDistinctChars = s.chars().distinct().count()
Or if not on Java 8 (I couldn't come up with something better):
Set<Character> set = new HashSet<>();
char[] charArray = s.toCharArray();
for (char c : charArray) {
set.add(Character.valueOf(c));
}
int numDistinctChars = set.size();

How to negate a vowel condition using Regex in java

I'm trying to construct a Regex for a string which should have these following conditions:
It must contain at least one vowel.
It cannot contain three consecutive vowels or three consecutive consonants.
It cannot contain two consecutive occurrences of the same letter, except for 'ee' or 'oo'.
I'm not able to construct regex for 2nd and 3rd conditions.
e.g:
bower - accepted,
appple - not accepted,
miiixer - not accepted,
hedding - not accepted,
feeding - accepted
Thanks in advance!
Edited:
My code:
Pattern ptn = Pattern.compile("((.*[A-Za-z0-9]*)(.*[aeiou|AEIOU]+)(.*[##$%]).*)(.*[^a]{3}.*)");
Matcher mtch = ptn.matcher("zoggax");
if (mtch.find()) {
return true;
}
else
return false;
The following one should suit your needs:
(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*
In Java:
String regex = "(?=.*[aeiouy])(?!.*[aeiouy]{3})(?!.*[a-z&&[^aeiouy]]{3})(?!.*([a-z&&[^eo]])\\1).*";
System.out.println("bower".matches(regex));
System.out.println("appple".matches(regex));
System.out.println("miiixer".matches(regex));
System.out.println("hedding".matches(regex));
System.out.println("feeding".matches(regex));
Prints:
true
false
false
false
true
Explanation:
(?=.*[aeiouy]): contains at least one vowel
(?!.*[aeiouy]{3}): does not contain 3 consecutive vowels
(?!.*[a-z&&[^aeiouy]]{3}): does not contain 3 consecutive consonants
[a-z&&[^aeiouy]]: any letter between a and z but none of aeiouy
(?!.*([a-z&&[^eo]])\1): does not contain 2 consecutive letters, except e and o
[a-z&&[^eo]]: any letter between a and z, but none of eo
See http://www.regular-expressions.info/charclassintersect.html.
This should work for English under the assumption that 'y' is a non-vowel;
^(?!.*[aeiou]{3})(?!.*[bcdfghjklmnpqrstvwxyz]{3})(?!.*([^eo])\1).*[aeiou]
Explanation:
^ fixes the match to the beginning of the string.
(?!.*[aeiou]{3}) checks that you can not find 3 consecutive vowels at any point after the current position in the string. (Since this is immidiately after the ^ this checks the entire string). It also does not advance the cursor.
Non vowels are tested similarily. This can be done in a prettier way if your regexp flavor supports set subtraction. But I think Java does not do this.
(?!.*([^eo])\1) checks that there are no occurence of a single character capture group, of characters other than e or o, which is followed by a copy of itself. Ie. no character other than e and o is repeated twice.
.*[aeiou] looks for a vowel at some point in the string.
This regexp also assumes that the case-insensitive flag is set. I think this is the default for java but I can be wrong about that.
It also is a regexp that will find a match in a string satisfying your criteria. It will not necesarily match the whole string. - If this is needed add .*$ to the end of the regexp.
If my hunch is correct that you meant to say "three consecutive occurrences of the same letter" (looking at your examples) then you can simply say "e and o may not occur thrice, everything else may not occur twice", like so:
^(?=.*[aeiouy].*)(?!.*([eo])\1\1.*)(?!.*([a-df-np-z])\2.*).*$
Debuggex Demo, Key is that a letter occuring thrice is also occuring twice.

Matching only one occurrence of a character from a given set

I need to validate an input string such that validation returns true only if the string contains one of the special characters # # $ %, only one, and one time at the most. Letters and numbers can be anywhere and can be repeated any number of times, but at least one number or letter should be present
For example:
a# : true
#a : true
a#$: false
a#n01 : true
an01 : false
a : false
# : false
I tried
[0-9A-Za-z]*[##%$]{1}[0-9A-Za-z]*
I was hoping this would match one occurrence of any of the special characters. But, no. I need only one occurrence of any one in the set.
I also tried alternation but could not solve it.
Vivek, your regex was really close. Here is the one-line regex you are looking for.
^(?=.*?[0-9a-zA-Z])[0-9a-zA-Z]*[##$%][0-9a-zA-Z]*$
See demo
How does it work?
The ^ and $ anchors ensure that whatever we are matching is the whole string, avoiding partial matches with forbidden characters later.
The (?=.*?[0-9a-zA-Z]) lookahead ensures that we have at least one number or letter.
The [0-9a-zA-Z]*[##$%][0-9a-zA-Z]* matches zero or more letters or digits, followed by exactly one character that is either a #, #, $ or %, followed by zero or more letters or digits—ensuring that we have one special character but no more.
Implementation
I am sure you know how to implement this in Java, but to test if the string match, you could use something like this:
boolean foundMatch = subjectString.matches("^(?=[0-9a-zA-Z]*[##$%][0-9a-zA-Z]*$)[##$%0-9a-zA-Z]*");
What was wrong with my regex?
Actually, your regex was nearly there. Here is what was missing.
Because you didn't have the ^ and $ anchors, the regex was able to match a subset of the string, for instance a# in a##%%, which means that special characters could appear in the string, but outside of the match. Not what you want: we need to validate the whole string by anchoring it.
You needed something to ensure that at least one letter or digit was present. You could definitely have done it with an alternation, but in this case a lookahead is more compact.
Alternative with Alternation
Since you tried alternations, for the record, here is one way to do it:
^(?:[0-9a-zA-Z]+[##$%][0-9a-zA-Z]*|[0-9a-zA-Z]*[##$%][0-9a-zA-Z]+)$
See demo.
Let me know if you have any questions.
I hope this answer will be useful for you, if not, it might be for future readers. I am going to make two assumptions here up front: 1) You do not need regex per se, you are programming in Java. 2) You have access to Java 8.
This could be done the following way:
private boolean stringMatchesChars(final String str, final List<Character> characters) {
return (str.chars()
.filter(ch -> characters.contains((char)ch))
.count() == 1);
}
Here I am:
Using as input a String and a List<Character> of the ones that are allowed.
Obtaining an IntStream (consisting of chars) from the String.
Filtering every char to only remain in the stream if they are in the List<Character>.
Return true only if the count() == 1, that is of the characters in List<Character>, exactly one is present.
The code can be used as:
String str1 = "a";
String str2 = "a#";
String str3 = "a##a";
String str4 = "a##a";
List<Character> characters = Arrays.asList('#', '#', '$', '%');
System.out.println("stringMatchesChars(str1, characters) = " + stringMatchesChars(str1, characters));
System.out.println("stringMatchesChars(str2, characters) = " + stringMatchesChars(str2, characters));
System.out.println("stringMatchesChars(str3, characters) = " + stringMatchesChars(str3, characters));
System.out.println("stringMatchesChars(str4, characters) = " + stringMatchesChars(str4, characters));
Resulting in false, true, false, false.

How to convert "string" to "*s*t*r*i*n*g*"

I need to convert a string like
"string"
to
"*s*t*r*i*n*g*"
What's the regex pattern? Language is Java.
You want to match an empty string, and replace with "*". So, something like this works:
System.out.println("string".replaceAll("", "*"));
// "*s*t*r*i*n*g*"
Or better yet, since the empty string can be matched literally without regex, you can just do:
System.out.println("string".replace("", "*"));
// "*s*t*r*i*n*g*"
Why this works
It's because any instance of a string startsWith(""), and endsWith(""), and contains(""). Between any two characters in any string, there's an empty string. In fact, there are infinite number of empty strings at these locations.
(And yes, this is true for the empty string itself. That is an "empty" string contains itself!).
The regex engine and String.replace automatically advances the index when looking for the next match in these kinds of cases to prevent an infinite loop.
A "real" regex solution
There's no need for this, but it's shown here for educational purpose: something like this also works:
System.out.println("string".replaceAll(".?", "*$0"));
// "*s*t*r*i*n*g*"
This works by matching "any" character with ., and replacing it with * and that character, by backreferencing to group 0.
To add the asterisk for the last character, we allow . to be matched optionally with .?. This works because ? is greedy and will always take a character if possible, i.e. anywhere but the last character.
If the string may contain newline characters, then use Pattern.DOTALL/(?s) mode.
References
regular-expressions.info/Dot Matches (Almost) Any Character and Grouping and Backreferences
I think "" is the regex you want.
System.out.println("string".replaceAll("", "*"));
This prints *s*t*r*i*n*g*.
If this is all you're doing, I wouldn't use a regex:
public static String glitzItUp(String text) {
return insertPeriodically(text, "*", 1);
}
Putting char into a java string for each N characters
public static String insertPeriodically(
String text, String insert, int period)
{
StringBuilder builder = new StringBuilder(
text.length() + insert.length() * (text.length()/period)+1);
int index = 0;
while (index <= text.length())
{
builder.append(insert);
builder.append(text.substring(index,
Math.min(index + period, text.length())));
index += period;
}
return builder.toString();
}
Another benefit (besides simplicity) is that it's about ten times faster than a regex.
IDEOne | Working example
Just to be a jerk, I'm going to say use J:
I've spent a school year learning Java, and self-taught myself a bit of J over the course of the summer, and if you're going to be doing this for yourself, it's probably most productive to use J simply because this whole inserting an asterisk thing is easily done with one simple verb definition using one loop.
asterisked =: 3 : 0
i =. 0
running_String =. '*'
while. i < #y do.
NB. #y returns tally, or number of items in y: right operand to the verb
running_String =. running_String, (i{y) , '*'
i =. >: i
end.
]running_String
)
This is why I would use J: I know how to do this, and have only studied the language for a couple months loosely. This isn't as succinct as the whole .replaceAll() method, but you can do it yourself quite easily and edit it to your specifications later. Feel free to delete this/ troll this/ get inflamed at my suggestion of J, I really don't care: I'm not advertising it.

Categories

Resources