Java - Pattern Matching

Java - Pattern Matching - java

I have some code in php I made using preg_grep for matching several words in any order that can exist in any context. I'm trying to convert it to java but i can't seem to figure it out.
My php code for converting a keyword to a regex string is:
function createRegexSearch($keywords)
{
$regex = '';
foreach ($keywords as $key)
$regex .= '(?=.*' . $key . ')';
return '/^' . $regex . '/i';
}
It would create a regex string similar to: /^(?=.*bot)/i - which should match robot, robots, bots etc. The same regex string doesn't seem to work in java which is leaving me confused. Currently in java I created a similar effect with contains but would rather use regex.
for (Map.Entry<String, String> entry : mKeyList.entrySet())
{
boolean found = true;
String val = entry.getValue().toLowerCase();
for (int i = 0; i < keywords.length; i++)
{
if (!val.contains(keywords[i].toLowerCase()))
found = false;
}
if (found)
ret.add(entry.getValue());
}

One thing that Java does differently than many languages is have two different ways of "matching" a regex against a target - "matches()" vs "find()" - matches is the equivalent of putting ^ and $ at the beginning and end of your expression, while find finds the first match (wherever it might be in the string) - for example while you might be able to find() .*bot in the target string robots, it would not be true to say that it matches() the target... I'm not entirely sure how the lookahead might affect this...
Without posted Java code (containing the problem), it's hard to tell you where you might be going wrong, but my guess is that it could very easily be in this area.
Also, the equivalent of putting /i at the end of your expression in Java (and .Net) is putting (?i) at the beginning of your expression (or any region you want to be case sensitive). Thus, /[a-f0-9]/i is equivalent to (?i)[a-f0-9]

The String contains is case sensitive, so the first set (PHP Code) will behave case in-sensitive since the usage of \i. But the java code will behave case sensitive. So there will be differences in behavior.
So if this is difference, you convert both the end to specific char set, say toUpperCase() before the contains check.
Also you are using a regex in PHP code and not in Java, any specific reason behind this?
Regards
Ajai G

You can use the embedded flag extension (?i) so the regex you should be using to match bot, robot, bots and robots is (?i)^(.*bots?)$ This should work with either String.matches or Pattern/Matcher

JMPL is simple java library, which could emulate some of the features pattern matching, using Java 8 features.
import org.kl.state.Else;
import static org.kl.pattern.DeconstructPattern.matches;
import static org.kl.pattern.DeconstructPattern.foreach;
import static org.kl.pattern.DeconstructPattern.let;
let(figure, (int w, int h) -> {
System.out.println("border: " + w + " " + h));
});
matches(figure).as(
Rectangle.class, (int w, int h) -> System.out.println("square: " + (w * h)),
Circle.class, (int r) -> System.out.println("square: " + (2 * Math.PI * r)),
Else.class, () -> System.out.println("Default square: " + 0)
);
foreach(listRectangles, (int w, int h) -> {
System.out.println("square: " + (w * h));
});

Related

How to remove comments from a given String in Java?

how do I remove comments start with "//" and with /**, * etc.? I haven't found any solutions on Stack Overflow that has helped me very much, a lot of them have been way above my head and I'm still at most basics.
What I have thought about so far:
for (int i = 0; i < length; i++) {
for (j = i; j < length; j++) {
if (obj.charAt(j) == '/' && obj.charAt(j + 1) == '/')
But I'm not really sure how to replace the words following those characters. And how to end when to stop the replacement with a "//" comment. With the /* comments, atleast conceptually I know I should replace all words till "*/" pops up. Though again, I'm not sure how to limit the replacement till that point. To replace I thought replacing the charAt after the second "/" with an empty string until....where? I cannot figure out where to "end" the replacing.
I have looked at a few implementations on Stack, but I really didn't get it. Any help is appreciated, especially if it's at a basic level and understandable for someone not well versed in programming!
Thanks.

I have done something similar with regex (Java 9+):
// Checks for
// 1) Single char literal '"'
// 2) Single char literal '\"'
// 3) Strings; termination ignores \", \\\", ..., but allows ", \\", \\\\", ...
// 4) Single-line comment // ... to first \n
// 5) Multi-line comments /*[*] ... */
Pattern regex = Pattern.compile(
"(?s)('\"'|'\\\"'|\".*?(?<!\\\\)(?:\\\\\\\\)*\"|//[^\n]*|/\\*.*?\\*/)");
// Assuming 'text' contains your java text
// Leaves 1,2,3) unchanged and replaces comments 4,5) with ""
// Need quoteReplacement to prevent matcher processing $ and \
String textWithoutComments = regex.matcher(text).replaceAll(
m -> m.group().charAt(0) == '/' ? "" : Matcher.quoteReplacement(m.group()));
If you don't have Java 9+ then you could use this replace function:
String textWithoutComments = replaceAll(regex, text,
m -> m.group().charAt(0) == '/' ? "" : m.group());
public static String replaceAll(Pattern p, String s,
Function<MatchResult, String> replacer) {
Matcher m = p.matcher(s);
StringBuilder b = new StringBuilder();
int lastStart = 0;
while (m.find()) {
String replacement = replacer.apply(m);
b.append(s.substring(lastStart, m.start())).append(replacement);
lastStart = m.end();
}
return b.append(s.substring(lastStart)).toString();
}

I'm not sure if you're using an IDE like IntelliJ or Eclipse but you could do this without using code if you're just interested in removing all comments for the project. You can do this with "Replace in Path" tool. Notice how "Regex" is checked, allowing us to match lines based on regular expressions.
This configuration in the tool will delete all lines starting with a // and replace it with an empty line.
The command to get to this on a Mac is ctrl + shift + r.

How to censor website links?

I've been working on a regex censor for quite the time and can't seem to find a decent way of censoring address links (and attempts to circumvent that).
Here's what I got so far, ignoring escape sequences:
([a-zA-Z0-9_-]+[\\W[_]]*)+(\\.|[\\W]?|dot|\\(\\.\\)|[\\(]?dot[\\)]?)+([\\w]{2,6})((\\.|[\\W]?|dot|\\(\\.\\)|[\\(]?dot[\\)]?)([\\w]{1,4}))*
I'm not so sure what might be causing the problem but however it censors the word "com" and "come" and pretty much anything that is about 3+ letters.
Problem: I want to know how to censor website links and invalid links that are attempts to circumvent the censor. Examples:
Google.com
goo gle .com
g o o g l e . c o m
go o gl e % com
go og le (.) c om
Also a slight addition, is there a possible way to add links to a white list for this? Thank you.

You could use a simple function such as this..
private String hideLink(String link){
String[] split = link.split("\\.");
String output = "";
output += split[0] + ".";
for(int i = 0; i < split[1].length(); i++){
output += "*";
}
output += "." + split[2];
return output;
}
calling
hideLink("www.google.com");
returns www.******.com
calling
hideLink("www.msn.net");
returns www.***.net
calling
hideLink("http://abc.12345.org");
returns http://abc.*****.org
etc...

Help building a regex

I need to build a regular expression that finds the word "int" only if it's not part of some string.
I want to find whether int is used in the code. (not in some string, only in regular code)
Example:
int i; // the regex should find this one.
String example = "int i"; // the regex should ignore this line.
logger.i("int"); // the regex should ignore this line.
logger.i("int") + int.toString(); // the regex should find this one (because of the second int)
thanks!

It's not going to be bullet-proof, but this works for all your test cases:
(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
It does a look behind and look ahead to assert that there's either none or two preceding/following quotes "
Here's the code in java with the output:
String regex = "(?<=^([^\"]*|[^\"]*\"[^\"]*\"[^\"]*))\\bint\\b(?=([^\"]*|[^\"]*\"[^\"]*\"[^\"]*)$)";
System.out.println(regex);
String[] tests = new String[] {
"int i;",
"String example = \"int i\";",
"logger.i(\"int\");",
"logger.i(\"int\") + int.toString();" };
for (String test : tests) {
System.out.println(test.matches("^.*" + regex + ".*$") + ": " + test);
}
Output (included regex so you can read it without all those \ escapes):
(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
true: int i;
false: String example = "int i";
false: logger.i("int");
true: logger.i("int") + int.toString();
Using a regex is never going to be 100% accurate - you need a language parser. Consider escaped quotes in Strings "foo\"bar", in-line comments /* foo " bar */, etc.

Not exactly sure what your complete requirements are but
$\s*\bint\b
perhaps

Assuming input will be each line,
^int\s[\$_a-bA-B\;]*$
it follows basic variable naming rules :)

If you think to parse code and search isolated int word, this works:
(^int|[\(\ \;,]int)
You can use it to find int that in code can be only preceded by space, comma, ";" and left parenthesis or be the first word of line.
You can try it here and enhance it http://www.regextester.com/
PS: this works in all your test cases.

$[^"]*\bint\b
should work. I can't think of a situation where you can use a valid int identifier after the character '"'.
Of course this only applies if the code is limited to one statement per line.

I need a Java regular expression

I am currently using the following regular expression:
^[a-zA-Z]{0,}(\\*?)?[a-zA-Z0-9]{0,}
to check a string to start with an alpha character and end with alphanumeric characters and have an asterisk(*) anywhere in the string but only a maximum of one time. The problem here is that if the given string still passes if it starts with a number but doesn't have an *, which should fail. How can I rework the regex to fail this case?
ex.
TE - pass
*TE - pass
TE* - pass
T*E - pass
*9TE - pass
*TE* - fail (multiple asterisk)
9E - fail (starts with number)
EDIT:
Sorry to introduce a late edit but I also need to ensure that the string is 8 characters or less, can I include that in the regex as well? Or should I just check the string length after the regex validation?

This passes your example:
"^([a-zA-Z]+\\*?|\\*)[a-zA-Z0-9]*$"
It says:
start with: [a-zA-Z]+\\*? (a letter and maybe a star)
| (or)
\\* a single star
and end with [a-zA-Z0-9]* (an alphanumeric character)
Code to test it:
public static void main(final String[] args) {
final Pattern p = Pattern.compile("^([a-zA-Z]+\\*?|\\*)\\w*$");
System.out.println(p.matcher("TE").matches());
System.out.println(p.matcher("*TE").matches());
System.out.println(p.matcher("TE*").matches());
System.out.println(p.matcher("T*E").matches());
System.out.println(p.matcher("*9TE").matches());
System.out.println(p.matcher("*TE*").matches());
System.out.println(p.matcher("9E").matches());
}
Per Stargazer, if you allow alphanumeric before the star, then use this:
^([a-zA-Z][a-zA-Z0-9]*\\*?|\\*)\\w*$

One possible way is to separate into 2 conditions:
^(?=[^*]*\*?[^*]*$)[a-zA-Z*][a-zA-Z0-9*]*$
The (?=[^*]*\*?[^*]*$) part ensures there is at most one * in the string.
The [a-zA-Z*][a-zA-Z0-9*]* part ensures it starts with an alphabet or a *, and followed by only alphanumerals or *.

It might be easier to develop and maintain later if you just break your regular expressions into a few pieces, e.g., one for the start and end, and one for the asterisk. I am not sure what the overall performance effect would be, you would have simpler expressions but have to run a few of them.

This is Python, it'll need some massaging for Java:
>>> import re
>>> p = re.compile('^([a-z][^*]*[*]?[^*]*[a-z0-9]|[*][^*]*[a-z0-9]|[a-z][^*]*[*])$', re.I)
>>> for test in ['TE', '*TE', 'TE*', 'T*E', '*9TE', '*TE*', '9E']:
... if p.match(test):
... print test, 'pass'
... else:
... print test, 'fail'
...
TE pass
*TE pass
TE* pass
T*E pass
*9TE pass
*TE* fail
9E fail
Hope I didn't miss anything.

How about this, it's easier to read:
boolean pass = input.replaceFirst("\\*", "").matches("^[a-zA-Z].*\\w$");
Assuming I read right, you want to:
Start with an alpha character
End with an alphanumeric character
Allow up to one * anywhere

At most one asterisk, alphabetic characters anywhere and numbers anywhere but at start.
String alpha = "[a-zA-Z]";
String alnum = "[a-zA-Z0-9]";
String asteriskNone = "^" + alpha + "+" + alnum + "*";
String asteriskStart = "^\\*" + alnum + "*";
String asteriskInside = "^" + alpha + "+" + alnum + "+\\*" + alnum + "*";
String yourRegex = asteriskNone + "|" + asteriskStart + "|"
+ asteriskInside;
String[] tests = {"TE","*TE","TE*","T*E","*9TE","*TE*", "9E"};
for (String test : tests)
System.out.println(test + " " + (test.matches(yourRegex)?"PASS":"FAIL"));

Look for two possible patterns, one starting with *, and one with an alpha char:
^[a-zA-Z][a-zA-Z0-9]*(\\*?)?[a-zA-Z0-9]*|\*[a-zA-Z0-9]*

^([a-zA-Z][a-zA-Z0-9]*\*|\*|[a-zA-Z])([a-zA-Z0-9])*$
the parenthesis around the second half are for clarity and can be safely excluded.

This was a tough one (liked the challenge), but here it is:
^(\*[a-zA-Z0-9]+|[a-zA-Z]+[\*]{1}[a-zA-Z]*)$
In order to comply with T9*Z, as pointed out on another post with StarGazer712, I had to change it to:
^(\*[a-zA-Z0-9]+|[a-zA-Z]{1}[a-zA-Z0-9]*[\*]{1}[a-zA-Z0-9]*)$

codingbat wordEnds using regex

I'm trying to solve wordEnds from codingbat.com using regex.
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
wordEnds("XYXY", "XY") → "XY"
This is the simplest as I can make it with my current knowledge of regex:
public String wordEnds(String str, String word) {
return str.replaceAll(
".*?(?=word)(?<=(.|^))word(?=(.|$))|.+"
.replace("word", java.util.regex.Pattern.quote(word)),
"$1$2"
);
}
replace is used to place in the actual word string into the pattern for readability. Pattern.quote isn't necessary to pass their tests, but I think it's required for a proper regex-based solution.
The regex has two major parts:
If after matching as few characters as possible ".*?", word can still be found "(?=word)", then lookbehind to capture any character immediately preceding it "(?<=(.|^))", match "word", and lookforward to capture any character following it "(?=(.|$))".
The initial "if" test ensures that the atomic lookbehind captures only if there's a word
Using lookahead to capture the following character doesn't consume it, so it can be used as part of further matching
Otherwise match what's left "|.+"
Groups 1 and 2 would capture empty strings
I think this works in all cases, but it's obviously quite complex. I'm just wondering if others can suggest a simpler regex to do this.
Note: I'm not looking for a solution using indexOf and a loop. I want a regex-based replaceAll solution. I also need a working regex that passes all codingbat tests.
I managed to reduce the occurrence of word within the pattern to just one.
".+?(?<=(^|.)word)(?=(.?))|.+"
I'm still looking if it's possible to simplify this further, but I also have another question:
With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?

Based on your solution I managed to simplify the code a little bit:
public String wordEnds(String str, String word) {
return str.replaceAll(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+","$1$2");
}
Another way of writing it would be:
public String wordEnds(String str, String word) {
return str.replaceAll(
String.format(".*?(?="+word+")(?<=(.|^))"+word+"(?=(.|$))|.+",word),
"$1$2");
}

With this latest pattern, I simplified .|$ to just .? successfully, but if I similarly tried to simplify ^|. to .? it doesn't work. Why is that?
In Oracle's implementation, the behavior of look-behind is as follow:
By "studying" the regex (with study() method in each node), it knows the maximum length and minimum length of the pattern in look-behind group. (The study() method is what allows for obvious look-behind length)
It verifies the look-behind by starting a match at every position from index (current - min_length) to position (current - max_length) and exits early if the condition is satisfied.
Effectively, it will try to verify the look-behind on the shortest string first.
The implementation multiplies the matching complexity by O(k) factor.
This explains why changing ^|. to .? doesn't work: due to the starting position, it effectively checks for word before .word. The quantifier doesn't have a say here, since the ordering is imposed by the match range.
You can check the code of match method in Pattern.Behind and Pattern.NotBehind inner classes to verify what I said above.
In .NET's flavor, look-behind is likely implemented by the reverse matching feature, which means that no extra factor is incurred on the matching complexity.
My suspicion comes from the fact that the capturing group in (?<=(a+))b matches all a's in aaaaaaaaaaaaaab. The quantifier is shown to have free reign in look-behind group.
I have tested that ^|. can be simplified to .? in .NET and the regex works correctly.

I am working in .NET's regex but I was able to change your pattern to:
.+?(?<=(\w?)word)(?=(\w?))|.+
with the positive results. You know its a word (alphanumeric) type character, why not give a valid hint to the parser of that fact; instead of any character its an optional alpha numeric character.
It may answer why you don't need to specify the anchors of ^ and $, for what exactly is $ - is it \r or \n or other? (.NET has issues with $, and maybe you are not exactly capturing a Null of $, but the null of \r or \n which allowed you to change to .? for $)

Another solution to look at...
public String wordEnds(String str, String word) {
if(str.equals(word)) return "";
int i = 0;
String result = "";
int stringLen = str.length();
int wordLen = word.length();
int diffLen = stringLen - wordLen;
while(i<=diffLen){
if(i==0 && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i+wordLen);
}else if(i==diffLen && str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1);
}else if(str.substring(i,i+wordLen).equals(word)){
result = result + str.charAt(i-1) + str.charAt(i+wordLen) ;
}
i++;
}
if(result.length()==1) result = result + result;
return result;
}

Another possible solution:
public String wordEnds(String str, String word) {
String result = "";
if (str.contains(word)) {
for (int i = 0; i < str.length(); i++) {
if (str.startsWith(word, i)) {
if (i > 0) {
result += str.charAt(i - 1);
}
if ((i + word.length()) < str.length()) {
result += str.charAt(i + word.length());
}
}
}
}
return result;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Pattern Matching - java

You can use the embedded flag extension (?i) so the regex you should be using to match bot, robot, bots and robots is (?i)^(.*bots?)$ This should work with either String.matches or Pattern/Matcher

Related

How to remove comments from a given String in Java?

How to censor website links?

Help building a regex

I need a Java regular expression

codingbat wordEnds using regex

Categories

Resources