Help building a regex - java

I need to build a regular expression that finds the word "int" only if it's not part of some string.
I want to find whether int is used in the code. (not in some string, only in regular code)
Example:
int i; // the regex should find this one.
String example = "int i"; // the regex should ignore this line.
logger.i("int"); // the regex should ignore this line.
logger.i("int") + int.toString(); // the regex should find this one (because of the second int)
thanks!

It's not going to be bullet-proof, but this works for all your test cases:
(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
It does a look behind and look ahead to assert that there's either none or two preceding/following quotes "
Here's the code in java with the output:
String regex = "(?<=^([^\"]*|[^\"]*\"[^\"]*\"[^\"]*))\\bint\\b(?=([^\"]*|[^\"]*\"[^\"]*\"[^\"]*)$)";
System.out.println(regex);
String[] tests = new String[] {
"int i;",
"String example = \"int i\";",
"logger.i(\"int\");",
"logger.i(\"int\") + int.toString();" };
for (String test : tests) {
System.out.println(test.matches("^.*" + regex + ".*$") + ": " + test);
}
Output (included regex so you can read it without all those \ escapes):
(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
true: int i;
false: String example = "int i";
false: logger.i("int");
true: logger.i("int") + int.toString();
Using a regex is never going to be 100% accurate - you need a language parser. Consider escaped quotes in Strings "foo\"bar", in-line comments /* foo " bar */, etc.

Not exactly sure what your complete requirements are but
$\s*\bint\b
perhaps

Assuming input will be each line,
^int\s[\$_a-bA-B\;]*$
it follows basic variable naming rules :)

If you think to parse code and search isolated int word, this works:
(^int|[\(\ \;,]int)
You can use it to find int that in code can be only preceded by space, comma, ";" and left parenthesis or be the first word of line.
You can try it here and enhance it http://www.regextester.com/
PS: this works in all your test cases.

$[^"]*\bint\b
should work. I can't think of a situation where you can use a valid int identifier after the character '"'.
Of course this only applies if the code is limited to one statement per line.

Related

How to find a String of last 2 items in colon separated string

I have a string = ab:cd:ef:gh. On this input, I want to return the string ef:gh (third colon intact).
The string apple:orange:cat:dog should return cat:dog (there's always 4 items and 3 colons).
I could have a loop that counts colons and makes a string of characters after the second colon, but I was wondering if there exists some easier way to solve it.
You can use the split() method for your string.
String example = "ab:cd:ef:gh";
String[] parts = example.split(":");
System.out.println(parts[parts.length-2] + ":" + parts[parts.length-1]);
String example = "ab:cd:ef:gh";
String[] parts = example.split(":",3); // create at most 3 Array entries
System.out.println(parts[2]);
The split function might be what you're looking for here. Use the colon, like in the documentation as your delimiter. You can then obtain the last two indexes, like in an array.
Yes, there is easier way.
First, is by using method split from String class:
String txt= "ab:cd:ef:gh";
String[] arr = example.split(":");
System.out.println(arr[arr.length-2] + " " + arr[arr.length-1]);
and the second, is to use Matcher class.
Use overloaded version of lastIndexOf(), which takes the starting index as 2nd parameter:
str.substring(a.lastIndexOf(":", a.lastIndexOf(":") - 1) + 1)
Another solution would be using a Pattern to match your input, something like [^:]+:[^:]+$. Using a pattern would probably be easier to maintain as you can easily change it to handle for example other separators, without changing the rest of the method.
Using a pattern is also likely be more efficient than String.split() as the latter is also converting its parameter to a Pattern internally, but it does more than what you actually need.
This would give something like this:
String example = "ab:cd:ef:gh";
Pattern regex = Pattern.compile("[^:]+:[^:]+$");
final Matcher matcher = regex.matcher(example);
if (matcher.find()) {
// extract the matching group, which is what we are looking for
System.out.println(matcher.group()); // prints ef:gh
} else {
// handle invalid input
System.out.println("no match");
}
Note that you would typically extract regex as a reusable constant to avoid compiling the pattern every time. Using a constant would also make the pattern easier to change without looking at the actual code.

How to properly use java Pattern object to match string patterns

I wrote a code that does several string operations including checking whether a given string matches with a certain regular expression. It ran just fine with 70,000 input but it started to give me out of memory error when I iteratively ran it for five-fold cross validation. It just might be the case that I have to assign more memory, but I have a feeling that I might have written an inefficient code, so wanted to double check if I didn't make any obvious mistake.
static Pattern numberPattern = Pattern.compile("^[a-zA-Z]*([0-9]+).*");
public static boolean someMethod(String line) {
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++) {
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
if(numberPattern.matcher(tokens[i]).find()) return true;
}
return false;
}
and I have also many lines like below:
token.matches("[a-z]+[A-Z][a-z]+");
Which way is more memory efficient? Do they look efficient enough? Any advice is appreciated!
Edited:
Sorry, I had a wrong code, which I intended to modify before posting this question but I forgot at the last minute. But the problem was I had many similar looking operations all over, aside from the fact that the example code did not make sense, I wanted to know if regexp comparison part was efficient.
Thanks for all of your comments, I'll look through and modify the code following the advice!
Well, first at all, try a second look at your code... it will always return a "true" value ! You are not reading the 'match' variable, just putting values....
At second, String is immutable, so, each time you're splitting, you're creating another instances... why don't you try so create a pattern that makes the matches you want ignoring the commas and semicolons? I'm not sure, but I think it will take you less memory...
Yes, this code is inefficient indeed because you can return immediately once you've found that match = true; (no point to continue looping).
Further, are you sure you need to break the line into tokens ? why not check the regex only once ?
And last, if all comparisons checks failed, you should return false (last line).
Instead of altering the text and splitting it you can put it all in the regex.
// the \\b means it must be the start of the String or a word
static Pattern numberPattern = Pattern.compile("\\b[a-zA-Z,;]*[0-9,;]*[0-9]");
// return true if the string contains
// a number which might have letters in front
public static boolean someMethod(String line) {
return numberPattern.matcher(line).find());
}
Aside from what #alfasin has mentioned in his answer, you should avoid duplicating code; Rewrite the following:
{
tokens[i] = tokens[i].replace(",", "");
tokens[i] = tokens[i].replace(";", "");
}
Into:
tokens[i] = tokens[i].replaceAll(",|;", "");
And please just compute this before it was .split(), such that the operation doesn't have to be repeated within the loop:
String[] tokens = line.replaceAll(",|;", "").split(" ");
^^^^^^^^^^^^^^^^^^^^^^
Edit: After staring at your code for a bit I think I have a better solution, using regex ;)
public static boolean someMethod(String line) {
return Pattern.compile("\\b[a-zA-Z]*\\d")
.matcher(line.replaceAll(",|;", "")).find();
}
Online Regex DemoOnline Code Demo
\b is a Word Boundary.
It asserts position at the Boundary of a word (Start of line + after spacing)
Code Demo STDOUT:
foo does not match
bar does not match
bar1 does match
foo baz bar bar1 lolz does match
password_01 does not match

Remove Punctuation issue

Im trying to find a word in a string. However, due to a period it fails to recognize one word. Im trying to remove punctuation, however it seems to have no effect. Am I missing something here? This is the line of code I am using: s.replaceAll("([a-z] +) [?:!.,;]*","$1");
String test = "This is a line about testing tests. Tests are used to examine stuff";
String key = "tests";
int counter = 0;
String[] testArray = test.toLowerCase().split(" ");
for(String s : testArray)
{
s.replaceAll("([a-z] +) [?:!.,;]*","$1");
System.out.println(s);
if(s.equals(key))
{
System.out.println(key + " FOUND");
counter++;
}
}
System.out.println(key + " has been found " + counter + " times.");
}
I managed to find a solution (though may not be ideal) through using s = s.replaceAll("\W",""); Thanks for everyones guidance on how to solve this problem.
You could also take advantage of the regex in the split operation. Try this:
String[] testArray = test.toLowerCase().split("\\W+");
This will split on apostrophe, so you may need to tweak it a bit with a specific list of characters.
Strings are immutable. You would need assign the result of replaceAll to the new String:
s = s.replaceAll("([a-z] +)*[?:!.,;]*", "$1");
^
Also your regex requires that a space exist between the word and the the punctuation. In the case of tests., this isn't true. You can adjust you regex with an optional (zero or more) character to account for this.
Your regex doesn't seem to work as you want.
If you want to find something which has period after that then this will work
([a-z]*) [?(:!.,;)*]
it returns "tests." when it's run on your given string.
Also
[?(:!.,;)*]
just points out the punctuation which will then can be replaced.
However I am not sure why you are not using substring() function.

I need a Java regular expression

I am currently using the following regular expression:
^[a-zA-Z]{0,}(\\*?)?[a-zA-Z0-9]{0,}
to check a string to start with an alpha character and end with alphanumeric characters and have an asterisk(*) anywhere in the string but only a maximum of one time. The problem here is that if the given string still passes if it starts with a number but doesn't have an *, which should fail. How can I rework the regex to fail this case?
ex.
TE - pass
*TE - pass
TE* - pass
T*E - pass
*9TE - pass
*TE* - fail (multiple asterisk)
9E - fail (starts with number)
EDIT:
Sorry to introduce a late edit but I also need to ensure that the string is 8 characters or less, can I include that in the regex as well? Or should I just check the string length after the regex validation?
This passes your example:
"^([a-zA-Z]+\\*?|\\*)[a-zA-Z0-9]*$"
It says:
start with: [a-zA-Z]+\\*? (a letter and maybe a star)
| (or)
\\* a single star
and end with [a-zA-Z0-9]* (an alphanumeric character)
Code to test it:
public static void main(final String[] args) {
final Pattern p = Pattern.compile("^([a-zA-Z]+\\*?|\\*)\\w*$");
System.out.println(p.matcher("TE").matches());
System.out.println(p.matcher("*TE").matches());
System.out.println(p.matcher("TE*").matches());
System.out.println(p.matcher("T*E").matches());
System.out.println(p.matcher("*9TE").matches());
System.out.println(p.matcher("*TE*").matches());
System.out.println(p.matcher("9E").matches());
}
Per Stargazer, if you allow alphanumeric before the star, then use this:
^([a-zA-Z][a-zA-Z0-9]*\\*?|\\*)\\w*$
One possible way is to separate into 2 conditions:
^(?=[^*]*\*?[^*]*$)[a-zA-Z*][a-zA-Z0-9*]*$
The (?=[^*]*\*?[^*]*$) part ensures there is at most one * in the string.
The [a-zA-Z*][a-zA-Z0-9*]* part ensures it starts with an alphabet or a *, and followed by only alphanumerals or *.
It might be easier to develop and maintain later if you just break your regular expressions into a few pieces, e.g., one for the start and end, and one for the asterisk. I am not sure what the overall performance effect would be, you would have simpler expressions but have to run a few of them.
This is Python, it'll need some massaging for Java:
>>> import re
>>> p = re.compile('^([a-z][^*]*[*]?[^*]*[a-z0-9]|[*][^*]*[a-z0-9]|[a-z][^*]*[*])$', re.I)
>>> for test in ['TE', '*TE', 'TE*', 'T*E', '*9TE', '*TE*', '9E']:
... if p.match(test):
... print test, 'pass'
... else:
... print test, 'fail'
...
TE pass
*TE pass
TE* pass
T*E pass
*9TE pass
*TE* fail
9E fail
Hope I didn't miss anything.
How about this, it's easier to read:
boolean pass = input.replaceFirst("\\*", "").matches("^[a-zA-Z].*\\w$");
Assuming I read right, you want to:
Start with an alpha character
End with an alphanumeric character
Allow up to one * anywhere
At most one asterisk, alphabetic characters anywhere and numbers anywhere but at start.
String alpha = "[a-zA-Z]";
String alnum = "[a-zA-Z0-9]";
String asteriskNone = "^" + alpha + "+" + alnum + "*";
String asteriskStart = "^\\*" + alnum + "*";
String asteriskInside = "^" + alpha + "+" + alnum + "+\\*" + alnum + "*";
String yourRegex = asteriskNone + "|" + asteriskStart + "|"
+ asteriskInside;
String[] tests = {"TE","*TE","TE*","T*E","*9TE","*TE*", "9E"};
for (String test : tests)
System.out.println(test + " " + (test.matches(yourRegex)?"PASS":"FAIL"));
Look for two possible patterns, one starting with *, and one with an alpha char:
^[a-zA-Z][a-zA-Z0-9]*(\\*?)?[a-zA-Z0-9]*|\*[a-zA-Z0-9]*
^([a-zA-Z][a-zA-Z0-9]*\*|\*|[a-zA-Z])([a-zA-Z0-9])*$
the parenthesis around the second half are for clarity and can be safely excluded.
This was a tough one (liked the challenge), but here it is:
^(\*[a-zA-Z0-9]+|[a-zA-Z]+[\*]{1}[a-zA-Z]*)$
In order to comply with T9*Z, as pointed out on another post with StarGazer712, I had to change it to:
^(\*[a-zA-Z0-9]+|[a-zA-Z]{1}[a-zA-Z0-9]*[\*]{1}[a-zA-Z0-9]*)$

Regular expression, value in between quotes

I'm having a little trouble constructing the regular expression using java.
The constraint is, I need to split a string seperated by !. The two strings will be enclosed in double quotes.
For example:
"value"!"value"
If I performed a java split() on the string above, I want to get:
value
value
However the catch is value can be any characters/punctuations/numerical character/spaces/etc..
So here's a more concrete example. Input:
""he! "l0"!"wor!"d1"
Java's split() should return:
"he! "l0
wor!"d1
Any help is much appreciated. Thanks!
Try this expression: (".*")\s*!\s*(".*")
Although it would not work with split, it should work with Pattern and Matcher and return the 2 strings as groups.
String input = "\" \"he\"\"\"\"! \"l0\" ! \"wor!\"d1\"";
Pattern p = Pattern.compile("(\".*\")\\s*!\\s*(\".*\")");
Matcher m = p.matcher(input);
if(m.matches())
{
String s1 = m.group(1); //" "he""""! "l0"
String s2 = m.group(2); //"wor!"d1"
}
Edit:
This would not work for all cases, e.g. "he"!"llo" ! "w" ! "orld" would get the wrong groups. In that case it would be really hard to determine which ! should be the separator. That's why often rarely used characters are used to separate parts of a string, like # in email addresses :)
have the value split on "!" instead of !
String REGEX = "\"!\"";
String INPUT = "\"\"he! \"l0\"!\"wor!\"d1\"";
String[] items = p.split(INPUT);
It feels like you need to parse on:
DOUBLEQUOTE = "
OTHER = anything that isn't a double quote
EXCLAMATION = !
ITEM = (DOUBLEQUOTE (OTHER | (DOUBLEQUOTE OTHER DOUBLEQUOTE))* DOUBLEQUOTE
LINE = ITEM (EXCLAMATION ITEM)*
It feels like it's possible to create a regular expression for the above (assuming the double quotes in an ITEM can't be nested even further) BUT it might be better served by a very simple grammer.
This might work... excusing missing escapes and the like
^"([^"]*|"[^"]*")*"(!"([^"]*|"[^"]*")*")*$
Another option would be to match against the first part, then, if there's a !and more, prune off the ! and keep matching (excuse the no-particular-language, I'm just trying to illustrate the idea):
resultList = []
while(string matches \^"([^"]*|"[^"]*")*(.*)$" => match(1)) {
resultList += match
string = match(2)
if(string.beginsWith("!")) {
string = string[1:end]
} elseif(string.length > 0) {
// throw an error, since there was no exclamation and the string isn't done
}
}
if(string.length > 0) {
// throw an exception since the string isn't done
}
resultsList == the list of items in the string
EDIT: I realized that my answer doesn't really work. You can have a single doublequote inside the strings, as well as exclamation marks. As such, you really CAN'T have "!" inside one of the strings. As such, the idea of 1) pull quotes off the ends, 2) split on '"!"' is really the right way to go.

Categories

Resources