Java REGEX: matching comments and NOT matching specific character - java

so I'm new to Java and having some trouble with regex. I'm trying to find winged comments (/* */) and end of line comments( // ) in a string so I can split along them and put the pieces in an array.
This is the regex I'm currently have:
stringofstuff.split("[!//.*?\n!]");
and it works, but my problem is that it's also matching the character "." so when the string contains a number like 90.55, my array looks like [90, 55] which is NOT what I want. I've tried adding ^\\. to the end of the regex after the closing square bracket:
stringofstuff.split("[!//.*?\n!]^\\.");
and it succeeds in not matching . but it no longer recognizes either type of comment! I have no clue where I'm going wrong, any suggestions?

You can use pattern and matcher of regex package to do so.
For example to find digits:
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher(string);
if(m.find())
{
System.out.println(m.start()+" "+m.end()+" "+m.group);
}
Similarly you can make different combinations of strings you want to separate out and they will be stored in m.group().
For different combinations and more information on regex package you can see here:
http://www.regular-expressions.info/java.html

Related

Cannot match my regular expression

I am trying to match a string that looks like "WIFLYMODULE-xxxx" where the x can be any digit. For example, I want to be able to find the following...
WIFLYMODULE-3253
WIFLYMODULE-1585
WIFLYMODULE-1632
I am currently using
final Pattern q = Pattern.compile("[WIFLYMODULE]-[0-9]{3}");
but I am not picking up the string that I want. So my question is, why is my regular expression not working? Am i going about it in the wrong way?
You should use (..) instead of [...]. [..] is used for Character class
With a "character class", also called "character set", you can tell the regex engine to match only one out of several characters.
(WIFLYMODULE)-[0-9]{4}
Here is demo
Note: But in this case it's not needed at all. (...) is used for capturing group to access it by Matcher.group(index)
Important Note: Use \b as word boundary to match the correct word.
\\bWIFLYMODULE-[0-9]{4}\\b
Sample code:
String str = "WIFLYMODULE-3253 WIFLYMODULE-1585 WIFLYMODULE-1632";
Pattern p = Pattern.compile("\\bWIFLYMODULE-[0-9]{4}\\b");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group());
}
output:
WIFLYMODULE-3253
WIFLYMODULE-1585
WIFLYMODULE-1632
The regex should be:
"WIFLYMODULE-[0-9]{4}"
The square brackets means: one of the characters listed inside. Also you were matching three numbers instead of four. So your were matching strings like (where xxx is a number of three digits):
W-xxx, I-xxx, F-xxx, L-xxx, Y-xxx, M-xxx, O-xxx, D-xxx, U-xxx, L-xxx, E-xxx
You had it match on 3 digits instead of 4. And putting WIFLYMODULE inside [] makes it match on only one of those characters.
final Pattern q = Pattern.compile("WIFLYMODULE-[0-9]{4}");
[...] means that one character out of the ones in the bracket must match and not the string within it.
You, however, want to match WIFLYMODULE, thus, you have to use Pattern.compile("WIFLYMODULE-[0-9]{3}"); or Pattern.compile("(WIFLYMODULE)-[0-9]{3}");
{n} means that the character (or group) must match n-times. In your example you need 4 instead of 3: Pattern.compile("WIFLYMODULE-[0-9]{4}");
This way will work:
final Pattern q = Pattern.compile("WIFLYMODULE-[0-9]{4}");
The pattern breaks down to:
WIFLYMODULE- The literal string WIFLYMODULE-
[0-9]{4} Exactly four digits
What you had was:
[WIFLYMODULE] Any one of the characters in WIFLYMODULE
- The literal string -
[0-9]{3} Exactly three digits

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

RegEx to find the word between last Upper Case word and another word

My problem is to find a word between two words. Out of these two words one is an all UPPER CASE word which can be anything and the other word is "is". I tried out few regexes but none are helping me. Here is my example:
String :
In THE house BIG BLACK cat is very good.
Expected output :
cat
RegEx used :
(?<=[A-Z]*\s)(.*?)(?=\sis)
The above RegEx gives me BIG BLACK cat as output whereas I just need cat.
One solution is to simplify your regular expression a bit,
[A-Z]+\s(\w+)\sis
and use only the matched group (i.e., \1). See it in action here.
Since you came up with something more complex, I assume you understand all the parts of the above expression but for someone who might come along later, here are more details:
[A-Z]+ will match one or more upper-case characters
\s will match a space
(\w+) will match one or more word characters ([a-zA-Z0-9_]) and store the match in the first match group
\s will match a space
is will match "is"
My example is very specific and may break down for different input. Your question didn't provided many details about what other inputs you expect, so I'm not confident my solution will work in all cases.
Try this one:
String TestInput = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern
.compile(
"(?<=\\b\\p{Lu}+\\s) # lookbehind assertion to ensure a uppercase word before\n"
+ "\\p{L}+ # matching at least one letter\n"
+ "(?=\\sis) # lookahead assertion to ensure a whitespace is ahead\n"
, Pattern.COMMENTS); Matcher m = p.matcher(TestInput);
if(m.find())
System.out.println(m.group(0));
it matches only "cat".
\p{L} is a Unicode property for a letter in any language.
\p{Lu} is a Unicode property for an uppercase letter in any language.
You want to look for a condition that depends on several parts of infirmation and then only retrieve a specific part of that information. That is not possible in a regex without grouping. In Java you should do it like this:
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+\\s(\\w+)\\sis");
Matcher matcher = pattern.matcher("In THE house BIG BLACK cat is very good.");
if (matcher.find())
System.out.println(matcher.group(1));
}
}
}
The group(1) is the one with brackets around it. In this case w+. And that's your word. The return type of group() is String so you can use it right away
The following part has a extrange behavior
(?<=[A-Z]*\s)(.*?)
For some reason [A-Z]* is matching a empty string. And (.*?) is matching BIG BLACK. With a little tweaks, I think the following will work (but it still matches some false positives):
(?<=[A-Z]+\s)(\w+)(?=\sis)
A slightly better regex would be:
(?<=\b[A-Z]+\s)(\w+)(?=\sis)
Hope it helps
String m = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern.compile("[A-Z]+\\s\\w+\\sis");
Matcher m1 = p.matcher(m);
if(m1.find()){
String group []= m1.group().split("\\s");// split by space
System.out.println(group[1]);// print the 2 position
}

How to wrap (surround) java matcher groups with xml?

Using the following value of a text node...
MatcH one MatcHer two MarcH three
How can java matcher.find() be used to create the following output?
<wrap>MatcH</wrap> one MatcHer two <wrap>MarcH</wrap> three
Assuming a java regex that captures all words starting with capital 'M' and ending with a capital 'H'
\bM\w*H\b
Basically, I want to surround anything that matches this regex with wrap tags
String text = "MatcH one MatcHer two MarcH three";
Pattern pattern = Pattern.compile(\\bM\w*H\b\);
Matcher matcher = pattern.matcher(text);
// replace each time the regex is found
while (matcher.find()) {
text = text.replaceAll(matcher.group(), "<wrap>" +
+ matcher.group() + "</wrap>");
}
ReplaceFirst/ReplaceAll is not working for me because it results in the following...
<wrap>MatcH</wrap> one <wrap>MatcH</wrap>er two <wrap>MarcH</wrap> three
Thanks in advance...
Your regex is problematic since your do replaceAll, so it will match MatcH, then MatcH and MatcHer will get replaced in that iteration of the loop. Note that the \\b doesn't appear in the output of group, so nothing prevents it from replacing MatcHer.
You can put a System.out.println inside the loop to print the output of group and the output of replaceAll to see what happens and why it does what it does.
Simplifying your code to just the below will work: (that's probably "hard-coding match numbers" but I don't really see a problem with that as it stands and I don't see a simpler solution)
String text = "MatcH one MatcHer two MarcH three";
text = text.replaceAll("\\b(M\\w*H)\\b", "<wrap>$1</wrap>");
The above is how regex is supposed to work. If you see that problems may arise in future using something similar to the above, regex may not be the way to go.

How to make regex matching fail if checked string still has leftover characters?

I'm trying to check a string with a regular expression, and this check should only pass if the string contains only *h, *d, *w and/or *m where * can be any number.
So far I've got this:
Pattern p = Pattern.compile("([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)");
Matcher m = p.matcher(strToCheck);
if(m.find()){
//matching succesful code
}
And it works to detect if there are any of the number-letter combinations present in the checked string, but it also works if the input is, for instance, "12x5d", because it has "5d" in it. I don't know if this is a code problem or a regex problem. Is there a way to achieve what I want?
EDIT:
Thank you for your answers so far, but as requested, I'll try to clarify a bit. A string like "1w 2d 3h" or "1w 1w" is valid and should pass, but something like "1w X 2d 3h", "1wX 2d" or "w d h" should fail.
use m.matches() or add ^ and $ to the beginning and end of the regex resp.
edit but if you wan sequences of these delimited by whitespace (as mentioned in the comments) you can use
Pattern.compile("\\b\\d[hdwm]\\b");
Matcher m = p.matcher(strToCheck);
while(m.find()){
//matching succesful code
}
Firstly, I think you should use matches() instead of find(). The former matches the entire string against the regex, whereas the latter searches within the string.
Secondly, you can simplify the regex like so: "[0-9][hdwm]".
Finally, if the number can contain multiple digits, use the + operator: "[0-9]+[hdwm]"
try this:
Pattern p = Pattern.compile("[0-9][hdwm]");
Matcher m = p.matcher(strToCheck);
if(m.matches()){
//matching succesful code
}
If you want to only accept things like 5d as a complete word, rather than just part of one, you can use the \b "word border" markers in regex:
Pattern p = Pattern.compile("\\b([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)\\b");
This will let you match a string like "Dimension: 5h" while rejecting a string like "Dimension: 12wx5h".
(If, on the other hand, you only want to match if the entire string is just 5d or the like, then use matches() as others have suggested.)
You can write it like this "^\\d+[hdwm]$". Which should only match on the desired strings.

Categories

Resources