How can I correct my regular expression in Java?

How can I correct my regular expression in Java? - java

I met a problem when I try to extract a segment from a string using Java. The original string is looks like test/data/20/0000893220-97-000850.txt, and I want to extract the segment which is behind the third /.
My regular expression is like
String m_str = "test/data/20/0000893220-97-000850.txt";
Pattern reg = Pattern.compile("[.*?].txt");
Matcher matcher = reg.matcher(m_str);
System.out.println(matcher.group(0));
The expected result is 0000893220-97-000850, but obviously, I failed. How can I correct this?

[^\/]+$
https://regex101.com/r/tS4nS2/2
This will extract the last segment in a string that contains after slashes. It would work great if you want that, as opposed to only the third section.
To find and extract the match, you don't need a match group (hence, no ()), however, you need to instruct the matcher to only look for the pattern, since .matches() will attempt to compare the entire string. Here is the relevant bit and here is a full example:
matcher.find(); //finds any occurrence of the pattern in the string
System.out.println(matcher.group()); //returns the entire occurence
Note the lack of index inside the call .group().
On a separate note, in Java, you don't necessarily need regex - extracting the last part can be done using plain Java
String matched = m_str.split('/')[2];
This would capture the third segment while
String[] matches = m_str.split('/');
String matched = matches[matches.length-1];
Would give you the last part.

Related

IllegalStateException with Pattern/Matcher

I'm using Matcher to capture groups using a regular expression in Java and it keeps throwing an IllegalStateException even though I know that the expression matches.
This is my code:
String safeName = Pattern.compile("(\\.\\w+)$").matcher("google.ca").group();
I'm expecting safeName to be .ca as captured with the capturing group in the regular expression but instead I get:
IllegalStateException: No match found
I also tried with .group(0) and .group(1) but the same error occurs.
According to the documentation for group() and group(int group):
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().
What am I doing wrong?

Matcher is helper class which handles iterating over data to search for substrings matching regex. It is possible that entire string will contain many sub-strings which can be matched, so by calling group() you can't specify which actual match you are interested in. To solve this problem Matcher lets you iterate over all matching sub-strings and then use parts you are interested in.
So before you can use group you need to let Matcher iterate over your string to find() match for your regex. To check if regex matches entire String we can use matches() method instead of find().
Generally to find all matching substrings we are using
Pattern p = Pattern.compiler("yourPattern");
Matcher m = p.matcher("yourData");
while(m.find()){
String match = m.group();
//here we can do something with match...
}
Since you are assuming that text you want to find exists only once in your string (at its end) you don't need to use loop, but simple if (or conditional operator) should solve your problem.
Matcher m = Pattern.compile("(\\.\\w+)$").matcher("google.ca");
String safeName = m.find() ? m.group() : null;

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.

Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com

Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]

The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

Getting a specific word in a string

I am new to java, i have a string
"rdl_mod_id:0123456789\n\nrdl_mod_name:Driving Test\n\nrdl_mod_type:PUBL\n\nrdl_mod_mode:Practice\n\nrdl_mod_date:2013-04-23"
What I want is to get the Driving Test word. The word is dynamically changes so what I want to happen is get the word between the rdl_mod_name: and the \n.

Try the following.. It will work in your case..
String str = "rdl_mod_id:0123456789\n\nrdl_mod_name:Driving Test\n\nrdl_mod_type:PUBL\n\nrdl_mod_mode:Practice\n\nrdl_mod_date:2013-04-23";
Pattern pattern = Pattern.compile("rdl_mod_name:(.*?)\n");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Also you can make use of regex,matcher,pattern to get your desired result..
The following links will also give you a fair idea:
Extract string between two strings in java
Java- Extract part of a string between two special characters
How to get a string between two characters?

I would look into java regular expressions (regex). The String matches method uses a regex to determine if there's a pattern in a string. For what you are doing, I would probably use 'matches(rdl_mod_.*\n)'. The '.*' is a wildcard for strings, so in this context it means anything between rdl_mod and \n. I'm not sure if the matches method can process forward slashes (they signify special text characters), so you might have to replace them with either a different character or remove them altogether.

Use java's substring() function with java indexof() function.

Try this code :
String s = "rdl_mod_id:0123456789\n\nrdl_mod_name:Driving Test\n\nrdl_mod_type:PUBL\n\nrdl_mod_mode:Practice\n\nrdl_mod_date:2013-04-23";
String sArr[] = s.split("\n\n");
String[] sArr1 = sArr[1].split(":");
System.out.println("sArr1[1] : " + sArr1[1]);
The s.split("\n\n");will split the string on basis of \n\n.
The second split i.e. sArr[1].split(":"); will split the second element in array sArr on basis of : i.e split rdl_mod_name:Driving Test into rdl_mod_name and Driving Test.
sArr1[1] is your desired result.

Using backreference to refer to a pattern rather than actual match

I am trying to write a regex which would match a (not necessarily repeating) sequence of text blocks, e.g.:
foo,bar,foo,bar
My initial thought was to use backreferences, something like
(foo|bar)(,\1)*
But it turns out that this regex only matches foo,foo or bar,bar but not foo,bar or bar,foo (and so on).
Is there any other way to refer to a part of a pattern?
In the real world, foo and bar are 50+ character long regexes and I simply want to avoid copy pasting them to define a sequence.

With a decent regex flavor you could use (foo|bar)(?:,(?-1))* or the like.
But Java does not support subpattern calls.
So you end up having a choice of doing String replace/format like in ajx's answer, or you could condition the comma if you know when it should be present and when not. For example:
(?:(?:foo|bar)(?:,(?!$|\s)|))+

Perhaps you could build your regex bit by bit in Java, as in:
String subRegex = "foo|bar";
String fullRegex = String.format("(%1$s)(,(%1$s))*", subRegex);
The second line could be factored out into a function. The function would take a subexpression and return a full regex that would match a comma-separated list of subexpressions.

The point of the back reference is to match the actual text that matches, not the pattern, so I'm not sure you could use that.
Can you use quantifiers like:
String s= "foo,bar,foo,bar";
String externalPattern = "(foo|bar)"; // comes from somewhere else
Pattern p = Pattern.compile(externalPattern+","+externalPattern+"*");
Matcher m = p.matcher(s);
boolean b = m.find();
which would match 2 or more instances of foo or bar (followed by commas)

How to make regex matching fail if checked string still has leftover characters?

I'm trying to check a string with a regular expression, and this check should only pass if the string contains only *h, *d, *w and/or *m where * can be any number.
So far I've got this:
Pattern p = Pattern.compile("([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)");
Matcher m = p.matcher(strToCheck);
if(m.find()){
//matching succesful code
}
And it works to detect if there are any of the number-letter combinations present in the checked string, but it also works if the input is, for instance, "12x5d", because it has "5d" in it. I don't know if this is a code problem or a regex problem. Is there a way to achieve what I want?
EDIT:
Thank you for your answers so far, but as requested, I'll try to clarify a bit. A string like "1w 2d 3h" or "1w 1w" is valid and should pass, but something like "1w X 2d 3h", "1wX 2d" or "w d h" should fail.

use m.matches() or add ^ and $ to the beginning and end of the regex resp.
edit but if you wan sequences of these delimited by whitespace (as mentioned in the comments) you can use
Pattern.compile("\\b\\d[hdwm]\\b");
Matcher m = p.matcher(strToCheck);
while(m.find()){
//matching succesful code
}

Firstly, I think you should use matches() instead of find(). The former matches the entire string against the regex, whereas the latter searches within the string.
Secondly, you can simplify the regex like so: "[0-9][hdwm]".
Finally, if the number can contain multiple digits, use the + operator: "[0-9]+[hdwm]"

try this:
Pattern p = Pattern.compile("[0-9][hdwm]");
Matcher m = p.matcher(strToCheck);
if(m.matches()){
//matching succesful code
}

If you want to only accept things like 5d as a complete word, rather than just part of one, you can use the \b "word border" markers in regex:
Pattern p = Pattern.compile("\\b([0-9]h)|([0-9]d)|([0-9]w)|([0-9]m)\\b");
This will let you match a string like "Dimension: 5h" while rejecting a string like "Dimension: 12wx5h".
(If, on the other hand, you only want to match if the entire string is just 5d or the like, then use matches() as others have suggested.)

You can write it like this "^\\d+[hdwm]$". Which should only match on the desired strings.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I correct my regular expression in Java? - java

Related

IllegalStateException with Pattern/Matcher

Regex matches with multiple patterns

Getting a specific word in a string

Using backreference to refer to a pattern rather than actual match

How to make regex matching fail if checked string still has leftover characters?

Categories

Resources