Need help in regex matching

Need help in regex matching - java

It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.

You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"

Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69

Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Want to capture the string after the last slash and before either a (; sid=) word or a (?) character.
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;
Expecting the below output:
1. itemSummary
2. itemList
3. ''(empty string)
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
Url=.*\/(.*)(; sid|\?)
Could you please help me to improve the regex to get desired output?
Thanks in advance!

You may use this regex in Java with a greedy match after Url=:
\bUrl=\S+/([^?;/]+)(?=; sid|\?)
RegEx Demo
RegEx Demo:
\b: Word boundary
Url=: Match text Url=
\S+/: Match 1+ non-whitespace characters followed by a /
([^?;/]+): Match 1+ of a character that not ? and ; and /
(?=; sid|\?): Lookahead to assert that we have ; sid or ? ahead

Alternative solution:
Used regex:
"^Url=.*/(\\w+|)$"
Regex in test bench and context:
public static void main(String[] args) {
String input1 = "sessionId=30a793b1-ed7e-464a-a630; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemSummary; "
+ "sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;";
String input2 = "sessionId=sfdsdfsd-ba57-4e21-a39f-34; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; "
+ "sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=123;";
String input3 = "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; "
+ "Url=https://www.example.com/mybook/order/newbooking/; "
+ "sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;";
List<String> inputList = Arrays.asList(input1, input2, input3);
// Pre-compiled Patterns should not be in loops - that is why they are placed outside the loops
Pattern replaceWithNewLinePattern = Pattern.compile(";?\\s|\\?");
Pattern extractWordFromUrlPattern = Pattern.compile("^Url=.*/(\\w+|)$", Pattern.MULTILINE);
int count = 0;
for(String input : inputList) {
String inputWithNewLines = replaceWithNewLinePattern.matcher(input).replaceAll("\n");
// System.out.println(inputWithNewLines); // Check the change...
Matcher matcher = extractWordFromUrlPattern.matcher(inputWithNewLines);
while (matcher.find()) {
System.out.printf( "%d. '%s'%n", ++count, matcher.group(1));
}
}
}
Output:
1. 'itemSummary'
2. 'itemList'
3. ''

java/scala: Regex for skipping odd number of backslash while splitting a String?

Here is my requirement:
Input1: adasd|adsasd\|adsadsadad|asdsad
output1: Array(adasd,adsasd\|adsadsadad,asdsad)
Input2: adasd|adsasd\\|adsadsadad|asdsad
output2: Array(adasd,adsasd\\,adsadsadad,asdsad)
Input3: adasd|adsasd\\\|adsadsadad|asdsad
output3: Array(adasd,adsasd\\\|adsadsadad,asdsad)
I was using this code:
val delimiter =Pattern.quote("|")
val esc = "\\"
val regex = "(?<!" + Pattern.quote(esc) + ")" + delimiter
But this is not working fine with all the cases.
What will be the best solution to deal with this?

Instead of splitting, use this regex for a match:
(?<=[|]|^)[^|\\]*(?:\\.[^|\\]*)*
Java Code Demo
Java code:
final String[] input = {"adasd|adsasd\\|adsadsadad|asdsad",
"adasd|adsasd\\\\|adsadsadad|asdsad",
"adasd|adsasd\\\\\\|adsadsadad|asdsad"};
final String regex = "(?<=[|]|^)[^|\\\\]*(?:\\\\.[^|\\\\]*)*";
final Pattern pattern = Pattern.compile(regex);
Matcher matcher;
for (String string: input) {
matcher = pattern.matcher(string);
System.out.println("\n*** Input: " + string);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
}
Output:
*** Input: adasd|adsasd\|adsadsadad|asdsad
adasd
adsasd\|adsadsadad
asdsad
*** Input: adasd|adsasd\\|adsadsadad|asdsad
adasd
adsasd\\
adsadsadad
asdsad
*** Input: adasd|adsasd\\\|adsadsadad|asdsad
adasd
adsasd\\\|adsadsadad
asdsad

For the sake of simplicity, let's take ";"(semicolon) instead of "\"(backslash) to avoid too many escape sequences here.
We can do this split with a look-behind as below:
String[] input = { "adasd|zook;|adsadsadad|asdsad", "adasd|zook;;|adsadsadad|asdsad",
"adasd|zook;;;|adsadsadad|asdsad", "blah;|blah;;;;|blah|blahblah;|blahbloooh;;|" };
String regex = "(?<!;)(;;)+\\||(?<!;)\\|";
for(String str : input) {
System.out.println("Input : "+ str);
System.out.println("Output: ");
String[] astr = str.split(regex);
for(String nres : astr)
System.out.print(nres+", ");
System.out.println("\n");
}
Let's have a deeper look at the regex. I will split this into 2 parts:
Split on even occurrence of semicolon(;) followed by a pipe("|"):
(?<!;)(;;)+\\| :
Here we make sure we match just even occurrence with (;;)+ and a look-behind to make sure we are not matching any unintended ";" before the set of even occurrences.
Split on pipe without a preceding semicolon:
(?<!;)\\| :
Here we will just match lone pipe symbols and use look-behind to make sure no ";" before the "|"
Output for the above snippet
Hope this helps! :)

Certain strings that should be found by a working Regex are missed, and I need help identifying why

I have a set of strings, which I cycle through, checking those against the following set of regex, to try and separate the first small section from the rest of the string. The regex works in almost all cases, but unfortunately I have no idea why it fails occasionally. I’ve been using Pattern Matcher to print out the string, if the pattern is found.
Two example working strings:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials; inflorescence …
Two example failed strings:
100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …
26. POA L. (Parodiochloa C.E. Hubb.) - Meadow-grasses Annuals or perennials with or without stolons or rhizomes; sheaths overlapping or some …
Regex’s used so far:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusTwo = Pattern.compile("(?<=(^\\d+" + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusThree = Pattern.compile("(?<=(\\d+\\. " + genusNames[l] + "))");
Pattern endOfGenusFour = Pattern.compile("(?<=(\\d+" + genusNames[l] + "))");
Pattern endOfGenusFive = Pattern.compile("(?<=(\\. " + genusNames[l] + "))");
The first of these is the one thats producing the reliable results so far.
Example Code
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Matcher endOfGenusFinder = endOfGenus.matcher(descriptionPartBits[b]);
if (endOfGenusFinder.find()) {
System.out.print(descriptionPartBits[b] + ":- ");
System.out.print(genusNames[l] + "\n");
String[] genusNameBits = descriptionPartBits[b].split("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
}
Desired Output. This is what is produced by strings that work. Strings that don't work simply don't appear in the output:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials:- Sorghum
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials:- Miscanthus

From regex tutorial:
Lookahead and lookbehind, collectively called "lookaround", are
zero-length assertions just like the start and end of line, and start
and end of word anchors explained earlier in this tutorial.
Lookahead and lookbehind only return true or false.
So I changed your code example:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. ZEA L))(.+)$");
// Matcher matcher = endOfGenus.matcher("98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …");
Matcher matcher = endOfGenus.matcher("100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …");
while (matcher.find()) {
String group1 = matcher.group(1);
String group2 = matcher.group(2);
System.out.println("group1=" + group1);
System.out.println("group2=" + group2);
}
Group 1 is matched by (^\\d+\\. ZEA L). Group 2 is matched by (.+).

Match the end of the string but not newlines?

Is possible, in java, to make a regex for matching the end of the string but not the newlines, using the Pattern.DOTALL option and searching for a line with \n?
Examples:
1)
aaa\n==test==\naaa\nbbb\naaa
2)
bbb\naaa==toast==cccdd\nb\nc
3)
aaa\n==trick==\naaaDDDaaa\nbbb
I want to match
\naaa\nbbb\naaa
and
cccdd\nb\nc
but, in the third example, i don't want to match text ater DDD.
\naaa

Yes, there is. For example, (?-m)}$ will match a close-brace at the very end of a Java source file. The point is to disable the multiline mode. You can disable as I've shown or by setting the appropriate flag on the Pattern instance.
UPDATE: I believe that multiline is off by default when you instantiate a Pattern, but is on in Eclipse's find by regex.

The regex you need is:
"(?s)==(?!.*?==)([^(?:DDD)]*)"
Here is the full code:
String[] sarr = {"aaa\n==test==\naaa\nbbb\naaa", "bbb\naaa==toast==cccdd\nb\nc",
"aaa\n==trick==\naaaDDDaaa\nbbb"};
Pattern pt = Pattern.compile("(?s)==(?!.*?==)([^(?:DDD)]*)");
for (String s : sarr) {
Matcher m = pt.matcher(s);
System.out.print("For input: [" + s + "] => ");
if (m.find())
System.out.println("Matched: [" + m.group(1) + ']');
else
System.out.println("Didn't Match");
}
OUTPUT:
For input: [aaa\n==test==\naaa\nbbb\naaa] => Matched: [\naaa\nbbb\naaa]
For input: [bbb\naaa==toast==cccdd\nb\nc] => Matched: [cccdd\nb\nc]
For input: [aaa\n==trick==\naaaDDDaaa\nbbb] => Matched: [\naaa]

Pattern for pulling strings out a string

I'm not new to Java, but have not dealt with Regex and Patterns before. What I'm looking to do is take a string like
"Class: " + data1 + "\nFrom: " + data2 + " To: " + data3 + "\nOccures: " + data4 + " In: " + data5 + " " + data6;
and pull out only data_1 to data_n.
I appreciate any help.

Use this regex:
Pattern pattern = Pattern.compile("Class: (.+?)\nFrom: (.+?) To: (.+?)\nOccures: (.+?) In: (.+?) (.+?)");
Matcher matcher = pattern.matcher(yourInputString);
if (matcher.find())
{
String data1 = matcher.group(1);
String data2 = matcher.group(2);
String data3 = matcher.group(3);
String data4 = matcher.group(4);
String data5 = matcher.group(5);
String data6 = matcher.group(6);
} else
{
// String didn't match the specified format
}
Explanation:
.+? will match any character for undefined times, but non-greedy.
(), using brackets will create a group. A group is given an index starting by 1 (since group 0 is the entire match)
So, (.+?) will creates groups of any character.
And what the matcher does, is searching for the whole pattern somewhere in the input string. But since you specified the format, we know exactly how your entire string is going to look like. The only thing you have to do is copy the format and replace the data you want to extract with "something" (.+?), which you give an index by creating a group of it.
Afterwards, the matcher will try to find the pattern (done by matcher.find()) and you ask them what the content is of the groups 1 up to 6.

how about using split() with ":", then from the splitted String[] get string[2i+1] ? (i from 0)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Need help in regex matching - java

Related

Regex to capture the staring with specific word or character and ending with either one of the word

java/scala: Regex for skipping odd number of backslash while splitting a String?

Certain strings that should be found by a working Regex are missed, and I need help identifying why

Match the end of the string but not newlines?

Pattern for pulling strings out a string

Categories

Resources