Regex word boundry

Regex word boundry - java

I am splitting a string by word boundary.
What I am expecting is:
TOKEN 0
TOKEN 1 0
TOKEN 2
TOKEN 3 +Ve
and, what I am getting is,
TOKEN 0
TOKEN 1 0
TOKEN 2 +
TOKEN 3 Ve
public void StringExample(){
String str = " 0 +Ve";
String[] token = str.split("\\b");
System.out.println("TOKEN 0 " + token[0]);
System.out.println("TOKEN 1 " + token[1]);
System.out.println("TOKEN 2 " + token[2]);
System.out.println("TOKEN 3 " + token[3]);
}
Can someone give a clue where its going wrong? and Possible corrections if any,

Both #pb2q and #Hovercraft have already explained why word boundary doesn't work in your situation. An alternative, is to use a Pattern and capture each group, which will give you what you want:
String str = " 0 +Ve";
Pattern p = Pattern.compile("( |[^ ]+)");
Matcher m = p.matcher(str);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
tokens.add(m.group(1));
}
System.out.println("TOKEN 0 " + tokens.get(0));
System.out.println("TOKEN 1 " + tokens.get(1));
System.out.println("TOKEN 2 " + tokens.get(2));
System.out.println("TOKEN 3 " + tokens.get(3));

Nothing is going wrong, and the results are as should be expected. Word boundaries match at the before the first character of a String, after the last character of a String and between two characters in the string, where one is a word character and the other is not a word character. The last rule will result in a match between '+' and 'V', and so your results make perfect sense.
Perhaps you want to use look ahead and look behind to match anything next to a space. For example:
public class Foo001 {
// private static final String REGEX1 = "\\b";
private static final String REGEX2 = "(?= )|(?<= )";
public static void main(String[] args) {
String str = " 0 +Ve";
String[] tokens = str.split(REGEX2);
for (int i = 0; i < tokens.length; i++) {
System.out.printf("token %d: \"%s\"%n", i, tokens[i]);
}
}
}
This will also match the left of the first space giving an extra token:
token 0: ""
token 1: " "
token 2: "0"
token 3: " "
token 4: "+Ve"

+ is not counted as a word char for word boundaries. Word chars are [a-zA-Z_0-9], that is, alphanumeric, and underscore
Unless your strings get more complex than your example, this is another instance where you can just split around the space:
" 0 +Ve".split(" ");
This should yield this array: [" ", "0", "+Ve"].
Which doesn't quite match the token list that you expect, but may suit your purposes. With this token list you know that there is a leading space character, and you can infer a space as the third token.
A problem with splitting this way is that multiple space characters will yield additional " " tokens in the resulting array.

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Want to capture the string after the last slash and before either a (; sid=) word or a (?) character.
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;
Expecting the below output:
1. itemSummary
2. itemList
3. ''(empty string)
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
Url=.*\/(.*)(; sid|\?)
Could you please help me to improve the regex to get desired output?
Thanks in advance!

You may use this regex in Java with a greedy match after Url=:
\bUrl=\S+/([^?;/]+)(?=; sid|\?)
RegEx Demo
RegEx Demo:
\b: Word boundary
Url=: Match text Url=
\S+/: Match 1+ non-whitespace characters followed by a /
([^?;/]+): Match 1+ of a character that not ? and ; and /
(?=; sid|\?): Lookahead to assert that we have ; sid or ? ahead

Alternative solution:
Used regex:
"^Url=.*/(\\w+|)$"
Regex in test bench and context:
public static void main(String[] args) {
String input1 = "sessionId=30a793b1-ed7e-464a-a630; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemSummary; "
+ "sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;";
String input2 = "sessionId=sfdsdfsd-ba57-4e21-a39f-34; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; "
+ "sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=123;";
String input3 = "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; "
+ "Url=https://www.example.com/mybook/order/newbooking/; "
+ "sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;";
List<String> inputList = Arrays.asList(input1, input2, input3);
// Pre-compiled Patterns should not be in loops - that is why they are placed outside the loops
Pattern replaceWithNewLinePattern = Pattern.compile(";?\\s|\\?");
Pattern extractWordFromUrlPattern = Pattern.compile("^Url=.*/(\\w+|)$", Pattern.MULTILINE);
int count = 0;
for(String input : inputList) {
String inputWithNewLines = replaceWithNewLinePattern.matcher(input).replaceAll("\n");
// System.out.println(inputWithNewLines); // Check the change...
Matcher matcher = extractWordFromUrlPattern.matcher(inputWithNewLines);
while (matcher.find()) {
System.out.printf( "%d. '%s'%n", ++count, matcher.group(1));
}
}
}
Output:
1. 'itemSummary'
2. 'itemList'
3. ''

Get a specific data values from a string in Java (String without comma)

I want to get the value from a string.
I have a string value like this:
String myData= "Number: 34678 Type: Internal Qty: 34";
How can I get the Number, Type, Qty values separately?
Give me any suggestion on this.
Input:
String myData= "Number: 34678 Type: Internal Qty: 34";
Output:
Number value is 34678
Type values is Internal
Qty value is 34

Here is one way to do it. It looks for a word following by a colon followed by zero or more spaces followed by another word. This works regardless of the order or names of the fields.
String myData = "Number: 34678 Type: Internal Qty: 34";
Matcher m = Pattern.compile("(\\S+):\\s*(\\S+)").matcher(myData);
while (m.find()) {
System.out.println(m.group(1) + " value is " + m.group(2));
}

You can use regex to do this cleanly:
Pattern p = Pattern.compile("Number: (\\d*) Type: (.*) Qty: (\\d*)");
Matcher m = p.matcher(myData);
m.find()
So you'll get the number with m.group(1), the Type m.group(2) and the Qty m.group(3).
I assume you accept a limited number of types. So you can change the regex to match only if the type is correct, for eg. either Internal or External: "Number: (\\d*) Type: (Internal|External) Qty: (\\d*)"
Here's a nice explanation of how this works

If you just want to print them with fixed pattern of input data, a simplest way is shown as follows: (Just for fun!)
System.out.print(myData.replace(" Type", "\nType")
.replace(" Qty", "\nQty")
.replace(":", " value is"));

I suppose the string is always formatted like that. I.e., n attribute names each followed by a value that does not contain spaces. In other words, the 2n entities are separated from each other by 1 or more spaces.
If so, try this:
String[] parts;
int limit;
int counter;
String name;
String value;
parts = myData.split("[ ]+");
limit = (parts.length / 2) * 2; // Make sure an even number of elements is considered
for (counter = 0; counter < limit; counter += 2)
{
name = parts[counter].replace(":", "");
value = parts[counter + 1];
System.out.println(name + " value is " + value);
}

This Should work
String replace = val.replace(": ", "|");
StringBuilder number = new StringBuilder();
StringBuilder type = new StringBuilder();
StringBuilder qty = new StringBuilder();
String[] getValues = replace.split(" ");
int i=0;
while(i<getValues.length-1){
String[] splitNumebr = getValues[i].split("\\|");
number.append(splitNumebr[1]);
String[] splitType = getValues[i+=1].split("\\|");
type.append(splitType[1]);
String[] splitQty = getValues[i+=1].split("\\|");
qty.append(splitQty[1]);
}
System.out.println(String.format("Number value is %s",number.toString()));
System.out.println(String.format("Type value is %s",type.toString()));
System.out.println(String.format("Qty value is %s",qty.toString()));
}
Output
Number value is 34678
Type value is Internal
Qty value is 34

Need help in regex matching

It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.

You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"

Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69

Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.

Why does the Java regular expression "|" find a matching substring for any input string?

I am trying to understand why a regular expression ending with "|" (or simply "|" itself) will find a matching substring with start index 0 and end "offset after the last character matched (as per JavaDoc for Matcher)" 0.
The following code demonstrates this:
public static void main(String[] args) {
String regExp = "|";
String toMatch = "A";
Matcher m = Pattern.compile(regExp).matcher(toMatch);
System.out.println("ReqExp: " + regExp +
" found " + toMatch + "(" + m.find() + ") " +
" start: " + m.start() +
" end: " + m.end());
}
Output is:
ReqExp: | found A(true) start: 0 end: 0
I'm confused by the fact that it is even a valid regular expression. And further confused by the fact that start and end are both 0.
Hoping someone can explain this to me.

The pipe in a regular expression means "or." So your regular expression is basically "(empty string) or (empty string)". It successfully finds an empty string at the beginning of the string, and an empty string has a length of 0.

password validation and UNICODE

How to validate regex for condition:
Password must not contain any sequence of characters immediately followed by the same sequence of characters. I am having other conditions as well and am using
(?=.*(..+)\\1)
to validate for immediate sequence repeat. And it is failing. This piece of code returns "true" for 3rd and 4th strings passed; I need it to return false. Please help.
String s2[] = {"1newAb", "newAB1", "1234567AaAa", "123456ab3434", "love", "love1"};
boolean b3;
for(int i=0; i<s2.length; i++){
b3 = s2[i].matches("^(?=.*[0-9])(?=.*[a-zA-Z])(?=.*(..+)\\1).{5,12}$");
System.out.println("value" + b3);
}

You can try with negative look-ahead (?!.*(.{2,})\\1).
For those who are wondering what \\1 is: it represents match from group 1, which in our case is match from (.{2,})

With Ron's suggestion I found which methods in java helps; matches(), find() work differently. find() helped me.
Guido's suggestion am breaking up code for different rules. Here's my code; yet to refine it: For checking repeat of any sequence using (\S+?)\1
String regex = "(\\S+?)\\1";
String regex2 = "^(?=.*[0-9])(?=.*[a-zA-Z]).{5,12}$";
p = Pattern.compile(regex);
for (String str : s2) {
matcher = p.matcher(str);
if (matcher.find())
System.out.println(str + " got repeated: " + matcher.group(1));
else if(str.matches(regex2))
System.out.println(str + " Password correct");
else
System.out.println(str + " Password incorrect");
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex word boundry - java

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Get a specific data values from a string in Java (String without comma)

Need help in regex matching

Why does the Java regular expression "|" find a matching substring for any input string?

password validation and UNICODE

Categories

Resources