I am puzzled about the split methode with regex in Java. It is a rather theoretical question that poped up and i can't figure it out.
I found this answer: Java split by \\S
but the advice to use \\s instead of \\S does not explain what is happening here.
Why: does quote.split("\\S") has 2 results in case A and 8 in case B ?
case A)
String quote = " x xxxxxx";
String[] words = quote.split("\\S");
System.out.print("\\S >>\t");
for (String word : words) {
System.out.print(":" + word);
}
System.out.println(words.length);
Result:
\\S >> : : 2
case B)
String quote = " x xxxxxx ";
String[] words = quote.split("\\S");
System.out.print("\\S >>\t");
for (String word : words) {
System.out.print(":" + word);
}
System.out.println(words.length);
Result:
\\S >> : : :::::: 8
It would be wonderfull to understand what happens here. Thanks in advance.
As Jongware noticed, the documentation for String.split(String) says:
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
So it works somewhat like this:
"a:b:::::".split(":") === removeTrailing([a,b,,,,,]) === [a,b]
"a:b:::::c".split(":") === removeTrailing([a,b,,,,,c]) === [a,b,,,,,c]
And in your example:
" x xxxxxx".split("\\S") === removeTrailing([ , ,,,,,,]) === [ , ]
" x xxxxxx ".split("\\S") === removeTrailing([ , ,,,,,, ]) === [ , ,,,,,, ]
To collapse multiple delimiters into one, use \S+ pattern.
" x xxxxxx".split("\\S+") === removeTrailing([ , ,]) === [ , ]
" x xxxxxx ".split("\\S+") === removeTrailing([ , , ]) === [ , , ]
As suggested in the comments, to maintain the trailing empty strings we can use overloaded version of split method (String.split(String, int)) with a negative number passed as limit.
"a:b:::::".split(":", -1) === [a,b,,,,,]
Related
I have an issue related to regex check at least one string and number which exclude any sign like #!#%^&*.
I have read many articles but I still can not find the correct answer as my expectation result.
Example: I have a string like : 111AB11111
My expectation result:
If input string
"1111"=>false
"AAA" =>false
"1A" => true
"A1" => true
"111AAA111"=>true
"AAA11111AA"=>true
"#AA111BBB"=>false
"111AAAA$"=>false
Which java regex patter can show above mentioned cases
Thank for your value time for checking and suggestion
Thank you!!
You could use String#matches with the following regex pattern:
^[A-Z0-9]*(?:[0-9][A-Z0-9]*[A-Z]|[A-Z][A-Z0-9]*[0-9])[A-Z0-9]*$
Sample Java code:
List<String> inputs = Arrays.asList(new String[] { "111AB", "111", "AB", "111AB$" });
for (String input : inputs) {
if (input.matches("[A-Z0-9]*(?:[0-9][A-Z0-9]*[A-Z]|[A-Z][A-Z0-9]*[0-9])[A-Z0-9]*")) {
System.out.println(input + " => VALID");
}
else {
System.out.println(input + " => INVALID");
}
}
This prints:
111AB => VALID
111 => INVALID
AB => INVALID
111AB$ => INVALID
Note that the regex pattern actually used with String#matches do not have leading/trailing ^/$ anchors. This is because the matches API implicitly applies the regex pattern to the entire string.
For an explanation of the regex pattern, it simply tries to match an input with has at least one digit or letter, in any order.
This would accomplish your goal:
^(?:[0-9]+[A-Za-z]|[A-Za-z]+[0-9])[A-Za-z0-9]*$
Digits preceding alpha, or alphas preceding a digit, followed by only alphas and digits.
I have this string:
"round((TOTAL_QTY * 100) / SUM(ORDER_ITEMS->TOTAL_QTY) , 1)"
I tried to split the string using the following code:
String[] tokens = function.split("[ )(*+-/^!##%&]");
Result is the following array:
"round"
""
"TOTAL_QTY"
""
""
"100"
""
""
""
"SUM"
"ORDER_ITEMS"
"->TOTAL_QTY"
""
""
""
"1"
But I need to split the string as follows:
"round",
"TOTAL_QTY",
"100",
"SUM",
"ORDER_ITEMS->TOTAL_QTY",
"1"
To make it clearer. First of all I need to ignore -> when it splits the string and then remove those empty strings in the result array.
Solution 1
Ok, I think you can do it in two steps, replace all non necessary characters with space for example and then split with space, your regex can look like like :
[)(*+/^!##%&,]|\\b-\\b
Your code :
String[] tokens = function.replaceAll("[)(*+/^!##%&,]|\\b-\\b", " ").split("\\s+");
Note that I used \\b-\\b to replace only - :
Solution 2
Or If you want something clean, you can use Pattern with Matcher like this :
Pattern.compile("\\b\\w+->\\w+\\b|\\b\\w+\\b")
.matcher("round((TOTAL_QTY * 100) / SUM(ORDER_ITEMS->TOTAL_QTY) , 1)")
.results()
.map(MatchResult::group)
.forEach(s -> System.out.println(String.format("\"%s\"", s)));
regex demo
Details
\b\w+->\w+\b to match that special case of ORDER_ITEMS->TOTAL_QTY
| or
\b\w+\b any other word with word boundaries
Note, this solution work from Java9+, but you can use a simple Pattern and Matcher solution.
Outputs
"round"
"TOTAL_QTY"
"100"
"SUM"
"ORDER_ITEMS->TOTAL_QTY"
"1"
Could see a couple of very good solutions provide by YCF_L
Here is one more solution:
String[] tokens = function.replace(")","").split("\\(+|\\*|/|,");
Explanation:
\\(+ Will split by ( and + will ensure that multiple open bracket cases and handled e.g. round((
|\\*|/|, OR split by * OR split by / OR split by ,
Output:
round
TOTAL_QTY
100
SUM
ORDER_ITEMS->TOTAL_QTY
1
I have a test string like this
08:28:57,990 DEBUG [http-0.0.0.0-18080-33] [tester] [1522412937602-580613] [TestManager] ABCD: loaded 35 test accounts
I want to regex and match "ABCD" and "35" in this string
def regexString = ~ /(\s\d{1,5}[^\d\]\-\:\,\.])|([A-Z]{4}\:)/
............
while (matcher.find()) {
acct = matcher.group(1)
grpName = matcher.group(2)
println ("group : " +grpName + " acct : "+ acct)
}
My Current Output is
group : ABCD: acct : null
group : null acct : 35
But I expected something like this
group : ABCD: acct : 35
Is there any option to match all the patterns in the string before it loops into the while(). Or a better way to implement this
You may use
String s = "08:28:57,990 DEBUG [http-0.0.0.0-18080-33] [tester] [1522412937602-580613] [TestManager] ABCD: loaded 35 test accounts"
def res = s =~ /\b([A-Z]{4}):[^\]\[\d]*(\d{1,5})\b/
if (res.find()) {
println "${res[0][1]}, ${res[0][2]}"
} else {
println "not found"
}
See the Groovy demo.
The regex - \b([A-Z]{4}):[^\]\[\d]*(\d{1,5})\b - matches a string starting with a whole word consisting of 4 uppercase ASCII letters (captured into Group 1), then followed with : and 0+ chars other than [, ] and digits, and then matches and captures into Group 2 a whole number consisting of 1 to 4 digits.
See the regex demo.
In the code, =~ operator makes the regex engine find a partial match (i.e. searches for the pattern anywhere inside the string) and the res variable contains all the match objects that hold a whole match inside res[0][0], Group 1 inside res[0][1] and Group 2 value in res[0][2].
I believe your issues is with the 'or' in your regex. I think it is essentially parsing it twice, once to match the first half of the regex and then again to match the second half after the '|'. You need a regex that will match both in one parse. You can reverse the matches so they match in order:
/([A-Z]{4})\:.*\s(\d{1,5)}[^\d\]-"\,\.]/
Also notice the change in parentheses so you don't capture more than you need - Currently you are capturing the ':' after the group name and an extra space before the acct. This is assuming the "ABCD" will always come before the "35".
There is also a lot more you can do assuming that all your strings are formatted very similarly:
For example, if there is always a space after the acct number you could simplify it to:
/([A-Z]{4})\:.*\s(\d{1,5)}\s/
There's probably a lot more you could do to make sure you're always capturing the correct things, but i'd have to see or know more about the dataset to do so.
Then of course you have the switch the order of matches in your code:
while (matcher.find()) {
grpName = matcher.group(1)
acct = matcher.group(2)
println ("group : " +grpName + " acct : "+ acct)
}
I can't seem to find the regex that suits my needs.
I have a .txt file of this form:
Abc "test" aBC : "Abc aBC"
Brooking "ABC" sadxzc : "I am sad"
asd : "lorem"
a22 : "tactius"
testsa2 : "bruchia"
test : "Abc aBC"
b2 : "Ast2"
From this .txt file I wish to extract everything matching this regex "([a-zA-Z]\w+)", except the ones between the quotation marks.
I want to rename every word (except the words in quotation marks), so I should have for example the following output:
A "test " B : "Abc aBC"
Z "ABC" X : "I am sad"
Test : "lorem"
F : "tactius"
H : "bruchia"
Game : "Abc aBC"
S: "Ast2"
Is this even achievable using a regex? Are there alternatives without using regex?
If quotes are balanced and there is no escaping in the input like \" then you can use this regex to match words outside double quotes:
(?=(?:(?:[^"]*"){2})*[^"]*$)(\b[a-zA-Z]\w+\b)
RegEx Demo
In java it will be:
Pattern p = Pattern.compile("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(\\b[a-zA-Z]\\w+\\b)");
This regex will match word if those are outside double quotes by using a lookahead to make sure there are even number of quotes after each matched word.
A simple approach might be to split the string by ", then do the replace using your regex on every odd part (on parts 1, 3, ..., if you start the numbering from 1), and join everything back.
UPD
However, it is also simple to implement manually. Just go along the line and track whether you are inside quotes or not.
insideQuotes = false
result = ""
currentPart = ""
input = input + '"' // so that we do not need to process the last part separately
for ch in string
if ch == '"'
if not insideQuotes
currentPart = replace(currentPart)
result = result + currentPart + '"'
currentPart = ""
insideQuotes = not insideQuotes
else
currentPart = currentPart + ch
drop the last symbol of result (it is that quote mark that we have added)
However, think also on whether you will need some more advanced syntax. For example, quote escaping like
word "inside quote \" still inside" outside again
? If yes, then you will need a more advanced parser, or you might think of using some special format.
You can’t formulate a “within quotes” condition the way you might think. But you can easily search for unquoted words or quoted strings and take action only for the unquoted words:
Pattern p = Pattern.compile("\"[^\"]*\"|([a-zA-Z]\\w+)");
for(String s: lines) {
Matcher m=p.matcher(s);
while(m.find()) {
if(m.group(1)!=null) {
System.out.println("take action with "+m.group(1));
}
}
}
This utilizes the fact that each search for the next match starts at the end of the previous. So if you find a quoted string ("[^"]*") you don’t take any action and continue searching for other matches. Only if there is no match for a quoted string, the pattern looks for a word (([a-zA-Z]\w+)) and if one is found, the group 1 captures the word (will be non null).
I am trying to split a string in Java on / but I need to ignore any instances where / is found between []. For example if I have the following string
/foo/bar[donkey=King/Kong]/value
Then I would like to return the following in my output
foo
bar[donkey=King/Kong]
value
I have seen a couple other similar posts, but I haven't found anything that fits exactly what I'm trying to do. I've tried the String.split() method and as follows and have seen weird results:
Code: value.split("/[^/*\\[.*/.*\\]]")
Result: [, oo, ar[donkey=King, ong], alue]
What do I need to do in order to get back the following:
Desired Result: [, foo, bar[donkey=King/Kong], value]
Thanks,
Jeremy
You need to split on the / followed by an 0 or more balanced pairs of brackets:
String str = "/foo/bar[donkey=King/Kong]/value";
String[] arr = str.split("/(?=([[^\\[\\]]*\\[[^\\[\\]]*\\])*[^\\[\\]]*$)");
System.out.println(Arrays.toString(arr));
Output:
[, foo, bar[donkey=King/Kong], value]
More User friendly explanation
String[] arr = str.split("(?x)/" + // Split on `/`
"(?=" + // Followed by
" (" + // Start a capture group
" [^\\[\\]]*" + // 0 or more non-[, ] character
" \\[" + // then a `[`
" [^\\]\\[]*" + // 0 or more non-[, ] character
" \\]" + // then a `]`
" )*" + // 0 or more repetition of previous pattern
" [^\\[\\]]*" + // 0 or more non-[, ] characters
"$)"); // till the end
Of the following string, the regex below will match foo and bar, but not fox and baz, because they're followed by a close bracket. Study up on negative lookahead.
fox]foo/bar/baz]
Regex:
\b(\w+)\b(?!])