Regex split into overlapping strings - java

I'm exploring the power of regular expressions, so I'm just wondering if something like this is possible:
public class StringSplit {
public static void main(String args[]) {
System.out.println(
java.util.Arrays.deepToString(
"12345".split(INSERT_REGEX_HERE)
)
); // prints "[12, 23, 34, 45]"
}
}
If possible, then simply provide the regex (and preemptively some explanation on how it works).
If it's only possible in some regex flavors other than Java, then feel free to provide those as well.
If it's not possible, then please explain why.
BONUS QUESTION
Same question, but with a find() loop instead of split:
Matcher m = Pattern.compile(BONUS_REGEX).matcher("12345");
while (m.find()) {
System.out.println(m.group());
} // prints "12", "23", "34", "45"
Please note that it's not so much that I have a concrete task to accomplish one way or another, but rather I want to understand regular expressions. I don't need code that does what I want; I want regexes, if they exist, that I can use in the above code to accomplish the task (or regexes in other flavors that work with a "direct translation" of the code into another language).
And if they don't exist, I'd like a good solid explanation why.

I don't think this is possible with split(), but with find() it's pretty simple. Just use a lookahead with a capturing group inside:
Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
while (m.find())
{
System.out.println(m.group(1));
}
Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.
As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above ("(?=(\\d\\d))") and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.
There's no split() equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.

This somewhat heavy implementation using Matcher.find instead of split will also work, although by the time you have to code a for loop for such a trivial task you might as well drop the regular expressions altogether and use substrings (for similar coding complexity minus the CPU cycles):
import java.util.*;
import java.util.regex.*;
public class StringSplit {
public static void main(String args[]) {
ArrayList<String> result = new ArrayList<String>();
for (Matcher m = Pattern.compile("..").matcher("12345"); m.find(result.isEmpty() ? 0 : m.start() + 1); result.add(m.group()));
System.out.println( result.toString() ); // prints "[12, 23, 34, 45]"
}
}
EDIT1
match(): the reason why nobody so far has been able to concoct a regular expression like your BONUS_REGEX lies within Matcher, which will resume looking for the next group where the previous group ended (i.e. no overlap), as oposed to after where the previous group started -- that is, short of explicitly respecifying the start search position (above). A good candidate for BONUS_REGEX would have been "(.\\G.|^..)" but, unfortunately, the \G-anchor-in-the-middle trick doesn't work with Java's Match (but works just fine in Perl):
perl -e 'while ("12345"=~/(^..|.\G.)/g) { print "$1\n" }'
12
23
34
45
split(): as for INSERT_REGEX_HERE a good candidate would have been (?<=..)(?=..) (split point is the zero-width position where I have two characters to my right and two to my left), but again, because split concieves naught of overlap you end up with [12, 3, 45] (which is close, but no cigar.)
EDIT2
For fun, you can trick split() into doing what you want by first doubling non-boundary characters (here you need a reserved character value to split around):
Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1#$1").split("#")
We can be smart and eliminate the need for a reserved character by taking advantage of the fact that zero-width look-ahead assertions (unlike look-behind) can have an unbounded length; we can therefore split around all points which are an even number of characters away from the end of the doubled string (and at least two characters away from its beginning), producing the same result as above:
Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1").split("(?<=..)(?=(..)*$)")
Alternatively tricking match() in a similar way (but without the need for a reserved character value):
Matcher m = Pattern.compile("..").matcher(
Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1")
);
while (m.find()) {
System.out.println(m.group());
} // prints "12", "23", "34", "45"

Split chops a string into multiple pieces, but that doesn't allow for overlap. You'd need to use a loop to get overlapping pieces.

I don't think you can do this with split() because it throws away the part that matches the regular expression.
In Perl this works:
my $string = '12345';
my #array = ();
while ( $string =~ s/(\d(\d))/$2/ ) {
push(#array, $1);
}
print join(" ", #array);
# prints: 12 23 34 45
The find-and-replace expression says: match the first two adjacent digits and replace them in the string with just the second of the two digits.

Alternative, using plain matching with Perl. Should work anywhere where lookaheads do. And no need for loops here.
$_ = '12345';
#list = /(?=(..))./g;
print "#list";
# Output:
# 12 23 34 45
But this one, as posted earlier, is nicer if the \G trick works:
$_ = '12345';
#list = /^..|.\G./g;
print "#list";
# Output:
# 12 23 34 45
Edit: Sorry, didn't see that all of this was posted already.

Creating overlapping matches with String#split isn't possible, as the other answers have already stated. It is however possible to add a regex-replace before it to prepare the String, and then use the split to create regular pairs:
"12345".replaceAll(".(?=(.).)","$0$1")
.split("(?<=\\G..)")
The .replaceAll(".(?=(.).)","$0$1") will transform "12345" into "12233445". It basically replaces every 123 substring to 1223, then every 234 to 2334 (note that it's overlapping), etc. In other words, it'll duplicate every character, except for the first and last.
.(?=(.).) # Replace-regex:
. # A single character
(?= ) # followed by (using a positive lookahead):
. . # two more characters
( ) # of which the first is saved in capture group 1
$0$1 # Replacement-regex:
$0 # The entire match, which is the character itself since everything
# else was inside a lookahead
$1 # followed by capture group 1
After that, .split("(?<=\\G..)") will split this new String into pairs:
(?<=\G..) # Split-regex:
(?<= ) # A positive lookbehind:
\G # Matching the end of the previous match
# (or the start of the string initially)
.. # followed by two characters
Some more information about .split("(?<=\\G..)") can be found here.
Try it online.

Related

Complicated regex and possible simple way to do it [duplicate]

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

String to HTML paragraphs in Java with Regex [duplicate]

I had to match a number followed by itself 14 times. Then I've came to the following regular expression in the regexstor.net/tester:
(\d)\1{14}
Edit
When I paste it in my code, including the backslashes properly:
"(\\d)\\1{14}"
I've replaced the back-reference "\1" by the "$1" which is used to replace matches in Java.
Then I've realized that it doesn't work. When you need to back-reference a match in the REGEX, in Java, you have to use "\N", but when you want to replace it, the operator is "$N".
My question is: why?
$1 is not a back reference in Java's regexes, nor in any other flavor I can think of. You only use $1 when you are replacing something:
String input="A12.3 bla bla my input";
input = StringUtils.replacePattern(
input, "^([A-Z]\\d{2}\\.\\d).*$", "$1");
// ^^^^
There is some misinformation about what a back reference is, including the very place I got that snippet from: simple java regex with backreference does not work.
Java modeled its regex syntax after other existing flavors where the $ was already a meta character. It anchors to the end of the string (or line in multi-line mode).
Similarly, Java uses \1 for back references. Because regexes are strings, it must be escaped: \\1.
From a lexical/syntactic standpoint it is true that $1 could be used unambiguously (as a bonus it would prevent the need for the "evil escaped escape" when using back references).
To match a 1 that comes after the end of a line the regex would need to be $\n1:
this line
1
It just makes more sense to use a familiar syntax instead of changing the rules, most of which came from Perl.
The first version of Perl came out in 1987, which is much earlier than Java, which was released in beta in 1995.
I dug up the man pages for Perl 1, which say:
The bracketing construct (\ ...\ ) may also be used, in which case \<digit> matches the digit'th substring. (Outside of the pattern, always use $ instead of \ in front of the digit. The scope of $<digit> (and $\`, $& and $') extends to the end of the enclosing BLOCK or eval string, or to the next pattern match with subexpressions. The \<digit> notation sometimes works outside the current pattern, but should not be relied upon.) You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parens before the backreference. Otherwise (for backward compatibilty) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)
I think the main Problem is not the backreference - which works perfectly fine with \1 in java.
Your Problem is more likely the "overall" escaping of a regex pattern in Java.
If you want to have the pattern
(\d)\1{14}
passed to the regex engine, you first need to escape it cause it's a java-string when you write it:
(\\d)\\1{14}
Voila, works like a charm: goo.gl/BNCx7B (add http://, SO does not allow Url-Shorteners, but tutorialspoint.com has no other option as it seems)
Offline-Example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String test = "555555555555555"; // 5 followed by 5 for 14 times.
String pattern = "(\\d)\\1{14}";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
if (m.find( )) {
System.out.println("Matched!");
}else{
System.out.println("not matched :-(");
}
}
}

Regex backreferences in Java

I had to match a number followed by itself 14 times. Then I've came to the following regular expression in the regexstor.net/tester:
(\d)\1{14}
Edit
When I paste it in my code, including the backslashes properly:
"(\\d)\\1{14}"
I've replaced the back-reference "\1" by the "$1" which is used to replace matches in Java.
Then I've realized that it doesn't work. When you need to back-reference a match in the REGEX, in Java, you have to use "\N", but when you want to replace it, the operator is "$N".
My question is: why?
$1 is not a back reference in Java's regexes, nor in any other flavor I can think of. You only use $1 when you are replacing something:
String input="A12.3 bla bla my input";
input = StringUtils.replacePattern(
input, "^([A-Z]\\d{2}\\.\\d).*$", "$1");
// ^^^^
There is some misinformation about what a back reference is, including the very place I got that snippet from: simple java regex with backreference does not work.
Java modeled its regex syntax after other existing flavors where the $ was already a meta character. It anchors to the end of the string (or line in multi-line mode).
Similarly, Java uses \1 for back references. Because regexes are strings, it must be escaped: \\1.
From a lexical/syntactic standpoint it is true that $1 could be used unambiguously (as a bonus it would prevent the need for the "evil escaped escape" when using back references).
To match a 1 that comes after the end of a line the regex would need to be $\n1:
this line
1
It just makes more sense to use a familiar syntax instead of changing the rules, most of which came from Perl.
The first version of Perl came out in 1987, which is much earlier than Java, which was released in beta in 1995.
I dug up the man pages for Perl 1, which say:
The bracketing construct (\ ...\ ) may also be used, in which case \<digit> matches the digit'th substring. (Outside of the pattern, always use $ instead of \ in front of the digit. The scope of $<digit> (and $\`, $& and $') extends to the end of the enclosing BLOCK or eval string, or to the next pattern match with subexpressions. The \<digit> notation sometimes works outside the current pattern, but should not be relied upon.) You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parens before the backreference. Otherwise (for backward compatibilty) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)
I think the main Problem is not the backreference - which works perfectly fine with \1 in java.
Your Problem is more likely the "overall" escaping of a regex pattern in Java.
If you want to have the pattern
(\d)\1{14}
passed to the regex engine, you first need to escape it cause it's a java-string when you write it:
(\\d)\\1{14}
Voila, works like a charm: goo.gl/BNCx7B (add http://, SO does not allow Url-Shorteners, but tutorialspoint.com has no other option as it seems)
Offline-Example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String test = "555555555555555"; // 5 followed by 5 for 14 times.
String pattern = "(\\d)\\1{14}";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
if (m.find( )) {
System.out.println("Matched!");
}else{
System.out.println("not matched :-(");
}
}
}

Program run forever when matching regex

I don't know why but when I try to run this program, it's looks like the program will run forever.
package fjr.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test3 {
public static void main(String[] args){
String regex = "dssdfsdfdsf wdasdads dadlkn mdsds .";
Pattern p = Pattern.compile("^([a-zA-Z]+ *)+$");
Matcher match = p.matcher(regex);
if(match.matches()){
System.out.println("Yess");
}else{
System.out.println("No.. ");
}
System.out.println("FINISH...");
}
}
What I need to do was to match the pattern that contain a bunch of word that only separated by spaces
Your program has likely encountered what's called catastrophic backtracking.
If you have a bit of time, let's look at how your regex works...
Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.
On the left hand side, we have our pattern:
/^([a-zA-Z]+ *)+$/
And here's the String to match:
dssdfsdfdsf wdasdads dadlkn mdsds .
From the regex101 debugger, your regex took 78540 steps to fail. This is because you used quantifiers that are greedy and not possessive (backtracking).
... Long story short, because the input string fails to match, every quantifier within your regex causes indefinite backtracking - Every character is released from + and then * and then both and then a group is released from ()+ to backtrack more.
Here's a few solutions you should follow:
Avoid abundant quantifiers!
If you revise your expression, you'll see that the pattern is logically same as:
/^[a-zA-Z]+( +[a-zA-Z]+)*$/
This uses a step of logical induction to reduce the regex upstairs to match far quicker, now at 97 steps!
Use possessive quantifiers while you can!
As I mentioned, /^([a-zA-Z]+ *)+$/ is evil because it backtracks in a terrible manner. We're in Java, what can we do?
This solution works only because [a-zA-Z] and matches distinct items. We can use a possessive group!
/^([a-zA-Z]++ *+)++$/
^ ^ ^
These simple "+" denotes "We're not backtracking if we fail the match from here". This is an extremely effective solution, and cuts off any need for backtracking. Whenever you have two distinct groups with a quantifier in between, use them. And if you need some proof on the effectiveness, here's our scorecard:
Read also:
The Stack Overflow Regex Reference
ReDoS - Wikipedia
Online Demos:
RegEx demo 1
RegEx demo 2
This does terminate, but it does take 10 seconds or so. Some observations:
Removing the fullstop from the end of the test string makes it fast.
Changing the * in the regex to a + (which i believe is actually what you want) makes it fast. I think having the option of 0 characters in that spot expands the state space a lot.
I would use:
^(\w+ +)*\w+$"
Which means a bunch of (word space), followed by a word. Ran against your example and it is fast
You can use this regex to match all word with spaces or none.
sample:
Pattern p = Pattern.compile("([a-zA-Z ]+)");
Interesting phenomena. This is related on Greedy quantifiers's behaviour & performance.
Based on your pattern ^([a-zA-Z]+ *)+$ , and your inpput "dssdfsdfdsf wdasdads dadlkn mdsds ." , the patther doesn't match your input, however, the java regex will backtrack all the possibilities of ^([a-zA-Z]+ *)+, (see below examples) and then obtain the not-match results.
(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsds )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsd)(s )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mds)(ds )
...
(d)(s)(s)(d)(f)(s)(d)(f)(d)(s)(f )(w)(d)(a)(s)(d)(a)(d)(s )(d)(a)(d)(l)(k)(n )(m)(d)(s)(d)(s )
...
(dssdfsdfdsf )
...
(d)
The backtrack could be more than 200,000,000 times.
I'm a little bit curious why java-regex can't do some performance improvement, since after first try, it found the epxected char is '.', not '$', then any further backtrack won't be successful. it's useless to do the backtrack.
Therefore, when we define the Loop pattern, we need to pay more attenttion on the internal Loop pattern (e.g.: [a-zA-Z]+ * in your example), not making it matching so many case. Otherwise, the backtrack numbers for the whole Loop will be huge.
Another similar exmaple (more bad then your case):
Pattern: "(.+)*A"
Input: "abcdefghijk lmnopqrs tuvwxy zzzzz A" -- So quick
Input: "abcdefghijAk lmnopqrs tuvwxy zzzzz " -- Not bad, just wait for a while
Input: "aAbcdefghijk lmnopqrs tuvwxy zzzzz " -- it seems infinit. (acutally, not, but I have no patience to wait its finishing.)

Regexp: Specific characters in the text

My goal is to validate specific characters (*,^,+,?,$,[],[^]) in the some text, like:
?test.test => true
test.test => false
test^test => true
test:test => false
test-test$ => true
test-test => false
I've already created regex regarding to requirment above, but I am not sure in this.
^(.*)([\[\]\^\$\?\*\+])(.*)$
Will be good to know whether it can be optimized in such way.
Your regex is already optimized one as its very simple. You can make is much simpler or readable only.
Also if you use the matches() method of Java's String class then you'll not require the ^ and $ at the both ends.
.*([\\[\\]^$?*+]).*
Double slashes(\\) for Java, otherwise please use single slash(\).
Look, I have removed the captures () along with escape character \ for the characters ^$?*+ as they are inside the character class [].
TL;DR
The quickest regex to do the job is
# ^[^\]\[^$?*+]*([\]\[^$?*+])
^ #start of the string
[^ #any character BUT...
\]\[^$?*+ #...these ones (^$?*+ aren't special inside a character class)
]*+ #zero or more times (possessive quantifier)
([ #capture any of...
\]\[^$?*+ #...these characters
])
Be careful that in a java string, you need to escape the \ as well, so you should transform every \ into \\.
Discussion
At first two regex come in mind:
[\]\[^$?*+], which will match only the character you want inside the string.
^.*[\]\[^$?*+], which will match your string up to the desired character.
It's actually important performance-wise to understand the difference between the case with .* at the beginning and the one with no wildcard at all.
When searching for the pattern, the first .* will make the regex engine eat all the string, then backtrack character by character to see if it's a match for your character range [...]. So the regex will actually search from the end of the string.
This is an advantage when your wanted sign if near the end, a disadvantage when it is at the beginning.
On the other case, the regex engine will try every character, beginning from the left, until it matches what you want.
You can see what I mean with these two examples from the excellent regex101.com:
with the .*, match is found in 26 steps when near the beginning, 8 when it's near the beginning: http://regex101.com/r/oI3pS1/#debugger
without it, it is found in 5 steps when near the beginning and 23 when near the end
Now, if you want to combine these two approaches you can use the tl;dr answer: you eat everything that isn't your character, then you match your character (or fail if there isn't one).
On our example, it takes 7 steps wherever your character is in the string (and 7 steps even if there is no character, thanks to the possessive quantifier).
That should also work:
String regex = ".*[\\[\\]^$?*+].*";
String test1 = "?test.test";
String test2 = "test.test";
String test3 = "test^test";
String test4 = "test:test";
String test5 = "test-test$";
String test6 = "test-test";
System.out.println(test1.matches(regex));
System.out.println(test2.matches(regex));
System.out.println(test3.matches(regex));
System.out.println(test4.matches(regex));
System.out.println(test5.matches(regex));
System.out.println(test6.matches(regex));

Categories

Resources