Is it possible to split by a substring without removing it?

Is it possible to split by a substring without removing it? - java

I am trying to do the follow:
I have some strings that I need to separate, they have this form:
node:info:sequence(id:ASDF,LMD)
node:info:sequence:id:QWES
Those are the possible individual string formats...
Now I have to separate them when come concatenated by comma... like this
node:info:sequence(id:ASDF,LMD),node:info:sequence:id:QWES
So I tried
entries.split(",node");
Which... kinda works but of course I cut the "node" part from the previous string, is there anyway I can detect that , followed by node but split it by the comma , only?

You may use
s.split(",(?=node\\b)")
See the regex demo
The positive lookahead (?=node\b) will make sure only those commas are matched that are followed with a whole word node (as the \b is a word boundary).

Related

Regular Expression Match after double space, and before comma

I am trying to match the bolded portion of the below String, which would represent a city.
1795 New Test Dr Test TEst Wildwood, MI 48769-1100
There are two spaces between Dr and Test, the starting portion should happen after those double spaces, and end before the comma.
I feel like I am very close to having this correct but can't quite get it 100%, as it is including the white space characters before Test.
(?=\s{2})[\w+\s]*[^,]
The above is what I have so far, also the many other alternatives did not work either they still include the white space characters I do not want at the beginning.
I feel like I missing something simple, but even after looking many places I cannot seem to find the regex that would match this pattern.
Also I know this can be easily accomplished with split and substrings, but the requirement is a regex unfortunately, as this is for a database driven automation application and the format should be able to change on the fly without requiring a deploy due to code changes.

You need a look-behind for the spaces rather than a look-ahead, as you want the match to start immediately after them. From that point on, you can simply do a greedy match for anything that is not a comma:
(?<=\s{2})[^,]*
The * is greedy and will consume as many characters as it can, ending the match immediately before the comma.

\s actually also matches whitespace other than space, which may or may not be not be what you what.
How about ^.*? ([^,]*).*$. That's a non-greedy match at the beginning of the line ^.*?, followed by two literal spaces , then capturing everything that isn't a comma, then matching everything else to the end of the line.
Be aware, though, that when I copy and paste your example text, it does not contain two spaces. This might be causing you problems, or it's just a transcription issue and your original has the two spaces.

Tokenizing a string using negations

So i have the following problem:
I have to tokenize a string using String.split() and the tokens must be in the form 07dd ddd ddd, where d is a digit. I thought of using the following regex : ^(07\\d{2}\\s\\d{3}\\d{3}) and pass it as an argument to String.split(). But for some reason, although i do have substrings under that form, it outputs the whole initial string and doesn't tokenize it.
I initially thought that it was using an empty string as a splitter, as an empty string indeed matches that regex, but even after I added & (.)+ to the regex in order to assure that the splitter hasn't got length 0, it still outputs the whole initial string.
I know that i could have used Pattern's and Matchers to solve it much faster, but i have to use String.split(). Any ideas why this happens?

A Few Pointers
Your pattern ^(07\d{2}\s\d{3}\d{3}) is missing a space between the two last groups of digits
The reason you get the whole string back is that this pattern was never found in the first place: there is no split
If you split on this pattern (once fixed), the resulting array will be strings that are in-between this pattern (these tokens are actually removed)
If you want to use this pattern (once fixed), you need a Match All not a Split. This will look like arrayOfMatches = yourString.match(/pattern/g);
If you want to split, you need to use a delimiter that is present between the token (this delimiter could in fact just be a zero-width position asserted by the 07 about to follow)
Further Reading
Match All and Split are Two Sides of the Same Coin

Java Regex to validate String

I have just bought a book on Regex to try and get my head around it but I'm still really struggling with it. I am trying to create a java regex that will satisfy a string configuration that can;
Can contain lowercase letters ([a-z])
Can contain commas (,) but only between words
Can contain colon (:) but must be separated by words or multiply (*)
Can contain hyphens (-) but must be separated by words
Can contain multiply (*) but if used it must be the only character before/between/after the colon
Cannot contain spaces, 'words' are delimitated by a hyphens (-) or commas (,) or colon (:) or the end of the string
So for example the following would be true:
foo:bar
foo-bar:foo
foo,bar:foo
foo-bar,foo:bar,foo-bar
foo:bar:foo,bar
*:foo
foo:*
*:*:*
But the following would be false:
foo :bar
,foo:bar
foo-:bar
-foo:bar
foo,:bar-
foo:bar,
foo,*:bar
foo-*:bar
This is what I have so far:
^[a-z-]|*[:?][a-z-]|*[:?][a-z-]|*

Here is a regex that will work for all your cases:
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+)([,-][a-z]+)*|\*)*
Here is a detailed analysis:
One of the basic structures used to build complicated regular expressions like this is actually pretty simple, and has the form text(separator text)*. A regex of that form will match:
one text
one text, a separator, and another text
one text, a separator, another text, another separator, and yet another text
or more, just add another separator and a text to the end.
So here is a breakdown of the code:
[a-z]+([,-][a-z]+)* is an instance of the pattern I discussed above: the text here is [a-z]+, and the separator is [,-].
([a-z]+([,-][a-z]+)*|\*) allows an asterisk to be matched instead.
([a-z]+([,-][a-z]+)*|\*)(:([a-z]+([,-][a-z]+)*|\*))* is another instance of the pattern I discussed above: the text is ([a-z]+([,-][a-z]+)*|\*), and the separator is :.
If you plan to use this as a component of an even larger regex, in which the group matches will be important, I would recommend making the internal parens non-grouping, and place grouping parens around the entire regex, like so:
((?:[a-z]+(?:[,-][a-z]+)*|\*)(?::([a-z]+)(?:[,-][a-z]+)*|\*)*)

We rarely see here somebody who can define positive and negative test cases. That makes live really easier.
Here's my regex with a 95% solution:
"(([a-z]+|\\*)[:,-])*([a-z]+|\\*)" (JAVA-Version)
(([a-z]+|\*)[:,-])*([a-z]+|\*) (plain regex)
It simply differntiates between words (a-z or *) and separators (one of :-,) and it must contain at least one word and words must be separated by a separator. It works for the positive cases and for the negative cases except the last two negative ones.
One remark: Such a complex "syntax" would in real live be implemented with a grammer definition tool like ANTLR (or a few years ago with lex/yacc, flex/bison). Regex can do that but will not be easy to maintain.

Using regex to match beginning and end of string [Java]

I have a list of files in a folder:
maze1.in.txt
maze2.in.txt
maze3.in.txt
I've used substring to remove the .txt extensions.
How do I use regex to match the front and the back of the file name?
I need it to match "maze" at the front and ".in" at the back, and the middle must be a digit (can be single or double digit).
I've tried the following
if (name.matches("name\\din")) {
//dosomething
}
It doesn't match anything. What is the correct regex expression to use?

I'm a little confused what you are asking for in particular
^(maze[0-9]*\.in)$
This will match maze(any number).in
^(maze[0-9]*\.in)\.txt$
this will match maze(any number).in.txt -- excludes the .txt NO NEED FOR USING SUB STRING!
Edit live on Debuggex
The think i would be wary about as of right now is the capture groups... I'm not particularly sure what you are doing with this regex. However, I believe explaining capture groups could benefit you.
A capture group for instance is denoted by () this is basically store them in the pattern array and is a way to parse stuff.
example maze1.in.txt
So if you want to capture the entire line minus .txt i would use this ^(maze[0-9]*\.in\.txt)$
However, if I wanted to capture things separately I would do this ^(maze)([0-9]*)(\.in)\.txt$ this will exclude .txt but include maze, the number, and .in IN separate indexes of the pattern array.

Your original solution doesn't work because string "name" is not in your text. It is "maze".
You can try this
name.matches("maze\\d{1,2}\\.in")
d{1,2} is used to match a digit(can be single or double digit).

You need regex anchors that tell the regex to
start at the beginning: ^
and signal the end of the string: $
^maze[\d]{0,2}\.in$
or in Java:
name.matches("^maze[\\d]{0,2}\\.in$");
Also, your regex wasn't matching strings with a dot (.) which would not accept your examples given. You need to add \. to the regex to accept dots because . is a special character.

It is always good to think of what you are trying to do in english, before you create regular expressions.
You want to match a word maze followed by a digit, followed by a literal period . followed by another word.
word `\w` matches a word character
digit `\d` matches a single digit
period `\.` matches a literal period
word `\w` matches a word character
putting it all together into a single string you get (keep in mind the double backslash for the Java escape and the pluses to repeat the previous match one or more times):
"\\w+\\d\\.\\w+"
The above is the generic case for any file name in the format xxx1.yyy, if you wanted to match maze and in specifically, you can just add those in as literal strings.
"maze\\d+\\.in"
example: http://ideone.com/rS7tw1

name.matches("^maze[0-9]+\\.in\\.txt$")

java regex tricky pattern

I'm stucked for a while with a regex that does me the following:
split my sentences with this: "[\W+]"
but if it finds a word like this: "aaa-aa" (not "aaa - aa" or "aaa--aaa-aa"), the word isnt splitted, but the whole word.
Basically, i want to split a sentece per words, but also considering "aaa-aa" is a word. I'have sucessfully done that by creating two separate functions, one for spliting with \w, and other to find words like "aaa-aa". Finally, i then add both, and subctract each compound word.
For example, the sentence:
"Hello my-name is Richard"
First i collect {Hello, my, name, is, Richard}
then i collect {my-name}
then i add {my-name} to {Hello, my, name, is, Richard}
then i take out {my} and {name} in here {Hello, my, name, is, Richard}.
result: {Hello, my-name, is, Richard}
this approach does what i need, but for parsing large files, this becomes too heavy, because for each sentence there's too many copies needed. So my question is, there is anything i can do to include everything in one pattern? Like:
"split me the text using this pattern "[\W+], but if you find a word like this "aaa-aa", consider it a word and not two words.

If you want to use a split() rather than explicitly matching the words you are interested in, the following should do what you want: [\s-]{2,}|\s To break that down, you first split on two or more whitespaces and/or hyphens - so a single '-' won't match so 'one-two' will be left alone but something like 'one--two', 'one - two' or even 'one - --- - two' will be split into 'one' and 'two'. That still leaves the 'normal' case of a single whitespace - 'one two' - unmatched, so we add an or ('|') followed by a single whitespace (\s). Note that the order of the alternatives is important - RE subexpressions separated by '|' are evaluated left-to-right so we need to put the spaces-and-hyphens alternative first. If we did it the other way around, when presented with something like 'one -two' we'd match on the first whitespace and return 'one', '-two'.
If you want to interactively play around with Java REs I can thoroughly recommend http://myregexp.com/signedJar.html which allows you to edit the RE and see it matching against a sample string as you edit the RE.

Why not to use pattern \\s+? This does exactly what you want without any tricks: splits text by words separated by whitespace.

Your description isn't clear enough, but why not just split it up by spaces?

I am not sure whether this pattern would work, because I don't have developer tools for Java, you might try it though, it uses character class substraction, which is supported only in Java regex as far as I know:
[\W&&[^-]]+
it means match characters if they are [\W] and [^-], that is characters are [\W] and not [-].

Almost the same regular expression as in your previous question:
String sentence = "Hello my-name is Richard";
Pattern pattern = Pattern.compile("(?<!\\w)\\w+(-\\w+)?(?!\\w)");
Matcher matcher = pattern.matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
Just added the option (...)? to also match non-hypened words.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Is it possible to split by a substring without removing it? - java

You may use s.split(",(?=node\\b)") See the regex demo The positive lookahead (?=node\b) will make sure only those commas are matched that are followed with a whole word node (as the \b is a word boundary).

Related

Regular Expression Match after double space, and before comma

Tokenizing a string using negations

Java Regex to validate String

Using regex to match beginning and end of string [Java]

java regex tricky pattern

Categories

Resources