java regex to strip root element in xpath string

java regex to strip root element in xpath string - java

What's the easiest way to strip the root element from an xpath string where anything matching /\w/, as long as the path starts with that pattern, like this:
/root/foo/bar/sushi becomes foo/bar/sushi
/my/t/fine/path becomes t/fine/path
I got this working:
String path = '/root/foo/bar/sushi'
path.replaceFirst('\\/(.*?)\\/', '')
but if path='root/foo/bar/sushi', I don't want anything changed, since that doesn't start with /, but it still strips out the first occurrence of /element/, resulting in rootbar/sushi. I understand why, just having trouble validating the start pattern.

You need the ^ anchor to specify that we are looking for /root/ at the beginning of the string. At the simplest, this regex will do it:
^/[^/]*/
In Java code, this can look like:
String replaced = your_original_string.replaceAll("^/[^/]*/", "");
This works if you know that what you are looking at is a path in the first place.
Explain Regex
^ # the beginning of the string
/ # '/'
[^/]* # any character except: '/' (0 or more times
# (matching the most amount possible))
/ # '/'
Option 2: validate at the same time
On the other hand, if you are not sure that the string is a path, then this regex is not adequate because it will accept any character after the /root/
In that case, you can specify your characters, for instance with
^/[^/]*/([\w-/]+)
for digits, letters, underscores and hyphens. This validation can be further refined to ensure that the characters occur in the right order.
For this regex, you would replace with:
String replaced = your_original_string.replaceAll("^/[^/]*/([\\w-/]+)", "$1");

Related

Regex pattern matching with multiple strings

Forgive me. I am not familiarized much with Regex patterns.
I have created a regex pattern as below.
String regex = Pattern.quote(value) + ", [NnoneOoff0-9\\-\\+\\/]+|[NnoneOoff0-9\\-\\+\\/]+, "
+ Pattern.quote(value);
This regex pattern is failing with 2 different set of strings.
value = "207e/160";
Use Case 1 -
When channelStr = "207e/160, 149/80"
Then channelStr.matches(regex), returns "true".
Use Case 2 -
When channelStr = "207e/160, 149/80, 11"
Then channelStr.matches(regex), returns "false".
Not able to figure out why? As far I can understand it may be because of the multiple spaces involved when more than 2 strings are present with separated by comma.
Not sure what should be correct pattern I should write for more than 2 strings.
Any help will be appreciated.

If you print your pattern, it is:
\Q207e/160\E, [NnoneOoff0-9\-\+\/]+|[NnoneOoff0-9\-\+\/]+, \Q207e/160\E
It consists of an alternation | matching a mandatory comma as well on the left as on the right side.
Using matches(), should match the whole string and that is the case for 207e/160, 149/80 so that is a match.
Only for this string 207e/160, 149/80, 11 there are 2 comma's, so you do get a partial match for the first part of the string, but you don't match the whole string so matches() returns false.
See the matches in this regex demo.
To match all the values, you can use a repeating pattern:
^[NnoeOf0-9+/-]+(?:,\h*[NnoeOf0-90+/-]+)*$
^ Start of string
[NnoeOf0-9\\+/-]+
(?: Non capture group
,\h* Match a comma and optional horizontal whitespace chars
[NnoeOf0-90-9\\+/-]+ Match 1+ any of the listed in the character class
)* Close the non capture group and optionally repeat it (if there should be at least 1 comma, then the quantifier can be + instead of *)
$ End of string
Regex demo
Example using matches():
String channelStr1 = "207e/160, 149/80";
String channelStr2 = "207e/160, 149/80, 11";
String regex = "^[NnoeOf0-9+/-]+(?:,\\h*[NnoeOf0-90+/-]+)*$";
System.out.println(channelStr1.matches(regex));
System.out.println(channelStr2.matches(regex));
Output
true
true
Note that in the character class you can put - at the end not having to escape it, and the + and / also does not have to be escaped.

You can use regex101 to test your RegEx. it has a description of everything that's going on to help with debugging. They have a quick reference section bottom right that you can use to figure out what you can do with examples and stuff.
A few things, you can add literals with \, so \" for a literal double quote.
If you want the pattern to be one or more of something, you would use +. These are called quantifiers and can be applied to groups, tokens, etc. The token for a whitespace character is \s. So, one or more whitespace characters would be \s+.
It's difficult to tell exactly what you're trying to do, but hopefully pointing you to regex101 will help. If you want to provide examples of the current RegEx you have, what you want to match and then the strings you're using to test it I'll be happy to provide you with an example.

^(?:[NnoneOoff0-9\\-\\+\\/]+ *(?:, *(?!$)|$))+$
^ Start
(?: ... ) Non-capturing group that defines an item and its separator. After each item, except the last, the separator (,) must appear. Spaces (one, several, or none) can appear before and after the comma, which is specified with *. This group can appear one or more times to the end of the string, as specified by the + quantifier after the group's closing parenthesis.
Regex101 Test

How to match Linux parent directories with Java Regex?

Linux path is ../../test/test/mydirectory/.....
I tried to remove all the ../ with this regex
[s{/././///}]
But this removes all special characters
I only want to remove ../../../../../ and leave the real path
String result = path.replaceAll("[s{/././///}]","");
I expect the regex to identify all possible ../../../../ empty parent directories and leave only the real directory where the real path name starts
start only where the letters start

You may use
s.replaceFirst("^(?:\\.{2}/)+", "")
The pattern matches
^ - start of string
(?:\\.{2}/)+ - one or more repetitions of:
\.{2} - two dots
/ - slash.
The .replaceFirst will find the first occurrence of the pattern and will replace it with an empty string.

Regex to match user and user#domain

A user can login as "user" or as "user#domain". I only want to extract "user" in both cases. I am looking for a matcher expression to fit it, but im struggling.
final Pattern userIdPattern = Pattern.compile("(.*)[#]{0,1}.*");
final Matcher fieldMatcher = userIdPattern.matcher("user#test");
final String userId = fieldMatcher.group(1)
userId returns "user#test". I tried various expressions but it seems that nothing fits my requirement :-(
Any ideas?

If you use "(.*)[#]{0,1}.*" pattern with .matches(), the (.*) grabs the whole line first, then, when the regex index is still at the end of the line, the [#]{0,1} pattern triggers and matches at the end of the line because it can match 0 # chars, and then .* again matches at that very location as it matches any 0+ chars. Thus, the whole line lands in your Group 1.
You may use
String userId = s.replaceFirst("^([^#]+).*", "$1");
See the regex demo.
Details
^ - start of string
([^#]+) - Group 1 (referred to with $1 from the replacement pattern): any 1+ chars other than #
.* - the rest of the string.

A little bit of googling came up with this:
(.*?)(?=#|$)
Will match everthing before an optional #

I would suggest keeping it simple and not relying on regex in this case if you are using java and have a simple case like you provided.
You could simply do something like this:
String userId = "user#test";
if (userId.indexOf("#") != -1)
userId = userId.substring(0, userId.indexOf("#"));
// from here on userId will be "user".
This will always either strip out the "#test" or just skip stripping it out when it is not there.
Using regex in most cases makes the code less maintainable by another dev in the future because most devs are not very good with regular expressions, at least in my experience.

You included the # as optional, so the match tries to get the longest user name. As you didn't put the restriction of a username is not allowed to have #s in it, it matched the longest string.
Just use:
[^#]*
as the matching subexpr for usernames (and use $0 to get the matched string)
Or you can use this one that can be used to find several matches (and to get both the user part and the domain part):
\b([^#\s]*)(#[^#\s]*)?\b
The \b force your string to be tied to word boundaries, then the first group matches non-space and non-# chars (any number, better to use + instead of * there, as usernames must have at least one char) followed (optionally) by a # and another string of non-space and non-# chars). In this case, $0 matches the whole email addres, $1 matches the username part, and $2 the #domain part (you can refine to only the domain part, adding a new pair of parenthesis, as in
b([^#\s]*)(#([^#\s]*))?\b
See demo.

Regex for partial path

I have paths like these (single lines):
/
/abc
/def/
/ghi/jkl
/mno/pqr/
/stu/vwx/yz
/abc/def/ghi/jkl
I just need patterns that match up to the third "/". In other words, paths containing just "/" and up to the first 2 directories. However, some of my directories end with a "/" and some don't. So the result I want is:
/
/abc
/def/
/ghi/jkl
/mno/pqr/
/stu/vwx/
/abc/def/
So far, I've tried (\/|.*\/) but this doesn't get the path ending without a "/".

I would recommend this pattern:
/^(\/[^\/]+){0,2}\/?$/gm
DEMO
It works like this:
^ searches for the beginning of a line
(\/[^\/]+) searches for a path element
( starts a group
\/ searches for a slash
[^\/]+ searches for some non-slash characters
{0,2} says, that 0 to 2 of those path elements should be found
\/? allows trailling slashes
$ searches for the end of the line
Use these modifiers:
g to search for several matches within the input
m to treat every line as a separate input

You need a pattern like ^(\/\w+){0,2}\/?$, it checks that you have (/ and name) no more than 2 times and that it can end with /
Details :
^ : beginning of the string
(\/\w+) : slash (escaped) and word-char, all in a group
{0,2} the group can be 0/1/2 times
\/? : slash (escaped) can be 0 or 1 time
Online DEMO
Regex DEMO

Your regex (\/|.*\/) uses an alternation which matches either a forward slash or any characters 0+ times greedy followed by matching a forward slash.
So in for example /ghi/jkl, the first match will be the first forward slash. Then this part .* of the next pattern will match from the first g until the end of the string. The engine will backtrack to last forward slash to fullfill the whole .*\/ pattern.
The trailing jkl can not be matched anymore by neither patterns of the alternation.
Note that you don't have to escape the forward slash.
You could use:
^/(?:\w+/?){0,2}$
In Java:
String regex = "^/(?:\\w+/?){0,2}$";
Regex demo
Explanation
^ Start of the string
/ Match forward slash
(?: Non capturing group
\w+ Match 1+ word characters (If you want to match more than \w you could use a character class and add to that what you want match)
/? Match optional forward slash
){0,2} Close non capturing group and repeat 0 - 2 times
$ End of the string

^(/([^/]+){0,2}\/?)$
To break it down
^ is the start of the string
{0,2} means repeat the previous between 0 and 2 times.
Then it ends with an optional slash by using a ?
String end is $ so it doesn't match longer strings.
() Around the whole thing to capture it.
But I'll point out that the is almost always the wrong answer for directory matching. Some directories have special meaning, like /../.. which actually goes up two directories, not down. Better to use the systems directory API instead for more robust results.

Differentiating between slashes in a string using a regular expression

A program that I'm writing (in Java) gets input data made up of three kinds of parts, separated by a slash /. The parts can be one of the following:
A name matching the regular expression \w*
A call matching the expression \w*\(.*\)
A path matching the expression <.*>|\".*\". A path can contain slashes.
An example string could look like this:
bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()
which has the following structure
name/call/call/path/name/path/call
I want to split this string into parts, and I'm trying to do this using a regular expression. My current expression captures slashes after calls and paths, but I'm having trouble getting it to capture slashes after names without also including slashes that may exist within paths. My current expression, just capturing slashes after paths and calls looks like this:
(?<=[\)>\"])/
How can I expand this expression to also capture slashes after names without including slashes within paths?

(\w+|\w+\([^/]*\)(?:/\w+\([^/]*\))*|<[^>]*>|"[^"]*")(?=/|$)
captures this from the string 'bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()'
'bar'
'foo()/foo(bar)'
'<foo/bar>'
'bar'
'"foo/bar"'
'foo()'
It does not capture the separating slashes, though (what for? - just assume they are there).
The simpler (\w+|\w+\([^/]*\)|<[^>]*>|"[^"]*")(?=/|$) would capture calls separately:
"foo()"
"foo(bar)"
EDIT: Usually, I do a regex breakdown:
( # begin group 1 (for alternation)
\w+ # at least one word character
| # or...
\w+ # at least one word character
\( # a literal "("
[^/]* # anything but a "/", as often as possible
\) # a literal ")"
| # or...
< # a "<"
[^>]* # anything but a ">", as often as possible
> # a ">"
| # or...
" # a '"'
[^"]* # anything but a '"', as often as possible
" # a '"'
) # end group 1
(?=/|$) # look-ahead: ...followed by a slash or the end of string

My first thought was to match slashes with an even number of quotes to the left of it. (I.e., having a positive look behind of something like (".*")* but this ends up in an exception saying
Look-behind group does not have an obvious maximum length
Honestly I think you'd be better of with a Matcher, using an or:ed together version of your components, (something like \w*|\w*\(.*\)|(<.*>|\".*\")) and do while (matcher.find()).

Having your deliminator for your string not escaped when used inside your input might not be the best choice. However, you do have the luxury of the "false" slash being inside a regular pattern. What I suggest...
Split the whole string on "/"
Parse each part until you get to the start of the path
Put the path elements into a list until the end of the path
Rejoin the path back on "/"
I highly recommend you consider escaping the "/" in your paths to make your life easier.

This pattern captures all parts of your example string separately without including the delimiter into the results:
\w+\(.*?\)|<.*>|\".*\"|\w+

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java regex to strip root element in xpath string - java

Related

Regex pattern matching with multiple strings

How to match Linux parent directories with Java Regex?

Regex to match user and user#domain

Regex for partial path

Differentiating between slashes in a string using a regular expression

Categories

Resources