Differentiating between slashes in a string using a regular expression

Differentiating between slashes in a string using a regular expression - java

A program that I'm writing (in Java) gets input data made up of three kinds of parts, separated by a slash /. The parts can be one of the following:
A name matching the regular expression \w*
A call matching the expression \w*\(.*\)
A path matching the expression <.*>|\".*\". A path can contain slashes.
An example string could look like this:
bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()
which has the following structure
name/call/call/path/name/path/call
I want to split this string into parts, and I'm trying to do this using a regular expression. My current expression captures slashes after calls and paths, but I'm having trouble getting it to capture slashes after names without also including slashes that may exist within paths. My current expression, just capturing slashes after paths and calls looks like this:
(?<=[\)>\"])/
How can I expand this expression to also capture slashes after names without including slashes within paths?

(\w+|\w+\([^/]*\)(?:/\w+\([^/]*\))*|<[^>]*>|"[^"]*")(?=/|$)
captures this from the string 'bar/foo()/foo(bar)/<foo/bar>/bar/"foo/bar"/foo()'
'bar'
'foo()/foo(bar)'
'<foo/bar>'
'bar'
'"foo/bar"'
'foo()'
It does not capture the separating slashes, though (what for? - just assume they are there).
The simpler (\w+|\w+\([^/]*\)|<[^>]*>|"[^"]*")(?=/|$) would capture calls separately:
"foo()"
"foo(bar)"
EDIT: Usually, I do a regex breakdown:
( # begin group 1 (for alternation)
\w+ # at least one word character
| # or...
\w+ # at least one word character
\( # a literal "("
[^/]* # anything but a "/", as often as possible
\) # a literal ")"
| # or...
< # a "<"
[^>]* # anything but a ">", as often as possible
> # a ">"
| # or...
" # a '"'
[^"]* # anything but a '"', as often as possible
" # a '"'
) # end group 1
(?=/|$) # look-ahead: ...followed by a slash or the end of string

My first thought was to match slashes with an even number of quotes to the left of it. (I.e., having a positive look behind of something like (".*")* but this ends up in an exception saying
Look-behind group does not have an obvious maximum length
Honestly I think you'd be better of with a Matcher, using an or:ed together version of your components, (something like \w*|\w*\(.*\)|(<.*>|\".*\")) and do while (matcher.find()).

Having your deliminator for your string not escaped when used inside your input might not be the best choice. However, you do have the luxury of the "false" slash being inside a regular pattern. What I suggest...
Split the whole string on "/"
Parse each part until you get to the start of the path
Put the path elements into a list until the end of the path
Rejoin the path back on "/"
I highly recommend you consider escaping the "/" in your paths to make your life easier.

This pattern captures all parts of your example string separately without including the delimiter into the results:
\w+\(.*?\)|<.*>|\".*\"|\w+

Related

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?

In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.

I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.

Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.

The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.

I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.

A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1

The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.

This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/

MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "

I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1

string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character

Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.

All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.

My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.

From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter

If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Regex for partial path

I have paths like these (single lines):
/
/abc
/def/
/ghi/jkl
/mno/pqr/
/stu/vwx/yz
/abc/def/ghi/jkl
I just need patterns that match up to the third "/". In other words, paths containing just "/" and up to the first 2 directories. However, some of my directories end with a "/" and some don't. So the result I want is:
/
/abc
/def/
/ghi/jkl
/mno/pqr/
/stu/vwx/
/abc/def/
So far, I've tried (\/|.*\/) but this doesn't get the path ending without a "/".

I would recommend this pattern:
/^(\/[^\/]+){0,2}\/?$/gm
DEMO
It works like this:
^ searches for the beginning of a line
(\/[^\/]+) searches for a path element
( starts a group
\/ searches for a slash
[^\/]+ searches for some non-slash characters
{0,2} says, that 0 to 2 of those path elements should be found
\/? allows trailling slashes
$ searches for the end of the line
Use these modifiers:
g to search for several matches within the input
m to treat every line as a separate input

You need a pattern like ^(\/\w+){0,2}\/?$, it checks that you have (/ and name) no more than 2 times and that it can end with /
Details :
^ : beginning of the string
(\/\w+) : slash (escaped) and word-char, all in a group
{0,2} the group can be 0/1/2 times
\/? : slash (escaped) can be 0 or 1 time
Online DEMO
Regex DEMO

Your regex (\/|.*\/) uses an alternation which matches either a forward slash or any characters 0+ times greedy followed by matching a forward slash.
So in for example /ghi/jkl, the first match will be the first forward slash. Then this part .* of the next pattern will match from the first g until the end of the string. The engine will backtrack to last forward slash to fullfill the whole .*\/ pattern.
The trailing jkl can not be matched anymore by neither patterns of the alternation.
Note that you don't have to escape the forward slash.
You could use:
^/(?:\w+/?){0,2}$
In Java:
String regex = "^/(?:\\w+/?){0,2}$";
Regex demo
Explanation
^ Start of the string
/ Match forward slash
(?: Non capturing group
\w+ Match 1+ word characters (If you want to match more than \w you could use a character class and add to that what you want match)
/? Match optional forward slash
){0,2} Close non capturing group and repeat 0 - 2 times
$ End of the string

^(/([^/]+){0,2}\/?)$
To break it down
^ is the start of the string
{0,2} means repeat the previous between 0 and 2 times.
Then it ends with an optional slash by using a ?
String end is $ so it doesn't match longer strings.
() Around the whole thing to capture it.
But I'll point out that the is almost always the wrong answer for directory matching. Some directories have special meaning, like /../.. which actually goes up two directories, not down. Better to use the systems directory API instead for more robust results.

java regex to strip root element in xpath string

What's the easiest way to strip the root element from an xpath string where anything matching /\w/, as long as the path starts with that pattern, like this:
/root/foo/bar/sushi becomes foo/bar/sushi
/my/t/fine/path becomes t/fine/path
I got this working:
String path = '/root/foo/bar/sushi'
path.replaceFirst('\\/(.*?)\\/', '')
but if path='root/foo/bar/sushi', I don't want anything changed, since that doesn't start with /, but it still strips out the first occurrence of /element/, resulting in rootbar/sushi. I understand why, just having trouble validating the start pattern.

You need the ^ anchor to specify that we are looking for /root/ at the beginning of the string. At the simplest, this regex will do it:
^/[^/]*/
In Java code, this can look like:
String replaced = your_original_string.replaceAll("^/[^/]*/", "");
This works if you know that what you are looking at is a path in the first place.
Explain Regex
^ # the beginning of the string
/ # '/'
[^/]* # any character except: '/' (0 or more times
# (matching the most amount possible))
/ # '/'
Option 2: validate at the same time
On the other hand, if you are not sure that the string is a path, then this regex is not adequate because it will accept any character after the /root/
In that case, you can specify your characters, for instance with
^/[^/]*/([\w-/]+)
for digits, letters, underscores and hyphens. This validation can be further refined to ensure that the characters occur in the right order.
For this regex, you would replace with:
String replaced = your_original_string.replaceAll("^/[^/]*/([\\w-/]+)", "$1");

Regex lookaround construct in Java: advise on optimization needed

I am trying to search for filenames in a comma-separated list in:
text.txt,temp_doc.doc,template.tmpl,empty.zip
I use Java's regex implementation. Requirements for output are as follows:
Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
It should look like:
text
template
empty
So far I have managed to write more or less satisfactory regex to cope with the first task:
[^\\.,]++(?=\\.[^,]*+,?+)
I believe to make it comply with the second requirement best option is to use lookaround constructs, but not sure how to write a reliable and optimized expression. While the following regex does seem to do what is required, it is obviously a flawed solution if for no other reason than it relies on explicit maximum filename length.
(?!temp_|emp_|mp_|p_|_)(?<!temp_\\w{0,50})[^\\.,]++(?=\\.[^,]*+,?+)
P.S. I've been studying regexes only for a few days, so please don't laugh at this newbie-style overcomplicated code :)

Display only filenames and not their respective extensions
Exclude files that begin with "temp_"
One variant would be like this:
(?:^|,)(?!temp_)((?:(?!\.[^.]*(?:,|$)).)+)
This allows
file names that do not begin with a "word character" (Tim Pietzcker's solution does not)
file names that contain a dot (sth. like file.name.ext will be matched as file.name)
But actually, this is really complex. You'll be better off writing a small function that splits the input at the commas and strips the extension from the parts.
Anyway, here's the tear-down:
(?:^|,) # filename start: either start of the string or comma
(?!temp_) # negative look-ahead: disallow filenames starting with "temp_"
( # match group 1 (will contain your file name)
(?: # non-capturing group (matches one allowed character)
(?! # negative look-ahead (not followed by):
\. # a dot
[^.]* # any number of non-dots (this matches the extension)
(?:,|$) # filename-end (either end of string or comma)
) # end negative look-ahead
. # this character is valid, match it
)+ # end non-capturing group, repeat
) # end group 1
http://rubular.com/r/4jeHhsDuJG

How about this:
Pattern regex = Pattern.compile(
"\\b # Start at word boundary\n" +
"(?!temp_) # Exclude words starting with temp_\n" +
"[^,]+ # Match one or more characters except comma\n" +
"(?=\\.) # until the last available dot",
Pattern.COMMENTS);
This also allows dots within filenames.

Another option:
(?:temp_[^,.]*|([^,.]*))\.[^,]*
That pattern will match all file names, but will capture only valid names.
If at the current position the pattern can match temp_file.ext, it matches it and does not capture.
It it cannot match temp_, it tires to match ([^,.]*)\.[^,]*, and capture the file's name.
You can see an example here: http://www.rubular.com/r/QywiDgFxww

Regex for multiple lines

I am looking for a pattern for multiple lines
I am new to regex and heavily using them using in my project
I need to come up with a pattern that will match a few group of lines. The pattern should
match either these lines
* Source: Test *
* *
or
Ord. 429 Tckt. 1
or
Guest:
Yes, it is not clear. I got a pattern for the second line ( Ord. 429 Tckt. 1) which is:
[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

If you need one large regex to match all of these, the following should work if you have the Pattern.DOTALL and Pattern.MULTILINE flags set (see Rubular):
^\*[^\n]*\*$.*?^\*[^\n]*\*$|^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$|^Guest:[^\n]*$
Here is a breakdown of the different sections (split by the |):
Your first group of lines:
^\*[^\n]*\*$.*?^\*[^\n]*\*$
---------------------------
^ # start of a line
\* # a literal '*'
[^\n]* # any number of non-newline characters
\* # a literal '*'
$ # end of a line
.*? # any number of characters, as few as possible (includes newlines)
^\*[^\n]*\*$ # repeat of the first six elements of pattern as described above
The second line portion (for lines like 'Ord. 429 Tckt. 1') is adapted from yours with some minor changes.
^\w+\.[ \t]+\d+[ \t]+\w+\.[ \t]+\d+$
As for the third, it should be pretty basic, start of a line followed by 'Guest:' and then any number of non-newline characters.
^Guest:[^\n]*$

Add the multi-line switch (?s) to the front of your regex:
(?s)[\s]+[\w]+[\.][\s]+[\d]+[\s]+[\w]+[\.][\s]+[\d]+

I'm assuming that you are using Java. You would be using java.util.Regex. You are probably looking for the Pattern.DOTALL flag on Pattern. This treats line terminators as a character that you can match with ..
Pattern.compile("^*\sSource: Test\s**\s*", Patther.DOTALL);
It depends on how strict you want to be, but the above will match the first line in the first snippet (including the line terminator).
If you need more help with the API or this is the wrong API, edit your question to be clearer.
Are you trying to match all three in a single regex? It can be done, but the patter will be a bit ugly. I can probably help with that too.
A decent regex tester page is: http://www.fileformat.info/tool/regex.htm. You can do a google search for something like regex java tester.
Just one last thing, the pattern at the bottom won't do what you want if I understand fully.
[\s]+ matches one or more spaces, so whitespace is required on the front. Also, you don't need the square brackets. They work, but are only needed for alternation. If you wanted to match either a or b but not both: [ab]. But, if you want to match just a, you just put a in your pattern.
\s+ one or more spaces
\w+ one or more word chars (no digits or punctuation,etc)
. period
\s+ some whitespace
\d+ some digits
\s+ some whitespace
\w some word chars
. period
\s+ some whitespace
\d+ a single digit
so,
\s+\w+\.\s+\d+\s+\w+\.\s+\d+
Are there supposed to be blank lines in between the Source: Test and the line with just the stars?
You are going to end up with something like this:
(?: # non-capturing group
\s*\* Source: Test\s+\* # first line of the of the first block
\s+\*\s+\* # second line, assuming that there is no space
# between lines or an arbitrary amout of whitespace
) # end of first group
| # or....
(?: # second group (non capturing)
\s+\w+\.\s+\d+\s+\w+\.\s+\d+ # what we discussed before for Org/Tckt
)
|
(?:\s+Guest:) # the last one is easy :)
You may or may not know this, but comments like I have up there can be put into your code via the Pattern.COMMENTS flag. Some people like that. I've also broken up the different groups into their own constant and then pasted them together when compiling the patter. I like that pretty well.
I hope all of this helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.