Java - Regex: Several matches in same string

Java - Regex: Several matches in same string - java

I have a string: String s = "The input must be of format: '$var1$'-'$var1$'-'$var1$'".
I want to replace the text between the $ with another text, so the outcome may look on the console like:
"The input must be of format: '$REPLACED$'-'$REPLACED$'-'$REPLACED$'"
I came till s.replaceAll("\\$.+\\$", "\\$REPLACED\\$";, but that results in
"The input must be of format: '$REPLACED$'" (the first and the last $ are taken as borders).
How can I tell the regex engine, that there are several occurences and each need to be processed (=replaced)?
Thank for your help!
Edit:// Thanks for your help. The "greedy thing" was the matter. Adding a ? to the regex fixed my issue. The solution now looks like this (for those witha similar problem):
s.replaceAll("\\$.+?\\$", "\\$REPLACED\\$";

The effect you're experiencing is called greediness: An expression like .+ will match as many characters as possible.
Use .+? instead to make the expression ungreedy and match as few characters as possible.

+ is greedy so it will try to find maximal match. This means that [$].+[$] will match
a$b$c$e
^^^^^
If you want .+ to look for minimal possible match you can
add ? after + quantifier .+? making + reluctant
instead of every character (.) between $ $ accept only these that are not $ like [^$].
So try to change your regex to
s.replaceAll("\\$.+?\\$", "\\$REPLACED\\$");
or
s.replaceAll("\\$[^$]+?\\$", "\\$REPLACED\\$");

This should work:
String s = "The input must be of format: '$var1$'-'$var1$'-'$var1$'";
System.out.println( s.replaceAll("\\$[^$]*\\$", "\\$REPLACED\\$") );
//=> The input must be of format: '$REPLACED$'-'$REPLACED$'-'$REPLACED$'
Using this regex: \\$[^$]*\\$ will match literal $ then string until $ is found and then followed by literal $

Related

Java Rex is not giving the output as expected

networks[0]/site[9785d8e8-9b1f-3fc0-8271-6e32f58fb725]/equipment/location[144ae20e-be33-32e2-8b52-798e968e88b9]
The objective is to get the 9785d8e8-9b1f-3fc0-8271-6e32f58fb725 from above string. I have written the regex as below. But its giving the output as "location".
.*\\/([^\\/]+)\\[.*\\]$
Could any one suggest me the proper regex to get the 9785d8e8-9b1f-3fc0-8271-6e32f58fb725 from above string.

You can search using this regex:
^[^/]+/[^\[/]*\[|\].*
and replace with empty string.
RegEx Demo
RegEx Explanation:
^[^/]+/[^\[/]*\[: This pattern matches text before first / then / followed by text till it gets next [
\].*: Matches ] and everything afterwards
Code:
String s = "networks[0]/site[9785d8e8-9b1f-3fc0-8271-6e32f58fb725]/equipment/location[144ae20e-be33-32e2-8b52-798e968e88b9]";
String r = s.replaceAll("^[^/]+/[^\\[/]*\\[|\\].*", "");
//=> "9785d8e8-9b1f-3fc0-8271-6e32f58fb725"

You can just use site\[(.+?)\]. See the test.
P.S. You current expression is actually doing the following:
Pass whatever .*
Unless you encounter /
then capture any sequence after / not containing: \, /
which in turn is followed by [] with whatever content straight away and residing at the very end of the string.
So the only matching part is location

This should do the trick:
^networks\[\d\]\/site\[([^]]+)\].*$
It will match
the literal string networks[]/site[
followed by your id
followed by ] and arbitrary stuff
You can then extract your ID from the first capturing group.

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

I have a value like this:
"Foo Bar" "Another Value" something else
What regex will return the values enclosed in the quotation marks (e.g. Foo Bar and Another Value)?

In general, the following regular expression fragment is what you are looking for:
"(.*?)"
This uses the non-greedy *? operator to capture everything up to but not including the next double quote. Then, you use a language-specific mechanism to extract the matched text.
In Python, you could do:
>>> import re
>>> string = '"Foo Bar" "Another Value"'
>>> print re.findall(r'"(.*?)"', string)
['Foo Bar', 'Another Value']

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.

I would go for:
"([^"]*)"
The [^"] is regex for any character except '"'
The reason I use this over the non greedy many operator is that I have to keep looking that up just to make sure I get it correct.

Lets see two efficient ways that deal with escaped quotes. These patterns are not designed to be concise nor aesthetic, but to be efficient.
These ways use the first character discrimination to quickly find quotes in the string without the cost of an alternation. (The idea is to discard quickly characters that are not quotes without to test the two branches of the alternation.)
Content between quotes is described with an unrolled loop (instead of a repeated alternation) to be more efficient too: [^"\\]*(?:\\.[^"\\]*)*
Obviously to deal with strings that haven't balanced quotes, you can use possessive quantifiers instead: [^"\\]*+(?:\\.[^"\\]*)*+ or a workaround to emulate them, to prevent too much backtracking. You can choose too that a quoted part can be an opening quote until the next (non-escaped) quote or the end of the string. In this case there is no need to use possessive quantifiers, you only need to make the last quote optional.
Notice: sometimes quotes are not escaped with a backslash but by repeating the quote. In this case the content subpattern looks like this: [^"]*(?:""[^"]*)*
The patterns avoid the use of a capture group and a backreference (I mean something like (["']).....\1) and use a simple alternation but with ["'] at the beginning, in factor.
Perl like:
["'](?:(?<=")[^"\\]*(?s:\\.[^"\\]*)*"|(?<=')[^'\\]*(?s:\\.[^'\\]*)*')
(note that (?s:...) is a syntactic sugar to switch on the dotall/singleline mode inside the non-capturing group. If this syntax is not supported you can easily switch this mode on for all the pattern or replace the dot with [\s\S])
(The way this pattern is written is totally "hand-driven" and doesn't take account of eventual engine internal optimizations)
ECMA script:
(?=["'])(?:"[^"\\]*(?:\\[\s\S][^"\\]*)*"|'[^'\\]*(?:\\[\s\S][^'\\]*)*')
POSIX extended:
"[^"\\]*(\\(.|\n)[^"\\]*)*"|'[^'\\]*(\\(.|\n)[^'\\]*)*'
or simply:
"([^"\\]|\\.|\\\n)*"|'([^'\\]|\\.|\\\n)*'

Peculiarly, none of these answers produce a regex where the returned match is the text inside the quotes, which is what is asked for. MA-Madden tries but only gets the inside match as a captured group rather than the whole match. One way to actually do it would be :
(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)
Examples for this can be seen in this demo https://regex101.com/r/Hbj8aP/1
The key here is the the positive lookbehind at the start (the ?<= ) and the positive lookahead at the end (the ?=). The lookbehind is looking behind the current character to check for a quote, if found then start from there and then the lookahead is checking the character ahead for a quote and if found stop on that character. The lookbehind group (the ["']) is wrapped in brackets to create a group for whichever quote was found at the start, this is then used at the end lookahead (?=\1) to make sure it only stops when it finds the corresponding quote.
The only other complication is that because the lookahead doesn't actually consume the end quote, it will be found again by the starting lookbehind which causes text between ending and starting quotes on the same line to be matched. Putting a word boundary on the opening quote (["']\b) helps with this, though ideally I'd like to move past the lookahead but I don't think that is possible. The bit allowing escaped characters in the middle I've taken directly from Adam's answer.

The RegEx of accepted answer returns the values including their sourrounding quotation marks: "Foo Bar" and "Another Value" as matches.
Here are RegEx which return only the values between quotation marks (as the questioner was asking for):
Double quotes only (use value of capture group #1):
"(.*?[^\\])"
Single quotes only (use value of capture group #1):
'(.*?[^\\])'
Both (use value of capture group #2):
(["'])(.*?[^\\])\1
-
All support escaped and nested quotes.

I liked Eugen Mihailescu's solution to match the content between quotes whilst allowing to escape quotes. However, I discovered some problems with escaping and came up with the following regex to fix them:
(['"])(?:(?!\1|\\).|\\.)*\1
It does the trick and is still pretty simple and easy to maintain.
Demo (with some more test-cases; feel free to use it and expand on it).
PS: If you just want the content between quotes in the full match ($0), and are not afraid of the performance penalty use:
(?<=(['"])\b)(?:(?!\1|\\).|\\.)*(?=\1)
Unfortunately, without the quotes as anchors, I had to add a boundary \b which does not play well with spaces and non-word boundary characters after the starting quote.
Alternatively, modify the initial version by simply adding a group and extract the string form $2:
(['"])((?:(?!\1|\\).|\\.)*)\1
PPS: If your focus is solely on efficiency, go with Casimir et Hippolyte's solution; it's a good one.

A very late answer, but like to answer
(\"[\w\s]+\")
http://regex101.com/r/cB0kB8/1

The pattern (["'])(?:(?=(\\?))\2.)*?\1 above does the job but I am concerned of its performances (it's not bad but could be better). Mine below it's ~20% faster.
The pattern "(.*?)" is just incomplete. My advice for everyone reading this is just DON'T USE IT!!!
For instance it cannot capture many strings (if needed I can provide an exhaustive test-case) like the one below:
$string = 'How are you? I\'m fine, thank you';
The rest of them are just as "good" as the one above.
If you really care both about performance and precision then start with the one below:
/(['"])((\\\1|.)*?)\1/gm
In my tests it covered every string I met but if you find something that doesn't work I would gladly update it for you.
Check my pattern in an online regex tester.

This version
accounts for escaped quotes
controls backtracking
/(["'])((?:(?!\1)[^\\]|(?:\\\\)*\\[^\\])*)\1/

MORE ANSWERS! Here is the solution i used
\"([^\"]*?icon[^\"]*?)\"
TLDR;
replace the word icon with what your looking for in said quotes and voila!
The way this works is it looks for the keyword and doesn't care what else in between the quotes.
EG:
id="fb-icon"
id="icon-close"
id="large-icon-close"
the regex looks for a quote mark "
then it looks for any possible group of letters thats not "
until it finds icon
and any possible group of letters that is not "
it then looks for a closing "

I liked Axeman's more expansive version, but had some trouble with it (it didn't match for example
foo "string \\ string" bar
or
foo "string1" bar "string2"
correctly, so I tried to fix it:
# opening quote
(["'])
(
# repeat (non-greedy, so we don't span multiple strings)
(?:
# anything, except not the opening quote, and not
# a backslash, which are handled separately.
(?!\1)[^\\]
|
# consume any double backslash (unnecessary?)
(?:\\\\)*
|
# Allow backslash to escape characters
\\.
)*?
)
# same character as opening quote
\1

string = "\" foo bar\" \"loloo\""
print re.findall(r'"(.*?)"',string)
just try this out , works like a charm !!!
\ indicates skip character

Unlike Adam's answer, I have a simple but worked one:
(["'])(?:\\\1|.)*?\1
And just add parenthesis if you want to get content in quotes like this:
(["'])((?:\\\1|.)*?)\1
Then $1 matches quote char and $2 matches content string.

All the answer above are good.... except they DOES NOT support all the unicode characters! at ECMA Script (Javascript)
If you are a Node users, you might want the the modified version of accepted answer that support all unicode characters :
/(?<=((?<=[\s,.:;"']|^)["']))(?:(?=(\\?))\2.)*?(?=\1)/gmu
Try here.

My solution to this is below
(["']).*\1(?![^\s])
Demo link : https://regex101.com/r/jlhQhV/1
Explanation:
(["'])-> Matches to either ' or " and store it in the backreference \1 once the match found
.* -> Greedy approach to continue matching everything zero or more times until it encounters ' or " at end of the string. After encountering such state, regex engine backtrack to previous matching character and here regex is over and will move to next regex.
\1 -> Matches to the character or string that have been matched earlier with the first capture group.
(?![^\s]) -> Negative lookahead to ensure there should not any non space character after the previous match

echo 'junk "Foo Bar" not empty one "" this "but this" and this neither' | sed 's/[^\"]*\"\([^\"]*\)\"[^\"]*/>\1</g'
This will result in: >Foo Bar<><>but this<
Here I showed the result string between ><'s for clarity, also using the non-greedy version with this sed command we first throw out the junk before and after that ""'s and then replace this with the part between the ""'s and surround this by ><'s.

From Greg H. I was able to create this regex to suit my needs.
I needed to match a specific value that was qualified by being inside quotes. It must be a full match, no partial matching could should trigger a hit
e.g. "test" could not match for "test2".
reg = r"""(['"])(%s)\1"""
if re.search(reg%(needle), haystack, re.IGNORECASE):
print "winning..."
Hunter

If you're trying to find strings that only have a certain suffix, such as dot syntax, you can try this:
\"([^\"]*?[^\"]*?)\".localized
Where .localized is the suffix.
Example:
print("this is something I need to return".localized + "so is this".localized + "but this is not")
It will capture "this is something I need to return".localized and "so is this".localized but not "but this is not".

A supplementary answer for the subset of Microsoft VBA coders only one uses the library Microsoft VBScript Regular Expressions 5.5 and this gives the following code
Sub TestRegularExpression()
Dim oRE As VBScript_RegExp_55.RegExp '* Tools->References: Microsoft VBScript Regular Expressions 5.5
Set oRE = New VBScript_RegExp_55.RegExp
oRE.Pattern = """([^""]*)"""
oRE.Global = True
Dim sTest As String
sTest = """Foo Bar"" ""Another Value"" something else"
Debug.Assert oRE.test(sTest)
Dim oMatchCol As VBScript_RegExp_55.MatchCollection
Set oMatchCol = oRE.Execute(sTest)
Debug.Assert oMatchCol.Count = 2
Dim oMatch As Match
For Each oMatch In oMatchCol
Debug.Print oMatch.SubMatches(0)
Next oMatch
End Sub

Java Regex - Trying to isolate text from a line that starts with a certain string?

EDIT: MAKE SURE YOU CALL Matcher#matches or Matcher#find before trying to use group!
Source
I'm trying to do something very simple - I'm trying to get the text from a line that starts with a word. In this case, the word is Location:. I'm reading from raw HTML so the line of interest actually looks like this:
Location: Main Hall
Obviously, I want Main Hall returned to me so I can read the location for my application.
This is what I've tried:
String t_location = "";
Pattern t_pat = Pattern.compile("^[\\s]+?(?s)Location: (?-s)(.*)$");
Matcher t_match = t_pat.matcher(t_inner_html);
t_location = t_match.group(0);
But I keep getting the error:
java.lang.IllegalStateException: No successful match so far
Breaking down my Regex, this is what (I think) I'm doing:
^ - Read from the beginning of the line
[\\s]+? - With a reluctant qualifier, read the whitespace at the beginning of the line until we hit something else
(?s)Location: (?-s) - The literal string "Location: " is read
(.*)$ - Read characters (except newlines) until the end of the line
That is what I THINK I'm doing. I'm not so good at Regex, but I've tried to follow the documentation to no avail. Can someone please help me?
For example purposes, the String t_inner_html looks like this:
8/28/2014
Alumni Reunion
Location: Main Hall
<span class="extra-info">
Blah blah blah....
</span>

If this were not Java, this regex should work, depending on what your end-of-line (EOL) character sequence is:
(.|\n)*Location:\s*(.*)\n
The string you want is at group index 1.
Now since this regex is going to be inside a Java String, and since backslashes are escape characters in Java strings, you will actually have to pollute the pure regex with double backslashes:
Pattern t_pat = Pattern.compile("(.|\\n)*Location:\\s*(.*)\\n");
In general, to test regexes, I really like this tool:
http://regexpal.com/
It's an interactive tester that will progressively highlight your sample input as it matches the regex. When you edit the regex or change the sample input, the matching highlighting will update in real time. This does not support the required double backslashes of Java, so test in the tool with the singles, paste them to Java, and then add the extra backslashes.
You may also want to play around with this tool, which is not as real-time but does support Java String regexes:
http://www.regexplanet.com/advanced/java/index.html
To break down what I have:
(.|\n)* - zero or more characters or EOL sequences
Location: - the string "Location:"
\s* - zero or more white space
(.*) - a regex group consisting of absolutely anything, which is what you will capture
\n - EOL sequence
You may need to replace \n with \r\n if you are on Windows, but try \n first and see.
This will match everything in your sample input through "Main Hall", and will ignore everything after (<span . . .> etc.) "Main Hall" will end up in the match group 1.

Please try the following:
String t_location = "";
Pattern t_pat = Pattern.compile("^\\s+Location:\\s+(.*)$", Pattern.MULTILINE);
Matcher t_match = t_pat.matcher(t_inner_html);
if (t_match.find()) {
t_location = t_match.group(1);
}
You need to use Pattern.MULTILINE for the expressions ^ and $ to match each line instead of the whole string.
Java Fiddle Demo

First use String indexOf Method to find wether line contains "Location :".
Then use str.replace("Location : ",""); on the line which has "Location :".

.*?Location:(.*?)\n
This should get you what you want.
See demo.
http://regex101.com/r/rJ1oQ3/1

Remove repeating set of characters in a string

I want to remove the sequesnce "-~-~-" if it repeats in a string, but only if they are together.
I have tried to create a regex based on the removing of multiple white spaces regex:
test.replaceAll("\\s+", " ");
Unfortunately I was unsuccessful. Can someone please help me write the correct regex? thanks.
Example:
string test = "hello-~-~--~-~--~-~-"
output:
hello-~-~-
Another example
string test = "-~-~--~-~--~-~-hello-~-~--~-~--~-~-"
output:
-~-~-hello-~-~-

The regex is:
test.replaceAll("(-~-~-){2,}", "-~-~-")
replaceAll replaces all occurrences matched by the regex (the first parameter) with the second parameter.
the () groups the expression -~-~- together, {2,} means two or more occurrences.
EDIT
Like #anubhava said, instead of using -~-~- for the replacement string, you could also use $1 which backreferences the first capturing group (i.e. the expression in the regex surrounded by ()).

test.replaceAll("(-~-~-)+", "-~-~-");

This is the regex you need:
(-~-~-){2}

How do I write a regular expression to find the following pattern?

I am trying to write a regular expression to do a find and replace operation. Assume Java regex syntax. Below are examples of what I am trying to find:
12341+1
12241+1R1
100001+1R2
So, I am searching for a string beginning with one or more digits, followed by a "1+1" substring, followed by 0 or more characters. I have the following regex:
^(\d+)(1\\+1).*
This regex will successfully find the examples above, however, my goal is to replace the strings with everything before "1+1". So, 12341+1 would become 1234, and 12241+1R1 would become 1224. If I use the first grouped expression $1 to replace the pattern, I get the wrong result as follows:
12341+1 becomes 12341
12241+1R1 becomes 12241
100001+1R2 becomes 100001
Any ideas?

Your existing regex works fine, just that you are missing a \ before \d
String str = "100001+1R2";
str = str.replaceAll("^(\\d+)(1\\+1).*","$1");
Working link

IMHO, the regex is correct.
Perhaps you wrote it wrong in the code. If you want to code the regex ^(\d+)(1\+1).* in a string, you have to write something like String regex = "^(\\d+)(1\\+1).*".
Your output is the result of ^(\d+)(1+1).* replacement, as you miss some backslash in the string (e.g. "^(\\d+)(1\+1).*").

Your regex looks fine to me - I don't have access to java but in JavaScript the code..
"12341+1".replace(/(\d+)(1\+1)/g, "$1");
Returns 1234 as you'd expect. This works on a string with many 'codes' in too e.g.
"12341+1 54321+1".replace(/(\d+)(1\+1)/g, "$1");
gives 1234 5432.

Personally, I wouldn't use a Regex at all (it'd be like using a hammer on a thumbtack), I'd just create a substring from (Pseudocode)
stringName.substring(0, stringName.indexOf("1+1"))
But it looks like other posters have already mentioned the non-greedy operator.
In most Regex Syntaxes you can add a '?' after a '+' or '*' to indicate that you want it to match as little as possible before moving on in the pattern. (Thus: ^(\d+?)(1+1) matches any number of digits until it finds "1+1" and then, NOT INCLUDING the "1+1" it continues matching, whereas your original would see the 1 and match it as well).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - Regex: Several matches in same string - java

The effect you're experiencing is called greediness: An expression like .+ will match as many characters as possible. Use .+? instead to make the expression ungreedy and match as few characters as possible.

Related

Java Rex is not giving the output as expected

Regex for finding the text inside parentheses followed by #en : "example"#en [duplicate]

Java Regex - Trying to isolate text from a line that starts with a certain string?

Remove repeating set of characters in a string

How do I write a regular expression to find the following pattern?

Categories

Resources