Java regex to match double quoted substrings - java

I want to parse the following string:
String text = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
// "w1 w"2" w3 | w4 w"5 "w6 w7"
I'm using Pattern.compile(regex).matcher(text), so what I'm missing here is the proper regex.
The rules are that regex has to:
isolate any single word
any substring surrounded by double quotes is a match
double quotes within a word have to be ignored (I will later replace them with a whitespace).
So the resulting matches should be:
w1 w"2
w3
|
w4
w"5
w6 w7
Whether the double quotes are included or not in the double quotes surrounded substrings is irrelevant (e.g. 1. could be either w1 w"2 or "w1 w"2").
What I came up with is something like this:
"\"(.*)\"|(\\S+)"
I also tried many diffent variants of the above regex (including lookbehind/forward) but none is giving me the expected result.
Any idea on how to improve this?

Try this Regex:
(?:(?<=^")|(?<=\s")).*?(?="(?:\s|$))|(?![\s"])\S+
Click for Demo
EXPLANATION:
(?:(?<=^")|(?<=\s")) - Positive Lookbehind to find the position which is preceeded by a ". This " either needs to be at the start of the string or after a whitespace
.*? - matches 0+ occurrences of any character other than a newline character lazily
(?="(?:\s|$)) - Positive lookahead to validate that whatever is matched so far is followed by either a whitespace or there is nothing after the match($).
| - OR (either the above match or the following)
(?![\s"]) - Negative lookahead to validate that the position in not followed by either a whitespace or a "
\S+ - matches 1+ occurrences of a non-whitespace character
Java Code(Generated from here):
Run code here to see the output
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class MyClass {
public static void main(String args[]) {
final String regex = "(?:(?<=^\")|(?<=\\s\")).*?(?=\"(?:\\s|$))|(?![\\s\"])\\S+";
final String string = "\"w1 w\"2\" w3 | w4 w\"5 \"w6 w7\"";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
OUTPUT:

This seems to do the job:
"(?:[^"]|\b"\b)+"|\S+
Debuggex Demo
Regex101 Demo
Note that in Java, because we're using string literals for regexes, a backslash needs to be preceded by another backslash:
String regex = "\"(?:[^\"]|\\b\"\\b)+\"|\\S+";

Related

Java regex, replace certain characters except if it matches a pattern

I have this string "person","hobby","key" and I want to remove " " for all words except for key so the output will be person,hobby,"key"
String str = "\"person\",\"hobby\",\"key\"";
System.out.println(str+"\n");
str=str.replaceAll("/*regex*/","");
System.out.println(str); //person,hobby,"key"
You may use the following pattern:
\"(?!key\")(.+?)\"
And replace with $1
Details:
\" - Match a double quotation mark character.
(?!key\") - Negative Lookahead (not followed by the word "key" and another double quotation mark).
(.+?) - Match one or more characters (lazy) and capture them in group 1.
\" - Match another double quotation mark character.
Substitution: $1 - back reference to whatever was matched in group 1.
Regex demo.
Here's a full example:
String str = "\"person\",\"hobby\",\"key\"";
String pattern = "\"(?!key\")(.+?)\"";
String result = str.replaceAll(pattern, "$1");
System.out.println(result); // person,hobby,"key"
Try it online.

Java Regex expression to append two strings

I need to monitor a log file which will give me lines in the following way...
type=CWD<some text>cwd="something within double quotes"<some text><enter>
type=PATH<some text>name="something within double quotes"<some text>
Now I need a regex expression which will take those two variables in double quotes and append them to form a single string with a '/' in between
Eg: I need <cwd_vale>/<name_value>
(\"\S*\")(\"\S*\") will give two strings within double quotes to two groups
I want the way to append these strings. Help is so very welcomed!
You can match both lines and use 2 capture groups using a negated character class [^"] as \S does not match the spaces between the double quotes.
Then concat the 2 capture groups with a /
cwd="([^"]*)".*\R.*?name="([^"]*)"
cwd="([^"]*)" Capture the content between double quotes after cwd=" in group 1
.*\R Match the rest of the line and a newline
.*?name="([^"]*)" Match as least as possible chars, match name=" and capture the contents between the double quotes in group 2
Java demo | Regex demo
String regex = "cwd=\"([^\"]*)\".*\\R.*?name=\"([^\"]*)\"";
String string = "type=CWD<some text>cwd=\"something within double quotes\"<some text><enter>\n"
+ "type=PATH<some text>name=\"something within double quotes\"<some text>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(1) + "/" + matcher.group(2));
}
Output
something within double quotes/something within double quotes
For the values on a single line, and the second part can optionally start with /, you can optionally match the forward slash outside of the second group so that it would not be there when concatenating group 1 and group 2 having double //
String regex = "\\\"(\\S*)\\\".*?\\\"/?(\\S*)\\\"";
String string = "cwd=\"/root\" name=\"/msv_backup/archives/auditlogs\"";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(1) + "/" + matcher.group(2));
}
Output
/root/msv_backup/archives/auditlogs
Java demo | REgex demo

Merge two pattern into one

I need write a pattern to remove currency symbol and comma. eg Fr.-145,000.01
After the pattern matcher should return -145000.01.
The pattern i am using:
^[^0-9\\-]*([0-9\\-\\.\\,]*?)[^0-9\\-]*$
This will return -145,000.01
Then I remove the comma to get -145000.01, I want to ask if that's possible that I change the pattern and directly get -145000.01
String pattern = "^[^0-9\\-]*([0-9\\-\\.\\,]*?)[^0-9\\-]*$";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(str);
if(m.matches()) {
System.out.println(m.group(1));
}
I expect the output could resolve the comma
You can simply it with String.replaceAll() and simpler regex (providing you are expecting the input to be reasonably sane, i.e. without multiple decimal points embedded in the numbers or multiple negative signs)
String str = "Fr.-145,000.01";
str.replaceAll("[^\\d-.]\\.?", "")
If you are going down this route, I would sanity check it by parsing the output with BigDecimal or Double.
One approach would be to just collect our desired digits, ., + and - in a capturing group followed by an optional comma, and then join them:
([+-]?[0-9][0-9.]+),?
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "([+-]?[0-9][0-9.]+),?";
final String string = "Fr.-145,000.01\n"
+ "Fr.-145,000\n"
+ "Fr.-145,000,000\n"
+ "Fr.-145\n"
+ "Fr.+145,000.01\n"
+ "Fr.+145,000\n"
+ "Fr.145,000,000\n"
+ "Fr.145\n"
+ "Fr.145,000,000,000.01";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
Demo
String str = "Fr.-145,000.01";
Pattern regex = Pattern.compile("^[^0-9-]*(-?[0-9]+)(?:,([0-9]{3}))?(?:,([0-9]{3}))?(?:,([0-9]{3}))?(\\.[0-9]+)?[^0-9-]*$");
Matcher matcher = regex.matcher(str);
System.out.println(matcher.replaceAll("$1$2$3$4$5"));
Output:
-145000.01
It looks for number with up to 3 commas (Up to 999,999,999,999.99), and replaces it with the digits.
My approach would be to remove all the unnecessary parts using replaceAll.
The unnecessary parts are, apparently:
Any sequence which is not digits or minus at the beginning of the string.
Commas
The first pattern is represented by ^[^\\d-]+. The second is merely ,.
Put them together with an |:
Pattern p = Pattern.compile("(^[^\\d-]+)|,");
Matcher m = p.matcher(str);
String result = m.replaceAll("");
You could 2 capturing groups and make use of repeating matching using the \G anchor to assert the position at the end of the previous match.
(?:^[^0-9+-]+(?=[.+,\d-]*\.\d+$)([+-]?\d{1,3})|\G(?!^)),(\d{3})
In Java
String regex = "(?:^[^0-9+-]+(?=[.+,\\d-]*\\.\\d+$)([+-]?\\d{1,3})|\\G(?!^)),(\\d{3})";
Explanation
(?: Non capturing group
^[^0-9+-]+ Match 1+ times not a digit, + or -
(?= Positive lookahead, assert that what follows is:
[.+,\d-]*\.\d+$ Match 0+ times what is allowed and assert ending on . and 1+ digits
) Close positive lookahead
( Capturing group 1
[+-]?\d{1,3}) Match optional + or - followed by 1-3 digits
| Or
\G(?!^) Assert position at the end of prevous match, not at the start
), Close capturing group 1 and match ,
(\d{3}) Capture in group 2 matching 3 digits
In the replacement use the 2 capturing groups $1$2
See the Regex demo | Java demo

Regex to match strings in-between double quotes that are not containing some other strings

How to match words between double quotes in lines not containing specific words
input:
System.log("error");
new Exception("error");
view.setText("message");
From the above input, I would like to ignore lines with log and Exception words in them(Case sensitive) and match words in between double quotes.
Expected output
message
I have been trying to use look ahead without luck
(?s)^(?!log)".+"
I need this for a search in IntelliJ using regex
In your pattern (?s)^(?!log)".+" the negative lookahead does not contain a quantifier so it will assert that what is directly after the start of the string is not log
What you could do is use a quantifier .* with an alternation to match either log or Exception and add word boundaries \b to prevent them being part of a larger word.
Then you might use negated character classes [^"] to match not a double quote and use a capturing group ([^"]+) for the value between the double quotes.
^(?!.*\b(?:log|Exception)\b)[^"]*"([^"]+)"
In Java:
String regex = "^(?!.*\\b(?:log|Exception)\\b)[^\"]*\"([^\"]+)\"";
Regex demo
If you want to make the dot to match a newline you can prepend (?s) to the pattern.
My guess is that this expression would likely work for capturing the message,
^(?!.*log.*|.*exception.*).*?"(.+?)".*
Demo 1
Example
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?!.*log.*|.*exception.*).*?"(.+?)".*";
final String string = "System.log(\"error\");\n"
+ "new Exception(\"error\");\n"
+ "view.setText(\"message\");";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE | Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

how to parse a double-quote delimited string that can contain escaped double quotes

I need to parse the line from the stream that would look like this: command "string1" "string2" string can contain spaces and escaped double-quotes. I need to split it so that I get command, string1 and string2 as array elements. I think split() with regex matching " but not \" ( .split("(?<!\\\\)\"") ) would do the job, but I hear that that is not a good idea.
Is there any better way of doing this in Java?
Something like that should do the trick, assuming you want to remove the external double quotes when applicable (if you don't, it's just a matter of changing the first capturing group to also include the quotes):
public class Demo {
private static final Pattern WORD =
Pattern.compile("\"((?:[^\\\\\"]|\\\\.)*)\"|([^\\s\"]+)");
public static void main(String[] args) {
String cmd =
"command " +
"\"string with blanks\" " +
"\"anotherStringBetweenQuotes\" " +
"\"a string with \\\"escaped\\\" quotes\" " +
"stringWithoutBlanks";
Matcher matcher = WORD.matcher(cmd);
while (matcher.find()) {
String capturedGroup = matcher.group(1) != null ? matcher.group(1) : matcher.group(2);
System.out.println("Matched: " + capturedGroup);
}
}
}
Output:
Matched: command
Matched: string with blanks
Matched: anotherStringBetweenQuotes
Matched: a string with \"escaped\" quotes
Matched: stringWithoutBlanks
The regex is a bit complicated, so it well deserves a bit of explanation:
[^\\\\\"] matches everything but a backslash or double quotes
\\\\. matches a backslash followed by any character (including double quotes), namely escaped characters
(?:[^\\\\\"]|\\\\.)* matches any sequence of escaped or non-escaped characters, but without capturing the group (because of the (?:))
"\"((?:[^\\\\\"]|\\\\.)*)\" matches any such sequence wrapped into double quotes and captures the inside of the quotes
([^\\s\"]+) matches any non-empty sequence of non-blank characters, and captures it in a group

Categories

Resources