RegEx To Ignore Text Between Quotes - java

I have a Regex, which is [\\.|\\;|\\?|\\!][\\s]
This is used to split a string. But I don't want it to split . ; ? ! if it is in quotes.

I'd not use split but Pattern & Matcher instead.
A demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String text = "start. \"in quotes!\"; foo? \"more \\\" words\"; bar";
String simpleToken = "[^.;?!\\s\"]+";
String quotedToken =
"(?x) # enable inline comments and ignore white spaces in the regex \n" +
"\" # match a double quote \n" +
"( # open group 1 \n" +
" \\\\. # match a backslash followed by any char (other than line breaks) \n" +
" | # OR \n" +
" [^\\\\\r\n\"] # any character other than a backslash, line breaks or double quote \n" +
") # close group 1 \n" +
"* # repeat group 1 zero or more times \n" +
"\" # match a double quote \n";
String regex = quotedToken + "|" + simpleToken;
Matcher m = Pattern.compile(regex).matcher(text);
while(m.find()) {
System.out.println("> " + m.group());
}
}
}
which produces:
> start
> "in quotes!"
> foo
> "more \" words"
> bar
As you can see, it can also handle escaped quotes inside quoted tokens.

Here is what I do in order to ignore quotes in matches.
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*? # <-- append the query you wanted to search for - don't use something greedy like .* in the rest of your regex.
To adapt this for your regex, you could do
(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*?[.;?!]\s*

Related

Regex to capture the staring with specific word or character and ending with either one of the word

Want to capture the string after the last slash and before either a (; sid=) word or a (?) character.
sample data:
sessionId=30a793b1-ed7e-464a-a630; Url=https://www.example.com/mybook/order/newbooking/itemSummary; sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;
sessionId=sfdsdfsd-ba57-4e21-a39f-34; Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW targetUrl=https://www.example.com/ mybook/order/newbooking/page1?id=123;
sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; Url=https://www.example.com/mybook/order/newbooking/; sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;
Expecting the below output:
1. itemSummary
2. itemList
3. ''(empty string)
Have build the below regex to capture it but its 100% accurate. It is capturing some additional part.
Regex
Url=.*\/(.*)(; sid|\?)
Could you please help me to improve the regex to get desired output?
Thanks in advance!
You may use this regex in Java with a greedy match after Url=:
\bUrl=\S+/([^?;/]+)(?=; sid|\?)
RegEx Demo
RegEx Demo:
\b: Word boundary
Url=: Match text Url=
\S+/: Match 1+ non-whitespace characters followed by a /
([^?;/]+): Match 1+ of a character that not ? and ; and /
(?=; sid|\?): Lookahead to assert that we have ; sid or ? ahead
Alternative solution:
Used regex:
"^Url=.*/(\\w+|)$"
Regex in test bench and context:
public static void main(String[] args) {
String input1 = "sessionId=30a793b1-ed7e-464a-a630; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemSummary; "
+ "sid=KJ4dgQGdhg7dDn1h0TLsqhsdfhsfhjhsdjfhjshdjfhjsfddscg139bjXZQdkbHpzf9l6wy1GdK5XZp; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=122;";
String input2 = "sessionId=sfdsdfsd-ba57-4e21-a39f-34; "
+ "Url=https://www.example.com/mybook/order/newbooking/itemList?id=76734&para=jhjdfhj&type=new&ordertype=kjkf&memberid=273647632&iSearch=true; "
+ "sid=Q4hWgR1GpQb8xWTLpQB2yyyzmYRgXgFlJLGTc0QJyZbW "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=123;";
String input3 = "sessionId=0e1acab1-45b8-sdf3454fds-afc1-sdf435sdfds; "
+ "Url=https://www.example.com/mybook/order/newbooking/; "
+ "sid=hkm2gRSL2t5ScKSJKSJn3vg2sfdsfdsfdsfdsfdfdsfdsfdsfvJZkDD3ng0kYTjhNQw8mFZMn; "
+ "targetUrl=https://www.example.com/mybook/order/newbooking/page1?id=343;";
List<String> inputList = Arrays.asList(input1, input2, input3);
// Pre-compiled Patterns should not be in loops - that is why they are placed outside the loops
Pattern replaceWithNewLinePattern = Pattern.compile(";?\\s|\\?");
Pattern extractWordFromUrlPattern = Pattern.compile("^Url=.*/(\\w+|)$", Pattern.MULTILINE);
int count = 0;
for(String input : inputList) {
String inputWithNewLines = replaceWithNewLinePattern.matcher(input).replaceAll("\n");
// System.out.println(inputWithNewLines); // Check the change...
Matcher matcher = extractWordFromUrlPattern.matcher(inputWithNewLines);
while (matcher.find()) {
System.out.printf( "%d. '%s'%n", ++count, matcher.group(1));
}
}
}
Output:
1. 'itemSummary'
2. 'itemList'
3. ''

Finding JSON objects inside a string

Following my question about a me having to deal with a poorly implemented chat server, I have come to the conclusion that I should try to get the chat messages out of the other server responses.
Basically, I receive a string that would look like this:
13{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat {message 1\"}","sender":123,"recipient":321}45{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat} message 2\"}","sender":123,"recipient":321}1
And the result I would like is two substrings:
{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat {message 1\"}","sender":123,"recipient":321}
{"ts":2135646,"msg":"{\"ts\":123156,\"msg\":\"this is my chat} message 2\"}","sender":123,"recipient":321}
The output I can receive is a mix between JSON objects (possibly containing other JSON objects) and some numerical data.
I need to extract the JSON objects from that string.
I have thought about counting curly braces to pick what is between the first opening one and the corresponding closing one. However, the messages can possibly contain a curly brace.
I have thought about regular expressions but I can't get one that will work (I am not good at regexes)
Any idea about how to proceed?
This should work:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile(
"\\{ # Match an opening brace. \n" +
"(?: # Match either... \n" +
" \" # a quoted string, \n" +
" (?: # which may contain either... \n" +
" \\\\. # escaped characters \n" +
" | # or \n" +
" [^\"\\\\] # any other characters except quotes and backslashes \n" +
" )* # any number of times, \n" +
" \" # and ends with a quote. \n" +
"| # Or match... \n" +
" [^\"{}]* # any number of characters besides quotes and braces. \n" +
")* # Repeat as needed. \n" +
"\\} # Then match a closing brace.",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}

Regex for String with possible escape characters

I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.
Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.
In the below code, i essentially want to match all the three strings and print true, but cannot.
What should be the correct regex?
Thanks
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \" ABC\"",
"\"tuco \" ABC \" DEF\""
};
Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:
import java.util.regex.*;
public class TEST
{
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \\\" ABC\"",
"\"tuco \\\" ABC \\\" DEF\""
};
//old: Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
Pattern pattern = Pattern.compile(
"# Match double quoted substring allowing escaped chars. \n" +
"\" # Match opening quote. \n" +
"( # $1: Quoted substring contents. \n" +
" [^\"\\\\]* # {normal} Zero or more non-quote, non-\\. \n" +
" (?: # Begin {(special normal*)*} construct. \n" +
" \\\\. # {special} Escaped anything. \n" +
" [^\"\\\\]* # more {normal} non-quote, non-\\. \n" +
" )* # End {(special normal*)*} construct. \n" +
") # End $1: Quoted substring contents. \n" +
"\" # Match closing quote. ",
Pattern.DOTALL | Pattern.COMMENTS);
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
}
I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.

Using a JTextField to get a regular expression from a user. How do I make it see \t as a tab instead of a \ followed by a t

JTextField reSource; //contains the regex expression the user wants to search for
String re=reSource.getText();
Pattern p=Pattern.compile(re,myflags); //myflags defined elsewhere in code
Matcher m=p.matcher(src); //src is the text to search and comes from a JTextArea
while (m.find()==true) {
If the user enters \t it finds \t not tab.
If the user enters \\\t it finds \\\t not tab.
If the user enters [\t] or [\\\t] it finds t not tab.
I want it such that if the user enters \t it finds tab. Of course it also needs to work with \n, \r etc...
If re="\t"; is used instead of re=reSource.getText(); with \t in the JTextField then it finds tabs. How do I get it to work with the contents of the JTextField?
Example:
String src = "This\tis\ta\ttest";
System.out.println("src=\"" + src + '"'); // --> prints "This is a test"
String re="\\t";
System.out.println("re=\"" + re + '"'); // --> prints "\t" - as when you use reSource.getText();
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(src);
while (m.find()) {
System.out.println('"' + m.group() + '"');
}
Output:
src="This is a test"
re="\t"
" "
" "
" "
Try this:
re=re.replace("\\t", "\t");
OR
re=re.replace("\\t", "\\\\t");
I think the problem is in understanding that when you type:
String str = "\t";
Then it is actualy same as:
String str = " ";
But if you type:
String str = "\\t";
Then the System.out.print(str) will be "\t".
Matching \t should work, however, your flags might have a problem.
Here's what works for me:
String src = "A\tBC\tD";
Pattern p=Pattern.compile("\\w\\t\\w"); //simulates the user entering \w\t\w
Matcher m=p.matcher(src);
while (m.find())
{
System.out.println("Match: \"" + m.group(0) + "\"");
}
Output is:
Match: "A B"
Match: "C D"
My experience is that Java Swing JTextField and JTable GUI controls escape user-entered backslashes by prefixing a backslash.
User types two-character sequence "backslash t", control's getText() method returns a String containing the three-character sequence "backslash backslash t". The SO formatter does its own thing with backslashes in text so here it is as code:
Single backslash: input is 2 char sequence \t and return value is 3 char \\t
For three-character input sequence "backsl backsl t", getText() returns the five-character sequence "backsl backsl backsl backsl t". As code:
Double backslash: input is 3 char sequence \\t and return value is 5 char \\\\t
This basically prevents the backslash from modifying the t to yield a character sequence that becomes a tab when interpreted by something like System.out.println.
Conveniently, and surprisingly to me, the regex processor accepts it either way. A two-character sequence "\t" matches a tab character, as does a three-character sequence "\\t". Please see demo code below. The system.out calls demonstrate which sequences and patterns, have tabs, and in JDK 1.7 both matches yield true.
package my.text;
/**
* Demonstrate use of tab character in regexes
*/
public class RegexForSo {
public static void main(String [] argv) {
final String sequenceTab="x\ty\tz";
final String patternBsTab = "x\t.*";
final String patternBsBsTab = "x\\t.*";
System.out.println("sequence is >" + sequenceTab + "<");
System.out.println("pattern BsTab is >" + patternBsTab + "<");
System.out.println("pattern BsBsTab is >" + patternBsBsTab + "<");
System.out.println("matched BsTab = " + sequenceTab.matches(patternBsTab));
System.out.println("matched BsBsTab = " + sequenceTab.matches(patternBsBsTab));
}
}
Output on my JDK1.7 system is below, tabs in output might not survive SO formatter :)
sequence is >x y z<
pattern BsTab is >x .*<
pattern BsBsTab is >x\t.*<
matched BsTab = true
matched BsBsTab = true
HTH

How do I escape '+' in pattern matching to highlight keyword?

I'm implementing a keyword highlighter in Java. I'm using java.util.regex.Pattern to highlight (making bold) keyword within String content. The following piece of code is working fine for alphanumeric keywords, but it is not working for some special characters. For example, in String content, I would like to highlight the keyword c++ which has the special character + (plus), but it's not getting highlighted properly. How do I escape + character so that c++ is highlighted?
public static void main(String[] args)
{
String content = "java,c++,ejb,struts,j2ee,hibernate";
System.out.println("CONTENT: " + content);
String highlight = "C++";
System.out.println("HIGHLIGHT KEYWORD: " + highlight);
//highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("\\b" + highlight + "\\b", java.util.regex.Pattern.CASE_INSENSITIVE);
System.out.println("PATTERN: " + pattern.pattern());
java.util.regex.Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println("Match found!!!");
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
content = matcher.replaceAll("<B>" + matcher.group(i) + "</B>");
}
}
System.out.println("RESULT: " + content);
}
Output:
CONTENT: java,c++,ejb,struts,j2ee,hibernate
HIGHLIGHT KEYWORD: C++
PATTERN: \bC++\b
Match found!!!
c
RESULT: java,c++,ejb,struts,j2ee,hibernate
I even tried to escape '+' before calling Pattern.compile like this,
highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
but still I'm not able to get the syntax right. Can somebody help me solve this?
This should do what you need:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "\\b",
Pattern.CASE_INSENSITIVE);
Update: you are right, the above doesn't work for C++ (\b matches word boundaries and doesn't recognize ++ as a word). We need a more complicated solution:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "(?![^\\p{Punct}\\s])", // matches if the match is not followed by
// anything other than whitespace or punctuation
Pattern.CASE_INSENSITIVE);
Update in response to comments: it seems that you need more logic in your pattern creation. Here's a helper method to create the pattern for you:
private static final String WORD_BOUNDARY = "\\b";
// edit this to suit your neds:
private static final String ALLOWED = "[^,.!\\-\\s]";
private static final String LOOKAHEAD = "(?!" + ALLOWED + ")";
private static final String LOOKBEHIND = "(?<!" + ALLOWED + ")";
public static Pattern createHighlightPattern(final String highlight) {
final Pattern pattern = Pattern.compile(
(Character.isLetterOrDigit(highlight.charAt(0))
? WORD_BOUNDARY : LOOKBEHIND)
+ Pattern.quote(highlight)
+ (Character.isLetterOrDigit(highlight.charAt(highlight.length() - 1))
? WORD_BOUNDARY : LOOKAHEAD),
Pattern.CASE_INSENSITIVE);
return pattern;
}
And here is some test code to check that it works:
private static void testMatch(final String haystack, final String needle) {
final Matcher matcher = createHighlightPattern(needle).matcher(haystack);
if (!matcher.find())
System.out.println("Failed to find pattern " + needle);
while (matcher.find())
System.out.println("Found additional match: " + matcher.group() +
" for pattern " + needle);
}
public static void main(final String[] args) {
final String testString = "java,c++,hibernate,.net,asp.net,c#,spring";
testMatch(testString, "java");
testMatch(testString, "c++");
testMatch(testString, ".net");
testMatch(testString, "c#");
}
When I run this method, I don't see any output (which is good :-))
The problem is that the \b word boundary anchor is not matching, because + is a non word character and I assume there is a whitespace following that is also a non word character.
A word boundary \b is matching a change from a word character (Member in \w) to a non word character (no member of \w).
Also if you want to match a + literally you have to escape it. Here you are searching for C++ that means match at least one C and the ++ is a possessive quantifier matching at least 1 C and does not backtrack.
Try changing your pattern to something like this
java.util.regex.Pattern.compile("\\b" + highlight + "(?=\s)", java.util.regex.Pattern.CASE_INSENSITIVE);
(?=\s) is a positive lookahead that will check if there is a whitespace following your highlight
Additionally you will need to esacape the + your are searching for.
All you need is here :
Pattern.compile("\\Q"+highlight+"\\E", java.util.regex.Pattern.CASE_INSENSITIVE);
Assuming your keyword does not begin or end with punctuation, here is a commented regex which uses lookahead and lookbehind to achieve your desired matching behavior:
// Compile regex to match a keyword or keyphrase.
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(
"(?<=[\\s'\".?!,;:]|^) # Word preceded by ws, quote, punct or BOS.\n" +
// Escape any regex metacharacters in the keyword phrase.
java.util.regex.Pattern.quote(highlight) + " # Keyword to be matched.\n" +
"(?=[\\s'\".?!,;:]|$) # Word followed by ws, quote, punct or EOS.",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Note that this solution works even if your keyword is a phrase containing spaces.

Categories

Resources