I am trying to get a regex expression to match a specific url format. Specifically the api urls for stackexchange. For example I want both of these to match:
http://api.stackoverflow.com/1.1/questions/1234/answers
http://api.physics.stackexchange.com/1.0/questions/5678/answers
Where
everything not in bold must identical.
The first bold part, can only be made of a to z, and either one or no full stop.
Also it would be good, if there is one full stop the word "stackexchange" must follow. However this isn't crucial.
The second bold part can only be a 1 or a 0.
The last bold part can be only numbers 0 to 9, and can be any length
There can't be anything at all before or after the url, not even a trailing slash
Pattern.compile("^(?i:http://api\\.(?:[a-z]+(?:\\.stackexchange)?)\\.com)/1\\.[01]/questions/[0-9]+/answers\\z")
The ^ makes sure it starts at the start of input, and the \\z makes sure it ends at the end of input. All the dots are escaped so they are literal. The (?i:...) part makes the domain and scheme case-insensitive as per the URL spec. The [01] only matches the characters 0 or 1. The [0-9]+ matches 1 or more Arabic digits. The rest is self explanatory.
^http://api[.][a-z]+([.]stackexchange)?[.]com/1[.][01]/questions/[0-9]+/answers$
^ matches start-of-string, $ matches end-of-line, [.] is an alternative way to escape the dot than a backslash (which itself would need to be escaped as \\.).
This tested Java program has a commented regex which should do the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "http://api.stackoverflow.com/1.1/questions/1234/answers";
Pattern p = Pattern.compile(
"http://api\\. # Scheme and api subdomain.\n" +
"(?: # Group for domain alternatives.\n" +
" stackoverflow # Either one\n" +
"| physics\\.stackexchange # or the other\n" +
") # End group for domain alternatives.\n" +
"\\.com # TLD\n" +
"/1\\.[01] # Either 1.0 or 1.1\n" +
"/questions/\\d+/answers # Rest of path.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.print("Match found.\n");
} else {
System.out.print("No match found.\n");
}
}
}
Related
In java I am using regex \".*?\".
I used this for replacing all the string with doublequote with a term String.
Ex:
INPUT: Functions.unescapeJson("test")
Result : Functions.unescapeJson("String")
But now I wanted to exclude some string if they contains double quote. So, I am using / as the escape character. How to achieve this.
Ex:
INPUT: Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson("test"), "m2m:cin.con"),"payloads_ul.dataFrameOutput"),"[/"Dimming Value/"]")
RESULT: Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson(String), String),String),String)
But the result I am getting if I use the previous regex is:
Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson(input.mIntegerm/:sgn.nev.rep), String),String),StringDimming ValueString)
How to achieve this using regex if it finds / it should neglect without replacing original string.
The code that I am using
public static void main(String[] args) {
String STRINGVALIDATIONREGEX = "\".*?\"";
String formula = "Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson(input.m2m/:sgn.nev.rep), \"m2m:cin.con\"),\"payloads_ul.dataFrameOutput\"),\"[\"Dimming Value\"]\")";
System.out.println(formula.replace(STRINGVALIDATIONREGEX, "String"));
}
You can use this regex:
\"(\/?.)*?\"
Use [^/] to match anything that is not a slash.
For example, [^/]?\".*[^/]?\" would catch quotes not preceded by /
"((?:[^"]|(?<=\/)")*)"
" match a "
[^"] match a non-quote character
| or
(?<=\/)") a quote character that is preceded by a /
* match sub-expressions 2 - 4 zero or more times.
" match a "
See Regex demo
If you believe that a string such as "abc/" is invalid, then you should use the stricter regex:
"((?:[^"\/]|\/")*)"
" match "
[^"\/] match a any character that isn't a quote for /
| or
\/" match a /" combination
* match sub-expressions 2 - 4 zero or more times.
" match a "
See Regex demo
I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?
Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.
Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)
I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?
Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.
Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)
Is it possible to subtract the characters in a Java regex back reference from a character class?
e.g., I want to use String#matches(regex) to match either:
any group of characters that are [a-z'] that are enclosed by "
Matches: "abc'abc"
Doesn't match: "1abc'abc"
Doesn't match: 'abc"abc'
any group of characters that are [a-z"] that are enclosed by '
Matches: 'abc"abc'
Doesn't match: '1abc"abc'
Doesn't match: "abc'abc"
The following regex won't compile because [^\1] isn't supported:
(['"])[a-z'"&&[^\1]]*\1
Obviously, the following will work:
'[a-z"]*'|"[a-z']*"
But, this style isn't particularly legible when a-z is replaced by a much more complex character class that must be kept the same in each side of the "or" condition.
I know that, in Java, I can just use String concatenation like the following:
String charClass = "a-z";
String regex = "'[" + charClass + "\"]*'|\"[" + charClass + "']*\"";
But, sometimes, I need to specify the regex in a config file, like XML, or JSON, etc., where java code is not available.
I assume that what I'm asking is almost definitely not possible, but I figured it wouldn't hurt to ask...
One approach is to use a negative look-ahead to make sure that every character in between the quotes is not the quotes:
(['"])(?:(?!\1)[a-z'"])*+\1
^^^^^^
(I also make the quantifier possessive, since there is no use for backtracking here)
This approach is, however, rather inefficient, since the pattern will check for the quote character for every single character, on top of checking that the character is one of the allowed character.
The alternative with 2 branches in the question '[a-z"]*'|"[a-z']*" is better, since the engine only checks for the quote character once and goes through the rest by checking that the current character is in the character class.
You could use two patterns in one OR-separated pattern, expressing both your cases:
// | case 1: [a-z'] enclosed by "
// | | OR
// | | case 2: [a-z"] enclosed by '
Pattern p = Pattern.compile("(?<=\")([a-z']+)(?=\")|(?<=')([a-z\"]+)(?=')");
String[] test = {
// will match group 1 (for case 1)
"abcd\"efg'h\"ijkl",
// will match group 2 (for case 2)
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
Output
efg'h
null
null
efg"h
Note
There is nothing stopping you from specifying the enclosing characters or the character class itself somewhere else, then building your Pattern with components unknown at compile-time.
Something in the lines of:
// both strings are emulating unknown-value arguments
String unknownEnclosingCharacter = "\"";
String unknownCharacterClass = "a-z'";
// probably want to catch a PatternSyntaxException here for potential
// issues with the given arguments
Pattern p = Pattern.compile(
String.format(
"(?<=%1$s)([%2$s]+)(?=%1$s)",
unknownEnclosingCharacter,
unknownCharacterClass
)
);
String[] test = {
"abcd\"efg'h\"ijkl",
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
// note: only main group here
System.out.println(m.group());
}
}
Output
efg'h
I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:
some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" "http://www.notarealwebsite.com" some other text
A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string ""[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)
I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.
This will correctly match the &'s at the beginning of the "[semicolon]:
&(?=q(?=u(?=o(?=t(?=;)))))
But this does not work:
http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*
I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?
If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.
Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)
For example, in Python:
import re
p = re.compile(r'(["\s]|")(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more "{0}" test greed"'
data = formatstr.format(url)
for m in p.finditer(data):
print "Found:", m.group(2)
Produces:
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Or in Java:
#Test
public void testRegex() {
Pattern p = Pattern.compile("([\"\\s]|")(https?://.*?)(?=\\1)",
Pattern.CASE_INSENSITIVE);
final String URL = "http://test.url/here.php?var1=val&var2=val2";
final String INPUT = "some text " + URL + " more text + \"" + URL +
"\" more then "" + URL + "" testing greed "";
Matcher m = p.matcher(INPUT);
while( m.find() ) {
System.out.println("Found: " + m.group(2));
}
}
Produces the same output.