Match all urls in a domain except queries - java

I wish to match (java regex) all urls belonging to a certain domain except the ones looking like a query string.
For e.g. I wish to match
http://www.thehindu.com/arts/music/marrying-keys-to-chips/article4061904.ece
But avoid
http://www.thehindu.com/arts/music?article=23417
I tried the following but it allows both the above patterns.
+^http://www\.thehindu\.com([^\?=])*

What about
if (yourString.matches("(http://)?www\\.thehindu\\.com[^\\?=]*") {
// match --> doesn't look like a query
} else {
// no match --> looks like a query or completely different url
}

Try this:
(^|\s)http:\/\/www\.thehindu\.com([^\?])*(\s|$)
Where the (^|\s) and (\s|$) are delimiters you expect between urls. Add more in those if you need.

I suppose regexp isn't required, try looking for question mark ?.

Related

Java Regex to Match URL

I need to create Regexes to match URLs of the following forms
/collected/{deliveryId}/deliverer/{userId}
/customer/{userId}/status/active
/users/{userId}/role
Where delivery-id and user-id are UUIDs in the form of: 124r23452-124234234-123123423534 and the other string parts are constant.
For the first one I tried something like this but didnt work:
String urlRegex = "[a-zA-Z-]*/collected/deliverer/(?=\\S*[-])([a-zA-Z-]+)";
You can try this pattern : \/collected\/\w{0,9}\/deliverer\/\w{0,} and use https://regex101.com/ web site. This wikipedia page, gives also some good details on regex.

Regex to validate wildcard domains with special conditions

I am looking to validate wildcards against Samsung Knox Firewall. Please see below the full criteria for all domains:
A list of URLs for specified domain names to block DNS resolution. The format of the URL must be compliant with RFC's standards and must also match one of the following rules:
Full URL: "www.google.com"
Partial URL: "android.com"; "www.samsung"; "google". The
character "*" (wildcard) must be at the beginning and/or at the end
of the URL otherwise the URL is invalid.
Special case, matches any URL : "*"
Valid domains
The following examples are considered valid by Knox.
*.test.com
*test.com
*test
*test*
test.*
test1.test.*
Invalid domains
The following examples are considered invalid by Knox.
*test-
*test.
*test.com-
*test-.com
Is anybody able to offer a hand? I am struggling to accommodate for all of the requirements with this one.
Current code:
(?=^\*|.*\*$)^(?:\*\.?)?(?:(?:[a-z0-9-]+(?(?=\.)(?<!-)\.(?!-)))+[a-z]+)(?:\.?\*)?$
Edit: Actually, it looks like conditional regex may not even be supported in Java.
BASED ON YOUR PROVIDED EXAMPLES
If you're trying to pre-filter the domains, then this one matches all of your "Valid" examples and rejects all of your "Invalid" examples
^[\w*]([\w*-]+[\w*])?(\.[\w*]([\w*-]+[\w*])?)*$
If there's a file or carriage return separated field with all of these in it that you're trying to test, you may want to use the "multiline" switch like so:
(?m)^[\w*]([\w*-]+[\w*])?(\.[\w*]([\w*-]+[\w*])?)*$
since you tagged java, that would be encoded into a java string as follows:
"(?m)^[\\w*]([\\w*-]+[\\w*])?(\\.[\\w*]([\\w*-]+[\\w*])?)*$"
EDIT - Matching all the rules, in addition to your provided examples
This expression seems to work:
^(\*|(\*|\*\.)?\w+(\.\w+)*(\.\*|\*)?)$
Matching/Non-matching examples:
MATCHING NON-MATCHING
------------ ------------
* *test-
*.test.com *test.
*test.com *test.com-
*test *test-.com
*test* test*.com
test.* test.*com
test1.test.* -test.com

How to replace xml empty tags using regex

I have a lot of empty xml tags which needs to be removed from string.
String dealData = dealDataWriter.toString();
someData = someData.replaceAll("<somerandomField1/>", "");
someData = someData.replaceAll("<somerandomField2/>", "");
someData = someData.replaceAll("<somerandomField3/>", "");
someData = someData.replaceAll("<somerandomField4/>", "");
This uses a lot of string operations which is not efficient, what can be better ways to avoid these operations.
I would not suggest to use Regex when operating on HTML/XML... but for a simple case like yours maybe it is ok to use a rule like this one:
someData.replaceAll("<\\w+?\\/>", "");
Test: link
If you want to consider also the optional spaces before and after the tag names:
someData.replaceAll("<\\s*\\w+?\\s*\\/>", "");
Test: link
Try the following code, You can remove all the tag which does not have any space in it.
someData.replaceAll("<\w+/>","");
Alternatively to using regex or string matching, you can use an xml parser to find empty tags and remove them.
See the answers given over here: Java Remove empty XML tags
If you like to remove <tagA></tagA> and also <tagB/> you can use following regex. Please note that \1 is used to back reference matching group.
// identifies empty tag i.e <tag1></tag> or <tag/>
// it also supports the possibilities of white spaces around or within the tag. however tags with whitespace as value will not match.
private static final String EMPTY_VALUED_TAG_REGEX = "\\s*<\\s*(\\w+)\\s*></\\s*\\1\\s*>|\\s*<\\s*\\w+\\s*/\\s*>";
Run the code on ideone

Regex to Extract First Part of URL

I need a java regex to extract parts of a URL.
For example, take the following URLs:
http://localhost:81/example
https://test.com/test
http://test.com/
I would want my regex expression to return:
http://localhost:81
https://test.com
http://test.com
I will be using this in a Java patcher.
This is what I have so far, problem is it takes the whole URLs:
^https?:\/\/(?!.*:\/\/)\S+
import Java.net.URL
//snip
URL url = new URL(urlString);
return url.getProtocol() + "://" + url.getAuthority();
The right tool for the right job.
Building off your attempt, try this:
^https?://[^/]+
I'm assuming that you want to capture everything until the first / after http://? (That's what I was getting from your examples - if not, please post some more).
Are these URLs given as one input, or are each a different string?
Edit: It was pointed out that there were unnecessary escapes, so fixed to a more condensed version
Language independent answer:
For the whitespace: replace /^\s+/ with the empty string.
For removing the path information from the URL, if you can assume there aren't any slashes in the path (i.e. you're not dealing with http://localhost:81/foo/bar/baz), replace /\/[^\/]+$/ with the empty string. If there might be more slashes, you might try something like replacing /(^\s*.*:\/\/[^\/]+)\/.*/ with $1.
A simple one: ^(https?://[^/]+)

Match two urls with regular expressions

I have a list of urls and I want to match those url's with this url using regular expressions
http://investor.somehost.com/*
here * means anything after that or you can say it's a wildcard...
String href = url.getURL();
here href contains all the url's.
suppose firstentry contains that above url (http://investor.somehost.com/*)
So how can I compare href with firstentry such that if href starts with this url then do this thing...
If you just want to determine whether a String starts with a particular prefix, use startsWith(String prefix).
Example:
String href = "http://google.com/mail";
if(href.startsWith("http://google.com")) {
//... Do stuff
}
"^http://investor\\.somehost\\.com/"
will match any string starting with http://investor.somehost.com/. If you want only valid URLs, you could use
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?"
If you want to allow queries,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
If you also need fragments,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?(#([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
End any of these with $ if you don't want to allow trailing (non-URL) parts of the string.
I have a regular expression on this post that provides the regular expression to extract the domain part of a url no matyer where in a string it mau occur. Its for javascript so remove the leading '/' amd trailing '/ig'. Use it to extract the domains and compare them with a simple equals check.

Categories

Resources