Java Regex to Match URL - java

I need to create Regexes to match URLs of the following forms
/collected/{deliveryId}/deliverer/{userId}
/customer/{userId}/status/active
/users/{userId}/role
Where delivery-id and user-id are UUIDs in the form of: 124r23452-124234234-123123423534 and the other string parts are constant.
For the first one I tried something like this but didnt work:
String urlRegex = "[a-zA-Z-]*/collected/deliverer/(?=\\S*[-])([a-zA-Z-]+)";

You can try this pattern : \/collected\/\w{0,9}\/deliverer\/\w{0,} and use https://regex101.com/ web site. This wikipedia page, gives also some good details on regex.

Related

Website/URL Validation Regex in JAVA

I need a regex string to match URL starting with "http://", "https://", "www.", "google.com"
the code i tried using is:
//Pattern to check if this is a valid URL address
Pattern p = Pattern.compile("(http://|https://)(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?");
Matcher m;
m=p.matcher(urlAddress);
but this code only can match url such as "http://www.google.com"
I know this ma be a dupicate question but i have tried all of the regex provided and it does not suit my requirement. Willl someone please help me? Thank you.
You need to make (http://|https://) part in your regex as optional one.
^(http:\/\/|https:\/\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$
DEMO
You can use the Apache commons library(org.apache.commons.validator.UrlValidator) for validating a url:
String[] schemes = {"http","https"}.
UrlValidator urlValidator = new UrlValidator(schemes);
And use :-
urlValidator.isValid(your url)
Then there is no need of regex.
Link:-
https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/UrlValidator.html
If you use Java, I recommend use this RegEx (I wrote it by myself):
^(https?:\/\/)?(www\.)?([\w]+\.)+[‌​\w]{2,63}\/?$
"^(https?:\\/\\/)?(www\.)?([\\w]+\\.)+[‌​\\w]{2,63}\\/?$" // as Java-String
to explain:
^ = line start
(https?://)? = "http://" or "https://" may occur.
(www.)? = "www." may orrur.
([\w]+.)+ = a word ([a-zA-Z0-9]) has to occur one or more times. (extend here if you need special characters like ü, ä, ö or others in your URL - remember to use IDN.toASCII(url) if you use special characters. If you need to know which characters are legal in general: https://kb.ucla.edu/articles/what-characters-can-go-into-a-valid-http-url
[‌​\w]{2,63} = a word ([a-zA-Z0-9]) with 2 to 63 characters has to occur exactly one time. (a TLD (top level domain (for example .com) can not be shorter than 2 or longer than 63 characters)
/? = a "/"-character may occur. (some people or servers put a / at the end... whatever)
$ = line end
-
If you extend it by special characters it could look like this:
^(https?:\/\/)?(www\.)?([\w\Q$-_+!*'(),%\E]+\.)+[‌​\w]{2,63}\/?$
"^(https?:\\/\\/)?(www\.)?([\\w\\Q$-_+!*'(),%\\E]+\\.)+[‌​\\w]{2,63}\\/?$" // as Java-String
The answer of Avinash Raj is not fully correct.
^(http:\/\/|https:\/\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$
The dots are not escaped what means it matches with any character. Also my version is simpler and I never heard of a domain like "test..com" (which actually matches...)
Demo: https://regex101.com/r/vM7wT6/279
Edit:
As I saw some people needing a regex which also matches servers directories I wrote this:
^(https?:\/\/)?([\w\Q$-_+!*'(),%\E]+\.)+(\w{2,63})(:\d{1,4})?([\w\Q/$-_+!*'(),%\E]+\.?[\w])*\/?$
while this may not be the best one, since I didn't spend too much time with it, maybe it helps someone. You can see how it works here: https://regex101.com/r/vM7wT6/700
It also matches urls like "hello.to/test/whatever.cgi"
Java compatible version of #Avinash's answer would be
//Pattern to check if this is a valid URL address
Pattern p = Pattern.compile("^(http://|https://)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$");
Matcher m;
m=p.matcher(urlAddress);
boolean matches = m.matches();
pattern="w{3}\.[a-z]+\.?[a-z]{2,3}(|\.[a-z]{2,3})"
this will only accept addresses like e.g www.google.com & www.google.co.in
//I use that
static boolean esURL(String cadena){
boolean bandera = false;
bandera = cadena.matches("\\b(https://?|ftp://|file://|www.)[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]");
return bandera;
}

Deleting all words matching a regex pattern

I would like to remove the character sequences like "htsap://" or "ftsap://" from a String. Is it possible?
Let me illustrate my needs with an example.
Actual input String:
"Every Web page has a http unique address called a URL (Uniform Resource Locator) which identifies where it is located on the Web. For "ftsap://"example, the URL for CSM Library's home page is: "htsap://"www.smccd.edu/accounts/csmlibrary/index.htm The basic parts of a URL often provide \"clues\" to htsap://where a web page originates and who might be responsible for the information at that page or site."
Expected resulting String:
"Every Web page has a http unique address called a URL (Uniform Resource Locator) which identifies where it is located on the Web. For example, the URL for CSM Library's home page is: www.smccd.edu/accounts/csmlibrary/index.htm The basic parts of a URL often provide \"clues\" to where a web page originates and who might be responsible for the information at that page or site."
Patterns I tried: (not very sure it is a right way)
((.*?)(?=("htsap://|ftsap://")))
and:
((.*?)(?=("htsap://|ftsap://")))(.*)
Could anyone please suggest here?
Since you're escaping your quotes within your sample Strings, I'll assume you're working in Java.
You should try:
final String res = input.replaceAll("\"?\\w+://\"?", "");
Here is a link to a working example of what does this regex match exactly!
How it works:
It matches and removes any sequence of alphanumeric characters (and underscores), followed by :// and possibly preceded and/or followed by ".
EDIT: How to achieve the same result using a Matcher?
final String input = "Every Web page has a http unique address called a URL (Uniform Resource Locator) which identifies where it is located on the Web. For \"ftsap://\"example, the URL for CSM Library's home page is: \"htsap://\"www.smccd.edu/accounts/csmlibrary/index.htm The basic parts of a URL often provide \"clues\" to htsap://where a web page originates and who might be responsible for the information at that page or site.";
final Pattern p = Pattern.compile("\"?\\w+://\"?");
final StringBuilder b = new StringBuilder(input);
Matcher m;
while((m = p.matcher(b.toString())).find()) {
b.replace(m.start(), m.end(), "");
}
System.out.println(b.toString());
Use this regex:
"(ftsap|htsap).//"
And replace it with ''
Regex explained:
"(ftsap|htsap).//" with flag g
Debuggex Demo

Regex to Extract First Part of URL

I need a java regex to extract parts of a URL.
For example, take the following URLs:
http://localhost:81/example
https://test.com/test
http://test.com/
I would want my regex expression to return:
http://localhost:81
https://test.com
http://test.com
I will be using this in a Java patcher.
This is what I have so far, problem is it takes the whole URLs:
^https?:\/\/(?!.*:\/\/)\S+
import Java.net.URL
//snip
URL url = new URL(urlString);
return url.getProtocol() + "://" + url.getAuthority();
The right tool for the right job.
Building off your attempt, try this:
^https?://[^/]+
I'm assuming that you want to capture everything until the first / after http://? (That's what I was getting from your examples - if not, please post some more).
Are these URLs given as one input, or are each a different string?
Edit: It was pointed out that there were unnecessary escapes, so fixed to a more condensed version
Language independent answer:
For the whitespace: replace /^\s+/ with the empty string.
For removing the path information from the URL, if you can assume there aren't any slashes in the path (i.e. you're not dealing with http://localhost:81/foo/bar/baz), replace /\/[^\/]+$/ with the empty string. If there might be more slashes, you might try something like replacing /(^\s*.*:\/\/[^\/]+)\/.*/ with $1.
A simple one: ^(https?://[^/]+)

Match two urls with regular expressions

I have a list of urls and I want to match those url's with this url using regular expressions
http://investor.somehost.com/*
here * means anything after that or you can say it's a wildcard...
String href = url.getURL();
here href contains all the url's.
suppose firstentry contains that above url (http://investor.somehost.com/*)
So how can I compare href with firstentry such that if href starts with this url then do this thing...
If you just want to determine whether a String starts with a particular prefix, use startsWith(String prefix).
Example:
String href = "http://google.com/mail";
if(href.startsWith("http://google.com")) {
//... Do stuff
}
"^http://investor\\.somehost\\.com/"
will match any string starting with http://investor.somehost.com/. If you want only valid URLs, you could use
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?"
If you want to allow queries,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
If you also need fragments,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?(#([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
End any of these with $ if you don't want to allow trailing (non-URL) parts of the string.
I have a regular expression on this post that provides the regular expression to extract the domain part of a url no matyer where in a string it mau occur. Its for javascript so remove the leading '/' amd trailing '/ig'. Use it to extract the domains and compare them with a simple equals check.

java email extraction regular expression?

I would like a regular expression that will extract email addresses from a String (using Java regular expressions).
That really works.
Here's the regular expression that really works.
I've spent an hour surfing on the web and testing different approaches,
and most of them didn't work although Google top-ranked those pages.
I want to share with you a working regular expression:
[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})
Here's the original link:
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
I had to add some dashes to allow for them. So a final result in Javanese:
final String MAIL_REGEX = "([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";
Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.
Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.
Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.
([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)#([A-Za-z0-9]+)(\.[A-Za-z0-9]+)
For example, using the above regex, the following string
abc.efg#asdf.cde
yields
start=0, end=16
Group(0) = abc.efg#asdf.cde
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde
Group 0 is always the capture of whole string matched.
If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.
It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.
a little late but ok.
Here is what i use. Just paste it in the console of FireBug and run it. Look on the webpage for a 'Textarea' (Most likely on the bottom of the page) That will contain a , seperated list of all email address found in A tags.
var jquery = document.createElement('script');
jquery.setAttribute('src', 'http://code.jquery.com/jquery-1.10.1.min.js');
document.body.appendChild(jquery);
var list = document.createElement('textarea');
list.setAttribute('emaillist');
document.body.appendChild(list);
var lijst = "";
$("#emaillist").val("");
$("a").each(function(idx,el){
var mail = $(el).filter('[href*="#"]').attr("href");
if(mail){
lijst += mail.replace("mailto:", "")+",";
}
});
$("#emaillist").val(lijst);
The Java 's build-in email address pattern (Patterns.EMAIL_ADDRESS) works perfectly:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}

Categories

Resources