Pattern Matching to find the web address present in a web page

Pattern Matching to find the web address present in a web page - java

Can any one suggest me how to match string patterns, where One string is
(\\.quantserve\\.com\\/|\\/quant\\.js)$
and which needed to be matched to
edge.quantserve.com/quant.js$.
Thanks for your help.

You should use the java.util.regex.Pattern Object to match String-objects in Java.
Your example would be something like:
Pattern p = Pattern.compile(".*\\.quantserve\\.com/.*quant\\.js");
boolean b = p.matcher("edge.quantserve.com/quant.js").matches();
System.out.println(b);
Edit: Debug in Pattern

Related

How to get the middle strings with regex?

I have an input string that looks like this
DatalogSetupFile: BTS50xx1EJA\3.20\log_all.stp
The DatalogSetupFile: and \3.20\log_all.stp are constant. I wish to extract BTS50xx1EJA from the string. How should I do it?

You can make a regex group in which you can specify what all are the static content and then specify what are the dynamic content as a whole group, So that you can get the dynamic content as a whole group,
You can define regex as follow
^(?:DatalogSetupFile:\s)(.*)(?:\3.20\log_all.stp)$
Try this Demo
Here you can use the first group to get your dynamic string

Give this regex a try:
\s\K[^\\]+
Which, in Java would look like:
String myInputString = "DatalogSetupFile: BTS50xx1EJA\\3.20\\log_all.stp";
Pattern myPattern = Pattern.compile("\\s\\K[^\\\\]+");
Matcher myMatcher = Pattern.matcher(myInputString);
System.out.println(myMatcher.group(0));

Regular expression for hgsv notation in java

HGSV nomenclature has a pattern:
xxxxx.yyyy:charactersnumbercharacters
I would like to make a regex in java and fetch the all the tokens from above eg:
it should have 5 tokens :
{ 'xxxxx', 'yyyy', 'characters', 'number' , 'characters'}
I have used simple split methodology to fetch the tokens, but I don't find its an optimal solution:
my current code is :
String hgsv = "BRAF.p:V600E";
String[] tokens = hgsv.split(".");
this.symbol = tokens[0];
String type = tokens[1].split(":")[0];
I would like to use Pattern and Matcher in Java. No idea, how to make regex for the above token.
Any clue how to do that?
(even to separate characters, numbers, characters I will be using regex). So why not to use REGEX for entire token.
I found link but this is in Python, I need similar in Java.

I think what you're probably looking for is to use capture groups, like this:
String s = "BRAF.p:V600E";
Pattern p = Pattern.compile("(\\w+)\\.(\\w+):([a-zA-Z]+)(\\d+)([a-zA-Z]+)");
Matcher m = p.matcher(s);
if (m.matches()) {
String[] parts = {m.group(1),
m.group(2),
m.group(3),
m.group(4),
m.group(5)};
// Prints "[BRAF, p, V, 600, E]"
System.out.println(Arrays.toString(parts));
} else {
// The input String is invalid.
}
That's really just a lot like a split, but it's more stable because you're using the pattern to validate the String beforehand.
Note that I have no idea if that is the exact right pattern that you should be using. I don't know the exact details of the HGSV notation you're talking about and your description is actually pretty vague. (What are e.g. xxxxx and yyyy? What are "characters"?) If you link me to some sort of specification or detailed description of this notation I can try to write a regex that's more definitely correct.
Anyhow, my example shows the basic idea. You might also see http://www.regular-expressions.info/brackets.html for more information.

Website/URL Validation Regex in JAVA

I need a regex string to match URL starting with "http://", "https://", "www.", "google.com"
the code i tried using is:
//Pattern to check if this is a valid URL address
Pattern p = Pattern.compile("(http://|https://)(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?");
Matcher m;
m=p.matcher(urlAddress);
but this code only can match url such as "http://www.google.com"
I know this ma be a dupicate question but i have tried all of the regex provided and it does not suit my requirement. Willl someone please help me? Thank you.

You need to make (http://|https://) part in your regex as optional one.
^(http:\/\/|https:\/\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$
DEMO

You can use the Apache commons library(org.apache.commons.validator.UrlValidator) for validating a url:
String[] schemes = {"http","https"}.
UrlValidator urlValidator = new UrlValidator(schemes);
And use :-
urlValidator.isValid(your url)
Then there is no need of regex.
Link:-
https://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/UrlValidator.html

If you use Java, I recommend use this RegEx (I wrote it by myself):
^(https?:\/\/)?(www\.)?([\w]+\.)+[‌\w]{2,63}\/?$
"^(https?:\\/\\/)?(www\.)?([\\w]+\\.)+[‌\\w]{2,63}\\/?$" // as Java-String
to explain:
^ = line start
(https?://)? = "http://" or "https://" may occur.
(www.)? = "www." may orrur.
([\w]+.)+ = a word ([a-zA-Z0-9]) has to occur one or more times. (extend here if you need special characters like ü, ä, ö or others in your URL - remember to use IDN.toASCII(url) if you use special characters. If you need to know which characters are legal in general: https://kb.ucla.edu/articles/what-characters-can-go-into-a-valid-http-url
[‌\w]{2,63} = a word ([a-zA-Z0-9]) with 2 to 63 characters has to occur exactly one time. (a TLD (top level domain (for example .com) can not be shorter than 2 or longer than 63 characters)
/? = a "/"-character may occur. (some people or servers put a / at the end... whatever)
$ = line end
-
If you extend it by special characters it could look like this:
^(https?:\/\/)?(www\.)?([\w\Q$-_+!*'(),%\E]+\.)+[‌\w]{2,63}\/?$
"^(https?:\\/\\/)?(www\.)?([\\w\\Q$-_+!*'(),%\\E]+\\.)+[‌\\w]{2,63}\\/?$" // as Java-String
The answer of Avinash Raj is not fully correct.
^(http:\/\/|https:\/\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$
The dots are not escaped what means it matches with any character. Also my version is simpler and I never heard of a domain like "test..com" (which actually matches...)
Demo: https://regex101.com/r/vM7wT6/279
Edit:
As I saw some people needing a regex which also matches servers directories I wrote this:
^(https?:\/\/)?([\w\Q$-_+!*'(),%\E]+\.)+(\w{2,63})(:\d{1,4})?([\w\Q/$-_+!*'(),%\E]+\.?[\w])*\/?$
while this may not be the best one, since I didn't spend too much time with it, maybe it helps someone. You can see how it works here: https://regex101.com/r/vM7wT6/700
It also matches urls like "hello.to/test/whatever.cgi"

Java compatible version of #Avinash's answer would be
//Pattern to check if this is a valid URL address
Pattern p = Pattern.compile("^(http://|https://)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$");
Matcher m;
m=p.matcher(urlAddress);
boolean matches = m.matches();

pattern="w{3}\.[a-z]+\.?[a-z]{2,3}(|\.[a-z]{2,3})"
this will only accept addresses like e.g www.google.com & www.google.co.in

//I use that
static boolean esURL(String cadena){
boolean bandera = false;
bandera = cadena.matches("\\b(https://?|ftp://|file://|www.)[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]");
return bandera;
}

RegEx for 4 values within a link (This or this or this etc)

Having a bit of trouble with this.
Say I have a link that could contain these values:
Balearic|Ibiza|Majorca|Menorca
Example1: http://site.com/Menorca
Example2: http://site.com/Ibiza
I just need a RegEx to say if the link contains any of those 4 (case insensitive as well)
Can someone point in the right direction - it is not in a particular language but the software I work in is Java based.
Thanks a lot - and I'll keep trying in the meantime! :)

You can just use:
// assuming url is your URL variable
if (url.matches("(?i)^http://site\.com/(Balearic|Ibiza|Majorca|Menorca)\b.*$")) {
// match succeeded
}
(?i) will make sure case is ignored while doing this comparison.

Your list of values is a valid regex, just add the "i" option to make it case insensitive, per http://rubular.com/r/euscuu7Fwj

Here you can find a regexp that matches what you ask:
^http://site\.com/(Balearic|Ibiza|Majorca|Menorca)$
to use it in Java, you may want to do the following:
String url = "http://site.com/Ibiza";
Pattern p = Pattern.compile("^http://site\.com/(Balearic|Ibiza|Majorca|Menorca)$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(url); // get a matcher object

Usually regex against URL is not a good practice.
Here's a mixed approach. It will look for your values only in the path of your URLs, thus both simplifying the regular expression and validating the URL. Case-insensitive.
try {
URI menorca = new URI("http://site.com/Menorca");
System.out.println(menorca.getPath().substring(1));
URI ibiza = new URI("http://site.com/Ibiza");
System.out.println(ibiza.getPath().substring(1));
Pattern pattern = Pattern.compile("Balearic|Ibiza|Majorca|Menorca", Pattern.CASE_INSENSITIVE);
System.out.println(pattern.matcher(menorca.getPath()).find());
System.out.println(pattern.matcher(ibiza.getPath()).find());
}
catch (URISyntaxException use) {
use.printStackTrace();
}
Output:
Menorca
Ibiza
true
true

Negating a Regular Expression for string replacement

I have the following code that can replace the email address in a String in Java:
addressStr.replaceFirst("([a-zA-Z0-9_\\-\\.]+)#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})", "")
So, a string with John Smith <john#smith.com> would become John Smith <>. How do I negate it so that it will instead replace all that doesn't match the email address and have the final result as just john#smith.com?
I tried to put in the ^ and ?<= at the front but it doesn't work.

Well, it's not the regex you need to change but the calling code. Your regex matches the e-mail address (in a weird way), and the replace() removes it from the string.
So just use
Pattern regex = Pattern.compile("([a-zA-Z0-9_\\-\\.]+)#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.)|(([a-zA-Z0-9\\-]+\\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})");
Matcher regexMatcher = regex.matcher(addressStr);
if (regexMatcher.find()) {
address = regexMatcher.group();
}

The complete Java regex for catching e-mails would be as follows:
"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])"
Take a look at https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1 for more info on this.
A bit complicated but it is valid for all known and valid emails formats (yours do not allows mails like bob+bib#gmail.com which are valid).
For your problem, as stated multiple times, just find (stealing Tim Pietzcker piece of code):
Pattern regex = Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])");
Matcher regexMatcher = regex.matcher(addressStr);
foundMatch = regexMatcher.find();

You can try:
String mailId = Pattern.compile(regexp, Pattern.LITERAL).matcher(addressStr).group();
Idea here is to get the matched string rather than trying to replace everything else with blank. You can extract the pattern into a field if this operation is repetitive.

Just don't replace.... use match(es) instead.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pattern Matching to find the web address present in a web page - java

Can any one suggest me how to match string patterns, where One string is (\\.quantserve\\.com\\/|\\/quant\\.js)$ and which needed to be matched to edge.quantserve.com/quant.js$. Thanks for your help.

You should use the java.util.regex.Pattern Object to match String-objects in Java. Your example would be something like: Pattern p = Pattern.compile(".\\.quantserve\\.com/.quant\\.js"); boolean b = p.matcher("edge.quantserve.com/quant.js").matches(); System.out.println(b); Edit: Debug in Pattern

Related

How to get the middle strings with regex?

Regular expression for hgsv notation in java

Website/URL Validation Regex in JAVA

RegEx for 4 values within a link (This or this or this etc)

Negating a Regular Expression for string replacement

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pattern Matching to find the web address present in a web page - java

Can any one suggest me how to match string patterns, where One string is (\\.quantserve\\.com\\/|\\/quant\\.js)$ and which needed to be matched to edge.quantserve.com/quant.js$. Thanks for your help.

You should use the java.util.regex.Pattern Object to match String-objects in Java. Your example would be something like: Pattern p = Pattern.compile(".*\\.quantserve\\.com/.*quant\\.js"); boolean b = p.matcher("edge.quantserve.com/quant.js").matches(); System.out.println(b); Edit: Debug in Pattern

Related

How to get the middle strings with regex?

Regular expression for hgsv notation in java

Website/URL Validation Regex in JAVA

RegEx for 4 values within a link (This or this or this etc)

Negating a Regular Expression for string replacement

Categories

Resources

You should use the java.util.regex.Pattern Object to match String-objects in Java. Your example would be something like: Pattern p = Pattern.compile(".\\.quantserve\\.com/.quant\\.js"); boolean b = p.matcher("edge.quantserve.com/quant.js").matches(); System.out.println(b); Edit: Debug in Pattern