Match two urls with regular expressions - java

I have a list of urls and I want to match those url's with this url using regular expressions
http://investor.somehost.com/*
here * means anything after that or you can say it's a wildcard...
String href = url.getURL();
here href contains all the url's.
suppose firstentry contains that above url (http://investor.somehost.com/*)
So how can I compare href with firstentry such that if href starts with this url then do this thing...

If you just want to determine whether a String starts with a particular prefix, use startsWith(String prefix).
Example:
String href = "http://google.com/mail";
if(href.startsWith("http://google.com")) {
//... Do stuff
}

"^http://investor\\.somehost\\.com/"
will match any string starting with http://investor.somehost.com/. If you want only valid URLs, you could use
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?"
If you want to allow queries,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
If you also need fragments,
"^http://investor\\.somehost\\.com/(([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])+(/([-._~:#!$&'()*+,;=a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)*)?(\?([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?(#([-._~:#!$&'()*+,;=/?a-zA-Z0-9]|%[0-9a-fA-F][0-9a-fA-F])*)?"
End any of these with $ if you don't want to allow trailing (non-URL) parts of the string.

I have a regular expression on this post that provides the regular expression to extract the domain part of a url no matyer where in a string it mau occur. Its for javascript so remove the leading '/' amd trailing '/ig'. Use it to extract the domains and compare them with a simple equals check.

Related

Java Regex to Match URL

I need to create Regexes to match URLs of the following forms
/collected/{deliveryId}/deliverer/{userId}
/customer/{userId}/status/active
/users/{userId}/role
Where delivery-id and user-id are UUIDs in the form of: 124r23452-124234234-123123423534 and the other string parts are constant.
For the first one I tried something like this but didnt work:
String urlRegex = "[a-zA-Z-]*/collected/deliverer/(?=\\S*[-])([a-zA-Z-]+)";
You can try this pattern : \/collected\/\w{0,9}\/deliverer\/\w{0,} and use https://regex101.com/ web site. This wikipedia page, gives also some good details on regex.

How to replace xml empty tags using regex

I have a lot of empty xml tags which needs to be removed from string.
String dealData = dealDataWriter.toString();
someData = someData.replaceAll("<somerandomField1/>", "");
someData = someData.replaceAll("<somerandomField2/>", "");
someData = someData.replaceAll("<somerandomField3/>", "");
someData = someData.replaceAll("<somerandomField4/>", "");
This uses a lot of string operations which is not efficient, what can be better ways to avoid these operations.
I would not suggest to use Regex when operating on HTML/XML... but for a simple case like yours maybe it is ok to use a rule like this one:
someData.replaceAll("<\\w+?\\/>", "");
Test: link
If you want to consider also the optional spaces before and after the tag names:
someData.replaceAll("<\\s*\\w+?\\s*\\/>", "");
Test: link
Try the following code, You can remove all the tag which does not have any space in it.
someData.replaceAll("<\w+/>","");
Alternatively to using regex or string matching, you can use an xml parser to find empty tags and remove them.
See the answers given over here: Java Remove empty XML tags
If you like to remove <tagA></tagA> and also <tagB/> you can use following regex. Please note that \1 is used to back reference matching group.
// identifies empty tag i.e <tag1></tag> or <tag/>
// it also supports the possibilities of white spaces around or within the tag. however tags with whitespace as value will not match.
private static final String EMPTY_VALUED_TAG_REGEX = "\\s*<\\s*(\\w+)\\s*></\\s*\\1\\s*>|\\s*<\\s*\\w+\\s*/\\s*>";
Run the code on ideone

How can I get value after hashtag from URL in Java

I have a URL and I want to print in my graphical user interface the ID value after the hashtag.
For example, we have www.site.com/index.php#hello and I want to print hello value on a label in my GUI.
How can I do this using Java in Netbeans?
Simple solution is getRef() in URL class:
URL url = new URL("http://www.anyhost.com/index.php#hello");
jLabel.setText(url.getRef());
EDIT: According to #Henry comment:
I would recommend to use the java.net.URI as it also deals with encoding. The Javadocs say: "Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL()."
and this comment:
Why not just doing uri.getFragment()
URI uri = new URI("http://www.anyhost.com/index.php#hello");
jLabel.setText(uri.getFragment());
Use the String.split() Method.
public static String getId(string url) {
return url.split("#")[1];
}
String.split() returns an array of Strings that are delimited, or "Split," by the value you pass to it, or in this case #.
Because you want only the string after the #, you can just use the second item in the array that it returns by adding [1] to the end of it.
For more on String.split() go to Tutorials Point.
By the way, the part of the URL you are referencing is the Element ID. It is used to jump to an Element on a webpage.

Escape special characters using Regex in java [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Regex to Extract First Part of URL

I need a java regex to extract parts of a URL.
For example, take the following URLs:
http://localhost:81/example
https://test.com/test
http://test.com/
I would want my regex expression to return:
http://localhost:81
https://test.com
http://test.com
I will be using this in a Java patcher.
This is what I have so far, problem is it takes the whole URLs:
^https?:\/\/(?!.*:\/\/)\S+
import Java.net.URL
//snip
URL url = new URL(urlString);
return url.getProtocol() + "://" + url.getAuthority();
The right tool for the right job.
Building off your attempt, try this:
^https?://[^/]+
I'm assuming that you want to capture everything until the first / after http://? (That's what I was getting from your examples - if not, please post some more).
Are these URLs given as one input, or are each a different string?
Edit: It was pointed out that there were unnecessary escapes, so fixed to a more condensed version
Language independent answer:
For the whitespace: replace /^\s+/ with the empty string.
For removing the path information from the URL, if you can assume there aren't any slashes in the path (i.e. you're not dealing with http://localhost:81/foo/bar/baz), replace /\/[^\/]+$/ with the empty string. If there might be more slashes, you might try something like replacing /(^\s*.*:\/\/[^\/]+)\/.*/ with $1.
A simple one: ^(https?://[^/]+)

Categories

Resources