Regular expression in java that encloses some url - java

i have this problem:
i have to make a regular expression which take this urls:
http://www.amazon.it/TP-LINK-TL-WR841N-Wireless-300Mbps-Ethernet/dp/B001FWYGJS?ie=UTF8&redirect=true&ref_=s9_simh_gw_p147_d0_i2
http://www.amazon.it/gp/product/B014KMQWU0/
http://www.amazon.it/gp/product/glance/B014KMQWU0/
I need a regular expression which matches the full url until the ASIN of the product (ASIN is a word of 10 capital letters)
I have write this regex but not make what i want:
String regex="http:\\/\\/(?:www\\.|)amazon\\.com\\/(?:gp\\ product|| gp\\ product\\ glance || [^\\/]+\\/dp|dp)\\/([^\\/]{10})";
Pattern pattern=Pattern.compile(regex);
Matcher urlAmazonMatcher = pattern.matcher(url);
while (urlAmazonMatcher.find()) {
System.out.println("PROVA "+urlAmazonMatcher.group(0));
}

This is my solution. Finally it works :D
String regex="(http|www\\.)amazon\\.(com|it|uk|fr|de)\\/(?:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp)\\/([^\\/]{10})";
Pattern pattern=Pattern.compile(regex);
Matcher urlAmazonMatcher = pattern.matcher(url);
String toReturn = null;
while (urlAmazonMatcher.find()) {
toReturn=urlAmazonMatcher.group(0);
}

How about
/[^/?]{10}(/$|\?)
This matches 10 characters that are neither / nor ? following a slash if those characters are followed by a final slash or a question mark.
You can get the part that precedes or follows the ASIN using one of the various Matcher functions.

Here is my work from a previous project that was to extract URLs from text:
private Pattern getUriPattern() {
if(uriPattern == null) {
// taken from http://labs.apache.org/webarch/uri/rfc/rfc3986.html
//TODO implement the full URI syntax
String genDelims = "\\:\\/\\?\\#\\[\\]\\#";
String subDelims = "\\!\\$\\&\\'\\*\\+\\,\\;\\=";
String reserved = genDelims + subDelims;
String unreserved = "\\w\\-\\.\\~"; // i.e. ALPHA / DIGIT / "-" / "." / "_" / "~"
String allowed = reserved + unreserved;
// ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
uriPattern = Pattern.compile("((?:[^\\:/\\?\\#]+:)?//[" + allowed + "&&[^\\?\\#]]*(?:\\?([" + allowed + "&&[^\\#]]*))?(?:\\#[" + allowed + "]*)?).*");
}
return uriPattern;
}
You can use the above method as follows:
Matcher uriMatcher =
getUriPattern().matcher(text);
if(uriMatcher.matches()) {
String candidateUriString = uriMatcher.group(1);
try {
new URI(candidateUriString); // check once again if you matched a URL
// your code here
} catch (Exception e) {
// error handling
}
}
This will catch the whole URL, including params. You can then split it up to the first occurence of '?' (if any) and take the first part. Of course, you can rework the regex too.

Related

SwiftMessage Regular expression

I have the below message:
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}
And i want it to be converted like below, with whitespaces in block 4 (which is
{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA -}
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA -}{5:{CHK:1DBBF1D81EE1}{TNG:}}
I tried to extract using groups and then apply regular expression. But, i was unsuccessfully. Unable to find the error i am making.
public static void StringReplace() {
String data = "{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}";
Pattern pat = Pattern.compile("(({1:\\w+})({2:\\w+})({4::\\d+:\\w+:\\d+.:\\w+:\\d+.:\\d+\\w+,\\d:\\d+.:\\w+,\\d+:\\d+.:\\w+:\\d+:\\w+:\\d+.:\\w+-})({5:{\\w+:.\\w+}{\\w+.}}))");
Matcher m = pat.matcher(data);
if(m.matches()) {
System.out.println(m.group(0));
}
}
Thanks in Adavance
You have just matched the string and simply printed it but havn't put logic of introducing a space in between. You need to add the logic of introducing space in block 4.
Looking at the expected output of your block 4, you can first catch the block 4 using this regex,
(.*?)(\\{4.*?\\})(.*?)
and then replace colon with a space colon ( :) in group 2 content which you call as block 4. I see you are not introducing space with every colon instead just for colon which are followed by 2-3 characters followed by colon. I have implemented the logic accordingly in my replaceAll() method.
Here is the modified java code,
public static void StringReplace() {
String data = "{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}";
Pattern pat = Pattern.compile("(.*)(\\{4.*?\\})(.*)");
Matcher m = pat.matcher(data);
if (m.find()) {
String g1 = m.group(1);
String g2 = m.group(2).replaceAll(":(?=\\w{2,3}:)", " :");
String g3 = m.group(3);
System.out.println(g1 + g2 + g3);
} else {
System.out.println("Didn't match");
}
}
This prints the following output as you expect,
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}

Regex for finding mp4 in string

I want to get all .mp4 URLs of this String using Regex.
Also I want to know how to get only the last .mp4 URL using Regex.
Thanks
contentType=application/x-mpegURL, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.m3u8},
Variant{bitrate=0, contentType=application/dash+xml, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.mpd},
Variant{bitrate=320000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4},
Variant{bitrate=832000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4},
Variant{bitrate=2176000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4}]}]";
Regex:
https?.*?\.mp4
Literal http
Followed by an optional 's': s?
Remove the question mark if they will all use HTTPS.
Followed by as few characters as possible: .*?
Followed by an mp4 extension (literal dot) \.mp4
2 Approaches:
If you're sure the URL's will always begin with https:// and will not contain a mp4 after the complete URL is finished, then you can use
pattern = "https://.*mp4";
String[] arr = {
"contentType=application/x-mpegURL, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.m3u8}",
"Variant{bitrate=0, contentType=application/dash+xml, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.mpd}",
"Variant{bitrate=320000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4}",
"Variant{bitrate=832000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4}",
"Variant{bitrate=2176000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4}]}]"
};
String pattern = "https://.*mp4";
Pattern r = Pattern.compile(pattern);
for (String line : arr) {
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println(m.group(0));
} else {
System.out.println("NO MATCH");
}
}
If not, to Support all types of URL's then change your pattern to what is defined here with a little modification,
String pattern =
"(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b"+"mp4";
Output:
NO MATCH
NO MATCH
https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4
https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4
https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4

How to split a long string in Java?

How to edit this string and split it into two?
String asd = {RepositoryName: CodeCommitTest,RepositoryId: 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef};
I want to make two strings.
String reponame;
String RepoID;
reponame should be CodeCommitTest
repoID should be 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef
Can someone help me get it? Thanks
Here is Java code using a regular expression in case you can't use a JSON parsing library (which is what you probably should be using):
String pattern = "^\\{RepositoryName:\\s(.*?),RepositoryId:\\s(.*?)\\}$";
String asd = "{RepositoryName: CodeCommitTest,RepositoryId: 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef}";
String reponame = "";
String repoID = "";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(asd);
if (m.find()) {
reponame = m.group(1);
repoID = m.group(2);
System.out.println("Found reponame: " + reponame + " with repoID: " + repoID);
} else {
System.out.println("NO MATCH");
}
This code has been tested in IntelliJ and runs without error.
Output:
Found reponame: CodeCommitTest with repoID: 425f5fc5-18d8-4ae5-b1a8-55eb9cf72bef
Assuming there aren't quote marks in the input, and that the repository name and ID consist of letters, numbers, and dashes, then this should work to get the repository name:
Pattern repoNamePattern = Pattern.compile("RepositoryName: *([A-Za-z0-9\\-]+)");
Matcher matcher = repoNamePattern.matcher(asd);
if (matcher.find()) {
reponame = matcher.group(1);
}
and you can do something similar to get the ID. The above code just looks for RepositoryName:, possibly followed by spaces, followed by one or more letters, digits, or hyphen characters; then the group(1) method extracts the name, since it's the first (and only) group enclosed in () in the pattern.

Regex function rename file issue

I'm using following code to rename a file automatically:
public static String getNewNameForCopyFile(final String originalName, final boolean firstCall) {
if (firstCall) {
final Pattern p = Pattern.compile("(.*?)(\\..*)?");
final Matcher m = p.matcher(originalName);
if (m.matches()) { //group 1 is the name, group 2 is the extension
String name = m.group(1);
String extension = m.group(2);
if (extension == null) {
extension = "";
}
return name + "-Copy1" + extension;
} else {
throw new IllegalArgumentException();
}
} else {
final Pattern p = Pattern.compile("(.*?)(-Copy(\\d+))?(\\..*)?");
final Matcher m = p.matcher(originalName);
if (m.matches()) { //group 1 is the prefix, group 2 is the number, group 3 is the suffix
String prefix = m.group(1);
String numberMatch = m.group(3);
String suffix = m.group(4);
return prefix + "-Copy" + (numberMatch == null ? 1 : (Integer.parseInt(numberMatch) + 1)) + (suffix == null ? "" : suffix);
} else {
throw new IllegalArgumentException();
}
}
}
This works mostly only with following filename I'm having a problem and I don't know how to adapt my code:
test.abc.txt
The renamed file becomes 'test-Copy1.abc.txt' but should be 'test.abc-Copy1.txt'.
Do you know how can I achieve this with my method?
If I understand you correctly, you want to insert a copy number before the last dot ('.') in the file name if there is any, and instead you get insertion before the first dot. This arises because you are using a reluctant quantifier for the first group, and the second group is able to match a filename tail containing any number of dots. I think you will do better with this:
final Pattern p = Pattern.compile("(.*?)(\\.[^.]*)?");
Note that if it is present, the second group starts with a dot, but cannot contain other dots.
I think what you're trying to do is find the last '.' in the firstname, correct? I that case you need to use greedy matching .* (which matches as much as possible) instead of .*?
final Pattern p = Pattern.compile("(.*)(\\..*)")
You will need to handle the case with no dot seperately:
if (originalName.indexOf('.') == -1)
return originalName + "-Copy1"
Your other code

How can i match particular format in input using java.util.regex in java?

INPUT
Input can be in any of the form shown below with following mandatory content TXT{Any comma separated strings in any format}
String loginURL = "http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}";
String loginURL1 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}";
String loginURL2 = "http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}";
String loginURL3 = "http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}";
String loginURL4 = "http://ip:port/path?username=abcd&password={PASS}";
Required Output
1. OutputURL corresponding to loginURL.
String outputURL = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL1 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL2 = "http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}";
String outputURL3 = "http://ip:port/path?";
String outputURL4 = "http://ip:port/path?username=abcd&password={PASS}";
2. Deleted pattern(if any)
String deletedPatteren = TXT{UE-IP,UE-Username,UE-Password}
My Attempts
String loginURLPattern = TXT+"\\{([\\w-,]*)\\}&*";
System.out.println("1. ");
getListOfTemplates(loginURL, loginURLPattern);
System.out.println();
System.out.println("2. ");
getListOfTemplates(loginURL1, loginURLPattern);
System.out.println();
private static void getListOfTemplates(String inputSequence,String pattern){
System.out.println("Input URL : " + inputSequence);
Matcher templateMatcher = Pattern.compile(pattern).matcher(inputSequence);
if (templateMatcher.find() && templateMatcher.group(1).length() > 0) {
System.out.println(templateMatcher.group(1));
System.out.println("OutputURL : " + templateMatcher.replaceAll(""));
}
}
OUTPUT obtained
1.
Input URL : http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}
UE-IP,UE-Username,UE-Password}&password={PASS
OutputURL : http://ip:port/path?username=abcd&location={LOCATION}&
2.
Input URL : http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}
UE-IP,UE-Username,UE-Password
OutputURL : http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&
DRAWBACK OF ABOVE PATTERN
If i add any String containing character like #,%,# in between TXT{} then my code breaks.
How can i achieve it using java.util.regex library so that user can input any comma separated String between TXT{Any Comma Separated Strings}.
I would recommend using Matcher.appendReplacement:
public static void main(final String[] args) throws Exception {
final String[] loginURLs = {
"http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}",
"http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}",
"http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}",
"http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}",
"http://ip:port/path?username=abcd&password={PASS}"};
final Pattern patt = Pattern.compile("(\\?)?&?(TXT\\{[^}]++})(&)?");
for (final String loginURL : loginURLs) {
System.out.printf("%1$-10s %2$s%n", "Processing", loginURL);
final StringBuffer sb = new StringBuffer();
final Matcher matcher = patt.matcher(loginURL);
while (matcher.find()) {
final String found = matcher.group(2);
System.out.printf("%1$-10s %2$s%n", "Found", found);
if (matcher.group(1) != null && matcher.group(3) != null) {
matcher.appendReplacement(sb, "$1");
} else {
matcher.appendReplacement(sb, "$3");
}
}
matcher.appendTail(sb);
System.out.printf("%1$-10s %2$s%n%n", "Processed", sb.toString());
}
}
Output:
Processing http://ip:port/path?username=abcd&location={LOCATION}&TXT{UE-IP,UE-Username,UE-Password}&password={PASS}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}&TXT{UE-IP,UE-Username,UE-Password}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}&username=abcd&location={LOCATION}&password={PASS}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path?username=abcd&location={LOCATION}&password={PASS}
Processing http://ip:port/path?TXT{UE-IP,UE-Username,UE-Password}
Found TXT{UE-IP,UE-Username,UE-Password}
Processed http://ip:port/path
Processing http://ip:port/path?username=abcd&password={PASS}
Processed http://ip:port/path?username=abcd&password={PASS}
As you rightly point out, there are 3 possible cases:
"?{TEXT}&" -> "?"
"&{TEXT}&" -> "&"
"?{TEXT}" -> ""
So what we need to do is test for those cases in the regex. Here is the pattern:
(\\?)?&?(TXT\\{[^}]++})(&)?
Explanation:
(\\?)? optionally matches and captures a ?
&? optionally captures an &
(TXT\\{[^}]++}) matches and captures TXT, followed by {, followed by one or most not } (possessively), followed by } (closing brackets don't need to be escaped
(&)? optionally matches and captures a &
We have 3 groups:
potentially a ?
the required text
potentially an &
Now when we find a match we need to replace with the appropriate capture of case 1..3
if (matcher.group(1) != null && matcher.group(3) != null) {
matcher.appendReplacement(sb, "$1");
} else {
matcher.appendReplacement(sb, "$3");
}
If groups 1 and 3 are both present:
We must be in case 1; we must replace with "?" which is in group 1 so $1.
Otherwise we are in case 2 or 3:
In case 2 we need to replace with "&" and in 3 with "".
In case 2 group 3 will hold "&" and in case 3 it will hold "" so we can replace with $3 in both these cases.
Here I only capture the TXT{...} part using a match group. This means that although the leading ? or & is replaced it is not in the String found. I you only want the bit between {} then just move the parenthesis.
Note that I reuse the Pattern - you can also reuse the Matcher if performance is a concern. You should always reuse the Pattern as it is (very) expensive to create. Store it in a static final if you can - it's threadsafe, matchers are not. The usual way to do it is to store the Pattern in a static final and then reuse the Matcher in the context of a method.
Also, the use of Matcher.appendReplacement is much more efficient than your current approach as it only needs to process the input once. Your approach parses the string twice.

Categories

Resources