How to ignore encoding certain characters in a url in java? - java

I have a url that looks like this: https://123.com/screen-shot-2021-02-25-at-7.31.10%2520PM.png
screen-shot-2021-02-25-at-7.31.10%2520PM.png is the file name and %25 is the encoded value for %
This gives me a 404. I need % to not be encoded. What is the proper way to ignore this when encoding a url using Google's UrlEscapers.urlFragmentEscaper().escape(); for Java other than using a replace() method?
Code for encoding:
private static String FILENAME_REGEX = ".*//?(.*)$";
private static Pattern FILENAME_PATTERN = Pattern.compile(FILENAME_REGEX);
public String sanitizedURL(#NonNull String url) throws URISyntaxException {
String contentUrl = url;
Matcher matcher = FILENAME_PATTERN.matcher(url);
if (matcher.matches()) {
String filename = matcher.group(1);
String encodedFilename = UrlEscapers.urlFragmentEscaper().escape(filename);
contentUrl = url.replace(filename, encodedFilename);
//contentUrl = contentUrl.replace("%25", "%");
}
// validate this is a good URI
URI uri = new URI(contentUrl);
return uri.toString();
}

Try UrlDecoder.decode(String s, String enc)
e.g.
jshell> URLDecoder.decode("https://123.com/screen-shot-2021-02-25-at-7.31.10%2520PM.png", "UTF-8")
$1 ==> "https://123.com/screen-shot-2021-02-25-at-7.31.10%20PM.png"

Related

ISO-8858-1 to UTF-8 only in URL, only invalid characters

Problem: sometimes we are getting links/phrases with invalid(for us) encoding.
Examples and my first solution below
Description:
I have to fix invalid encoded strings in one part of the application. Sometimes it is a word or phrase, but somtimes also a url. When its a URL I would like to change only wrongly encoded characters. If I decode with ISO and encode to UTF-8 the special url characters are also encoded (/ : ? = &). I coded a solution, which is working for my cases just fine, but those hashes you will see below are smelling badly to me.
Do you had a similar problem or do you know a library which allows to decode a phrase except some characters? Something like this:
decode(String value, char[] ignored)
I also though about braking URL into pieces and fix only path and query but it would be even more mess with parsing them etc..
TLDR: Decode ISO-8858-1 encoded URL and encode it to UTF-8. Dont touch URL specific characters (/ ? = : &)
Input/Output examples:
// wrong input
"http://some.url/xxx/a/%e4t%fcr%E4/b/%e4t%fcr%E4"
"t%E9l%E9phone"
// good output
"http://some.url/xxx/a/%C3%A4t%C3%BCr%C3%A4/b/%C3%A4t%C3%BCr%C3%A4"
"t%C3%A9l%C3%A9phone"
// very wrong output
"http%3A%2F%2Fsome.url%2Fxxx%2Fa%2F%C3%A4t%C3%BCr%C3%A4%2Fb%2F%C3%A4t%C3%BCr%C3%A4"
My first solution:
class EncodingFixer {
private static final String SLASH_HASH = UUID.randomUUID().toString();
private static final String QUESTION_HASH = UUID.randomUUID().toString();
private static final String EQUALS_HASH = UUID.randomUUID().toString();
private static final String AND_HASH = UUID.randomUUID().toString();
private static final String COLON_HASH = UUID.randomUUID().toString();
EncodingFixer() {
}
String fix(String value) {
if (isBlank(value)) {
return value;
}
return tryFix(value);
}
private String tryFix(String str) {
try {
String replaced = replaceWithHashes(str);
String fixed = java.net.URLEncoder.encode(java.net.URLDecoder.decode(replaced, ISO_8859_1), UTF_8);
return replaceBack(fixed);
} catch (Exception e) {
return str;
}
}
private String replaceWithHashes(String str) {
return str
.replaceAll("/", SLASH_HASH)
.replaceAll("\\?", QUESTION_HASH)
.replaceAll("=", EQUALS_HASH)
.replaceAll("&", AND_HASH)
.replaceAll(":", COLON_HASH);
}
private String replaceBack(String fixed) {
return fixed
.replaceAll(SLASH_HASH, "/")
.replaceAll(QUESTION_HASH, "?")
.replaceAll(EQUALS_HASH, "=")
.replaceAll(AND_HASH, "&")
.replaceAll(COLON_HASH, ":");
}
}
Or it should be more like: ???
Check if input is an URL
Create URL
Get path
Split by /
Fix every part
Put it back together
Same for query but little more complicated
??
I also though about it but it seems even more messy than those replaceAlls above :/
If you are able to recognize clearly that some string is an URL, then following user's #jschnasse answer in similar question on SO, this might be the solution you need:
URL url= new URL("http://some.url/xxx/a/%e4t%fcr%E4/b/%e4t%fcr%E4");
URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String correctEncodedURL=uri.toASCIIString();
System.out.println(correctEncodedURL);
outputs:
http://some.url/xxx/a/%25e4t%25fcr%25E4/b/%25e4t%25fcr%25E4

java regex to retrieve link from text

I have a input String as:
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";
I want to convert this text to:
Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it
So here:
1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.
2) If the URL is relative, I want to prefix the base URL (http://www.google.com).
3) I want to append a parameter to the URL. (&myParam=pqr)
I am having issues retrieving the tag with URL and label, and replacing it.
I wrote something like:
public static void main(String[] args) {
String text = "String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";";
text = text.replaceAll("<", "<");
text = text.replaceAll(">", ">");
text = text.replaceAll("&", "&");
// this is not working
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);
}
}
// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}
Edit1:
Pattern p = Pattern.compile("HREF=\"(.*?)\"");
This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.
Also, how do I handle if my text has several URLs.
Edit2:
Some progress.
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}
This handles the case of multiple URLs.
Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.
Edit3:
By multiple URL cases, I mean there are multiple url present in given text.
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}
public static void main(String args[]) {
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
Pattern p = Pattern.compile("(.*?)", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
while (m.find()) {
text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
}
System.out.println(text);
}
private static String cleanUrlPart(String url, String label) {
if (!url.startsWith("http") && !url.startsWith("www")) {
if (url.startsWith("/")) {
url = "http://www.google.com" + url;
} else {
url = "http://www.google.com/" + url;
}
}
url = appendQueryParams(url, "myParam=pqr").toString();
if (label != null && !label.isEmpty()) url += " (" + label + ")";
return url;
}
Output
Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text
You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:
import org.apache.commons.text.StringEscapeUtils;
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+\"(.*?)\">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it
Demos:
jdoodle
Regex Explanation
// this is not working
Because your regex is case-sensitive.
Try:-
Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Edit1:
To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).
Edit2:
To replace the tag (including label) with your final string, use:-
text.replaceAll("(?i)<a href=\"(.*?)</a>", "new substring here")
Almost there:
public static void main(String[] args) throws URISyntaxException {
String text = "Some content which contains link as <A HREF=\"/relative-path/fruit.cgi?param1=abc&param2=xyz\">URL Label</A> and some text after it and another link <A HREF=\"/relative-path/vegetables.cgi?param1=abc&param2=xyz\">URL2 Label</A> and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
System.out.println(text);
System.out.println("**************************************");
Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
Pattern patternLink = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
Matcher matcherTag = patternTag.matcher(text);
while (matcherTag.find()) {
String href = matcherTag.group(1); // href
String linkText = matcherTag.group(2); // link text
System.out.println("Href: " + href);
System.out.println("Label: " + linkText);
Matcher matcherLink = patternLink.matcher(href);
String finalText = null;
while (matcherLink.find()) {
String link = matcherLink.group(1);
System.out.println("Link: " + link);
finalText = getFinalText(link, linkText);
break;
}
System.out.println("***************************************");
// replacing logic goes here
}
System.out.println(text);
}
public static String getFinalText(String link, String label) throws URISyntaxException {
link = appendBaseURI(link);
link = appendQueryParams(link, "myParam=ABCXYZ");
return link + " (" + label + ")";
}
public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri.toString();
}
public static String appendBaseURI(String url) {
String baseURI = "http://www.google.com/";
if (url.startsWith("/")) {
url = url.substring(1, url.length());
}
if (url.startsWith(baseURI)) {
return url;
} else {
return baseURI + url;
}
}

How to encode the arabic words in a Url [duplicate]

How do you encode a URL in Android?
I thought it was like this:
final String encodedURL = URLEncoder.encode(urlAsString, "UTF-8");
URL url = new URL(encodedURL);
If I do the above, the http:// in urlAsString is replaced by http%3A%2F%2F in encodedURL and then I get a java.net.MalformedURLException when I use the URL.
You don't encode the entire URL, only parts of it that come from "unreliable sources".
Java:
String query = URLEncoder.encode("apples oranges", Charsets.UTF_8.name());
String url = "http://stackoverflow.com/search?q=" + query;
Kotlin:
val query: String = URLEncoder.encode("apples oranges", Charsets.UTF_8.name())
val url = "http://stackoverflow.com/search?q=$query"
Alternatively, you can use Strings.urlEncode(String str) of DroidParts that doesn't throw checked exceptions.
Or use something like
String uri = Uri.parse("http://...")
.buildUpon()
.appendQueryParameter("key", "val")
.build().toString();
I'm going to add one suggestion here. You can do this which avoids having to get any external libraries.
Give this a try:
String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();
You can see that in this particular URL, I need to have those spaces encoded so that I can use it for a request.
This takes advantage of a couple features available to you in Android classes. First, the URL class can break a url into its proper components so there is no need for you to do any string search/replace work. Secondly, this approach takes advantage of the URI class feature of properly escaping components when you construct a URI via components rather than from a single string.
The beauty of this approach is that you can take any valid url string and have it work without needing any special knowledge of it yourself.
For android, I would use
String android.net.Uri.encode(String s)
Encodes characters in the given string as '%'-escaped octets using the UTF-8 scheme. Leaves letters ("A-Z", "a-z"), numbers ("0-9"), and unreserved characters ("_-!.~'()*") intact. Encodes all other characters.
Ex/
String urlEncoded = "http://stackoverflow.com/search?q=" + Uri.encode(query);
Also you can use this
private static final String ALLOWED_URI_CHARS = "##&=*+-_.,:!?()/~'%";
String urlEncoded = Uri.encode(path, ALLOWED_URI_CHARS);
it's the most simple method
try {
query = URLEncoder.encode(query, "utf-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
you can use below methods
public static String parseUrl(String surl) throws Exception
{
URL u = new URL(surl);
return new URI(u.getProtocol(), u.getAuthority(), u.getPath(), u.getQuery(), u.getRef()).toString();
}
or
public String parseURL(String url, Map<String, String> params)
{
Builder builder = Uri.parse(url).buildUpon();
for (String key : params.keySet())
{
builder.appendQueryParameter(key, params.get(key));
}
return builder.build().toString();
}
the second one is better than first.
Find Arabic chars and replace them with its UTF-8 encoding.
some thing like this:
for (int i = 0; i < urlAsString.length(); i++) {
if (urlAsString.charAt(i) > 255) {
urlAsString = urlAsString.substring(0, i) + URLEncoder.encode(urlAsString.charAt(i)+"", "UTF-8") + urlAsString.substring(i+1);
}
}
encodedURL = urlAsString;

Java convert string into url title characters only [duplicate]

How do you encode a URL in Android?
I thought it was like this:
final String encodedURL = URLEncoder.encode(urlAsString, "UTF-8");
URL url = new URL(encodedURL);
If I do the above, the http:// in urlAsString is replaced by http%3A%2F%2F in encodedURL and then I get a java.net.MalformedURLException when I use the URL.
You don't encode the entire URL, only parts of it that come from "unreliable sources".
Java:
String query = URLEncoder.encode("apples oranges", Charsets.UTF_8.name());
String url = "http://stackoverflow.com/search?q=" + query;
Kotlin:
val query: String = URLEncoder.encode("apples oranges", Charsets.UTF_8.name())
val url = "http://stackoverflow.com/search?q=$query"
Alternatively, you can use Strings.urlEncode(String str) of DroidParts that doesn't throw checked exceptions.
Or use something like
String uri = Uri.parse("http://...")
.buildUpon()
.appendQueryParameter("key", "val")
.build().toString();
I'm going to add one suggestion here. You can do this which avoids having to get any external libraries.
Give this a try:
String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();
You can see that in this particular URL, I need to have those spaces encoded so that I can use it for a request.
This takes advantage of a couple features available to you in Android classes. First, the URL class can break a url into its proper components so there is no need for you to do any string search/replace work. Secondly, this approach takes advantage of the URI class feature of properly escaping components when you construct a URI via components rather than from a single string.
The beauty of this approach is that you can take any valid url string and have it work without needing any special knowledge of it yourself.
For android, I would use
String android.net.Uri.encode(String s)
Encodes characters in the given string as '%'-escaped octets using the UTF-8 scheme. Leaves letters ("A-Z", "a-z"), numbers ("0-9"), and unreserved characters ("_-!.~'()*") intact. Encodes all other characters.
Ex/
String urlEncoded = "http://stackoverflow.com/search?q=" + Uri.encode(query);
Also you can use this
private static final String ALLOWED_URI_CHARS = "##&=*+-_.,:!?()/~'%";
String urlEncoded = Uri.encode(path, ALLOWED_URI_CHARS);
it's the most simple method
try {
query = URLEncoder.encode(query, "utf-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
you can use below methods
public static String parseUrl(String surl) throws Exception
{
URL u = new URL(surl);
return new URI(u.getProtocol(), u.getAuthority(), u.getPath(), u.getQuery(), u.getRef()).toString();
}
or
public String parseURL(String url, Map<String, String> params)
{
Builder builder = Uri.parse(url).buildUpon();
for (String key : params.keySet())
{
builder.appendQueryParameter(key, params.get(key));
}
return builder.build().toString();
}
the second one is better than first.
Find Arabic chars and replace them with its UTF-8 encoding.
some thing like this:
for (int i = 0; i < urlAsString.length(); i++) {
if (urlAsString.charAt(i) > 255) {
urlAsString = urlAsString.substring(0, i) + URLEncoder.encode(urlAsString.charAt(i)+"", "UTF-8") + urlAsString.substring(i+1);
}
}
encodedURL = urlAsString;

How to concatenate string or word inside each String index

I am having trouble encoding a url with combined Non-ASCII and spaces. For example, http://xxx.xx.xx.xx/resources/upload/pdf/APPLE ははは.pdf. I've read here that you need to encode only the last part of the path of the url.
Here's the code:
public static String getLastPathFromUrl(String url) {
return url.replaceFirst(".*/([^/?]+).*", "$1");
}
So now I have already APPLE ははは.pdf, next step is to replace spaces with %20 for the link to work BUT the problem is that if I encode APPLE%20ははは.pdf it becomes APPLE%2520%E3%81%AF%E3%81%AF%E3%81%AF.pdf. I should have APPLE%20%E3%81%AF%E3%81%AF%E3%81%AF.pdf.
So I decided to:
1. Separate each word from the link
2. Encode it
3. Concatenate the new encoded words, for example:
3.A. APPLE (APPLE)
3.B. %E3%81%AF%E3%81%AF%E3%81%AF.pdf (ははは.pdf)
with the (space) converted to %20, now becomes APPLE%20%E3%81%AF%E3%81%AF%E3%81%AF.pdf
Here's my code:
public static String[] splitWords(String sentence) {
String[] words = sentence.split(" ");
return words;
}
The calling code:
String urlLastPath = getLastPathFromUrl(pdfUrl);
String[] splitWords = splitWords(urlLastPath);
for (String word : splitWords) {
String urlEncoded = URLEncoder.encode(word, "utf-8"); //STUCKED HERE
}
I now want to concatenate each unicoded string(urlEncoded) inside the indices to finally form like APPLE%20%E3%81%AF%E3%81%AF%E3%81%AF.pdf. How do I do this?
actually the %20 is encoded as %2520 so just call URLEncoder.encode(word, "utf-8"); so you will get result like this APPLE+%E3%81%AF%E3%81%AF%E3%81%AF.pdf and in final result replace + with %20.
Do you want to do something like this:
// Get the whole url as string
Stirng urlString = pdfUrl.toString();
// get the string before the last path segment
String result = urlString.substring(0, urlString.lastIndexOf("/"));
String urlLastPath = getLastPathFromUrl(pdfUrl);
String[] splitWords = splitWords(urlLastPath);
for (String word : splitWords) {
String urlEncoded = URLEncoder.encode(word, "utf-8");
// add the encoded part to the url
result += urlEncoded;
}
Now the string result is your encoded URL as a string.
Possibly easy with org.apache.commons.io.FilenameUtils.
Split your url into baseUrl and the file name and extension.
Encode the file name and extension
Join them together
String url = "http://xxx.xx.xx.xx/resources/upload/pdf/APPLE ははは.pdf";
String baseUrl = FilenameUtils.getPath(url); // GIVES: http://xxx.xx.xx.xx/resources/upload/pdf/
String myFile = FilenameUtils.getBaseName(url)
+ "." + FilenameUtils.getExtension(url); // GIVES: APPLE ははは.pdf
String encoded = URLEncoder.encode(myFile, "UTF-8"); //GIVES: APPLE+%E3%81%AF%E3%81%AF%E3%81%AF.pdf
System.out.println(baseUrl + encoded);
Output:
http://xxx.xx.xx.xx/resources/upload/pdf/APPLE+%E3%81%AF%E3%81%AF%E3%81%AF.pdf
Don't reinvent the wheel. Use URLEncoder for encoding the URL.
URLEncoder.encode(yourArgumentsHere, "utf-8");
Moreover, where do you get your URL from, so that you have to split it before encoding? You should first build the arguments (last part), then just append it onto the base URL.

Categories

Resources