In Java, how to get canonicalized url - java

Say i have space in a url, what is the right way to convert it to %20? no 'replace' suggestion please.
For example, if you put "http://test.com/test and test/a" into the browser window, it converts to http://test.com/test%20and%20test/a
If I use URLEncoder, I get even the / converted. which is not what i want.
Thanks,
this is the right way, seems like. to add to the question, what if there is also some non ascii code in the path that I want convert to valid url with utf8 encode? e.g.: test.com:8080/test and test/pierlag2_carré/a?query=世界 I'd want it to be converted to test.com:8080/test%20and%20test/pierlag2_carr%C3%A9/a?query=%E4%B8%96%E7%95%8C

Try splitting into a URI with the aid of the URL class:
String sUrl = "http://test.com:8080/test and test/a?query=world";
URL url = new URL(sUrl);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String canonical = uri.toString();
System.out.println(canonical);
Output:
http://test.com:8080/test%20and%20test/a?query=world

The correct way to build URLs in Java is to create a URI object and fill out each part of the URL. The URI class handles the encoding rules for the distinct parts of the URL as they differ from one to the next.
URLEncoder is not what you want, despite its name, as that actually does HTML form encoding.
EDIT:
Based on your comments, you are receiving the URL as input to your application and do not control the initial generation of the URL. The real problem you are currently experiencing is that the input you are receiving, the URL, is not a valid URL. URLs / URIs cannot contain spaces per the spec (hence the %20 in the browser).
Since you have no control over the invalid input you are going to be forced to split the incoming URL into its parts:
scheme
host
path
Then you are going to have to split the path and separately encode each part to ensure that you do not inadvertently encode the / that delimits your path fragments.
Finally, you can put all of them back together in a URI object and then pass that around your application.

You may find useful this code to replace blank spaces in your URL:
String myUrl = "http://test.com/test and test/a";
myUrl = myUrl.replaceAll(" ", "%20");
URI url = new URI(myUrl);
System.out.print(url.toString());

Related

How to encode only a part of URL

I have incomplete URL's which I am redirecting (don't have the full URL) like
a.jsp?id=269101|14000
and
b.jsp?action=in&id=239394|2000&inmethod=
I wanted to encode the pipe "|" char only, so I started with java.net.URI class but it asks for complete url.So I used URLEncoder but it encodes the entire url.
I know I can look for | in url and encode it directly but what would be the best approach?
Using String.replace():
String myUrl = "b.jsp?action=in&id=239394|2000&inmethod=";
myUrl = myUrl.replace("|","%7C");
You need to use the URLEncoder on each query parameter value that needs to be encoded.
String url = "b.jsp?action=in" +
"&id=" + URLEncoder.encode("239394|2000", StandardCharsets.UTF_8) +
"&inmethod=";
System.out.println(url); // prints: b.jsp?action=in&id=239394%7C2000&inmethod=
Using the URLEncoder is the correct way to go. However you should do the encoding before you create your full url. using it on your full url will cause all special URL characters to be encoded. Which is not what you want here
Change your code to something like this
String url = "a.jsp?id=" + URLEncoder.encode("269101|14000",StandardCharsets.UTF_8);

How to convert URL toURI when there are unwise characters?

I've got URL object with path containing unwise characters (RFC 2396) in my case it is "|" (pipe) character.
Now I need to safely convert that to URI, but URL.toURI() throws an exception.
I've read URL documentation but this part is for me confusing:
The URL class does not itself encode or decode any URL components
according to the escaping mechanism defined in RFC2396. It is the
responsibility of the caller to encode any fields, which need to be
escaped prior to calling URL, and also to decode any escaped fields,
that are returned from URL. Furthermore, because URL has no knowledge
of URL escaping, it does not recognize equivalence between the encoded
or decoded form of the same URL.
So how should I do it? What is the pattern here to encode this characters during conversion? Do I need create encoded copy of my URL object?
OK, I come up with something like this:
URI uri = new URI(url.getProtocol(),
null /*userInfo*/,
url.getHost(),
url.getPort(),
(url.getPath()==null)?null:URLDecoder.decode(url.getPath(), "UTF-8"),
(url.getQuery()==null)?null:URLDecoder.decode(url.getQuery(), "UTF-8"),
null /*fragment*/);
Looks like it works, here is an example. Can some one confirm that this is proper solution?
Edit: initial solution had some problems when there was a query so I've fixed it.
Use URL encoding?
From your example, you currently have:
URL url = new URL("http", "google.com", 8080, "/crapy|path with-unwise_characters.jpg");
Instead, I would use:
String path = "/crapy|path with-unwise_characters.jpg"
URL url = new URL("http", "google.com", 8080, URLEncoder.encode(path, "UTF-8"));
This should work and handle all unwise characters in the path as per the standard URL encoding.
HTTPClient 4 has an object for that org.apache.http.client.utils.URIBuilder:
URIBuilder builder =
new URIBuilder()
.setScheme(url.getProtocol())
.setHost(url.getHost())
.setPort(url.getPort())
.setUserInfo(url.getUserInfo())
.setPath(url.getPath())
.setQuery(url.getQuery());
URI uri = builder.build();
return uri;

Encode URL parameters using Java

I want to encode part of URL paramter in JAVA
http://statebuild-dev.com/iit-title/size.xml?id=(102T OR 140T)
to
http://statebuild-dev.com/iit-title/size.xml?id=(102%20OR%20140)
have tried using URI to encode but it also encodes ? which I do not want. In the URL I want to encode the part after '='
URI uri = new URI("http",
"statebuild-dev.com/iit-title", "/size.xml?id=(102 OR 140)", null);
//URL url = uri.toURL();
System.out.println(uri.toString());
System.out.println(url1);
Thank You
you want to use URLEncoder to encode each query parameter before adding to the url, e.g.:
String encodedValue = URLEncoder.encode("(102 OR 140)", "UTF-8");
This answer has a good discussion of encoding the various parts of a URI/URL. You're on the right track, but your specific problem is that you have the various parts of the URI wrong. You need to use the multi-part constructor that takes an authority, path, query, and fragment:
URI uri = new URI("http", "statebuild-dev.com", "/iit-title/size.xml", "id=(102 or 104)", null);
System.out.println(uri.toString());
System.out.println(uri.toASCIIString());
You used the wrong constructor.
Try this:
URI uri = new URI("http","statebuild-dev.com", "/iit-title/size.xml", "id=(102 OR 140)", null);
See also java.net.URLEncoder
The java.net.URI class can help; in the documentation of URL you find
Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use URI
Check this thread.

Invoking Http Request In Java

I have standalone, swing based application, that allows the user to enter any URL and it returns the status code.
I want to let the user to enter any URL that works when he uses the same URL in a browser, no matter what the URL is (e.g. parameters with special characters, json strings, etc.).
How can I implement that?
I tried to use URL class, but in some cases, I saw that one web site did not accept a json string I gave, although it was accepted when I copy the URL to the browser.
You might want to look at the java.net.URI object. Some of the constructors will properly escape extended characters.
URI(String scheme,
String authority,
String path,
String query,
String fragment)
Even so, you'll need to carefully encode the query string to make sure the JSON doesn't spill over into other parameters if there are & characters, etc.
Finally, this worked for me:
String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();

How to encode URL to avoid special characters in Java? [duplicate]

This question already has answers here:
HTTP URL Address Encoding in Java
(24 answers)
Closed 5 years ago.
i need java code to encode URL to avoid special characters such as spaces and % and & ...etc
URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".
RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).
In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URI object using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.
Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).
I also spent quite some time with this issue, so that's my solution:
String urlString2Decode = "http://www.test.com/äüö/path with blanks/";
String decodedURL = URLDecoder.decode(urlString2Decode, "UTF-8");
URL url = new URL(decodedURL);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String decodedURLAsString = uri.toASCIIString();
If you don't want to do it manually use Apache Commons - Codec library. The class you are looking at is: org.apache.commons.codec.net.URLCodec
String final url = "http://www.google.com?...."
String final urlSafe = org.apache.commons.codec.net.URLCodec.encode(url);
Here is my solution which is pretty easy:
Instead of encoding the url itself i encoded the parameters that I was passing because the parameter was user input and the user could input any unexpected string of special characters so this worked for me fine :)
String review="User input"; /*USER INPUT AS STRING THAT WILL BE PASSED AS PARAMTER TO URL*/
try {
review = URLEncoder.encode(review,"utf-8");
review = review.replace(" " , "+");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String URL = "www.test.com/test.php"+"?user_review="+review;
I would echo what Wyzard wrote but add that:
for query parameters, HTML encoding is often exactly what the server is expecting; outside these, it is correct that URLEncoder should not be used
the most recent URI spec is RFC 3986, so you should refer to that as a primary source
I wrote a blog post a while back about this subject: Java: safe character handling and URL building

Categories

Resources