I need to create a java URL object by providing a representation containing a delimiter, which is excluded for US_ASCII Characters. You can find the speicification here 2.4.3. Excluded US-ASCII Characters.
For example,
http://localhost:8182/a%image.tif
or
http://localhost:8182/a#image.tif
Does anybody know a workaround?
Can't you encode the character? So # => %23 and % => %25. See more information on W3Schools
Generally, a URI can be safely constructed only by encoding the individual components before assembling them into the final URI. In this case a%image.gif is a path component and must be encoded according the path production (3.3 in rfc 2369).
Use java.net.URI to create legal URI (and URLs):
URI uri = URI.create("http://localhost:8182/a%25image.gif");
System.out.println(uri.toASCIIString());
System.out.println(uri.getPath());
You should see the output of the last statement unencoded.
Technically, the second URL is not illegal, image.gif, would be treated as a fragment. But if the hash caharacter is part of the path, it must of course be encoded as well.
Related
The construct new URL(new URL(new URL("http://localhost:4567"), "abc"), "def") produces (imho incorrectly) this url: http://localhost:4567/def
While the construct new URL(new URL(new URL("http://localhost:4567"), "abc/"), "def") produces the correct (wanted by me) url: http://localhost:4567/abc/def
The difference is a trailing slash in abc constructor argument.
Is this intended behavior or this is a bug that should be fixed in URL class?
After all the idea is not to worry about slashes when you use some helper class for URL construction.
Quoting javadoc of new URL(URL context, String spec):
Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396.
See section 5 "Relative URI References" of the RFC2396 spec, specifically section 5.2 "Resolving Relative References to Absolute Form", item 6a:
All but the last segment of the base URI's path component is copied to the buffer. In other words, any characters after the last (right-most) slash character, if any, are excluded.
Explanation
On a web page, the "Base URI" is the page address, e.g. http://example.com/path/to/page.html. A relative link, e.g. <a href="page2.html">, must be interpreted as a sibling to the base URI, so page.html is removed, and page2.html is added, resulting in http://example.com/path/to/page2.html, as intended.
The Java URL class implements this logic, and that is why you get what you see, and it is entirely the way it is supposed to work.
It is by design, i.e. not a bug.
Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL
Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.
Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.
Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
I am finding that if I pass access tomcat with a URL with a percent sign in it (e.g.
http://tester:8080/blah-1.6.0-SNAPSHOT/blah/getLoginURL/http%2F
Then tomcat seems to block the request and returns a blank response. If I remove the %gt above the request works as expected. Is there anyway to prevent this behaviour?
Edit: I thought I was using URL encoding - the above URL also causes the same failure
Correctly encode the URL:
http://tester:8080/blah-1.6.0-SNAPSHOT/blah/getLoginURL/veryr%25gt
Difference here ---------------------------------------------^^^
In a URI, the % character is special: It introduces an encoded entity. To actually put a % in a URI, you must use %25 (which is the encoded entity %). This is called URI-encoding, although it's frequently called "percent encoding".
(Complete speculation) If the %gt was meant to be a >, that would be %3E. URI encoding is a different thing from HTML character entities.
How do you encode a path parameter (not form-url-encoded) but just a single URL that's appended in the format:
public String method(#PathParam("url") String url) {
}
There are lots of references to form URL encoding, but I want to simply encode a string as in the above.
Like mentioned in the previous answer URLEncoder can only be used for query paramaters, not path parameters. This matters e.g. for spaces which are a + in the query parameter but a %20 in the path.
org.springframework.web.util.UriUtils.encodePath()
can be used. Also using an org.apache.http.client.utils.URIBuilder would work. setPath is escaping the path part here. Also pure Java by using a constructor of java.net.Uri works.
Why would you want to *en*code it there, if anything wouldn't you want to *de*code it? In any case, you would call the standard URLEncoder.
This question already has answers here:
HTTP URL Address Encoding in Java
(24 answers)
Closed 5 years ago.
i need java code to encode URL to avoid special characters such as spaces and % and & ...etc
URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".
RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).
In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URI object using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.
Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).
I also spent quite some time with this issue, so that's my solution:
String urlString2Decode = "http://www.test.com/äüö/path with blanks/";
String decodedURL = URLDecoder.decode(urlString2Decode, "UTF-8");
URL url = new URL(decodedURL);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String decodedURLAsString = uri.toASCIIString();
If you don't want to do it manually use Apache Commons - Codec library. The class you are looking at is: org.apache.commons.codec.net.URLCodec
String final url = "http://www.google.com?...."
String final urlSafe = org.apache.commons.codec.net.URLCodec.encode(url);
Here is my solution which is pretty easy:
Instead of encoding the url itself i encoded the parameters that I was passing because the parameter was user input and the user could input any unexpected string of special characters so this worked for me fine :)
String review="User input"; /*USER INPUT AS STRING THAT WILL BE PASSED AS PARAMTER TO URL*/
try {
review = URLEncoder.encode(review,"utf-8");
review = review.replace(" " , "+");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String URL = "www.test.com/test.php"+"?user_review="+review;
I would echo what Wyzard wrote but add that:
for query parameters, HTML encoding is often exactly what the server is expecting; outside these, it is correct that URLEncoder should not be used
the most recent URI spec is RFC 3986, so you should refer to that as a primary source
I wrote a blog post a while back about this subject: Java: safe character handling and URL building