java.net.URL bug in constructing URLs? - java

The construct new URL(new URL(new URL("http://localhost:4567"), "abc"), "def") produces (imho incorrectly) this url: http://localhost:4567/def
While the construct new URL(new URL(new URL("http://localhost:4567"), "abc/"), "def") produces the correct (wanted by me) url: http://localhost:4567/abc/def
The difference is a trailing slash in abc constructor argument.
Is this intended behavior or this is a bug that should be fixed in URL class?
After all the idea is not to worry about slashes when you use some helper class for URL construction.

Quoting javadoc of new URL(URL context, String spec):
Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396.
See section 5 "Relative URI References" of the RFC2396 spec, specifically section 5.2 "Resolving Relative References to Absolute Form", item 6a:
All but the last segment of the base URI's path component is copied to the buffer. In other words, any characters after the last (right-most) slash character, if any, are excluded.
Explanation
On a web page, the "Base URI" is the page address, e.g. http://example.com/path/to/page.html. A relative link, e.g. <a href="page2.html">, must be interpreted as a sibling to the base URI, so page.html is removed, and page2.html is added, resulting in http://example.com/path/to/page2.html, as intended.
The Java URL class implements this logic, and that is why you get what you see, and it is entirely the way it is supposed to work.
It is by design, i.e. not a bug.

Related

Why URI constructor allows missing protocol (while URL does not)?

Why does URI allow missing protocol (while URL does not)?
In wikipedia Scheme (and even Path) seem to be obligatory components of an URI:
The URI generic syntax consists of a hierarchical sequence of five
components:[8]
URI = scheme:[//authority]path[?query][#fragment]
Or missing protocol defaults to something (like http)? I found nothing like this in the docs.
new URI("my.html"); // 1
new URI("xabc:my.html"); // 2
new URL("my.html"); // 3
new URL("xabc:my.html"); // 4
Concerning "obligatory" path - OK, there is oblique URI. But why missing protocol is allowed (it shall be present even for obligue URI which is required to be absolute)
I could understand that relative URL/URI don't require protocol (<img src="/images/pic.png">), but URL gives run-time java.net.MalformedURLException: no protocol in this case either (while URI don't).
Your relative path must be wrong,
Java's URI supports empty scheme for relative URI:
relative URI, that is, a URI that does not specify a scheme. Some examples of hierarchical URIs are:
docs/guide/collections/designfaq.html#28
Scheme is optional:
[scheme:]scheme-specific-part[#fragment]
Similar with URL, e.g.:
URL url = new URL("/guidelines.txt");

Why is this URI valid by RCF 2396 standards?

I was playing around with non stringy types for an application loader i've been developing. As a typo, I forgot to include the protocol part of a specific URI. I expected the java test to fail due to an invalid URI... however this statement seems to work...
URI uri = URI.create("contacts.addresses.genericAddress")
To me, theres no standard for using a dot as a scheme part... and I thought the scheme part was always required?
Does anyone know why?
I'll add my comment as an answer because I think it's correct:
From the Java URI documentation: "specified by the grammar in RFC 2396, Appendix A" and appendix A allows a URI to be a relative path, with no host name or scheme. So "this.and.that" might just be a file name like "this.html" (dot's are valid as a file element name -- i.e., pchars in a path segment).

Java URI#resolve improperly resolves relative URLs in certain cases

I'm writing my own web crawler in Java, and I'm using URI#resolve to resolve URLs that appear on every HTML page that my crawler encounters. In certain cases, it's behaving in an unexpected way.
For example, while crawling https://hacks.mozilla.org, I notice that one of the URLs extracted is https://hacks.mozilla.orgabout/ (indeed, if you look at the HTML source for that page, you will find an <a href="about/">). I did some testing, and got these results:
URI uri1 = new URI("https://hacks.mozilla.org").resolve("about/");
System.out.println(uri1); // => https://hacks.mozilla.orgabout/
URI uri2 = new URI("https://hacks.mozilla.org/").resolve("about/");
System.out.println(uri2); // => https://hacks.mozilla.org/about/
I don't know how practical it is to attempt to mitigate this issue by manually adding the slash after the base URL, but I want to know if there is an actual non-hacky fix to this problem.
I did a little more experimentation, and realized that this happens when the path element is empty (null or 0-length string):
URI uri3 = new URI("http", null, "hacks.mozilla.org", 80, "", null, null).resolve("about/");
System.out.println(uri3); // => http://hacks.mozilla.org:80about/
URI constructor Javadoc states that (from http://docs.oracle.com/javase/7/docs/api/java/net/URI.html) :
If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character ('/') or the commercial-at character ('#'), is quoted.
So just filling this parameter with one of those accepted character will solve your problem.

Custom URL scheme as adapter on existing URL schemes

Is there a clean and spec-conformant way to define a custom URL scheme that acts as an adapter on the resource returned by another URL?
I have already defined a custom URL protocol which returns a decrypted representation of a local file. So, for instance, in my code,
decrypted-file:///path/to/file
transparently decrypts the file you would get from file:///path/to/file. However, this only works for local files. No fun! I am hoping that the URL specification allows a clean way that I could generalize this by defining a new URL scheme as a kind of adapter on existing URLs.
For example, could I instead define a custom URL scheme decrypted: that could be used as an adapter that prefixes another absolute URL that retrieved a resource? Then I could just do
decrypted:file:///path/to/file
or decrypted:http://server/path/to/file or decrypted:ftp://server/path/to/file or whatever. This would make my decrypted: protocol composable with all existing URL schemes that do file retrieval.
Java does something similar with the jar: URL scheme but from my reading of RFC 3986 it seems like this Java technology violates the URL spec. The embedded URL is not properly byte-encoded, so any /, ?, or # delimiters in the embedded URL should officially be treated as segment delimiters in the embedding URL (even if that's not what JarURLConnection does). I want to stay within the specs.
Is there a nice and correct way to do this? Or is the only option to byte-encode the entire embedded URL (i.e., decrypted:file%3A%2F%2F%2Fpath%2Fto%2Ffile, which is not so nice)?
Is what I'm suggesting (URL adapters) done anywhere else? Or is there a deeper reason why this is misguided?
There's no built-in adaptor in Cocoa, but writing your own using NSURLProtocol is pretty straightforward for most uses. Given an arbitrary URL, encoding it like so seems simplest:
myscheme:<originalurl>
For example:
myscheme:http://example.com/path
At its simplest, NSURL only actually cares if the string you pass in is a valid URI, which the above is. Yes, there is then extra URL support layered on top, based around RFC 1808 etc. but that's not essential.
All that's required to be a valid URI is a colon to indicate the scheme, and no invalid characters (basically, ASCII without spaces).
You can then use the -resourceSpecifier method to retrieve the original URL and work with that.

How to use excluded delimiters in URI

I need to create a java URL object by providing a representation containing a delimiter, which is excluded for US_ASCII Characters. You can find the speicification here 2.4.3. Excluded US-ASCII Characters.
For example,
http://localhost:8182/a%image.tif
or
http://localhost:8182/a#image.tif
Does anybody know a workaround?
Can't you encode the character? So # => %23 and % => %25. See more information on W3Schools
Generally, a URI can be safely constructed only by encoding the individual components before assembling them into the final URI. In this case a%image.gif is a path component and must be encoded according the path production (3.3 in rfc 2369).
Use java.net.URI to create legal URI (and URLs):
URI uri = URI.create("http://localhost:8182/a%25image.gif");
System.out.println(uri.toASCIIString());
System.out.println(uri.getPath());
You should see the output of the last statement unencoded.
Technically, the second URL is not illegal, image.gif, would be treated as a fragment. But if the hash caharacter is part of the path, it must of course be encoded as well.

Categories

Resources