How to obtain correct URI of a local file? - java

To my great surprise, following snippet prints false on jdk1.8.0_u144
public class Tmp {
public static void main(String[] args) {
File f = new File(".");
boolean result = f.toPath().toUri().toString().equals(f.toURI().toString());
System.out.println("result = " + result);
}
}
Obviously, java.io.File#toURI and java.nio.Path#toUri return different presentations. The question is, which of them is correct (according to RFC 8089)?

TLDR version: Both forms of the URI are correct according to RFC 8089.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Your sample code highlights the difference between the value returned by Path's toUri() and File's toURI() methods for a given file. Printing those to values on my Win10 machine showed:
path.toUri() => file:///D:/NetBeansProjects/MiscTests/./
file.toUri() => file:/D:/NetBeansProjects/MiscTests/./
The results on Linux are similar:
path.toUri() => file:///home/johndoe/IdeaProjects/TestUri/./
file.toUri() => file:/home/johndoe/IdeaProjects/TestUri/./
So the only difference is the single or triple forward slashes following "file:" in the URI.
From your link, Appendix B of RFC 8089 confirms that both forms are valid URIs:
A traditional file URI for a local file with an empty authority.
This is the most common format in use today. For example:
* "file:///path/to/file"
o The minimal representation of a local file with no authority
field
and an absolute path that begins with a slash "/". For example:
* "file:/path/to/file"
Further confirmation that both URI forms are valid is that either can be entered in a browser to display the content of the directory. However, there are a few points worth noting:
Brave was the only browser that would not accept the single slash form of the URI (as given by File.toURI().toString()).
All browsers accepted the triple slash form of the URI (as given by File.toPath().toUri().toString()).
If I entered the URI in the browser's adddress bar with a single slash it was converted to a triple slash.
Strangely, both Chrome and Firefox will accept any number of slashes in the URI (e.g. file:///////////D:/NetBeansProjects/MiscTests/), and still display the directory.

Related

java.net.URL bug in constructing URLs?

The construct new URL(new URL(new URL("http://localhost:4567"), "abc"), "def") produces (imho incorrectly) this url: http://localhost:4567/def
While the construct new URL(new URL(new URL("http://localhost:4567"), "abc/"), "def") produces the correct (wanted by me) url: http://localhost:4567/abc/def
The difference is a trailing slash in abc constructor argument.
Is this intended behavior or this is a bug that should be fixed in URL class?
After all the idea is not to worry about slashes when you use some helper class for URL construction.
Quoting javadoc of new URL(URL context, String spec):
Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396.
See section 5 "Relative URI References" of the RFC2396 spec, specifically section 5.2 "Resolving Relative References to Absolute Form", item 6a:
All but the last segment of the base URI's path component is copied to the buffer. In other words, any characters after the last (right-most) slash character, if any, are excluded.
Explanation
On a web page, the "Base URI" is the page address, e.g. http://example.com/path/to/page.html. A relative link, e.g. <a href="page2.html">, must be interpreted as a sibling to the base URI, so page.html is removed, and page2.html is added, resulting in http://example.com/path/to/page2.html, as intended.
The Java URL class implements this logic, and that is why you get what you see, and it is entirely the way it is supposed to work.
It is by design, i.e. not a bug.

Java URI#resolve improperly resolves relative URLs in certain cases

I'm writing my own web crawler in Java, and I'm using URI#resolve to resolve URLs that appear on every HTML page that my crawler encounters. In certain cases, it's behaving in an unexpected way.
For example, while crawling https://hacks.mozilla.org, I notice that one of the URLs extracted is https://hacks.mozilla.orgabout/ (indeed, if you look at the HTML source for that page, you will find an <a href="about/">). I did some testing, and got these results:
URI uri1 = new URI("https://hacks.mozilla.org").resolve("about/");
System.out.println(uri1); // => https://hacks.mozilla.orgabout/
URI uri2 = new URI("https://hacks.mozilla.org/").resolve("about/");
System.out.println(uri2); // => https://hacks.mozilla.org/about/
I don't know how practical it is to attempt to mitigate this issue by manually adding the slash after the base URL, but I want to know if there is an actual non-hacky fix to this problem.
I did a little more experimentation, and realized that this happens when the path element is empty (null or 0-length string):
URI uri3 = new URI("http", null, "hacks.mozilla.org", 80, "", null, null).resolve("about/");
System.out.println(uri3); // => http://hacks.mozilla.org:80about/
URI constructor Javadoc states that (from http://docs.oracle.com/javase/7/docs/api/java/net/URI.html) :
If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character ('/') or the commercial-at character ('#'), is quoted.
So just filling this parameter with one of those accepted character will solve your problem.

When passing ampersand in the URI for file: schemes into ProducerTemplate.sendBodyAndHeader() it fails

When using ProducerTemplate.sendBodyAndHeader() to send a file using the "file" scheme to its destination, and the file path in the URI contains ampersands, it fails to deliver the file with the following errors.
org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint:
file:///c%7C/IMM_SAN/Marketing/f77333bd-f96f-4873-b846-2f1dc5531a5a/2596/PB&J%20Generic%2007064782/transcoded/21726
due to: Failed to resolve endpoint:
file:///c%7C/IMM_SAN/Marketing/f77333bd-f96f-4873-b846-2f1dc5531a5a/25964/PB&J%20Generic%2007064782/transcoded/21726
due to: Invalid uri syntax: no ? marker however the uri has & parameter separators. Check the uri if its missing a ? marker.
Spending a few days trying the different overloads to send the file send(), sendBody(), sendBodyAndHeader() and even sendBodyAndHeaders().
I tried to UrlEncoder.encode() it before hand and of course a no go.
I even debugged the URISupport.normalizeUri(String uri) from the camel-core source and discovered something interesting. Apparently no amount of encoding will do me any good before sending the body and header because it appears to be doing its own encoding and it appears to be totally incorrect. I think this is a bug in sendBodyAndHeader(). It encodes the ampersand back into the URI before sending it. This is bad. Why are we doing that? We have an application that reads files from one department and are written to a share and another system automatically picking those file up and delivering it when processing on the file is finished.
See below camel URISupport.normalizeUri(String uri) method is encoding the URI here and this puts the ampersand back into the file path.
URI u = new URI(UnsafeUriCharactersEncoder.encode(uri));
So you see no amount of preprocessing on the file path in the URI is going to work at all because sendBodyAndHeader is going to do whatever it feels like doing. I would like to add a new overload to this API to turn off normalization and just send the URI as is. But wanted to check here to see if anybody has any less drastic options. Please note this is a problem when ampersands are in the URI path for file schemes.
ProducerTemplate prod = exchange.getContext().createProducerTemplate();
destPath = destPath.replace(':', '|');
destPath = destPath.replaceAll("\\\\", "/");
destPath = destPath.replaceAll("&", "%26"); // replace the ampersand
String query = "file:///" + destPath;
prod.sendBodyAndHeader(query, exchange.getIn().getBody(), Exchange.FILE_NAME, destFileName);
Use the CamelFileName header to avoid messing up the endpoint URI with the reserved character & if you really need that character in the file path.
This example would put a file into c:\a&b
public void sendAnyFile(Exchange e){
ProducerTemplate pt = getContext().createProducerTemplate();
pt.sendBodyAndHeader("file:///c:/",e.getIn().getBody(String.class), "CamelFileName", "a&b/hej.txt");
}

Should URL must end with / or not inside URL class of Java?

I need to connect to the middleware server using
java URL and URLConnection classes .
When googled on to net , i found out some examples
Where the url is ending with /
(http://www.oracle.com/)
URL oracle = new URL("http://www.oracle.com/");
And in some examples the url is without /
URL ur = new URL("http://www.mydomain.com/myfile.gif");
Could anybody please tell me , is that makes any difference , if so
then how ca we choose a URL value ??
Not necessary.
If the URL contain no path section (only has the domain name and its scheme), it may or may not contain trailing slash, i.e. http://www.oracle.com/ or http://www.oracle.com. Both should be accessible. However, a normailzed URL of this must have traling slash.
URL normalization is a convention that allows a URL to be written in a consistent manner. In the URL normalization, the trailing slash indicate that the URL is a directory not a file. For example:
"http://www.oracle.com/" <- root path
"http://www.oracle.com/pages/" <- "pages" is a directory
"http://www.oracle.com/pages" <- "pages" is a file
"http://www.oracle.com/myfile.gif" <- "myfile.gif" is a file
"http://www.oracle.com/myfile.gif/" <- "myfile.gif" is a directory
However, this convention is only applied to a normalized URL and whether one should have a trailing slash or not depends entirely on the service implmentations.
No, an URL does not have to end with /, but some URLs do. Whether it does or not depends on which URL you are trying to access (normally you'd use the URL you have without modification).

How to use excluded delimiters in URI

I need to create a java URL object by providing a representation containing a delimiter, which is excluded for US_ASCII Characters. You can find the speicification here 2.4.3. Excluded US-ASCII Characters.
For example,
http://localhost:8182/a%image.tif
or
http://localhost:8182/a#image.tif
Does anybody know a workaround?
Can't you encode the character? So # => %23 and % => %25. See more information on W3Schools
Generally, a URI can be safely constructed only by encoding the individual components before assembling them into the final URI. In this case a%image.gif is a path component and must be encoded according the path production (3.3 in rfc 2369).
Use java.net.URI to create legal URI (and URLs):
URI uri = URI.create("http://localhost:8182/a%25image.gif");
System.out.println(uri.toASCIIString());
System.out.println(uri.getPath());
You should see the output of the last statement unencoded.
Technically, the second URL is not illegal, image.gif, would be treated as a fragment. But if the hash caharacter is part of the path, it must of course be encoded as well.

Categories

Resources