I'm writing my own web crawler in Java, and I'm using URI#resolve to resolve URLs that appear on every HTML page that my crawler encounters. In certain cases, it's behaving in an unexpected way.
For example, while crawling https://hacks.mozilla.org, I notice that one of the URLs extracted is https://hacks.mozilla.orgabout/ (indeed, if you look at the HTML source for that page, you will find an <a href="about/">). I did some testing, and got these results:
URI uri1 = new URI("https://hacks.mozilla.org").resolve("about/");
System.out.println(uri1); // => https://hacks.mozilla.orgabout/
URI uri2 = new URI("https://hacks.mozilla.org/").resolve("about/");
System.out.println(uri2); // => https://hacks.mozilla.org/about/
I don't know how practical it is to attempt to mitigate this issue by manually adding the slash after the base URL, but I want to know if there is an actual non-hacky fix to this problem.
I did a little more experimentation, and realized that this happens when the path element is empty (null or 0-length string):
URI uri3 = new URI("http", null, "hacks.mozilla.org", 80, "", null, null).resolve("about/");
System.out.println(uri3); // => http://hacks.mozilla.org:80about/
URI constructor Javadoc states that (from http://docs.oracle.com/javase/7/docs/api/java/net/URI.html) :
If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character ('/') or the commercial-at character ('#'), is quoted.
So just filling this parameter with one of those accepted character will solve your problem.
Related
I am trying to create a link that opens the New issue page on Github filling it with existing knowledge of the problem.
In order to do so, i am using the query parameters like followed:
https://github.com/User/Repository/issues/new?title=Some text&body=More Text
That works fine, however i am trying to format the document using Markdown and all symbols are being escaped after creating a new URL by calling
URL url = new URL("https://github.com/User/Repository/issues/new?title=Some text&body=# Header # Another header");
The result will be this:
https://github.com/User/Repository/issues/new?title=Some text&body=# Header %23 Another header
the second # is being escaped, but the first isn't and i don't quite understand why.
Any ideas?
In short, the URL parser is treating your first # as a fragment (a.k.a. anchor, e.g. <a name="named-anchor">). Since according to RFC-3986: Section 3, the fragment must come last and # is a reserved character, anything after that first # is assumed to be part of that fragment, causing the parser to encode any further "invalid" characters, such as your second #. From the RFC:
The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
Note that fragment comes last, and is delimited by the #.
The best way to handle this would be to:
encode the body query parameter on your own or
use an HTTP client that does the escaping for you, e.g. RestTemplate from Spring or Apache HttpComponents.
To my great surprise, following snippet prints false on jdk1.8.0_u144
public class Tmp {
public static void main(String[] args) {
File f = new File(".");
boolean result = f.toPath().toUri().toString().equals(f.toURI().toString());
System.out.println("result = " + result);
}
}
Obviously, java.io.File#toURI and java.nio.Path#toUri return different presentations. The question is, which of them is correct (according to RFC 8089)?
TLDR version: Both forms of the URI are correct according to RFC 8089.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Your sample code highlights the difference between the value returned by Path's toUri() and File's toURI() methods for a given file. Printing those to values on my Win10 machine showed:
path.toUri() => file:///D:/NetBeansProjects/MiscTests/./
file.toUri() => file:/D:/NetBeansProjects/MiscTests/./
The results on Linux are similar:
path.toUri() => file:///home/johndoe/IdeaProjects/TestUri/./
file.toUri() => file:/home/johndoe/IdeaProjects/TestUri/./
So the only difference is the single or triple forward slashes following "file:" in the URI.
From your link, Appendix B of RFC 8089 confirms that both forms are valid URIs:
A traditional file URI for a local file with an empty authority.
This is the most common format in use today. For example:
* "file:///path/to/file"
o The minimal representation of a local file with no authority
field
and an absolute path that begins with a slash "/". For example:
* "file:/path/to/file"
Further confirmation that both URI forms are valid is that either can be entered in a browser to display the content of the directory. However, there are a few points worth noting:
Brave was the only browser that would not accept the single slash form of the URI (as given by File.toURI().toString()).
All browsers accepted the triple slash form of the URI (as given by File.toPath().toUri().toString()).
If I entered the URI in the browser's adddress bar with a single slash it was converted to a triple slash.
Strangely, both Chrome and Firefox will accept any number of slashes in the URI (e.g. file:///////////D:/NetBeansProjects/MiscTests/), and still display the directory.
The construct new URL(new URL(new URL("http://localhost:4567"), "abc"), "def") produces (imho incorrectly) this url: http://localhost:4567/def
While the construct new URL(new URL(new URL("http://localhost:4567"), "abc/"), "def") produces the correct (wanted by me) url: http://localhost:4567/abc/def
The difference is a trailing slash in abc constructor argument.
Is this intended behavior or this is a bug that should be fixed in URL class?
After all the idea is not to worry about slashes when you use some helper class for URL construction.
Quoting javadoc of new URL(URL context, String spec):
Otherwise, the path is treated as a relative path and is appended to the context path, as described in RFC2396.
See section 5 "Relative URI References" of the RFC2396 spec, specifically section 5.2 "Resolving Relative References to Absolute Form", item 6a:
All but the last segment of the base URI's path component is copied to the buffer. In other words, any characters after the last (right-most) slash character, if any, are excluded.
Explanation
On a web page, the "Base URI" is the page address, e.g. http://example.com/path/to/page.html. A relative link, e.g. <a href="page2.html">, must be interpreted as a sibling to the base URI, so page.html is removed, and page2.html is added, resulting in http://example.com/path/to/page2.html, as intended.
The Java URL class implements this logic, and that is why you get what you see, and it is entirely the way it is supposed to work.
It is by design, i.e. not a bug.
When using ProducerTemplate.sendBodyAndHeader() to send a file using the "file" scheme to its destination, and the file path in the URI contains ampersands, it fails to deliver the file with the following errors.
org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint:
file:///c%7C/IMM_SAN/Marketing/f77333bd-f96f-4873-b846-2f1dc5531a5a/2596/PB&J%20Generic%2007064782/transcoded/21726
due to: Failed to resolve endpoint:
file:///c%7C/IMM_SAN/Marketing/f77333bd-f96f-4873-b846-2f1dc5531a5a/25964/PB&J%20Generic%2007064782/transcoded/21726
due to: Invalid uri syntax: no ? marker however the uri has & parameter separators. Check the uri if its missing a ? marker.
Spending a few days trying the different overloads to send the file send(), sendBody(), sendBodyAndHeader() and even sendBodyAndHeaders().
I tried to UrlEncoder.encode() it before hand and of course a no go.
I even debugged the URISupport.normalizeUri(String uri) from the camel-core source and discovered something interesting. Apparently no amount of encoding will do me any good before sending the body and header because it appears to be doing its own encoding and it appears to be totally incorrect. I think this is a bug in sendBodyAndHeader(). It encodes the ampersand back into the URI before sending it. This is bad. Why are we doing that? We have an application that reads files from one department and are written to a share and another system automatically picking those file up and delivering it when processing on the file is finished.
See below camel URISupport.normalizeUri(String uri) method is encoding the URI here and this puts the ampersand back into the file path.
URI u = new URI(UnsafeUriCharactersEncoder.encode(uri));
So you see no amount of preprocessing on the file path in the URI is going to work at all because sendBodyAndHeader is going to do whatever it feels like doing. I would like to add a new overload to this API to turn off normalization and just send the URI as is. But wanted to check here to see if anybody has any less drastic options. Please note this is a problem when ampersands are in the URI path for file schemes.
ProducerTemplate prod = exchange.getContext().createProducerTemplate();
destPath = destPath.replace(':', '|');
destPath = destPath.replaceAll("\\\\", "/");
destPath = destPath.replaceAll("&", "%26"); // replace the ampersand
String query = "file:///" + destPath;
prod.sendBodyAndHeader(query, exchange.getIn().getBody(), Exchange.FILE_NAME, destFileName);
Use the CamelFileName header to avoid messing up the endpoint URI with the reserved character & if you really need that character in the file path.
This example would put a file into c:\a&b
public void sendAnyFile(Exchange e){
ProducerTemplate pt = getContext().createProducerTemplate();
pt.sendBodyAndHeader("file:///c:/",e.getIn().getBody(String.class), "CamelFileName", "a&b/hej.txt");
}
I need to create a java URL object by providing a representation containing a delimiter, which is excluded for US_ASCII Characters. You can find the speicification here 2.4.3. Excluded US-ASCII Characters.
For example,
http://localhost:8182/a%image.tif
or
http://localhost:8182/a#image.tif
Does anybody know a workaround?
Can't you encode the character? So # => %23 and % => %25. See more information on W3Schools
Generally, a URI can be safely constructed only by encoding the individual components before assembling them into the final URI. In this case a%image.gif is a path component and must be encoded according the path production (3.3 in rfc 2369).
Use java.net.URI to create legal URI (and URLs):
URI uri = URI.create("http://localhost:8182/a%25image.gif");
System.out.println(uri.toASCIIString());
System.out.println(uri.getPath());
You should see the output of the last statement unencoded.
Technically, the second URL is not illegal, image.gif, would be treated as a fragment. But if the hash caharacter is part of the path, it must of course be encoded as well.