Add scheme to URL if not present - java

I have existing code that uses java.net.URL instead of java.net.URI all over the codebase.
Also, the code has URL parser that parses URLs appearing in some text body. All URLs that do not have a protocol prefix, such as www.google.com, are considered malformed when converting to URL object.
Is there a clean way to handle such cases in Java?

Create a URI and see if it has a scheme. Set the scheme, or reconstruct the URI with a scheme argument, if not present. Convert to URL.

Related

Custom URL scheme as adapter on existing URL schemes

Is there a clean and spec-conformant way to define a custom URL scheme that acts as an adapter on the resource returned by another URL?
I have already defined a custom URL protocol which returns a decrypted representation of a local file. So, for instance, in my code,
decrypted-file:///path/to/file
transparently decrypts the file you would get from file:///path/to/file. However, this only works for local files. No fun! I am hoping that the URL specification allows a clean way that I could generalize this by defining a new URL scheme as a kind of adapter on existing URLs.
For example, could I instead define a custom URL scheme decrypted: that could be used as an adapter that prefixes another absolute URL that retrieved a resource? Then I could just do
decrypted:file:///path/to/file
or decrypted:http://server/path/to/file or decrypted:ftp://server/path/to/file or whatever. This would make my decrypted: protocol composable with all existing URL schemes that do file retrieval.
Java does something similar with the jar: URL scheme but from my reading of RFC 3986 it seems like this Java technology violates the URL spec. The embedded URL is not properly byte-encoded, so any /, ?, or # delimiters in the embedded URL should officially be treated as segment delimiters in the embedding URL (even if that's not what JarURLConnection does). I want to stay within the specs.
Is there a nice and correct way to do this? Or is the only option to byte-encode the entire embedded URL (i.e., decrypted:file%3A%2F%2F%2Fpath%2Fto%2Ffile, which is not so nice)?
Is what I'm suggesting (URL adapters) done anywhere else? Or is there a deeper reason why this is misguided?
There's no built-in adaptor in Cocoa, but writing your own using NSURLProtocol is pretty straightforward for most uses. Given an arbitrary URL, encoding it like so seems simplest:
myscheme:<originalurl>
For example:
myscheme:http://example.com/path
At its simplest, NSURL only actually cares if the string you pass in is a valid URI, which the above is. Yes, there is then extra URL support layered on top, based around RFC 1808 etc. but that's not essential.
All that's required to be a valid URI is a colon to indicate the scheme, and no invalid characters (basically, ASCII without spaces).
You can then use the -resourceSpecifier method to retrieve the original URL and work with that.

In Java, how to get canonicalized url

Say i have space in a url, what is the right way to convert it to %20? no 'replace' suggestion please.
For example, if you put "http://test.com/test and test/a" into the browser window, it converts to http://test.com/test%20and%20test/a
If I use URLEncoder, I get even the / converted. which is not what i want.
Thanks,
this is the right way, seems like. to add to the question, what if there is also some non ascii code in the path that I want convert to valid url with utf8 encode? e.g.: test.com:8080/test and test/pierlag2_carré/a?query=世界 I'd want it to be converted to test.com:8080/test%20and%20test/pierlag2_carr%C3%A9/a?query=%E4%B8%96%E7%95%8C
Try splitting into a URI with the aid of the URL class:
String sUrl = "http://test.com:8080/test and test/a?query=world";
URL url = new URL(sUrl);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String canonical = uri.toString();
System.out.println(canonical);
Output:
http://test.com:8080/test%20and%20test/a?query=world
The correct way to build URLs in Java is to create a URI object and fill out each part of the URL. The URI class handles the encoding rules for the distinct parts of the URL as they differ from one to the next.
URLEncoder is not what you want, despite its name, as that actually does HTML form encoding.
EDIT:
Based on your comments, you are receiving the URL as input to your application and do not control the initial generation of the URL. The real problem you are currently experiencing is that the input you are receiving, the URL, is not a valid URL. URLs / URIs cannot contain spaces per the spec (hence the %20 in the browser).
Since you have no control over the invalid input you are going to be forced to split the incoming URL into its parts:
scheme
host
path
Then you are going to have to split the path and separately encode each part to ensure that you do not inadvertently encode the / that delimits your path fragments.
Finally, you can put all of them back together in a URI object and then pass that around your application.
You may find useful this code to replace blank spaces in your URL:
String myUrl = "http://test.com/test and test/a";
myUrl = myUrl.replaceAll(" ", "%20");
URI url = new URI(myUrl);
System.out.print(url.toString());

java request.getQueryString() value different between chrome and ie browser

I have a request,In Browser address bar enter:
http://localhost:8888/cmens-tops-outwear/t-b-f-a-c-s-fLoose-p-g-e-i-o.htm?'"--></style></script><script>netsparker(0x0000E1)</script>=
Tomcat6.0.35 i have set URIEncoding="UTF-8"
Use request.getQueryString() in servlet:
if chrome,i get
'%22--%3E%3C/style%3E%3C/script%3E%3Cscript%3Enetsparker(0x0000E1)%3C/script%3E=
if ie,I get
'"--></style></script><script>netsparker(0x0000E1)</script>=
Why?
Additional
I want to get request.getQueryString() to create a uri
URI uri = URI.create(url)
if ie:
java.net.URISyntaxException: Illegal character in query at index 36: /cmens/t-b-f-a-c-s-f-p-g-e-i-o.htm?'"--></style></script><script>netsparker(0x0000E1)</script>
at java.net.URI$Parser.fail(URI.java:2809)
at java.net.URI$Parser.checkChars(URI.java:2982)
at java.net.URI$Parser.parseHierarchical(URI.java:3072)
at java.net.URI$Parser.parse(URI.java:3024)
at java.net.URI.<init>(URI.java:578)
at java.net.URI.create(URI.java:840)
How to determine the queryString whether has be encoded?
The HttpServletRequest#getQueryString() is per definition undecoded. See also the javadoc (emphasis mine):
Returns:
a String containing the query string or null if the URL contains no query string. The value is not decoded by the container.
Basically, you need to URL-decode it yourself if you'd like to parse it manually instead of using getParameterXxx() methods for some reason (which implicitly decodes the parameters!).
String decodedQueryString = URLDecoder.decode(request.getQueryString(), "UTF-8");
As to why Chrome sends it encoded while IE not, that's because Chrome is doing a better job of handling HTTP requests the safe/proper way. This is beyond your control. Just always URL-decode the query string yourself if you intend to parse it manually for some reason. The URIEncoding="UTF-8" configuration has only effect on getParameterXxx() methods during GET requests.
The Chrome version is URLEncoded while the IE string is decoded.
Use this tool to compare the URLEncoded and decoded versions: http://meyerweb.com/eric/tools/dencoder/
Chrome uses the URL encoding way, but IE is using strings.
For example: " is %22 in URL encoding.
< is %3E
and > is %3C
Chrome is doing it the "right way" but IE just can't do as all the others.
You can find complete list of URL characters here: http://www.w3schools.com/tags/ref_urlencode.asp
Chrome sends the url encoded. Try decoding the query string using
URLDecoder.decode(queryString, "UTF-8");
As stated by the javadoc, the query string is not decoded by the container:
returns a String containing the query string or null if the URL contains no query string. The value is not decoded by the container.
javadoc

How to encode path parameter using Java Jersey

How do you encode a path parameter (not form-url-encoded) but just a single URL that's appended in the format:
public String method(#PathParam("url") String url) {
}
There are lots of references to form URL encoding, but I want to simply encode a string as in the above.
Like mentioned in the previous answer URLEncoder can only be used for query paramaters, not path parameters. This matters e.g. for spaces which are a + in the query parameter but a %20 in the path.
org.springframework.web.util.UriUtils.encodePath()
can be used. Also using an org.apache.http.client.utils.URIBuilder would work. setPath is escaping the path part here. Also pure Java by using a constructor of java.net.Uri works.
Why would you want to *en*code it there, if anything wouldn't you want to *de*code it? In any case, you would call the standard URLEncoder.

How to encode URL to avoid special characters in Java? [duplicate]

This question already has answers here:
HTTP URL Address Encoding in Java
(24 answers)
Closed 5 years ago.
i need java code to encode URL to avoid special characters such as spaces and % and & ...etc
URL construction is tricky because different parts of the URL have different rules for what characters are allowed: for example, the plus sign is reserved in the query component of a URL because it represents a space, but in the path component of the URL, a plus sign has no special meaning and spaces are encoded as "%20".
RFC 2396 explains (in section 2.4.2) that a complete URL is always in its encoded form: you take the strings for the individual components (scheme, authority, path, etc.), encode each according to its own rules, and then combine them into the complete URL string. Trying to build a complete unencoded URL string and then encode it separately leads to subtle bugs, like spaces in the path being incorrectly changed to plus signs (which an RFC-compliant server will interpret as real plus signs, not encoded spaces).
In Java, the correct way to build a URL is with the URI class. Use one of the multi-argument constructors that takes the URL components as separate strings, and it'll escape each component correctly according to that component's rules. The toASCIIString() method gives you a properly-escaped and encoded string that you can send to a server. To decode a URL, construct a URI object using the single-string constructor and then use the accessor methods (such as getPath()) to retrieve the decoded components.
Don't use the URLEncoder class! Despite the name, that class actually does HTML form encoding, not URL encoding. It's not correct to concatenate unencoded strings to make an "unencoded" URL and then pass it through a URLEncoder. Doing so will result in problems (particularly the aforementioned one regarding spaces and plus signs in the path).
I also spent quite some time with this issue, so that's my solution:
String urlString2Decode = "http://www.test.com/äüö/path with blanks/";
String decodedURL = URLDecoder.decode(urlString2Decode, "UTF-8");
URL url = new URL(decodedURL);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String decodedURLAsString = uri.toASCIIString();
If you don't want to do it manually use Apache Commons - Codec library. The class you are looking at is: org.apache.commons.codec.net.URLCodec
String final url = "http://www.google.com?...."
String final urlSafe = org.apache.commons.codec.net.URLCodec.encode(url);
Here is my solution which is pretty easy:
Instead of encoding the url itself i encoded the parameters that I was passing because the parameter was user input and the user could input any unexpected string of special characters so this worked for me fine :)
String review="User input"; /*USER INPUT AS STRING THAT WILL BE PASSED AS PARAMTER TO URL*/
try {
review = URLEncoder.encode(review,"utf-8");
review = review.replace(" " , "+");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
String URL = "www.test.com/test.php"+"?user_review="+review;
I would echo what Wyzard wrote but add that:
for query parameters, HTML encoding is often exactly what the server is expecting; outside these, it is correct that URLEncoder should not be used
the most recent URI spec is RFC 3986, so you should refer to that as a primary source
I wrote a blog post a while back about this subject: Java: safe character handling and URL building

Categories

Resources