Encoding URL query parameters in Java

Encoding URL query parameters in Java - java

How does one encode query parameters to go on a url in Java? I know, this seems like an obvious and already asked question.
There are two subtleties I'm not sure of:
Should spaces be encoded on the url as "+" or as "%20"? In chrome if I type in "http://google.com/foo=?bar me" chrome changes it to be encoded with %20
Is it necessary/correct to encode colons ":" as %3B? Chrome doesn't.
Notes:
java.net.URLEncoder.encode doesn't seem to work, it seems to be for encoding data to be form submitted. For example, it encodes space as + instead of %20, and encodes colon which isn't necessary.
java.net.URI doesn't encode query parameters

java.net.URLEncoder.encode(String s, String encoding) can help too. It follows the HTML form encoding application/x-www-form-urlencoded.
URLEncoder.encode(query, "UTF-8");
On the other hand, Percent-encoding (also known as URL encoding) encodes space with %20. Colon is a reserved character, so : will still remain a colon, after encoding.

Unfortunately, URLEncoder.encode() does not produce valid percent-encoding (as specified in RFC 3986).
URLEncoder.encode() encodes everything just fine, except space is encoded to "+". All the Java URI encoders that I could find only expose public methods to encode the query, fragment, path parts etc. - but don't expose the "raw" encoding. This is unfortunate as fragment and query are allowed to encode space to +, so we don't want to use them. Path is encoded properly but is "normalized" first so we can't use it for 'generic' encoding either.
Best solution I could come up with:
return URLEncoder.encode(raw, "UTF-8").replaceAll("\\+", "%20");
If replaceAll() is too slow for you, I guess the alternative is to roll your own encoder...
EDIT: I had this code in here first which doesn't encode "?", "&", "=" properly:
//don't use - doesn't properly encode "?", "&", "="
new URI(null, null, null, raw, null).toString().substring(1);

EDIT: URIUtil is no longer available in more recent versions, better answer at Java - encode URL or by Mr. Sindi in this thread.
URIUtil of Apache httpclient is really useful, although there are some alternatives
URIUtil.encodeQuery(url);
For example, it encodes space as "+" instead of "%20"
Both are perfectly valid in the right context. Although if you really preferred you could issue a string replace.

It is not necessary to encode a colon as %3B in the query, although doing so is not illegal.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It also seems that only percent-encoded spaces are valid, as I doubt that space is an ALPHA or a DIGIT
look to the URI specification for more details.

The built in Java URLEncoder is doing what it's supposed to, and you should use it.
A "+" or "%20" are both valid replacements for a space character in a URL. Either one will work.
A ":" should be encoded, as it's a separator character. i.e. http://foo or ftp://bar. The fact that a particular browser can handle it when it's not encoded doesn't make it correct. You should encode them.
As a matter of good practice, be sure to use the method that takes a character encoding parameter. UTF-8 is generally used there, but you should supply it explicitly.
URLEncoder.encode(yourUrl, "UTF-8");

I just want to add anther way to resolve this problem.
If your project depends on spring web, you can use their utils.
import org.springframework.web.util.UriUtils
import java.nio.charset.StandardCharsets
UriUtils.encode('vip:104534049:5', StandardCharsets.UTF_8)
Output:
vip%3A104534049%3A5

String param="2019-07-18 19:29:37";
param="%27"+param.trim().replace(" ", "%20")+"%27";
I observed in case of Datetime (Timestamp)
URLEncoder.encode(param,"UTF-8") does not work.

The white space character " " is converted into a + sign when using URLEncoder.encode. This is opposite to other programming languages like JavaScript which encodes the space character into %20. But it is completely valid as the spaces in query string parameters are represented by +, and not %20. The %20 is generally used to represent spaces in URI itself (the URL part before ?).

if you have only space problem in url. I have used below code and it work fine
String url;
URL myUrl = new URL(url.replace(" ","%20"));
example : url is
www.xyz.com?para=hello sir
then output of muUrl is
www.xyz.com?para=hello%20sir

Related

Why is java returning encoded values different

I am not quite sure why does java return %27+ for special characters in the name.
For example, the value I am trying to encode was "Mc' Donald". Its encoding to "Mc%27+Donald" when it should be "Mc%27%20Donald". reason why I replaced in the first place is db has ' instead of ' so replacing and encoding again.
lastName = URLEncoder.encode(lastName.replace("&apos;", "'"), "UTF-8");

In HTML encoding, + is a valid replacement for SPACE (%20) as well.

URL encoding the character # in query path

There are places/libraries that seem to consider "#" characters in a URL Path segment as "special character" that should be encoded, and places/libraries that do not.
I am looking to find out what is the correct version.
Example string: "someone#example.com".
If I go to https://www.urlencoder.org/ , and try to encode the above String I get
someone%40example.com
If I am using org.springframework.web.util.UriUtils I get these results:
String s1 = UriUtils.encodePathSegment("someone#example.com", "UTF-8");
String s2 = UriUtils.encodeQueryParam("someone#example.com", "UTF-8");
String s3 = UriUtils.encodePath("someone#example.com", "UTF-8");
System.out.println("----------s1: " + s1);
System.out.println("----------s2: " + s2);
System.out.println("----------s3: " + s3);
...outputs
----------s1: someone#example.com
----------s2: someone#example.com
----------s3: someone#example.com
RestEasy-Client v4.0.0.Final does not encode the "#" character in path segments
WSO2 ESB complains when receiving a Path parameter that contains # char (well, it can't find the resource at said moment).
Who is right, what should be the correct outcome, should "#" be transformed to "%40" or not?

There are places/libraries that seem to consider "#" characters in a URL Path segment as "special character" that should be encoded, and places/libraries that do not.
The standard for which characters must be escaped in a path segment is RFC 3986, Appendix A.
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
Notice that depending on the path production you are using, there are three different flavors of segment
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
but...
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
So # is allowed in any path segment.
Is it required? As far as I can tell, the answer is no -- using the pct-encoded representation instead is permitted when # is not serving the role of a delimiter. There's nothing explicit, but this observation about unreserved characters is a hint:
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.
This suggests that pct-encodings of unreserved characters are permitted, even though that's clearly not required. So that should hold true for other characters after the delimiters have been resolved.
For reference: the unreserved set is pretty much what you would expect.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

If you call an url like login(:password)#url.com, it will connect you to that endpoint with your credential. So I would not escape them at that point. But if they appear after the .com, I would escape them, because they should not be use as a separator.

Different behavior when space is encoded as + and %20 in a URL

Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL

Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.

Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.

Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html

urlencode() the 'asterisk' (star?) character

I'm testing PHP urlencode() vs. Java java.net.URLEncoder.encode().
Java
String all = "";
for (int i = 32; i < 256; ++i) {
all += (char) i;
}
System.out.println("All characters: -||" + all + "||-");
try {
System.out.println("Encoded characters: -||" + URLEncoder.encode(all, "utf8") + "||-");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
PHP
$all = "";
for($i = 32; $i < 256; ++$i)
{
$all = $all.chr($i);
}
echo($all.PHP_EOL);
echo(urlencode(utf8_encode($all)).PHP_EOL);
All characters seem to be encoded in the same way with both functions, except for the 'asterisk' character that is not encoded by Java, and translated to %2A by PHP. Which behaviour is supposed to be the 'right' one, if any?
Note: I tried with rawurlencode(), too - no luck.

It is okay to have a * in a URL, (but it is also okay to have it in its encoded form).
RFC1738: Uniform Resource Locators (URL) states the following:
Reserved:
[...]
Usually a URL has the same interpretation when an octet is
represented by a character and when it encoded. However, this is not
true for reserved characters: encoding a character reserved for a
particular scheme may change the semantics of a URL.
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
On the other hand, characters that are not required to be encoded
(including alphanumerics) may be encoded within the scheme-specific
part of a URL, as long as they are not being used for a reserved
purpose.

Wikipedia suggests that * is a reserved character when it comes to URIs, and that it must be encoded if not used for the reserved purpose. According to RFC3986, pages 12-13:
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
(The fact that the URL RFC still allows the * character to go unencoded is that is doesn't have a reserved purpose i URLs, and as such doesn't have to be encoded. So wether you have to encode it or not depends on what sort of URI you're creating.)

Javadoc of URLEncoder refers to the HTML specification:
This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format. For more information about HTML form encoding, consult the HTML specification.
HTML4 is quite unclear regarding this question and refers to RFC1738, which is quoted by aioobe:
Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
However, HTML5 directly states that * should not be encoded:
If the character isn't in the range U+0020, U+002A, U+002D, U+002E, U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to U+007A
Replace the character with a string formed as follows:
...
Otherwise
Leave the character as is.

What is the most efficient way to format UTF-8 strings in java?

I am doing the following:
String url = String.format(WEBSERVICE_WITH_CITYSTATE, cityName, stateName);
String urlUtf8 = new String(url.getBytes(), "UTF8");
Log.d(TAG, "URL: [" + urlUtf8 + "]");
Reader reader = WebService.queryApi(url);
The output that I am looking for is essentially to get the city name with blanks (e.g., "Overland Park") to be formatted as Overland%20Park.
Is it this the best way?

Assuming you are actually wanting to encode your string for use in a URL (ie, "Overland Park" can also be formatted as "Overland+Park") you want URLEncoder.encode(url, "UTF-8"). Other unsafe characters will be converted to the %xx format you are asking for.

The simple answer is to use URLEncoder.encode(...) as stated by #Recurse. However, if part or all of the URL has already been encoded, then this can lead to double encoding. For example:
http://foo.com/pages/Hello%20There
or
http://foo.com/query?keyword=what%3f
Another concern with URLEncoder.encode(...) is that it doesn't understand that certain characters should be escaped in some contexts and not others. So for example, a '?' in a query parameter should be escaped, but the '?' that marks the start of the "query part" should not be escaped.
I think that safer way to add missing escapes would be the following:
String safeURI = new URI(url).toASCIIString();
However, I haven't tested this ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Encoding URL query parameters in Java - java

I just want to add anther way to resolve this problem. If your project depends on spring web, you can use their utils. import org.springframework.web.util.UriUtils import java.nio.charset.StandardCharsets UriUtils.encode('vip:104534049:5', StandardCharsets.UTF_8) Output: vip%3A104534049%3A5

String param="2019-07-18 19:29:37"; param="%27"+param.trim().replace(" ", "%20")+"%27"; I observed in case of Datetime (Timestamp) URLEncoder.encode(param,"UTF-8") does not work.

if you have only space problem in url. I have used below code and it work fine String url; URL myUrl = new URL(url.replace(" ","%20")); example : url is www.xyz.com?para=hello sir then output of muUrl is www.xyz.com?para=hello%20sir

Related

Why is java returning encoded values different

URL encoding the character # in query path

Different behavior when space is encoded as + and %20 in a URL

urlencode() the 'asterisk' (star?) character

What is the most efficient way to format UTF-8 strings in java?

Categories

Resources