Why is java returning encoded values different

Why is java returning encoded values different - java

I am not quite sure why does java return %27+ for special characters in the name.
For example, the value I am trying to encode was "Mc' Donald". Its encoding to "Mc%27+Donald" when it should be "Mc%27%20Donald". reason why I replaced in the first place is db has ' instead of ' so replacing and encoding again.
lastName = URLEncoder.encode(lastName.replace("&apos;", "'"), "UTF-8");

In HTML encoding, + is a valid replacement for SPACE (%20) as well.

Related

Illegal Character in XML are not being replaced

SOLUTION So this was not an xml issue at all. My xml escapes were done properly, however there was an encoding issue. So i would like to share my solution with everyone, i hope you find this useful.
public static String entityEncode(String text) throws UnsupportedEncodingException {
String result = text;
if (result == null) {
return result;
}
byte ptext[] = result.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");
String temp = XMLStringUtil.escapeControlChrs(value);
return temp;
}
EXPLANATION The xml function above is for XML 1.0. We take our given text, convert it into a byte since String does not have an associated encoding. After which we create a new string off of the byte in "UTF-8". That is also why java was just telling me that character reference error with &#, it couldn't recognize the character at fault. Now that I did the encoding and assigned it to UTF-8, there are no issues and the xml escape proceeds properly!
EDIT: How do i print out all illegal xml characters in the provided string? According to StringEscapeUtils.escapeXml parameters? The problem i have is that i don't want to escape everything, because it doesn't properly decode after. So right now, i just need to find out what my invalid characters in the text are. The oens that are causing issues and need to be encoded.
I have the following error message:
ERROR: 'Character reference "&#'
ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Character reference "&#'
It does not specifically tell me what the character is which is a problem.
I do my original XML parse to convert to an xml document and then after that. I sanitize further to remove illegal characters
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
However, it's not removing them so i'm not sure how to go about this. Currently i have:
String temp = entityEncode(temp);
String legal = temp.replaceAll(xml10pattern , "");
item.setResponseBody(legal);
Entity encode just uses a standard xml parse class to escape characters XMLStringUtil.escapeControlChrs which is based off of StringEscapeUtils.escapeXml and just has additional escapes, nothing removed. But something is being missed.

Different behavior when space is encoded as + and %20 in a URL

Pages with spaces in the URL don't get correctly translated:
i.e.
http://www.streetinsider.com/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html
or
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
Gives 404. Please note "Press Releases" is encoded as "Press%20Releases".
However following two versions work fine where "Press Releases" is encoded as "Press+Releases".
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
The article parses fine with plus signs or HEX spaces %20.
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
Both + and %20 represent spaces. Then why this behavior.
And also, in java what could I use to get the correct encoded URL

Both + and %20 represent spaces
Only in query strings. Elsewhere in a URL a plus is a plus, not a space. In this case the web server gives you the same content for the two different URLs
http://www.streetinsider.com/Press+Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
and
http://www.streetinsider.com/Press+Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html
but the two URLs are distinct, they're not alternative representations of the same URL.

Officially + might only be used in the query string (after ?).
This is what URLEncoder is for:
"?x=" + URLEncoder.encode("Hello World", "UTF-8");
"?x=" + URLEncoder.encode("ŝi estas ĉarma", "UTF-8");
?x=Hello+World
?x=%C5%9Di+estas+%C4%89arma
The more universal class URI, obeys the specification for spaces to be replaced, using %.
URI uri = new URI("http", "www.streetinsider.com",
"/Press Releases/National Trends Reflected in Plano Housing Market/9778767.html",
"?x=ŝi estas ĉarma");
String u = uri.toString();
http://www.streetinsider.com/Press%20Releases/National%20Trends%20
Reflected%20in%20Plano%20Housing%20Market/9778767.html#?x=ŝi%20estas%20ĉarma
One sometime encounters URI as generalisation for File and others, and then has to be careful not introducing %20 in file names.
So probably there is a partial remapping on streetinsider of + or even %20 as it seems; in order to reach the same code.

Your statement
Both + and %20 represent spaces.
is not exactly true in all cases.
Space characters may only be encoded as "+" in one context: application/x-www-form-urlencoded key-value pairs.
The RFC-1866 (HTML 2.0 specification), paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").
Here is an example of such a string in URL where RFC-1866 allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses (in other cases, spaces should be encoded to %20). This way of encoding form data is also given in later HTML specifications, for example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
The URL that you have provided is not a form data containing key/value pairs, it's just a path to a 9778767.html file:
http://www.streetinsider.com/Press%20Releases/National+Trends+Reflected+in+Plano+Housing+Market/9778767.html
So, it is illegal to use pluses here. The correct URL in this case should have been the following:
http://www.streetinsider.com/Press%20Releases/National%20Trends%20Reflected%20in%20Plano%20Housing%20Market/9778767.html

Tomcat: possible to parse URL parameters containing '&'?

I have a servlet running on tomcat 6 which should be called as follows:
http://<server>/Address/Details?summary="Acme & co"
However: when I iterate through the parameters in the servlet code:
//...
while (paramNames.hasMoreElements()) {
paramName = (String) paramNames.nextElement();
if (paramName.equals("summary")) {
summary = request.getParameter(paramName).toString();
}
}
//...
the value of summary is "Acme ".
I assume tomcat ignores the quotes - so it sees "& co" as a second parameter (albeit improperly formed: there's no =...).
So: is there any way to avoid this? I want the value of summary to be "Acme & co". I tried replacing '&' in the URL with & but that doesn't work (presumably because it's decoded back to a straight '&' before the params are parsed out).
Thanks.

Use http://<server>/Address/Details?summary="Acme %26 co". Because in URL special http symbol(e.g. &,/, //) does not work as parameters.

Are you encoding and decoding the URL with URLEncode ? If so, can you check what the input and output of those are ? Seems like one of the special characters is not being properly encoded/decoded
Try %26 for the &

Try your parameter like
summary="Acme & co"
& is part reserved characters. Refer RFC2396 section
2.2. Reserved Characters.
how to encode URL to avoid special characters in java
Characters allowed in GET parameter
HTTP URL - allowed characters in parameter names
http://illegalargumentexception.blogspot.in/2009/12/java-safe-character-handling-and-url.html

Encoding URL query parameters in Java

How does one encode query parameters to go on a url in Java? I know, this seems like an obvious and already asked question.
There are two subtleties I'm not sure of:
Should spaces be encoded on the url as "+" or as "%20"? In chrome if I type in "http://google.com/foo=?bar me" chrome changes it to be encoded with %20
Is it necessary/correct to encode colons ":" as %3B? Chrome doesn't.
Notes:
java.net.URLEncoder.encode doesn't seem to work, it seems to be for encoding data to be form submitted. For example, it encodes space as + instead of %20, and encodes colon which isn't necessary.
java.net.URI doesn't encode query parameters

java.net.URLEncoder.encode(String s, String encoding) can help too. It follows the HTML form encoding application/x-www-form-urlencoded.
URLEncoder.encode(query, "UTF-8");
On the other hand, Percent-encoding (also known as URL encoding) encodes space with %20. Colon is a reserved character, so : will still remain a colon, after encoding.

Unfortunately, URLEncoder.encode() does not produce valid percent-encoding (as specified in RFC 3986).
URLEncoder.encode() encodes everything just fine, except space is encoded to "+". All the Java URI encoders that I could find only expose public methods to encode the query, fragment, path parts etc. - but don't expose the "raw" encoding. This is unfortunate as fragment and query are allowed to encode space to +, so we don't want to use them. Path is encoded properly but is "normalized" first so we can't use it for 'generic' encoding either.
Best solution I could come up with:
return URLEncoder.encode(raw, "UTF-8").replaceAll("\\+", "%20");
If replaceAll() is too slow for you, I guess the alternative is to roll your own encoder...
EDIT: I had this code in here first which doesn't encode "?", "&", "=" properly:
//don't use - doesn't properly encode "?", "&", "="
new URI(null, null, null, raw, null).toString().substring(1);

EDIT: URIUtil is no longer available in more recent versions, better answer at Java - encode URL or by Mr. Sindi in this thread.
URIUtil of Apache httpclient is really useful, although there are some alternatives
URIUtil.encodeQuery(url);
For example, it encodes space as "+" instead of "%20"
Both are perfectly valid in the right context. Although if you really preferred you could issue a string replace.

It is not necessary to encode a colon as %3B in the query, although doing so is not illegal.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It also seems that only percent-encoded spaces are valid, as I doubt that space is an ALPHA or a DIGIT
look to the URI specification for more details.

The built in Java URLEncoder is doing what it's supposed to, and you should use it.
A "+" or "%20" are both valid replacements for a space character in a URL. Either one will work.
A ":" should be encoded, as it's a separator character. i.e. http://foo or ftp://bar. The fact that a particular browser can handle it when it's not encoded doesn't make it correct. You should encode them.
As a matter of good practice, be sure to use the method that takes a character encoding parameter. UTF-8 is generally used there, but you should supply it explicitly.
URLEncoder.encode(yourUrl, "UTF-8");

I just want to add anther way to resolve this problem.
If your project depends on spring web, you can use their utils.
import org.springframework.web.util.UriUtils
import java.nio.charset.StandardCharsets
UriUtils.encode('vip:104534049:5', StandardCharsets.UTF_8)
Output:
vip%3A104534049%3A5

String param="2019-07-18 19:29:37";
param="%27"+param.trim().replace(" ", "%20")+"%27";
I observed in case of Datetime (Timestamp)
URLEncoder.encode(param,"UTF-8") does not work.

The white space character " " is converted into a + sign when using URLEncoder.encode. This is opposite to other programming languages like JavaScript which encodes the space character into %20. But it is completely valid as the spaces in query string parameters are represented by +, and not %20. The %20 is generally used to represent spaces in URI itself (the URL part before ?).

if you have only space problem in url. I have used below code and it work fine
String url;
URL myUrl = new URL(url.replace(" ","%20"));
example : url is
www.xyz.com?para=hello sir
then output of muUrl is
www.xyz.com?para=hello%20sir

What is the most efficient way to format UTF-8 strings in java?

I am doing the following:
String url = String.format(WEBSERVICE_WITH_CITYSTATE, cityName, stateName);
String urlUtf8 = new String(url.getBytes(), "UTF8");
Log.d(TAG, "URL: [" + urlUtf8 + "]");
Reader reader = WebService.queryApi(url);
The output that I am looking for is essentially to get the city name with blanks (e.g., "Overland Park") to be formatted as Overland%20Park.
Is it this the best way?

Assuming you are actually wanting to encode your string for use in a URL (ie, "Overland Park" can also be formatted as "Overland+Park") you want URLEncoder.encode(url, "UTF-8"). Other unsafe characters will be converted to the %xx format you are asking for.

The simple answer is to use URLEncoder.encode(...) as stated by #Recurse. However, if part or all of the URL has already been encoded, then this can lead to double encoding. For example:
http://foo.com/pages/Hello%20There
or
http://foo.com/query?keyword=what%3f
Another concern with URLEncoder.encode(...) is that it doesn't understand that certain characters should be escaped in some contexts and not others. So for example, a '?' in a query parameter should be escaped, but the '?' that marks the start of the "query part" should not be escaped.
I think that safer way to add missing escapes would be the following:
String safeURI = new URI(url).toASCIIString();
However, I haven't tested this ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why is java returning encoded values different - java

In HTML encoding, + is a valid replacement for SPACE (%20) as well.

Related

Illegal Character in XML are not being replaced

Different behavior when space is encoded as + and %20 in a URL

Tomcat: possible to parse URL parameters containing '&'?

Encoding URL query parameters in Java

What is the most efficient way to format UTF-8 strings in java?

Categories

Resources