request.getCharacterEncoding() returns NULL... why? - java

A coworker of mine created a basic contact-us type form, which is mangling accented characters (è, é, à, etc). We're using KonaKart a Java e-commerce platform on Struts 1.
I've narrowed the issue down to the data coming in through the HttpServletRequest object. Comparing a similar (properly functioning) form, I noticed that on the old form the request object's Character Encoding (request.getCharacterEncoding()) is returned as "UTF-8", but on the new form it is coming back as NULL, and the text coming out of request.getParameter() is already mangled.
Aside from that, I haven't found any significant differences between the known-good form, and the new-and-broken form.
Things I've ruled out:
Both HTML pages have the tag: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Both form tags in the HTML use POST, and do not set encodings
Checking from Firebug, both the Request and Response headers have the same properties
Both JSP pages use the same attributes in the <%#page contentType="text/html;charset=UTF-8" language="java" %> tag
There's nothing remotely interesting going on in the *Form.java files, both inherit from BaseValidatorForm
I've checked the source file encodings, they're all set to Default - inherited from Container: UTF-8
If I convert them from ISO-8859-1 to UTF-8, it works great, but I would much rather figure out the core issue.
eg: new String(request.getParameter("firstName").getBytes("ISO-8859-1"),"UTF8")
Any suggestions are welcome, I'm all out of ideas.

Modern browsers usually don't supply the character encoding in the HTTP request Content-Type header. It's in case of HTML form based applications however the same character encoding as specified in the Content-Type header of the initial HTTP response serving the page with the form. You need to explicitly set the request character encoding to the same encoding yourself, which is in your case thus UTF-8.
request.setCharacterEncoding("UTF-8");
Do this before any request parameter is been retrieved from the request (otherwise it's too late; the server platform default encoding would then be used to parse the parameters, which is indeed often ISO-8859-1). A servlet filter which is mapped on /* is a perfect place for this.
See also:
Unicode - How to get the characters right?

The request.getCharacterEncoding() relies on the Content-Type request attribute, not Accept-Charset
So application/x-www-form-urlencoded;charset=IS08859_1 should work for the POST action. The <%#page tag doesn't affect the POST data.

Related

Rendered JSP markup has garbled UTF-8 characters

I modified the embedded-jetty project to create a stand-alone jsp-viewer (one file with full source code). The result works fine, but it has a problem in displaying JSPs containing special glyphs. The problem is not that the Content-Type is not set when transmitting the markup, but that the rendered markup is garbled (in view-source or via curl). The JSP files must be read using the wrong character encoding, but starting the jvm with -Dfile.encoding=UTF-8 does nothing.
These strings
Butikknavn – et smartere valg
få ekstra fordeler når
becomes
Butikknavn â<80><93> et smartere valg
få ekstra fordeler når
Edit: Just to state the obvious, the content header is already set, as can be seen from the raw HTTP response
Content-Type:text/html;charset=utf-8
You have to add
<%# page pageEncoding="UTF-8" %>
to your JSP file(s).
The -Dfile.encoding=UTF-8 should do the pageEncoding="UTF-8" part for the whole Jetty instance, regrettably, as you've mentioned, it doesn't. You might also try to add <page-encoding>UTF-8</page-encoding> to your web.xml (as described here), but I've never tried it.
Your HTTP response is probably missing the content-type header. Try adding one as follows:
Content-Type: text/html; charset=utf-8

Is Explicit decoding required here?

Say I am displaying escaped value in HTML with below code under text area:
<c:out value="${person.name}" />
My question do I need to decode this value at server side manually or browser will do it automatically ?
No, you need not to decode this value manually .. All you need is:
Specify your HTTP response content type encoding as UTF-8. To be precise use HttpServletResponse.setContentType ("text/html;charset=utf-8");.
Your JSP should have content type encoding set as UTF-8 in your JSP .. To be precise add this meta tag in your JSP and you should be good to go <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
When you have this tag in your JSP then browser will understand that content of this page should be render as per UTF-8 encoding rules.
If don't specify page encoding explicitly using these kind of meta tags or some other mechanism then browser use default encoding associated with it while page rendering and you may not see expected result especially for characters from Unicode's advanced blocks of BMP and Supplementary Multilingual Plane. Check this on how to see the default encoding of browser.
Concept
Server should specify desired encoding scheme in "response stream" and same encoding scheme should be used in JSP/ASP/HTML page.
Server side encoding options
PHP
header('Content-type: text/html; charset=utf-8');
Perl
print "Content-Type: text/html; charset=utf-8\n\n";
Python
Use the same solution as for Perl (except that you don't need a semicolon at the end).
Java Servlets
resource.setContentType ("text/html;charset=utf-8");
JSP
<%# page contentType="text/html; charset=UTF-8" %>
ASP and ASP.Net
<%Response.charset="utf-8"%>
Client side encoding options
Use following meta tag in your HTML page <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
Further reading:
HTTP-charset
This answer
when I get the request.parameter for the escaped input (done thru) <c:out value="${person.name}" />, I get the escaped value and store it in db as it is. For example :- <script>test</script> is stored as <script>test</script> Now when value is fetched from DB and displayed on browser, it renders it correctly i.e <script>test</script> is displayed as <script>test</script>

Submitting POST form data in UTF -8

I have a form in which the user enters his name in Chinese, but when I do
String strName = request.getParameter("name");
I get strName as some meaningless characters. As a solution I tried
request.setCharacterEncoding("UTF-8");
before reading any parameters from the request object. This worked. What I want to know is how do I achieve this in HTML/javascript . I have tried the
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but this doesn't work . Any help?
It should work if you define the charset in the Content-Type request header. I believe it is usually not defined by default. For example if you use jQuery to make the request you have to add "charset=utf-8" to the contentType option: http://api.jquery.com/jQuery.ajax/
There's no way to force a web browser to send Content-Type for every request, so it's better call setCharacterEncoding always.
What you do client-side affects how the data is sent. Apparently the data is sent as UTF-8 encoded, since the problem was in reading it server-side. So adding the meta (though OK) has no effect on this.
In pure HTML you should need <form enctype="multipart/form-data" accept-charset="UTF-8"> set to submit utf-8 in the browser.

Why aren't UTF-8 characters being rendered correctly in this web page (generated with JSoup)?

I'm having trouble dealing with Charsets while parsing and rendering a page using the JSoup library. here is an example of the page it renders:
http://dl.dropbox.com/u/13093/charset-problem.html
As you can see, where there should be ' characters, ? is being rendered instead (even when you view the source).
This page is being generated by downloading a web page, parsing with JSoup, and then re-rendering it again having made some structural changes.
I'm downloading the page as follows:
final Document inputDoc = Jsoup.connect(sourceURL.toString()).get();
When I create the output document I do so as follows:
outputDoc.outputSettings().charset(Charset.forName("UTF-8"));
outputDoc.head().appendElement("meta").attr("charset", "UTF-8");
outputDoc.head().appendElement("meta").attr("http-equiv", "Content-Type")
.attr("content", "text/html; charset=UTF-8");
Can anyone offer suggestions as to what I'm doing wrong?
edit: Note that the source page is http://blog.locut.us/ and as you'll see, it appears to render correctly
The question marks are typical whenever you write characters to the outputstream of the response which are not covered by the response's character encoding. You seem to be relying on the platform default character encoding when serving the response. The response Content-Type header of your site also confirms this by a missing charset attribute.
Assuming that you're using a servlet to serve the modified HTML, then you should be using HttpServletResponse#setCharacterEncoding() to set the character encoding before writing the modified HTML out.
response.setCharacterEncoding("UTF-8");
response.getWriter().write(html);
The problem is most likely in reading the input page, you need to have the correct encoding for the source too.

converting a String to UTF8 format

I java code, I am having a string name = "örebro"; // its a swedish character.
But when I use this name in web application. I print some special character at 'Ö' character.
Is there anyway I can use the same character as it is in "örebro".
I did some thing like this but does not worked.
String name = "örebro";
byte[] utf8s = name .getBytes("UTF-8");
name = new String(utf8s, "UTF-8");
But the name at the end prints the same, something like this. �rebo
Please guide me
The Java code you've provided is pointless, it will do nothing. Java Strings are already perfectly capable of encoding any character (though you have to be careful with literals in the source code, as they depend on the encoding the compiler uses, which is platform-dependant).
Most likely your problem is that your webpage does not declare the encoding correctly in the HTTP header or the HTML meta tags.
You need to set the encoding of your output to UTF8.
It is likely the browser that reads the page does not know the encoding.
send the header (before any other output) something in Java like ServletResponse resource; (...)resource.setContentType ("text/html;charset=utf-8");
in your html page, mention the encoding by sending (printing)<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If the page used to generate the output is jsp it's useful to precise
<%# page contentType="text/html; charset=utf-8" %>

Categories

Resources