converting a String to UTF8 format

converting a String to UTF8 format - java

I java code, I am having a string name = "örebro"; // its a swedish character.
But when I use this name in web application. I print some special character at 'Ö' character.
Is there anyway I can use the same character as it is in "örebro".
I did some thing like this but does not worked.
String name = "örebro";
byte[] utf8s = name .getBytes("UTF-8");
name = new String(utf8s, "UTF-8");
But the name at the end prints the same, something like this. �rebo
Please guide me

The Java code you've provided is pointless, it will do nothing. Java Strings are already perfectly capable of encoding any character (though you have to be careful with literals in the source code, as they depend on the encoding the compiler uses, which is platform-dependant).
Most likely your problem is that your webpage does not declare the encoding correctly in the HTTP header or the HTML meta tags.

You need to set the encoding of your output to UTF8.

It is likely the browser that reads the page does not know the encoding.
send the header (before any other output) something in Java like ServletResponse resource; (...)resource.setContentType ("text/html;charset=utf-8");
in your html page, mention the encoding by sending (printing)<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

If the page used to generate the output is jsp it's useful to precise
<%# page contentType="text/html; charset=utf-8" %>

Related

Is Explicit decoding required here?

Say I am displaying escaped value in HTML with below code under text area:
<c:out value="${person.name}" />
My question do I need to decode this value at server side manually or browser will do it automatically ?

No, you need not to decode this value manually .. All you need is:
Specify your HTTP response content type encoding as UTF-8. To be precise use HttpServletResponse.setContentType ("text/html;charset=utf-8");.
Your JSP should have content type encoding set as UTF-8 in your JSP .. To be precise add this meta tag in your JSP and you should be good to go <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
When you have this tag in your JSP then browser will understand that content of this page should be render as per UTF-8 encoding rules.
If don't specify page encoding explicitly using these kind of meta tags or some other mechanism then browser use default encoding associated with it while page rendering and you may not see expected result especially for characters from Unicode's advanced blocks of BMP and Supplementary Multilingual Plane. Check this on how to see the default encoding of browser.
Concept
Server should specify desired encoding scheme in "response stream" and same encoding scheme should be used in JSP/ASP/HTML page.
Server side encoding options
PHP
header('Content-type: text/html; charset=utf-8');
Perl
print "Content-Type: text/html; charset=utf-8\n\n";
Python
Use the same solution as for Perl (except that you don't need a semicolon at the end).
Java Servlets
resource.setContentType ("text/html;charset=utf-8");
JSP
<%# page contentType="text/html; charset=UTF-8" %>
ASP and ASP.Net
<%Response.charset="utf-8"%>
Client side encoding options
Use following meta tag in your HTML page <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
Further reading:
HTTP-charset
This answer

when I get the request.parameter for the escaped input (done thru) <c:out value="${person.name}" />, I get the escaped value and store it in db as it is. For example :- <script>test</script> is stored as <script>test</script> Now when value is fetched from DB and displayed on browser, it renders it correctly i.e <script>test</script> is displayed as <script>test</script>

Wrong encoding in java servlet (tomcat)

I am trying to setup the right encoding for my JSP/servlet pages in Tomcat 7. Though, I have to be successful yet. I made some tries from the suggestions given by this stackexchange thread: Character encoding JSP -displayed wrong in JSP but not in URL: "á » Ã¡ é » Ã©", but they didn't work.
The curious fact lies on the fact that if I let the pages "as is" the browser recognise them as having the encoding Windows-CP 1252 and when I change for UTF-8 the text is displayed correctly. But applying filters and other mechanisms the browser put the encoding as UTF-8 and is not possibile to display it correctly. In fact for the latter if I change the encoding the results are horrible at minimum.

I got it right now. In pages JSP I am putting as first instruction:
<%# page pageEncoding="utf-8" %>
This fixes all problems. Other possibilities like to put response.setCharacterEncoding( "UTF-8" ) as first instruction don't work.
In relation to servlets I need to setup the character encoding before to get the PrintWriter object:
response.setContentType("text/html");
response.setCharacterEncoding("UTF-8");
PrintWriter out = response.getWriter();
These things have solved my problem of strange characters. To sum up: The problem was that the response coming out from JSP/servlet didn't have pointed that itself was encoded in UTF-8

Maybe is not a JSP problem. Have you tried doing that in the page, directly?
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</head>
Also, try to save the page in UTF-8 format

request.getCharacterEncoding() returns NULL... why?

A coworker of mine created a basic contact-us type form, which is mangling accented characters (è, é, à, etc). We're using KonaKart a Java e-commerce platform on Struts 1.
I've narrowed the issue down to the data coming in through the HttpServletRequest object. Comparing a similar (properly functioning) form, I noticed that on the old form the request object's Character Encoding (request.getCharacterEncoding()) is returned as "UTF-8", but on the new form it is coming back as NULL, and the text coming out of request.getParameter() is already mangled.
Aside from that, I haven't found any significant differences between the known-good form, and the new-and-broken form.
Things I've ruled out:
Both HTML pages have the tag: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Both form tags in the HTML use POST, and do not set encodings
Checking from Firebug, both the Request and Response headers have the same properties
Both JSP pages use the same attributes in the <%#page contentType="text/html;charset=UTF-8" language="java" %> tag
There's nothing remotely interesting going on in the *Form.java files, both inherit from BaseValidatorForm
I've checked the source file encodings, they're all set to Default - inherited from Container: UTF-8
If I convert them from ISO-8859-1 to UTF-8, it works great, but I would much rather figure out the core issue.
eg: new String(request.getParameter("firstName").getBytes("ISO-8859-1"),"UTF8")
Any suggestions are welcome, I'm all out of ideas.

Modern browsers usually don't supply the character encoding in the HTTP request Content-Type header. It's in case of HTML form based applications however the same character encoding as specified in the Content-Type header of the initial HTTP response serving the page with the form. You need to explicitly set the request character encoding to the same encoding yourself, which is in your case thus UTF-8.
request.setCharacterEncoding("UTF-8");
Do this before any request parameter is been retrieved from the request (otherwise it's too late; the server platform default encoding would then be used to parse the parameters, which is indeed often ISO-8859-1). A servlet filter which is mapped on /* is a perfect place for this.
See also:
Unicode - How to get the characters right?

The request.getCharacterEncoding() relies on the Content-Type request attribute, not Accept-Charset
So application/x-www-form-urlencoded;charset=IS08859_1 should work for the POST action. The <%#page tag doesn't affect the POST data.

jsp pages displaying junk characters in non english languages

I have a Main JSP page say jsp1 which includes two JSP pages (jsp2, jsp3). All the strings in these pages come from property files.
The non-english property files are converted using native2ascii
native2ascii –encoding="8859-1" lang.properties lang1.properties
All the JSP pages have
<%# page contentType="text/html;charset=UTF-8" language="java" %>
Now when main jsp page(jsp1) gets displayed, we see garbled characters in a few strings of jsp2 and jsp3. Till now I have seen this happening to Russian, Korean, Japanese language strings. And it happens on a random string.
Does any one have an idea what could be wrong
Updating with more details
The string in rus_utf8.proeperties is
Щелкните <strong>УСТАНОВИТЬ СЕЙЧАС</strong> и сохраните файл в некотором расположении
After Conversion using native2Ascii, String in rus.properties is
\u0429\u0435\u043b\u043a\u043d\u0438\u0442\u0435 <strong>\u0423\u0421\u0422\u0410\u041d\u041e\u0412\u0418\u0422\u042c \u0421\u0415\u0419\u0427\u0410\u0421</strong> \u0438 \u0441\u043e\u0445\u0440\u0430\u043d\u0438\u0442\u0435 \u0444\u0430\u0439\u043b \u0432 \u043d\u0435\u043a\u043e\u0442\u043e\u0440\u043e\u043c \u0440\u0430\u0441\u043f\u043e\u043b\u043e\u0436\u0435\u043d\u0438\u0438.
In JSP we use struts <s:text> to load the string from property file
In firefox the string got displayed as
��елкните УСТАНОВИТЬ СЕЙЧАС и сохраните файл в некотором расположении.
The char Щ got garbled. Same String in some other place in the page got displayed properly.

The non-english property files are converted using native2ascii
native2ascii –encoding="8859-1" lang.properties lang1.properties
This is invalid. It should have been
native2ascii –encoding ISO-8859-1 lang.properties lang1.properties
Apart from the syntax error which you have there (which should immediately have aborted native2ascii), the ISO-8859-1 encoding can impossibly be correct for Russian, Korean and Japanese strings. The ISO-8859-1 encoding does not cover those characters at all. Assuming that you saved it as UTF-8, then you should be using
native2ascii –encoding UTF-8 lang.properties lang1.properties
This way the native2ascii will convert from an UTF-8 lang.properties to an ISO-8859-1 compatible lang1.properties. The native2ascii will always convert to ASCII. The -encoding attribute concerns the encoding of the source file, not the target file.
As to the JSP pages, just a
<%#page contentType="text/html; charset=UTF-8" %>
ought to be sufficient, per http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8.
See also:
Unicode - How to get the characters right?
Update as per your update with the examples. Everything is actually working right. It only look much like that the UTF-8 BOM (Byte Order Mark) is the culprit. Notepad adds it by default. Try creating the properties file in another editor instead like Eclipse.

I have the same problem as you and I also tried the similar solutions but they didn't work. Hence I suspected that it may not be an issue with JSP config but rather it is a config issue with my tomcat.
I found this on a Chinese site, https://openhome.cc/Gossip/Encoding/Servlet.html: request.setCharacterEncoding("UTF-8");. It worked for me. I added this before my request.getParameter();.

Java String Encoding to UTF-8

I have some HTML code that I store in a Java.lang.String variable. I write that variable to a file and set the encoding to UTF-8 when writing the contents of the string variable to the file on the filesystem. I open up that file and everything looks great e.g. → shows up as a right arrow.
However, if the same String (containing the same content) is used by a jsp page to render content in a browser, characters such as → show up as a question mark (?)
When storing content in the String variable, I make sure that I use:
String myStr = new String(bytes[], charset)
instead of just:
String myStr = "<html><head/><body>→</body></html>";
Can someone please tell me why the String content gets written to the filesystem perfectly but does not render in the jsp/browser?
Thanks.

but does not render in the jsp/browser?
You need to set the response encoding as well. In a JSP you can do this using
<%# page pageEncoding="UTF-8" %>
This has actually the same effect as setting the following meta tag in HTML <head>:
<meta http-equiv="content-type" content="text/html; charset=utf-8">

Possibilities:
The browser does not support UTF-8
You don't have Content-Type: text/html; charset=utf-8 in your HTTP Headers.

The lazy developer (=me) uses Apache Common Lang StringEscapeUtils.escapeHtml http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringEscapeUtils.html#escapeHtml(java.lang.String) which will help you handle all 'odd' characters. Let the browser do the final translation of the html entities.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.