jsp pages displaying junk characters in non english languages

jsp pages displaying junk characters in non english languages - java

I have a Main JSP page say jsp1 which includes two JSP pages (jsp2, jsp3). All the strings in these pages come from property files.
The non-english property files are converted using native2ascii
native2ascii –encoding="8859-1" lang.properties lang1.properties
All the JSP pages have
<%# page contentType="text/html;charset=UTF-8" language="java" %>
Now when main jsp page(jsp1) gets displayed, we see garbled characters in a few strings of jsp2 and jsp3. Till now I have seen this happening to Russian, Korean, Japanese language strings. And it happens on a random string.
Does any one have an idea what could be wrong
Updating with more details
The string in rus_utf8.proeperties is
Щелкните <strong>УСТАНОВИТЬ СЕЙЧАС</strong> и сохраните файл в некотором расположении
After Conversion using native2Ascii, String in rus.properties is
\u0429\u0435\u043b\u043a\u043d\u0438\u0442\u0435 <strong>\u0423\u0421\u0422\u0410\u041d\u041e\u0412\u0418\u0422\u042c \u0421\u0415\u0419\u0427\u0410\u0421</strong> \u0438 \u0441\u043e\u0445\u0440\u0430\u043d\u0438\u0442\u0435 \u0444\u0430\u0439\u043b \u0432 \u043d\u0435\u043a\u043e\u0442\u043e\u0440\u043e\u043c \u0440\u0430\u0441\u043f\u043e\u043b\u043e\u0436\u0435\u043d\u0438\u0438.
In JSP we use struts <s:text> to load the string from property file
In firefox the string got displayed as
��елкните УСТАНОВИТЬ СЕЙЧАС и сохраните файл в некотором расположении.
The char Щ got garbled. Same String in some other place in the page got displayed properly.

The non-english property files are converted using native2ascii
native2ascii –encoding="8859-1" lang.properties lang1.properties
This is invalid. It should have been
native2ascii –encoding ISO-8859-1 lang.properties lang1.properties
Apart from the syntax error which you have there (which should immediately have aborted native2ascii), the ISO-8859-1 encoding can impossibly be correct for Russian, Korean and Japanese strings. The ISO-8859-1 encoding does not cover those characters at all. Assuming that you saved it as UTF-8, then you should be using
native2ascii –encoding UTF-8 lang.properties lang1.properties
This way the native2ascii will convert from an UTF-8 lang.properties to an ISO-8859-1 compatible lang1.properties. The native2ascii will always convert to ASCII. The -encoding attribute concerns the encoding of the source file, not the target file.
As to the JSP pages, just a
<%#page contentType="text/html; charset=UTF-8" %>
ought to be sufficient, per http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8.
See also:
Unicode - How to get the characters right?
Update as per your update with the examples. Everything is actually working right. It only look much like that the UTF-8 BOM (Byte Order Mark) is the culprit. Notepad adds it by default. Try creating the properties file in another editor instead like Eclipse.

I have the same problem as you and I also tried the similar solutions but they didn't work. Hence I suspected that it may not be an issue with JSP config but rather it is a config issue with my tomcat.
I found this on a Chinese site, https://openhome.cc/Gossip/Encoding/Servlet.html: request.setCharacterEncoding("UTF-8");. It worked for me. I added this before my request.getParameter();.

Related

Internationalization using resource bundle properties in JSP, non-Latin text becomes Mojibake

I have the following index.jsp:
<%# taglib prefix="fmt" uri="http://java.sun.com/jsp/jstl/fmt" %>
<%# page contentType="text/html;charset=UTF-8" language="java" %>
<fmt:setLocale value="ru_RU"/>
<fmt:setBundle basename="messages"/>
<html>
<head>
<title></title>
</head>
<body>
<h1><fmt:message key="login"/></h1>
</body>
</html>
And property file messages_ru_RU.properties:
login = Логин
The problem is that I get the junk unicode characters in the output:
Ëîãèí
Update
Changed the .properies file encoding to UTF-8.
The latest output: ÐÐ¾Ð³Ð¸Ð½
Help me, please, to change this to the normal cyrillic letters.
Property file:
messages_ru_RU.properties

Properties files are as per specification read using ISO-8859-1.
... the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes as defined in section 3.3 of The Java™ Language Specification; only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings.
So, any character which is not covered by the ISO-8859-1 range needs to be escaped in the Unicode escape sequences \uXXXX. You can use the JDK-supplied native2ascii tool to convert them. You can find it in JDK's /bin folder.
Here's an example assuming that foo_utf8.properties is the one which you saved using UTF-8 and that foo.properties is the one which you'd like to use in your application:
native2ascii –encoding UTF-8 foo_utf8.properties foo.properties
In your particular case, the property in question would then be converted to:
login = \u041B\u043E\u0433\u0438\u043D
This can then be successfully read and displayed in a JSP page with the below minimum #page configuration:
<%# page pageEncoding="UTF-8" %>
(the remainder you had is irrelevant as those are the defaults already when above is set)
If you're using a Java-aware IDE such as Eclipse, then you can just use its builtin properties file editor which should automatically be associated with .properties files in a Java-faceted project. If you use this editor instead of the plain text editor / source editor, then it'll automatically escape the characters which are not covered by the ISO-8859-1 range.
See also:
Unicode - How to get the characters right?
How to internationalize a Java web application?

Image showing to change to unicode
I had same problem with hindi language, so i changed my pageEncoding to UTF-8 and have saved file with Unicode encoding. Since i have given unicodes in .properties file. This worked for me.

Since Java SE properties files are loaded in UTF-8 encoding.
See https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm and https://stackoverflow.com/a/46926020/548473

JSP mysql and utf8

I am trying to store data encoded in greek through my JSP page in a mysql database using an insert statement. I have set the mysql arrays collation to utf 8. I have already used the
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
statements and I have made the proper modification to the server.xml of the tomcat server ...
Any other ideas???

set collation greek_general_ci.
ALTER TABLE <table name> CONVERT TO CHARACTER SET utf8 COLLATE greek_general_ci;
EDIT:
In MySQL Workbench right click on table->Alter Table--> change Collation to greek_general_ci or greekbin

It seems to me that you are trying to read UTF-8 characters (data is inputed as UTF-8 as you stated) using the Greek charset or other encoding that do not have equivalent Unicode chars.
Question marks or equivalent are shown when the byte (or bytes) do not have any association with the encoding you are using. You are parsing bytes that your client does not understand as valid, there is no representation.
I recommend you to read this very helpful article before you continue:
For example, you could encode the Unicode string for Hello (U+0048
U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding,
or the Hebrew ANSI Encoding, or any of several hundred encodings that
have been invented so far, with one catch: some of the letters might
not show up! If there's no equivalent for the Unicode code point
you're trying to represent in the encoding you're trying to represent
it in, you usually get a little question mark: ? or, if you're really
good, a box. Which did you get? -> �
Also the problem is not in your database or table/column encoding. It would only apply if you were using stored procedures for example.
Make sure your browser is operating in UTF-8 when inputing or showing information. Using Chrome you can go to Tools > Encoding, the Unicode UTF-8 should be set after you open your JSP in the browser. You can also debug the request/response in the Network tab. If you are inputing data as UTF-8, read it as UTF-8.
You dont need to change Tomcat config or use the HttpFilter to set encoding if you are doing it in the JSP. I was able to simulate your problem using only the following config:
ISO-8859-7 is a Greek Encoding that I used to read unicode and simulate your problem
<%# page language="java" contentType="text/html; charset=ISO-8859-7" pageEncoding="ISO-8859-7"%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-7">
</head>
</html>
If this does not help I ask you to post complete details and source code of your application.

Character is corrupted when use request.getParameter() in java

I have a web page where I do a search based on the text given in a text box. This text can be in any language like japanese, chinese etc (or any mbcs character).
Now when i enter a text in japanese (or any other mbcs character), the result populates the screen (form) with some wierd characters.
For Example: testテスト will turn into testãã¹ã.
When i see the post parameters in Firebug (debugging tool) i can see that the search string goes as testテスト however when i put debug statements in my code, i can see that request.getParameter("searchString") is not able to identify the japanese characters and turn them into some wierd chars.
My JSP header already has <%# page contentType="text/html; charset=UTF-8"
I have also tried putting pageEncoding="UTF-8" in this but it didn't help.
I tried setting character encoding like request.setCharacterEncoding("UTF-8") also just before doing request.getParameter but that too didn't work for me.
After going through a few forums and blogs i also tried setting useBodyEncodingForURI=true in the <Connector> of tomcat config but that also did not help me.
Can anybody suggest me something to resolve this issue?

set the following encoding in every servlet/ action
response.setContentType("UTF-8");
response.setCharacterEncoding("UTF-8");
request.setCharacterEncoding("UTF-8");
also set following in first servlet/action
for japanese
response.setCharacterEncoding("UTF-8");
request.setCharacterEncoding("UTF-8");
session.setAttribute(Globals.LOCALE_KEY, new Locale("jp", "ja_JP"));

Java utf-8 encoding from url

I have a problem with some symbols in UTF-8 encoding.
I am reading the index.html from http://wordki.pl to get the list of sets of words with their name.
it looks like this
THE NAME<span>(20)</span><img src="krecha.png">
and when THE NAME has "Ł" it doesent work and puts there "??" but the "??" is not a sign that i can change with replaceAll("str", "str") because my console just doesent show the char hidden behind it.
But when i view the source in chrome/firefox etc it shows "Ł".
And all the other funny signs like "ó, ł, ą, ś" work fine in my program.
So I am asking if there is a way to change the "??" into "Ł" ? I tried encoding it byte by byte but then i lose all the other signs like "ó, ł, ą" etc.
EDIT: Ok i have the problem solved
I needed to save my *.java file as UTF-8 : O

You should set the page content-type as "UTF-8"
Do something like this:
request.getCharacterEncoding() = ISO-8859-1
response.getCharacterEncoding() = UTF-8
request.getParameter("query") = dÃ©jeuner
OR
if(null == request.getCharacterEncoding())
request.setCharacterEncoding(encoding);
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
Refer this URL for more info:
How to get UTF-8 working in Java webapps?
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

converting a String to UTF8 format

I java code, I am having a string name = "örebro"; // its a swedish character.
But when I use this name in web application. I print some special character at 'Ö' character.
Is there anyway I can use the same character as it is in "örebro".
I did some thing like this but does not worked.
String name = "örebro";
byte[] utf8s = name .getBytes("UTF-8");
name = new String(utf8s, "UTF-8");
But the name at the end prints the same, something like this. �rebo
Please guide me

The Java code you've provided is pointless, it will do nothing. Java Strings are already perfectly capable of encoding any character (though you have to be careful with literals in the source code, as they depend on the encoding the compiler uses, which is platform-dependant).
Most likely your problem is that your webpage does not declare the encoding correctly in the HTTP header or the HTML meta tags.

You need to set the encoding of your output to UTF8.

It is likely the browser that reads the page does not know the encoding.
send the header (before any other output) something in Java like ServletResponse resource; (...)resource.setContentType ("text/html;charset=utf-8");
in your html page, mention the encoding by sending (printing)<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

If the page used to generate the output is jsp it's useful to precise
<%# page contentType="text/html; charset=utf-8" %>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.