Japanese character encoding using Shift_JIS in Java

Japanese character encoding using Shift_JIS in Java - java

I have a web application which is served using tomcat.
On one of the pages, it allows the users to download a file stored on my file server. The names of most of the files present there are in Japanese. However, when the user downloads the file, the name of the file is garbled. Also, it works differently on different browsers.
The original code is as below:
FileInputStream in = new FileInputStream(absolutePath);
ResponseUtil.download(new String(downloadFileName.getBytes("Shift_JIS"), "ISO-8859-1"), in);
e.g., 08_タイヨーアクリス_装置開発_実績表 gets interpreted as
08_ƒ^ƒCƒˆ-[ƒAƒNƒŠƒX_‘•’uŠJ”-_ŽÀ-Ñ• in Google Chrome
This problem is due to the presence of '5c' in the file name and seems to be a known problem in Shift_JIS. I want to know the correct way to work around this problem.

It looks like the ResponseUtil.download method from the "Seasar sastruts" framework you're using is taking the filename you provide and sticking it directly in the Content-disposition header of the HTTP response it constructs.
response.setHeader("Content-disposition", "attachment; filename=" + fileName + "\"");
As far as I can tell, HTTP and MIME headers only support ASCII characters, so this technique won't work with non-ASCII characters. (If this is the case, I'd consider it a bug in this class that it unconditionally sticks the filename in to the header.) Modifying or trying to re-encode the string before you pass it in won't work, because this encoding is at a different level.
To support non-ASCII characters, the header value needs to be encoded using the MIME encoded-word technique. There's no way to do this with that ResponseUtil class as it is, because it concatenates the name you provide directly in to a non-encoded-word string.
I think you'll need to rewrite that download() method to check for non-ASCII characters in the filename inputs it receives, and use encoded-word encoding on strings that contain them. You'd want it to look something like this, where some_base64_text is the actual base-64 encoding of the bytes of your file name encoded as Shift-JIS. (Or use UTF-8 instead.)
Content-disposition: =?Shift_JIS?B?some_base64_text?=
There's probably a lot of different browser behaviors around this, because they're trying to work around various web servers that are doing it "wrong". But it looks like encoding it this way is a good bet for getting it working and making it portable.

Thanks a lot.
I was able to solve the problem on Chrome using the following:
ResponseUtil.download(URLEncoder.encode(downloadFileName, "UTF-8"), in);
However, the encoding is still not proper in Firefox and Safari.
In Chrome, the file is named "08_タイヨーアクリス_装置開発_実績表.pdf"
But, on Firefox and Safari, it is named "08_%E3%82%BF%E3%82%A4%E3%83%A8%E3%83%BC%E3%82%A2%E3%82%AF%E3%83%AA%E3%82%B9_%E8%A3%85%E7%BD%AE%E9%96%8B%E7%99%BA_%E5%AE%9F%E7%B8%BE%E8%A1%A8.pdf".

Related

Characters altered by Lotus when receiving a POST through a Java WebAgent with OpenURL command

I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);

I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.

my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.

UTF-8 issue in linux

String departmentName = request.getParameter("dept_name");
departmentName = new String(departmentName.getBytes(Charset.forName("UTF8")),"UTF8");
System.out.println(departmentName);//O/p: composés
In windows, the displayed output is what I expected and it is also fetching the record on department name matching criteria.
But in Linux it is returning "compos??s", so my DB query fails.
Can anyone give me solution?

Maybe because the Charset UTF8 doesn't exist. You must use UTF-8. See the javadoc.

First of all, using unicode output with System.out.println is no good indicator since you are dealing with console encoding. Open the file with OutputStreamWriter, explicite setting encoding to UTF-8, then you can say if the request parameter in encoded correctly or not.
Second, there may be database connection encoding issue. For MySQL you need to explicite specify encoding in connection string, as for other, it could also be, that the default system encoding is taken, when not specified.

First of all, try to figure out the encoding you have in every particular place.
For example, the string might already have the proper encoding, if your Linux system is running with UTF-8 charset; that hack was maybe only needed on Windows.
Last but not least, how do you know it is incorrect? And it is not your viewer that is incorrect? What character set does your console or log viewer use?
You need to perform a number of checks to find out where exactly the encoding is different from what is expected at that point of your application.

Encoding problems exporting file

I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards,
Karl

EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).

UTF-8 and Servlets on Tomcat/Linux

I've had some problems with reading and writing UTF-8 from servlets on Tomcat 6 / Linux. request and response were utf-8, browser was utf-8, URIEncoding was set in server.xml on both connectors and hosts. Ins short, every known thing for me in code itself, and server configuration was utf-8.
When reading request, I've had to take byte array from String, and then convert that byte array into String again. When writing request I've had to write bytes, not String itself, in order to get proper response (otherwise I would get an exception that says some non ASCII character is not valid ISO 8859-1).

Changing the LANG environment variable is one way to solve the problem.
The official way is to set the character encoding in a sevlet filter: http://wiki.apache.org/tomcat/Tomcat/UTF-8
Some background information: http://www.crazysquirrel.com/computing/general/form-encoding.jspx

Solution was to set LANG environmental variable to (in my case) en_US.UTF-8, or probably any other UTF-8 locale. I'm still puzzled with the fact, that I couldn't do anything from code to make my servlet behave properly. If there is no way to do it, than it's a bug from my point of view.

Character Encoding Trouble - Java

I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.
I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.
Thanks

There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.
Typical problematic spots in the Java API are:
new String(byte[])
String.getBytes()
FileReader, FileWriter
All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).
However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.
What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.

You can control which encoding your JVM will run with by supplying f,ex
-Dfile.encoding=utf-8
for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:
java -Dfile.encoding=utf-8 my.MainClass

Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.
Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)
If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.

I had to make sure that the OutputStreamWriter uses the correct encoding
OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);
Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.