UTF-8 and Servlets on Tomcat/Linux

UTF-8 and Servlets on Tomcat/Linux - java

I've had some problems with reading and writing UTF-8 from servlets on Tomcat 6 / Linux. request and response were utf-8, browser was utf-8, URIEncoding was set in server.xml on both connectors and hosts. Ins short, every known thing for me in code itself, and server configuration was utf-8.
When reading request, I've had to take byte array from String, and then convert that byte array into String again. When writing request I've had to write bytes, not String itself, in order to get proper response (otherwise I would get an exception that says some non ASCII character is not valid ISO 8859-1).

Changing the LANG environment variable is one way to solve the problem.
The official way is to set the character encoding in a sevlet filter: http://wiki.apache.org/tomcat/Tomcat/UTF-8
Some background information: http://www.crazysquirrel.com/computing/general/form-encoding.jspx

Solution was to set LANG environmental variable to (in my case) en_US.UTF-8, or probably any other UTF-8 locale. I'm still puzzled with the fact, that I couldn't do anything from code to make my servlet behave properly. If there is no way to do it, than it's a bug from my point of view.

Related

URL in Referer header is detected as using multiple encoding

Using owasp.esapi for to filter incoming request parameters and headers, I'm stumbling on an issue where apparently the Referer header contains a value that is considered as using "multiple encoding".
An example:
http://123.abc.xx/xyz/input.xhtml?server=http%3A%2F%2F123.abc.xx%3A7016%2Fxyz&o=1&language=en&t=a074faf3
To me though, that URL seems to be correctly encoded, and decoding it results in a perfectly readable and correct url.
So, can anyone explain the issue here, and how to handle this?
ESAPI reports the error when running this method on the header value:
value = ESAPI.encoder().canonicalize(value);
Output:
SEVERE: [SECURITY FAILURE] INTRUSION - Mixed encoding (2x) detected

As a matter of fact yes. I fixed this bug in the upcoming release of ESAPI but it will require an API change, perhaps one that might have a bug based on your data here.
In short, prior to my fix, ESAPI just did a Regex against the URI. The problem and slew of bug reports on this, is that URI’s are not a regular language. They are a language themselves. So what would happen is that the URI in question would have parameters that contained HTML entities, only, some random data variants would align to known HTML entities such as &param=foo which would be interpreted as the entity ¶ which is paragraph. There were also some issues in regards to ASCII vs Unicode (non bmp encodings.).
At any rate there will be a new method to use in the release candidate for our next library, Encoder.getCanonicalizedURI();
This will be safe to regex against as it will be broken down and checked for mixed/multiple encoding. The method you’re currently using is now deprecated.

Characters altered by Lotus when receiving a POST through a Java WebAgent with OpenURL command

I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);

I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.

my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.

Japanese character encoding using Shift_JIS in Java

I have a web application which is served using tomcat.
On one of the pages, it allows the users to download a file stored on my file server. The names of most of the files present there are in Japanese. However, when the user downloads the file, the name of the file is garbled. Also, it works differently on different browsers.
The original code is as below:
FileInputStream in = new FileInputStream(absolutePath);
ResponseUtil.download(new String(downloadFileName.getBytes("Shift_JIS"), "ISO-8859-1"), in);
e.g., 08_タイヨーアクリス_装置開発_実績表 gets interpreted as
08_ƒ^ƒCƒˆ-[ƒAƒNƒŠƒX_‘•’uŠJ”-_ŽÀ-Ñ• in Google Chrome
This problem is due to the presence of '5c' in the file name and seems to be a known problem in Shift_JIS. I want to know the correct way to work around this problem.

It looks like the ResponseUtil.download method from the "Seasar sastruts" framework you're using is taking the filename you provide and sticking it directly in the Content-disposition header of the HTTP response it constructs.
response.setHeader("Content-disposition", "attachment; filename=" + fileName + "\"");
As far as I can tell, HTTP and MIME headers only support ASCII characters, so this technique won't work with non-ASCII characters. (If this is the case, I'd consider it a bug in this class that it unconditionally sticks the filename in to the header.) Modifying or trying to re-encode the string before you pass it in won't work, because this encoding is at a different level.
To support non-ASCII characters, the header value needs to be encoded using the MIME encoded-word technique. There's no way to do this with that ResponseUtil class as it is, because it concatenates the name you provide directly in to a non-encoded-word string.
I think you'll need to rewrite that download() method to check for non-ASCII characters in the filename inputs it receives, and use encoded-word encoding on strings that contain them. You'd want it to look something like this, where some_base64_text is the actual base-64 encoding of the bytes of your file name encoded as Shift-JIS. (Or use UTF-8 instead.)
Content-disposition: =?Shift_JIS?B?some_base64_text?=
There's probably a lot of different browser behaviors around this, because they're trying to work around various web servers that are doing it "wrong". But it looks like encoding it this way is a good bet for getting it working and making it portable.

Thanks a lot.
I was able to solve the problem on Chrome using the following:
ResponseUtil.download(URLEncoder.encode(downloadFileName, "UTF-8"), in);
However, the encoding is still not proper in Firefox and Safari.
In Chrome, the file is named "08_タイヨーアクリス_装置開発_実績表.pdf"
But, on Firefox and Safari, it is named "08_%E3%82%BF%E3%82%A4%E3%83%A8%E3%83%BC%E3%82%A2%E3%82%AF%E3%83%AA%E3%82%B9_%E8%A3%85%E7%BD%AE%E9%96%8B%E7%99%BA_%E5%AE%9F%E7%B8%BE%E8%A1%A8.pdf".

UTF-8 issue in linux

String departmentName = request.getParameter("dept_name");
departmentName = new String(departmentName.getBytes(Charset.forName("UTF8")),"UTF8");
System.out.println(departmentName);//O/p: composés
In windows, the displayed output is what I expected and it is also fetching the record on department name matching criteria.
But in Linux it is returning "compos??s", so my DB query fails.
Can anyone give me solution?

Maybe because the Charset UTF8 doesn't exist. You must use UTF-8. See the javadoc.

First of all, using unicode output with System.out.println is no good indicator since you are dealing with console encoding. Open the file with OutputStreamWriter, explicite setting encoding to UTF-8, then you can say if the request parameter in encoded correctly or not.
Second, there may be database connection encoding issue. For MySQL you need to explicite specify encoding in connection string, as for other, it could also be, that the default system encoding is taken, when not specified.

First of all, try to figure out the encoding you have in every particular place.
For example, the string might already have the proper encoding, if your Linux system is running with UTF-8 charset; that hack was maybe only needed on Windows.
Last but not least, how do you know it is incorrect? And it is not your viewer that is incorrect? What character set does your console or log viewer use?
You need to perform a number of checks to find out where exactly the encoding is different from what is expected at that point of your application.

Outputting International Characters from MySQL to Java/Android

Let's say someone uses this letter: ë. They input it in an EditText Box and it correctly is stored in the MySQL Database (via a php script). But to grap that database field with that special character causes an output of "null" in Java/Android.
It appears my database is setup and storing correctly. But retrieving is the issue. Do I have to fix this in the PHP side or handle it in Java/Android? EDIT: I don't believe this has anything to do with the PHP side anymore so I am more interested int he Java side.

Sounds similar to: android, UTF8 - How do I ensure UTF8 is used for a shared preference
I suspect that the problem occurs over the web interface between the web service and the Android App. One side is sending UTF-16 or ISO 8859-1 characters, and the other is interpreting it as UTF-8 (or vice versa). Make sure:
That the web request from Android is using UTF-8
That the web service replies using UTF-8.
As in the other answer, use a HTTP debugging proxy to check that the characters being sent between the Android App and the web service are what you expect.

I suggest to extract your database access code to a standard Java Env then compile and test it. This will help you to isolate the problem.
Usually you won't get null even if there is encode problem. Check other problem and if other exception throws.
Definitely not problem of PHP if you sure the string is correctly inserted.

Probably a confusion between UTF-8 and UTF-16 or any other character set that you might be using for storing these international characters. In UTF-16, the character ë will be stored as two bytes with the first byte beeing the null byte (0x00). If this double byte is incorrectly transmitted back as, said, UTF-8, then the null byte will be seen as the end of string terminator; resulting in the output of a null value instead of the international character.
First, you need to be 100% sure that your international characters are stored correctly in the database. Seeing the correct result in a php page on a web site is not a guaranty for that; as two wrongs can give you a right. In the past, I have often seen incorrectly stored characters in a database that were still displayed correctly on a web page or some system. This will looks OK until you need to access your database from another system and at this point; everything break loose because you cannot repeat the same kind of errors on the second system.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.