how to implement UTF-8 format in Swing application?

how to implement UTF-8 format in Swing application? - java

In my Swing chat application i am having the send button, one text area, and a text field.
If I press Send button, I need to send the text from text field to text area. It is working fine in English but not in local language.
Please give some idea or some code that will help me to solve this.

First of all, the internal character representation of String is UTF-16, so you don't need to worry once you have the string in your JVM.
The problem is probably the conversion between a sequence of characters that gets sent over the internet and a String object. When parsing a string you need to provide the encoding, e.g. when using InputStreamReader, you have to pass the Charset parameter:
InputStreamReader(InputStream in, Charset cs)
Create an InputStreamReader that uses the given charset.
The encoding has to be provided, because Java can't magically guess the encoding of a byte sequence.

Related

Characters altered by Lotus when receiving a POST through a Java WebAgent with OpenURL command

I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);

I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.

my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.

Reading UTF-8 encoded text from InputStream

I'm having problems reading all Japanese/Chinese characters from an input stream.
Basically, I'm retrieving a JSON object from an API.
Below is my code:
try {
URL url = new URL(string);
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(),StandardCharsets.UTF_8));
result = br.readLine();
br.close();
} catch(Exception e) {
}
For some reason, not all characters are read by the input stream. What could be the problem?
To be specific, some characters appear when I print them out in the console, while some appear as black boxes with question marks. Also, there are no black boxes with questions marks when I check the actual JSON object through a browser.

What you see when "printing to a console" really has nothing to do with whether data was read or not, but has everything to do with the capabilities of your console.
If you are fetching data from a URL, and you know for sure that the bytes you have fetched represent UTF-8 encoded text, and the entire data fits on one line of text, then there is no reason why your code should not work.
It sounds like you are not sure things work because you are trying to print text to your console. Perhaps your console is not set to render UTF-8 encoded text? Perhaps your console font does not have enough glyphs to cover the font?
Here are two things you can try:
Instead of writing the text to your console, save it to a file. Then use a command like hexdump -C (on a *nix system, I have no idea how to do that in Windows) and look at the binary representation to make sure all your expected characters are there.
Save your data to a text file, then open it in a web browser, since browsers probably have much richer font support than a console.
If you still suspect you've read the remote data incorrectly, you can run your retrieved text through a JSON validator, just to make sure.

Try this instead: "ISO-8859-1".

MacRoman vs UTF-8

I am try to pull out byte data from a source, encrypt it, and then store it in the file system.
For encryption, I am using jasypt and the BasicTextEncryptor class. And for storing on to the file system, I am using Apache's Commons IOUtils class.
When required, these files will be decrypted and then sent to the user's browser. This system works on my local machine where the default charset is MacRoman, but it fails on the server where the default charset is UTF-8.
When I explicitly set the encoding at each stage of the process to use MacRoman it works on the server as well, but I am skeptical about doing this as rest of my code uses UTF8.
Is there a way that I can work the code without conversion to MacRoman?

You should just use UTF8 everywhere.
As long as you use the same encoding at each end of an operation (and as long as the encoding can handle all of the characters you need), you'll be fine.

In your comments on another answer, you claim you're not using an encoding, but that's impossible. You're using the BasicTextEncryptor class, which according to this documentation only works on Strings and char arrays. That means that, at some point, you're converting from an encoding-agnostic byte array to an encoding-specific String or char array. That means that you're relying upon an encoding somewhere, whether you realize it or not. You need to track down where that conversion is happening and ensure it has the correct encoding.
Your question states, "When I explicitly set the encoding at each stage of the process", so you will need to know how it's encoded in the database. If that doesn't make sense, read on.
It's also possible that you are simply trying to encrypt a file that you're getting out of the database, and you don't care about the string representation; you want to treat it as plain bytes, not as text. In that case, BasicTextEncrypter ("Utility class for easily performing normal-strength encryption of texts.") is not a good fit for this task. It encrypts strings. The BasicBinaryEncryptor ("Utility class for easily performing normal-strength encryption of binaries (byte arrays).") is what you need.

Android, mysql, and rendering non Latin Characters as well as Latin?

Are these squares a representation of chinese characters being turned into unicode?
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
I'd like to either turn this back into the original characters when displayed in android (or to enable mysql to just store them as chinese characters not in unicode???)
BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"), 8);
While debugging it shows the strings value as
"\u001a\u001a\u001a\u001a"
byte[] bytes = chinesestringfromdatabase.getBytes();
turns it into
"[26, 26, 26, 26]"
String fresh = new String(bytes, "UTF-8");
and then this turns it back into
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
My phone can display chinese text.
MySQL charset: UTF-8 Unicode (utf8)
While typing my question I realize that perhaps I have the wrong charset all together.
I'm lost as to whether or not my issue will even be anything coding related or if it is just related to a setting or if php cannot handle the character set??
I'd like to store and render multiple language character sets that could contain a mixture of languages.

Here I entered the squares with numbers inside them into the post but they didn't render
With "squares with numbers inside", do you mean the same as those which you also see for some exotic languages somewhere at the bottom of the Wikipedia homepage, while browsing with Firefox browser? (in all other browsers -MSIE, Chrome, Safari, etc- you would only see nothing-saying empty squares).
If true, then it simply means that there are no glyphs available for those characters in the font which the webbrowser/viewer is been instructed to use.
I'd like to store and render multiple language character sets that could contain a mixture of languages.
Use UTF-8 all the way. Only keep in mind that MySQL only supports the BMP panel of Unicode (max 3 bytes per character), not the other panels (4 bytes per character). So the SMP panel (which contains "special" CJK characters) is out of range for MySQL.
References
PHP UTF-8 cheatsheet
Unicode - How to get characters right? (for Java web developers)
MySQL reference - Unicode support

What were the numbers in the boxes? I'm guessing they were 001A? Like ?
(SO will usually filter these out as they're ASCII control characters, typically invisible in other browsers.)
While debugging it shows the strings value as "\u001a\u001a\u001a\u001a"
Well clearly there's no Chinese or any text to be recovered there. Any informational content in the original string has been lost.
Whilst I agree that you need to be using UTF-8 throughout (which for PHP means serving the form page with a UTF-8 <meta> tag, using mysql_set_charset('utf8'), and creating your MySQL tables with UTF-8 collations), I think you must have a more serious corruption problem than just UTF-8-vs-other-ASCII-compatible-encoding if you are somehow getting just identical control characters instead of a text string.

How do I decipher garbled/gibberish characters in my networking program

I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?

If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)

This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.