Let's say someone uses this letter: ë. They input it in an EditText Box and it correctly is stored in the MySQL Database (via a php script). But to grap that database field with that special character causes an output of "null" in Java/Android.
It appears my database is setup and storing correctly. But retrieving is the issue. Do I have to fix this in the PHP side or handle it in Java/Android? EDIT: I don't believe this has anything to do with the PHP side anymore so I am more interested int he Java side.
Sounds similar to: android, UTF8 - How do I ensure UTF8 is used for a shared preference
I suspect that the problem occurs over the web interface between the web service and the Android App. One side is sending UTF-16 or ISO 8859-1 characters, and the other is interpreting it as UTF-8 (or vice versa). Make sure:
That the web request from Android is using UTF-8
That the web service replies using UTF-8.
As in the other answer, use a HTTP debugging proxy to check that the characters being sent between the Android App and the web service are what you expect.
I suggest to extract your database access code to a standard Java Env then compile and test it. This will help you to isolate the problem.
Usually you won't get null even if there is encode problem. Check other problem and if other exception throws.
Definitely not problem of PHP if you sure the string is correctly inserted.
Probably a confusion between UTF-8 and UTF-16 or any other character set that you might be using for storing these international characters. In UTF-16, the character ë will be stored as two bytes with the first byte beeing the null byte (0x00). If this double byte is incorrectly transmitted back as, said, UTF-8, then the null byte will be seen as the end of string terminator; resulting in the output of a null value instead of the international character.
First, you need to be 100% sure that your international characters are stored correctly in the database. Seeing the correct result in a php page on a web site is not a guaranty for that; as two wrongs can give you a right. In the past, I have often seen incorrectly stored characters in a database that were still displayed correctly on a web page or some system. This will looks OK until you need to access your database from another system and at this point; everything break loose because you cannot repeat the same kind of errors on the second system.
Related
I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);
I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.
my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.
I have a website, and need to store data from a text field into a mysql database.
The frontend is perl. I used utf8::encode to encode the data into utf8.
The request is made to the Java backend which connects to the mysql db and inserts this text.
For the table the default charset is set to utf8.
This works in many cases, but it fails in some cases.
If I use テスト, the data stored in the database shows questions marks: ã??ã?¹ã??.
If I try to insert the utf8 encoded string directly from the sql browser, everything works fine.
Update events set summary = ãã¹ã where event_id = 11657;
While inserting I noticed there are some blank characters that show up in the mysql query browser, something like: ã ã¹ ã.
After inserting from here, data in the database shows some boxes in the database instead of these spaces, and テスト displays correctly on the website after utf8 decoding it.
The problem is only when I insert directly from the website, these special characters come up as question marks in the database.
Can someone please help me with these special characters? Do I need to handle them differently?
We have also faced similar issue in one of the projects.So we had to write a small routine to convert those utf8 characters into html encoded and store into the database.
Use StringEscapeUtils from Apache Commons Lang:
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
If the database really stored テスト, that's what you should see in the sql browser instead of mojibake.
It sounds like the Java backend is interpreting what Perl sends as ISO-8859-1 rather than UTF-8. This explains hows テ gets converted into \u00E3\u0083\u0086. Then the backend tries to send the data to the database in Windows-1252 - the MySQL default encoding. Unfortunately Windows-1252 cannot represent the Unicode characters in the range \u0080-\u009F, so the last two characters are replaced by question marks.
So you have two problems:
You should make the Java backend read the request in UTF-8 rather than in ISO-8859-1.
The backend should use UTF-8 when talking with the database. The easiest way to do this is adding characterEncoding=utf8 to the connection parameters.
I'm assuming that you are sending POST parameters.
I think that the most likely cause of your initial problem is one of the following:
If the parameters are being sent in the HTTP request body, your Perl front-end is probably not setting the encoding in the content type header of the request. The webserver is probably to assuming ISO-8859-1. The solution to this is to set the request content type properly.
If the parameters are sent in the HTTP request URL, your web server is using the wrong characterset when decoding the request parameters. The solution to this is going be web-server specific ...
It sounds like there might also be a character set problem in talking to the database, but that might just be a consequence of earlier mangling.
I am trying to encode Arabic text from a Web service. Currently the values come as question marks (???).
I have read many blogs (even stackoverflow answers/links) but nothing seems to worked.
Any idea of how I can resolve this issue?
Thanks
If you use dreamweaver's designer view and paste your Arabic text in design view you will get ascii characters in dreamweaver's code view which will work in any web browser.
First, an important aside: check that the web service you are consuming sends you actual Arabic characters and not actual question marks. Check a network dump if you are not sure, and use wget/curl to perform a simple transaction; check the results.
If the raw data as sent by the WS is question marks, you have an uphill battle - try again and fiddle with the Accept/Accept-Charset headers. If all fail, it may be that the server itself isn't coded properly and there ain't much you can do after that...
Also, you're trying to decode the text, convert it from a byte representation to abstract characters.
This has been the problem Sending UTF-8 data from Android. Your code would work fine except that you will have to encode your String to Base64 . At Server PHP you just decode Base64 String back. It worked for me. I can share if you need the code.
I need to send a string to server. That string is having some special characters.
Example,
String abc = "ABC Farmacéutica Corporation";
When I am sending it, It is converted into, ABC Farmace#utica Corporation.
I Tried using UTF-8 encoding. it is giving output as ABC+Farmac%C3%A9utica+Corporation
please suggest me how to convert the data in java side.
It entirely depends on how the server is set up to receive the string in the first place. Your second example is applying URL-encoding using UTF-8 where required, by the look of things. That might be appropriate - or it might not.
If the data is within XML, for example, you shouldn't need to do anything special - whatever XML API you use should handle all of this transparently.
If you can give more details about the protocol you're using to talk to the server, we may be able to help more.
I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.
After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?
U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.
The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.
(The confusion comes from the fact that if you write in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)
Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.