Cloudant java non-latin characters

Cloudant java non-latin characters - java

I am having a difficulty trying to use the Cloudant java client with Greek characters. Saving objects that include Strings with Greek characters seems to be working ok, as they appear correctly at the Cloudant console. Below is a minimal test case for this. The DummyObject has a String name, an _id and a _rev.
String password = "xxxx";
CloudantClient client = new CloudantClient("xx", "xxx", password);
Database database = client.database("mydatabase", false);
DummyClass dummyobject = new DummyClass();
dummyobject.setName("ά έ ό ύ αβγδεζηθικλμνξ");
Response saveResponse = database.save(dummyobject);
String id = saveResponse.getId();
String result=new String();
DummyClass loaded = database.find(DummyClass.class,id);
result = result+"Object:"+loaded.getName()+"\n"; //Prints out garbage
result = result+"UTF-8:"+new String(loaded.getName().getBytes(),Charset.forName("utf-8"))+"\n"; //Prints most characters correct, except for some accented ones
InputStream inputStream = database.find(id);
DummyClass loadedFromStream = Json.fromJson(Json.parse(inputStream), DummyClass.class);
result = result+"From Stream:"+loadedFromStream.getName(); //prints out fine
return ok(result);
By retrieving the stream and using Jackson to deserialize, the output is correct, but then I'd have to implement many of the provided methods for views, bulk document manipulation etc.
Perhaps the problem is in the LightCouch library, specifically here: CouchDbClientBase.java, since that is the point that I have found differs between the two implementations (get() as object and as stream). However, I do not know how to confirm, fix or work around it.

We fixed this in release 1.1.0, I think:
https://github.com/cloudant/java-cloudant/releases/tag/1.1.0
[FIX] Fixed handling of non-ASCII characters when the platform's default charset is not UTF-8.

The problem was indeed at the LightCouch library. Making the following change and respective changes on the code for views, fixed it.
return getGson().fromJson(new InputStreamReader(in), classType);
to
return getGson().fromJson(new InputStreamReader(in, Charset.forName("UTF-8")), classType);

Related

trying to figure out what kind of unicode should i have

I'm working on spring boot on a project that fetch the data from the database then use post method to send them through HTTP post request, everything is okay but with Latina, the data i have in database encoded with: ISO 8859-6 i have encoded it to UTF-8 and UTF-16 but still it returns unreadable text question marks and special characters
test example in Arabic :
مرحبا
should be like this to be valid and reliable after post method :
06450631062d06280627
i can't figure out what kind of encoding happend here, now im doing integration from .NET to java:
this what they used in .NET :
public static String UnicodeStr2HexStr(String strMessage)
{
byte[] ba = Encoding.BigEndianUnicode.GetBytes(strMessage);
String strHex = BitConverter.ToString(ba);
strHex = strHex.Replace("-", "");
return strHex;
}
i just need to know what kind of encoding happend here to apply in java, and it would helpfull if someone provide me with way:
i have tried this but it return different value:
String encodedWithISO88591 = "مرحبا;
String decodedToUTF8 = new String(encodedWithISO88591.getBytes("ISO-8859-1"), "UTF-8");

What you're looking to get is the hex representation of the Arabic String in UTF-16BE
String yourVal = "مرحبا";
System.out.println(DatatypeConverter.printHexBinary(yourVal.getBytes(StandardCharsets.UTF_16BE)));
output will be :
06450631062D06280627

Character not displaying in html

I am having trouble displaying the "velar nasal" character (ŋ)(but I assume the same problem would arise with other rare characters).
I have a MySQL table which contains a word with this character.
When my code retrieves it to display in my HTML page, it is displayed as a question mark.
I have tried a number of things:
1) Tried using MySQL's CONVERT to convert the retrieved string to UTF-8 because I understood that the string is stored in my table as "Latin1":
SELECT CONVERT(Name USING utf8)
Instead of:
SELECT Name
This did not help, and, when I saved a string in my java code with the problematic word ("Yolŋu"), and then passed the String through the rest of the code the problem still occured (ie: The problem does not lie in the different character encoding that my DB uses).
2) I also tried creating a new String from bytes:
new String(name.getBytes("UTF-8"));
The String is being passed from java to the html via a JSONObject that is passed to a javascript file:
Relevant JSON code:
JSONArray names = new JSONArray();
for (int iD: iDs)
{
JSONObject namesData = new JSONObject();
String name = NameDB.getNameName(iD);
nameData.put("label", name);
nameData.put("value", iD);
names.put(nameData);
}
return names;
Relevant servlet code:
response.setContentType("application/json");
try (PrintWriter out = response.getWriter())
{
out.print(namesJSONArray);
}
Relevant js code:
An ajax call to the servlet is made via jquery ui's autocomplete "source" option.
I am pretty new to coding in general and very new to the character encoding topic.
Thank you.

First, in Java String should already hold correct Unicode, so new String(string.getBytes(...), ...) is a hack, with its own troubles.
1. The database
It would be nice if the database held the text in UTF-8. The encoding can be set on database, table and column level. The first thing is to investigate how the text is stored. A table dump (mysqldump) would be least error prone.
If you can use UTF-8, this must be set form MySQL on the database engine, and for data transfer for the java driver.
In every case you can check a round-trip in java JDBC by filling a table field, and reading it again, as also reading that existing troublesome field.
Dump the code points of the string.
String dump(String s) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int cp = s.codePointAt(i);
if (32 < cp && cp < 128) {
sb.append((char) cp);
} else {
sb.append("U+").append(Integer.toHexString(cp));
}
sb.append(' ');
i += Character.charCount(cp);
}
return sb.toString();
}
2. The output
Here probably lies the error. Call at the beginning:
response.setCharacterEncoding("UTF-8");
... response.getWriter(); // Now converts java's Unicode text to UTF-8.
For HTML a charset specification is in order too. Especially when the HTML page is saved to the file system, the encoding header would be lost.

You should be assure about the following things:
Your JVM must work with file.encoding=UTF-8 param
Your mySQL table in which contains special characters must be parametrized with encoding=UTF-8
Your web UI should specify the meta tag with the encoding you have saved the web page in your editor, so UTF-8
If the problem persists, try to use HTML entities (&entity) instead.

Parsing a URL in Java

I am looking for an equivalent to PHP's "parse_url" function in Java. I am not running in Tomcat. I have query strings saved in a database that I'm trying to break apart into individual parameters. I'm working inside of Pentaho, so I only have the Java SE classes to work with. I know I could write a plugin or something, but if I'm going to do all that I'll just write the script in PHP and be done with it.
TLDR: Looking for a Java standard class/function that takes a String and spits out an array of parameters.
Thanks,
Roger

You can accomplish that using java.net.URL:
URL url = new URL("http://hostname:port/path?arg=value#anchor");
String protocol = url.getProtocol(); // http
String host = url.getHost(); // hostname
String path = url.getPath(); // /path
int port = url.getPort(); // port
String query = url.getQuery(); // arg=value
String ref = url.getRef(); // anchor

Here's something quick and dirty (have not compiled it, but you should get the idea.
URL url = new URL("http://...");
String query = url.getQuery();
String paramStrings[] = query.split("\\&");
HashMultiMap<String, String> params = HashMultiMap.create(); // <== google guava class
for (int i=0;iparamStrings.length;i++) {
String parts[] = params[i].split("=");
params.put(URLDecoder.decode(parts[0], "UTF-8"), URLDecoder.decode(parts[1], "UTF-8"));
}
Set<String> paramVals = params.get("paramName");
If you don't want to use the guava class, you can accomplish the same thing with some additional code, and a HashMap>

No such thing in Java. You will need to parse the strings manually and create your own array. You could create your own parse_url using StringTokenizer, String.split, or Regular Expressions rather easily.
You could also cast those strings from the database back to URL objects and parse them that way, here are the docs.

String has a split function, but you will need to write your own regex to determine how to split the string.
See: http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)

ISO-8859-1 encoded strings out of /into JSON in Java

My application has a Java servlet that reads a JSONObject out of the request and constructs some Java objects that are used elsewhere. I'm running into a problem because there are strings in the JSON that are encoded in ISO-8859-1. When I extract them into Java strings, the encoding appears to get interpreted as UTF-16. I need to be able to get the correctly encoded string back at some point to put into another JSON object.
I've tried mucking around with ByteBuffers and CharBuffers, but then I don't get any characters at all. I can't change the encoding, as I have to play nicely with other applications that use ISO-8859-1.
Any tips would be greatly appreciated.
It's a legacy application using Struts 1.3.8. I'm using net.sf.json 2.2.4 for JSONObject and JSONArray.
A snippet of the parsing code is:
final JSONObject a = (JSONObject) i;
final JSONObject attr = a.getJSONObject("attribute");
final String category = attr.getString("category");
final String value = attr.getString("value");
I then create POJOs using that information, that are retrieved by another action class to create JSON to pass to the client for display, or to pass to other applications.
So to clarify, if the JSON contains the string "Juan Guzmán", the Java String contains something like Juan Guzm?_An (I don't have the exact one in front of me). I'm not sure how to get the correct diacritical back. I believe that if I can get a Java String that contains the correct representation, that Mezzie's solution, below, will allow me to create the string with the correct encoding to put back into the JSON to serve back.

I had the same issue and I am using the same technology as you are. In our case, it was UTF 8. so just change that to UTF-16
public static String UTF8toISO( String str )
{
try
{
return new String( str.getBytes( "ISO-8859-1" ), "UTF-8" );
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
return str;
}

StringBufferInputStream Question in Java

I want to read an input string and return it as a UTF8 encoded string. SO I found an example on the Oracle/Sun website that used FileInputStream. I didn't want to read a file, but a string, so I changed it to StringBufferInputStream and used the code below. The method parameter jtext, is some Japanese text. Actually this method works great. The question is about the deprecated code. I had to put #SuppressWarnings because StringBufferInputStream is deprecated. I want to know is there a better way to get a string input stream? Is it ok just to leave it as is? I've spent so long trying to fix this problem that I don't want to change anything now I seem to have cracked it.
#SuppressWarnings("deprecation")
private String readInput(String jtext) {
StringBuffer buffer = new StringBuffer();
try {
StringBufferInputStream sbis = new StringBufferInputStream (jtext);
InputStreamReader isr = new InputStreamReader(sbis,
"UTF8");
Reader in = new BufferedReader(isr);
int ch;
while ((ch = in.read()) > -1) {
buffer.append((char)ch);
}
in.close();
return buffer.toString();
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
I think I found a solution - of sorts:
private String readInput(String jtext) {
String n;
try {
n = new String(jtext.getBytes("8859_1"));
return n;
} catch (UnsupportedEncodingException e) {
return null;
}
}
Before I was desparately using getBytes(UTF8). But I by chance I used Latin-1 "8859_1" and it worked. Why it worked, I can't fathom. This is what I did step-by-step:
OpenOffice CSV(utf8)------>SQLite(utf8, apparently)------->java encoded as Latin-1, somehow readable.

The reason that StringBufferInputStream is deprecated is because it is fundamentally broken ... for anything other than Strings consisting entirely of Latin-1 characters. According to the javadoc it "encodes" characters by simply chopping off the top 8 bits! You don't want to use it if your application needs to handle Unicode, etc correctly.
If you want to create an InputStream from a String, then the correct way to do it is to use String.getBytes(...) to turn the String into a byte array, and then wrap that in a ByteArrayInputStream. (Make sure that you choose an appropriate encoding!).
But your sample application immediately takes the InputStream, converts it to a Reader and then adds a BufferedReader If this is your real aim, then a simpler and more efficient approach is simply this:
Reader in = new StringReader(text);
This avoids the unnecessary encoding and decoding of the String, and also the "buffer" layer which serves no useful purpose in this case.
(A buffered stream is much more efficient than an unbuffered stream if you are doing small I/O operations on a file, network or console stream. But for a stream that is served from an in-memory data structure the benefits are much smaller, and possibly even negative.)
FOLLOWUP
I realized what you are trying to do now ... work around a character encoding / decoding issue.
My advice would be to try to figure out definitively the actual encoding of the character data that is being delivered by the database, then make sure that the JDBC drivers are configured to use the same encoding. Trying to undo the mis-translation by encoding with one encoding and decoding with another is dodgy, and can give you only a partial correction of the problems.
You also need to consider the possibility that the characters got mangled on the way into the database. If this is the case, then you may be unable to de-mangle them.

Is this what you are trying to do? Here is previous answer on similar question. I am not sure why you want to convert to a String to an exactly the same String.
Java String holds a sequence of chars in which each char represents a Unicode number. So it is possible to construct the same string from two different byte sequences, says one is encoded with UTF-8 and the other is encoded with US-ASCII.
If you want to write it to file, you can always convert it with String.getBytes("encoder");
private static String readInput(String jtext) {
byte[] bytes = jtext.getBytes();
try {
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something
return null;
}
}
Update
Here is my assumption.
According to your comment, you SQLite DB store text value using one encoding, says UTF-16. For some reason, your SQLite APi cannot determine what the encoding it uses to encode the Unicode values to sequence of bytes.
So when you use getString method from your SQLite API, it reads a set of bytes form you DB, and convert them into Java String using incorrect encoding. If this is the case, you should use getBytes method and reconstruct the String yourself, i.e. new String(bytes, "encoding used in your DB"); If you DB is stored in UTF-16, then new String(bytes, "UTF-16"); should be readable.
Update
I wasn't talking about getBytes method on String class. I talked about getBytes method on your SQL result object, e.g. result.getBytes(String columnLabel).
ResultSet result = .... // from SQL query
String readableString = readInput(result.getBytes("my_table_column"));
You will need to change the signature of your readInput method to
private static String readInput(byte[] bytes) {
try {
// change encoding to your DB encoding.
// this can be UTF-8, UTF-16, 8859_1, etc.
String string = new String(bytes, "UTF-8");
return string;
} catch (UnsupportedEncodingException ex) {
// do something, at least return garbled text
return new String(bytes, "UTF-8");;
}
}
Whatever encoding you set in here which makes your String readable, it is definitely the encoding of your column in DB. This involves no unexplanable phenomenon and you know exactly what your column encoding is.
But it will be good to config your JDBC driver to use the correct encoding so that you will not need to use this readInput method to convert.
If no encoding can make your string readable, you will need consider the possibility of the characters got mangled when it was written to DB as #Stephen C said. If this is the case, using walk around method may cause you to lose some of the charaters during conversions. You will also need to solve encoding problem during writting as well.

The StringReader class is the new alternative to the deprecated StringBufferInputStream class.
However, you state that what you actually want to do is take an existing String and return it encoded as UTF-8. You should be able to do that much more simply I expect. Something like:
s8 = new String(jtext.getBytes("UTF8"));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.