Issues in Converting values to UTF-8

Issues in Converting values to UTF-8 - java

I am encountering issues in reporting in displaying names. My application uses different technologies PHP, Perl and for BI Pentaho.
We are using MYSQL as DB and my table is of CHARSET=utf8.
My table is been stored with values in rows as below which is wrong
Row1 = Ãxâ€”350
Row2 = Ã‘zâ€“401
PHP and Perl are using different in built functions to convert the above values which is stored in DB and it is displaying in UI as below which is correct
Expected Row1 = Áx—350
Expected Row2 = Ñz–401
Coming to reports which is using pentaho I am using ETL to transform the data before showing data in reports. In order to convert the above DB stored values I am trying to convert the data through Java step as below
new java.lang.String(new java.lang.String(CODE).getBytes("Windows-1252"), "UTF-8")
But it is not converting the values properly, among the above 2 wrong values only Row2 value is been converted properly but the first Row1 is wrongly converting as below
Converted Row1 = �?x—350
Converted Row2 = Ñz–401
Please suggest what way I can convert the values properly so that for example Row1 value should be converted properly to Áx—350.
I wrote a small Java program as below to convert the Ãxâ€”350 string to Áx—350
String input = "Ãxâ€”350";
byte[] b1 = input.getBytes("Windows-1252");
System.out.println("Input Get Bytes = "+b1.toString());
String szUT8 = new String(b1, "UTF-8");
System.out.println("Input Encoded = " + szUT8);
The output from the above code is as below
Input Get Bytes = [B#157ee3e5
Input Encoded = �?x—350-350—É1
If we see the output the string is wrong where the actual expected output is Áx—350.
To confirm on the encoding/decoding schemes i tried testing string online and tested with string Ãxâ€”350 and output is as expected Áx—350 which is correct.
So from this any one please point why java code is not able to convert properly although i am using the proper encoding/decoding schemes, anything else which iam missing or my approach is wrong.

The CHARSET setting in your db being set to utf-8 doesn't necessarily mean that the data there is properly encoded in utf-8 (or even in utf-8 at all), as we can see. It looks like you are dealing with mojibake - characters that that were at one time decoded using the wrong encoding scheme, then therefore in turn encoded wrong. Fixing that is a usually tedious process of figuring out past decode/encode errors and then undoing them.
Long story short: if you have mojibake, there isn't any automatic conversions you can do unless you know (or can figure out) what conversions were made in the past.
Converting is a matter of first decoding, then encoding. To convert in Perl:
my $string = "some windows-1252 string";
use Encode;
my $raw = decode('windows-1252',$string);
my $encoded = encode('utf-8',$raw);

Related

How to url encode data in Java

So Im trying to translate a working python code into Java. One of the steps required is to url encode the data. But when I encode the data in Java it looks different than the one in encoded in Python.
In one of the block of Python code theres this:
data = {'request-json': json}
print('Sending form data:', data)
data = urlencode(data)
data = data.encode('utf-8')
print('Sending data:', data)
The Output
Sending form data: {'request-json': '{"apikey": "xewpjipcpovwiiql"}'}
The output after being encoded
Sending data: b'request-json=%7B%22apikey%22%3A+%22xewpjipcpovwiiql%22%7D'
So this is what im trying to do in Java. As you can imagine Java is more involved. I used gson to convert to Json
Gson gson = new Gson();
API_Key key = new API_Key("xewpjipcpovwiiql");
String jsonInputString = gson.toJson(key);
Data data = new Data(key);
String request_form = gson.toJson(data);
System.out.println(request_form);
String urlencoded = URLEncoder.encode(request_form,StandardCharsets.UTF_8);
System.out.println(urlencoded);
The output:
Sending form data: {"request-json":{"apikey":"xewpjipcpovwiiql"}}
The output of the encoded string:
%7B%22requestjson%22%3A%7B%22apikey%22%3A%22xewpjipcpovwiiql%22%7D%7D
So they dont look the same so why are they coming differently ? How do I get the same python encoded String in Java ? I noticed in Python it used a combination of single and double quotes and in Java its only Double quotes so I dont know if that makes a difference.
Thank You!

On the Python side: The data.encode('utf-8') call is not necessary or at least the documentation describes with a different intention compared to this use https://docs.python.org/3/library/stdtypes.html#str.encode (and that's why there's a b' at the beggining).
The outer brackets are missing because it is interpreting request-json as the URL parameter name (it may be easier to understand if you add a second property at the json's top/first property level, you'll see you end with request-json=%7B%22apikey%22%3A+%22xewpjipcpovwiiql%22%7D&second-property=<second-property-value>).
On the Java side: the request_form is being completely interpreted as a single value to encode so you can put the encoded value as part of some parameter in a URL, as in: https://host:port?some-parameter-name=%7B%22requestjson%22%3A%7B%22apikey%22%3A%22xewpjipcpovwiiql%22%7D%7D

Store base64 encoded string in HBase

I have a very specific requirement of storing PDF data in Hbase columns. The source of Data is Mongo DB, from where the base64 encoded data is read and I will need to bulk upload it to Hbase table.
I realized that in base64 encoded string there are a lot of "\n" character which splits the entire string into parts. Not sure if it is because of this, but when I store the string as it is, using a put :
put.add(Bytes.toBytes(ColFamilyName), Bytes.toBytes(columnName), Bytes.toBytes(data.replaceAll("\n","").toString()));
It is storing only the first line from the entire encoded string. Eg :
If the actual content was something like this :
"JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PAovQ3JlYXRvciAoQXBhY2hlIEZPUCBWZXJzaW9uIDEu
" +
"MSkKL1Byb2R1Y2VyIChBcGFjaGUgRk9QIFZlcnNpb24gMS4xKQovQ3JlYXRpb25EYXRlIChEOjIw\n" +
"MTUwODIyMTIxMjM1KzAzJzAwJykKPj4KZW5kb2JqCjUgMCBvYmoKPDwKICAvTiAzCiAgL0xlbmd0\n" +
It is storing only the first line which is :
JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PAovQ3JlYXRvciAoQXBhY2hlIEZPUCBWZXJzaW9uIDEu
in the column. Even after trying to remove the "\n" manually it is the same output.
Could someone please guide me in the right direction here ?

Currently, I am also working on Base64 encoding. As per my understanding, you should try using
org.apache.hadoop.hbase.util.Base64.encodeBytes(byte[] source, int option)
method where DONT_BREAK_LINES can be used as an option.
Please let me know if this works fine.

Managed to solve it. The issue was when reading the Base64 encoded data from MongoDB Source. Read the data from Mongo DB document DBObject as:
jsonObj.get("receiptContent").toString().replaceAll("\n","")
And stored it as such in Hbase. Even from the Hue HBase UI Browser I can see the PDF content now.

correct way of extracting image (BLOB) from ORACLE

I have a project in which i need to extract some images in BLOB format as string, from an ORACLE database to send it through a JSON. I'm using Eclipse Java EE IDE.
Version: Mars Release (4.5.0)
Build id: 20150621-1200
Would this be the proper way to extract the BLOB data as string?
String query = "SELECT operation, c_book, x_book, x_text1, x_text2, x_text3, x_text4,"
+ "UTL_RAW.CAST_TO_VARCHAR2(DBMS_LOB.SUBSTR(img_logo,32670,1))FROM "
+ dataBaseConnectionData.getDB_SHCHEMA() + "."+ dataBaseConnectionData.getDB_TABLE_COLA()
+ " WHERE status = 'P' OR status = 'N' OR status = 'E'"
+ " ORDER BY c_book";

There are 3 ways to get BLOB data using JDBC:
Blob blob = rs.getBlob("img_logo")
InputStream stream = rs.getBinaryStream("img_logo")
byte[] bytes = rs.getBytes("img_logo")
If the blob is of limited size, which a logo would be, the third version is the easiest to use.
You will then need to convert to a string, which means you need to know which encoding was used to convert the original text to binary in the first place. The most conservative choice would be US-ASCII, so:
String text = new String(bytes, StandardCharsets.US_ASCII)
Of course, this assumes the binary data is text, but it's a logo, and that doesn't sound like a text value, so you might have meant that you want to encode the binary data for embedding in a JSON structure, to be decoded back to binary at the other end.
For that, you'd need to decide on an encoding. The two top choices are HEX and BASE64, where HEX will double the size (two hex digits per byte), and BASE64 will add 33% (4 chars per 3 bytes).
For HEX, see How to convert a byte array to a hex string in Java?
For BASE64, see How do I convert a byte array to Base64 in Java?, or use the new Base64 class (Java 8).

Character not displaying in html

I am having trouble displaying the "velar nasal" character (ŋ)(but I assume the same problem would arise with other rare characters).
I have a MySQL table which contains a word with this character.
When my code retrieves it to display in my HTML page, it is displayed as a question mark.
I have tried a number of things:
1) Tried using MySQL's CONVERT to convert the retrieved string to UTF-8 because I understood that the string is stored in my table as "Latin1":
SELECT CONVERT(Name USING utf8)
Instead of:
SELECT Name
This did not help, and, when I saved a string in my java code with the problematic word ("Yolŋu"), and then passed the String through the rest of the code the problem still occured (ie: The problem does not lie in the different character encoding that my DB uses).
2) I also tried creating a new String from bytes:
new String(name.getBytes("UTF-8"));
The String is being passed from java to the html via a JSONObject that is passed to a javascript file:
Relevant JSON code:
JSONArray names = new JSONArray();
for (int iD: iDs)
{
JSONObject namesData = new JSONObject();
String name = NameDB.getNameName(iD);
nameData.put("label", name);
nameData.put("value", iD);
names.put(nameData);
}
return names;
Relevant servlet code:
response.setContentType("application/json");
try (PrintWriter out = response.getWriter())
{
out.print(namesJSONArray);
}
Relevant js code:
An ajax call to the servlet is made via jquery ui's autocomplete "source" option.
I am pretty new to coding in general and very new to the character encoding topic.
Thank you.

First, in Java String should already hold correct Unicode, so new String(string.getBytes(...), ...) is a hack, with its own troubles.
1. The database
It would be nice if the database held the text in UTF-8. The encoding can be set on database, table and column level. The first thing is to investigate how the text is stored. A table dump (mysqldump) would be least error prone.
If you can use UTF-8, this must be set form MySQL on the database engine, and for data transfer for the java driver.
In every case you can check a round-trip in java JDBC by filling a table field, and reading it again, as also reading that existing troublesome field.
Dump the code points of the string.
String dump(String s) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int cp = s.codePointAt(i);
if (32 < cp && cp < 128) {
sb.append((char) cp);
} else {
sb.append("U+").append(Integer.toHexString(cp));
}
sb.append(' ');
i += Character.charCount(cp);
}
return sb.toString();
}
2. The output
Here probably lies the error. Call at the beginning:
response.setCharacterEncoding("UTF-8");
... response.getWriter(); // Now converts java's Unicode text to UTF-8.
For HTML a charset specification is in order too. Especially when the HTML page is saved to the file system, the encoding header would be lost.

You should be assure about the following things:
Your JVM must work with file.encoding=UTF-8 param
Your mySQL table in which contains special characters must be parametrized with encoding=UTF-8
Your web UI should specify the meta tag with the encoding you have saved the web page in your editor, so UTF-8
If the problem persists, try to use HTML entities (&entity) instead.

Read special charatters ( æ ø å ) with Java from Oracle database

i have a problem when reading special charatters from oracle database (use JDBC driver and glassfish tooplink).
I store on database the name "GRØNLÅEN KJÆTIL" through WebService and, on database, the data are store correctly.
But when i read this String, print on log file and convert this in byte array whit this code:
int pos = 0;
byte[] msg=new byte[1024];
String F = "F" + passenger.getName();
logger.debug("Add " + F + " " + F.length());
msg = addStringToArrayBytePlusSeparator(msg, F,pos);
..............
private byte[] addStringToArrayBytePlusSeparator(byte[] arrDest,String strToAdd,int destPosition)
{
System.arraycopy(strToAdd.getBytes(Charset.forName("ISO-8859-1")), 0, arrDest, destPosition, strToAdd.getBytes().length);
arrDest = addSeparator(arrDest,destPosition+strToAdd.getBytes().length,1);
return arrDest;
}
1) In the log file there is:"Add FGRÃNLÃ " (the name isn't correct and the F.length() are not printed).
2) The code throw:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at it.edea.ebooking.business.chi.control.VingCardImpl.addStringToArrayBytePlusSeparator(Test.java:225).
Any solution?
Tanks

You're calling strToAdd.getBytes() without specifying the character encoding, within the System.arraycopy call - that will be using the system default encoding, which may well not be ISO-8859-1. You should be consistent in which encoding you use. Frankly I'd also suggest that you use UTF-8 rather than ISO-8859-1 if you have the choice, but that's a different matter.
Why are you dealing with byte arrays anyway at this point? Why not just use strings?
Also note that your addStringToArrayBytePlusSeparator method doesn't give any indication of how many bytes it's copied, which means the caller won't have any idea what to do with it afterwards. If you must use byte arrays like this, I'd suggest making addStringToArrayBytePlusSeparator return either the new "end of logical array" or the number of bytes copied. For example:
private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
/**
* (Insert fuller description here.)
* Returns the number of bytes written to the array
*/
private static int addStringToArrayBytePlusSeparator(byte[] arrDest,
String strToAdd,
int destPosition)
{
byte[] encodedText = ISO_8859_1.getBytes(strToAdd);
// TODO: Verify that there's enough space in the array
System.arraycopy(encodedText, 0, arrDest, destPosition, encodedText.length);
return encodedText.length;
}

Encoding/Decoding problems are hard. In every process step you have to do the correct encoding/decoding. So,
familiarize yourself with the difference of bytes (inputstream) and Characters (Readers, Strings)
Choose in which character encoding you want to store your data in the database, and in which character encoding you want to expose your webservice. Make sure when you load initial data in the database it's in the right encoding
connect with the right database properties. mysql requires an addition to the connection url:?useUnicode=true&characterEncoding=UTF-8 when using UTF-8, I don't know about oracle.
if you print/debug at a certain step and it looks ok, you can't be sure you did it right. The logger can write with the wrong encoding (sometimes making something look ok, while in fact it's broken). Your terminal might not handle strange byte encodings correct. The same holds for command-line database clients. Your data might wrongly be stored, but your wrongly configured terminal interprets/shows the data as correct.
In XML, it's not only the stream encoding that matters, but also the xml-encoding attribute.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Issues in Converting values to UTF-8 - java

Related

How to url encode data in Java

Store base64 encoded string in HBase

correct way of extracting image (BLOB) from ORACLE

Character not displaying in html

Read special charatters ( æ ø å ) with Java from Oracle database

Categories

Resources