DB2 UTF-8 XML C2 85 to new line conversion

DB2 UTF-8 XML C2 85 to new line conversion - java

We have problem when saving XML data ( UTF-8) encoded to DB2 9.7 LUW in table.
Table DDL:
CREATE TABLE DB2ADMIN.TABLE_FOR_XML
(
ID INTEGER NOT NULL,
XML_FIELD XML NOT NULL
)
Problem occurs in some rare examples with rare Unicode characters, we are using java jdbc db2 driver.
For example looking in editor in normal mode not in hex view (Notepad++) this strange A below (after 16.) is represented as NEL in blacks square
Input XML is in UTF-8 encoding and when looked in HEX editor has this values:
00000010h: 31 36 2E 20 C2 85 42 ; 16. Â…B
After inserting in DB2 I presume that some kind of conversion occurs because when selecting data back this same character are now
00000010h: 31 36 2E 20 0D 0A 42 ; 16. ..B
C2 85 is transformed into 0D 0A that is new line.
One another thing I noticed that although when saving XML into table header content was starting with
<xml version="1.0" encoding="UTF-8">
but after fetching xml from db2 content was starting with
<xml version="1.0" encoding="UTF-16">
Is there way to force db2 to store XML in UTF-8 without conversions ? Fetching with XMLSERIALIZE didn't help
SELECT XML_FIELD AS CONTENT1, XMLSERIALIZE(XML_FIELD as cLOB(1M)) AS CONTENT2 from DB2ADMIN.TABLE_FOR_XML
IN content2 there is no XML header but stile newLine is there.

This behaviour is standard for XML 1.1 processors. XML 1.1 s2.11:
the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating [the single character #x85] to a single #xA character
Line ending type is one of the many details of a document that will be lost over a parse-and-serialise cycle (eg attribute order, whitespace in tags, numeric character references...).
It's slightly surprising that DB2's XML fields are using XML 1.1 since not much uses that revision of XML, but not super-surprising in that support for NEL (ancient, useless mainframe line ending character) is something only IBM ever wanted.
Is there way to force db2 to store XML in UTF-8 without conversions ?
Use a BLOB?
If you need both native-XML-field functionality and to retain the exact original serialised form of a document then you'll need two columns.
(Are you sure you need to retain NEL line endings? Nobody usually cares about line endings, and these are pretty bogus.)

As I don't generally need non readable characters, before saving XML string to Db2 I decided to implement clean string from x'c285 (code point 133) and 4 byte UTF-8 characters just for the case:
Found similar example(How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?) and adjusted it.
public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD";
public static String toValid3ByteUTF8String(String line) {
final int length = line.length();
StringBuilder b = new StringBuilder(length);
for (int offset = 0; offset < length; ) {
final int codepoint = line.codePointAt(offset);
// do something with the codepoint
if (codepoint > LAST_3_BYTE_UTF_CHAR.codePointAt(0)) { //4-byte UTF replace
b.append(REPLACEMENT_CHAR);
} else if( codepoint == 133){ //NEL or x'c285
b.append(REPLACEMENT_CHAR);
} else {
if (Character.isValidCodePoint(codepoint)) {
b.appendCodePoint(codepoint);
} else {
b.append(REPLACEMENT_CHAR);
}
}
offset += Character.charCount(codepoint);
}
return b.toString();
}

Related

UTF-8 won't persist on Hibernate + MySQL

I'm trying to save some values in MySQL database by using Hibernate, but most Lithuanian characters won't get saved, including ąĄ čČ ęĘ ėĖ įĮ ųŲ ūŪ(they are saved as ?), however, šŠ žŽ do get saved.
If I do inserts manually, then those values are properly saved, so the problem is most likely in Hibernate configuration.
What I have tried so far:
hibernate.charset=UTF-8
hibernate.character_encoding=UTF-8
hibernate.use_unicode=true
---------
properties.put(PROPERTY_NAME_HIBERNATE_USE_UNICODE,
env.getRequiredProperty(PROPERTY_NAME_HIBERNATE_USE_UNICODE));
properties.put(PROPERTY_NAME_HIBERNATE_CHARSET,
env.getRequiredProperty(PROPERTY_NAME_HIBERNATE_CHARSET));
properties
.put(PROPERTY_NAME_HIBERNATE_CHARACTER_ENCODING,
env.getRequiredProperty(PROPERTY_NAME_HIBERNATE_CHARACTER_ENCODING));
---------
private void registerCharachterEncodingFilter(ServletContext aContext) {
CharacterEncodingFilter cef = new CharacterEncodingFilter();
cef.setForceEncoding(true);
cef.setEncoding("UTF-8");
aContext.addFilter("charachterEncodingFilter", cef)
.addMappingForUrlPatterns(null, true, "/*");
}
As described here
I tried adding ?useUnicode=true&characterEncoding=utf-8 to db connection url.
As described here
I ensured that my db is set to UTF-8 charset. phpmyadmin > information_schema > schemata
def db_name utf8 utf8_lithuanian_ci NULL
This is how I save into db:
//Controller
buildingService.addBuildings(schema.getBuildings());
List<Building> buildings = buildingService.getBuildings();
System.out.println("-----------");
for (Building b : schema.getBuildings()) {
System.out.println(b.toString());
}
System.out.println("-----------");
for (Building b : buildings) {
System.out.println(b.toString());
}
System.out.println("-----------");
//Service:
#Override
public void addBuildings(List<Building> buildings) {
for (Building b : buildings) {
getCurrentSession().saveOrUpdate(b);
}
}
First set of println contains all Lithuanian characters, while second replaces most with ?
EDIT: Added details
insert into buildings values (11,'ąĄčČęĘ', 'asda');
select short, hex(short) from buildings;
//Šalt. was inserted via hibernate
//letters are properly displayed:
ąĄčČęĘ | C485C484C48DC48CC499C498
MIF Šalt. | 4D494620C5A0616C742E
select address, hex(address) from buildings;
Šaltini? <...> | C5A0616C74696E693F20672E2031412C2056696C6E697573
//should contain "ų"
--------
show create table buildings;
buildings | CREATE TABLE `buildings` (
`id` int(11) NOT NULL,
`short` varchar(255) COLLATE utf8_lithuanian_ci DEFAULT NULL,
`address` varchar(255) COLLATE utf8_lithuanian_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_lithuanian_ci
EDIT:
I did not find a proper solution, so I came up with a workaround. I ended up escaping/unescaping characters, storing them like this: \uXXXX.

Let's verify that they were stored correctly... Please do SELECT col, HEX(col) ... to fetch some cell with Lithuanian characters. A correctly stored ą will show C485. The others should show various hex values of C4xx or C5xx. 3F is ?.
But, more importantly, 4 characters do show. Š should be C5A0 if properly stored as utf8. However, I suspect, you will see 8A, implying that the column in the table is really declared as CHARACTER SET latin1. (The 4 characters show up in the first column of my charset blog ).
Do SHOW CREATE TABLE to see how the column is defined. If it says latin1, then the problem is with the table definition, and you probably ought to start over.

You have to ensure that every component taking part in data entry uses UTF-8 encoding explicitly.
If you enter the values via browser, make sure that the
page displaying the results with the following header
Content-Type: text/html; charset=utf-8.
The input form is defined as follows
<form action="submit" accept-charset="UTF-8">...</form>.
If you are creating String objects from byte array, make sure you
explicitly state the Charset in the constructor.
If your entry happens from a text file, that file has to be UTF-8
encoded.
If it is hardcoded directly in your code, then the source has to be
UTF-8 encoded.

The fact that your DB holds correct UTF-8 (two or more bytes for a special letter) is reassuring.
If you get one single ? for a special letter, it was attempted to do a UTF-8 conversion to some encoding that does not contain those letters. And that seems to be the case. The letters that are converted correctly are in the ISO-8859-1 or Windows-1252 range. The others are not.
Now ISO-88591-1 aka Latin-1 is the default HTTP encoding, default in java EE server. You might like to do before writing:
response.setCharacterEncoding("UTF-8");
Now one problem with System.out.println is that it uses the system default encoding. Logging to a file with a logger is more interesting. Or debugging and inspecting the String and its char array.
That the schema does seemingly work, may be that the schema Strings stem immediately from a Java source, and the editor encoding and javac compiler encoding differ. This can be checked by u-escaping the string literals in java: "\u0105" instead of "ą".
Make a unit test that writes and reads from the database.

Parsing a complicated CSV file

I am in the difficult situation now where i need to make a parser to parse a formatted document from tekla to be processed in the database.
so on the .CSV i have this
,SMS-PW-BM31,,1,,,,287.9
,,SMS-PW-BM31,1,H350*175*7*11,SS400,5805,287.9
,------------,--------------,----,---------------,--------,------------,---------
,SMS-PW-BM32,,1,,,,405.8
,,SMSPW-H707,1,H350*175*7*11,SS400,6697,332.2
,,SMSPW-EN12,1,PLT12x175,SS400,500,8.2
,,SMSPW-EN14,1,PLT16x175,SS400,500,11
,------------,--------------,----,---------------,--------,------------,---------
That is the document generated from the tekla software. What i expect from the output is something like this
HEAD-MARK COMPONENT-TYPE QUANTITY PROFILE GRADE LENGTH WEIGHT
SMS-PW-BM31 1 287.9
SMS-PW-BM31 SMS-PW-BM31 1 H350*175*7*11 SS400 5805 287.9
SMS-PW-BM32 1 405.8
SMS-PW-BM32 SMSPW-H707 1 H350*175*7*11 SS400 6697 332.2
SMS-PW-BM32 SMSPW-EN12 1 PLT12X175 SS400 500 8.2
SMS-PW-BM32 SMSPW-EN14 1 PLT16X175 SS400 500 11
How do i start from in Java ? the most complicated thing is distributing the head mark that separated by the '-'

CSV format is quite simple, there is a column delimiter that is a comma(,) and a row delimiter that is a new line(\n). Some columns will be surrounded by quotes(") to contain column data but it looks like you wont have to worry about that given your current file.
Look at String.split and you will find your answer after a bit of pondering it.

Special chars in JAVA

สวัสดี Mr.Java Sp'e c'i'a'l'' '
I tried to parse the String using below code but I could't make
simply it shows the wrong value.
String s = "สวัสดี Mr.Java Sp'e c'i'a'l'' '"";
s = s.replaceAll("'", "&apos;");
//s = s.replaceAll("'", "''");
StringEscapeUtils.escapeHtml(s);
I am trying to get from JSP and save in SQL Server DB and show using JSP and update.
But some times in JSP it shows the converted &apos in jsp as it is instead of Special
Chars.
Very Simple is Here I have shown this String(สวัสดี Mr.Java Sp'e c'i'a'l'' ') in StackOverflow they
save in their DB and Shows and allows me to update this is what I
wanted.

OK. So lets look at what your code does:
// line 1
String s = "สวัสดี Mr.Java Sp'e c'i'a'l'' '";
We have a String with various international characters in it ... and some "'" characters.
// line 2
s = s.replaceAll("'", "&apos;");
Assuming that those are really "'" characters characters, we will replace all instances of "'" with an XML / HTML character entity giving us:
"สวัสดี Mr.Java Sp&apos;e c&apos;i&apos;a&apos;l&apos;&apos; &apos;"
And so ...
// line 3
s = StringEscapeUtils.escapeHtml(s);
This replaces any active HTML / XML characters with character references. This includes the ampersand characters "&" that you previously inserted. The result is this:
"&#xxxx;&#xxxx;&#xxxx;&#xxxx; Mr.Java Sp&apos;e
c&apos;i&apos;a&apos;l&apos;&apos; &apos;"
(The &#xxxx; numeric character references encode those Thai (?) characters.)
When you embed that in an HTML document and display it, you will see "สวัสดี Mr.Java Sp&apos;e c&apos;i&apos;a&apos;l&apos;&apos; &apos;"
See what has happened? You have HTML escaped your HTML escaped apostrophies!!
So what do you really need to do?
There is no need replace apostrophes with &apos;. Apostrophes are legal in HTML text.
There should be no need to add HTML escapes so that you can store text in a database:
Any modern database will allow you to store Unicode strings without any special encoding.
If you are trying to prevent the database's SQL parser getting confused by quotes in the text you are storing, you are doing it the wrong way. The right way to do this is to use a PreparedStatement, add parameter placeholders to the query, and use the PreparedStatement.setXxx methods to provide the parameter values. The execute (or whatever) will take care of any SQL escaping that needs to be done.

Java POST data to mySQL UTF-8 encoding issue

I have POST data that contains the Japanese string AKB48 ネ申テレビ シーズン3, defined in jQuery as data.
$("#some_div").load("someurl", { data : "AKB48 ネ申テレビ シーズン3"})
The post data is sent to Java Servlet:
String data = new String(this.request.getParameter("data").getBytes("ISO-8859-1"), "UTF-8");
My program saves it to MySQL, but after the data is saved to the database it becomes:
AKB48 u30CDu7533u30C6u30ECu30D3 u30B7u30FCu30BAu30F33
What should I do if I want to save it as it is in UTF-8? All my files are in UTF-8.
MySQL encoding is utf8 and here is the code
String sql = "INSERT INTO Inventory (uid, item_id, item_data, ctime) VALUES ("
+ inventory.getUid() + ",'"
+ inventory.getItemId() + "','"
+ StringEscapeUtils.escapeJava(inventory.getItemData()) + "',CURRENT_TIMESTAMP)";
Statement stmt = con.createStatement();
int cnt = stmt.executeUpdate(sql);

From your example above, I can verify that the Japanese string is getting saved to your MySQL database correctly, but as escaped Unicode.
I would check these items in order:
Are your tables and columns all set to have a character set and collation for utf8? I.e.,
CHARACTER SET utf8 COLLATE utf8_general_ci
Are explicitly setting the character set encoding before POST? request.setCharacterEncoding("UTF-8");
Are you setting the character encoding for your db connections? I.e., jdbc:mysql://localhost:3306/YOURDB?useUnicode=true&characterEncoding=UTF8
As the others have pointed out, you should not use that getBytes trick. It will surely mess up the POSTed values.
EDIT
Do not use StringEscapeUtils.escapeJava, since that will turn your string into escaped Unicode. That is what is transforming AKB48 ネ申テレビ シーズン3 into AKB48 u30CDu7533u30C6u30ECu30D3 u30B7u30FCu30BAu30F33.

Why you do not just extract value of parameter like this.request.getParameter("data")?
Your data is sent correctly using URL encoding where each unicode character is replaced by its code. Then you have to get the value of the parameter. When you are requesting bytes using ISO-8859-1 you are actually corrupting your data because the string is represented as a sequence if codes in textual form.

Java strings are stored in UTF-16. So, this code:
String data = new String(this.request.getParameter("data").getBytes("ISO-8859-1"), "UTF-8");
decodes a UTF-16 string (which has been re-encoded from UTF-8 in the HTTP protocol) into a binary array using the ISO-8859-1 charset, and re-encodes the binary array using the UTF-8 charset. This is almost certainly not what you want.
What happens when you use this?
String data = this.request.getParameter("data");
System.out.println(data);
If the second line generates bad data, then your problem is likely in jQuery. Determine that you are indeed getting unicode in your jQuery request:
System.out.println(this.request.getHeader("Content-Encoding"));
If it does not generate bad data, but the data doesn't get stored correctly in mySQL, your problem is at the database level. Make sure your column type supports unicode strings.

What's the point of the line
String data = new String(this.request.getParameter("data").getBytes("ISO-8859-1"), "UTF-8");
You're transforming chinese (or at least non-occidental) characters into bytes using the ISO-8859-1 encoding. Of course this can't work, since chinese characters are not supported by the ISO-8859-1 encoding. ANd then you're constructing a new String from bytes that are supposed to represent ISO-8859-1-encoded characters, using the UTF-8 encoding. This, once again, doesn't make any sense. UTF-8 and ISO-8859-1 are not the same thing, and only a small set of chars have the same encoding in both formats.
Just use
String data = this.request.getParameter("data");
and everything should be OK, provided that the column in the MySQL table uses an encoding that supports these characters.
EDIT:
now that you've shown us the code used to insert the data in database, I know where all this comes from (the preceding points are still valid, though). You're doing
StringEscapeUtils.escapeJava(inventory.getItemData())
What's the point? escapeJava is used to take a String and escape special characters in order to make it a valid Java String literal. It has nothing to do with SQL. Use a prepared statement:
String sql = "INSERT INTO Inventory (uid, item_id, item_data, ctime) VALUES (?, ?, ?, CURRENT_TIMESTAMP);
PreparedStatement stmt = con.prepareStatement();
stmt.setInteger(1, inventory.getUid()); // or setLong, depending on the type
stmt.setString(2, inventory.getItemId());
stmt.setString(inventory.getItemData());
int cnt = stmt.executeUpdate();
The PreparedStatement will take care of escaping special SQL characters correctly. They're the best tool agains SQL injection attack, and should always be used when a query has parameters, especially if the parameters come from the end user. See http://docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html.

’ character being converted to â€™ in jdbc

I am trying to read a UTF-8 string from my MySql database, which I create using:
CREATE DATABASE april
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
I make the table of interest using:
DROP TABLE IF EXISTS `article`;
CREATE TABLE `article` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`text` longtext NOT NULL,
`date_created` timestamp DEFAULT NOW(),
PRIMARY KEY (`id`)
) CHARACTER SET utf8;
If I select * from article in the MySql command line util, I get:
OIL sands output at Nexen’s Long Lake project dropped in February.
However, when I do
ResultSet rs = st.executeQuery(QUERY);
long id = -1;
String text = null;
Timestamp date = null;
while (rs.next()) {
text = rs.getString("text");
LOGGER.debug("text=" text);
}
the output I get is:
text=OIL sands output at Nexenâ€™s Long Lake project dropped in February.
I get my Connection via:
DriverManager.getConnection("jdbc:" + this.dbms + "://" + this.serverHost + ":" + this.serverPort + "/" + this.dbName + "?useUnicode&user=" + this.username + "&password=" + this.password);
I've also tried, instead of the useUnicode parameter:
characterEncoding=UTF-8
and
characterEncoding=utf8
I also tried, instead of the line text = rs.getString("text")
rs.getBytes("text");
String[] encodings = new String[]{"US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16", "Latin1"};
for (String encoding : encodings) {
text = new String(temp, encoding);
LOGGER.debug(encoding + ": " + text);
}
// Which outputted:
US-ASCII: OIL sands output at Nexen��������s Long Lake project dropped in February.
ISO-8859-1: OIL sands output at NexenÃ¢â¬â¢s Long Lake project dropped in February.
UTF-8: OIL sands output at Nexenâ€™s Long Lake project dropped in February.
UTF-16BE: 佉䰠獡湤猠潵瑰畴⁡琠乥硥滃ꋢ芬ꉳ⁌潮朠䱡步⁰牯橥捴⁤牯灰敤⁩渠䙥扲畡特�
UTF-16LE: 䥏⁌慳摮⁳畯灴瑵愠⁴敎數썮겂蓢玢䰠湯⁧慌敫瀠潲敪瑣搠潲灰摥椠⁮敆牢慵祲�
UTF-16: 佉䰠獡湤猠潵瑰畴⁡琠乥硥滃ꋢ芬ꉳ⁌潮朠䱡步⁰牯橥捴⁤牯灰敤⁩渠䙥扲畡特�
Latin1: OIL sands output at NexenÃ¢â¬â¢s Long Lake project dropped in February.
I load the strings into the DB using some pre-defined sql in a file. This file is UTF-8 encoded.
mysql -u april -p -D april < insert_articles.sql
This file includes the line:
INSERT INTO article (text) value ("OIL sands output at Nexen’s Long Lake project dropped in February.");
When I print out that file within my application using:
BufferedReader reader = new BufferedReader(new FileReader(new File("/home/path/to/file/sql_article_inserts.sql")));
String str;
while((str = reader.readLine()) != null) {
LOGGER.debug("LINE: " + str);
}
I get the correct, expected output:
LINE: INSERT INTO article (text) value ("OIL sands output at Nexen’s Long Lake project dropped in February.");
Any help would be much appreciated.
Some System Details:
I am running on linux (Ubuntu)
Edits:
* Edited to specify OS
* Edited to detail output of reading sql input file.
* Edited to specify more about how the data is inserted into the DB.
* Edited to to fix typo in code, and clarify example.

Is it possible you're reading the log file using the incorrect encoding? windows-1252, I am guessing.
UTF-8: OIL sands output at Nexenâ€™s Long Lake project dropped in February.
If this is appearing in the log, do a hex dump of the log file. If the data is UTF-8, you would expect the sequence Nexen’s to become 4E 65 78 65 6E E2 80 99 73. If some other application reads this as a native ANSI encoding, it'll decode it as Nexenâ€™s.
To confirm, you can also dump the individual characters of the return value to see if they are correct in UTF-16:
//untested
for(char ch : text.toCharArray()) {
System.out.printf("%04x%n", (int) ch);
}
I'm assuming all data is in the BMP, so you can just look up the results in the Unicode charts.

Try setting the database itself to UTF-8. When creating the DB:
CREATE DATABASE mydb
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_general_ci;
Also see MySQL reference on connection charsets and MySQL reference on configuring charsets for applications

Parameters in the JDBC URL only define how the driver should communicate with the server. If the server does not use UTF8 by default these parameters won't change it either.
Have you tried executing the following SQL query after connecting? (This should switch the current connection to UTF8 on the server-side too):
SET names utf8

There are several character encodings involved.
The terminal/cmd window that the mysql command line tool is running. (putty?)
the environment in the shell (bash) where you are running your stuff. (LC_CTYPE)
Mysql internal (used in tables) : you have defined this to UTF-8
The JVM internal (always UTF16)
The character used by the writers the logger use. Default (system property) or perhaps defined in the logging frameworks configuration.
The terminal/cmd/editor that you read the logs with. ( putty/bash?)
If the terminal settings are wrong, you might have inserted corrupted data in mysql. (If your terminal is iso-8859-1 and you read a file that is UTF-8, for instance) Assuming linux, mysql should look at the env LC_CTYPE (but I am not 100% sure that it does.)
The JDBCD driver is responsible for converting the database character encoding to the JVMs internal format (UTF16) so that should not be a problem. But you can test this with a simpel java program that inserts a hard coded string, and reads it back. Print the original and received string - they should be identical. But; If both are wrong, you have a problem with the terminals character set definition.
Use a string like "HejÅÄÖ" for some drama...
ALso, write a small program that prints the same string to a file using a printwriter that converts to UTF-8 and verify that the tool you use for reading the log prints that file correctly. If not, there terminals settings are to be suspected, again.
String test = "Test HEJ \u00C5\u00C4\u00D6 ÅÄÖ";
// here's how to define what character set to use when writing to a fileOutputStream
PrintWriter pw = new PrintWriter("test.txt","UTF8");
pw.println(test);
pw.flush();
pw.close();
System.out.println(test);
output -> Test HEJ ÅÄÖ ÅÄÖ
The contents ni the file test.txt should look the same.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

DB2 UTF-8 XML C2 85 to new line conversion - java

Related

UTF-8 won't persist on Hibernate + MySQL

Parsing a complicated CSV file

Special chars in JAVA

Java POST data to mySQL UTF-8 encoding issue

’ character being converted to â€™ in jdbc

Categories

Resources