Multi-byte character split generates junk symbols on saving to database

Multi-byte character split generates junk symbols on saving to database - java

In my application, dynamic long strings are generated. these values I am saving in a database with a maximum length. when the maximum length is crossed, the string is split using a custom code and a new line gets inserted in database.
The problem here occurs when multi-byte characters are used. At the split of the string if a word is getting split at a Vowel signs (matra), then it generates a junk symbols like a diamond with question mark in between.
int blockSize = 12;
String str1 = "<SOME STRING>";
byte[] b = str1.getBytes("UTF-8");
int loopCount = x; // in actual code dynamically generated
String outString = "";
for (int i = 0; i <= loopCount; i++) {
if (i != loopCount) {
outString = new String(b, i * blockSize, blockSize, "UTF-8");
} else {
outString =
new String(b, i * blockSize, (b.length - loopCount * blockSize));
}
}
How can I avoid splitting of string when in between a word and instead take to complete word to the next time.?
2.Or is there any other way for stopping generation of junk symbols.

Text as conceived in Unicode has its problems on several levels.
As pure text composed from Unicode code points.
ĉ can be represented as one code point U+109, in UTF-16 (binary format) as one char'\u0109', or ascplus a zero-width so called combining diacritical mark for^. So splitting between code points already is problematic.java.text.Normalizer` can normalize to either composed or decomposed form. Then there are the Left-To-Right and Right-To-Left markers to consider when using a part of a text.
On the UTF-16 level, java char, some code points need 2 chars, a so called surrogate pair. This is testable in java using Character. The Character class and also regular expression Pattern has a rather good Unicode support. One can find categories like combining diacritical marks.
On the UTF-8 level some (non-ASCII) chars or code points need multibyte sequences, so splitting a byte array causes UTF-8 illegal garbage at the split point.
The solution?
Maybe sensible to normalize text; mind file names.
Do not consider byte sub-arrays as valid text
Treat boundaries of byte arrays careful, even a c at the end might be ĉ, consider a shifting boundary buffer.

Related

When doing Java string comparison what values for ~ (tilda) hold?

For example:
public static void smallestWord() {
String smallestWord = "~";
List<String> words = new ArrayList<>();
words.add("dba");
words.add("dba");
words.add("eba");
words.add("dca");
words.add("eca");
for (String word : words) {
if (word.compareTo(smallestWord) < 0) {
smallestWord = word;
}
}
}
It returns dba as smallest word which is correct, but I initialized smallestWord as ~ initially, if I leave it as empty or . I do not get the correct answer. What value does ~ hold in Java lexicography?

All characters in Java are compared by their Unicode codepoint. ~ is U+007E (126) in Unicode, which as larger than all the standard ASCII Latin characters, but less than characters from all other scripts, or accented Latin characters. For more detailed information on how strings are compared, you can read the String.compareTo JavaDoc.
What you want to do is probably rather something like this:
public static void smallestWord() {
String smallestWord = null;
List<String> words = new ArrayList<>();
words.add("dba");
words.add("dba");
words.add("eba");
words.add("dca");
words.add("eca");
for (String word : words) {
if ((smallestWord == null) || (word.compareTo(smallestWord) < 0)) {
smallestWord = word;
}
}
}
Or, alternatively, use the standard library:
Collections.min(words);

As others have pointed out '~' is an ASCII character / Unicode codepoint that is larger than all ASCII letters; i.e. upper and lowercase 'A' through 'Z'.
Therefore, according to the specification1 of the String class, "~" comes after any English word.
However, the '~' codepoint is NOT less than accented letters and letters in non-latin alphabets. So the "~" trick won't work with Cyrillic or Hindi. And if you can think of a French / German / Portuguese / etc word that has an accented first letter, it won't work in those languages either.
And it won't work with Emojis either.
In short, that code using "~" as in your example won't work in internationalized contexts.
You could use null as per #Dolda2000's answer, or you could use "\u10ffff".
(\u10ffff is the largest possible Unicode codepoint. However that approach is not entirely fool-proof either. There are legal Java strings that are larger than "\u10ffff"; e.g. "\u10ffffZZZZ". Unfortunately, the largest possible string value cannot be written as a string literal, and its representation is ridiculously large - roughly 2^31 bytes!)
1 - The ordering of strings is based on the ordering of UTF-16 code units rather than Unicode codepoints. But for well-formed strings there is no difference in the two ways of thinking about it.

compareTo works unicode character value, ~ has unicode value greater than the alphabets so it works while space and dot has unicode value less than alphabets so it consider them as small and print the same.

How to convert special characters in a string to unicode?

I couldn't find an answer to this problem, having tried several answer here combined to find something that works, to no avail.
An application I'm working on uses a users name to create PDF's with that name in it. However, when someones name contains a special character like "Yağmur" the pdf creator freaks out and omits this special character.
However, when it gets the unicode equivalent ("Yağmur"), it prints "Yağmur" in the pdf as it should.
How do I check a name/string for any special character (regex = "[^a-z0-9 ]") and when found, replace that character with its unicode equivalent and returning the new unicoded string?

I will try to give the solution in generic way as the frame work you are using is not mentioned as the part of your problem statement.
I too faced the same kind of issue long time back. This should be handled by the pdf engine if you set the text/char encoding as UTF-8. Please find how you can set encoding in your framework for pdf generation and try it out. Hope it helps !!

One hackish way to do this would be as follows:
/*
* TODO: poorly named
*/
public static String convertUnicodePoints(String input) {
// getting char array from input
char[] chars = input.toCharArray();
// initializing output
StringBuilder sb = new StringBuilder();
// iterating input chars
for (int i = 0; i < input.length(); i++) {
// checking character code point to infer whether "conversion" is required
// here, picking an arbitrary code point 125 as boundary
if (Character.codePointAt(input, i) < 125) {
sb.append(chars[i]);
}
// need to "convert", code point > boundary
else {
// for hex representation: prepends as many 0s as required
// to get a hex string of the char code point, 4 characters long
// sb.append(String.format("&#xu%04X;", (int)chars[i]));
// for decimal representation, which is what you want here
sb.append(String.format("&#%d;", (int)chars[i]));
}
}
return sb.toString();
}
If you execute: System.out.println(convertUnicodePoints("Yağmur"));...
... you'll get: Yağmur.
Of course, you can play with the "conversion" logic and decide which ranges get converted.

How do I compare each character of a String while accounting for characters with length > 1?

I have a variable string that might contain any unicode character. One of these unicode characters is the han 𩸽.
The thing is that this "han" character has "𩸽".length() == 2 but is written in the string as a single character.
Considering the code below, how would I iterate over all characters and compare each one while considering the fact it might contain one character with length greater than 1?
for ( int i = 0; i < string.length(); i++ ) {
char character = string.charAt( i );
if ( character == '𩸽' ) {
// Fail, it interprets as 2 chars =/
}
}
EDIT:
This question is not a duplicate. This asks how to iterate for each character of a String while considering characters that contains .length() > 1 (character not as a char type but as the representation of a written symbol). This question does not require previous knowledge of how to iterate over unicode code points of a Java String, although an answer mentioning that may also be correct.

int hanCodePoint = "𩸽".codePointAt(0);
for (int i = 0; i < string.length();) {
int currentCodePoint = string.codePointAt(i);
if (currentCodePoint == hanCodePoint) {
// do something here.
}
i += Character.charCount(currentCodePoint);
}

The String.charAt and String.length methods treat a String as a sequence of UTF-16 code units. You want to treat the string as Unicode code-points.
Look at the "code point" methods in the String API:
codePointAt(int index) returns the (32 bit) code point at a given code-unit index
offsetByCodePoints(int index, int codePointOffset) returns the code-unit index corresponding to codePointOffset code-points from the code-unit at index.
codePointCount(int beginIndex, int endIndex) counts the code-points between two code-unit indexes.
Indexing the string by code point index is a bit tricky, especially if the string is long and you want to do it efficiently. However, it is a do-able, albeit that the code is rather cumbersome.
#sstan's answer is one solution.

This will be simpler if you treat both the string and the data you're searching for as Strings. If you just need to test for the presence of that character:
if (string.contains("𩸽") {
// do something here.
}
If you specifically need the index where that character appears:
int i = string.indexOf("𩸽");
if (i >= 0) {
// do something with i here.
}
And if you really need to iterate through every code point, see How can I iterate through the unicode codepoints of a Java String? .

An ASCII character takes half the amount a Unicode char does, so it's logical that the han character is of length 2. It not an ASCII char, nor a Unicode letter. If it were the second case, the letter would be displayed correctly.

How to build the longest String with different Unicode characters

Thanks in advance for your patience. This is my problem.
I'm writing a program in Java that works best with a big set of different characters.
I have to store all the characters in a String. I started with
private static final String values = "0123456789";
Then I added A-Z, a-z and all the commons symbols.
But they are still too few, so I tought that maybe Unicode could be the solution.
The problem is now: what is the best way to get all the unicode characters that can be displayed in Eclipse (my algorithm will probably fail if there are unrecognized characters - those displayed like little rectangles). Is it possible to build a string (or some strings) with all the characters present here (en.wikipedia.org/wiki/List_of_Unicode_characters) correctly displayed?
I can do a rough copy-paste from http://www.terena.org/activities/multiling/euroml/tests/test-ucspages1ucs.html or http://zenoplex.jp/tools/unicoderange_generator.html, but I would appreciate some cleaner solution.
I don't know if there is a way to extract characters fron a font (the Unifont one). Or maybe I should parse this (www. utf8-chartable.de/unicode-utf8-table.pl) webpage.
Moreover, by adding all the characters into a String I will probably get the error:
"The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool" (discussed in this question on SO: /questions/10798769/how-to-process-a-string-with-823237-characters).
Hybrid solutions can be accepted. I can remove duplicates following this question on SO questions/4989091/removing-duplicates-from-a-string-in-java)
Finally: every solution to get the longest only-different-characters string is accepted.
Thanks!

You are mixing some things up. The question whether a character can be displayed in Eclipse depends on the font you have chosen; and whether the source file can be processed correctly depends on which character encoding you have set up for the source file. When choosing UTF-8 and a good unicode font you can use and display almost any character, at least more than fit into a single String literal.
But is it really required to show the character in Eclipse? You can use the unicode escapes, e.g. \u20ac to refer to characters, regardless of whether they can be displayed or if the file encoding can handle them.
And if it is not a requirement to blow up your source code, it’s easy to create a String containing all existing characters:
// all chars (i.e. UTF-16 values)
StringBuilder sb=new StringBuilder(Character.MAX_VALUE);
for(char c=0; c<Character.MAX_VALUE; c++) sb.append(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
or
// all unicode characters (aka code points)
StringBuilder sb=new StringBuilder(2162686);
for(int c=0; c<Character.MAX_CODE_POINT; c++) sb.appendCodePoint(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
If you wan’t the String to contain valid unicode characters only you can use if(Character.isDefined(c)) … inside the loop. But that’s a moving target— newer JRE’s will most probably know more defined characters.

Smply use Apache classes, org.apache.commons.lang.RandomStringUtils (commons-lang) can solve your purpose.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/RandomStringUtils.html
Also please refer to below code for api usage,
import org.apache.commons.lang3.RandomStringUtils;
public class RandomString {
public static void main(String[] args) {
// Random string only with numbers
String string = RandomStringUtils.random(64, false, true);
System.out.println("Random 0 = " + string);
// Random alphabetic string
string = RandomStringUtils.randomAlphabetic(64);
System.out.println("Random 1 = " + string);
// Random ASCII string
string = RandomStringUtils.randomAscii(32);
System.out.println("Random 2 = " + string);
// Create a random string with indexes from the given array of chars
string = RandomStringUtils.random(32, 0, 20, true, true, "bj81G5RDED3DC6142kasok".toCharArray());
System.out.println("Random 3 = " + string);
}
}

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine:
String symbol = "\u2202";
symbol is equal to "∂". That's what I want.
The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:
int c = 2202;
String symbol = "\\u" + c;
However, in this case, symbol is equal to "\u2202". That's not what I want.
How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?

If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.
If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.
Doc says:
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.

Just cast your int to a char. You can convert that to a String using Character.toString():
String s = Character.toString((char)c);
EDIT:
Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.

The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.
To support supplementary code points also, this is what needs to be done:
// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);
// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));
I also did a quick test as to which conversion methods work and which don't
int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);
System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0)); // 91, didn't work
System.out.println(new String(charPair).codePointAt(0)); // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0)); // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
// 128149, worked
--
Note: as #Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.

This one worked fine for me.
String cc2 = "2202";
String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));
Now text2 will have ∂.

Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.
char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);

String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.

Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():
public static String toString(int codePoint)
Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.
Parameters:
codePoint - the codePoint to be converted
Returns:
the string representation of the specified codePoint
Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.
Since:
11
Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.
The code needed for the example given in the question is simply:
int codePoint = '\u2202';
String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
System.out.println(s); // Prints ∂
This approach offers several advantages:
It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].

This is how you do it:
int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);
This solution is by Arne Vajhøj.

The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars!
The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.
// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print
String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2]; // 2. two chars per unicode
int ii = 0;
for (String intString : intsInStrs) {
// 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
++ii; // advance to the next char
}
String symbols = new String(c2s); // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works. Enjoy

Here is a block to print out unicode chars between \u00c0 to \u00ff:
char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 16; j++) {
String sc = new String(ca);
System.out.print(sc + " ");
ca[0]++;
}
System.out.println();
}

Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.
/**
* Converts character to the mimic unicode format i.e. '\\u0020'.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param ch the character to convert
* #return is in the mimic of escaped unicode string,
*/
public static String unicodeEscaped(char ch) {
String returnStr;
//String uniTemplate = "\u0000";
final static String charEsc = "\\u";
if (ch < 0x10) {
returnStr = "000" + Integer.toHexString(ch);
}
else if (ch < 0x100) {
returnStr = "00" + Integer.toHexString(ch);
}
else if (ch < 0x1000) {
returnStr = "0" + Integer.toHexString(ch);
}
else
returnStr = "" + Integer.toHexString(ch);
return charEsc + returnStr;
}
/**
* Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
* notice: i cannot use real unicode format, because this is immediately translated
* to the character in time of compiling and editor (i.e. netbeans) checking it
* instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
* as a string, but it doesn't gives the same results, of course
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the UTF8 string to convert
* #return is the string in JAVA unicode mimic escaped
*/
public String encodeStr(String nationalString) throws UnsupportedEncodingException {
String convertedString = "";
for (int i = 0; i < nationalString.length(); i++) {
Character chs = nationalString.charAt(i);
convertedString += unicodeEscaped(chs);
}
return convertedString;
}
/**
* Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the JAVA unicode mimic escaped
* #return is the string in UTF8 string
*/
public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
String convertedString = "";
String[] arrStr = escapedString.split("\\\\u");
String str, istr;
for (int i = 1; i < arrStr.length; i++) {
str = arrStr[i];
if (!str.isEmpty()) {
Integer iI = Integer.parseInt(str, 16);
char[] chaCha = Character.toChars(iI);
convertedString += String.valueOf(chaCha);
}
}
return convertedString;
}

char c=(char)0x2202;
String s=""+c;

(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)
I am from West Bengal in INDIA.
As I understand your problem is ...
You want to produce similar to ' অ ' (It is a letter in Bengali language)
which has Unicode HEX : 0X0985.
Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?
In Dot Net it is as simple as this :
int c = 0X0985;
string x = Char.ConvertFromUtf32(c);
Now x is your answer.
But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Multi-byte character split generates junk symbols on saving to database - java

Related

When doing Java string comparison what values for ~ (tilda) hold?

How to convert special characters in a string to unicode?

How do I compare each character of a String while accounting for characters with length > 1?

How to build the longest String with different Unicode characters

Creating Unicode character from its number

Categories

Resources