How to print unicode character into string?

How to print unicode character into string? - java

Getting invalid unicode error with below code
Uniocde want to print: unicode:0x16
PrintWriter pw = new PrintWriter(System.out, true);
char aa = "\u0x16";
pw.println(aa);
What's wrong happening here ?

\u0x16 is not a valid unicode character reference. There should be 4 hexadecimal digits (numbers 0-9 letters a-f) after \u - the "x" is not valid.
If you meant to use the character U+0016, it's written as \u0016:
char aa = '\u0016';
The following is equivalent, but it uses an integer constant rather than a character constant.
char aa = 0x16;

Related

How to use regular expression character class properly?

String a = "77*b+7-77/98+6";
String b[] = a.split("[*+-/]"); // works fine
b[] = a.split("[+/- *]"); // gives pattern syntax exception because of " * "
b[] = a.split("[*/+-]"); // works fine
b[] = a.split("[-*]"); // works fine
Please, help me to figure out this.

In Regex square brackets [] denote a character class. A character class can have two characters separated by a hyphen a-z to denote a range of characters.
This means that if the hyphen is used, and either end of the range is invalid, this is an invalid pattern. This hyphen must be escaped in this case, \\- in Java.
But, if the hyphen is used either at the beginning or end of a character range then the hyphen is not treated as a metacharater - because it cannot be a range. So your other patterns work because the hyphen is effectively escaped.
b[] = a.split("[*/+-]"); // works fine
^ at the end
b[] = a.split("[-*]"); // works fine
^ at the start
The first expression has +-/, which is a valid range from + to / in the ASCII character set, equivalent to the literal characters +,-./.
The errored expression has /-, i.e. the range from / to SPACE. SPACE is character 32 and / is character 47 so your range is 47-32, the range is backwards.

How does Java concatenate 2 strings?

Why does the following print 197, but not 'bc'?
System.out.println('b' + 'c');
Can someone explain how to do proper concatenation on Java?
P.S. I learnt some Python, and now transforming to learn Java.

'b' and 'c' are not Strings, they are chars. You should use double quotes "..." instead:
System.out.println("b" + "c");
You are getting an int because you are adding the unicode values of those characters:
System.out.println((int) 'b'); // 98
System.out.println((int) 'c'); // 99
System.out.println('b' + 'c'); // 98 + 99 = 197

'b' is not a String in Java it is char. Then 'b'+'c' prints 197.
But if you use "b"+"c" this will prints bc since "" used to represent String.
System.out.println("b" + "c"); // prints bc

Concatenating chars using + will change the value of the char into ascii and hence giving a numerical output. If you want bc as output, you need to have b and c as String. Currently, your b and c are char in Java.
In Java, String literals should be surrounded by "" and Character are surrounded by ''

Yes single quotation is Char while double quote represent string so:
System.out.println("b" + "c");
Some alternatives can be:
"" + char1 + char2 + char3;
or:
new StringBuilder().append('b').append('c').toString();

'b' and 'c' are not Strings they are characters.
197 is sum of unicode values of b and c
For Concatinating String you can use following 2 ways:
System.out.println("b"+"c");
System.out.println("b".concat("c"));

In Java, Strings literals are represented with double quotes - ""
What you have done is added two char values together. What you want is:
System.out.println("b" + "c"); // bc
What happened with your code is that it added the ASCII values of the chars to come up with 197.
The ASCII value of 'b' is 98 and the ASCII value of 'c' is 99.
So it went like this:
System.out.println('b' + 'c'); // 98 + 99 = 197
As a note with my reference to the ASCII value of the chars:
The char data type is a single 16-bit Unicode character.
From the Docs. However, for one byte (0-255), as far as I'm aware, chars can also be represented by their ASCII value because the ASCII values directly correspond to the Unicode code point values - see here.
The reason I referenced ASCII values in my answer above is because the 256 ASCII values cover all letter (uppercase and lowercase) and punctuation - so it covers all of the main stuff.
Technically what I said is correct - it did add the ASCII values (because they are the same as the Unicode values). However, technically it adds the Unicode codepoint decimal values.

Character literal in Java?

So I just started reading "Java In A Nutshell", and on Chapter One it states that:
"To include a character literal in a Java program, simply place it between single quotes"
i.e.
char c = 'A';
What exactly does this do^? I thought char only took in values 0 - 65,535. I don't understand how you can assign 'A' to it?
You can also assign 'B' to an int?
int a = 'B'
The output for 'a' is 66. Where/why would you use the above^ operation?
I apologise if this is a stupid question.
My whole life has been a lie.

char is actually an integer type. It stores the 16-bit Unicode integer value of the character in question.
You can look at something like http://asciitable.com to see the different values for different characters.

In Java char literals represent UTF-16 (character encoding schema) code units. What you got from UTF-16 is mapping between integer values (and the way they are saved in memory) with corresponding character (graphical representation of unit code).
You can enclose characters in single quotes - this way you don't need to remember UTF-16 values for characters you use. You can still get the integer value from character type and put if for example in int type (but generally not in short, they both use 16 bits but short values are from -32768 to 32767 and char values are from 0 to 65535 or so).

If you look at an ASCII chart, the character "A" has a value of 41 hex or 65 decimal. Using the ' character to bracket a single character makes it a character literal. Using the double-quote (") would make it a String literal.
Assigning char someChar = 'A'; is exactly the same as saying char someChar = 65;.
As to why, consider if you simply want to see if a String contains a decimal number (and you don't have a convenient function to do this). You could use something like:
bool isDecimal = true;
for (int i = 0; i < decString.length(); i++) {
char theChar = decString.charAt(i);
if (theChar < '0' || theChar > '9') {
isDecimal = false;
break;
}
}

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine:
String symbol = "\u2202";
symbol is equal to "∂". That's what I want.
The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:
int c = 2202;
String symbol = "\\u" + c;
However, in this case, symbol is equal to "\u2202". That's not what I want.
How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?

If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.
If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.
Doc says:
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.

Just cast your int to a char. You can convert that to a String using Character.toString():
String s = Character.toString((char)c);
EDIT:
Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.

The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.
To support supplementary code points also, this is what needs to be done:
// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);
// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));
I also did a quick test as to which conversion methods work and which don't
int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);
System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0)); // 91, didn't work
System.out.println(new String(charPair).codePointAt(0)); // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0)); // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
// 128149, worked
--
Note: as #Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.

This one worked fine for me.
String cc2 = "2202";
String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));
Now text2 will have ∂.

Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.
char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);

String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.

Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():
public static String toString(int codePoint)
Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.
Parameters:
codePoint - the codePoint to be converted
Returns:
the string representation of the specified codePoint
Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.
Since:
11
Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.
The code needed for the example given in the question is simply:
int codePoint = '\u2202';
String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
System.out.println(s); // Prints ∂
This approach offers several advantages:
It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].

This is how you do it:
int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);
This solution is by Arne Vajhøj.

The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars!
The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.
// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print
String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2]; // 2. two chars per unicode
int ii = 0;
for (String intString : intsInStrs) {
// 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
++ii; // advance to the next char
}
String symbols = new String(c2s); // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works. Enjoy

Here is a block to print out unicode chars between \u00c0 to \u00ff:
char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 16; j++) {
String sc = new String(ca);
System.out.print(sc + " ");
ca[0]++;
}
System.out.println();
}

Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.
/**
* Converts character to the mimic unicode format i.e. '\\u0020'.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param ch the character to convert
* #return is in the mimic of escaped unicode string,
*/
public static String unicodeEscaped(char ch) {
String returnStr;
//String uniTemplate = "\u0000";
final static String charEsc = "\\u";
if (ch < 0x10) {
returnStr = "000" + Integer.toHexString(ch);
}
else if (ch < 0x100) {
returnStr = "00" + Integer.toHexString(ch);
}
else if (ch < 0x1000) {
returnStr = "0" + Integer.toHexString(ch);
}
else
returnStr = "" + Integer.toHexString(ch);
return charEsc + returnStr;
}
/**
* Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
* notice: i cannot use real unicode format, because this is immediately translated
* to the character in time of compiling and editor (i.e. netbeans) checking it
* instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
* as a string, but it doesn't gives the same results, of course
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the UTF8 string to convert
* #return is the string in JAVA unicode mimic escaped
*/
public String encodeStr(String nationalString) throws UnsupportedEncodingException {
String convertedString = "";
for (int i = 0; i < nationalString.length(); i++) {
Character chs = nationalString.charAt(i);
convertedString += unicodeEscaped(chs);
}
return convertedString;
}
/**
* Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the JAVA unicode mimic escaped
* #return is the string in UTF8 string
*/
public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
String convertedString = "";
String[] arrStr = escapedString.split("\\\\u");
String str, istr;
for (int i = 1; i < arrStr.length; i++) {
str = arrStr[i];
if (!str.isEmpty()) {
Integer iI = Integer.parseInt(str, 16);
char[] chaCha = Character.toChars(iI);
convertedString += String.valueOf(chaCha);
}
}
return convertedString;
}

char c=(char)0x2202;
String s=""+c;

(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)
I am from West Bengal in INDIA.
As I understand your problem is ...
You want to produce similar to ' অ ' (It is a letter in Bengali language)
which has Unicode HEX : 0X0985.
Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?
In Dot Net it is as simple as this :
int c = 0X0985;
string x = Char.ConvertFromUtf32(c);
Now x is your answer.
But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P

Java String.codePointAt returns unexpected value

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?

But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.

Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.

in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK

String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to print unicode character into string? - java

Getting invalid unicode error with below code Uniocde want to print: unicode:0x16 PrintWriter pw = new PrintWriter(System.out, true); char aa = "\u0x16"; pw.println(aa); What's wrong happening here ?

Related

How to use regular expression character class properly?

How does Java concatenate 2 strings?

Character literal in Java?

Creating Unicode character from its number

Java String.codePointAt returns unexpected value

Categories

Resources