How many bytes of English and Chinese characters take in java?

How many bytes of English and Chinese characters take in java? - java

import java.io.UnsupportedEncodingException;
public class TestChar {
public static void main(String[] args) throws UnsupportedEncodingException {
String cnStr = "龙";
String enStr = "a";
byte[] cnBytes = cnStr.getBytes("UTF-8");
byte[] enBytes = enStr.getBytes("UTF-8");
System.out.println("bytes size of Chinese：" + cnBytes.length);
System.out.println("bytes size of English：" + enBytes.length);
// in java, char takes two bytes, the question is:
char cnc = '龙'; // will '龙‘ take two or three bytes ?
char enc = 'a'; // will 'a' take one or two bytes ?
}
}
Output :
bytes size of Chinese：3
bytes size of English：1
Here, My JVM is set as UTF-8, from the output, we know Chinese character '龙' takes 3 bytes, and English character 'a' takes one byte. My question is:
In Java, char takes two bytes, here, char cnc = '龙'; char enc = 'a'; will cnc only takes two bytes instead of 3 bytes ? And 'a' takes two bytes instead of one byte ?

The codepoint value of 龙 is 40857. That fits inside the two bytes of a char.
It takes 3 bytes to encode in UTF-8 because not all 2-byte sequences are valid in UTF-8.

UTF-8 is a variable-length character encoding, where characters take up 1 to 4 bytes.
A Java char is 16 bits. See 3.1 Unicode in the Java Language Specification to understand how exactly Java handles Unicode.

Internally, Strings/chars are UTF-16, so it'll be the same for both: Each char will be 16bits.
byte[] cnBytes = cnStr.getBytes("UTF-8");
UTF-8 is a variable length encoding, so the Chinese char takes more bits because it's out of the ASCII character range.

Related

How to read extend ascii code in java

Hello today I have a problem to print extend ASCII code in java. When I try to print it. It does not display. How can I print it.

You can use the String constructor that takes a byte array and a character set to convert a code page 437 ("IBM extended ASCII") character to a Java UTF-16 char:
public static extendedAscii(int codePoint) throws UnsupportedEncodingException {
return new String(new byte[] { (byte) codePoint }, "Cp437").charAt(0);
}
(Note: Yes, all characters in code page 437 fit in single UTF-16 chars; I checked.)

Generate random string of 300 bytes in Java?

I always gets confused when people ask me to generate random string which is of 300 bytes or some predefined bytes. I am not sure what does they mean in general? Are they asking that string should be of 300 length?
I am working on a project in which people have asked me to generate random String of approximately 300 bytes.
Is it possible to do? I am confuse how we can generate random string of 300 bytes?
I know how to generate random string like this -
private static final Random random = new Random();
private static final String CHARACTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
public static String generateString(int length) {
StringBuilder sb = new StringBuilder(length);
for (int i = 0; i < length; i++) {
sb.append(CHARACTERS.charAt(random.nextInt(CHARACTERS.length())));
}
return sb.toString();
}
Can anyone explain me what does it mean when we need to generate random string of approx 300 bytes?

Strings are made of characters, but the number of bytes required to represent a character can be 1 or 2 (or sometimes more) depending on the character and the encoding. eg characters encoded in UTF8 that are over ascii 127 need 2 bytes, but those under - like english letters and numbers, take only 1.
Normally, string size refers to the number of characters. You only need the bytes if you are writing bytes, and the bytes you write depend on the encoding used.
I would interpret the requirement as 300 characters, especially since you have listed all candidate characters and they are 1-byte chars in the standard encoding.

Since each hexadecimal digit represents four binary digits, which means two digits represent 1 byte. You can generate 300 random numbers Yi in hexadecimals. Such that, 0x41 <= Yi <= 0x5a. so that Yi maps to [A-Za-z] in UTF-8. (Change the range if you want to include numbers or any other characters).
and then you convert these numbers to String using Byte Encodings

A java char is 2 bytes, the example is as follows:
public class Main {
public static void main(String[] args) {
String str="Hello World" ;
System.out.println(str.getBytes().length);
}
}
The result is 11 .

Java String to byteArray conversion issue

I am trying to encode/decode a ByteArray to String, but input/output are not matching. Am I doing something wrong?
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The output is:
130021000061f8f0001a
130021000061efbfbd
Complete code:
String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};
byte[] by = new byte[arr.length];
for (int i = 0; i < arr.length; i++) {
by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff);
}
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));

The problem here is that f8f0001a isn't a valid UTF-8 byte sequence.
First of all, the f8 opening byte denotes a 5 byte sequence and you've only got four. Secondly, f8 can only be followed by a byte of 8x, 9x, ax or bx form.
Therefore it gets replaced with a unicode replacement character (U+FFFD), whose byte sequence in UTF-8 is efbfbd.
And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence. )
The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays.

My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding.
Look at http://en.wikipedia.org/wiki/Utf-8#Description.

When you build the String from the array of bytes, the bytes are decoded.
Since the bytes from your code does not represent valid characters, the bytes that finally composes the String are not the same your passed as parameter.
public String(byte[] bytes)
Constructs a new String by decoding the
specified array of bytes using the platform's default charset. The
length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
The behavior of this
constructor when the given bytes are not valid in the default charset
is unspecified. The CharsetDecoder class should be used when more
control over the decoding process is required.

Converting binary data to String

If I have some binary data D And I convert it to string S. I expect than on converting it back to binary I will get D. But It's wrong.
public class A {
public static void main(String[] args) throws IOException {
final byte[] bytes = new byte[]{-114, 104, -35};// In hex: 8E 68 DD
System.out.println(bytes.length); //prints 3
System.out.println(new String(bytes, "UTF-8").getBytes("UTF-8").length); //prints 7
}
}
Why does this happens?

Converting between a byte array to a String and back again is not a one-to-one mapping operation. Reading the docs, the String implmentation uses the CharsetDecoder to convert the incoming byte array into unicode. The first and last bytes in your input byte array must not map to a valid unicode character, thus it replaces it with some replacement string.

It's likely that the bytes you're converting to a string don't actually form a valid string. If java can't figure out what you mean by each byte, it will attempt to fix them. This means that when you convert back to the byte array, it won't be the same as when you started. If you try with a valid set of bytes, then you should be more successful.

Your data can't be decoded into valid Unicode characters using UTF-8 encoding. Look at decoded string. It consists of 3 characters: 0xFFFD, 0x0068 and 0xFFFD. First and last are "�" - Unicode replacement characters. I think you need to choose other encoding. I.e. "CP866" produces valid string and converts back into same array.

Java String.codePointAt returns unexpected value

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?

But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.

Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.

in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK

String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How many bytes of English and Chinese characters take in java? - java

The codepoint value of 龙 is 40857. That fits inside the two bytes of a char. It takes 3 bytes to encode in UTF-8 because not all 2-byte sequences are valid in UTF-8.

UTF-8 is a variable-length character encoding, where characters take up 1 to 4 bytes. A Java char is 16 bits. See 3.1 Unicode in the Java Language Specification to understand how exactly Java handles Unicode.

Internally, Strings/chars are UTF-16, so it'll be the same for both: Each char will be 16bits. byte[] cnBytes = cnStr.getBytes("UTF-8"); UTF-8 is a variable length encoding, so the Chinese char takes more bits because it's out of the ASCII character range.

Related

How to read extend ascii code in java

Generate random string of 300 bytes in Java?

Java String to byteArray conversion issue

Converting binary data to String

Java String.codePointAt returns unexpected value

Categories

Resources