How to convert UTF-8 to GBK string in java

How to convert UTF-8 to GBK string in java - java

I retrieved HTML string from an objective site and within it there is a section
class="f9t" name="ÓÃ»§Ãû:ôâÈ»12"
I know it's in GBK encoding, as I can see it from the FF browser display. But I do not know how to convert that name string into a readable GBK string (such as 上海 or 北京).
I am using
String sname = new String(name.getBytes(), "UTF-8");
byte[] gbkbytes = sname.getBytes("gb2312");
String gbkStr = new String( gbkbytes );
System.out.println(gbkStr);
but it's not printed right in GBK text
???¡ì??:????12
I have no clue how to proceed.

You can try this if you already read the name with a wrong encoding and get the wrong name value "ÓÃ»§Ãû:ôâÈ»12", as #Karol S suggested:
new String(name.getBytes("ISO-8859-1"), "GBK")
Or if you read a GBK or GB2312 string from internet or a file, use something like this to get the right string at the first place:
BufferedReader r = new BufferedReader(new InputStreamReader(is,"GBK")); name = r.readLine();

Assuming that name.getBytes() returns GBK encoded string it's enough to create string specifying encoding of array of bytes:
new String(gbkString.getBytes(), "GBK");
Regarding to documentation the name of encryption should be GBK.
Sample code:
String gbkString = "ÓÃ»§Ãû:ôâÈ»12";
String utfString = new String(gbkString.getBytes(), "GBK");
System.out.println(utfString);
Result (not 100% sure that it's correct :) ):
脫脙禄搂脙没:么芒脠禄12

Related

Java encode special character in a String with UTF-8 character

String original = "This is my string valúe";
I'm trying to encode the above string to UTF-8 equivalent but to replace only special character (ú) with -- "&#250 ;" in this case.
I've tried using the below but I get an error:
Input is not proper UTF-8, indicate encoding !Bytes: 0xFA 0x20 0x63 0x61
Code:
String original = new String("This is my string valúe");
byte ptext[] = original.getBytes("UTF-8");
String value = new String(ptext, "UTF-8");
System.out.println("Output : " + value);
This is my string valúe

You could use String.replace(CharSequence, CharSequence) and formatted io like
String original = "This is my string valúe";
System.out.printf("Output : %s%n", original.replace("ú", "ú"));
Which outputs (as I think you wanted)
Output : This is my string valúe

You seem to want to use XML character entities.
Appache Commons Lang has a method for this (in StringEscapeUtils).

Im trying to encode the above string to UTF-8 equivalent but to replace only >special character ( ú ) with -- "&#250 ;" in this case.
I'm not sure what encoding "&#250 ;" is but have you tried looking at the URLEncoder class? It won't encode the string exactly the way you asked but it gets rid of the spooky character.

Could you please try the below lines:
byte ptext[] = original.getBytes("UTF8");
String value = new String(ptext, "UTF8");

reading only half content while converting encoded zipped string to normal string in java

String request=new String("UEsDBBQACAAIAHhoEEMAAAAAAAAAAAAAAAABAAAAMe1X247bNhD9FWPfs7Ls7GYNMEwpirLZSKTCy3r9xIfEKAJkd4F4W7R/36FuFH2RW/S18IM5Zw6PRsMZUkSf/nz+Mftj//Pw/fXl4016O7+Z7V++vn77/vLbxxtrincPN58woipzin2xTBuMSF2XnBLDpXA5MQQjrh0XOX/kuSUl3qEkBqIZXBQyRhQr8KbIXC1LZ+AJh28HlBwTohmkklYYnMa0Do2Y1CrFBN1hu36K2YMH0ZIzYRyVQsO/D/8IQYRSL+1KSYQzu5rhOxA7AVFuVZ+WncbLe2DFUBxCm4qQqA5oZNXO8RwezgvOFMw7Bw6Y1paLtSPWbKTiZjfmn/qE1oUTtsoa4ciydT5yRRasDqcsDuoMJppXI2UbRWQVVPdqq9WH+XyezheyTn9FyciDDHkaBNs8gc45sCZa11KZEO8Jkiv+yFTQTo6B3q6ZqrgZEXrgUZpgBp3zsLZKkIphYUgpUdKbfYF1tK+/H95en93b/vA2VFqvUEioSZiT4h/7t9nzfvb28y9Iz4AOhAUO8CLASzwL+BKjNRM56M5R0o1QRaAIoNS0IcZqvEDJEYKgpZmThcu4MhucrpqFglaLcejx47plVV3KHVOdKSm1dbtWXavoGSwQh+cIg5KRu5nZthBEGoyg2OTxM8lIyWY5KChOTcsM7p7NwjKdILBgXOauw3O8uIMXOwahI6msmMuIyHGa+jcNNtKkJNBXRbMd+v3jPaz0MTaKrdvv8lwxrTtr4+U6aGjj0LRDnMT0LFdJYTZ+A7nssmJcopHVGk2WktG4KKVUYUpsQpYZGwke2b3ZikaWtplVmUe7gZHbpoubv379fLcNQ8XWbaf3A6BCh214DdhoXEuXyac+iJnfgsZA7+8eGFtx1rdSff5/Df77GvyrJYiTnsRdAcetIdR0Vk25qxWvfGP1h1nO4SiBw4zKnEHv3oP8NVakQxQjsfthlT7cLWOdU1YkYljJ6o0Uw65yTuOE1EhIs4Ed4XycyTVC8J8JMJn0BudpXMmU0/sqmfGSTUU9yRgRLsV92T3yXoj8otc7WUV4OdSbr6tfYCevraZE3FJ5a9etSMxrpm5ZprlheLvd3p6d0xOQvlqCyXWKnq6/5IpfT5ZeMu3W1yrzGkFPVuakV09V5pRTX63Mqww9VXqzSOBC5JdLc9Kr/2FpnvKaqZOlOSb4w2y8pSanN7fuUtdZhRW0+SJrPvgGCzHxxXKfP/gExIt5upw/pHC3ieBBCm5rTkh897753fnjJHIg/4HtZRe3/im95U8sTVXPghun/+5szq8R3LMyRQTdwJ3Ah5oOrAAjKWCxmLOadV/hyQnSAf7aAAdTHigB2erAr/k8/fCOwjhbZ7ZNw/x+tVymKzjZRkQ/ayQaW3HGk/HN+m9QSwcI78e6fhgEAACNDwAAUEsBAhQAFAAIAAgAeGgQQ+/Hun4YBAAAjQ8AAAEAAAAAAAAAAAAAAAAAAAAAADFQSwUGAAAAAAEAAQAvAAAARwQAAAAA");
byte[] resByte=new byte[11474836];
resByte=Base64.decode(request.toString().getBytes());
InputStream input = new ByteArrayInputStream(resByte);
byte[] readByte = new byte[11474836];
ZipInputStream zip = new ZipInputStream(input);
int noofbyteRead = 0;
if ( zip.getNextEntry() != null )
{
noofbyteRead = zip.read(readByte);
}
byte[] writeByte = new byte[ noofbyteRead ];
System.arraycopy( readByte,0,writeByte,0,noofbyteRead);
zip.close();
input.close();
/* String actualXmlmessage = new String(writeByte);*/
String s1 = new String(writeByte);
System.out.println(s1);
s1 displays only half of the content. Why it is not reading the full content?

Before even answering the question, let me make some comments:
There is no need to create a new String from a String constant.
String request= "UEsDBBQACAAIAHhoE....." ;
Base64.decode does not receive a byte[] to hold the result. So, it is most probably creating the one it returns. In consequence, resByte does not need to be initialized. resByte's declaration should be just
byte[] resByte;
That said, and without many more details, instruction String s1 = new String(writeByte); will use the platform default encoding (this depends on OS and OS configuration if you have not manually set it). If your encoding is such as UTF-16 (where every character corresponds to 2-bytes), you would be obtaining exactly half the characters the number of bytes in writeByte .
If s1 looks garbled (which is very different to "half the content" or even "just the content", and you should have pointed this out in the question), then this is what's happening almost for sure.
The solution is to use:
String s1= new String(writeByte,charSetName) ;
Where charSetName corresponds to the character set of the very original input (not only before it was base-64-encoded, but even before it was zipped).

How best to convert a byte[] array to a string buffer

I have a number of byte[] array variables I need to convert to string buffers.
is there a method for this type of conversion ?
Thanks
Thank you all for your responses..However I didn't make myself clear....
I'm using some byte[] arrays pre-defined as public static "under" the class declaration
for my java program. these "fields" are reused during the "life" of the process.
As the program issues status messages, (written to a file) I've defined a string buffer
(mesg_data) that used to format a status message.
So as the program executes
I tried msg2 = String(byte_array2)
I get a compiler error:
cannot find symbol
symbol : method String(byte[])
location: class APPC_LU62.java.LU62XnsCvr
convrsID = String(conversation_ID) ;
example:
public class LU62XnsCvr extends Object
.
.
static String convrsID ;
static byte[] conversation_ID = new byte[8] ;
So I can't use a "dynamic" define of a string variable because the same variable is used
in multiple occurances.
I hope I made myself clear
Thanks ever so much
Guy

String s = new String(myByteArray, "UTF-8");
StringBuilder sb = new StringBuilder(s);

There is a constructor that a byte array and encoding:
byte[] bytes = new byte[200];
//...
String s = new String(bytes, "UTF-8");
In order to translate bytes to characters you need to specify encoding: the scheme by which sequences (typically of length 1,2 or 3) of 0-255 values (that is: sequence of bytes) are mapped to characters. UTF-8 is probably the best bet as a default.

You can turn it to a String directly
byte[] bytearray
....
String mystring = new String(bytearray)
and then to convert to a StringBuffer
StringBuffer buffer = new StringBuffer(mystring)

You may use
str = new String(bytes)
By thewhat the code above does is to create a java String (i.e. UTF-16) with the default platform character encoding.
If the byte array was created from a string encoded in the platform default character encoding this will work well.
If not you need to specify the correct character encoding (Charset) as
String str = new String (byte [] bytes, Charset charset)

It depends entirely on the character encoding, but you want:
String value = new String(bytes, "US-ASCII");
This would work for US-ASCII values.
See Charset for other valid character encodings (e.g., UTF-8)

Encode String to UTF-8

I have a String with a "ñ" character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn't work:
byte ptext[] = myString.getBytes();
String value = new String(ptext, "UTF-8");
How do I encode that string to utf-8?

How about using
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)

String objects in Java use the UTF-16 encoding that can't be modified*.
The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).
* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).

In Java7 you can use:
import static java.nio.charset.StandardCharsets.*;
byte[] ptext = myString.getBytes(ISO_8859_1);
String value = new String(ptext, UTF_8);
This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException.
If you're using an older Java version you can declare the charset constants yourself:
import java.nio.charset.Charset;
public class StandardCharsets {
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
//....
}

Use byte[] ptext = String.getBytes("UTF-8"); instead of getBytes(). getBytes() uses so-called "default encoding", which may not be UTF-8.

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.

You can try this way.
byte ptext[] = myString.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");

In a moment I went through this problem and managed to solve it in the following way
first i need to import
import java.nio.charset.Charset;
Then i had to declare a constant to use UTF-8 and ISO-8859-1
private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");
Then I could use it in the following way:
String textwithaccent="Thís ís a text with accent";
String textwithletter="Ñandú";
text1 = new String(textwithaccent.getBytes(ISO), UTF_8);
text2 = new String(textwithletter.getBytes(ISO),UTF_8);

String value = new String(myString.getBytes("UTF-8"));
and, if you want to read from text file with "ISO-8859-1" encoded:
String line;
String f = "C:\\MyPath\\MyFile.txt";
try {
BufferedReader br = Files.newBufferedReader(Paths.get(f), Charset.forName("ISO-8859-1"));
while ((line = br.readLine()) != null) {
System.out.println(new String(line.getBytes("UTF-8")));
}
} catch (IOException ex) {
//...
}

I have use below code to encode the special character by specifying encode format.
String text = "This is an example é";
byte[] byteText = text.getBytes(Charset.forName("UTF-8"));
//To get original string from byte.
String originalString= new String(byteText , "UTF-8");

A quick step-by-step guide how to configure NetBeans default encoding UTF-8. In result NetBeans will create all new files in UTF-8 encoding.
NetBeans default encoding UTF-8 step-by-step guide
Go to etc folder in NetBeans installation directory
Edit netbeans.conf file
Find netbeans_default_options line
Add -J-Dfile.encoding=UTF-8 inside quotation marks inside that line
(example: netbeans_default_options="-J-Dfile.encoding=UTF-8")
Restart NetBeans
You set NetBeans default encoding UTF-8.
Your netbeans_default_options may contain additional parameters inside the quotation marks. In such case, add -J-Dfile.encoding=UTF-8 at the end of the string. Separate it with space from other parameters.
Example:
netbeans_default_options="-J-client -J-Xss128m -J-Xms256m
-J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true -J-Dfile.encoding=UTF-8"
here is link for Further Details

This solved my problem
String inputText = "some text with escaped chars"
InputStream is = new ByteArrayInputStream(inputText.getBytes("UTF-8"));

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).

If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.

As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert UTF-8 to GBK string in java - java

Related

Java encode special character in a String with UTF-8 character

reading only half content while converting encoded zipped string to normal string in java

How best to convert a byte[] array to a string buffer

Encode String to UTF-8

Java InputStream encoding/charset

Categories

Resources