Java encode special character in a String with UTF-8 character - java

String original = "This is my string valúe";
I'm trying to encode the above string to UTF-8 equivalent but to replace only special character (ú) with -- "&#250 ;" in this case.
I've tried using the below but I get an error:
Input is not proper UTF-8, indicate encoding !Bytes: 0xFA 0x20 0x63 0x61
Code:
String original = new String("This is my string valúe");
byte ptext[] = original.getBytes("UTF-8");
String value = new String(ptext, "UTF-8");
System.out.println("Output : " + value);
This is my string valúe

You could use String.replace(CharSequence, CharSequence) and formatted io like
String original = "This is my string valúe";
System.out.printf("Output : %s%n", original.replace("ú", "ú"));
Which outputs (as I think you wanted)
Output : This is my string valúe

You seem to want to use XML character entities.
Appache Commons Lang has a method for this (in StringEscapeUtils).

Im trying to encode the above string to UTF-8 equivalent but to replace only >special character ( ú ) with -- "&#250 ;" in this case.
I'm not sure what encoding "&#250 ;" is but have you tried looking at the URLEncoder class? It won't encode the string exactly the way you asked but it gets rid of the spooky character.

Could you please try the below lines:
byte ptext[] = original.getBytes("UTF8");
String value = new String(ptext, "UTF8");

Related

How to convert hex string to Shift JIS encoding in java?

How can I convert a word's HEX code string to Shift JIS encoding?
For example, I have a string:
"90DD92E882F08F898AFA89BB82B582DC82B782A9"
And I want to get the following output:
設定を初期化しますか
String s = new String(new BigInteger("90DD92E882F08F898AFA89BB82B582DC82B782A9", 16).toByteArray(), "Shift_JIS");
will do it for you for earlier versions
Assuming you have Java 17+, which added java.util.HexFormat, then you can use parseHex followed by a conversion from the byte array to a string:
byte[] bytes = HexFormat.of().parseHex("90DD92E882F08F898AFA89BB82B582DC82B782A9");
String str = new String(bytes, "Shift_JIS");
If you do not have Java 17+, then the related answer I linked to gives an alternative approach instead of parseHex.
I don't have the correct charset/font to show the result in my console, but here is the str variable in my debugger:

How to encode URL string without escape ':' and '/' characters using java?

I hope to encode a string to a url, but URLEncoder.encode() cannot do it quite well:
URLEncoder.encode("http://www.example.com/1/hello world", "utf8")
will result in
http%3A%2F%2Fwww.example.com%2F1%2Fhello+world
What I hope to get is:
http://www.example.com/1/hello+world
without encoding the / and : characters.
EDIT
This is a just a simple example here, actually I have many non-ascii characters in the url.
you can convert "%3A" to ":" and convert "%2F" to "/" after encode. eg:
String ret = URLEncoder.encode("http://www.example.com/1/hello world", "utf8");
String ret2 = ret.replace("%3A", ":").replace("%2F", "/");
ret2 is what you want..

How to convert UTF-8 to GBK string in java

I retrieved HTML string from an objective site and within it there is a section
class="f9t" name="Óû§Ãû:ôâÈ»12"
I know it's in GBK encoding, as I can see it from the FF browser display. But I do not know how to convert that name string into a readable GBK string (such as 上海 or 北京).
I am using
String sname = new String(name.getBytes(), "UTF-8");
byte[] gbkbytes = sname.getBytes("gb2312");
String gbkStr = new String( gbkbytes );
System.out.println(gbkStr);
but it's not printed right in GBK text
???¡ì??:????12
I have no clue how to proceed.
You can try this if you already read the name with a wrong encoding and get the wrong name value "Óû§Ãû:ôâÈ»12", as #Karol S suggested:
new String(name.getBytes("ISO-8859-1"), "GBK")
Or if you read a GBK or GB2312 string from internet or a file, use something like this to get the right string at the first place:
BufferedReader r = new BufferedReader(new InputStreamReader(is,"GBK")); name = r.readLine();
Assuming that name.getBytes() returns GBK encoded string it's enough to create string specifying encoding of array of bytes:
new String(gbkString.getBytes(), "GBK");
Regarding to documentation the name of encryption should be GBK.
Sample code:
String gbkString = "Óû§Ãû:ôâÈ»12";
String utfString = new String(gbkString.getBytes(), "GBK");
System.out.println(utfString);
Result (not 100% sure that it's correct :) ):
脫脙禄搂脙没:么芒脠禄12

Encoding troubles with Java

How can I make String Стек look like %D0%A1%D1%82%D0%B5%D0%BA? Which encoding is it? How can I do it with Java? I thought it's UTF-8:
String myString = "Стек";
byte text[] = myString.getBytes();
String value = new String(text, "UTF-8");
System.out.println(value);
But no, I've got Стек in output.
It's not UTF-8, it's URL-like encoding, and you can get it using the URLEncoder class:
String encoded = URLEncoder.encode("Стек");
System.out.println(encoded);
Result:
%D0%A1%D1%82%D0%B5%D0%BA
IDEOne working example
The text that you've shown is Percent encoded or URL encoded.
You can use URLEncoder for converting it to the desired format:
String value = URLEncoder.encode("Стек", "ISO-8859-1");
You can use the URLEncoder class to convert a String to percent encoding:
import java.net.URLEncoder;
System.out.println(URLEncoder.encode("Стек", "utf-8"));
You'll also need to catch UnsupportedEncodingException.

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).
If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.
As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Categories

Resources