I am writing a Java test class and would like to iterate over all the Charsets specified in the StandardCharsets class and specify each encoding when performing the .getBytes() on the myString variable.
I want to try something like this:
String myString = "Some Junk";
for (Charset encoding : StandardCharsets) {
System.out.println("Using Encoding: " + encoding.displayName());
byte[] newBytes = myString.getBytes(encoding);
for (byte b : newBytes ) {
System.out.print(b + " ");
}
System.out.println("");
}
Obviously that is not correct. Short of doing each one manually how can I step through all the Charsets defined in the StandardCharsets class?
So based on your suggestions I came up with this which works (and probably can/should be improved on)
String myString = "Some Junk";
for (Field charSet : StandardCharsets.class.getDeclaredFields()) {
String encoding = charSet.getName();
//This is because the Charsets in StandardCharsets all use underscores however
//when passing the string to .getBytes() you need to pass UTF-8 and not UTF_8
//** All except for ISO_8859_1 ** - Sigh I wish I could do this better.
if (encoding.startsWith("U")) {
encoding = encoding.replaceAll("_", "-");
}
System.out.println("Using Encoding: " + encoding);
byte[] newByteStr = myString.getBytes(Charset.forName(encoding));
for (byte b : newByteStr ) {
System.out.print(b + " ");
}
System.out.println("");
}
This will convert the string in myString to a byte array with the relevant encoding of every Charset found in StandardCharSets. Which is what I wanted in the end.
EDIT 1: So based on MC Emperor's comment I now have this
String myString = "Some Junk";
for (Field field: StandardCharsets.class.getDeclaredFields()) {
if (field.get(null) instanceof Charset charset) {
System.out.println("Using Encoding: " + charset.displayName());
byte[] newByteStr = myString.getBytes(charset);
for (byte b : newByteStr ) {
System.out.print(b + " ");
}
System.out.println();
}
}
This seems far better as I no longer have to use string replacements.
Related
I supposed to show/atleast print attached characters in a string
String str = (attached)
System.out.println("Str : "+Str);
But I am unable to print the exact chars. I did use UTF 8 and 16 encode.
Thanks in advance
You have what is known as a double encoding.
You have the three character sequence "你好吗" which you correctly point out is encoded in UTF-8 as E4BDA0 E5A5BD E59097.
But now, start encoding each byte of THAT encoding in UTF-8. Start with E4. What is that codepoint in UTF-8? Try it! It's C3 A4!
You get the idea.... :-)
Here is a Java app which illustrates this:
public class DoubleEncoding {
public static void main(String[] args) throws Exception {
byte[] encoding1 = "你好吗".getBytes("UTF-8");
String string1 = new String(encoding1, "ISO8859-1");
for (byte b : encoding1) {
System.out.printf("%2x ", b);
}
System.out.println();
byte[] encoding2 = string1.getBytes("UTF-8");
for (byte b : encoding2) {
System.out.printf("%2x ", b);
}
System.out.println();
}
}
Just like the picture, I'd like to convert between the encoded UTF-8 String and Native String in Java.
Would anyone some suggestions? Thanks a lot!
ps.
For example,
String a = "这是一个例子,this is a example";
String b = null;
// block A: processing a, and let b = "这是一个例子,this is a example"
How to implement the "block A"?
Apache Commons Lang StringEscapeUtils.unescapeXml(...) is what you want. Depending on where your original string came from, one of the HTML variants may be more appropriate.
Use like so:
String a = "这是一个例子,this is a example";
String b = StringEscapeUtils.unescapeXml(a);
// block A: processing a, and let b = "这是一个例子,this is a example"
System.out.println(a);
System.out.println(b);
Output:
这是一个例子,this is a example
这是一个例子,this is a example
There are methods for converting the other way also.
You can use Charset. See the documentation here
Charset.forName("UTF-8").encode(text)
Or
you can also use getBytes() method of 'java.lang.String' Class
text.getBytes(Charset.forName("UTF-8"));
documentation:
public byte[] getBytes(Charset charset)
Encodes this String into a sequence of bytes using the given charset,
storing the result into a
new byte array.
This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
CharsetEncoder class should be used when more control over the
encoding process is required.
Parameters: charset - The Charset to be used to encode the String
Returns: The resultant byte array
Since:
1.6
To the right are hexadecimal numeric HTML entities.
Now the apache commons library has a StringEscapeUtils which can convert from that to String, but the reverse is not obvious (= should be tried, might give named entities).
public static void main(String[] args) throws InterruptedException {
String a = "这是一个例子,this is a example";
String b = fromHtmlEntities(a);
System.out.println(b);
String a2 = toHtmlEntities(b);
System.out.println(a2.equals(a));
System.out.println(a);
System.out.println(a2);
}
public static String fromHtmlEntities(String s) {
Pattern numericEntityPattern = Pattern.compile("\\&#[Xx]([0-9A-Fa-f]{1,6});");
Matcher m = numericEntityPattern.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()) {
int codePoint = Integer.parseInt(m.group(1), 16);
String replacement = new String(new int[] { codePoint }, 0, 1);
m.appendReplacement(sb, replacement);
}
m.appendTail(sb);
return sb.toString();
}
// Uses java 8
public static String toHtmlEntities(String s) {
int[] codePoints = s.codePoints().flatMap(
(cp) -> cp < 128 // ASCII?
? IntStream.of(cp)
: String.format("&#x%X;", cp).codePoints())
.toArray();
return new String(codePoints, 0, codePoints.length);
}
String s1="\u0048\u0065\u006C\u006C\u006F"; // Hello
String s2="\u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F"; // ಮುಖಪುಟ (Kannada Language)
System.out.println("s1: " + StringEscapeUtils.unescapeJava(s1)); // s1: Hello
System.out.println("s2: " + StringEscapeUtils.unescapeJava(s2)); // s2: ??????
When I print s1, I get the result as Hello.
When I print s2, I get the result as ???????.
I want the output as ಮುಖಪುಟ for s2. How can I achieve this?
ByteArrayOutputStream os = new ByteArrayOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("\u0048\u0065\u006C\u006C\u006F \u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F");
String output = os.toString("UTF8");
System.out.println("result: "+output); // Hello ಮುಖಪುಟ
You need to add the encoding like "UTF-8"
try this
String s1="\u0048\u0065\u006C\u006C\u006F"; // Hello
String s2="\u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F"; // ಮುಖಪುಟ (Kannada Language)
System.out.println("s1: " + new String(s1.getBytes("UTF-8"), "UTF-8"));
System.out.println("s2: " + new String(s2.getBytes("UTF-8"), "UTF-8"));
If you are using Eclipse then please have a look at: https://decoding.wordpress.com/2010/03/18/eclipse-how-to-change-the-console-output-encoding/
Please simply output on the console as follows:-
String s1="\u0048\u0065\u006C\u006C\u006F";
String s2="\u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F";
System.out.println("s1: " + s1); // s1
System.out.println("s2: " + s2); // s2
Hope, this is helpful to you.
The problem is most probably that System.out is not prepared to deal with Unicode. It is an output stream that gets encoded in the so called default encoding.
The default encoding is most often (i.e. on Windows) some proprietary 8-bit character set, that simply can't handle unicode.
My tip: For the sake of testing, create your own PrintStream or PrintWriter with UTF-8 encoding.
public static void main(String[] args) {
try {
String name = "i love my country";
byte[] sigToVerify = name.getBytes();
System.out.println("file data:" + sigToVerify);
String name1 = "data";
byte[] sigToVerify1 = name1.getBytes();
System.out.println("file data1:" + sigToVerify1);
}
}
I am trying to execute the above program but getBytes() gives me different values for the same String. Is there any way to get the same byte while executing multiple times for a given string?
System.out.println("file data:" + sigToVerify);
Here you are not printing the value of a String. As owlstead pointed out correctly in the comments, the Object.toString() method will be invoked on the byte array sigToVerify. Leading to an output of this format:
getClass().getName() + '#' + Integer.toHexString(hashCode())
If you want to print each element in the array you have to loop through it.
byte[] bytes = "i love my country".getBytes();
for(byte b : bytes) {
System.out.println("byte = " + b);
}
Or even simpler, use the Arrays.toString() method:
System.out.println(Arrays.toString(bytes));
try printing out the contents of the byte array instead of the toString() result of the variable
for(byte b : sigToVerify)
System.out.print(b +"\t");
if the bytes getting printed are the same, then you're good to go.
I have the following value in a string variable in Java which has UTF-8 characters encoded like below
Dodd\u2013Frank
instead of
Dodd–Frank
(Assume that I don't have control over how this value is assigned to this string variable)
Now how do I convert (encode) it properly and store it back in a String variable?
I found the following code
Charset.forName("UTF-8").encode(str);
But this returns a ByteBuffer, but I want a String back.
Edit:
Some more additional information.
When I use System.out.println(str); I get
Dodd\u2013Frank
I am not sure what is the correct terminology (UTF-8 or unicode). Pardon me for that.
try
str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);
from Apache Commons Lang
java.util.Properties
You can take advantage of the fact that java.util.Properties supports strings with '\uXXXX' escape sequences and do something like this:
Properties p = new Properties();
p.load(new StringReader("key="+yourInputString));
System.out.println("Escaped value: " + p.getProperty("key"));
Inelegant, but functional.
To handle the possible IOExeception, you may want a try-catch.
Properties p = new Properties();
try { p.load( new StringReader( "key=" + input ) ) ; } catch ( IOException e ) { e.printStackTrace(); }
System.out.println( "Escaped value: " + p.getProperty( "key" ) );
try
str = org.apache.commons.text.StringEscapeUtils.unescapeJava(str);
as org.apache.commons.lang3.StringEscapeUtils is deprecated.
Suppose you have a Unicode value, such as 00B0 (degree symbol, or superscript 'o', as in Spanish abbreviation for 'primero')
Here is a function that does just what you want:
public static String unicodeToString( char charValue )
{
Character ch = new Character( charValue );
return ch.toString();
}
I used StringEscapeUtils.unescapeXml to unescape the string loaded from an API that gives XML result.
UnicodeUnescaper from org.apache.commons:commons-text is also acceptable.
new UnicodeUnescaper().translate("Dodd\u2013Frank")
Perhaps the following solution which decodes the string correctly without any additional dependencies.
This works in a scala repl, though should work just as good in Java only solution.
import java.nio.charset.StandardCharsets
import java.nio.charset.Charset
> StandardCharsets.UTF_8.decode(Charset.forName("UTF-8").encode("Dodd\u2013Frank"))
res: java.nio.CharBuffer = Dodd–Frank
You can convert that byte buffer to String like this :
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.ByteBuffer
public static CharsetDecoder decoder = CharsetDecoder.newDecoder();
public static String byteBufferToString(ByteBuffer buffer)
{
String data = "";
try
{
// EDITOR'S NOTE -- There is no 'position' method for ByteBuffer.
// As such, this is pseudocode.
int old_position = buffer.position();
data = decoder.decode(buffer).toString();
// reset buffer's position to its original so it is not altered:
buffer.position(old_position);
}
catch (Exception e)
{
e.printStackTrace();
return "";
}
return data;
}