Get Unicode Encoded Characters (Kannada Lanuguage) from given String - java

String s1="\u0048\u0065\u006C\u006C\u006F"; // Hello
String s2="\u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F"; // ಮುಖಪುಟ (Kannada Language)
System.out.println("s1: " + StringEscapeUtils.unescapeJava(s1)); // s1: Hello
System.out.println("s2: " + StringEscapeUtils.unescapeJava(s2)); // s2: ??????
When I print s1, I get the result as Hello.
When I print s2, I get the result as ???????.
I want the output as ಮುಖಪುಟ for s2. How can I achieve this?

ByteArrayOutputStream os = new ByteArrayOutputStream();
PrintStream ps = new PrintStream(os);
ps.println("\u0048\u0065\u006C\u006C\u006F \u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F");
String output = os.toString("UTF8");
System.out.println("result: "+output); // Hello ಮುಖಪುಟ

You need to add the encoding like "UTF-8"
try this
String s1="\u0048\u0065\u006C\u006C\u006F"; // Hello
String s2="\u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F"; // ಮುಖಪುಟ (Kannada Language)
System.out.println("s1: " + new String(s1.getBytes("UTF-8"), "UTF-8"));
System.out.println("s2: " + new String(s2.getBytes("UTF-8"), "UTF-8"));

If you are using Eclipse then please have a look at: https://decoding.wordpress.com/2010/03/18/eclipse-how-to-change-the-console-output-encoding/
Please simply output on the console as follows:-
String s1="\u0048\u0065\u006C\u006C\u006F";
String s2="\u0CAE\u0CC1\u0C96\u0CAA\u0CC1\u0C9F";
System.out.println("s1: " + s1); // s1
System.out.println("s2: " + s2); // s2
Hope, this is helpful to you.

The problem is most probably that System.out is not prepared to deal with Unicode. It is an output stream that gets encoded in the so called default encoding.
The default encoding is most often (i.e. on Windows) some proprietary 8-bit character set, that simply can't handle unicode.
My tip: For the sake of testing, create your own PrintStream or PrintWriter with UTF-8 encoding.

Related

How to iterate over each Charset in java.nio.charset.StandardCharsets

I am writing a Java test class and would like to iterate over all the Charsets specified in the StandardCharsets class and specify each encoding when performing the .getBytes() on the myString variable.
I want to try something like this:
String myString = "Some Junk";
for (Charset encoding : StandardCharsets) {
System.out.println("Using Encoding: " + encoding.displayName());
byte[] newBytes = myString.getBytes(encoding);
for (byte b : newBytes ) {
System.out.print(b + " ");
}
System.out.println("");
}
Obviously that is not correct. Short of doing each one manually how can I step through all the Charsets defined in the StandardCharsets class?
So based on your suggestions I came up with this which works (and probably can/should be improved on)
String myString = "Some Junk";
for (Field charSet : StandardCharsets.class.getDeclaredFields()) {
String encoding = charSet.getName();
//This is because the Charsets in StandardCharsets all use underscores however
//when passing the string to .getBytes() you need to pass UTF-8 and not UTF_8
//** All except for ISO_8859_1 ** - Sigh I wish I could do this better.
if (encoding.startsWith("U")) {
encoding = encoding.replaceAll("_", "-");
}
System.out.println("Using Encoding: " + encoding);
byte[] newByteStr = myString.getBytes(Charset.forName(encoding));
for (byte b : newByteStr ) {
System.out.print(b + " ");
}
System.out.println("");
}
This will convert the string in myString to a byte array with the relevant encoding of every Charset found in StandardCharSets. Which is what I wanted in the end.
EDIT 1: So based on MC Emperor's comment I now have this
String myString = "Some Junk";
for (Field field: StandardCharsets.class.getDeclaredFields()) {
if (field.get(null) instanceof Charset charset) {
System.out.println("Using Encoding: " + charset.displayName());
byte[] newByteStr = myString.getBytes(charset);
for (byte b : newByteStr ) {
System.out.print(b + " ");
}
System.out.println();
}
}
This seems far better as I no longer have to use string replacements.

Removing space in Java error

Basically, my program will convert data into .CSV format. But, I am faced with an error such that when I open my file in excel, it displays my data normally but when in notepad, it becomes some characters ㈬㥙〳㈬㥙ㄳ㌬かㄹ㌬か㌹㌬ㅋㄳ㌬ㅋ㈳㌬ㅋ㌳㌬ㅋ㐳
Here's my line of code
String resultString = stringWriter.toString();
for ( String cheese: pie.keySet() ) {
resultString += System.getProperty("line.separator") + cheese + "," +
pie.get(cheese).toString();
resultString = resultString.replaceAll(",$" , "").replaceAll(" ", "");
}
this.WriteToFile(resultString);
I have multiple file with this method to remove the space but only this file has the error. I've tried multiple methods such as removing it before the first resultString and at the back of pie.get(cheese).toString().
Also tried with .replace(" ", ""); and replaceAll("\\s","")
The data contains special characters. These characters are not rendered properly with default encoding. So When you are writing/creating a text/csv file, also set the character encoding to UTF-8. You can do this in JAVA Program itself.
String utf8String=getFromSource();
File fileDir = new File("c:\\temp\\test.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append(utf8String).append("\r\n");
out.flush();
out.close();

String.getBytes() returns different values for multiple execution?

public static void main(String[] args) {
try {
String name = "i love my country";
byte[] sigToVerify = name.getBytes();
System.out.println("file data:" + sigToVerify);
String name1 = "data";
byte[] sigToVerify1 = name1.getBytes();
System.out.println("file data1:" + sigToVerify1);
}
}
I am trying to execute the above program but getBytes() gives me different values for the same String. Is there any way to get the same byte while executing multiple times for a given string?
System.out.println("file data:" + sigToVerify);
Here you are not printing the value of a String. As owlstead pointed out correctly in the comments, the Object.toString() method will be invoked on the byte array sigToVerify. Leading to an output of this format:
getClass().getName() + '#' + Integer.toHexString(hashCode())
If you want to print each element in the array you have to loop through it.
byte[] bytes = "i love my country".getBytes();
for(byte b : bytes) {
System.out.println("byte = " + b);
}
Or even simpler, use the Arrays.toString() method:
System.out.println(Arrays.toString(bytes));
try printing out the contents of the byte array instead of the toString() result of the variable
for(byte b : sigToVerify)
System.out.print(b +"\t");
if the bytes getting printed are the same, then you're good to go.

Split strings with an "Enter"

I have code that splits a string into 3 strings, then prints them. I want each one to be separated by the equivalent of an "Enter". Here's the code:
String accval = text;
try {
PrintWriter writer = new PrintWriter(new BufferedWriter(
new FileWriter("sdcard/YS Data/Accelerometer.html",
true)));
String[] tempArr = accval.split("\\s+");
String x = tempArr[0] + "_"; //I want the enter to be where the underlines are:
String y = tempArr[1] + "_";
String z = tempArr[2] + "_";
for (String a : tempArr) {
writer.println("<h3 style=padding-left:20px;>" + x + y
+ z + "</h3><br>");
}
writer.flush();
writer.close();
} catch (IOException e) {
// put notification here later!!!
e.printStackTrace();
}
This outputs:
x=-0.125_y=0.9375_z=0.375
x=-0.125_y=0.9375_z=0.375
with the strings separated by underscores.
However, I want it to look like this:
x=-0.125
y=0.9375
z=0.375
x=-0.125
y=0.9375
z=0.375
Thanks for any help.
EDIT:
I've implemented the answer of #Julius in the following code that prints how I wanted it:
Code:
String accval = text;
try {
PrintWriter writer = new PrintWriter(new BufferedWriter(
new FileWriter("sdcard/YS Data/Accelerometer.html",
true)));
String[] tempArr = accval.split("\\s+");
String x = tempArr[0];
String y = tempArr[1];
String z = tempArr[2];
for (String a : tempArr) {
writer.println("<h3 style=padding-left:20px;>" + x + "<br>" + y + <br>
+ z + "</h3><br>");
}
writer.flush();
writer.close();
} catch (IOException e) {
// put notification here later!!!
e.printStackTrace();
}
Which prints:
x=0.25
y=125
z=1.23
x=0.125
y=725
z=0.935
if you want the line returns to be displayed in the browser, this is the way to go:
writer.println("<h3 style=padding-left:20px;>" + x + "<br/>" + y + "<br/>" + z + "<br/></h3>");
You can use
System.getProperty("line.separator")
to get a line separator.
You could explicitly write CR+LF as in the other answers here. You can also use the default line break by just using println separately for each item, e.g.:
for (String a : tempArr) {
writer.println("<h3 style=padding-left:20px;>");
writer.println(x);
writer.println(y);
writer.println(z);
writer.println("</h3><br>");
}
This is slightly more verbose but won't run into inconsistent line-break issues on systems where CRLF is not the default line ending.
Note, however, that the linebreaks in the HTML probably won't be rendered, unless your CSS specifies that they should be. You probably want to just write a "<br>" tag after each element instead of an actual line break, e.g.:
writer.println("<h3 style=padding-left:20px;>");
writer.println(x + "<br>");
writer.println(y + "<br>");
writer.println(z + "<br>");
writer.println("</h3><br>");
You wouldn't have to use println for this (you could use print or just concatenate as you were doing before), but it does make the generated source a bit more readable.
Off topic, your use of linebreaks in an "<h3>" header tag isn't really semantically appropriate. A "<div>" with appropriate styling would be a more accurate representation, unless these are actually serving as section headers.
"\r\n" is the correct string for a line break on Windows systems; "\n" is correct on Linux and other Unix-based systems. This will work on either type of system:
String newline = String.format("%n");
which will set newline to "\r\n" or "\n" as appropriate (or perhaps some other sequence on some OS's from a different planet :)
EDIT:%n by definition inserts System.getProperty("line.separator") as noted in another answer. So these are equivalent.

Error reading UTF-8 file in Java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters
This is the code I have:
public static String readSentence(String resourceName) {
String sentence = null;
try {
InputStream refStream = ClassLoader
.getSystemResourceAsStream(resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(
refStream, Charset.forName("UTF-8")));
sentence = br.readLine();
} catch (IOException e) {
throw new RuntimeException("Cannot read sentence: " + resourceName);
}
return sentence.trim();
}
The problem is probably in the way that the string is being output.
I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:
for (char c : sentence.toCharArray()) {
System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}
and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.
First, you could create the InputStreamReader as
new InputStreamReader(refStream, "UTF-8")
Also, you should verify if the resource really contains UTF-8 content.
One of the most annoying reason could be... your IDE settings.
If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Categories

Resources