So I'm working on a localization example and the normal method of doing it with ResourceBundle and everything doesn't support UTF-8 it seems so I'm moving on to Properties.
I've got it getting the actual properties fine but in the Spanish file, it doesn't like the accents. I have it reading in UTF-8 but it doesn't care, just displays a different symbol than before.
Output:
íHola!
┐C¾mo estßs?
íAdi¾s!
Expected Output:
¡Hola!
¿Cómo estás?
¡Adios!
Properties File:
greetings = ¡Hola!
farewell = ¡Adiós!
inquiry = ¿Cómo estás?
Code:
import java.util.*;
import java.io.*;
public class Test{
public static void main(String[] args) throws IOException {
String language;
String country;
if (args.length != 2) {
language = new String("en");
country = new String("GB");
} else {
language = new String(args[0]);
country = new String(args[1]);
}
String file = String.format("lang_%s_%s.properties",language,country);
InputStream utf8in = Test.class.getClassLoader().getResourceAsStream(file);
Reader reader = new InputStreamReader(utf8in, "UTF-8");
Properties props = new Properties();
props.load(reader);
System.out.println(props.getProperty("greetings"));
System.out.println(props.getProperty("inquiry"));
System.out.println(props.getProperty("farewell"));
}
}
I've just spent about 40 minutes reading everything I could find and they were either the exact same as what I've got now or slightly different and when trying, produced the same results.
Can someone please tell me how I can get my expected output?
In Eclipse, I can be reproduced the problem. Here are steps:
Create Java project and set Text file encoding to CP850.
Create Run/Debug Configurations, set VM arguments to -Dfile.encoding=ISO8859-1.
Confirm Encoding setting in Common tab is CP850;
Run the java program.
When java program print to standard output, those chars become ISO8859-1 bytes.
Those bytes are re-encoded using CP850 and display in Console view.
This is a configuration problem. Make sure the Encoding is the same as the file.encoding of running program.
Related
Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}
Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.
One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me
I'm using Java, Spring and I would like to prevent some invalid chars on the message property files.
Some colleges use different operational systems, IDEs and setups. As our language is portuguese and the Windows default encoding is Windows-1252 (or CP-1252), it's common to have some confusion about special (accented) chars, like á, ã, õ, etc when editing files, because some of them could use a different encoding and mess up the messages property file, like this:
1002 = O pedido não foi encontrado
1003 = O pedido j� est� finalizado
1004 = A situa��o do hist�rico do pedido n�o � permitida
The above file is originally a UTF-8 file but someone edit the file with Windows-1252 encoding, adding two new entries (1003 and 1004) and creating this weird question marks on the place of the accents when reading the file as a UTF-8 file.
So, I'm thinking on a unit test to detect this problem on the file. How could I do this unit test? Thanks!
I found the answer with the help of #Mayamar and this answer.
#Test
public void verifyInvalidCharsOnMessages() throws IOException {
verifyInvalidChars("src/main/resources/i18n/messages.properties");
verifyInvalidChars("src/main/resources/i18n/messages_pt_BR.properties");
}
private void verifyInvalidChars(String file) throws IOException {
Properties p = new Properties();
FileInputStream input = new FileInputStream(new File(file));
p.load(new InputStreamReader(input, Charset.forName("UTF-8")));
Enumeration<String> enums = (Enumeration<String>) p.propertyNames();
while (enums.hasMoreElements()) {
String key = enums.nextElement();
String value = p.getProperty(key);
if (value.indexOf('\uFFFD') > 0) {
fail();
}
}
}
I have a small java program that reads a given file with data and converts it to a csv file.
I've been trying to use the arrow symbols: ↑, ↓, → and ← (Alt+24 to 27) but unless the program is run from within Netbeans (Using F6), they will always come out as '?' in the resulting csv file.
I have tried using the unicodes, eg "\u2190" but it makes no difference.
Anyone know why this is happening?
As requested, here is a sample code that gives the same issue. This wont work when run using the .jar file, just creating a csv file containing '?', however running from within Netbeans works.
import java.io.FileNotFoundException;
import java.io.PrintWriter;
public class Sample {
String fileOutName = "testresult.csv";
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws FileNotFoundException {
Sample test = new Sample();
test.saveTheArrow();
}
public void saveTheArrow() {
try (PrintWriter outputStream = new PrintWriter(fileOutName)) {
outputStream.print("←");
outputStream.close();
}
catch (FileNotFoundException e) {
// Do nothing
}
}
}
new PrintWriter(fileOutName) uses the default charset of the JVM - you may have different defaults in Netbeans and in the console.
Google Sheet uses UTF_8 according to this thread so it would make sense to save your file using that character set:
Files.write(Paths.get("testresult.csv"), "←".getBytes(UTF_8));
Using the "<-" character in your editor is for sure not the desired byte 0x27.
Use
outputStream.print( new String( new byte[] { 0x27}, StandardCharsets.US_ASCII);
When I run the following program:
public static void main(String args[]) throws Exception
{
byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
String s = new String(str, "UTF-8");
}
on Linux and inspect the value of s in jdb, I correctly get:
s = "ì–´"
on Windows, I incorrectly get:
s = "?"
My byte sequence is a valid UTF-8 character in Korean, why would it be producing two very different results?
It correctly prints "어" on my computer (Ubuntu Linux), as described in Code Table Korean Hangul. Windows command prompt is known to have issues with encoding, don't bother.
Your code is fine.
It gives 어 for me. This means your console is probably not configured to display UTF-8 and it is a printing/display problem, rather than a problem with representation.
You get the correct string, it's Windows console that does not display the string correctly.
Here is a link to an article that discusses a way to make Java console produce correct Unicode output using JNI.
JDB is displaying the data incorrectly. The code works the same on both Windows and Linux. Try running this more definitive test:
public static void main(String[] args) throws Exception {
byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
String s = new String(str, "UTF-8");
for(int i=0; i<s.length(); i++) {
System.out.println(BigInteger.valueOf((int)s.charAt(i)).toString(16));
}
}
This prints out the hex value of every character in the string. This will correctly print out "c5b4" in both Windows and Linux.
I am trying to read Japanese string values from a .properties file with the code:
Properties properties = new Properties();
InputStream in = MyClass.class.getResourceAsStream(fileName);
properties.load(in);
The problem is apparently with the above code not recognizing the encoding of the file. It reads only the English portions and replaces the Japanese characters with question marks. Incidentally, this is not a problem with displaying Japanese text in Swing or reading/writing a UTF-8 encoded .properties file in an editor. Both things work.
Is the Properties class encoding-unaware? Is there an encoding-aware alternative that does not violate the security manager settings normally found in applets?
In my opinion you have to convert Japanese character to java Unicode escape string
For example, this is the way I did with Vietnamese
Currency_Converter = Chuyen doi tien te
Enter_Amount = Nh\u1eadp v\u00e0o s\u1ed1 l\u01b0\u1ee3ng
Source_Currency = \u0110\u01a1n v\u1ecb g\u1ed1c
Target_Currency = \u0110\u01a1n v\u1ecb chuy\u1ec3n
Converted_Amount = K\u1ebft qu\u1ea3
Convert = Chuy\u1ec3n \u0111\u1ed5i
Alert_Mess = Vui l\u00f2ng nh\u1eadp m\u1ed9t s\u1ed1 h\u1ee3p l\u1ec7
Alert_Title = Thong bao
load expects ISO 8859-1 encoding, as noted in the docs.
In general you'll want to use native2ascii to convert property files, load using a reader, or use XML where you can specify encoding.
it is possible to read any language from properties file, what you should do is just get value by key from Properties file and then create a new string new String(keyValue.getBytes("ISO-8859-1"), "UTF-8"), that is, it will create a UTF-8 string for you.
public static String getLocalizedPropertyValue(String fileName, String key, Locale locale) throws UnsupportedEncodingException {
Properties props = getPropertiesByLocale(fileName, locale);
String keyValue = props.getProperty(key);
return keyValue != null ? new String(keyValue.getBytes("ISO-8859-1"), "UTF-8") : "";
}
Have you considered using ResourceBundle ?