Reading from property file containing utf 8 character - java

I am reading a property file which consists of a message in the UTF-8 character set.
Problem
The output is not in the appropriate format. I am using an InputStream.
The property file looks like
username=LBSUSER
password=Lbs#123
url=http://localhost:1010/soapfe/services/MessagingWS
timeout=20000
message=Spanish character are = {á é í, ó,ú ,ü, ñ, ç, å, Á, É, Í, Ó, Ú, Ü, Ñ, Ç, ¿, °, 4° año = cuarto año, €, ¢, £, ¥}
And I am reading the file like this,
Properties props = new Properties();
props.load(new FileInputStream("uinsoaptest.properties"));
String username = props.getProperty("username", "test");
String password = props.getProperty("password", "12345");
String url = props.getProperty("url", "12345");
int timeout = Integer.parseInt(props.getProperty("timeout", "8000"));
String messagetext = props.getProperty("message");
System.out.println("This is soap msg : " + messagetext);
The output of the above message is
You can see the message in the console after the line
{************************ SOAP MESSAGE TEST***********************}
I will be obliged if I can get any help reading this file properly. I can read this file with another approach but I am looking for less code modification.

Use an InputStreamReader with Properties.load(Reader reader):
FileInputStream input = new FileInputStream(new File("uinsoaptest.properties"));
props.load(new InputStreamReader(input, Charset.forName("UTF-8")));
As a method, this may resemble the following:
private Properties read( final Path file ) throws IOException {
final var properties = new Properties();
try( final var in = new InputStreamReader(
new FileInputStream( file.toFile() ), StandardCharsets.UTF_8 ) ) {
properties.load( in );
}
return properties;
}
Don't forget to close your streams. Java 7 introduced StandardCharsets.UTF_8.

Use props.load(new FileReader("uinsoaptest.properties")) instead. By default it uses the encoding Charset.forName(System.getProperty("file.encoding")) which can be set to UTF-8 with System.setProperty("file.encoding", "UTF-8") or with the commandline parameter -Dfile.encoding=UTF-8.

If somebody use #Value annotation, could try StringUils.
#Value("${title}")
private String pageTitle;
public String getPageTitle() {
return StringUtils.toEncodedString(pageTitle.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("UTF-8"));
}

You should specify the UTF-8 encoding when you construct your FileInputStream object. You can use this constructor:
new FileInputStream("uinsoaptest.properties", "UTF-8");
If you want to make a change to your JVM so as to be able to read UTF-8 files by default, you will have to change the JAVA_TOOL_OPTIONS in your JVM options to something like this :
-Dfile.encoding=UTF-8

If anybody comes across this problem in Kotlin, like me:
The accepted solution of #Würgspaß works here as well. The corresponding Kotlin syntax:
Instead of the usual
val properties = Properties()
filePath.toFile().inputStream().use { stream -> properties.load(stream) }
I had to use
val properties = Properties()
InputStreamReader(FileInputStream(filePath.toFile()), StandardCharsets.UTF_8).use { stream -> properties.load(stream) }
With this, special UTF-8 characters are loaded correctly from the properties file given in filePath.

Related

Which encoding for ProcessBuilder parameters

Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}
Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.
One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me

How to create a unit test to detect if someone edited the file using the wrong encoding?

I'm using Java, Spring and I would like to prevent some invalid chars on the message property files.
Some colleges use different operational systems, IDEs and setups. As our language is portuguese and the Windows default encoding is Windows-1252 (or CP-1252), it's common to have some confusion about special (accented) chars, like á, ã, õ, etc when editing files, because some of them could use a different encoding and mess up the messages property file, like this:
1002 = O pedido não foi encontrado
1003 = O pedido j� est� finalizado
1004 = A situa��o do hist�rico do pedido n�o � permitida
The above file is originally a UTF-8 file but someone edit the file with Windows-1252 encoding, adding two new entries (1003 and 1004) and creating this weird question marks on the place of the accents when reading the file as a UTF-8 file.
So, I'm thinking on a unit test to detect this problem on the file. How could I do this unit test? Thanks!
I found the answer with the help of #Mayamar and this answer.
#Test
public void verifyInvalidCharsOnMessages() throws IOException {
verifyInvalidChars("src/main/resources/i18n/messages.properties");
verifyInvalidChars("src/main/resources/i18n/messages_pt_BR.properties");
}
private void verifyInvalidChars(String file) throws IOException {
Properties p = new Properties();
FileInputStream input = new FileInputStream(new File(file));
p.load(new InputStreamReader(input, Charset.forName("UTF-8")));
Enumeration<String> enums = (Enumeration<String>) p.propertyNames();
while (enums.hasMoreElements()) {
String key = enums.nextElement();
String value = p.getProperty(key);
if (value.indexOf('\uFFFD') > 0) {
fail();
}
}
}

Gujarati text in Java String

I have Gujarati Bible and trying to insert each verse in MySQL database using parser written in Java. When I assign Gujarati text to Java String variable it shows junks in debug.
E.g. This is my Gujarati text
હે યહોવા તું મારો દેવ છે;
I assign it to Java String variable as shown below
verse._verseText = "હે યહોવા તું મારો દેવ છે;";
What i see in debug window is all junk characters. Any help is appreciated. If need more information let me know and I will provide as and when asked.
UPDATE
Pasting my parser code here
private Boolean Insert(String _text)
{
BibleVerse verse = new BibleVerse();
String[] data = _text.split("\\|");
try
{
if (data[0].equals(bookName) || bookName.equals("All"))
{
verse._Version = "Gujarati";
verse._book = data[0];
verse._chapter = Integer.parseInt(data[1]);
verse._verse = Integer.parseInt(data[2]);
verse._verseText = new String(data[3].getBytes(), "UTF-8");
_bibleDatabase.Insert(verse);
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - INSERTED.");
}
else
{
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - SKIPPED.");
}
return true;
}
catch(Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
return false;
}
}
Here is the sample line from the text file
Isaiah|25|1|હે યહોવા તું મારો દેવ છે; હું તને મોટો માનીશ, હું તારા નામની સ્તુતિ કરીશ; કેમકે તેં અદભુત કાર્યો કર્યાં છે, તેં વિશ્વાસુપણે તથા સત્યતાથી પુરાતન સંકલ્પો પાર પાડ્યા છે.
UPDATE
Here is the code where I open & read file.
try
{
FileReader _file = new FileReader(this._filename);
_bufferedReader = new BufferedReader(_file);
SwingWorker parseWorker = new SwingWorker()
{
#Override
protected Object doInBackground() throws Exception
{
String line;
String[] data;
int lineno=0;
BibleVerse verse = new BibleVerse();
while ((line = _bufferedReader.readLine()) != null)
{
++lineno;
pcs.firePropertyChange("pgbupdate", null, lineno);
Insert(line);
}
_bufferedReader.close();
return null;
}
#Override
protected void done()
{
pcs.firePropertyChange("logupdate", null, "Parsing complete.");
}
};
parseWorker.execute();
}
catch (Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
}
The problem is this:
FileReader _file = new FileReader(this._filename);
This reads the file using the platform's default charset. If your data file is not encoded in that charset, you will get incorrect characters.
On Windows, the default charset is almost always UTF-16LE. On most other systems, it's UTF-8.
The easiest solution is to find out the actual encoding of your data file, so you can specify it explicitly in the code. The encoding of a file can be determined with the file command on Unix and Linux systems. In Windows, you may need to examine it with a binary editor, or install something like Cygwin, which has a file command of its own.
Once you know what it is, you should pass it explicitly to the construction of your Reader:
// Replace "UTF-8" with the actual encoding of your data file (if it's not UTF-8).
Reader _file = new InputStreamReader(new FileInputStream(this._filename), "UTF-8");
Once you've done that, there is no reason for any other part of your code to concern itself with bytes. You should replace this:
verse._verseText = new String(data[3].getBytes(), "UTF-8");
with this:
verse._verseText = data[3];
how to inject chinese characters using javascript?
not quite the same problem, but I think the same solution may work in this case.
If the script is inline (in the HTML file), then it's using the
encoding of the HTML file and you won't have an issue.
If the script is loaded from another file:
Your text editor must save the file in an appropriate encoding such as
utf-8 (it's probably doing this already if you're able to save it,
close it, and reopen it with the characters still displaying
correctly) Your web server must serve the file with the right http
header specifying that it's utf-8 (or whatever the enocding happens to
be, as determined by your text editor settings). Here's an example for
how to do this with php: Set http header to utf-8 php If you can't
have your webserver do this, try to set the charset attribute on your
script tag (e.g. > I tried to see what the spec said should happen
in the case of mismatching charsets defined by the tag and the http
headers, but couldn't find anything concrete, so just test and see if
it helps. If that doesn't work, place your script inline
It looks like if you want to store Gujarati text in Java string, you need to use unicode characters. See this: http://jrgraphix.net/r/Unicode/0A80-0AFF
So for example the first Gujarati character:
char example = '0A80';
String result = Character.toString((char)example);

Handle special charecters while writing xml through java

Through a java program I am creating a xml of stock holders. The generated xml would look like -
<?xml version="1.0" encoding="UTF-8" ?>
<urlset>
<url>
<loc>FirstName-LastName/id/</loc>
</url>
</urlset>
There are some stock holders having special characters in there name e.g. A. Pitkänen. Now, when I see xml for this stock holders it looks like -
<?xml version="1.0" encoding="UTF-8" ?>
<urlset>
<url>
<loc>/A-Pitk寥n/ELS_1005091/</loc>
</url>
</urlset>
This is making the xml invalid. Why this is happening? The java program is -
FileWriter fstream = new FileWriter("c:\stock-holders.xml");
final BufferedWriter out = new BufferedWriter(fstream);
try {
// Making Connection and query the stock holders to get the resultset
String aId = "";
String aFName = "";
String aLName = "";
out.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n");
out.write("<urlset>\n");
while (rs.next()) {
String url = "";
aFName = rs.getString(2);
if (StringUtils.isNotEmpty(aFName) ) {
aFName = aFName.trim();
url += aFName;
}
aLName = rs.getString(3);
if (StringUtils.isNotEmpty(aLName)) {
aLName = aLName.trim();
url += "-" + aFName;
}
aId = rs.getString(1);
if (StringUtils.isNotEmpty(aId)) {
aId = aId.trim();
url += "/" + aId + "/";
}
out.write("<url>\n");
out.write("<loc>" + url + "</loc>\n");
out.write("</url>\n");
out.flush();
}
out.write("</urlset>");
out.close();
}
Sicne your XML file is supposed to be written in UTF-8 encoding, you need to configure your Writers to use that encoding rather than the system default one:
FileOutputStream fstream = new FileOutputStream("c:\stock-holders.xml");
OutputStreamWriter writer = new OutputStreamWriter(fstream, "UTF-8");
final BufferedWriter out = new BufferedWriter(writer);
Note that use of FileWriter is not recommended for this very reason - it cannot be configured to use encoding other than the default one.
Also, perhaps it would be better to use some existing API for constructing XML files (such as DOM or StAX) rather than do it by string concatenation. For example, your solution doesn't take into account that your data may contain characters that are illegal in XML and should be escaped.
I suspect that the problem is that you are using a FileWriter instead of a FileOutputStream hooked up the a OutputStreamWriter, where the OSW specifies "utf-8" as the encoding
You can use something more short:
PrintWriter out = new PrintWriter("c:\\stock-holders.xml", "UTF-8");
This constructor is available since Java 1.5.
The Documentation says:
Creates a new PrintWriter, without automatic line flushing, with the
specified file name and charset. This convenience constructor creates
the necessary intermediate OutputStreamWriter, which will encode
characters using the provided charset.
You need call the method flush() when all write calls is done.

Japanese text in Java .properties file

I am trying to read Japanese string values from a .properties file with the code:
Properties properties = new Properties();
InputStream in = MyClass.class.getResourceAsStream(fileName);
properties.load(in);
The problem is apparently with the above code not recognizing the encoding of the file. It reads only the English portions and replaces the Japanese characters with question marks. Incidentally, this is not a problem with displaying Japanese text in Swing or reading/writing a UTF-8 encoded .properties file in an editor. Both things work.
Is the Properties class encoding-unaware? Is there an encoding-aware alternative that does not violate the security manager settings normally found in applets?
In my opinion you have to convert Japanese character to java Unicode escape string
For example, this is the way I did with Vietnamese
Currency_Converter = Chuyen doi tien te
Enter_Amount = Nh\u1eadp v\u00e0o s\u1ed1 l\u01b0\u1ee3ng
Source_Currency = \u0110\u01a1n v\u1ecb g\u1ed1c
Target_Currency = \u0110\u01a1n v\u1ecb chuy\u1ec3n
Converted_Amount = K\u1ebft qu\u1ea3
Convert = Chuy\u1ec3n \u0111\u1ed5i
Alert_Mess = Vui l\u00f2ng nh\u1eadp m\u1ed9t s\u1ed1 h\u1ee3p l\u1ec7
Alert_Title = Thong bao
load expects ISO 8859-1 encoding, as noted in the docs.
In general you'll want to use native2ascii to convert property files, load using a reader, or use XML where you can specify encoding.
it is possible to read any language from properties file, what you should do is just get value by key from Properties file and then create a new string new String(keyValue.getBytes("ISO-8859-1"), "UTF-8"), that is, it will create a UTF-8 string for you.
public static String getLocalizedPropertyValue(String fileName, String key, Locale locale) throws UnsupportedEncodingException {
Properties props = getPropertiesByLocale(fileName, locale);
String keyValue = props.getProperty(key);
return keyValue != null ? new String(keyValue.getBytes("ISO-8859-1"), "UTF-8") : "";
}
Have you considered using ResourceBundle ?

Categories

Resources