Java URLEncode giving different results - java

I have this code stub:
System.out.println(param+"="+value);
param = URLEncoder.encode(param, "UTF-8");
value = URLEncoder.encode(value, "UTF-8");
System.out.println(param+"="+value);
This gives this result in Eclipse:
p=指甲油
p=%E6%8C%87%E7%94%B2%E6%B2%B9
But when I run the same code from command line, I get the following output:
p=指甲油
p=%C3%8A%C3%A5%C3%A1%C3%81%C3%AE%E2%89%A4%C3%8A%E2%89%A4%CF%80
What could be the problem?

Your Mac was using Mac OS Roman encoding in the terminal. Those Chinese characters are incorrectly been interpreted using Mac OS Roman encoding instead of UTF-8 encoding before sending to Java.
As evidence, those Chinese characters exist in UTF-8 encoding of the following (hex) bytes:
指 = 0xE6 0x8C 0x87
甲 = 0xE7 0x94 0xB2
油 = 0xE6 0xB2 0xB9
Then check the Mac OS Roman codepage layout, those (hex) bytes represent the following characters:
0xE6 0x8C 0x87 = Ê å á
0xE7 0x94 0xB2 = Á î ≤
0xE6 0xB2 0xB9 = Ê ≤ π
Now, put them together and URL-encode them using UTF-8:
System.out.println(URLEncoder.encode("指甲油", "UTF-8"));
Look what it prints?
%C3%8A%C3%A5%C3%A1%C3%81%C3%AE%E2%89%A4%C3%8A%E2%89%A4%CF%80
To fix your problem, tell your Mac to use UTF-8 encoding in the terminal. Honestly, I can't answer that part off top of head as I don't do Mac. Your Eclipse encoding configuration is totally fine, but for the case that, you could configure it via Window > Preferences > General > Workspace > Text File Encoding.
Update: I missed a comment:
I am reading the value from a text file
If those variables are originating from a text file instead of from commandline input — as I initially expected —, then you need to solve the problem differently. Apparently, you was using a Reader implementation for that which is using the runtime environment's default character encoding like so:
Reader reader = new FileReader("/file.txt");
// ...
You should instead be explicitly specifying the desired encoding while creating the reader. You can do that with the InputStreamReader constructor.
Reader reader = new InputStreamReader(new FileInputStream("/file.txt"), "UTF-8");
// ...
This will explicitly tell Java to read /file.txt using UTF-8 instead of runtime environment's default encoding as available by Charset#defaultCharset().
System.out.println("This runtime environment uses as default charset " + Charset.defaultCharset());

Related

Parse csv file with special character like µ using java

I have a situation where I have to read a CSV file which contains special character like 'µ'. This has to be done using java. I am doing:
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream(,"UTF-8"));
in windows it runs OK. But in redhat linux environment, it converts those special characters to '?'. Any help is highly appreciated
Output written to System.out will be encoded using the "platform default encoding" which on linux is determined from locale environment variables (see output of locale command), and those in turn are set in user or system level configuration files.
On server installations, the default encoding is often ASCII. "µ" is not an ASCII character so it will be converted to "?" when it is printed.
There are a couple of ways to change the default encoding:
Set the Java file.encoding system property when you run your program, e.g.
java -Dfile.encoding=utf-8 yourprogram
Set LC_CTYPE env variable before you run your program, for example:
export LC_CTYPE=en_US.UTF-8
java yourprogram
Those methods also change the default encoding for input and file names etc. You can change the encoding specifically for System.out with Java code:
PrintStream originalOut = System.out; // in case you need it later
System.setOut(new PrintStream(System.out, true, "utf-8"));

Writing hebrew to file turns to gibberish when run from exported jar

I have a small program that writes some Hebrew letters and some numbers to a file, written in JAVA.
The Hebrew is written fine when i run the program from Eclipse, but if i export it into an executable JAR file and run it from there the Hebrew turns to gibberish
My code:
if (content.length() > 0) {
FileWriter fileWriter = new FileWriter(path);
BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
bufferedWriter.write(content);
bufferedWriter.close();
}
I have also tried using an OutputStreamWriter to set the encoding myself:
if (content.length() > 0) {
BufferedWriter bufferedWriter = new BufferedWriter
(new OutputStreamWriter(new FileOutputStream(path), "windows-1255"));
bufferedWriter.write(content);
bufferedWriter.close();
}
The encodings i tried:
ISO-8859-8
windows-1255
x-IBM856
IBM862
IBM424
UTF-8
Some of them return proper Hebrew when i run the program from eclipse but all of them turn the Hebrew to different types of gibberish when run from the JAR file.
I am not even sure the encoding in the code itself is the issue or the way to fix it.
I am running the JAR using a batch file on windows 10.
My java version info:
java version "10.0.1" 2018-04-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.1+10)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.1+10, mixed mode)
example of output when using UTF-8
A line from the Hebrew file (generated by eclipse):
210001 188 13 04/09/1804/09/18 50.00 1 123456789 לירון קטלן הרא"ה 291 רמת גן 6013
The same line from the gibberish file (generated from the JAR):
210001 188 13 04/09/1804/09/18 50.00 1 123456789 לירון קטלן הר�"ה 291 רמת גן 6013
Don't mind the extra white-spaces, they are supposed to be there.
The second code snippet with explicit encoding is correctly crossplatform.
Check that the content is fine Unicode:
String content="\u200F\u05D0\u05D1\u05D2\u05D3\u05D4\u200E"; // "אבגדהו"
I used u-encoding so the java source is ASCII and hence the encoding of the java compiler and the encoding of the editor should the erroneously differ, cannot cause
corrupt strings.
Assuming that content is a String:
if (!content.isEmpty()) {
content = "\uFEFF" + content; // Add a BOM char in front for Windows
Path p = Paths.get(path);
Files.write(p, Collections.singletonList(content), StandardCharsets.UTF_8);
}
This writes a UTF-8 file that will cause the least of problems, unless inside Israel, where one might assume a country specific encoding, windows-1255.
I added a BOM character as first character of the file, so Windows can easily identify the file, not as some ANSI single-byte encoding, but as UTF-8 Unicode.
Then there rests the problem of representing Hebrew text. There must be an adequate font.
You might opt for writing an HTML file:
content = "<!DOCTYPE html><html lang="he">"
+ "<head><meta charset=\"utf-8\"></head>"
+ "<body><pre>"
+ content.replace("&", "&")
.replace("<", "<")
.replace(">", "&gt")
+ "</pre></body></html>";
I find that better than writing a BOM.
The last thing is to add LTR ('\u200E') and RTL (Right-To-Left, '\u200F') mark chars, but I take it, that that gives no problem.
It always is that at some place an overloaded method is used, where the encoding is not present, defaulting to the current platform encoding.
Do
new InputStreamReader(..., StandardCharsets.UTF_8))
and such.

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters
It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)
You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).
You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

reading unicode data from a file

Default encoding is ISO-8859-1
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream("file having unicode characters"),"UTF-8"));
String strTemp = bis.readLine();// on debugging strTemp is having actual unicode data
System.out.println(strTemp);// uses default encoding which is ISO-8859-1,So not printing ///actual data
PrintStream psTemp = new PrintStream(System.out, true, "UTF-8");
psTemp.println(strTemp);// here i am giving encoding as UTF-8,still not printing unicode data.
Even if i am giving encoding as UTF-8 in PrintStream constructor i am not able to print unicode data, if i change default encoding from ISO-8859-1 to UTF-8, it works. Why this is so?
if i change default encoding from ISO-8859-1 to UTF-8, it works. Why this is so?
I expect that this works because it is telling your console / shell / whatever is displaying the characters to expect UTF-8 characters. If the default behaviour is to expect ISO-8859-1, then sending it UTF-8 is not going to work.
Are you printing on eclipse console ? or in the shell ? Try to print to a file and check the result.
For example, windows shell is limited to "cp850" charset. The problem might be caused by the OS shell, not the JVM.

Character Encoding in Java

In eclipse, I changed the default encoding to ISO-8859-1. Then I wrote this:
String str = "Русский язык ";
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.print(str);
It should print the String correctly, as I am specifying UTF-8 encoding. However, it is not printing.
The ISO-8859-1 character encoding only supports characters between 0 and 255, and anything else is likely to be turned into '?'
If you save the source file (the .java file) as ISO-8859-1 than str will be encoded by javac using ISO-8859-1. Your problem does not lie in the creation of PrintStream: the str you are printing is wrong from the beginning.
Yes, it looks like the terminal that your are sending this output does not support this encoding.
If you are running Eclipse, you could set the encoding as follows:
In Run Configurations...->Common ->Encoding->Other
Select UTF-8
You are basically telling the PrintStream writer to expect the input characters to be UTF-8 encoded and to output it as UTF-8. There is no conversion. If you set your IDE to use ISO-8859-1 as character encoding for your file, which in turns contains the input string than you pipe ISO-8859-1 encoded characters into an UTF-8 expecting writer. So the writer treats the bytes receiving as UTF encoded characters which will result in data junk.
Either set your IDE to encode your source files in UTF-8 and check that your characters are correctly displayed and stored. Or tell your writer to treat them as ISO-8859-1, either way should do.

Categories

Resources