UTF-16BE and UTF-16 issue in java - java

I have a file, when displayed with geanny * shows UTF-16BE. If I try to convert this file in Java to a different encoding (let's say ISO-8859-1), assuming it is UTF-16BE, a question mark (?) appears every time at the beginning of new created file. If instead I assume it is in UTF-16 (something that's not true), the converted file gets converted ok, without any question mark at the beginning.
Can anybody clarify why this behavior?
Bellow is a snippet from my used code:
StringBuilder sb = new StringBuilder();
BufferedReader buff = new BufferedReader(new InputStreamReader(inputStream, utf16beCharset));
String line = null;
while ( (line = buff.readLine()) != null) {
sb.append(line);
sb.append('\n');
}
String output = new String(sb.toString().getBytes(neededCharset), neededCharset);
System.out.println(output);
* geanny is a text editor

Your Problem is the BOM (Byte Order Mark).
If you define the character set as UTF-16 then Java recognises the BOM and removes it after reading. The BOM then tells Java that the character stream is (UTF-16)BE.
If you define UTF-16BE then you tell Java to ignore the BOM and Java ignores it and writes it to your target file.

Related

JAVA Unrecognized Character of the first character in the first line

I have lines of code to read the content of the file in Java. Basically I am using FileReader and BufferedReader. I am reading the lines correctly, however, the first character of the first line seems to be an undefined symbol. I have no idea where I got this symbol since the content of the input file is correct.
Here is the code:
FileReader readFile = new FileReader(chosenFile);
BufferedReader input = new BufferedReader(readFile);
while((line = input.readLine()) != null) {
System.out.println(line);
}
If it apears only in the first line, this is probably BOM (Byte Order Mark). All modern Text editors recognize this and do not present it as part of the text file. When you save the text file, there should be option to save with or without it.
If you wish to read the BOM marker in java, see here Reading UTF-8 - BOM marker

Cannot find ZERO WIDTH NO-BREAK SPACE when reading file

I've run into a problem when trying to parse a JSON string that I grab from a file. My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. I don't want to use regex because of the chance there may be other hidden characters with different unicodes.
Here's what I have:
StringBuilder content = new StringBuilder();
try {
BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
String currentLine;
while((currentLine = br.readLine()) != null) {
content.append(currentLine);
}
br.close();
} catch(Exception e) {
Assert.fail();
}
And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid):
{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...
Here's what I've tried so far:
Copying the JSON file to notepad++ and showing all characters
Copying file to notepad++ and converting to UFT-8 without BOM, and ISO 8859-1
Opened JSON file in other text editors such as sublime and saved as UFT-8
Copied the JSON file to a txt file and read that in
Tried using Scanner instead of BufferedReader
In intellij I tried view -> active editor -> show whitespaces
How can I read this file in without having the Zero width no-break space character at the beginning of the string?
0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).
If you read the FileReader documentation, it says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:
(from https://stackoverflow.com/a/13988345/65863)
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
//use reader
} finally {
inputStream.close();
}

Greek characters display issue Tomcat 7

I am facing an issue in displaying greek characters. The characters should appear as σ μυστικός αυτό? but they are appearing as ó ìõóôéêüò áõôü?
There are some other greek characters which appear fine but the above text appears garbled.
The content is read from a HTML file using following code by a servlet:
public String getResponse() {
StringBuffer sb = new StringBuffer();
try {
BufferedReader in = new BufferedReader((new InputStreamReader(new FileInputStream(fn), "8859_1")));
String line=null;
while ((line=in.readLine())!=null){
sb.append(line);
}
in.close();
return sb.toString();
}
}
I am setting encoding as UTF-8 while sending back response:
PrintWriter out;
if ((encodings != null) && (encodings.indexOf("gzip") != 1)) {
OutputStream out1 = response.getOutputStream();
out = new PrintWriter(new GZIPOutputStream(out1), false);
response.setHeader("Content-Encoding","gzip");
}
else {
out = response.getWriter();
}
response.setCharacterEncoding("UTF-8");
response.setContentType("text/html;charset=UTF-8");
out.println(getResponse());
The characters appear fine on my local development machine (which is Windows), but appear garbled when deployed on a CentOS Server. Both machines have JDK7 and Tomcat 7 installed.
I'm 99% sure the problem is your input encoding (when you read the data). You're decoding it as ISO-8859-1 when it's probably ISO-8859-7 instead. This would cause the symptoms you see.
The simplest way to check would be to open the HTML in a hex editor and examine the character encodings directly. If the Greek characters take up one byte each then it's almost definitely ISO-8859-7 (not -1). If they take up 2 bytes each then it's UTF-8.
From what you posted it looks like ISO-8859-7. In that character set, the lower-case sigma σ is 0xF3, while in ISO-8859-1 that same code maps to ó, which matches the data you showed. I'm sure if you mapped all the remaining characters you'd see a 1-to-1 match in the codes. Maybe your Windows system's default codepage is ISO-8859-7?

How can I change the text-coding of my Java Programm?

I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!
Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).
Try passing -Dfile.encoding=utf-8 as JVM argument.

Why am i getting ?? when i try to read ä character from a text file in java?

I am trying to read text from a text file. There are some special characters like å,ä and ö. When i make a string and print out that string then i get ?? from these special characters. I am using the following code:
File fileDir = new File("files/myfile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream(fileDir), "UTF8"));
String strLine;
while ((strLine = br.readLine()) != null) {
System.out.println("strLine: "+strLine);
}
Can anybody tell me whats the problem. I want strLine to show and save å, ä and ö as they are in text file. Thanks in advance.
The problem might not be with the file but with the console where you are trying to print. I suggest you follow the following steps
Make sure the file you are reading is encoded in UTF-8.
Make sure the console you are printing to has the proper encoding/charset to display these characters
Finally, this article Unicode - How to get characters right? is a must read.
Check here for the lists of Java supported encodings
Most common single-byte encoding that includes non-ascii characters is ISO8859_1; maybe your file is that, and you should specifiy that encoding for your FileInputStream

Categories

Resources