Java reading from binary file

Java reading from binary file - java

I have some calculated data (floats and integers), which is written to a 12mb file like this
DataOutputStream os3;
os3 = new DataOutputStream(new FileOutputStream(Cache.class.getResource("/3w.dat").getPath()));
...... (some loops)
os3.writeFloat(f);
os3.writeInt(r);
os3.close();
And after that I read it this way
DataInputStream is3;
is3 = new DataInputStream(new FileInputStream(Cache.class.getResource("/3w.dat").getPath()));
...... (same loops)
is3.readFloat();
is3.readInt();
is3.close();
So, I wrote file only once on Windows 7. After that I only read it while app starts. File reading works fine on Windows 7, but when I try do to it on Ubuntu, I get EOF exception (code and file are the same).
I suspect problem may be caused by some NaN values written to the file.
BTW. While debugging I figured out that on ubuntu reading process runs about 15% of loop and throws an exception. All values it reads are "0.0", but file doesnt contain zeros.

The problem should be the different of line break between Linux and Windows. As mentioned by Clark, just replace the \n with \r\n and the end of line character could resolve this. You can use Notpad++ to simplify this work. Just click on Menu :
Edit -> EOL conversion -> Convert to UNIX format
It might be also a problem of encoding. If you'are using Windows, the default encoding is Windows-1252 or so called CP-1252, which is specially microsoft thing and cannot be recognized on Linux. Just change it into an encoding like UTF-8 which can be recognized by all OS. Using Notpad++:
Encoding -> Convert to UTF-8

Related

Java Character Encoding Writing to Text File

My Issue is as follows:
Having issue with character encoding when writing to text file. The issue is characters are not showing the intended value. for example I am writing ' '(which is probably a Tab character) and 'Â' is what is displayed in the text file.
Background information
This data is being stored on a MSQL Database. The Database Collation is SQL_Latin1_General_CP1_CI_AS and the fields are varchar. I've come to learn the collation and type determine what character encoding is used on the database side. Values are stored correctly so no issues here.
My Java application runs queries to pull the data from the DB and this too also looks OK. I have debugged the code and seen all the Strings have the correct representation before writing to the file.
Next I write the text to the .TXT file using a OutputStreamWriter as follows:
public OfferFileBuilder(String clientAppName, boolean isAppend) throws IOException, URISyntaxException {
String exportFileLocation = getExportedFileLocation();
File offerFile = new File(getDatedFileName(exportFileLocation+"/"+clientAppName+"_OFFERRECORDS"));
bufferedWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(offerFile, isAppend), "UTF-8"));
}
Now once I open up the file on the Linux server by running cat command on file or open up the file using notepad++ some of the characters are incorrectly displaying.
I've ran the following commands on the server to see its encoding locale charmap which prints UTF-8, echo $LANG which prints en_US.UTF-8, and echo $LC_CTYPE` prints nothing.
Here is what I've attempted so far.
I've attempted to change the Character encoding used by the OutputStreamWriter I've tried UTF-8, and CP1252. When switching encoding some characters are fixed when others are then improperly displayed.
My Question is this:
Which encoding should my OutputStreamWriter be using?
(Bonus Questions) how are we supposed to avoid issues like this from happening. The rule of thumb i was provided was use UTF-8 and you will never run into problems, but this isn't the case for me right now.

running file -bi command on the server revealed that the file was encoded with ascii instead of utf8. Removing the file completely and rerunning the process fixed this for me.

Unicode working on Windows but not Red Hat Linux : Java

In Java, I am generating file having Unicode characters.
When I run my program in Windows (Jboss) and open the file (CSV). It finely displays Unicode characters (Norwegin and Icelandic) in excel.
But when I deploy the same in server inside Red Hat Linux (in Jboss same version), run the program, generate file and download and when I see that in excel then it is distorting all Unicode characters.
Could you please suggest any local Linux setting due to which Unicode is distorting? or where change is required?
FileWriter writer = new FileWriter(fileName);
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
writer.flush();
writer.close();
//data is StringBuilder type
I have also tried ISO8859_1
Update 1
I have checked System Encoding: using System.getProperty("file.encoding") and found that
Windows is Cp1252 and Linux is UTF-8
Update 2
When I print in Linux using :
log.info(new String(data.toString().getBytes("UTF-8"), "UTF-8"));
it is showing all output perfectly fine but when I put it in FileWriter with extension filename.csv, it is not displaying correctly.

It looks like you are translating from bytes
data
to String
data.toString()
to Bytes
data.toString().getBytes("UTF-8")
to String
new String(data.toString().getBytes("UTF-8"),"UTF-8"))
to bytes
writer.append(new String(data.toString().getBytes("UTF-8"),"UTF-8"));
Try just a single conversion from the input encoding to a String and then write out the String. So the data.toString() needs to know what encoding it is reading. Does data support conversion from different code pages?
writer.append(data.toString(codepage));

How can I change the Standard Out to "UTF-8" in Java

I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support

The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.

The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.

Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);

Failed to check if file with German name is exist in the file system

Background:
I have 2 machines: one is running German windows 7 and my PC running English(with Hebrew locale) windows 7.
In my Perl code I'm trying to check if the file that I got from the German machine exists on my machine.
The file name is ßßßzllpoöäüljiznppü.txt
Why is it failed when I do the following code:
use Encode;
use Encode::locale;
sub UTF8ToLocale
{
my $str = decode("utf8",$_[0]);
return encode(locale, $str);
}
if(!-e UTF8ToLocale($read_file))
{
print "failed to open the file";
}
else
{
print $read_file;
}
Same thing goes also when I'm trying to open the file:
open (wtFile, ">", UTF8ToLocale($read_file));
binmode wtFile;
shift #_;
print wtFile #_;
close wtFile;
The file name is converted from German to utf8 in my java application and this is passed to the perl script.
The perl script takes this file name and convert it from utf8 to the system locale, see UTF8ToLocale($read_file) function call, and I believe that is the problem.
Questions:
Can you please tell me what is the OS file system charset encoding?
When I create German file name in OS that the locale is Hebrew in which Charset is it saved?
How do I solve this problem?
Update:
Here is another code that I run with hard coded file name on my PC, the script file is utf8 encoded:
use Encode;
use Encode::locale;
my $string = encode("utf-16",decode("utf8","C:\\TestPerl\\ßßßzllpoöäüljiznppü.txt"));
if (-e $string)
{
print "exists\r\n";
}
else
{
print "not exists\r\n"
}
The output is "not exists".
I also tried different charsets: cp1252, cp850, utf-16le, nothing works.
If I'm changing the file name to English or Hebrew(my default locale) it works.
Any ideas?

Windows 7 uses UTF-16 internally [citation needed] (I don't remember the byte order). You don't need to convert file names because of that. However, if you transport the file via a FAT file system (eg an old USB stick) or other non Unicode aware file systems these benefits will get lost.
The locale setting you are talking about only affects the language of the user interface and the apparent folder names (Programme (x86) vs. Program Files (x86) with the latter being the real name in the file system).
The larger problem I can see is the internal encoding of the file contents that you want to transfer as some applications may default to different encodings depending on the locale. There is no solution to that except being explicit when the file is created. Sticking to UTF-8 is generally a good idea.
And why do you convert the file names with another tool? Any Unicode encoding should be sufficient for transfer.
Your script does not work because you reference an undefined global variable called $read_file. Assuming your second code block is not enclosed in any scope, especially not in a sub, then the #_ variable is not available. To get command line arguments you should consider using the #ARGV array. The logic ouf your script isn't clear anyway: You print error messages to STDOUT, not STDERR, you "decode" the file name and then print out the non-decoded string in your else-branch, you are paranoid about encodings (which is generally good) but you don't specify an encoding for your output stream etc.

Accents in file name using Java on Solaris

I have a problem where I can't write files with accents in the file name on Solaris.
Given following code
public static void main(String[] args) {
System.out.println("Charset = "+ Charset.defaultCharset().toString());
System.out.println("testéörtkuoë");
FileWriter fw = null;
try {
fw = new FileWriter("testéörtkuoë");
fw.write("testéörtkuoëéörtkuoë");
fw.close();
I get following output
Charset = ISO-8859-1
test??rtkuo?
and I get a file called "test??rtkuo?"
Based on info I found on StackOverflow, I tried to call the Java app by adding "-Dfile.encoding=UTF-8" at startup.
This returns following output
Charset = UTF-8
testéörtkuoë
But the filename is still "test??rtkuo?"
Any help is much appreciated.
Stef

All these characters are present in ISO-8859-1. I suspect part of the problem is that the code editor is saving files in a different encoding to the one your operating system is using.
If the editor is using ISO-8859-1, I would expect it to encode ëéö as:
eb e9 f6
If the editor is using UTF-8, I would expect it to encode ëéö as:
c3ab c3a9 c3b6
Other encodings will produce different values.
The source file would be more portable if you used Unicode escape sequences. At least be certain your compiler is using the same encoding as the editor.
Examples:
ë \u00EB
é \u00E9
ö \u00F6
You can look up these values using the Unicode charts.
Changing the default file encoding using -Dfile.encoding=UTF-8 might have unintended consequences for how the JVM interacts with the system.
There are parallels here with problems you might see on Windows.
I'm unable to reproduce the problem directly - my version of OpenSolaris uses UTF-8 as the default encoding.

If you attempt to list the filenames with the java io apis, what do you see? Are they encoded correctly? I'm curious as to whether the real problem is with encoding the filenames or with the tools that you are using to check them.

What happens when you do:
ls > testéörtkuoë
If that works (writes to the file correctly), then you know you can write to files with accents.

I got a similar problem. Contrary to that example, the program was unable to list the files correct using sysout.println, despite the ls was showing correct values.
As described in the documentation, the enviroment variable file.encoding should not be used to define charset and, in this case, the JVM ignores it
The symptom:
I could not type accents in shell.
ls was showing correct values
File.list() was printing incorrect values
the environ file.encoding was not affecting the output
the environ user.(language|country) was not affecting the output
The solution:
Although the enviroment variable LC_* was set in the shell with values inherited from /etc/defaut/init, as listed by set command, the locale showed different values.
$ set | grep LC
LC_ALL=pt_BR.ISO8859-1
LC_COLLATE=pt_BR.ISO8859-1
LC_CTYPE=pt_BR.ISO8859-1
LC_MESSAGES=C
LC_MONETARY=pt_BR.ISO8859-1
LC_NUMERIC=pt_BR.ISO8859-1
LC_TIME=pt_BR.ISO8859-1
$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
The solution was simple exporting LANG. This environment variable really affect the jvm
LANG=pt_BR.ISO8859-1
export LANG

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.
In Java you can use following for reading and writing:
Reading:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));
Writing:
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.