Proper Java classes for reading and writing files? - java

Reading some sources about Java file I/O managing, I get to know that there are more than 1 alternative for input and output operations.
These are:
BufferedReader and BufferedWriter
FileReader and FileWriter
FileInputStream and FileOutputStream
InputStreamReader and OutputStreamWriter
Scanner class
What of these is best alternative for text files managing? What's best alternative for serialization? What does Java NIO say about it?

Two kinds of data
Generally speaking there are two "worlds":
binary data
text data
When it's a file (or a socket, or a BLOB in a DB, or ...), then it's always binary data first.
Some of that binary data can be treated as text data (which involves something called an "encoding" or "character encoding").
Binary Data
Whenever you want to handle the binary data then you need to use the InputStream/OutputStream classes (generally, everything that contains Stream in its name).
That's why there's a FileInputStream and a FileOutputStream: those read from and write to files and they handle binary data.
Text Data
Whenever you want to handle text data, then you need to use the Reader/Writer classes.
Whenever you need to convert binary data to text (or vice versa), then you need some kind of encoding (common ones are UTF-8, UTF-16, ISO-8859-1 (and related ones) and the good old US-ASCII). "Luckily" the Java platform also has something called the "default platform encoding" which it will use whenever it needs one but the code doesn't specify one.
The platform default encoding is a two-sided sword, however:
it makes writing code easier, because you don't have to specify an encoding for each operation but
it might not match the data you have: If the platform-default encoding is ISO-8859-1 and the file you read is actually UTF-8, then you will get a scrambled output!
For reading, we should also mention the BufferedReader which can be wrapped around any other Reader and adds the ability to handle whole lines at once.
Scanner is a special class that's meant to parse text input into tokens. It's most useful for structured text but often used on System.in to provide a very simple way to read data from stdin (i.e. from what the user inputs on the keyboard).
Bridgin the gap
Now, confusingly enough there are classes that make the bridge between those worlds, which generally have both parts in their names:
an InputStreamReader consumes a InputStream and is itself a Reader.
an OutputStreamWriter is a Writer and writes to an OutputStream.
And then there are "shortcut classes" that basically combine two other classes that are often combined.
a FileReader is basically a combination of a FileInputStream with an InputStreamReader
a FileWriter is basically a combination of a FileOutputStream with an OutputStreamWriter
Note that FileReader and FileWriter have a major drawback compared to their more complicated "hand-built" alternative: they always use the platform default encoding, which might not be what you're trying to do!
What about serialization?
ObjectOutputStream and ObjectInputStream are special streams used for serialization.
As the name of the classes implies serializing involves only binary data (even if serializing String objects), so you'll want to use *Stream classes exclusively. As long as you avoid any Reader/Writer classes, you should be fine.
Further resources
the Basic I/O trail.
Joel's old-ish article on Unicode (good introduction, slightly light on technical detail)
On the evils of platform default encoding (also this)

Related

How to know when to use byteStream for reading data and when to use charStream for reading data from a file?

I am trying to understand if I need to read data from different types of files (.properties file, json file, text file etc) or from console, which java class should I use and why exactly.
Some classes are reading data in the form of bytes (8 bits) and some are reading in the form of characters (16 bits unicode). So, how do I decide which class to use for reading data?
enter image description here
In the above sample code, I am trying to read a .properties file. So, how to decide if I need to use FileInputStream or any other class to read the file?
I tried looking for answers online but I am still not clear.
It depends on the individual case.
For property files, there is an overloaded Properties.load() method. You can pass either an InputStream or a Reader.
Here it would be best to use the InputStream option (byte stream) because the load() function handles the correct character set on its own. If you use a Reader you are responsible to read the file with the correct encoding.
Same for parsing XML files. The encoding is part of the XML header (or UTF-8 as default) so it is the best option to let the parser read and handle it.
For JSON the default encoding is UTF-8. Other encodings are possible but I don't know whether it is possible to declare the encoding inside a JSON document like in XML.
So for other file types it depends on the use case. If you have a text file encoded as UTF-8 and you want to copy it to another location as is, you can simply treat the file as a byte block. But if you have to extract some words it is necessary to interpret the byte block as characters so you need a Reader and the right character encoding (in conjunction with an InputStreamReader).
Sometimes you get the right encoding via an API e.g. if you call a REST service via HTTP you can extract the encoding out of a HTTP header. Otherwise, you simply have to know the encoding.

what are prepackaged character stream classes?

While reading Java Tutorials, the topic Basic I/o says, use InputStreamReader and OutputStreamWriter when there are no prepackaged character stream classes.
1)What are Pepackaged character stream classes?
Does it mean, a file already has some text!
The term is quite vague and doesn't really seem to be defined anywhere, so good question.
As best I understand it it means things like FileInputStream, FileOutputStream, ByteArrayOutputStream, etc. Classes that have wrapped up a particular kind of stream for you and provide the functionality required to work with it.
Note that most of these streams are working with characters not bytes, and that is generally what you want in Java for dealing with String data in files. On the other hand though if you are reading a pure binary source then the data will come in as bytes and you can then use InputStreamReader to convert those bytes to characters.
So a prepackaged stream reader is one that already provides you the data pre-packaged in the form that you want it.
I believe it to mean classes which inherit Reader or Writer. Such classes "wrap" byte streams so as to convert them automatically to character streams. Example: FileReader, FileWriter; they can read text from files directly.
If no such classes exist for your particular stream needs but you know what you get out of it/put into it is text, then you must use these two wrapper classes.
Classical example: HTML. It is text, but what you get from sockets is byte streams; if you want to read it as HTML, use a Reader (with the correct encoding!) over the socket stream (but of course, many APIs today don't require you to do that).

Stream chaining in Java

Is it bad style to keep the references to streams "further down" a filter chain, and use those lower level streams again, or even to swap one type of stream for another? For example:
OutputStream os = new FileOutputStream("file");
PrintWriter pw = new PrintWriter(os);
pw.print("print writer stream");
pw.flush();
pw = null;
DataOutputStream dos = new DataOutputStream(os);
dos.writeBytes("dos writer stream");
dos.flush();
dos = null;
os.close();
If so, what are the alternatives if I need to use the functionality of both streams, e.g. if I want to write a few lines of text to a stream, followed by binary data, or vice versa?
This can be done in some cases, but it's error-prone. You need to be careful about buffers and stuff like the stream headers of ObjectOutputStream.
if I want to write a few lines of text to a stream, followed by binary
data, or vice versa?
For this, all you need to know is that you can convert text to binary data and back but always need to specify an encoding. However, it is also error-prone because people tend to use the API methods that use the platform default encoding, and of course you're basically implementing a parser for a custom binary file format - lots of things can go wrong there.
All in all, if you're creating a file format, especially when mixing text and binary data, it's best to use an existing framework like Google protocol buffers
If you have to do it, then you have to do it. So if you're dealing with an external dependency that you don't have control over, you just have to do it.
I think the bad style is the fact that you would need to do it. If you had to send binary data across sometimes, and text across at others, it would probably be best to have some kind of message object and send the object itself over the wire with Serialization. The data overhead isn't too much if structured properly.
I don't see why not. I mean, the implementations of the various stream classes should protect you from writing invalid data. So long as you're reading it back the same way, and your code is otherwise understandable, I don't see why that would be a problem.
Style doesn't always mean you have to do it the way you've seen others do it. So long as it's logical, and someone reading the code would see what (and why) you're doing it without you needing to write a bunch of comments, then I don't see what the issue is.
Since you're flushing between, it's probably fine. But it might be cleaner to use one OutputStream and just use os.write(string.getBytes()); to write the strings.

Adding special characters while writing into a java file but not visible everywhere

I am writing into a java file through
FileWriter fstream = new FileWriter("someFile.java");
BufferedWriter out = new BufferedWriter(fstream);
out.write(strContents);
// Close the output stream
out.close();
but after writing I found it appended some special characters in shape of box like [], but those special characters are only visible in specific text editor like EditPlus.
How to avoid those special characters while writing or is it specific to some editors only.
My advice would be to avoid using FileWriter completely. It always uses the platform default encoding, which is rarely a good idea.
I would suggest using FileOutputStream wrapped in an OutputStreamWriter - then you just need to specify an appropriate encoding, such as UTF-8. Obviously you'll still need to use an editor which supports UTF-8, and you may need to tell it the encoding... but at least you'll have code which always writes in the same way, regardless of OS and system properties.
Normal notepad application can't show the special character wriiten in the file. There is no any problem with your code. It is limitation of notepad.

Character Encoding Trouble - Java

I've written a little application that does some text manipulation and writes the output to a file (html, csv, docx, xml) and this all appears to work fine on Mac OS X. On windows however I seem to get character encoding problems and a lot of '"' seems to disappear and be replaced with some weird stuff. Usually the closing '"' out of a pair.
I use a FreeMarker to create my output files and there is a byte[] array and in one case also a ByteArrayStream between reading the templates and writing the output. I assume this is a character encoding problem so if someone could give me advise or point me to some 'Best Practice' resource for dealing with character encoding in java.
Thanks
There's really only one best practice: be aware that Strings and bytes are two fundamentally different things, and that whenever you convert between them, you are using a character encoding (either implicitly or explicitly), which you need to pay attention to.
Typical problematic spots in the Java API are:
new String(byte[])
String.getBytes()
FileReader, FileWriter
All of these implicitly use the platform default encoding, which depends on the OS and the user's locale settings. Usually, it's a good idea to avoid this and explicitly declare an encoding in the above cases (which FileReader/Writer unfortunately don't allow, so you have to use an InputStreamReader/Writer).
However, your problems with the quotation marks and your use of a template engine may have a much simpler explanation. What program are you using to write your templates? It sounds like it's one that inserts "smart quotes", which are part of the Windows-specific cp1251 encoding but don't exist in the more global ISO-8859-1 encoding.
What you probably need to do is to be aware which encoding your templates are saved in, and configure your template engine to use that encoding when reading in the templates. Also be aware that some texxt files, specifically XML, explicitly declare the encoding in a header, and if that header disagrees with the actual encoding used by the file, you'll invariable run into problems.
You can control which encoding your JVM will run with by supplying f,ex
-Dfile.encoding=utf-8
for (UTF-8 of course) as an argument to the JVM. Then you should get predictable results on all platforms. Example:
java -Dfile.encoding=utf-8 my.MainClass
Running the JVM with a 'standard' encoding via the confusing named -Dfile.encoding will resolve a lot of problems.
Ensuring your app doesn't make use of byte[] <-> String conversions without encoding specified is important, since sometimes you can't enforce the VM encoding (e.g. if you have an app server used by multiple applications)
If you're confused by the whole encoding issue, or want to revise your knowledge, Joel Spolsky wrote a great article on this.
I had to make sure that the OutputStreamWriter uses the correct encoding
OutputStream out = ...
OutputStreamWriter writer = new OutputStreamWriter(out, "UTF-8");
template.process(model, writer);
Plus if you use a ByteArrayOutputStream also make sure to call toString with the correct encoding:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
...
baos.toString("UTF-8");

Categories

Resources