It seems that there are many, many ways to read text files in Java (BufferedReader, DataInputStream etc.) My personal favorite is Scanner with a File in the constructor (it's just simpler, works with mathy data processing better, and has familiar syntax).
Boris the Spider also mentioned Channel and RandomAccessFile.
Can someone explain the pros and cons of each of these methods? To be specific, when would I want to use each?
(edit) I think I should be specific and add that I have a strong preference for the Scanner method. So the real question is, when wouldn't I want to use it?
Lets start at the beginning. The question is what do you want to do?
It's important to understand what a file actually is. A file is a collection of bytes on a disc, these bytes are your data. There are various levels of abstraction above that that Java provides:
File(Input|Output)Stream - read these bytes as a stream of byte.
File(Reader|Writer) - read from a stream of bytes as a stream of char.
Scanner - read from a stream of char and tokenise it.
RandomAccessFile - read these bytes as a searchable byte[].
FileChannel - read these bytes in a safe multithreaded way.
On top of each of those there are the Decorators, for example you can add buffering with BufferedXXX. You could add linebreak awareness to a FileWriter with PrintWriter. You could turn an InputStream into a Reader with an InputStreamReader (currently the only way to specify character encoding for a Reader).
So - when wouldn't I want to use it [a Scanner]?.
You would not use a Scanner if you wanted to, (these are some examples):
Read in data as bytes
Read in a serialized Java object
Copy bytes from one file to another, maybe with some filtering.
It is also worth nothing that the Scanner(File file) constructor takes the File and opens a FileInputStream with the platform default encoding - this is almost always a bad idea. It is generally recognised that you should specify the encoding explicitly to avoid nasty encoding based bugs. Further the stream isn't buffered.
So you may be better off with
try (final Scanner scanner = new Scanner(new BufferedInputStream(new FileInputStream())), "UTF-8") {
//do stuff
}
Ugly, I know.
It's worth noting that Java 7 Provides a further layer of abstraction to remove the need to loop over files - these are in the Files class:
byte[] Files.readAllBytes(Path path)
List<String> Files.readAllLines(Path path, Charset cs)
Both these methods read the entire file into memory, which might not be appropriate. In Java 8 this is further improved by adding support for the new Stream API:
Stream<String> Files.lines(Path path, Charset cs)
Stream<Path> Files.list(Path dir)
For example to get a Stream of words from a Path you can do:
final Stream<String> words = Files.lines(Paths.get("myFile.txt")).
flatMap((in) -> Arrays.stream(in.split("\\b")));
SCANNER:
can parse primitive types and strings using regular expressions.
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types.more can be read at http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
DATA INPUT STREAM:
Lets an application read primitive Java data types from an underlying input stream in a machine-independent way. An application uses a data output stream to write data that can later be read by a data input stream.DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class. More can be read at http://docs.oracle.com/javase/7/docs/api/java/io/DataInputStream.html
BufferedReader:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.The buffer size may be specified, or the default size may be used. The default is large enough for most purposes.In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders and InputStreamReaders. For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.Programs that use DataInputStreams for textual input can be localized by replacing each DataInputStream with an appropriate BufferedReader.More detail are at http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
NOTE: This approach is outdated. As Boris points out in his comment. I will leave it here for history, but you should use methods available in JDK.
It depends on what kind of operation you are doing and the size of the file you are reading.
In most of the cases, I recommend using commons-io for small files.
byte[] data = FileUtils.readFileToByteArray(new File("myfile"));
You can read it as string or character array...
Now, you are handing big files, or changing parts of a file directly on the filesystem, then the best it to use a RandomAccessFile and potentially even a FileChannel to do "nio" style.
Using BufferedReader
BufferedReader reader;
char[] buffer = new char[10];
reader = new BufferedReader(new FileReader("FILE_PATH"));
//or
reader = Files.newBufferedReader(Path.get("FILE_PATH"));
while (reader.read(buffer) != -1) {
System.out.print(new String(buffer));
buffer = new char[10];
}
//or
while (buffReader.ready()) {
System.out.println(
buffReader.readLine());
}
reader.close();
Using FileInputStream-Read Binary Files to Bytes
FileInputStream fis;
byte[] buffer = new byte[10];
fis = new FileInputStream("FILE_PATH");
//or
fis=Files.newInoutSream(Paths.get("FILE_PATH"))
while (fis.read(buffer) != -1) {
System.out.print(new String(buffer));
buffer = new byte[10];
}
fis.close();
Using Files– Read Small File to List of Strings
List<String> allLines = Files.readAllLines(Paths.get("FILE_PATH"));
for (String line : allLines) {
System.out.println(line);
}
Using Scanner – Read Text File as Iterator
Scanner scanner = new Scanner(new File("FILE_PATH"));
while (scanner.hasNextLine()) {
System.out.println(scanner.nextLine());
}
scanner.close();
Using RandomAccessFile-Reading Files in Read-Only Mode
RandomAccessFile file = new RandomAccessFile("FILE_PATH", "r");
String str;
while ((str = file.readLine()) != null) {
System.out.println(str);
}
file.close();
Using Files.lines-Reading lines as stream
Stream<String> lines = Files.lines(Paths.get("FILE_PATH") .forEach(s -> System.out.println(s));
Using FileChannel-for increasing performance by using off-heap memory furthermore using MappedByteBuffer
FileInputStream i = new FileInputStream(("FILE_PATH");
ReadableByteChannel r = i.getChannel();
ByteBuffer buffer = ByteBuffer.allocateDirect(16 * 1024);
while (r.read(buffer) != -1) {
buffer.flip();
while (buffer.hasRemaining()) {
System.out.print((char) buffer.get());
}
buffer.clear();
}
Related
I have a sample method which copies one file to another using InputStream and OutputStream. In this case, the source file is encoded in 'UTF-8'. Even if I don't specify the encoding while writing to the disk, the destination file has the correct encoding. But, if I have to write a java.lang.String to a file, I need to specify the encoding. Why is that ?
public static void copyFile() {
String sourceFilePath = "C://my_encoded.txt";
InputStream inStream = null;
OutputStream outStream = null;
try{
String targetFilePath = "C://my_target.txt";
File sourcefile =new File(sourceFilePath);
outStream = new FileOutputStream(targetFilePath);
inStream = new FileInputStream(sourcefile);
byte[] buffer = new byte[1024];
int length;
//copy the file content in bytes
while ((length = inStream.read(buffer)) > 0){
outStream.write(buffer, 0, length);
}
inStream.close();
outStream.close();
System.out.println("File "+targetFilePath+" is copied successful!");
}catch(IOException e){
e.printStackTrace();
}
}
My guess is that since the source file has thee correct encoding and since we read and write one byte at a time, it works fine. And java.lang.String is 'UTF-16' by default and if we write it to the file, it reads one byte at a time instead of 2 bytes and hence garbage values. Is that correct or am I completely wrong in my understanding ?
You are copying the file byte per byte, so you don't need to care about character encoding.
As a rule of thumb:
Use the various InputStream and OutputStream implementations for byte-wise processing (like file copy).
There are some convenience methods to handle text directly like PrintStream.println(). Be careful because most of them use the default platform specific encoding.
Use the various Reader and Writer implemenations for reading and writing text.
If you need to convert between byte-wise and text processing use InputStreamReader and OutputStreamWriter with explicit file encoding.
Do not rely on the default encoding. The default character encoding is platform specific (e.g. Windows-ANSI aka Cp1252 for Windows, usually UTF-8 on Linux).
Example: If you need to read a UTF-8 text file:
BufferedReader reader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"));
Avoid using a FileReader because a FileReader uses always the default encoding.
A special case: If you need random access to a file you should use RandomAccessFile. With it you can read and write data blocks at arbitrary positions. You can read and write raw byte blocks or you can use convenience methods to read and write text. But you should read the documentation carefully. E.g. the methods readUTF() and writeUTF() use a modified UTF-8 encoding.
InputStream, OutputStream, Reader, Writer and RandomAccessFile form the basic IO functionality, enough for most use cases. For advanced IO (e.g. memory mapped files, ...) have a look at package java.nio.
Just read your code! (For the copy part at least ;-) )
When you copy the two files, you copy it byte by byte. There is no conversion to String, thus.
When you write a String into a file, you need to convert it (indirectly sometimes) in an array of byte (byte[]). There you need to specify your encoding.
When you read a file to get a String, you need to know its encoding in order to do it properly. Java doesn't 'skip' any byte but you need to make a conversion once again : from a byte[] to a String.
I'm trying to read a bz2 file using Apache Commons Compress.
The following code works for a small file.
However for a large file (over 500MB), it ends after reading a few thousands lines without any error.
try {
InputStream fin = new FileInputStream("/data/file.bz2");
BufferedInputStream bis = new BufferedInputStream(fin);
CompressorInputStream input = new CompressorStreamFactory()
.createCompressorInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(input,
"UTF-8"));
String line = "";
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
}
Is there another good way to read a large compressed file?
I was having the same problem with a large file, until I noticed that CompressorStreamFactory has a couple of overloaded constructors that take a boolean decompressUntilEOF parameter.
Simply changing to the following may be all that's missing...
CompressorInputStream input = new CompressorStreamFactory(true)
.createCompressorInputStream(bis);
Clearly, whoever wrote this factory seems to think it's better to create new compressor input streams at certain points, with the same underlying buffered input stream so that the new one picks up where the last one left off. They seem to think that's a better default, or preferred way of doing it over allowing one stream to decompress data all the way to the end of the file. I've no doubt they are cleverer than me, and I haven't worked out what trap I'm setting for future me by setting this parameter to true. Maybe someone will tell me in the comments! :-)
Text file can be directly read using FileReader & BufferedReader classes.
In several technote, it is mentioned to get the text file as a input stream, then convert to Inputstreamreader and then BufferedReader.
Any reasons why we need to use InputStream approach
FileReader is convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
Complement to this answer...
There really isn't a need to use a BufferedReader if you don't need to, except that it has the very convenient .readLine() method. That's the first point.
Second, and more importantly:
Use JSR 203. Don't use File anymore.
Use try-with-resources.
Both of them appeared in Java 7. So, the new way to read a text file is as such:
final Path path = Paths.get("path/to/the/file");
try (
final BufferedReader reader = Files.newBufferedReader(path,
StandardCharsets.UTF_8);
) {
// use reader here
}
In Java 8, you also have a version of Files.newBufferedReader() which does not take a charset as an argument; this will read in UTF-8 by default. Also in Java 8, you have Files.lines():
try (
final Stream<String> stream = Files.lines(thePath);
) {
// use the stream here
}
And yes, use try-with-resources for such a stream!
I've been Googling up some answers, and I can't seem to find the best one.
Here's what I have so far for reading internal files on Android:
fis = openFileInput("MY_FILE");
StringBuilder fileContent = new StringBuilder("");
byte[] buffer = new byte[fis.available()];
while (fis.read(buffer) != -1) {
fileContent.append(new String(buffer));
}
MYVARIABLE = fileContent.toString();
fis.close();
It use to leave a lot of whitespaces, but I just used .available method to only return what I need.
Is there a faster or shorter way to write this? I can't seem to find any good ones in the API guide.
1). API for available() says it should not be used for the purposes you need:
Note that this method provides such a weak guarantee that it is not very useful in practice.
Meaning it may not give you the file size.
2). When you read smth in RAM, then take under account the file can be lengthy, so try to avoid spending extra RAM. For this a relatively small (1~8 KB) buffer is used to read from source and then append to result. On the other hand using too small buffers (say, several bytes) slows down reading significantly.
3). Reading bytes differs from reading characters, because a single character may be represented by more than one byte (depends on encoding). To read chars the spesific classes are used which are aware of encoding and know how to convert bytes to chars properly. For instance InputStreamReader is one of such classes.
4). The encoding to use for reading should be the encoding tha was used for persisting the data.
Taking all the said above I would use smth like this:
public static String getStringFromStream(InputStream in, String encoding)
throws IOException {
InputStreamReader reader;
if (encoding == null) {
// This constructor sets the character converter to the encoding
// specified in the "file.encoding" property and falls back
// to ISO 8859_1 (ISO-Latin-1) if the property doesn't exist.
reader = new InputStreamReader(in);
} else {
reader = new InputStreamReader(in, encoding);
}
StringBuilder sb = new StringBuilder();
final char[] buf = new char[1024];
int len;
while ((len = reader.read(buf)) > 0) {
sb.append(buf, 0, len);
}
return sb.toString();
}
5). Make sure to always close an InputStream when done working with it.
Sure, there are more than one way to read text from file in Java/Android. This is mostly because Java API contains several generations of IO APIs. For instance, classes from java.nio package were created to be more efficient, however usually there is no strong reason of using them (don't fall into premature optimization sin).
I have a file saved as utf-8 (saved by my application in fact). How do you read it character by character?
File file = new File(folder+name);
FileInputStream fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
The two options seem to be:
char c = dis.readByte()
char c = dis.readChar()
The first option works as long as you only have ascii characters stored, ie english.
The second option reads the first and second byte of the file as one character.
The original file is being written as follows:
File file = File.createTempFile("file", "txt");
FileWriter fstream = new FileWriter(file);
BufferedWriter out = new BufferedWriter(fstream);
You don't want a DataInputStream, that's for reading raw bytes. Use an InputStreamReader, which lets you specify the encoding of the input (UTF-8 in your case).
You should be aware that in the Java world you use streams to process bytes, and readers/writers to process characters. These two are not the same, and you should choose the right one to handle what you have.
Have a look at http://java.sun.com/docs/books/tutorial/i18n/text/stream.html to see how to work with characters in a byte-oriented world.
The Sun Java Tutorial is a highly recommended learning resource.
Use a Reader (eg. BufferedReader)
Reader reader = new BufferedReader(new FileReader(file));
char c = reader.read();
You can read individual bytes and when you hit a byte that is less than 128 (ie. the 8th byte is 0) then that is the last byte of the character.
I'm no Java expert, but I would assume that there are better ways. Maybe some way of telling the reader what encoding it is in...
edit: see dmazzoni's answer.