I am parsing file which is 800MB of size (high possibility of more than 2GB).
I split it into several files which approximately 1-3kb per file.
I would like to consult you guys, which is better to use among the two: BufferedWriter and OutputStreamWriter
Any guidance on the right direction is appreciated.
Ok, since you ask.
Writer - an abstract class that concrete implementations of let you write characters/strings. As opposed to raw bytes, which OutputStream implementations do.
FileWriter - a concrete implementation that lets you write to a File. Weakness: The encoding of the characters is hard-coded to be the default Locale, for example usually Windows-1252 on Windows, and UTF-8 on Linux.
To overcome this, many people start with an OutputStream (maybe a FileOutputStream) and then convert it into a Writer using OutputStreamWriter, because the constructor lets you set the encoding.
Example:
OutputStream os = new FileOutputStream("turnip");
Writer writer = new OutputStreamWriter(os,"UTF-8");
writer.write("This string will be written as UTF-8");
Now, with OutputStreams/Writers (and their inverse classes InputStream/Readers), it is often useful in addition to wrap a BufferedWriter around them.
continuing from example
writer=new BufferedWriter(writer);
writer.write("Another string in UTF-8");
What does this do? A BufferedWriter basically provides a memory buffer. Everything you write is first stored in memory and then flushed as necessary to disk (or whever). This often provides dramatic performance improvements. To show yourself this, just create a loop of say 100,000 writes without the BufferedWriter, time it, and compare that to the Buffered version.
There is no Stream writer in Java
If you want to learn about the Input and output Stream
Best place to learn is the following link
Related
Yes I know what buffer is. But watch this:
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter("file.txt"));
How does buffering actually work here? The way I see it, we are buffering data in FileWriter buffer and not in BufferedWriter buffer. Because when buffer of BufferedWriter gets full it will send it to FileWriter buffer and it will be responsible for writing data?
Am I missing something? The way I see it: It looks as we are sipping water from a bigger container to a smaller one. So we end up pouring water from smaller one.
Simmilar example here:
Scanner scanner = new BufferedReader(new FileReader("file.txt"));
scanner.nextLine();
I have seen this everywhere. We actually end up reading from Scanner line by line and not from buffer and its 8k capacity. So what is the point of buffer here? We read line by line from file and not entire buffer at once. Is bufferedReader redundand here?
Please if someone can nicely explain this, I have been struggling for a long time.
Low level system calls to read and write data are optimized to transfer larger blocks at once. Buffering lets you take advantage of this. When you write single characters or short strings, they are all accumulated in a buffer, and written out as one large block when the buffer is full. When you read data, the read functions request to fill a large buffer, and then it returns data from that buffer.
You're right that wrapping buffered streams within other buffered streams is pointless: at best it achieves nothing, at worst it adds overhead as the data is needlessly copied from one buffer to the next. The buffer closest to the data source matters most.
On the other hand, nothing in the API specification says FileWriter and FileReader have buffers. In fact, it recommends you wrap FileWriter within a BufferedWriter and FileReader within a BufferedReader:
For top efficiency, consider wrapping an OutputStreamWriter within a BufferedWriter so as to avoid frequent converter invocations. For example:
Writer out
= new BufferedWriter(new OutputStreamWriter(System.out));
(FileWriter is a subclass of OutputStreamWriter)
How does this work internally?
If you look at how FileWriter is implemented though, the story gets complicated because FileWriter does involve a buffer. Some of the details may depend on which version of Java you're using. In OpenJDK, when you create a BufferedWriter that decorates a FileWriter like:
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter("file.txt"));
you are creating a stack of objects like the following, where one object wraps the next:
BufferedWriter -> FileWriter -> StreamEncoder -> FileOutputStream
where StreamEncoder is an internal class, part of how OutputStreamWriter is implemented.
Now, when you write characters to the BufferedWriter instance, it first accumulates them in the BufferedWriter's own buffer. The inner FileWriter does not see any of the data until you have write enough data to fill this buffer (or call flush()).
When the BufferedWriter buffer becomes full, it writes the contents of the buffer to the FileWriter with a single call to write(char[],int,int). This transfer of a large data block is where the efficiency comes from: now FileWriter has a large block of data it can write to the file, and not individual characters.
Then it gets a little complicated: the characters have to be converted to bytes so that they can be written into a file. This is where FileWriter passes these data on to StreamEncoder.
The StreamEncoder class uses a CharsetEncoder to convert the block of characters to bytes all at once, and accumulates the bytes in a buffer of its own. When it's done, it writes the bytes to the innermost FileOutputStream, as one block. FileOutputStream then invokes operating system functions to write to an actual file.
What if you didn't use BufferedWriter?
If you write characters to the FileWriter directly, they get passed on to the StreamEncoder object, which converts them into bytes and stores in its private buffer, and not written directly to the FileOutputStream. This way, the internal implementation of FileWriter gives you some of the benefits of buffering. But this is not a part of the API specification so you shouldn't depend on it.
Also, every call to FileWriter.write will result in an invocation to the CharsetEncoder to encode characters into bytes. It's more efficient to encode large blocks of characters at once, writing single characters or short strings has a higher overhead.
I am studying Android development (I'm a beginner in programming in general) and learning about HTTP networking and saw this code in the lesson:
private String readFromStream(InputStream inputStream) throws IOException {
StringBuilder output = new StringBuilder();
if (inputStream != null) {
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
BufferedReader reader = new BufferedReader(inputStreamReader);
String line = reader.readLine();
while (line != null) {
output.append(line);
line = reader.readLine();
}
}
return output.toString();
}
I don't understand exactly what InputStream, InputStreamReader and BufferedReader do. All of them have a read() method and also readLine() in the case of the BufferedReader.Why can't I only use the InputStream or only add the InputStreamReader? Why do I need to add the BufferedReader? I know it has to do with efficiency but I don't understand how.
I've been researching and the documentation for the BufferedReader tries to explain this but I still don't get who is doing what:
In general, each read request made of a Reader causes a corresponding
read request to be made of the underlying character or byte stream. It
is therefore advisable to wrap a BufferedReader around any Reader
whose read() operations may be costly, such as FileReaders and
InputStreamReaders. For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each
invocation of read() or readLine() could cause bytes to be read from
the file, converted into characters, and then returned, which can be
very inefficient.
So, I understand that the InputStream can only read one byte, the InputStreamReader a single character, and the BufferedReader a whole line and that it also does something about efficiency which is what I don't get. I would like to have a better understanding of who is doing what, so as to understand why I need all three of them and what the difference would be without one of them.
I've researched a lot here and elsewhere on the web and don't seem to find any explanation about this that I can understand, almost all tutorials just repeat the documentation info. Here are some related questions that maybe begin to explain this but don't go deeper and solve my confusion: Q1, Q2, Q3, Q4. I think it may have to do with this last question's explanation about system calls and returning. But I would like to understand what is meant by all this.
Could it be that the BufferedReader's readLine() calls the InputStreamReader's read() method which in turn calls the InputStream's read() method? And the InputStream returns bytes converted to int, returning a single byte at a time, the InputStreamReader reads enough of these to make a single character and converts it to int and returns a single character at a time, and the BufferedReader reads enough of these characters represented as integers to make up a whole line? And returns the whole line as a String, returning only once instead of several times? I don't know, I'm just trying to get how things work.
Lots of thanks in advance!
This Streams in Java concepts and usage link, give a very nice explanations.
Streams, Readers, Writers, BufferedReader, BufferedWriter – these are the terminologies you will deal with in Java. There are the classes provided in Java to operate with input and output. It is really worth to know how these are related and how it is used. This post will explore the Streams in Java and other related classes in detail. So let us start:
Let us define each of these in high level then dig deeper.
Streams
Used to deal with byte level data
Reader/Writer
Used to deal with character level. It supports various character encoding also.
BufferedReader/BufferedWriter
To increase performance. Data to be read will be buffered in to memory for quick access.
While these are for taking input, just the corresponding classes exists for output as well. For example, if there is an InputStream that is meant to read stream of byte, and OutputStream will help in writing stream of bytes.
InputStreams
There are many types of InputStreams java provides. Each connect to distinct data sources such as byte array, File etc.
For example FileInputStream connects to a file data source and could be used to read bytes from a File. While ByteArrayInputStream could be used to treat byte array as input stream.
OutputStream
This helps in writing bytes to a data source. For almost every InputStream there is a corresponding OutputStream, wherever it makes sense.
UPDATE
What is Buffered Stream?
Here I'm quoting from Buffered Streams, Java documentation (With a technical explanation):
Buffered Streams
Most of the examples we've seen so far use unbuffered I/O. This means
each read or write request is handled directly by the underlying OS.
This can make a program much less efficient, since each such request
often triggers disk access, network activity, or some other operation
that is relatively expensive.
To reduce this kind of overhead, the Java platform implements buffered
I/O streams. Buffered input streams read data from a memory area known
as a buffer; the native input API is called only when the buffer is
empty. Similarly, buffered output streams write data to a buffer, and
the native output API is called only when the buffer is full.
Sometimes I'm losing my hair reading a technical documentation. So, here I quote the more humane explanation from https://yfain.github.io/Java4Kids/:
In general, disk access is much slower than the processing performed
in memory; that’s why it’s not a good idea to access the disk a
thousand times to read a file of 1,000 bytes. To minimize the number
of times the disk is accessed, Java provides buffers, which serve as
reservoirs of data.
In reading File with FileInputStream then BufferedInputStream, the
class BufferedInputStream works as a middleman between FileInputStream
and the file itself. It reads a big chunk of bytes from a file into
memory (a buffer) in one shot, and the FileInputStream object then
reads single bytes from there, which are fast memory-to-memory
operations. BufferedOutputStream works similarly with the class
FileOutputStream.
The main idea here is to minimize disk access. Buffered streams are
not changing the type of the original streams — they just make reading
more efficient. A program performs stream chaining (or stream piping)
to connect streams, just as pipes are connected in plumbing.
InputStream, OutputStream, byte[], ByteBuffer are for binary data.
Reader, Writer, String, char are for text, internally Unicode, so that all scripts in the world may be combined (say Greek and Arabic).
InputStreamReader and OutputStreamWriter form a bridge between both. If you have some InputStream and know that its bytes is actually text in some encoding, Charset, then you can wrap the InputStream:
try (InputStreamReader reader =
new InputStreamReader(stream, StandardCharsets.UTF_8)) {
... read text ...
}
There is a constructor without Charset, but that is not portable, as it uses the default platform encoding.
On Android StandardCharset may not exist, use "UTF-8".
The derived classes FileInputStream and BufferedReader add something to the parent InputStream resp. Reader.
A FileInputStream is for input from a File, and BufferedReader uses a memory buffer, so the actual physical reading does not does not read character wise (inefficient). With new BufferedReader(otherReader) you add buffering to your original reader.
All this understood, there is the utility class Files with methods like newBufferedReader(Path, Charset) which add additional brevity.
I have read lots of articles on this very topic. I hope this might help you in some way.
Basically, the BufferedReader maintains an internal buffer.
During its read operation, it reads bytes from the files in bulk and stores that bytes in its internal buffer.
Now byte is passed to the program from that internal buffer for each read operation.
This reduces the number of communication between the program and the file or disks. Hence more efficient.
I am using java.io.PrintWriter to write some text to a text file.
I was wondering if it was better to build in a variable all what I need to write and give only once
PrintWriter out = new PrintWriter(outputfile);
out.printf("%s", myvariablewithalltext);
or if I can call n times PrintWriter to write block of text in a for loop.
It works in either way and there is no much more code, I was just wondering which is better.
In most cases it's better to write in stream. The main reason is that your variable might take too much memory, but stream will automatically flush it's content. Writing text into the variable is essentially manual buffering. And better way to do it is to use appropriate buffering stream/writer. In you case you can just use java.io.BufferedWriter. Like so
BufferedWriter out = new BufferedWriter(new PrintWriter("file.txt"));
or, if you prefer PrintWriter interface, you can do this
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("file.txt")));
Assuming you are open for other suggestions (not just the two you mentioned in question).
If all you want is a clean way of writing text to a file, which of course has multiple solutions, here are few ways:
Using PrintWriter.
example:
String contentToWrite = "This is some random Text";
PrintWriter writerToFile = new PrintWriter("TheOutputFile.txt");
writerToFile.print(contentToWrite);
writerToFile.close();
Using FileOutputStream
example:
String contentToWrite = "This is some random Text";
FileOutputSream fileOPS = new FileOutputStream("TheOutputFile.txt");
fileOPS.write(contentToWrite.getBytes());
fileOPS.close();
Using Files
Using FileWriter along with BufferWriter
Using FileUtils by apache.commons.io
Using Files by guava
Some approaches here just take the content (no parsing or conversion required i.e in string format) and write that to a file. [ no parsing/conversion -> less code -> cleaner code ]. ;)
Some do not require you to make nesting of objects. [ less objects -> less code -> cleaner code ]. ;)
Of course usage depends on your implementation. but I hope this will help you in making decision what would best suit your requirement.
Note: every class name I mentioned is a link to its reference document.
It is the latter. There is no good reason whatsoever to put the entire content into a variable, just to write it in a file.
If you have some additional use for that variable beyond writing to file, that might change things a little bit, but even then, there is, probably, a better way.
I think it depends on your content lenght.
If you have just some litle text, it's better to keep all in memory and write in one shot.
But if your content is very large or if some part take long time to computed, probably you should write piece by piece to avoid have huge data kept in memory.
There are too many java.io classes, for some of them i really dont understand when we need them, for example:
ByteArrayInputStream, ByteArrayOutputStream
SequenceInputStream,
PushbackInputStream, PushbackReader
StringReader...
I mean some real-life usages
Can someone please explain...
I would say that your question is too wide.
However it is possible to give a very basic overview of java.io package. It contains interfaces and classes for data input and output operations, such as reading bytes from file. There are only few basic interfaces / classes:
DataInput / ObjectInput - readig Java primitives and objects
DataOutput / ObjectOutput - writing Java primitives and objects
InputStream - reading individual bytes
OutputStream - writing individial bytes
Reader - reading character data
Writer - writing character data
There are other useful interfaces (like Closeable), but these are less significant.
It is best if you read the JavaDoc of these classes. Some examples:
It is pretty obvious that you would use FileOutputStream to write something into a file.
Character data is represented by bytes (defined by character encoding), so you can wrap any output stream using OutputStreamWriter.
You have byte[] and want to read from it just like from InputStream? Use ByteArrayInputStream.
You want to be able to return read bytes back to the reader (usually only a single pass-through is supported)? Wrap your reader with PushbackReader.
You have some String and want to read from it just like from Reader? Use StringReader.
...
So if you need some specific stream/reader/writer, check java.io package, search the internet and ask a question on SO if needed.
Of course then there is java.nio package, which you should know about. But that is for a different topic.
I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?
You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.
I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.