I am doing a project in which there are so many files I have to handle. Problem came when I have to provide file in different manner like:
File will contain one string in each line
numbers of char in each line e.g. :
1st line : A B 4
2nd line : 6 C A 6 & U #
etc.
File will contain no. of Strings e.g.
1st line : Lion Panther jaguar etc.
I have read how to efficiently handle file but I am so confused when to use Buffered Streams and when Unbuffered. If I am using BufferedStream then BufferInputStream and BufferReader / BufferWriter which should be used.
Similarly I am confuse with I/O Stream, File I/O Stream, ByteArray I/O Stream. There are so many things. Can any one suggest me when to use which one and why? What could be efficient handling according to the different scenarios?
Well, there might not be a direct answer for this, but you don't have to worry if you feel confused. Discussions about Buffered and Unbufferred have been done many times before.
For example in this link: bufferred vs non-bufferred, gives a good hint (check the answer marked as correct). This comes because while using Bufferred streams, those streams are stored in a small area of memory called (suprisingly) buffer. Same happens to written data (they go into the buffer before being stored into the hard memory). This improves performance because lowers the overhead of I/O operations (which are OS dependent). Check the Java Doc: Bufferred Streams
So, to make it clear, use Bufferred streams when you need to improve the performance of your I/O operations. Use Unbufferred streams when you want to ensure that the output has been written before continuing (because an error might always occur while writing from/into the buffer, an example might be when you want to write a log, it might be opened all the time, so there is no need to access it, no need for a buffer ).
Related
I'm looking for a good explaination of the difference between the "new" Streams in Java 8 and the "old" I/O Streams we had before in Java 7. For someone without any knowledge of functional programming, it's hard to get that those are complete different things, especially because the names are the same. I get that the Stream API is something completely new and even revolutionary in some point, but in my naive thinking, in both cases we deal with sequences of "things", be it bytes, data or objects...
Can someone please offer a good explaination?
It has nothing to do with each other and I agree, it's bad luck that IO Streams had their name before the "new" Streams have arrived. The I/O streams were meant as connections to external resources, mostly files, but also others. The new Streams are for functional programming and should be treated separately.
But you can actually use both concepts together. For example, a BufferedReader has a lines-method, which returns lines of a file (or other resources) as a Stream of Strings.
Let's take a look at the picture illustrated I/O stream.
There are three concerning concepts related to I/O stream: Source, Destination, and Element (represented by the letter 'e'), where
Source or Destination could by a file, network connection, pipe, memory buffer etc.
An Element is just simply a piece of data and a stream consist of a chunk of elements
When to use what?
I/O streams are for reading content from a source, or writing the content to a destination. That's it, simple :-)
The new Stream concept introduced in Java 8 has nothing to do with I/O streams. Streams are not themselves data structures, but Classes that allow you to manipulate a collection of data in a declarative way (functional-style operation).
In term of 'stream' there is no difference. Stream is abstract phrase that means something that has a source and destination. What is more it is something that represents sequence of data.
In term of these two mechanism there is a lot of differences. For example Java i/o streams let you only read and write data. If you want to process data from that stream there is no builded in mechanism for that. In Java 8 stream there are additional possibilities of processig like map/filter etc.
I am getting a hard time visualizing what exactly stream means in terms of IO. I imagine stream as a continuous flow of data coming from a file, socket or any other data source. Is that correct? But then I get confused on how our java programs react to stream because when we write any java code let say:
Customer getCustomer(Customer customer)
Doesn't the above java code expects the whole object to be present before it gets processed?
Now lets say we are reading from a stream something like
FileInputStream in = new FileInputStream("abc.txt")
in.read();
Doesn't in.read() expects the whole file to be present in memory to be processed. If it is, then how come it is a stream? Why do we call them streams? Do they process data as it is read?
A similar confusion when reading through hadoop streams, looks like they have a different meaning altogether.
The word "stream" is used for different things in different contexts. But you're specifically asking about streams in I/O, i.e. InputStream and OutputStream.
I imagine stream as a continuous flow of data coming from a file, socket or any other data source. Is that correct?
Yes, a stream is a source of a sequence of bytes, which may come from a file, socket etc.
About getCustomer: You need to have a Customer object to pass to that method. But calling methods, passing objects and getting objects returned really does not have anything to do with streams.
Doesn't in.read() expects the whole file to be present in memory to be processed.
No. FileInputStream is an object which represents the stream. It's the thing that knows how to read bytes from the file.
Streams are not a fundamentally special kind of object. It's not like that there are classes, objects and streams. Streams are just a concept that is implemented using the standard Java OO programming features (classes and objects).
Doesn't the above java code expects the whole object to be present before it gets processed?
Yes. But it's not a stream.
Doesn't in.read() expects the whole file to be present in memory to be processed.
No.
If it is
It isn't.
then how come it is a stream?
It is.
Why do we call them streams? Do they process data as it is read?
Yes.
A similar confusion
There is no confusion here, except your own confusion when comparing method calls with I/O streams, which comparing apples versus oranges.
when reading through hadoop streams, looks like they have a different meaning altogether.
Very possibly.
An I/O Stream represents an input source or an output destination. A stream can represent many different kinds of sources and destinations, including disk files, devices, other programs, and memory arrays.
Streams support many different kinds of data, including simple bytes, primitive data types, localized characters, and objects. Some streams simply pass on data; others manipulate and transform the data in useful ways.
No matter how they work internally, all streams present the same simple model to programs that use them: A stream is a sequence of data.
In Java there are two kinds of streams (Byte & Character), they differ in the way how the data is transferred between source & destination.
Hope this answers your question. Please let me know if you need any further information.
From here..
A Stream is a free flowing sequence of elements. They do not hold any storage as that responsibility lies with collections such as arrays, lists and sets. Every stream starts with a source of data, sets up a pipeline, processes the elements through a pipeline and finishes with a terminal operation. They allow us to parallelize the load that comes with heavy operations without having to write any parallel code. A new package java.util.stream was introduced in Java 8 to deal with this feature.
Streams adhere to a common Pipes and Filters software pattern. A pipeline is created with data events flowing through, with various intermediate operations being applied on individual events as they move through the pipeline. The stream is said to be terminated when the pipeline is disrupted with a terminal operation. Please keep in mind that a stream is expected to be immutable — any attempts to modify the collection during the pipeline will raise a ConcurrentModifiedException exception.
So if there is a java.io.StringBufferInputStream, you would think that there would be a StringBufferOutputStream.
Any ideas as to why there isn't??
Likewise,there is also a SequenceInputStream but no SequenceOutputStream.
My guess is that someone never got around to making a StringBufferOutputStream in Java 1.0 since the product was somewhat "rushed to market." By the time Java 1.1 rolled around and people actually understood that readers and writers were for characters, and inputstreams and outputstreams were for bytes, the whole concept of using streams for strings was realized to be wrong, so the StringBufferInputStream was rightly deprecated, with no chance ever of a partner coming along.
A SequenceInputStream is a nice way to read from a bunch of streams all concatenated together, but it doesn't make much sense to write a single stream to multiple streams. Well, I suppose you could make sense of this if you wanted to write a large stream into multiple partitions (reminds me of Hadoop here). It's just not common enough to be in a standard library. A complication here would be that you would need to specify the size of each output partition and would really only make sense for files (which can have names with increasing suffixes, perhaps), and so would not generalize into arbitrary output streams in a nice manner.
StringBufferInputStream is deprecated, because bytes and characters are not the same thing. The correct classes to use for this are StringReader and StringWriter.
If you think about it, there is no way to make a SequenceOutputStream work. SequenceInputStream reads from the first stream until it is exhausted, then reads from the next stream. Since an OutputStream is never exhausted (unless, say, it happens to be connected to a socket whose peer closes the connection), how would a SequenceOutputStream class know when to move on to the next stream?
StringBufferInputStream has long been depreciated.
Use StringReader and StringWriter.
There is region in file(possible small) that I want to overwrite.
Assume I calling fseek, fwrite, fsync. Is there any way to ensure atomicity of such region-rewriting operation, e.g. i need to be sure, that in any case of failure the region will contains only old(before modification) data, or only new(modified) data, but not a mix of this.
There are two thing i want to highlight.
First: It's ok if there is no way to atomically write ANY size region - we can handle it by appending data to the file, fsync'ing, and then rewriting 'pointer' area in file, then fsyncing again. However, if 'pointer' writing is not atomic, we still can have corrupted file with illegal pointers.
Second: I am pretty sure, writing 1-byte regions is atomic: i will not see in file any bytes I never put there. So we can use some tricks with allocating two regions for addresses and use 1-byte switch, so rewriting of region became - append new data, syncing, rewrite one of two(unused) pointer slots, syncing again, and then rewrite 'switch byte' and again syncing. So the overwrite region operation now contains at least 3 fsync invocation.
All of this would be much easer, if I will have atomic writing for longs, but do i really have it?
Is there any way to handle this situation without using method, mentioned in point 2?
Another question is - is there any ordering guarantee between writing and syncing?
For example, if i call fseek, fwrite [1], fseek, fwrite [2], fsync, can i have writing at [2] commited, and writing at [1] - not commited?
This question is applicable to linux and windows operation system, any particular answer(e.g. in ubuntu version a.b.c ....) is also wanted.
It's usually safe to assume that writing a 512 bytes chunks are done in one write by the HDDs.
However, i would not assume that. Instead, i'd go with your second solution, while adding a checksum to your write and verifying it before changing the pointer in the file.
Generally, it's a good practice to add checksum to everything written to disk.
To answer about "sync" guarantee - you can assume that. While sync is FS and disk dependent, let's say we are talking about 'reasonable' implementation.
After the 1st sync the data is guaranteed to be flushed to the disk (the disk might have it
in it's cache still) and if the data you are expected to get whatever you wrote.
If after the second sync the data of both syncs is in the disk cache, the situation you described can happen, but IMHO the probability of that is very low.
Anyway, there's no other mechanism which will promise you data is on disk. That's why you must have checksums.
Some more info: Ensure fsync did its job
I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.