I am new to Random File Access, and I have encountered one issue - as far as I have understood, RandomAccessFile class provides a Random Access file for reading/writing. I can use seek() method to move to preferable position and start reading or wrting, but does not matter in this case. It is completely the random access? But in FileInputStream I have the same ability
read(bute[] byte, int off, int len)
this method provides me reading from some particular place. So, what is the difference? (I guess, InputStream read all file, but just go through all symbols before off position, but it only my guess).
Looking at the documentation of the read method:
https://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html#read(byte[],%20int,%20int)
it states that off is "the start offset in the destination array b". So using this call, you can read the next len bytes from the stream and put them is a certain place in your memory buffer. This does not allow you to skip forward like the seek method of a random access file.
The read method you mention does not let you read from any particular place. It always reads from the "next" position in the stream, where it left off, and it puts the read bytes into the byte array at position off. off is the offset in the output, not the input.
Related
To start with, I understand the concept of buffering as a wrapper around, for instance, FileInuptStream to act as a temporary container for contents read(lets take read scenario) from an underlying stream, in this case - FileInputStream.
Say, there are 100 bytes to read from a stream(file as a source).
Without buffering, code(read method of BufferedInputStream) has to make 100 reads(one byte at a time).
With buffering, depending on buffer size, code makes <= 100 reads.
Lets assume buffer size to be 50.
So, the code reads the buffer(as a source) only twice to read the contents of a file.
Now, as the FileInuptStream is the ultimate source(though wrapped by BufferedInputStream) of data(file which contains 100 bytes), wouldn't it has to read 100 times to read 100 bytes? Though, the code calls read method of BufferedInputStream but, the call is passed to read method of FileInuptStream which needs to make 100 read calls. This is the point which I'm unable to comprehend.
IOW, though wrapped by a BufferedInputStream, the underlying streams(such as FileInputStream) still have to read one byte at a time. So, where is the benefit(not for the code which requires only two read calls to buffer but, to the application's performance) of buffering?
Thanks.
EDIT:
I'm making this as a follow-up 'edit' rather than 'comment' as I think its contextually better suits here and as a TL;DR for readers of chat between #Kayaman and me.
The read method of BufferedInputStream says(excerpt):
As an additional convenience, it
attempts to read as many bytes as possible by repeatedly invoking the
read method of the underlying stream. This iterated read continues
until one of the following conditions becomes true:
The specified number of bytes have been read,
The read method of the underlying stream returns -1, indicating end-of-file, or
The available method of the underlying stream returns zero, indicating that further input requests would block.
I digged into the code and found method call trace as under:
BufferedInputStream -> read(byte b[]) As a I want to see buffering in action.
BufferedInputStream -> read(byte b[], int off, int len)
BufferedInputStream -> read1(byte[] b, int off, int len) - private
FileInputStream -
read(byte b[], int off, int len)
FileInputStream -> readBytes(byte b[], int off, int len) - private and native. Method description from source code -
Reads a subarray as a sequence of bytes.
Call to read1(#4, above mentioned) in BufferedInputStream is in an infinite for loop. It returns on conditions mentioned in above excerpt of read method description.
As I had mentioned in OP(#6), the call does seem to handle by an underlying stream which matches API method description and method call trace.
The question still remains, if native API call - readBytes of FileInputStream reads one byte at a time and create an array of those bytes to return?
The underlying streams(such as FileInputStream) still have to read
one byte at a time
Luckily no, that would be hugely inefficient. It allows the BufferedInputStream to make read(byte[8192] buffer) calls to the FileInputStream which will return a chunk of data.
If you then want to read a single byte (or not), it will efficiently be returned from BufferedInputStream's internal buffer instead of having to go down to the file level. So the BI is there to reduce the times we do actual reads from the filesystem, and when those are done, they're done in an efficient fashion even if the end user wanted to read just a few bytes.
It's quite clear from the code that BufferedInputStream.read() does not delegate directly to UnderlyingStream.read(), as that would bypass all the buffering.
public synchronized int read() throws IOException {
if (pos >= count) {
fill();
if (pos >= count)
return -1;
}
return getBufIfOpen()[pos++] & 0xff;
}
That is, if I do:
channel.position(0)
channel.read(buffer); // will read in 1st byte of file and so on
vs
channel.position(1)
channel.read(buffer); // will read in 2nd byte of file and so on
Are my assumptions correct? Reading the documentation doesn't really say anything about that so I wanted to make sure
Is FileChannel position(long newPosition) 0-indexed?
Yes.
Reading the documentation doesn't really say anything about that so I wanted to make sure
It is clear to me. The javadoc for position() says:
"Returns: This channel's file position, a non-negative integer counting the number of bytes from the beginning of the file to the current position".
"[A] non-negative integer" means zero or greater. If they had meant one or greater, they would have written "a positive integer" or "a strictly positive integer".
The method is 0-indexed.
Also when you call the read method, then the file position is updated with the number of bytes actually read. The channel’s position() method returns the current position.
Suppose each time the buffer of the input-steam read 1000 bytes. There are some start signs and the video name at the beginning of the buffer, before the actual video content, like 100 byte. I don't want to write them into the result buffer. So the first time write 101-999 to the buffer. And the second time I hope to write 1000-1999. Currently, it write 0-999 again, and the result video has an 900 extra bytes before the actually video contents.
Is there anyway to write the buffer skipping the first buffer length? thanks!
I use this code for skipping bytes from a ByteBuffer:
import java.nio.ByteBuffer;
public class Utility {
public static void skip(ByteBuffer bb, int skip) {
bb.position( bb.position() + skip);
}
}
Sophia, you really do need to include example code so people can help, but I see from your tags you are likely asking about NIO's ByteBuffer.
What you want to do is skip the content you don't want by way of the ByteBuffer.position(int) method - there is no magic in the ByteBuffer impl, it is a backing data store (either a byte[] or direct memory reference to OS) and a series of int pointers that refer to conceptual positions in the buffer (start, end, limit, etc.) -- you just want to make sure you "skip" the bytes you don't want, which can be done by moving the position beyond it so the next operation to write out the entire buffer will start from position and go to limit.
I'm using a FileReader wrapped in a LineNumberReader to index a large text file for speedy access later on. Trouble is I can't seem to find a way to read a specific line number directly. BufferedReader supports the skip() function, but I need to convert the line number to a byte offset (or index the byte offset in the first place).
I took a crack at it using RandomAccessFile, and while it worked, it was horribly slow during the initial indexing. BufferedReader's speed is fantastic, but... well, you see the problem.
Some key info:
The file can be any size (currently 35,000 lines)
It's stored on Android's internal filesystem (via getFilesDir() to be exact)
The formatting is not fixed width, unfortunately (hence the need to read by line)
Any ideas?
Describes an extended RandomAccessFile with buffering semantics
Trouble is I can't seem to find a way to read a specific line number directly
Unless you know the length of each line you can't read it directly
There is no shortcut, you will need to read then entire file up front and calculate the offsets manualy.
I would just use a BufferedReader and then get the length of each string and add 1 (or 2?) for the EOL string.
Consider saving an file index along with the large text file. If this file is something you are generating, either on your server, or on the device, it should be trivial to generate an index once and distribute and/or save it along with the file.
I'd recommend an int[] where each value is the absolute offset in bytes for the n*(index+1) th line. So you could have an array of size 35,000 with the start of each line, or an array of size 350, with the start of every 100th line.
Here's an example assuming you have an index file containing an raw sequence of int values:
public String getLineByNumber(RandomAccessFile index,
RandomAccessFile data,
int lineNum) {
index.seek(lineNum*4);
data.seek(index.readInt());
return data.readLine();
}
I took a crack at it using
RandomAccessFile, and while it worked,
it was horribly slow during the
initial indexing
You've started the hard part already. Now for the harder part.
BufferedReader's speed is fantastic,
but...
Is there something in your use of RandomAccessFile that made it slower than it has to be? How many bytes did you read at a time? If you read one byte at a time it will be sloooooow. IF you read in an array of bytes at a time, you can speed things up and use the byte array as a buffer.
Just wrapping up the previous comments :
Either you use RandomAccessFile to first count byte and second parse what you read to find lines by hand OR you use a LineNumberReader to first read lines by lines and count the bytes of each line of char (2 bytes in utf 16 ?) by hand.
I am writing a utility in Java that reads a stream which may contain both text and binary data. I want to avoid having I/O wait. To do that I create a thread to keep reading the data (and wait for it) putting it into a buffer, so the clients can check avialability and terminate the waiting whenever they want (by closing the input stream which will generate IOException and stop waiting). This works every well as far as reading bytes out of it; as binary is concerned.
Now, I also want to make it easy for the client to read line out of it like '.hasNextLine()' and '.readLine()'. Without using an I/O-wait stream like buffered stream, (Q1) How can I check if a binary (byte[]) contain a valid unicode line (in the form of the length of the first line)? I look around the String/CharSet API but could not find it (or I miss it?). (NOTE: If possible I don't want to use non-build-in library).
Since I could not find one, I try to create one. Without being so complicated, here is my algorithm.
1). I look from the start of the byte array until I find '\n' or '\r' without '\n'.
2). Then, I cut the byte array from the start to that point and using it to create a string (with CharSet if specified) using 'new String(byte[])' or 'new String(byte[], CharSet)'.
3). If that success without exception, we found the first valid line and return it.
4). Otherwise, these bytes may not be a string, so I look further to another '\n' or '\r' w/o '\n'. and this process repeat.
5. If the search ends at the end of available bytes I stop and return null (no valid line found).
My question is (Q2)Is the following algorithm adequate?
Just when I was about to implement it, I searched on Google and found that there are many other codes for new line, for example U+2424, U+0085, U+000C, U+2028 and U+2029.
So my last question is (Q3), Do I really need to detect these code? If I do, Will it increase the chance of false alarm?
I am well aware that recognize something from binary is not absolute. I am just trying to find the best balance.
To sum up, I have an array of byte and I want to extract a first valid string line from it with/without specific CharSet. This must be done in Java and avoid using any non-build-in library.
Thanks you all in advance.
I am afraid your problem is not well-defined. You write that you want to extract the "first valid string line" from your data. But whether somet byte sequence is a "valid string" depends on the encoding. So you must decide which encoding(s) you want to use in testing.
Sensible choices would be:
the platform default encoding (Java property "file.encoding")
UTF-8 (as it is most common)
a list of encodings you know your clients will use (such as several Russian or Chinese encodings)
What makes sense will depend on the data, there's no general answer.
Once you have your encodings, the problem of line termination should follow, as most encodings have rules on what terminates a line. In ASCII or Latin-1, LF,CR-LF and LF-CR would suffice. On Unicode, you need all the ones you listed above.
But again, there's no general answer, as new line codes are not strictly regulated. Again, it would depend on your data.
First of all let me ask you a question, is the data you are trying to process a legacy data? In other words, are you responsible for the input stream format that you are trying to consume here?
If you are indeed controlling the input format, then you probably want to take a decision Binary vs. Text out of the Q1 algorithm. For me this algorithm has one troubling part.
`4). Otherwise, these bytes may not be a string, so I look further to
another '\n' or '\r' w/o '\n'. and this process repeat.`
Are you dismissing input prior to line terminator and take the bytes that start immediately after, or try to reevaluate the string with now 2 line terminators? If former, you may have broken binary data interface, if latter you may still not parse the text correctly.
I think having well defined markers for binary data and text data in your stream will simplify your algorithm a lot.
Couple of words on String constructor. new String(byte[], CharSet) will not generate any exception if the byte array is not in particular CharSet, instead it will create a string full of question marks ( probably not what you want ). If you want to generate an exception you should use CharsetDecoder.
Also note that in Java 6 there are 2 constructors that take charset
String(byte[] bytes, String charsetName) and String(byte[] bytes, Charset charset). I did some simple performance test a while ago, and constructor with String charsetName is magnitudes faster than the one that takes Charset object ( Question to Sun: bug, feature? ).
I would try this:
make the IO reader put strings/lines into a thread safe collection (for example some implementation of BlockingQueue)
the main code has only reference to the synced collection and checks for new data when needed, like queue.peek(). It doesn't need to know about the io thread nor the stream.
Some pseudo java code (missing exception & io handling, generics, imports++) :
class IORunner extends Thread {
IORunner(InputStream in, BlockingQueue outputQueue) {
this.reader = new BufferedReader(new InputStreamReader(in, "utf-8"));
this.outputQueue = outputQueue;
}
public void run() {
String line;
while((line=reader.readLine())!=null)
this.outputQueue.put(line);
}
}
class Main {
public static void main(String args[]) {
...
BlockingQueue dataQueue = new LinkedBlockingQueue();
new IORunner(myStreamFromSomewhere, dataQueue).start();
while(true) {
if(!dataQueue.isEmpty()) { // can also use .peek() != null
System.out.println(dataQueue.take());
}
Thread.sleep(1000);
}
}
}
The collection decouples the input(stream) more from the main code. You can also limit the number of lines stored/mem used by creating the queue with a limited capacity (see blockingqueue doc).
The BufferedReader handles the checking of new lines for you :) The InputStreamReader handles the charset (recommend setting one yourself since the default one changes depending on OS etc.).
The java.text namespace is designed for this sort of natural language operation. The BreakIterator.getLineInstance() static method returns an iterator that detects line breaks. You do need to know the locale and encoding for best results, though.
Q2: The method you use seems reasonable enough to work.
Q1: Can't think of something better than the algorithm that you are using
Q3: I believe it will be enough to test for \r and \n. The others are too exotic for usual text files.
I just solved this to get test stubb working for Datagram - I did byte[] varName= String.getBytes(); then final int len = varName.length; then send the int as DataOutputStream and then the byte array and just do readInt() on the rcv then read bytes(count) using the readInt.
Not a lib, not hard to do either. Just read up on readUTF and do what they did for the bytes.
The string should construct from the byte array recovered that way, if not you have other problems. If the string can be reconstructed, it can be buffered ... no?
May be able to just use read / write UTF() in DataStream - why not?
{ edit: per OP's request }
//Sending end
String data = new String("fdsfjal;sajssaafe8e88e88aa");// fingers pounding keyboard
DataOutputStream dataOutputStream = new DataOutputStream();//
final Integer length = new Integer(data.length());
dataOutputStream.writeInt(length.intValue());//
dataOutputStream.write(data.getBytes());//
dataOutputStream.flush();//
dataOutputStream.close();//
// rcv end
DataInputStream dataInputStream = new DataInputStream(source);
final int sizeToRead = dataInputStream.readInt();
byte[] datasink = new byte[sizeToRead.intValue()];
dataInputStream.read(datasink,sizeToRead);
dataInputStream.close;
try
{
// constructor
// String(byte[] bytes, int offset, int length)
final String result = new String(datasink,0x00000000,sizeToRead);//
// continue coding here
Do me a favor, keep the heat off of me. This is very fast right in the posting tool - code probably contains substantial errors - it's faster for me just to explain it writing Java ~ there will be others who can translate it to other code language ( s ) which you can too if you wish it in another codebase. You will need exception trapping an so on, just do a compile and start fixing errors. When you get a clean compile, start over from the beginnning and look for blunders. ( that's what a blunder is called in engineering - a blunder )