Java: reading strings from a random access file with buffered input - java

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:
RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")
Is there a quicker way than implementing it all myself? Am I overseeing something?

import org.apache.commons.io.input.BoundedInputStream
FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);
If you didn't care about pos2, then you woundn't need Apache Commons IO.

I wrote this code to read utf-8 using randomaccessfiles
//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);
public CyclicBuffer(FileChannel channel) {
this.channel = channel;
}
private int read() throws IOException {
return channel.read(buffer);
}
/**
* Returns the byte read
*
* #return byte read -1 - end of file reached
* #throws IOException
*/
public byte get() throws IOException {
if (buffer.hasRemaining()) {
return buffer.get();
} else {
buffer.clear();
int eof = read();
if (eof == -1) {
return (byte) eof;
}
buffer.flip();
return buffer.get();
}
}
}
//File: UTFRandomFileLineReader.java
public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;
public UTFRandomFileLineReader(FileChannel channel) {
this.buffer = new CyclicBuffer(channel);
}
public String readLine() throws IOException {
if (eof) {
return null;
}
byte x = 0;
temp.clear();
while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
if (temp.position() == temp.capacity()) {
temp = addCapacity(temp);
}
temp.put(x);
}
if (x == -1) {
eof = true;
}
temp.flip();
if (temp.hasRemaining()) {
return charset.decode(temp).toString();
} else {
return null;
}
}
private ByteBuffer addCapacity(ByteBuffer temp) {
ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
temp.flip();
t.put(temp);
return t;
}
public static void main(String[] args) throws IOException {
RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
"r");
UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
.getChannel());
int i = 1;
while (true) {
String s = reader.readLine();
if (s == null)
break;
System.out.println("\n line " + i++);
s = s + "\n";
for (byte b : s.getBytes(Charset.forName("utf-8"))) {
System.out.printf("%x", b);
}
System.out.printf("\n");
}
}
}

For #Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
/*
* Paths uses the default file system, note no exception thrown at this stage if
* file is missing
*/
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
/*
* newByteChannel is a SeekableByteChannel - this is the fun new construct that
* supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel
* you could read and write to that channel simultaneously with multiple threads.
*/
fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
fc.position(startPosition);
while (fc.read(readBuffer) != -1)
{
readBuffer.rewind();
System.out.println(Charset.forName(encoding).decode(readBuffer));
readBuffer.flip();
}
}

Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.
Then create your BufferedReader using
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))
Then you can call readLine on the BufferedReader.
Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.

I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...

I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:
FileDescriptor fd = raFile.getFD();
FileReader fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);
Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().
The one thing I'm not sure about is whether UTF8 strings are handled correctly.

The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
if (in.markSupported()) {
in.skip(1024);
in.read();
}
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

Related

Reading a String that has n length from InputStream or Reader

I know that I can do this. But I also want to know, is there a short way to do this ? For example: Why there is no method that has public String readString(int len); prototype in Reader class hierarchy to do what I want with only single code in this question ?
InputStream in = new FileInputStream("abc.txt");
InputStreamReader inReader = new InputStreamReader(in);
char[] foo = new char[5];
inReader.read(foo);
System.out.println(new String(foo));
// I think this way is too long
// for reading a string that has only 5 character
// from InputStream or Reader
In Python 3 programming language, I can do it very very easy for UTF-8 and another files. Consider the following code.
fl = open("abc.txt", mode="r", encoding="utf-8")
fl.read(1) # returns string that has 1 character
fl.read(3) # returns string that has 3 character
How can I dot it in Java ?
Thanks.
How can I do it in Java ?
The way you're already doing it.
I'd recommend doing it in a reusable helper method, e.g.
final class IOUtil {
public static String read(Reader in, int len) throws IOException {
char[] buf = new char[len];
int charsRead = in.read(buf);
return (charsRead == -1 ? null : new String(buf, 0, charsRead));
}
}
Then use it like this:
try (Reader in = Files.newBufferedReader(Paths.get("abc.txt"), StandardCharsets.UTF_8)) {
System.out.println(IOUtil.read(in, 5));
}
If you want to make a best effort to read as many as the specified number of characters, you may use
int len = 4;
String result;
try(Reader r = new FileReader("abc.txt")) {
CharBuffer b = CharBuffer.allocate(len);
do {} while(b.hasRemaining() && r.read(b) > 0);
result = b.flip().toString();
}
System.out.println(result);
While the Reader may read less than the specified characters (depending on the underlying stream), it will read at least one character before returning or return -1 to signal the end of the stream. So the code above will loop until either, having read the requested number of characters or reached the end of the stream.
Though, a FileReader will usually read all requested characters in one go and read only less when reaching the end of the file.

Safe implementation of BufferdReader

I want to use a BufferedReader to read a file uploaded to my server.
The file would by written as a CSV file, but I can't assume this, so I code some test where the file is an image or a binary file (supposing the client has sent me the wrong file or an attacker is trying to break my service), or even worse, the file is a valid CSV file but has a line of 100MB.
My application can deal with this problem, but it has to read the first line of the file:
...
String firstLine = bufferedReader.readLine();
//Perform some validations and reject the file if it's not a CSV file
...
But, when I code some tests, I've found a potential risk: BufferedReader doesn't perform any control over the amount of bytes it reads until it found a return line, so it can ended up throwing an OutOfMemoryError.
This is my test:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import org.junit.Test;
public class BufferedReaderTest {
#Test(expected=OutOfMemoryError.class)
public void testReadFileWithoutReturnLineCharacter() throws IOException {
BufferedReader bf = new BufferedReader(getInfiniteReader());
bf.readLine();
bf.close();
}
private Reader getInfiniteReader() {
return new Reader(){
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
return 'A';
}
#Override
public void close() throws IOException {
}
};
}
}
I've been looking up some safe BufferedReader implementation on the internet, but I can't find anything. The only class I've found was BoundedInputStream from apache IO, that limits the amount of bytes read by an input stream.
I need an implementation of BufferedReader that knows how to limit the number of bytes/characters read in each line.
Something like this:
The app calls 'readLine()'
The BufferedReader reads bytes until it found a return line character or it reaches the maximum amount of bytes allowed
If it has found a return line character, then reset the bytes read (so it could read the next line) and return the content
If it has reached the maximum amount of bytes allowed, it throws an exception
Does anybody knows about an implementation of BufferedReader that has this behaviour?
This is not how you should proceed to detect whether a file is binary or not.
Here is how you can do to check whether a file is truly text or not; note that this requires that you know the encoding beforehand:
final Charset cs = StandardCharsets.UTF_8; // or another
final CharsetDecoder decoder = cs.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT); // default is REPLACE!
// Here, "in" is the input stream from the file
try (
final Reader reader = new InputStreamReader(in, decoder);
) {
final char[] buf = new char[4096]; // or other size
while (reader.read(buf) != -1)
; // nothing
} catch (MalformedInputException e) {
// cannot decode; binary, or wrong encoding
}
Now, since you can initialize a BufferedReader over a Reader, you can use:
try (
final Reader r = new InputStreamReader(in, decoder);
final BufferedReader reader = new BufferedReader(r);
) {
// Read lines normally
} catch (CharacterCodingException e) {
// Not a CSV, it seems
}
// etc
Now, a little more explanation about how this works... While this is a fundamenal part of reading text in Java, it is a part which is equally fundamentally misunderstood!
When you read a file as text using a Reader, you have to specify a character coding; in Java, this is a Charset.
What happens internally is that Java will create a CharsetDecoder from that Charset, read the byte stream and output a char stream. And there are three ways to deal with errors:
CodingErrorAction.REPLACE (the default): unmappable byte sequences are replaced with the Unicode replacement character (it does ring a bell, right?);
CodingErrorAction.IGNORE: unmappable byte sequences do not trigger the emission of a char;
CodingErrorAction.REPORT: unmappable byte sequences trigger a CharacterCodingException to be thrown, which inherits IOException; in turn, the two subclasses of CharacterCodingException are MalformedInputException and UnmappableCharacterException.
Therefore, what you need to do in order to detect whether a file is truly text is to:
know the encoding beforehand!
use a CharsetDecoder configured with CodingErrorAction.REPORT;
use it in an InputStreamReader.
This is one way; there are others. All of them however will use a CharsetDecoder at some point.
Similarly, there is a CharsetEncoder for the reverse operation (char stream to byte stream), and this is what is used by the Writer family.
Thank you #fge for the answer. I ended up implementing a safe Readerthat can deal with files with too long lines (or without lines at all).
If anybody wants to see the code, the project (very small project even with many tests) is available here:
https://github.com/jfcorugedo/security-io

Java: InputStreams and OutputStreams being shared

Can I share an InputStream or OutputStream?
For example, let's say I first have:
DataInputStream incoming = new DataInputStream(socket.getInputStream()));
...incoming being an object variable. Later on I temporarily do:
BufferedReader dataReader = new BufferedReader(new InputStreamReader(socket.getInputStream()));
I understand that the stream is concrete and reading from it will consume its input, no matter from where it's done... But after doing the above, can I still access both incoming and dataReader simultaneously or is the InputStream just connected to ONE object and therefore incoming loses its input once I declare dataReader? I understand that if I close the dataReader then I will close the socket as well and I will refrain from this but I'm wondering whether I need to "reclaim" the InputStream somehow to incoming after having "transferred" it to dataReader? Do I have to do:
incoming = new DataInputStream(socket.getInputStream());
again after this whole operation?
You are using a teaspoon and a shovel to move dirt from a hole.
I understand that the stream is concrete and reading from it will
consume its input, no matter from where it's done
Correct. The teaspoon and shovel both move dirt from the hole. If you are removing dirt asynchronously (i.e. concurrently) you could get into fights about who has what dirt - so use concurrent construct to provide mutually exclusive access. If access is not concurrent, in other words ...
1) move one or more teaspoons of dirt from the hole
2) move one or more shovels of dirt from the hole
3) move one or more teaspoons of dirt from the hole
...
No problem. Teaspoon and shovel both remove dirt. But once dirt gets removed, it's removed, they do not get the same dirt. Hope this helps. Let's start shovelling, I'll use the teaspoon. :)
As fast-reflexes found, be very careful about sharing streams, particularly buffered readers since they can gobble up a lot more bytes off the stream than they need, so when you go back to your other input stream (or reader) it may look like a whole bunch of bytes have been skipped.
Proof you can read from same input stream:
import java.io.*;
public class w {
public static void main(String[] args) throws Exception {
InputStream input = new FileInputStream("myfile.txt");
DataInputStream b = new DataInputStream(input);
int data, count = 0;
// read first 20 characters with DataInputStream
while ((data = b.read()) != -1 && ++count < 20) {
System.out.print((char) data);
}
// if prematurely interrupted because of count
// then spit out last char grabbed
if (data != -1)
System.out.print((char) data);
// read remainder of file with underlying InputStream
while ((data = input.read()) != -1) {
System.out.print((char) data);
}
b.close();
}
}
Input file:
hello OP
this is
a file
with some basic text
to see how this
works when moving dirt
from a hole with a teaspoon
and a shovel
Output:
hello OP
this is
a file
with some basic text
to see how this
works when moving dirt
from a hole with a teaspoon
and a shovel
Proof to show BufferedReader is NOT gauranteed to work as it gobbles up lots of chars from the stream:
import java.io.*;
public class w {
public static void main(String[] args) throws Exception {
InputStream input = new FileInputStream("myfile.txt");
BufferedReader b = new BufferedReader(new InputStreamReader(input));
// read three lines with BufferedReader
String line;
for (int i = 0; (line = b.readLine()) != null && i < 3; ++i) {
System.out.println(line);
}
// read remainder of file with underlying InputStream
int data;
while ((data = input.read()) != -1) {
System.out.print((char) data);
}
b.close();
}
}
Input file (same as above):
hello OP
this is
a file
with some basic text
to see how this
works when moving dirt
from a hole with a teaspoon
and a shovel
Output:
hello OP
this is
a file
This will be disastrous. Both streams will have corrupted data. How could Java possibly know which data to send to which Stream?
If you need to do two different things with the same data, you're better off storing it somewhere (possibly copying it into two Queue<String>), and then reading it that way.
Ok, I solved this myself.. interesting links:
http://www.coderanch.com/t/276168//java/InputStream-multiple-Readers
Multiple readers for InputStream in Java
Basically... the InputStream can be connected to multiple objects reading from it and consuming it. However, a BufferedReader reads ahead, so when involving one of those, it might be a good idea to implement some sort of signal when you're switching from for example a BufferedReader to a DataInputStream (that is you want to use the DataInputStream to process the InputStream all of a sudden instead of the BufferedReader). Therefore I stop sending data to the InputStream once I know that all data has been sent that is for the BufferedReader to handle. After this, I wait for the other part to process what it should with the BufferedReader. It then sends a signal to show that it's ready for new input. The sending part should be blocking until it receives the signal input and then it can start sending data again. If I don't use the BufferedReader after this point, it won't have a chance to buffer up all the input and "steal" it from the DataInputStream and everything works very well :) But be careful, one read operation from the BufferedReader and you will be back in the same situation... Good to know!

Extracting UTF-16 encoded file from ZIP archive in Java

In the last section of the code I print what the Reader gives me. But its just bogus, where did I go wrong?
public static void read_impl(File file, String targetFile) {
// Create zipfile input stream
FileInputStream stream = new FileInputStream(file);
ZipInputStream zipFile = new ZipInputStream(new BufferedInputStream(stream));
// Im looking for a specific file/entry
while (!zipFile.getNextEntry().getName().equals(targetFile)) {
zipFile.getNextEntry();
}
// Next step in api requires a reader
// The target file is a UTF-16 encoded text file
InputStreamReader reader = new InputStreamReader(zipFile, Charset.forName("UTF-16"));
// I cant make sense of what this print
char buf[] = new char[1];
while (reader.read(buf, 0, 1) != -1) {
System.out.print(buf);
}
}
I'd guess that where you went wrong was believing that the file was UTF-16 encoded.
Can you show a few initial byte values if you don't decode them?
Your use of a char array is a bit pointless, though at first glance it should work. Try this instead:
int c;
while ((c = reader.read()) != -1) {
System.out.print((char)c);
}
If that does not work either, then perhaps you got the wrong file, or the file does not contain what you think it does, or the console can't display the characters it contains.

Is there a reason to use BufferedReader over InputStreamReader when reading all characters?

I currently use the following function to do a simple HTTP GET.
public static String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
//java.io.BufferedReader b = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s);
//b = new java.io.BufferedReader(r);
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
//if (b != null) b.close();
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
I see no reason to use the BufferedReader since I am just going to download everything in sequence. Am I right in thinking there is no use for the BufferedReader in this case?
In this case, I would do as you are doing (use a byte array for buffering and not one of the stream buffers).
There are exceptions, though. One place you see buffers (output this time) is in the servlet API. Data isn't written to the underlying stream until flush() is called, allowing you to buffer output but then dump the buffer if an error occurs and write an error page instead. You might buffer input if you needed to reset the stream for rereading using mark(int) and reset(). For example, maybe you'd inspect the file header before deciding on which content handler to pass the stream to.
Unrelated, but I think you should rewrite your stream handling. This pattern works best to avoid resource leaks:
InputStream stream = new FileInputStream("in");
try { //no operations between open stream and try block
//work
} finally { //do nothing but close this one stream in the finally
stream.close();
}
If you are opening multiple streams, nest try/finally blocks.
Another thing your code is doing is making the assumption that the returned content is encoded in your VM's default character set (though that might be adequate, depending on the use case).
You are correct, if you use BufferedReader for reading HTTP content and headers you will want InputStreamReader so you can read byte for byte.
BufferedReader in this scenario sometimes does weird things...escpecially when it comes to reading HTTP POST headers, sometimes you will be unable to read the POST data, if you use the InputStreamReader you can read the content length and read that many bytes...
Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream. To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
My gut tells me that since you're already performing buffering by using the byte array, it's redundant to use the BufferedReader.

Categories

Resources