java: read vs readNBytes of the InputStream instance - java

In java, InputStream class has methods read(byte[], int, int) and readNBytes(byte[], int, int). It seems that these two methods have exactly the same functionality, so I wonder what are the differences between them.

Edited for better visibility of discussion in the comments:
read() says it attempts to read "up to len bytes ... but a smaller number may be read. This method blocks until input data is available, end of file is detected, or an exception is thrown."
readNBytes() says "blocks until len bytes of input data have been read, end of stream is detected, or an exception is thrown."
Even though the JDK's implementation for InputStream is likely to give you identical results for both methods, the documented differences mean than other classes inheriting from it may behave differently.
E.g. Given the stream '12345<end>', read(s,0,10) is allowed to return '123', whereas readNbytes() is more likely to keep going to look for an end-of-stream and give you the whole thing.
Original answer:
You're right that the javadocs are very similar. When in doubt, always drop down to the source. Most IDEs make it easy to attach the OpenJDK source and lets you drill down to them.
This is readNBytes from InputStream.java:
public int readNBytes(byte[] b, int off, int len) throws IOException {
Objects.requireNonNull(b);
if (off < 0 || len < 0 || len > b.length - off)
throw new IndexOutOfBoundsException();
int n = 0;
while (n < len) {
int count = read(b, off + n, len - n);
if (count < 0)
break;
n += count;
}
return n;
}
As you can see, it actually performs a call to read(byte[],int,int). The difference in this case is that if the actual read bytes is less than your specified len, it will attempt to read() again until it is confirmed that there is actually nothing left to be read.
Edit: Note that
This is OpenJDK's implementation of the base InputStream. Others may differ.
Subclasses of InputStream may also have their own overridden implementation. Consult the doc/source for the relevant class.

Related

Is reading/writing in an array more efficient than reading/writing a char/byte one by one?

try(FileReader reader = new FileReader("input.txt")) {
int c;
while ((c = reader.read()) != -1)
System.out.print((char)c);
} catch (Exception ignored) { }
In this code, I read a char by char. Is it more efficient in someway to read a into an array of chars at once? In other words, is there any kind of optimization that happens when reading in arrays?
For example in this code, I have an array of char called arr and I read into it until there is noting left to read. Is it more efficient?
try(FileReader reader = new FileReader("input.txt")) {
int size;
char[] arr = new char[100];
while ((size = reader.read(arr)) != -1)
for (int i = 0; i < size; i++)
System.out.print(arr[i]);
} catch (Exception ignored) { }
The question applies for both reading/writing both chars/bytes.
Depends on the reader. The answer can be yes, though. Whatever Reader or InputStream is the actual 'raw' driver (the one that isn't just wrapping another reader or inputstream, but the one that is actually talking to the OS to get the data) - it may well implement the single-character read() method by asking the OS to read a single character.
In the end, you have a disk, and disks return data in blocks. So if you ask for 1 byte, you have 2 options as a computer:
Ask the disk for the block that contains the byte that is to be read. Store the block in memory someplace for a while. Return one byte; for the next few moments, if more requests for bytes come in from the same block, return from the stored data in memory and don't bother asking the disk at all. NOTE: This requires memory! Who allocates it? How much memory is okay? Tricky questions. OSes tend to give low level tools and don't like just picking values for any of these questions.
Ask the disk for the block that contains the byte that is to be read. Find the 1 byte needed from within this block. Ignore the rest of the data, return just that one byte. If in a few moments another byte from that block is asked for... ask the disk, again, for the whole block, and repeat this routine.
Which of the two models you get depends on many factors: For example: What kind of disk is it, what OS do you have, what underlying java reader are you using. But it is plausible you end up in this second mode and that is, as you can probably tell, usually incredibly slow, because you end up reading the same block 4000+ times instead of only once.
So, how to fix this? Well, java doesn't really know what the OS is doing either, so the safest bet is to let java do the caching. Then you have no dependencies on whatever the OS is doing.
You could write it yourself, so instead of:
for (int i = in.read(); i != -1; i = in.read()) {
processOneChar((char) i);
}
you could do:
char[] buffer = new char[4096];
while (true) {
int r = in.read(buffer);
if (r == -1) break;
for (int i = 0; i < r; i++) processOneChar(buffer[i]);
}
more code, but now the second scenario (the same block is read off the disk a ton of times) can no longer occur; you have given the OS the freedom to return to you up to 4096 chars worth of data.
Or, use a java builtin: BufferedX:
BufferedReader br = new BufferedReader(in);
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
The implementation of BufferedReader guarantees that java will take care of making some reasonably sized buffer to avoid rereads of the same block off of disk.
NB: Note that the FileReader constructor you are using should not be used. It uses platform default encoding (anytime you convert bytes to characters, encoding is involved), and platform default is a recipe for untestable bugs, which are very bad. Use new FileReader(file, StandardCharsets.UTF_8) instead, or better yet, use the new API:
Path p = Paths.get("C:/file.txt");
try (BufferedReader br = Files.newBufferedReader(p)) {
for (int i = br.read(); i != -1; i = br.read()) {
processOneChar((char) i);
}
}
Note that this:
Defaults to UTF-8, because the Files API defaults to UTF-8 unlike most places in the VM.
Makes a bufferedreader immediately, no need to make it yourself.
Properly manages the resource (ensures it is closed regardless of how this code exits, be it normally or be exception), by using an ARM block.
Because a BufferedX is involved, no risk of the 'read the same block a lot' performance hole.
NB: The same logic applies when writing; disks such as SSDs can only write a whole block at a time. Now it's not just slow as molasses to write, you're also ruining your disk, as they get a limited number of writes.

Unable to create a torrent's info hash

I'm having trouble finding the issue with how I'm generating the corresponding info hash for a torrent file. This is the code I have so far:
InputStream input = null;
try {
MessageDigest sha1 = MessageDigest.getInstance("SHA-1");
input = new FileInputStream(file);
StringBuilder builder = new StringBuilder();
while (!builder.toString().endsWith("4:info")) {
builder.append((char) input.read()); // It's ASCII anyway.
}
ByteArrayOutputStream output = new ByteArrayOutputStream();
for (int data; (data = input.read()) > -1; output.write(data));
sha1.update(output.toByteArray(), 0, output.size() - 1);
this.infoHash = sha1.digest();
System.out.println(new String(Hex.encodeHex(infoHash)));
} catch (NoSuchAlgorithmException | IOException e) {
e.printStackTrace();
} finally {
if (input != null) try { input.close(); } catch (IOException ignore) {}
}
Below is my expected and actual hash:
Expected: d4d44272ee5f5bf887a9c85ad09ae957bc55f89d
Actual: 4d753474429d817b80ff9e0c441ca660ec5d2450
The torrent I'm trying to generate an info hash for can be found here (Ubuntu 14.04 Desktop amd64).
Let me know if I can provide any more info, thanks!
Exceptions contain 4 useful bits of info: Type, Message, Trace, and Cause. You've tossing away 3 out of the 4 relevant bits of info. Also, code is part of a process, and when an error occurs, generally that process cannot be finished at all. And yet on exceptions your process continues. Stop doing this; you've written code that only hurts you. Remove the try, and the catch. Add a throws clause on your method signature. If you can't, the go-to default (and update your IDE if that generated this code to do this) is throw new RuntimeException("Unhandled", e);. This is shorter, does not destroy any of the 4 interesting bits of info, and ends a process.
Separately, the notion that the right way to handle an inputstream close method's IOException being: Just ignore it, is also false. It is highly unlikely to throw, but if it does, you should assume you didn't read every byte. As that would be one explanation for a mismatched hash, it's misguided.
Finally, use the proper language constructs: There is a try-with-resources statement that would work far better here.
You're calling update with output.size() - 1; unless you want to intentionally ignore the last byte, this is a mistake; you're lopping off the last byte read.
Reading bytes into a builder, and then per byte converting the builder to a string and then checking the last character is incredibly inefficient; for a file as small as 1MB that'll cause quite a grind.
Reading a single byte at a time from a raw FileInputStream is also that level of inefficient, because every read will cause file access (reading 1 byte is as expensive as reading a whole buffer full, so, it's about 50000 times slower than it needs to be).
Here's how to do this with somewhat newer API, and look how much nicer this code reads. It also acts better under erroneous conditions:
byte[] data = Files.readAllBytes(Paths.get(fileName));
var search = "4:info".getBytes(StandardCharsets.US_ASCII);
int searchIdx = -1;
for (int i = 0; searchIdx == -1 && i < data.length - search.length; i++) {
for (int j = 0; j < search.length; j++) {
if (data[i + j] != search[j]) break;
if (j == search.length - 1) searchIdx = i + j;
}
}
if (searchIdx == -1) throw new IOException("Input torrent file does not contain marker");
var sha1 = MessageDigest.getInstance("SHA-1");
sha1.update(data, searchIdx, data.length - searchIdx);
byte[] hash = sha1.digest();
StringBuilder hex = new StringBuilder();
for (byte h : hash) hex.append(String.format("%02x", h));
System.out.println(hex);
While rzwitserloot's answer covers some general java coding practices there also are correctness issues on the bittorrent level.
You are using string processing for a structured data format, this is pretty much the same mistake as attempting to parse html with regex. In this case you're assuming that the only place that the data can contain the string 4:info is the top-level dictionary key for the info dict and that the info dictionary is the last entry of the top level dictionary.
Instead you should use a proper bencoding decoder-encoder to extract the info dict and then re-encode it for hashing or a tokenizer to find the exact byte-range covering the info value. Note that you need a validating parser for the former while the latter can also handle some out-of-spec edge cases. Unless you want to implement them yourself you may want to find a library that handles this for you.
Additionally you're assuming that the data is ASCII. bencoding is in fact a binary format that just tends to use ascii by convention in some places. You should operate on byte arrays directly. Your input is already binary, the hasher expects binary so it is quite circuitous to go through strings.

Why does String method regionMatches not delegate to overload method [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
The method boolean regionMatches(int toffset, String other, int ooffset, int len) of java.lang.String is implemented as
public boolean regionMatches(int toffset, String other, int ooffset,
int len) {
char ta[] = value;
int to = toffset;
char pa[] = other.value;
int po = ooffset;
// Note: toffset, ooffset, or len might be near -1>>>1.
if ((ooffset < 0) || (toffset < 0)
|| (toffset > (long)value.length - len)
|| (ooffset > (long)other.value.length - len)) {
return false;
}
while (len-- > 0) {
if (ta[to++] != pa[po++]) {
return false;
}
}
return true;
}
Since there is an overloaded method covering the same functionality, why is this method not implemented as simple delegation, like
public boolean regionMatches(int toffset, String other, int ooffset, int len) {
return regionMatches(false, toffset, other, ooffset, len);
}
First, this is an implementation-dependent choice, so it might be possible to encounter alternative implementation actually doing this delegation you suggest. That’s why it is important to specify, which implementation you are referring to.
In case of Oracle’s JDK or OpenJDK, which seems to be the Java 8 implementation you’re referring to, the decision was most likely made for performance reasons. As you can see, the implementation of regionMatches with boolean ignoreCase parameter will re-check this parameter within the loop when two characters do not match.
It might have been the starting point for implementing both operations, but turned out to be a performance bottleneck for some cases. Usually, the decision to write a special implementation instead of handling an operation more generically, is made based on profiling widespread real life applications.
The specialized regionMatches implementation for the case sensitive match consists of a very short straight-forward loop over the character arrays, which can have a dramatic impact on the efficiency of the HotSpot optimizer. E.g. it might compile this loop to native code comparing more than one character at a time.
Newer JDKs had to adapt the code as, since Java 9, a byte[] array is used instead of a char[] array and might contain iso-latin-1 or utf-16 encoded data, so different scenarios have to be handled. The implementors took the opportunity to introduce delegation, though it is the other way round:
public boolean regionMatches(boolean ignoreCase, int toffset,
String other, int ooffset, int len) {
if (!ignoreCase) {
return regionMatches(toffset, other, ooffset, len);
}
// specialized case insensitive comparison follows
So now, you get the optimized case sensitive comparison whether you invoke regionMatches without the boolean parameter or with false. Further, the case insensitive match operation is also optimized in that the boolean parameter won’t be re-checked in a loop.

how to load first x bytes from URL with Java / Scala?

I want to read the first x bytes from a java.net.URLConnection (although I'm not forced to use this class - other suggestions welcome).
My code looks like this:
val head = new Array[Byte](2000)
new BufferedInputStream(connection.getInputStream).read(head)
IOUtils.toString(new ByteArrayInputStream(head), charset)
It works, but does this code load only the first 2000 bytes from the network?
Next trial
As 'JB Nizet' said it is not useful to use a buffered input stream, so I tried it with an InputStreamReader:
val head = new Array[Char](2000)
new InputStreamReader(connection.getInputStream, charset).read(head)
new String(head)
This code may be better, but the load times are about the same. So does this procedure limit the transferred bytes ?
No, it doesn't. It could read up to 8192 bytes (the deault buffer size of BufferedInputStream). It could also read 0 bytes, or any number of bytes between 0 and 2000, since you don't check the number of bytes that have actually been read, and which is returned by the read() method.
And finally, depending on the value of charset, and of the actual charset used by the HTTP response, this could return an incorrect string, or a String truncated in the middle of a multi-byte character. You should use a Reader to read text.
I suggest you read the Java IO tutorial.
You can use read(Reader, char[]) from Apache Commons IO. Just pass a 2000-character buffer to it and it will fill it with as many characters as possible, up to 2000.
Be sure you understand the objections in the other answers/comments, in particular:
Don't use Buffered... wrappers, it goes against your intentions.
If you read textual data, then use a Reader to read 2000 characters instead of InputStream reading 2000 bytes. The proper procedure would be to determine the character encoding from the headers of a response (Content-Type) and set that encoding into InputStreamReader.
Calling plain read(char[]) on a Reader will not fully fill the array you give to it. It can read as little as one character no matter how big the array is!
Don't forget to close the reader afterwards.
Other than that, I'd strongly recommend you to use Apache HttpClient in favor of java.net.URLConnection. It's much more flexible.
Edit: To understand the difference between Reader.read and IOUtils.read, it's worth examining the source of the latter:
public static int read(Reader input, char[] buffer,
int offset, int length)
throws IOException
{
if (length < 0) {
throw new IllegalArgumentException("Length must not be negative: " + length);
}
int remaining = length;
while (remaining > 0) {
int location = length - remaining;
int count = input.read(buffer, offset + location, remaining);
if (EOF == count) { // EOF
break;
}
remaining -= count;
}
return length - remaining;
}
Since Reader.read can read less characters than a given length (we only know it's at least 1 and at most the length), we need to iterate calling it until we get the amount we want.

PipedInputStream always blocks when calling read() with an empty buffer. Is there any way of stopping this?

I've searched through all the questions I can find relating to PipedInputStreams and PipedOutputStreams and have not found anything that can help me. Hopefully someone here will have come across something similar.
Background:
I have a class that reads data from any java.io.InputStream. The class has a method called hasNext(), which checks the given InputStream for data, returning true if data is found, false otherwise. This hasNext() method works perfectly with other InputStreams but when I try to use a PipedInputStream (fed from a PipedOutputStream in a different Thread, encapsulated in the inputSupplier variable below), it hangs. After looking into how the hasNext() method works, I recreated the problem with the following code:
public static void main(String [] args){
PipedInputStream inputSourceStream = new PipedInputStream(inputSupplier.getOutputStream());
byte[] input = new byte[4096];
int bytes_read = inputSourceStream.read(input, 0, 4096);
}
The inputSupplier is simply an instance of a small class I wrote that runs in its own thread with a local PipedOutputStream to avoid getting deadlocks.
The Problem
So, my problem is that the hasNext() method calls PipedInputStream.read() method on the stream to ascertain whether there is any data to be read. This causes a blocking read operation that never exits, until some data arrives to be read. This means that my function of hasNext() will never return false (or at all) if the stream is empty.
Disclaimer: I know about the available() method but all that tells me is that there are no bytes available, not that we are at the end of the stream (whatever implementation of a Stream that may be), and so read() is required to check this.
[Edit] The whole purpose of me initially using a PipedInputStream was to simulate a "bursty" source of data. That is, I need to have a Stream that I can write to sporadically to see if my hasNext() method will detect that there is new data on the Stream upon reading it. If there is a better way of doing this then I would be thrilled to hear it!
I hate to necro a question this old, but this is near the top of google's results, and I just found a solution for myself: this circular byte buffer exposes in and out streams, and the read method returns -1 immediately when no data is present. A little bit of threading, and your test classes can provide data exactly the way you want.
http://ostermiller.org/utils/src/CircularByteBuffer.java.html
Edit
Turns out I misunderstood the documentation of the above class, and it only returns -1 when a thread calling read() is interrupted. I made a quick mod to the read method that gives me what I want (original code commented out, the only new code is the substitution of an else for the else if:
#Override public int read(byte[] cbuf, int off, int len) throws IOException {
//while (true){
synchronized (CircularByteBuffer.this){
if (inputStreamClosed) throw new IOException("InputStream has been closed; cannot read from a closed InputStream.");
int available = CircularByteBuffer.this.available();
if (available > 0){
int length = Math.min(len, available);
int firstLen = Math.min(length, buffer.length - readPosition);
int secondLen = length - firstLen;
System.arraycopy(buffer, readPosition, cbuf, off, firstLen);
if (secondLen > 0){
System.arraycopy(buffer, 0, cbuf, off+firstLen, secondLen);
readPosition = secondLen;
} else {
readPosition += length;
}
if (readPosition == buffer.length) {
readPosition = 0;
}
ensureMark();
return length;
//} else if (outputStreamClosed){
} else { // << new line of code
return -1;
}
}
//try {
// Thread.sleep(100);
//} catch(Exception x){
// throw new IOException("Blocking read operation interrupted.");
//}
//}
}
```
Java SE 6 and later (correct me if I am wrong) come with the java.nio package, which is designed for asyschronous I/O, which sounds like what you are describing

Categories

Resources