I've been doing some socket programming to transmit information across the wire. I've run into a problem with DataOutputStream.writeUTF(). It seems to allow strings of up to 64k but I have a few situations where I can run over this. Are there any good alternatives that support larger strings or do I need to roll my own?
It actually uses a two bytes to write the length of the string before using an algorithm that compacts it into one, two or three bytes per character. (See the documentation on java.io.DataOutput) It is close to UTF-8, but even though documented as being so, there are compatibility problems. If you are not terribly worried about the amount of data you will be writing, you can easily write your own by writing the length of the string first, and then the raw data of the string using the getBytes method.
// Write data
String str="foo";
byte[] data=str.getBytes("UTF-8");
out.writeInt(data.length);
out.write(data);
// Read data
int length=in.readInt();
byte[] data=new byte[length];
in.readFully(data);
String str=new String(data,"UTF-8");
ObjectOutputStream.writeObject() properly handles long strings (verified by looking at the source code). Write the string out this way:
ObjectOutputStream oos = new ObjectOutputStream(out);
... other write operations ...
oos.writeObject(myString);
... other write operations ...
Read it this way:
ObjectInputStream ois = new ObjectInputStream(in);
... other read operations ...
String myString = (String) ois.readObject();
... other read operations ...
Another difference with DataOutputStream is that using ObjectOutputStream automatically writes a 4-byte stream header when instantiated, but its usually going to be a pretty small penalty to pay.
You should be able to use OutputStreamWriter with the UTF-8 encoding. There's no explicit writeUTF method, but you can set the charset in the constructor. Try
Writer osw = new OutputStreamWriter(out, "UTF-8");
where out is whatever OutputStream you're wrapping now.
Related
I am studying Android development (I'm a beginner in programming in general) and learning about HTTP networking and saw this code in the lesson:
private String readFromStream(InputStream inputStream) throws IOException {
StringBuilder output = new StringBuilder();
if (inputStream != null) {
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
BufferedReader reader = new BufferedReader(inputStreamReader);
String line = reader.readLine();
while (line != null) {
output.append(line);
line = reader.readLine();
}
}
return output.toString();
}
I don't understand exactly what InputStream, InputStreamReader and BufferedReader do. All of them have a read() method and also readLine() in the case of the BufferedReader.Why can't I only use the InputStream or only add the InputStreamReader? Why do I need to add the BufferedReader? I know it has to do with efficiency but I don't understand how.
I've been researching and the documentation for the BufferedReader tries to explain this but I still don't get who is doing what:
In general, each read request made of a Reader causes a corresponding
read request to be made of the underlying character or byte stream. It
is therefore advisable to wrap a BufferedReader around any Reader
whose read() operations may be costly, such as FileReaders and
InputStreamReaders. For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each
invocation of read() or readLine() could cause bytes to be read from
the file, converted into characters, and then returned, which can be
very inefficient.
So, I understand that the InputStream can only read one byte, the InputStreamReader a single character, and the BufferedReader a whole line and that it also does something about efficiency which is what I don't get. I would like to have a better understanding of who is doing what, so as to understand why I need all three of them and what the difference would be without one of them.
I've researched a lot here and elsewhere on the web and don't seem to find any explanation about this that I can understand, almost all tutorials just repeat the documentation info. Here are some related questions that maybe begin to explain this but don't go deeper and solve my confusion: Q1, Q2, Q3, Q4. I think it may have to do with this last question's explanation about system calls and returning. But I would like to understand what is meant by all this.
Could it be that the BufferedReader's readLine() calls the InputStreamReader's read() method which in turn calls the InputStream's read() method? And the InputStream returns bytes converted to int, returning a single byte at a time, the InputStreamReader reads enough of these to make a single character and converts it to int and returns a single character at a time, and the BufferedReader reads enough of these characters represented as integers to make up a whole line? And returns the whole line as a String, returning only once instead of several times? I don't know, I'm just trying to get how things work.
Lots of thanks in advance!
This Streams in Java concepts and usage link, give a very nice explanations.
Streams, Readers, Writers, BufferedReader, BufferedWriter – these are the terminologies you will deal with in Java. There are the classes provided in Java to operate with input and output. It is really worth to know how these are related and how it is used. This post will explore the Streams in Java and other related classes in detail. So let us start:
Let us define each of these in high level then dig deeper.
Streams
Used to deal with byte level data
Reader/Writer
Used to deal with character level. It supports various character encoding also.
BufferedReader/BufferedWriter
To increase performance. Data to be read will be buffered in to memory for quick access.
While these are for taking input, just the corresponding classes exists for output as well. For example, if there is an InputStream that is meant to read stream of byte, and OutputStream will help in writing stream of bytes.
InputStreams
There are many types of InputStreams java provides. Each connect to distinct data sources such as byte array, File etc.
For example FileInputStream connects to a file data source and could be used to read bytes from a File. While ByteArrayInputStream could be used to treat byte array as input stream.
OutputStream
This helps in writing bytes to a data source. For almost every InputStream there is a corresponding OutputStream, wherever it makes sense.
UPDATE
What is Buffered Stream?
Here I'm quoting from Buffered Streams, Java documentation (With a technical explanation):
Buffered Streams
Most of the examples we've seen so far use unbuffered I/O. This means
each read or write request is handled directly by the underlying OS.
This can make a program much less efficient, since each such request
often triggers disk access, network activity, or some other operation
that is relatively expensive.
To reduce this kind of overhead, the Java platform implements buffered
I/O streams. Buffered input streams read data from a memory area known
as a buffer; the native input API is called only when the buffer is
empty. Similarly, buffered output streams write data to a buffer, and
the native output API is called only when the buffer is full.
Sometimes I'm losing my hair reading a technical documentation. So, here I quote the more humane explanation from https://yfain.github.io/Java4Kids/:
In general, disk access is much slower than the processing performed
in memory; that’s why it’s not a good idea to access the disk a
thousand times to read a file of 1,000 bytes. To minimize the number
of times the disk is accessed, Java provides buffers, which serve as
reservoirs of data.
In reading File with FileInputStream then BufferedInputStream, the
class BufferedInputStream works as a middleman between FileInputStream
and the file itself. It reads a big chunk of bytes from a file into
memory (a buffer) in one shot, and the FileInputStream object then
reads single bytes from there, which are fast memory-to-memory
operations. BufferedOutputStream works similarly with the class
FileOutputStream.
The main idea here is to minimize disk access. Buffered streams are
not changing the type of the original streams — they just make reading
more efficient. A program performs stream chaining (or stream piping)
to connect streams, just as pipes are connected in plumbing.
InputStream, OutputStream, byte[], ByteBuffer are for binary data.
Reader, Writer, String, char are for text, internally Unicode, so that all scripts in the world may be combined (say Greek and Arabic).
InputStreamReader and OutputStreamWriter form a bridge between both. If you have some InputStream and know that its bytes is actually text in some encoding, Charset, then you can wrap the InputStream:
try (InputStreamReader reader =
new InputStreamReader(stream, StandardCharsets.UTF_8)) {
... read text ...
}
There is a constructor without Charset, but that is not portable, as it uses the default platform encoding.
On Android StandardCharset may not exist, use "UTF-8".
The derived classes FileInputStream and BufferedReader add something to the parent InputStream resp. Reader.
A FileInputStream is for input from a File, and BufferedReader uses a memory buffer, so the actual physical reading does not does not read character wise (inefficient). With new BufferedReader(otherReader) you add buffering to your original reader.
All this understood, there is the utility class Files with methods like newBufferedReader(Path, Charset) which add additional brevity.
I have read lots of articles on this very topic. I hope this might help you in some way.
Basically, the BufferedReader maintains an internal buffer.
During its read operation, it reads bytes from the files in bulk and stores that bytes in its internal buffer.
Now byte is passed to the program from that internal buffer for each read operation.
This reduces the number of communication between the program and the file or disks. Hence more efficient.
I am trying to copy the InputStream from a URLConnection which is returning a stream of type HttpInputStream (inner class of HttpUrlConnection)
In other cases, I can copy the original stream to a ByteArrayOutputStream and then use mark/reset on the original, but HttpInputStream does not support mark/reset.
Is there a way I can still copy the stream and reset the original or keep it from being consumed? The original stream inside URLConnection has to be readable because it is passed into another library. I only need to copy the stream so I can read the first two lines of data. Here is what I have for streams that support mark/reset:
InputStream input = null;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
input = connection.getInputStream();
byte[] buffer = new byte[200];
input.mark(200);
int len = input.read(buffer);
input.reset();
baos.write(buffer, 0, len);
baos.flush();
String content = baos.toString("UTF-8");
//I set flags based on the value of content, but omitting here for the sake of simplicity.
} catch (IOException ex) {
//I do stuff here, but omitting for sake of simplicity in this
}
ImputStreams are not generally cloneable, and neither do all streams support mark/reset. There are some possible workarounds within the standard JRE.
Wrap the InputStream into a BufferedInputStream. That one supports mark/reset within the limits of its buffer size. That enables you to read a limited amount of data from the beginning, then reset the stream.
Another alternative is PushBackInputStream, which allows you to "unread" data previously read. You need to buffer the data to be pushed back yourself though, so it may be a bit inconvinient to handle.
If the whole stream isn't terribly big, you could also read the entire stream first, then construct as many ByteArrayInputStreams as needed from the pre-read data. Only feasible if the data fits in the heap (e.g. less than approximately 2GB max).
Apache commons library has a really nice TeeInput stream.
https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/input/TeeInputStream.html
I need to write files, with Headers in ASCII and values in Binary.
For now, I'm using this:
File file = new File("~/myfile");
FileOutputStream out = new FileOutputStream(file);
// Write in ASCII
out.write(("This is a header\n").getBytes());
// Write a byte[] is quite easy
byte[] buffer = new buffer[4];
out.write(buffer, 0, 4);
// Write an int in binary gets complicated
out.write(ByteBuffer.allocate(4).putInt(6).array());
//Write a float in binary gets even more complicated
out.write(ByteBuffer.allocate(4).order(ByteOrder.BIG_ENDIAN)
.putFloat(4.5).array());
The problem is that it's very slow (in terms of performance) to write that way, way slower than writing the values in ASCII actually. But it should be shorter since in I'm writing less data.
I've looked at other Java classes, and it seems to me that they are either only for ASCII writing, or only for Binary writing.
Would you have any other proposition for this problem ?
You can use FileOutputStream to write binary. To include text you have to convert it to a byte[] before writing to the stream.
The problem is that it's very long to write that way, way longer than writing the values in ASCII actually. But it should be shorter since in I'm writing less data.
Mixing text and data is complex and error prone. The size of the data does matter, rather the complexity of the data is important. I suggest considering using DataOutputStream if you want to keep things simple.
To perform your example you can do
DataOutputStream out = new DataOutputStream(
new BufferedOutputStream(
new FileOutputStream("~/myfile")));
// Write in ASCII
out.write("This is a header\n".getBytes());
// Write a 32-bit int
out.writeInt(6);
//Write a float in binary
out.writeFloat(4.5f);
out.flush(); // the buffer.
I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?
You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.
I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.
I'm trying to connect to a server, and then send it a HTTP request (GET in this case). The idea is request a file and then receive it from the server.
It should work with both text files and binary files (imgs for example). I have no problem with text files, it works perfect, but I'm having some troubles with binary files.
First, I declare a BufferedReader (for reading header and textfile) and a DataInput Stream:
BufferedReader in_text = new BufferedReader(
new InputStreamReader(socket.getInputStream()));
DataInputStream in_binary = new DataInputStream(
new BufferedInputStream(socket.getInputStream()));
Then, I read the header with in_text and discover if it's a textfile or binary file. In case it's a textfile, I read it correctly in a StringBuilder. In case it's a binary file, I declare a byte[filesize] and store the following content of in_binary.
byte[] bindata = new byte[filesize];
in_binary.readFully(bindata);
And it doesn't work. I get a EOFException.
I thought that maybe in_binary is still in the first position of the stream, so it hasn't read the header yet. So I captured the length of the header and skip that bytes in in_binary.
byte[] bindata = new byte[filesize];
in_binary.reset();
in_binary.skip(headersize);
in_binary.readFully(bindata);
And still the same.
What could be happening?
Thanks!
PD: I know I could use URLConnection and all of that. That's not the problem.
BufferedReader buffers data (hence the name) - it will almost certainly have read more data from the socket than just the header. Therefore, when you try to read the actual data some has already been read from the socket. If you try reading just a few bytes you'll probably see that they aren't the first bytes of the actual response data.
If you know how to use URLConnection, I have to wonder what reason you have for not using it.
As soon as you use any subclass of Reader, you aren't reading binary. You are converting from bytes to characters, using the default encoding of the JVM. If you really want bytes of binary, you need to stick to streams, not readers. Creating both stacks at once is asking for trouble.
Use Apache Commons IO: IOUtils.toByteArray() to read the entire content into memory as a byte[], and then decide what to do with it, unless you have a gigantic amount of data, in which case you should set up the buffered input stream, decide what to do, and only construct the reader after you push back.