I could use some hints or tips for a decent interface for reading file of special characteristics.
The files in question has a header (~120 bytes), a body (1 byte - 3gb) and a footer (4 bytes).
The header contains information about the body and the footer is only a simple CRC32-value of the body.
I use Java so my idea was to extend the "InputStream" class and add a constructor such as "public MyInStream( InputStream in)" where I immediately read the header and the direct the overridden read()'s the body.
Problem is, I can't give the user of the class the CRC32-value until the whole body has been read.
Because the file can be 3gb large, putting it all in memory is a be an idea.
Reading it all in to a temporary file is going to be a performance hit if there are many small files.
I don't know how large the file is because the InputStream doesn't have to be a file, it could be a socket.
Looking at it again, maybe extending InputStream is a bad idea.
Thank you for reading the confused thoughts of a tired programmer. :)
Looking at it again, maybe extending
InputStream is a bad idea.
If users of the class need to access the body as a stream, it's IMO not a bad choice. Java's ObjectOutput/InputStream works like this.
I don't know how large the file is
because the InputStream doesn't have
to be a file, it could be a socket.
Um, then your problem is not with the choice of Java class, but with the design of the file format. If you can't change the format, there isn't really anything you can do to make the data at the end of the file available before all of it is read.
But perhaps you could encapsulate the processing of the checksum completely? Presumably it's a checksum for the body, so your class could always "read ahead" 4 bytes to see when the file ends and not return the last 4 bytes to the client as part of the body and instead compare them with a CRC computed while reading the body, throwing an exception when it does not match.
Related
Good afternoon everyone,
First of all, I'll say that it's only for personal purpose in a certain way, it's made to use for little projects to improve my Java knowledge, but my idea is to make this kind of things to understand better the way developers works with sockets and bytes, as I really like to understand this kind of things better for my future ideas.
Actually I'm making a lightweight HTTP server in Java to understand the way it works, and I've been reading documentation but still have some difficulties to actually understand part of the official documentation. The main problem I'm facing is that, something I'd like to know if it's related or not, the content-length seems to have a higher length than the one I get from the BufferedReader. I don't know if the issue is about the way chars are managed and bytes are being parsed to chars on the BufferedReader, so it has less data, so probably what I have to do is treat this part as a binary, so I'd have to read the bytes of the InputStream, but here comes the real deal I'm facing.
As Readers reads a certain amount of bytes, and then it stops and uses this as buffer, this means the data from the InputStream is being used on the Reader, and it's no longer on the stream, so using read() would end up on a -1 as there aren't more bytes to read. A multipart is divided in multiple elements separated with a boundary, and a newline that delimiters the information from the content. I still have to get the information as an String to process it, but the content should be parsed into a binary data, and, without modifying the buffer length, implying I'd require knowledge about the exact length I require to get only the information, the most probably result would be the content being transferred to the BufferedReader buffer. Is it possible to do it even with the processed data from the BufferedStream, or should I find a way to get that certain content as binary without being processed?
As I said, I'm new working with sockets and services, so I don't exactly know which are the possibilities it's another kind of issue, so any help would be appreciated, thank you in advance.
Answer from Remy Lebeau, that can be found on the comments, which become useful for me:
since multipart data is both textual and binary, you are going to have to do your own buffering of the socket data so you have more control and know where the data switches back and forth. At the very least, since you can read binary data directly from a BufferedInputStream, and access its internal buffer, you can let it handle the actual buffering for you, and it is not difficult to write a custom readLine() method that can read a line of text from a BufferedInputStream without using BufferedReader
I know the Java libraries pretty well, so I was surprised when I realized that, apparently, there's no easy way to do something seemingly simple with a stream. I'm trying to read an HTTP request containing multipart form data (large, multiline tokens separated be delimiters that look like, for example, ------WebKitFormBoundary5GlahTkFmhDfanAn--), and I want to read until I encounter a part of the request with a given name, and then return an InputStream of that part.
I'm fine with just reading the stream into memory and returning a ByteArrayInputStream, because the files submitted should never be larger than 1MB. However, I want to make sure that the reading method throws an exception if the file is larger than 1MB, so that excessively-large files don't fill up the JVM's memory and crash the server. The file data may be binary, so that rules out BufferedReader.readLine() (it drops newlines, which could be any of \r, \n, or \r\n, resulting in loss of data).
All of the obvious tokenizing solutions, such as Scanner, read the tokens as Strings, not streams, which could cause OutOfMemoryErrors for large files--exactly what I'm trying to avoid. As far as I can tell, there's no equivalent to Scanner that returns each token as an InputStream without reading it into memory. Is there something I'm missing, or is there any way to create something like that myself, using just the standard Java libraries (no Apache Commons, etc.), that doesn't require me to read the stream a character at a time and write all of the token-scanning code myself?
Addendum: Shortly before posting this, I realized that the obvious solution to my original problem was simply to read the full request body into memory, failing if it's too large, and then to tokenize the resulting ByteArrayInputStream with a Scanner. This is inefficient, but it works. However, I'm still interested to know if there's a way to tokenize an InputStream into sub-streams, without reading them into memory, without using extra libraries, and without resorting to character-by-character processing.
It's not possible without loading them into memory (the solution you don't want) or saving them to disk (becomes I/O heavy). Tokenizing the stream into separate streams without loading it into memory implies that you can read the stream (to tokenize it) and be able to read it again later. In short, what you want is impossible unless your stream is seekable, but these are generally specialized streams for very specific applications and specialized I/O objects, like RandomAccessFile.
I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?
You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.
We have a bunch of threads that take a block of data, compress this data and then eventually concatenate them into one large byte array. If anyone can expand on this idea or recommend another method, that'd be awesome. I've currently got two methods that I'm trying out, but neither are working the way they should:
The first: I have each thread's run() function take the input data and just use GZIPOutputStream to compress it and write it to the buffer.
The problem with this approach here is that, because each thread has one block of data which is part of a longer complete data when I call GZIPOutputStream, it treats that little block as a complete piece of data to zip. That means it sticks on the header and trailer (I also use a custom dictionary so I've got no idea how many bits the header is now nor how to find out).
I think you could manually cut off the header and trailer and you would just be left with compressed data (and leave the header of the first block and the trailer of the last block). The other thing I'm not sure about with this method is whether I can even do that. If I leave the header on the first block of data, will it still decompress correctly. Doesn't that header contain information for ONLY the first block of the data and not the other concatenated blocks?
The second method is to use the Deflater class. In that case, I can simply set the input, set the dictionary, and then call deflate().
The problem is, that's not gzip format. That's just "raw" compressed data. I have no idea how to make it so that gzip can recognize the final output.
You need a method that writes to a single GZIPOutputStream that is called by the other threads, with suitable co-ordination between them so the data doesn't get mixed up. Or else have the threads write to temporary files, and assemble and zip it all in a second phase.
So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...