Java: Reading a file containing both text and binary data

Java: Reading a file containing both text and binary data - java

I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?

You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.

Related

Confused about inputstreams and reading from files

I tried to understand the logic behind inputstreams and reading from files, but I fail to understand how you can read from a file using an inputstream.
My understanding is that when using input devices like a keyboard, you send input data through the input stream to the system. If you are reading from an input stream, aren't you reading the input data that's being send to the system at that time?
If we are creating an inputstream with the following code:
FileInputStream test = new FileInputStream("loremipsum.txt");
And if we try to read from the newly created inputstream with test.read(); how is there any data flowing through the inputstream? As no inputdata has been input from an input device at the time, but has already been input way beforehand. Is there something I'm missing out on? It almost seems to me as input streams are used in two different ways: Java using inputstreams to read data from a source and input devices using to input data to a source.

Java streams are a general concept / interface - a stream of data that you need to open, then read the data from (or write data to for output streams), then close. The basic stream only supports sequential reading / writing, no random access. Also, the data may or may not be readily available when you attempt to read from the stream, so the read may or may not block.
This abstraction allows us to use the same approach regardless of where we read the data from - it might be keyboard, a file, a network connection, output form another program or even some kind of generator that generates an endless sequence of data. Simply put, reading the input from file behaves the same as if someone in the background opened the file and typed its content on the keyboard really fast.
There are ways in Java to read the file in another ways (e.g. random access instead of sequential), but if you need to read the file from start to end, streams are a useful abstraction.

Java - possible to modify and parse gzipped xml files without unzipping?

I have an arraylist of gzipped xml files. Is it possible to view and manipulate the contents of these xml files all without unzipping them and taking up disk space? If so, what would be the correct class(es) to use for this task?
I know I can create a gzipinputstream from a fileinputstream of the zip file but from there I'm not sure what to do. I have only this written:
GZIPInputStream in = new GZIPInputStream(new FileInputStream(zippedFiles.get(i)));
I need some way to parse text within the xml files and modify the xml itself but again, extracting all of them would take up too much disk space.

What exactly are you going to achieve? You can extract the file into memory using a ByteArrayOutputStream and convert it into a byte-Array that you forward to your XML parser library (converting it to String and passing that is not recommended as the encoding is specified inside the XML file itself and the conversion to String must therefore be done by the XML parser internally). Most XML parsers also support reading directly from any InputStream, so you could pass yours directly to it which will probably further reduce your memory consumption. Disk space will only be occupied when writing data back to it by simply reversing the described procedure. Still, as you directly replace the source file by overwriting it, there is nowhere any disk space wasted.

The fact that they're in a list doesn't change much, but no.
Ignoring compression, files are stored linearly on disks. You can append to them cheaply, you can replace bytes cheaply, but you can't replace sequences of different lengths (like replace("Testing Procedure Specification", "TPS")) without rewriting the file after the modified substring.
Gziping the file complicates things, but the same rule applies. In general, making arbitrary modifications to a file requires rewriting the file.
Your code for reading the files is on the right track, though. You can easily read through gziped files as streams and without having to decompress the entire file.

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?

Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.

You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

Java 6 File Input Output Stream (same file)

I searched and looked at multiple questions like this, but my question is really different than anything I found. I've looked at Java Docs.
How do I get the equivalent of this c file open:
stream1 = fopen (out_file, "r+b");
Once I've done a partial read from the file, the first write makes the next read return EOF no matter how many bytes were in the file.
Essentially I want a file I/O stream that doesn't do that. The whole purpose of what I'm trying to do is to replace the bytes in an existing file in the current file. I don't want to do it in a copy or make a copy before I do the Read->Write.

You can use a RandomAccessFile.

As Perception mentions, you can use a RandomAccessFile. Also, in some situations, a FileChannel may work better. I've used these to handle binary file data with great success.
EDIT: you can get a FileChannel from the RandomAccessFile object using getChannel.

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud

I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.

What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.

GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.