temp files in memory in java program

temp files in memory in java program - java

Is there a way to force the temporary files created in a java program in memory? Since I use several large xml file, I would have advantages in this way? Should I use a transparent method that allows me to not upset the existing application.
UPDATE: I'm looking at the source code and I noticed that it uses libraries (I can not change) which requires the path of those files ...
Thanks

The only way I can think of is to create a RAM disk and then point the system property java.io.tmpdir to that RAM disk.

XML is just a String, why not just reference Strings in memory, I think the File interface is a distraction. Use StringBuilder if you need to manipulate the data. Use StringBuffer if you need thread safety. Put them in a type safe Map if you have a variable number of things that need to be looked up on with a key.
If you absolutely have to keep the File interface, then create a InMemoryFileWriter that wraps ByteArrayOutputStream and ByteArrayInputStream to keep them in memory, but again I think the whole File in memory thing is a bad decision if you just want to cache things in memory, that is a lot of overhead when a simple String would do.

Don't use files if you don't have to. Consider com.google.common.io.FileBackedOutputStream from Guava:
An OutputStream that starts buffering to a byte array, but switches to file buffering once the data reaches a configurable size.

You probably can force the default behaviour of java.io.File with some reflection magic, but I'm sure you don't want to do that as it can lead to unpredicted behaviour. You're better off providing a mechanism where it would be possible to switch between usual and in-memory behaviour, and route all calls via this mechanism.
Look at this example, it shows how to use file API to create in-memory files.

Assuming you have control over the the streams that are being used to write to the file -
Do you absolutely want the in-memory behavior? If all that you want to do is reduce the number of system calls to write to the disk, you can wrap the FileOutputStream in a BufferedOutputStream (with appropriately big buffer size) and write to this BufferedOutputStream (or BufferedWriter) instead of writing directly to the original FileOutputStream.
(This does require a change in the existing application)

Related

Java - processing file in memory without the disk R/W

I am receiving files through a socket
and saving them to database.
So, i'm receiving the byte stream, and passing it
to a back-end process, say Process1
for the DB save.
I'm looking to do this without saving
the stream on disk. So, rather than storing the incoming stream
as a file on disk and then passing that file to Process1,
i'm looking to pass it while it's still in the memory.
This is to eliminate the time-costly disk read & write.
One way i can do is to pass the byte[] to Process1.
I'm wondering whether there's a better way of doing this.
TIA.

You can use a ByteArrayOutputStream. It is, essentially, a growable byte[] which you can write into at will, that is in the limit of your available heap space.
After having written to it/flushed it/closed it (although those two last operations are essentially a no-op, that's no reason for ditching sane practices), you can obtain the underlying byte array using this class's .toByteArray().

Socket sounds like what you are looking for.

Best way to use as little memory as possible when reading/writing large file?

I'm on mobile (android), and have a large text file, about 50mb. I want to be able to open the file and seek to a particular position, then start reading data into a buffer from that point. Is using FileReader + BufferedReader the best way to do this if I want to use as little memory as possible?:
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
in.skip(byteCount); // in some cases I have to read from an offset
// start reading a line at a time here
I'll also need to write to the file, only ever appending data, so:
FileWriter w = new FileWriter("foo.txt", true);
w.write(someCharacters);
I'm primarily interested to know if by misusing the wrong file reader/writer classes, I may accidentally be loading the entire file contents into memory before the reads or writes,
Thanks

Basically you don't want to read the whole file, but just a certain portion of it. In this case use java.io.RandomAccessFile instead:
its seek() method is guaranteed to do seek instead of reading & discarding (which is what some implementations of InputStream.skip() actually do)
the seek() method can move back the file pointer - something you can't do for an InputStream
a getFilePointer() method is provided to get the current position in file.
it only reads what you tells it to read, so there's no fear you'll accidentally load more than what you want
My dictionary app uses RandomAccessFile to access about 45MB of data back when each Android app could only use 16MB of RAM, also a service running my dictionary engine that operates on the same 45MB of data uses only about 2MB of RAM(and most of it prob were used by Davlik VM and not my search engine). So this class definitely works as intended.

You could try using a memory mapped file (java.nio.channels.FileChannel.map()). I'm not sure how much heap space would be allocated for this though.

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?

Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.

You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

is it possible for delete file without new file instance in Java?

I have a simple function used for file delete,
it will check the file size,
if small than a specific value, delete the file
however, this function will be called thousand times
and every time it will new file instance,
i think it will be expensive on file object creation issue,
is there any other way to fix this issue?
public void checkFile(String filePath) {
File file = new File(filePath); //this is expensive
if (file.length() < 500) {
file.delete();
}
}

The effect on the performance of the new File() compared to checking the file size on the disk is miniscule. Don't worry about it.
If you really really think that it will make a difference, measure it and then optimise it.

IMHO "thinking" isn't good enough; have you really identified that File object creation is a bottle neck in your application? Anyways, I don't think you can delete a file without creating a File object, unless you are planning on writing your own "native" method which unlinks the file by just taking in the file path as a string.

Why would the code be expensive? Creating temporary objects in Java is not expensive anymore, due to generational GC. And a File is just an object encapsulating a path to the file system. It's not expensive to create one.

Standard java API does not allow this. And thousands of times is almost nothing for modern computer. Creation of java.io.File instance takes less time than deletion, so do not worry. If you see any problems with this code you can create cache as a Map<String, File> and get the file instance from there.
But again, do not do this unless you see that this is your problem. No pre-mature optimization!

There is no way to delete a file in pure Java that doesn't entail creating a File object. The impure alternatives are:
using JNI or JNA to call native code that will call unlink or the Window equivalent,
running the rm or del command as an external process.
The first is at best only marginally faster than new File().delete(). The second is significantly slower.
I'd say that 90+% of the cost of new File().delete() is in the system call and the operating system's file system layers.

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud

I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.

What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.

GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.