Reading a file content while it hasn't finished being copied/uploaded

Reading a file content while it hasn't finished being copied/uploaded - java

Every 5 seconds (for example), a server checks if files have been added to a specific directory. If yes, it reads and processes them. The concerned files can be quite big (100+ Mo for example), so copying/uploading them to the said directory can be quite long.
What if the server tries to access a file that hasn't finished being copied/uploaded? How does JAVA manage these concurrent accesses? Does it depend on the OS of the server?
I made a try, copying a ~1300000-line TXT file (i.e. about 200 Mo) from a remote server to my local computer: it takes about 5 seconds. During this lapse, I run the following JAVA class:
public static void main(String[] args) throws Exception {
String local = "C:\\large.txt";
BufferedReader reader = new BufferedReader(new FileReader(local));
int lines = 0;
while (reader.readLine() != null)
lines++;
reader.close();
System.out.println(lines + " lines");
}
I get the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:515)
at java.lang.StringBuffer.append(StringBuffer.java:306)
at java.io.BufferedReader.readLine(BufferedReader.java:345)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at main.Main.main(Main.java:15)
When running the class once the file has finished being copied, I get the expected output (i.e. 1229761 lines), so the exception isn't due to the size of the file (as we could think in the first place). What is JAVA doing in background, that threw this OutOfMemoryError exception?

How does JAVA manage these concurrent accesses? Does it depend on the OS of the server?
It depends on the specific OS. If you run a copy and server in a single JVM AsynchronousFileChannel (new in 1.7) class could be of a great help. However, if client and server are represented by different JVMs (or even more, are started on a different machines) it all turns to be a platform specific.
From JavaDoc for AsynchronousFileChannel:
As with FileChannel, the view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program. The view provided by an instance of this class may or may not, however, be consistent with the views seen by other concurrently-running programs due to caching performed by the underlying operating system and delays induced by network-filesystem protocols. This is true regardless of the language in which these other programs are written, and whether they are running on the same machine or on some other machine. The exact nature of any such inconsistencies are system-dependent and are therefore unspecified.

Why are you using a buffered reader just to count the lines?
From the javadoc:
Reads text from a character-input stream, buffering characters so as to provide for the efficient reading of characters, arrays, and lines.
This means it will "buffer", ie. save, that entire file in memory which causes your stack dump. Try a FileReader.

Related

java.lang.RuntimeException: Cannot create RandomBufferedReader java.io.FileNotFoundException: XXX.csv (Too many open files) [duplicate]

I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.

On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.

For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.

You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?

Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.

If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.

Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}

why do i ever need to use .close() on resources

i know this may get a down vote this is bothering me a lot
i have already read all posts on .close() method like
explain the close() method in Java in Layman's terms
Why do I need to call a close() or shutdown() method?
the usage of close() method(Java Beginner)
i have these questions which may seem too trivial
1.what does the word 'resource' exactly mean (is it the file or the 'FileWriter' object or some other thing)(try to explain as broadly as possible)
lets consider following code
import java.io.*;
public class characterstreams
{
public static void main(String []args) throws Exception
{
File f=new File("thischaracter.txt");
FileWriter fw=new FileWriter(f);
char[] ch={'a','c','d'};
fw.write('a');
fw.write(ch);
fw.write("aaaa aaaaa aaaaaaa");
fw.flush();
FileReader fr=new FileReader(f);
int r=fr.read();
System.out.println(r);
char[] gh=new char[30];
System.out.println(fr.read(gh));
}
}
after compiling and executing it
G:/>java characterstreams
lets say resource is FileWriter below (since i have yet to get the meaning of resources)
JVM starts and opens up so-called resources and then execution completes and after which JVM get shuts down after execution
2.it unlocks the resource that it has opened since it's not running right (correct me if i am wrong)
G:/>
at this point JVM is not running
3.before shuting down , garbage collector is called right ?(correct me if am wrong) so FileWriter objects get destroyed
then why are we supposed to close all the resources that we have opened up
and
4.again i read that 'resources get leaked' what does this supposed to mean..?

resource means anything which is needed by the JVM and/or operating system to provide you with the functionality you request.
Taken your example. If you open a FileWriter the operating system in general (depends on the operating system, file system, etc.) will do (assuming you want to write a file to a disc, like HDD/SDD)
create a directory entry for the requested filename
create a data structure to maintain the writing process to the file
allocate disc space if you actually write data to the file
(note: this is not an exhaustive list)
The point will be done for any file you open for writing. If you don't close the resource all this remains in the memory and is still maintained by the operating system.
Assume your application is running over a long time and is constantly opening files. The number of open file the operating system allows you to keep open is limited (the concrete number depends on the operating system, quota settings, ...). If the resources are exhausted something will behave unexpected or fail.
Find below a small demonstration on Linux.
public static void main(String[] args) throws IOException {
List<OutputStream> files = new ArrayList<>();
for (int i = 0; i < 1000; i++) {
files.add(Files.newOutputStream(Paths.get("/tmp/demo." + i),
StandardOpenOption.CREATE));
}
}
The code open one-thousend file for writing.
Assume your limit of open files is 1024
ulimit -n
1024
you run the snippet and it will generate 1000 files /tmp/demo.*.
If your limit of open files is only 100 the code will fail
ulimit -n 100
java.nio.file.FileSystemException: /tmp/demo.94: Too many open files
(it fails before as the JVM itself has some open files)
To prevent such problems (lack of resources) you should close files which you don't need any longer to write to. If you don't do it in Java(close()) the operating system also doesn't know if the memory etc. can be freed and used for another request.

Troubleshooting "too many open files" caused by too many pipes [duplicate]

I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.

On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.

For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.

You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?

Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.

If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.

Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}

Huge LinkedList is causing GC overhead limit, is there another solution?

here is my code:
public void mapTrace(String Path) throws FileNotFoundException, IOException {
FileReader arq = new FileReader(new File(Path));
BufferedReader leitor = new BufferedReader(arq, 41943040);
Integer page;
String std;
Integer position = 0;
while ((std = leitor.readLine()) != null) {
position++;
page = Integer.parseInt(std, 16);
LinkedList<Integer> values = map.get(page);
if (values == null) {
values = new LinkedList<>();
map.put(page, values);
}
values.add(position);
}
for (LinkedList<Integer> referenceList : map.values()) {
Collections.reverse(referenceList);
}
}
This is the HashMap structure
Map<Integer, LinkedList<Integer>> map = new HashMap<>();
For 50mb - 100mb trace files i don't have any problem, but for bigger files i have:
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
I don't know if the reverse method is increasing the memory use, if the LinkedList is using more space than other List structure or if the way i'm adding the list to the map is taking more space than it should. Does anyone can tell me what's using so much space?

Does anyone can tell me what's using so much space?
The short answer is that it is probably the space overheads of the data structure you have chosen that is using the space.
By my reckoning, a LinkedList<Integer> on a 64 bit JVM uses about 48 bytes of storage per integer in the list including the integers themselves.
By my reckoning, a Map<?, ?> on a 64 bit machine will use in the region of 48 bytes of storage per entry excluding the space need to represent the key and the value objects.
Now, your trace size estimates are rather too vague for me to plug the numbers in, but I'd expect a 1.5Gb trace file to need a LOT more than 2Gb of heap.
Given the numbers you've provided, a reasonable rule-of-thumb is that a trace file will occupy roughly 10 times its file size in heap memory ... using the data structure that you are currently using.
You don't want to configure a JVM to try to use more memory than the physical RAM available. Otherwise, you are liable to push the machine into thrashing ... and the operating system is liable to start killing processes. So for an 8Gb machine, I wouldn't advise going over -Xmx8g.
Putting that together, with an 8Gb machine you should be able to cope with a 600Mb trace file (assuming my estimates are correct), but a 1.5Gb trace file is not feasible. If you really need to handle trace files that big, my advice would be to either:
design and implement custom collection types for your specific use-case that use memory more efficiently,
rethink your algorithms so that you don't need to hold the entire trace files in memory, or
get a bigger machine.
I did some tests before reading your comment, i put -Xmx14g and processed the 600mb file, it took some minutes(about 10) but it did fine.
The -Xmx14g option sets the maximum heap size. Based on the observed behaviour, I expect that the JVM didn't need anywhere like that much memory ... and didn't request it from the OS. And if you'd looked at memory usage in the task manager, I expect you'd have seen numbers consistent with that.
Then i put -Xmx18g and tried to process the 1,5gb file, and its been running for about 20 minutes. My memory in the task manager is going from 7,80 to 7,90. I wonder if this will finish, how could i use MORE memory than i have? Does it use the HD as virtual memory?
Yes that it is what it does.
Yes, each page of your processes virtual address space corresponds to a page on the hard disc.
If you've got more virtual pages than physical memory pages, at any given time some of those virtual memory pages will live on disk only. When your application tries to use a one of those non-resident pages, the VM hardware generates an interrupt, and the operating system finds an unused page and populates it from the disc copy and then hands control back to your program. But if your application is busy, then it will have had to make that physical memory page by evicting another page. And that may have involved writing the contents of the evicted page to disc.
The net result is that when you try to use significantly more virtual address pages than you have physical memory, the application generates lots of interrupts that result in lots of disc reads and writes. This is known as thrashing. If your system thrashes too badly, the system will spend most of its waiting for disc reads and writes to finish, and performance will drop dramatically. And on some operating systems, the OS will attempt to "fix" the problem by killing processes.

Further to Stephen's quite reasonable answer, everything has its limit and your code simply isn't scalable.
In case where the input is "large" (as in your case), the only reasonable approach is a stream based approach, which while (usually) more complicated to write, uses very little memory/resources. Essentially you hold in memory only what you need to process the current task then release it asap.
You may find that unix command line tools are your best weapon, perhaps using a combination of awk, sed, grep etc to massage your raw data into hopefully a usable "end format".
I once stopped a colleague from writing a java program to read in and parse XML and issue insert statements to a database: I showed him how to use a series of piped commands to produce executable SQL which was then piped directly into the database command line tool. Took about 30 minutes to get it right, but job done. And the file was massive , so in java it would have required a SAC parser and JDBC, which aren't fun.

to build this structure, I would put those data in a key/value datastore like berkeleydb for java.
peusdo-code
putData(db,page,value)
{
Entry key=new Entry();
Entry data=new Entry();
List<Integer> L=new LinkedList<Integer>();;
IntegerBinding.intToEntry(page,key);
if(db.get(key,data)==OperationStatus.SUCCESS)
{
TupleInput t=new TupleInput(data);
int n=t.readInt();
for(i=0;i< n;++n) L.add(n);
}
L.add(value);
TupleOutput out=new TupleOutput();
out.writeInt(L.size());
for(int v: L) out.writeInt(v);
data=new Entry(out.toByteArray());
db.put(key,data);
}

What's going on (in the OS level) when I'm reading/writing a file?

Let's say one program is reading file F.txt, and another program is writing to this file at the same moment.
(When I'm thinking about how would I implement this functionality if I were a system programmer) I realize that there can be ambiguity in:
what will the first program see?
where does the second program write new bytes? (i.e. write "in place" vs write to a new file and then replace the old file with the new one)
how many programs can write to the same file simultaneously?
.. and maybe something not so obvious.
So, my questions are:
what are the main strategies for reading/writing files functionality?
which of them are supported in which OS (Windows, Linux, Mac OS etc)?
can it be dependent on certain programming language? (I can suppose that Java can try to provide some unified behavior on all supported OSs)

A single byte read has a long journey to go, from the magnetic plate/flash cell to your local Java variable. This is the path that a single byte travels:
Magnetic plate/flash cell
Internal hard disc buffer
SATA/IDE bus
SATA/IDE buffer
PCI/PCI-X bus
Computer's data bus
Computer's RAM via DMA
OS Page-cache
Libc read buffer, aka user space fopen() read buffer
Local Java variable
For performance reasons, most of the file buffering done by the OS is kept on the Page Cache, storing the recent read and write files contents on RAM.
That means that every read and write operation from your Java code is done from and to your local buffer:
FileInputStream fis = new FileInputStream("/home/vz0/F.txt");
// This byte comes from the user space buffer.
int oneByte = fis.read();
A page is usually a single block of 4KB of memory. Every page has some special flags and attributes, one of them being the "dirty page", which means that page has some modified data not written to phisical media.
Some time later, when the OS decides to flush the dirty data back to the disk, it sends the data on the opposite direction from where it came.
Whenever two distinct process writes data to the same file, the resulting behaviour is:
Impossible, if the file is locked. The secondth process won't be able to open the file.
Undefined, if writing over the same region of the file.
Expected, if operating over different regions of the file.
A "region" is dependant on the internal buffer sizes that your application uses. For example, on a two megabytes file, two distinct processes may write:
One on the first 1kB of data (0; 1024).
The other on the last 1kB of data (2096128; 2097152)
Buffer overlapping and data corruption would occur only when the local buffer is two megabytes in size. On Java you can use the Channel IO to read files with a fine-grained control of what's going on inside.
Many transactional databases forces some writes from the local RAM buffers back to disk by issuing a sync operation. All the data related to a single file gets flushed back to the magnetic plates or flash cells, effectively ensuring that on power failure no data will be lost.
Finally, a memory mapped file is a region of memory that enables a user process to read and write directly from and to the page cache, bypassing the user space buffering.
The Page Cache system is vital to the performance of a multitasking protected mode OS, and every modern operating system (Windows NT upwards, Linux, MacOS, *BSD) supports all these features.

http://ezinearticles.com/?How-an-Operating-Systems-File-System-Works&id=980216

Strategies can be as much as file systems. Generally, the OS focuses on the avoidance of I/O operations by caching the file before it is synchronized with the disc. Reading from the buffer will see the previously saved data to it. So between the software and hardware is a layer of buffering (eg MySQL MyISAM engine uses this layer much)
JVM synchronize file descriptor buffers to disk at closing file or when a program is invoking methods like fsync() but buffers may be synchronized also by OS when they exceed the defined thresholds. In the JVM this is of course unified on all supported OS.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.