My program generates two files. The first one that gets generated is usually huge, normally around 20GB. The one after that is a 'one line' file. The second file's content (one line) is the header for the first file. So my output should be one file that combines the two. On a memory constraint, I can't create another file to combine the two. What's the best way to go over that?
You cannot just "insert" data in the middle of the file. Using RandomAccessFile will override the data that is already written to specific position of the file.
So, the first solution is (if it is possible) create the header and then append your 20GB. If it is not possible but you can estimate the length (in bytes) of your header you can write garbage if the same length to the beginning of the file, then write your data, then go to the beginning of the file and write (override) the header.
Use Random Access File that is available...
Related
My task is to sort file which is too large to fit in memory. File contains text lines.
What I did:
read from original file by parts (of allowed size).
sorted each part
saved each sorted part to tempfiles
As I understand next thing i should do is:
read first lines of each file
sort them between each other (use local variable to temporarily store it, but I am not sure if it will be below restricted size)
write first line (as result of sorting) to final file
now I need to remove line I just wrote from temporary file
now i need to repeat steps 1-4 until all lines are sorted and "transferred" from temp files to final file
I am most unsure about step 4 - is there a class than can look for a value and then erase line with this value (at that point I won't even know from which file that line came)? I think that this is not a proper way to reach my goal at all. But I need to remove lines which are already sorted. And I can't operate with files' data in memory.
do you need to do this in Java (assuming by the tag)? As memory wise it isn't going to be efficient way. The simplest option in my opinion would be using sort and just sort the file directly on the OS level.
This article will give you a guide on how to use sort: https://www.geeksforgeeks.org/sort-command-linuxunix-examples/
Sort is available on Windows as well as unix/linux and can handle huge files.
Need update the file. I see two ways: first - rewrite file (merge content), second - delete previous file and then create new file with new content. I pass content for all file and its weight around 1 KB. What's the way faster?
There's no need to delete a file before you recreate it. You can just overwrite it with the new content. You can't really "merge" a file, because on the filesystem level blocks will be completely overwritten anyway, even if you just change 1 byte.
If you have the contents you want to end up in the file already, just overwrite the file.
If you have update data that depends on the existing file, read the file into memory (provided it's small enough), merge the data in memory and then overwrite the file. It's not that complicated.
Not sure practically but in theory merge content shd be faster. Delete n create means - remove pointer for the file, create new one and allocate memory and disk space for it. Merge means allocation is already there it just need expand it. While merging existing file content doesn't need to called in memory on the new one being merged, in creating new file whole lot will be in memory management. Unless there are lots of files OR it's going to make large file over time you won't notice it
I would like to update specific part of a text file using Java. I would like to be able to scan through the file and select specific lines to be updated, a bit like in a database, for instance given the file:
ID Value
1 100
2 500
4 20
I would like to insert 3 and update 4, e.g.
ID Value
1 100
2 500
3 80
4 1000
Is there a way to achieve this (seemingly) easy task? I know you can append to a file, but I am more interested in a random access
I know you can append to a file, but I am more interested in a random access
You're trying to insert and delete bytes in the middle of a file. You can't do that. File systems simply don't (in general) support that. You can overwrite specific bytes, but you can't insert or delete them.
You could update specific records with random access if your records were fixed-length (in bytes) but it looks like that's not the case.
You could either load the whole file into memory, or read from the original file, writing to a new file with either the old data or the new data as appropriate on a per line basis.
You can do so using Random Access files in java where you can place your current write and read position using available methods. you can explore more here
Load the file into memory, change your value, and then re-write the file
if there's a way to insert into a file without loading it, I haven't heard of it. You have to move the other data out of the way first.
unless you're dealing with huge files, frequently, performance isn't too much of a concern
As said by the previous answers, it's not possible to do that symply using streams. You could try to use properties, that are key, value pairs that can be saved and modified in a text file.
For example you can add to a file a new property with the command
setProperty(String key, String value)
This method adds a new property or, if already existing, modifies the value of the property with the choosen key.
Obviously, new properties are added at the end of the file but the lack of ordering is not a problem for performances because the access to the file is made with the getProperty method that calls the Hashtable method put.
See this tutorial for some examples:
http://docs.oracle.com/javase/tutorial/essential/environment/properties.html
I have a large (~100GB) text file structured like this:
A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar
Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines are variable length. Define a group as being all lines with a common first value, i.e. with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group.
The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory.
I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways:
1) Scan the file using a BufferedReader, accumulating the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line.
2) Scan the file using a BufferedReader, whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.
I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.
I think a PushbackReader would work:
if (lineBelongsToNewGroup){
reader.unread(lastLine.toCharArray());
// probably also unread a newline
}
I think option 1 is the simplest. I would parse the text yourself, rather than use BufferedReader as it will take a lone time to parse 100 GB.
The only option which is likely to be faster is to use a binary search accessing the file using RandomAccessFile. You can memory map 100 GB on a 64-bit JVM. This avoids the need to parse every line which is pretty expensive. An advantage of this approach is that you can use multiple threads Its is far, far more complicated to implement, but should be much faster. Once you have each boundary, you can copy the raw data in bulk without having to parse all the lines.
So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...