This may be a stupid question, but Google and MATLAB documentation have failed me. I have a rather large binary file (>10 GB) that I need to open and delete the last forty million bytes or so. Is there a way to do this without reading the entire file to memory in chunks and printing it out to a new file? It took 6 hours to generate the file, so I'm cringing at the thought of re-reading the whole thing.
EDIT:
The file is 14,440,000,000 bytes in size. I need to chop it to 14,400,000,000.
There is no ftruncate() in Matlab, but you've got access to the full Java standard library in the JVM embedded in Matlab, and can use java.io.RandomAccessFile or the Java NIO classes to truncate a file.
Here's a Matlab function that calls to Java to lop the last n bytes off a file. Should have minimal I/O cost.
function remove_last_n_bytes_from_file(file, n)
jFile = java.io.RandomAccessFile(file, 'rw');
currentLength = jFile.length();
wantLength = currentLength - n;
fprintf('Truncating file %s: Resizing to %d to remove %d bytes\n', file, wantLength, n);
jFile.setLength(wantLength);
jFile.close();
You could also do it as a one-liner.
java.io.RandomAccessFile('/path/to/my/file.bin', 'rw').setLength(n);
I found Perl is much quicker to do this than MATLAB.
Here are two examples from Perl Cookbook:
truncate(HANDLE, $length)
or die "Couldn't truncate: $!\n";
truncate("/tmp/$$.pid", $length)
or die "Couldn't truncate: $!\n";
You can run Perl script from MATLAB with PERL function.
Since you don't want to read the file into MATLAB (understandably), you are dealing with system level commands. MATLAB has a facility to call system commands using the "system" command
system
So now your problem is reduced to finding the shell command in your OS that will do it for you. Or you can write a program using truncate() (unix -- KennyTM) or SetEndOfFile (windows)
I don't know if MATLAB supports this, but see ftruncate() and truncate().
Related
I'm using Java's ProcessBuilder, and it is working great! (Shocking, I know. But just wait...)
Setup and Goal: I've got a binary app nrsc5 wrapped by a ProcessBuilder, which
Sends the regular output that would normally show on the console but instead is read by the java wrapper via getInputStream() and parsed into lines. Excellent.
It also is continually writing to a binary file TEMP.WAV
Every so often my wrapper java app looks at the TEMP.WAV and reads some chunk of it.
This works but is a monster and is horribly inefficient. And, I just found out that yes, the nrsc5 app can write that WAV file to standard output the same as lots of other linux apps! ... -o - ... (output file is a dash means "write to stdout")
This is excellent! No more WAV file! But... the console lines get all mushed together with the binary output, which is verybad.
Is there a way to say "all the text goes to one place like the error stream, and all the binary goes to another place like the regular input stream"?
I have a Fortran program that is calling into Java using JNI. My Java function receives an array, writes the array to a file, makes a system call to a Python function that computes something and writes the result to a file which in turn is read by the Java function and passed back to Fortran. This works as expected.
Unfortunately, I cannot use Jython because Jython does not support NumPy yet.
The serial implementation of my program works as expected but when I run the parallel implementation of Fortran code that uses OpenMP, file I/O is messed up. Is there any way I can safely read/write from files with the parallel implementation?
I assume that you use hard-coded filenames. The probblem is that all active threads are using the same files to pass data to the next program. Try to separate them. If you are running 3 OpenMP threads then you need 3 files for data transfer.
For separation you could name your files based on UUIDs and pass that filename to your python program as a parameter.
String filename = "myFile" + UUID.randomUUID() + ".dat";
Process p=Runtime.getRuntime().exec("python myProgram.py " + filename);
p.waitFor();
Python program:
print 'using file: ', sys.argv[0]
I'm trying to call a Java program (Stanford Chinese Word Segmenter) from within python. The Java program needs to load a large (100M) dictionary file (word list to assist segmentation) which takes 12+ seconds. I was wondering if it is possible to speed up the loading process, and more importantly, how to avoid loading it repeatedly when I need to call the python script multiple times?
Here's the relevant part of the code:
op = subprocess.Popen(['java',
'-mx2g',
'-cp',
'seg.jar',
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-sighanCorporaDict',
'data',
'-testFile',
filename,
'-inputEncoding',
'utf-8',
'-sighanPostProcessing',
'true',
'ctb',
'-loadClassifier',
**'./data/ctb.gz',**
'-serDictionary',
'./data/dict-chris6.ser.gz',
'0'],
stdout = subprocess.PIPE,
stdin = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
In the above code, './data/ctb.gz' is the place where the large word list file is loaded. I think this might be related to process, but I don't know much about it.
You might be able to use an OS specific solution here. Most modern Operating Systems have the ability to have a partition in memory. For example, in Linux you could do
mkfs -q /dev/ram1 8192
mkdir -p /ramcache
mount /dev/ram1 /ramcache
Moving the file to that directory would greatly speed I/O
There might be many ways to speed up the loading of the word list, but it depends on the details. If IO (disk read speed) is the bottleneck, then a simple way might be to zip the file and use a ZipInputStream to read it - but you would need to benchmark this.
To avoid multiple loading, you probably need to keep the Java process running, and communicate with it from Python via files or sockets, to send it commands, rather than actually launching the Java process each time from Python.
However, both of these require modifying the Java code.
If the java program produces output as soon as it receives input from filename named pipe and you can't change the java program then you could keep your Python script running instead and communicate with it via files/sockets as #DNA suggested for the Java process (the same idea but the Python program keeps running).
# ...
os.mkfifo(filename)
p = Popen([..., filename, ...], stdout=PIPE)
with open(filename, 'w') as f:
while True:
indata = read_input() # read text to segment from files/sockets, etc
f.write(indata)
# read response from java process
outdata = p.stdout.readline()# you need to figure out when to stop reading
write_output(outdata) # write response via files/sockets, etc
You can run a single instance of the JVM and use named pipes to allow the python script to communicate with the JVM. This will work assuming that the program executed by the JVM is stateless and responds on its stdout (and stderr perhaps) to requests arriving via its stdin.
Why not track whether the file has already been read on the python side? I'm not a python whiz, but I'm sure you could have some list or map/dictionary of all the files that have been opened so far.
The question is pretty much what is asked in the title.
I have a lot of PNG files created by MapTiler. 24083 files to be exact. They are within many folders which are in many folders i.e. a tree of folders, duh. Thing is, it's the biggest waste of time to manually PNGCrush all of those.
Does anyone have an algorithm to share for me please? One that could recursively crush all these PNGs?
I have a Windows PC and would love to have it rather in Java or PHP than another language (since I already know it well) But else something else might be fine.
Thanks!
You don't need anything special for this, just use the FOR command in the Windows Command Prompt.
Use this line:
FOR /R "yourdir" %f IN (*.png) DO pngcrush "%f" "%f.crushed.png"
The "yourdir" is the root-directory where the input files are stored.
The two %f's at the end:
The first one is the input filename
The second one is the output filename
-ow option added in 1.7.22 to make the operation in-place:
FOR /R "yourdir" %f IN (*.png) DO pngcrush -ow "%f"
See this page for more information of FOR.
The program 'sweep' http://users.csc.calpoly.edu/~bfriesen/software/files/sweep32.zip lets you run the same command on all files in a directory recursively.
See: RecursiveIteratorIterator with RecursiveDirectoryIterator and exec (or similar)
With that you can use:
$it = new RecursiveIteratorIterator(new RecursiveDirectoryIterator('%your-top-directory%'));
foreach ($it as $entry) {
if (strtolower($entry->getExtension()) == 'png') {
// execute command here
}
}
I deal with very large binary files ( several GB to multiple TB per file ). These files exist in a legacy format and upgrading requires writing a header to the FRONT of the file. I can create a new file and rewrite the data but sometimes this can take a long time. I'm wondering if there is any faster way to accomplish this upgrade. The platform is limited to Linux and I'm willing to use low-level functions (ASM, C, C++) / file system tricks to make this happen. The primimary library is Java and JNI is completely acceptable.
There's no general way to do this natively.
Maybe some file-systems provide some functions to do this (cannot give any hint about this), but your code will then be file-system dependent.
A solution could be that of simulating a file-system: you could store your data on a set of several files, and then provide some functions to open, read and write data as if it was a single file.
Sounds crazy, but you can store the file data in reverse order, if it is possible to change function that reads data from file. In that case you can append data (in reverse order) at the end of the file. It is just a general idea, so I can't recommend anything particular.
The code for reversing of current file can looks like this:
std::string records;
ofstream out;
std::copy( records.rbegin(), records.rend(), std::ostream_iterator<string>(out));
It depends on what you mean by "filesystem tricks". If you're willing to get down-and-dirty with the filesystem's on-disk format, and the size of the header you want to add is a multiple of the filesystem block size, then you could write a program to directly manipulate the filesystem's on-disk structures (with the filesystem unmounted).
This enterprise is about as hairy as it sounds though - it'd likely only be worth it if you had hundreds of these giant files to process.
I would just use the standard Linux tools to do it.
Writting another application to do it seems like it would be sub-optimal.
cat headerFile oldFile > tmpFile && mv tmpFile oldFile
I know this is an old question, but I hope this helps someone in the future. Similar to simulating a filesystem, you could simply use a named pipe:
mkfifo /path/to/file_to_be_read
{ echo "HEADER"; cat /path/to/source_file; } > /path/to/file_to_be_read
Then, you run your legacy program against /path/to/file_to_be_read, and the input would be:
HEADER
contents of /path/to/source_file
...
This will work as long as the program reads the file sequentially and doesn't do mmap() or rewind() past the buffer.