How to avoid loading a large file repeatedly? - java

I'm trying to call a Java program (Stanford Chinese Word Segmenter) from within python. The Java program needs to load a large (100M) dictionary file (word list to assist segmentation) which takes 12+ seconds. I was wondering if it is possible to speed up the loading process, and more importantly, how to avoid loading it repeatedly when I need to call the python script multiple times?
Here's the relevant part of the code:
op = subprocess.Popen(['java',
'-mx2g',
'-cp',
'seg.jar',
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-sighanCorporaDict',
'data',
'-testFile',
filename,
'-inputEncoding',
'utf-8',
'-sighanPostProcessing',
'true',
'ctb',
'-loadClassifier',
**'./data/ctb.gz',**
'-serDictionary',
'./data/dict-chris6.ser.gz',
'0'],
stdout = subprocess.PIPE,
stdin = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
In the above code, './data/ctb.gz' is the place where the large word list file is loaded. I think this might be related to process, but I don't know much about it.

You might be able to use an OS specific solution here. Most modern Operating Systems have the ability to have a partition in memory. For example, in Linux you could do
mkfs -q /dev/ram1 8192
mkdir -p /ramcache
mount /dev/ram1 /ramcache
Moving the file to that directory would greatly speed I/O

There might be many ways to speed up the loading of the word list, but it depends on the details. If IO (disk read speed) is the bottleneck, then a simple way might be to zip the file and use a ZipInputStream to read it - but you would need to benchmark this.
To avoid multiple loading, you probably need to keep the Java process running, and communicate with it from Python via files or sockets, to send it commands, rather than actually launching the Java process each time from Python.
However, both of these require modifying the Java code.

If the java program produces output as soon as it receives input from filename named pipe and you can't change the java program then you could keep your Python script running instead and communicate with it via files/sockets as #DNA suggested for the Java process (the same idea but the Python program keeps running).
# ...
os.mkfifo(filename)
p = Popen([..., filename, ...], stdout=PIPE)
with open(filename, 'w') as f:
while True:
indata = read_input() # read text to segment from files/sockets, etc
f.write(indata)
# read response from java process
outdata = p.stdout.readline()# you need to figure out when to stop reading
write_output(outdata) # write response via files/sockets, etc

You can run a single instance of the JVM and use named pipes to allow the python script to communicate with the JVM. This will work assuming that the program executed by the JVM is stateless and responds on its stdout (and stderr perhaps) to requests arriving via its stdin.

Why not track whether the file has already been read on the python side? I'm not a python whiz, but I'm sure you could have some list or map/dictionary of all the files that have been opened so far.

Related

parallel file I/O Java

I have a Fortran program that is calling into Java using JNI. My Java function receives an array, writes the array to a file, makes a system call to a Python function that computes something and writes the result to a file which in turn is read by the Java function and passed back to Fortran. This works as expected.
Unfortunately, I cannot use Jython because Jython does not support NumPy yet.
The serial implementation of my program works as expected but when I run the parallel implementation of Fortran code that uses OpenMP, file I/O is messed up. Is there any way I can safely read/write from files with the parallel implementation?
I assume that you use hard-coded filenames. The probblem is that all active threads are using the same files to pass data to the next program. Try to separate them. If you are running 3 OpenMP threads then you need 3 files for data transfer.
For separation you could name your files based on UUIDs and pass that filename to your python program as a parameter.
String filename = "myFile" + UUID.randomUUID() + ".dat";
Process p=Runtime.getRuntime().exec("python myProgram.py " + filename);
p.waitFor();
Python program:
print 'using file: ', sys.argv[0]

Is there any differnce between buffer and console to make use....

what is the difference when I use buffer istead of console() for retriving the output from my code?
The Console class, as used by System.console() seems targeted to interactive character-based I/O, as provided by an actual console such as a cmd.exe window in Windows or a terminal in Unix-like systems. As such, the system console may not always be available, depending on the underlying OS and how the JVM was started.
On the other hand, Scanner works with any input stream, including files and the standard input. It is more flexible, but it does not provide some console-specific functionality that Console does, such as the ability to read text - usually passwords - without echoing it back to the console.
The Console class makes it easy to accept input from the command line, both echoed and unechoed. Unechoed means you will see some special character in your console while writing a text (ex. *, ? etc) like when you enter your password in facebook. :) Its format() method also makes it easy to write formatted output to the command line(like making a pyramid of *s or a formatted date and currency formats etc). It also helps to write test engines for unit testing. or you can use it to provide you a simple CLI (Command Line Interface) instead of a GUI (Graphical User Interface) in case you want to create a real simple and small application. And yes, it is also system dependent that means that you cannot always rely on your system to provide you a console instance.
Now about buffering, its actually a technique used in I/O (i.e. both input and output) when you are interacting with a stream (be it a character stream or byte stream, be it from a console or a socket or a file). Its basically used to speed up the I/O and save system resources by avoiding multiple call to read() and write() methods. It is suggested that you use it in almost every kind of I/O interaction.

(Java) File redirection (both ways) within Runtime.exec?

I want to execute this command:
/ceplinux_work3/myName/opt/myCompany/ourProduct/bin/EXECUTE_THIS -p cepamd64linux.myCompany.com:19021/ws1/project_name < /ceplinux_work3/myName/stressting/Publisher/uploadable/00000.bin >> /ceplinux_work3/myName/stressting/Publisher/stats/ws1.project_name.19021/2011-07-22T12-45-20_PID-2237/out.up
But it doesn't work because EXECUTE_THIS requires an input file via redirect, and simply passing this command to Runtime.exec doesn't work.
Side note: I searched all over on how to solve this before coming here to ask. There are many questions/articles on the web regarding Runtime.exec and Input/Output redirect. However, I cannot find any that deal with passing a file to a command and outputting the result to another file. Plus, I am totally unfamiliar with Input/Output streams, so I have a hard time putting all the info out there together for my specific situation.
That said, any help is much appreciated.
P.S. If there are multiple ways to do this, I prefer whatever is fastest in terms of throughput.
Edit: As discussed in my last question, I CANNOT change this to a bash call because the program must wait for this process to finish before proceeding.
Unless you are sending a file name to the standard input of the process, there is no distinction of whether the data came from a file or from any other data source.
You need to write to the OutputStream given by Process.getOutputStream(). The data you write to it you can read in from a file using a FileInputStream.
Putting that together might look something like this:
Process proc = Runtime.getRuntime().exec("...");
OutputStream standardInputOfChildProcess = proc.getOutputStream();
InputStream dataFromFile = new FileInputStream("theFileWithTheData.dat");
byte[] buff = new byte[1024];
for ( int count = -1; (count = dataFromFile.read(buff)) != -1; ) {
standardInputOfChildProcess.write(buff, 0, count);
}
I've left out a lot of details, this is just to get the gist of it. You'll want to safely close things, might want to consider buffering and you need to worry about the pitfalls of Runtime.exec().
Edit
Writing the output to a file is similar. Obtain a FileOutputStream pointing to the output file and write the data you read from Process.getInputStream() to that OutputStream. The major caveat here is that you must do this operation in a second thread, since accessing two blocking streams from the same thread will lead to deadlock (see the article above).

Is it possible to prepend data to an file without rewriting?

I deal with very large binary files ( several GB to multiple TB per file ). These files exist in a legacy format and upgrading requires writing a header to the FRONT of the file. I can create a new file and rewrite the data but sometimes this can take a long time. I'm wondering if there is any faster way to accomplish this upgrade. The platform is limited to Linux and I'm willing to use low-level functions (ASM, C, C++) / file system tricks to make this happen. The primimary library is Java and JNI is completely acceptable.
There's no general way to do this natively.
Maybe some file-systems provide some functions to do this (cannot give any hint about this), but your code will then be file-system dependent.
A solution could be that of simulating a file-system: you could store your data on a set of several files, and then provide some functions to open, read and write data as if it was a single file.
Sounds crazy, but you can store the file data in reverse order, if it is possible to change function that reads data from file. In that case you can append data (in reverse order) at the end of the file. It is just a general idea, so I can't recommend anything particular.
The code for reversing of current file can looks like this:
std::string records;
ofstream out;
std::copy( records.rbegin(), records.rend(), std::ostream_iterator<string>(out));
It depends on what you mean by "filesystem tricks". If you're willing to get down-and-dirty with the filesystem's on-disk format, and the size of the header you want to add is a multiple of the filesystem block size, then you could write a program to directly manipulate the filesystem's on-disk structures (with the filesystem unmounted).
This enterprise is about as hairy as it sounds though - it'd likely only be worth it if you had hundreds of these giant files to process.
I would just use the standard Linux tools to do it.
Writting another application to do it seems like it would be sub-optimal.
cat headerFile oldFile > tmpFile && mv tmpFile oldFile
I know this is an old question, but I hope this helps someone in the future. Similar to simulating a filesystem, you could simply use a named pipe:
mkfifo /path/to/file_to_be_read
{ echo "HEADER"; cat /path/to/source_file; } > /path/to/file_to_be_read
Then, you run your legacy program against /path/to/file_to_be_read, and the input would be:
HEADER
contents of /path/to/source_file
...
This will work as long as the program reads the file sequentially and doesn't do mmap() or rewind() past the buffer.

MATLAB - Delete elements of binary files without loading entire file

This may be a stupid question, but Google and MATLAB documentation have failed me. I have a rather large binary file (>10 GB) that I need to open and delete the last forty million bytes or so. Is there a way to do this without reading the entire file to memory in chunks and printing it out to a new file? It took 6 hours to generate the file, so I'm cringing at the thought of re-reading the whole thing.
EDIT:
The file is 14,440,000,000 bytes in size. I need to chop it to 14,400,000,000.
There is no ftruncate() in Matlab, but you've got access to the full Java standard library in the JVM embedded in Matlab, and can use java.io.RandomAccessFile or the Java NIO classes to truncate a file.
Here's a Matlab function that calls to Java to lop the last n bytes off a file. Should have minimal I/O cost.
function remove_last_n_bytes_from_file(file, n)
jFile = java.io.RandomAccessFile(file, 'rw');
currentLength = jFile.length();
wantLength = currentLength - n;
fprintf('Truncating file %s: Resizing to %d to remove %d bytes\n', file, wantLength, n);
jFile.setLength(wantLength);
jFile.close();
You could also do it as a one-liner.
java.io.RandomAccessFile('/path/to/my/file.bin', 'rw').setLength(n);
I found Perl is much quicker to do this than MATLAB.
Here are two examples from Perl Cookbook:
truncate(HANDLE, $length)
or die "Couldn't truncate: $!\n";
truncate("/tmp/$$.pid", $length)
or die "Couldn't truncate: $!\n";
You can run Perl script from MATLAB with PERL function.
Since you don't want to read the file into MATLAB (understandably), you are dealing with system level commands. MATLAB has a facility to call system commands using the "system" command
system
So now your problem is reduced to finding the shell command in your OS that will do it for you. Or you can write a program using truncate() (unix -- KennyTM) or SetEndOfFile (windows)
I don't know if MATLAB supports this, but see ftruncate() and truncate().

Categories

Resources