Parsing files over 2.15 GB in Java using Kaitai Struct

Parsing files over 2.15 GB in Java using Kaitai Struct - java

I'm parsing large PCAP files in Java using Kaitai-Struct. Whenever the file size exceeds Integer.MAX_VALUE bytes I face an IllegalArgumentException caused by the size limit of the underlying ByteBuffer.
I haven't found references to this issue elsewhere, which leads me to believe that this is not a library limitation but a mistake in the way I'm using it.
Since the problem is caused by trying to map the whole file into the ByteBuffer I'd think that the solution would be mapping only the first region of the file, and as the data is being consumed map again skipping the data already parsed.
As this is done within the Kaitai Struct Runtime library it would mean to write my own class extending fom KatiaiStream and overwrite the auto-generated fromFile(...) method, and this doesn't really seem the right approach.
The auto-generated method to parse from file for the PCAP class is.
public static Pcap fromFile(String fileName) throws IOException {
return new Pcap(new ByteBufferKaitaiStream(fileName));
}
And the ByteBufferKaitaiStream provided by the Kaitai Struct Runtime library is backed by a ByteBuffer.
private final FileChannel fc;
private final ByteBuffer bb;
public ByteBufferKaitaiStream(String fileName) throws IOException {
fc = FileChannel.open(Paths.get(fileName), StandardOpenOption.READ);
bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
}
Which in turn is limitted by the ByteBuffer max size.
Am I missing some obvious workaround? Is it really a limitation of the implementation of Katiati Struct in Java?

There are two separate issues here:
Running Pcap.fromFile() for large files is generally not a very efficient method, as you'll eventually get all files parsed into memory array at once. A example on how to avoid that is given in kaitai_struct/issues/255. The basic idea is that you'd want to have control over how you read every packet, and then dispose of every packet after you've parsed / accounted it somehow.
2GB limit on Java's mmaped files. To mitigate that, you can use alternative RandomAccessFile-based KaitaiStream implementation: RandomAccessFileKaitaiStream — it might be slower, but it should avoid that 2GB problem.

This library provides a ByteBuffer implementation which uses long offset. I haven't tried this approach but looks promising. See section Mapping Files Bigger than 2 GB
http://www.kdgregory.com/index.php?page=java.byteBuffer
public int getInt(long index)
{
return buffer(index).getInt();
}
private ByteBuffer buffer(long index)
{
ByteBuffer buf = _buffers[(int)(index / _segmentSize)];
buf.position((int)(index % _segmentSize));
return buf;
}
public MappedFileBuffer(File file, int segmentSize, boolean readWrite)
throws IOException
{
if (segmentSize > MAX_SEGMENT_SIZE)
throw new IllegalArgumentException(
"segment size too large (max " + MAX_SEGMENT_SIZE + "): " + segmentSize);
_segmentSize = segmentSize;
_fileSize = file.length();
RandomAccessFile mappedFile = null;
try
{
String mode = readWrite ? "rw" : "r";
MapMode mapMode = readWrite ? MapMode.READ_WRITE : MapMode.READ_ONLY;
mappedFile = new RandomAccessFile(file, mode);
FileChannel channel = mappedFile.getChannel();
_buffers = new MappedByteBuffer[(int)(_fileSize / segmentSize) + 1];
int bufIdx = 0;
for (long offset = 0 ; offset < _fileSize ; offset += segmentSize)
{
long remainingFileSize = _fileSize - offset;
long thisSegmentSize = Math.min(2L * segmentSize, remainingFileSize);
_buffers[bufIdx++] = channel.map(mapMode, offset, thisSegmentSize);
}
}
finally
{
// close quietly
if (mappedFile != null)
{
try
{
mappedFile.close();
}
catch (IOException ignored) { /* */ }
}
}
}

Related

SeekableByteChannel.read() always returns 0, InputStream is fine

We have a data file for which we need to generate a CRC. (As a placeholder, I'm using CRC32 while the others figure out what CRC polynomial they actually want.) This code seems like it ought to work:
broken:
Path in = ......;
try (SeekableByteChannel reading =
Files.newByteChannel (in, StandardOpenOption.READ))
{
System.err.println("byte channel is a " + reading.getClass().getName() +
" from " + in + " of size " + reading.size() + " and isopen=" + reading.isOpen());
java.util.zip.CRC32 placeholder = new java.util.zip.CRC32();
ByteBuffer buffer = ByteBuffer.allocate (reasonable_buffer_size);
int bytesread = 0;
int loops = 0;
while ((bytesread = reading.read(buffer)) > 0) {
byte[] raw = buffer.array();
System.err.println("Claims to have read " + bytesread + " bytes, have buffer of size " + raw.length + ", updating CRC");
placeholder.update(raw);
loops++;
buffer.clear();
}
// do stuff with placeholder.getValue()
}
catch (all the things that go wrong with opening files) {
and handle them;
}
The System.err and loops stuff is just for debugging; we don't actually care how many times it takes. The output is:
byte channel is a sun.nio.ch.FileChannelImpl from C:\working\tmp\ls2kst83543216xuxxy8136.tmp of size 7196 and isopen=true
finished after 0 time(s) through the loop
There's no way to run the real code inside a debugger to step through it, but from looking at the source to sun.nio.ch.FileChannelImpl.read() it looks like a 0 is returned if the file magically becomes closed while internal data structures are prepared; the code below is copied from the Java 7 reference implementation, comments added by me:
// sun.nio.ch.FileChannelImpl.java
public int read(ByteBuffer dst) throws IOException {
ensureOpen(); // this throws if file is closed...
if (!readable)
throw new NonReadableChannelException();
synchronized (positionLock) {
int n = 0;
int ti = -1;
Object traceContext = IoTrace.fileReadBegin(path);
try {
begin();
ti = threads.add();
if (!isOpen())
return 0; // ...argh
do {
n = IOUtil.read(fd, dst, -1, nd);
} while (......)
.......
But the debugging code tests isOpen() and gets true. So I don't know what's going wrong.
As the current test data files are tiny, I dropped this in place just to have something working:
works for now:
try {
byte[] scratch = Files.readAllBytes(in);
java.util.zip.CRC32 placeholder = new java.util.zip.CRC32();
placeholder.update(scratch);
// do stuff with placeholder.getValue()
}
I don't want to slurp the entire file into memory for the Real Code, because some of those files can be large. I do note that readAllBytes uses an InputStream in its reference implementation, which has no trouble reading the same file that SeekableByteChannel failed to. So I'll probably rewrite the code to just use input streams instead of byte channels. I'd still like to figure out what's gone wrong in case a future scenario comes up where we need to use byte channels. What am I missing with SeekableByteChannel?

Check that 'reasonable_buffer_size' isn't zero.

Pre-allocating drive space for file storage

Is there a Java way to pre-allocate drive space for exclusive usage in the application?
There is no requirement for this space to be a separate filesystem or a part of existing filesystem (so could easily be a database), but it should allow for reserving the specified amount of space and allow for random reads/writes with high enough throughput.

Here's a stripped down version of my JNA-based fallocate solution. The main trick is obtaining the native file descriptor. I've only tested it on Linux so far, but it should work on all modern POSIX/non-Windows systems. It's not necessary on Windows, as Windows does not create sparse files by default (only with StandardOpenOption.SPARSE), so RandomAccessFile.setLength(size) or FileChannel.write(ByteBuffer.allocate(1), size - 1) are adequate there.
/**
* Provides access to operating system-specific {#code fallocate} and
* {#code posix_fallocate} functions.
*/
public final class Fallocate {
private static final boolean IS_LINUX = Platform.isLinux();
private static final boolean IS_POSIX = !Platform.isWindows();
private static final int FALLOC_FL_KEEP_SIZE = 0x01;
private final int fd;
private int mode;
private long offset;
private final long length;
private Fallocate(int fd, long length) {
if (!isSupported()) {
throwUnsupported("fallocate");
}
this.fd = fd;
this.length = length;
}
public static boolean isSupported() {
return IS_POSIX;
}
public static Fallocate forChannel(FileChannel channel, long length) {
return new Fallocate(getDescriptor(channel), length);
}
public static Fallocate forDescriptor(FileDescriptor descriptor, long length) {
return new Fallocate(getDescriptor(descriptor), length);
}
public Fallocate fromOffset(long offset) {
this.offset = offset;
return this;
}
public Fallocate keepSize() {
requireLinux("fallocate keep size");
mode |= FALLOC_FL_KEEP_SIZE;
return this;
}
private void requireLinux(String feature) {
if (!IS_LINUX) {
throwUnsupported(feature);
}
}
private void throwUnsupported(String feature) {
throw new UnsupportedOperationException(feature +
" is not supported on this operating system");
}
public void execute() throws IOException {
final int errno;
if (IS_LINUX) {
final int result = FallocateHolder.fallocate(fd, mode, offset, length);
errno = result == 0 ? 0 : Native.getLastError();
} else {
errno = PosixFallocateHolder.posix_fallocate(fd, offset, length);
}
if (errno != 0) {
throw new IOException("fallocate returned " + errno);
}
}
private static class FallocateHolder {
static {
Native.register(Platform.C_LIBRARY_NAME);
}
private static native int fallocate(int fd, int mode, long offset, long length);
}
private static class PosixFallocateHolder {
static {
Native.register(Platform.C_LIBRARY_NAME);
}
private static native int posix_fallocate(int fd, long offset, long length);
}
private static int getDescriptor(FileChannel channel) {
try {
// sun.nio.ch.FileChannelImpl declares private final java.io.FileDescriptor fd
final Field field = channel.getClass().getDeclaredField("fd");
field.setAccessible(true);
return getDescriptor((FileDescriptor) field.get(channel));
} catch (final Exception e) {
throw new UnsupportedOperationException("unsupported FileChannel implementation", e);
}
}
private static int getDescriptor(FileDescriptor descriptor) {
try {
// Oracle java.io.FileDescriptor declares private int fd
final Field field = descriptor.getClass().getDeclaredField("fd");
field.setAccessible(true);
return (int) field.get(descriptor);
} catch (final Exception e) {
throw new UnsupportedOperationException("unsupported FileDescriptor implementation", e);
}
}
}

You could try using a RandomAccessFile object and use the setLength() method.
Example:
File file = ... //Create a temporary file on the filesystem your trying to reserve.
long bytes = ... //number of bytes you want to reserve.
RandomAccessFile rf = null;
try{
rf = new RandomAccessFile(file, "rw"); //rw stands for open in read/write mode.
rf.setLength(bytes); //This will cause java to "reserve" memory for your application by inflating/truncating the file to the specific size.
//Do whatever you want with the space here...
}catch(IOException ex){
//Handle this...
}finally{
if(rf != null){
try{
rf.close(); //Lets be nice and tidy here.
}catch(IOException ioex){
//Handle this if you want...
}
}
}
Note: The file must exist before you create the RandomAccessFile object.
The RandomAccessFile object can then be used to read/write to the file. Make sure the target filesystem has enough free space. The space may not be "exclusive" per-say but you can always use File Locks to do that.
P.S: If you end up realizing hard drives are slow and useless (or meant to use RAM from the start) you can use the ByteBuffer object from java.nio. The allocate() and allocateDirect() methods should be more than enough. The byte buffer will be allocated into RAM (and possible SwapFile) and will be exclusive to this java program. Random access can be done by changing the position of the buffer. Since these buffers use signed integers to reference position, max sizes are limited to 2^31 - 1.
Read more on RandomAccessFile here.
Read more on FileLock (the java object) here.
Read more on ByteBuffer here.

On Linux systems you can use fallocate() system call. It's extremely fast. Just run Bash command.
UPD:
fallocate -l 10G 10Gigfile

You can pre-allocate space by writing a large file, but to be honest I wouldn't bother. Performance will be pretty good/ probably better than you need.
If you really needed performance, you'd be writing C++/C# and doing RAW I/O.
But that's typically only done when writing an RDBMS engine, high-volume media capture or similar.

FileChannel.transferTo for large file in windows

Using Java NIO use can copy file faster. I found two kind of method mainly over internet to do this job.
public static void copyFile(File sourceFile, File destinationFile) throws IOException {
if (!destinationFile.exists()) {
destinationFile.createNewFile();
}
FileChannel source = null;
FileChannel destination = null;
try {
source = new FileInputStream(sourceFile).getChannel();
destination = new FileOutputStream(destinationFile).getChannel();
destination.transferFrom(source, 0, source.size());
} finally {
if (source != null) {
source.close();
}
if (destination != null) {
destination.close();
}
}
}
In 20 very useful Java code snippets for Java Developers I found a different comment and trick:
public static void fileCopy(File in, File out) throws IOException {
FileChannel inChannel = new FileInputStream(in).getChannel();
FileChannel outChannel = new FileOutputStream(out).getChannel();
try {
// inChannel.transferTo(0, inChannel.size(), outChannel); // original -- apparently has trouble copying large files on Windows
// magic number for Windows, (64Mb - 32Kb)
int maxCount = (64 * 1024 * 1024) - (32 * 1024);
long size = inChannel.size();
long position = 0;
while (position < size) {
position += inChannel.transferTo(position, maxCount, outChannel);
}
} finally {
if (inChannel != null) {
inChannel.close();
}
if (outChannel != null) {
outChannel.close();
}
}
}
But I didn't find or understand what is meaning of
"magic number for Windows, (64Mb - 32Kb)"
It says that inChannel.transferTo(0, inChannel.size(), outChannel) has problem in windows, is 32768 (= (64 * 1024 * 1024) - (32 * 1024)) byte is optimum for this method.

Windows has a hard limit on the maximum transfer size, and if you exceed it you get a runtime exception. So you need to tune. The second version you give is superior because it doesn't assume the file was transferred completely with one transferTo() call, which agrees with the Javadoc.
Setting the transfer size more than about 1MB is pretty pointless anyway.
EDIT Your second version has a flaw. You should decrement size by the amount transferred each time. It should be more like:
while (size > 0) { // we still have bytes to transfer
long count = inChannel.transferTo(position, size, outChannel);
if (count > 0)
{
position += count; // seeking position to last byte transferred
size-= count; // {count} bytes have been transferred, remaining {size}
}
}

I have read that it is for compatibility with the Windows 2000 operating system.
Source: http://www.rgagnon.com/javadetails/java-0064.html
Quote: In win2000, the transferTo() does not transfer files > than 2^31-1 bytes. it throws an exception of "java.io.IOException: Insufficient system resources exist to complete the requested service is thrown." The workaround is to copy in a loop 64Mb each time until there is no more data.

There appears to be anecdotal evidence that attempts to transfer more than 64MB at a time on certain Windows versions results in a slow copy. Hence the check: this appears to be the result of some detail of the underlying native code that implements the transferTo operation on Windows.

How do I write a Java text file viewer for big log files

I am working on a software product with an integrated log file viewer. Problem is, its slow and unstable for really large files because it reads the whole file into memory when you view a log file. I'm wanting to write a new log file viewer that addresses this problem.
What are the best practices for writing viewers for large text files? How does editors like notepad++ and VIM acomplish this? I was thinking of using a buffered Bi-directional text stream reader together with Java's TableModel. Am I thinking along the right lines and are such stream implementations available for Java?
Edit: Will it be worthwhile to run through the file once to index the positions of the start of each line of text so that one knows where to seek to? I will probably need the amount of lines, so will probably have to scan through the file at least once?
Edit2: I've added my implementation to an answer below. Please comment on it or edit it to help me/us arrive at a more best-practice implementation or otherwise provide your own.

I'm not sure that NotePad++ actually implements random access, but I think that's the way to go, especially with a log file viewer, which implies that it will be read only.
Since your log viewer will be read only, you can use a read only random access memory mapped file "stream". In Java, this is the FileChannel.
Then just jump around in the file as needed and render to the screen just a scrolling window of the data.
One of the advantages of the FileChannel is that concurrent threads can have the file open, and reading doesn't affect the current file pointer. So, if you're appending to the log file in another thread, it won't be affected.
Another advantage is that you can call the FileChannel's size method to get the file size at any moment.
The problem with mapping memory directly to a random access file, which some text editors allow (such as HxD and UltraEdit), is that any changes directly affect the file. Therefore, changes are immediate (except for write caching), which is something users typically don't want. Instead, users typically don't want their changes made until they click Save. However, since this is just a viewer, you don't have the same concerns.

A typical approach is to use a seekable file reader, make one pass through the log recording an index of line offsets and then present only a window onto a portion of the file as requested.
This reduces both the data you need in quick recall and doesn't load up a widget where 99% of its contents aren't currently visible.

I post my test implementation (after following the advice of Marcus Adams and msw) here for your convenience and also for further comments and criticism. Its quite fast.
I've not bothered with Unicode encoding safety. I guess this will be my next question. Any hints on that very welcome.
class LogFileTableModel implements TableModel {
private final File f;
private final int lineCount;
private final String errMsg;
private final Long[] index;
private final ByteBuffer linebuf = ByteBuffer.allocate(1024);
private FileChannel chan;
public LogFileTableModel(String filename) {
f = new File(filename);
String m;
int l = 1;
Long[] idx = new Long[] {};
try {
FileInputStream in = new FileInputStream(f);
chan = in.getChannel();
m = null;
idx = buildLineIndex();
l = idx.length;
} catch (IOException e) {
m = e.getMessage();
}
errMsg = m;
lineCount = l;
index = idx;
}
private Long[] buildLineIndex() throws IOException {
List<Long> idx = new LinkedList<Long>();
idx.add(0L);
ByteBuffer buf = ByteBuffer.allocate(8 * 1024);
long offset = 0;
while (chan.read(buf) != -1) {
int len = buf.position();
buf.rewind();
int pos = 0;
byte[] bufA = buf.array();
while (pos < len) {
byte c = bufA[pos++];
if (c == '\n')
idx.add(offset + pos);
}
offset = chan.position();
}
System.out.println("Done Building index");
return idx.toArray(new Long[] {});
}
#Override
public int getColumnCount() {
return 2;
}
#Override
public int getRowCount() {
return lineCount;
}
#Override
public String getColumnName(int columnIndex) {
switch (columnIndex) {
case 0:
return "#";
case 1:
return "Name";
}
return "";
}
#Override
public Object getValueAt(int rowIndex, int columnIndex) {
switch (columnIndex) {
case 0:
return String.format("%3d", rowIndex);
case 1:
if (errMsg != null)
return errMsg;
try {
Long pos = index[rowIndex];
chan.position(pos);
chan.read(linebuf);
linebuf.rewind();
if (rowIndex == lineCount - 1)
return new String(linebuf.array());
else
return new String(linebuf.array(), 0, (int)(long)(index[rowIndex+1]-pos));
} catch (Exception e) {
return "Error: "+ e.getMessage();
}
}
return "a";
}
#Override
public Class<?> getColumnClass(int columnIndex) {
return String.class;
}
// ... other methods to make interface complete
}

I want to insert a file to Oracle using BLOB column type over Hibernate?

I want to save my file on Oracle Instance. I am using Hibernate for data objects. How can I insert files into Oracle Blob. Is there any sample code for this?

In first you must annotate your entity field with #javax.persistance.Lob annotation.
Like this:
public class InfoMessage {
private byte[] body;
#Lob
public byte[] getBody() {
return body;
}
public void setBody(byte[] body) {
this.body = body;
}
}
and set it with bytes array. It's depends on wich File class you use. The first google result for java.io.File. I guess there's better solution for this operation.
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
// Get the size of the file
long length = file.length();
if (length > Integer.MAX_VALUE) {
// File is too large
}
// Create the byte array to hold the data
byte[] bytes = new byte[(int)length];
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}
// Close the input stream and return bytes
is.close();
return bytes;
}

The #Lob annotation is not Hibernate's one. It's javax.persistence and you can use it in entity bean with any hibernate mapping.
Yes, the big file is obvious problem for such example. But i can't find any workaround for this case. FileInputStream using int type value to represent offset point.
I googled this one with similiar problem: http://www.coderanch.com/t/449055/Streams/java/JAVA-HEAP-SIZE-files-byte You can use solution with SQLPrepareSteatement if you use Java 1.6. Othwerwise you can try to use BufferedReader and somehow convert results to byteArray[] and try to beat another problem: you'll need so much memory as file size is.
EDITED: Another thing: Oracle can append data to it's clob\blob fields using dbms_lob.writeappend() procedure so you can avoid having all file in memory, but will GC perform clean as fast as BufferedReader read from file. And seems it's not a hibernate work to do this... jdbc and PreparedStatements back again.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing files over 2.15 GB in Java using Kaitai Struct - java

Related

SeekableByteChannel.read() always returns 0, InputStream is fine

Pre-allocating drive space for file storage

FileChannel.transferTo for large file in windows

How do I write a Java text file viewer for big log files

I want to insert a file to Oracle using BLOB column type over Hibernate?

Categories

Resources