Offset of file pointer robust/reliable for several programing languages? - java

I am having a question regarding reading files for instance in Java or in C/C++. You usually can get an offset value of the current position in file.
How robust is this offset? Assumed the file is not changed of course will I read the same line via Java as I would if using C/C++ if I position the stream on this offset?
I would guess yes, but I was wondering if I am missing something? What I want to do is making some kind of index that returns this offset value in a specific file? can that work is this offset bound to a certain API or even x-bit architecture?
Regards,

The offset of a given byte in a given file is going to be 100% reliable for (at least) any system with a POSIX / POSIX-like model of files. It follows that the same offest will give you the same byte in Java and C++. However, this does depend on you using the respective languages' I/O APIs correctly; i.e. understanding them.
One thing that can get a bit tricky is when you use some "binary I/O" scheme in C++ that involves treating objects (or structs) as arrays of bytes and reading / writing those bytes. If you do that, you have the problem that the byte-level representations of C / C++ objects are platform dependent. For instance, you can run into the big-endian vs little-endian problem. This doesn't alter offsets ... but it can mean that "stuff" gets mangled due to representation mismatches.
The best thing to do is to use a file representation that is not dependent on the platform where the file is read or written; i.e. don't do it that way.

Related

Understanding the importance of serialisation in Java

I was just introduced to the concept of serialisation in Java and while I 'get' the fundamentals, I can't help but feel like it's a bit of an overkill? My logic is that if I have pointers to the objects and I know how many bytes it takes up in memory. Why can't I just theoretically write these bytes to some txt file, along with the some extra bytes to indicate the type. With this, can't I just read these bytes back and restore my original object?
The amount of detail my book goes into serialisation is giving me a good indication that I'm not really understanding the importance of this and that there is probably something more subtle than just writing out all the bytes exactly as they are. Any help is greatly appreciated! (I have some background in c++ if that helps)
Why can't I just theoretically write these bytes to some txt file, along with the some extra bytes to indicate the type. With this, can't I just read these bytes back and restore my original object?
How could anyone ever read them back in? Say I'm writing code that's supposed to read in your file. Please tell me what the third byte means so that I can decode it properly.
What if the internal representation of the object contains pointers to other objects that might be in different memory locations the next time the program runs? For example, it is quite common to manage identical strings by having internal references to the same internal string object. How will writing that reference to a file be sensible given that the internal string object may not exist in the next run?
To write data to a file, you need to write it out in some specific format that actually contains all the information you need to be able to read back in. What happens to work internally for this program at this time just won't do as there's no guarantee another program at another time can make sense of it.
What you suggest works provided;
the order and type of fields doesn't change. Note this is not set at compile time.
the byte order doesn't change.
you don't have any references eg no String, enum, List or Map.
the name&package of the type doesn't change.
We at Chronicle, use a form of serialization which supports this as it's much faster but it's very limiting. You have to be very aware of those limitations and have a problem which is suitable. We also have a form of serialization which have none of these constraints, but it is slower.
The purpose of Java Serialization is to support arbitrary object graphs even if data is exchanged between systems which might arrange the data differently.

Sizeof in c porting to Java

I have a code in C like this
skip=(unsigned long) (st_row-1)*tot_numcols;
fseek(infile,sizeof(cnum)*skip,0);
Now i have to port it into Java How can I do That.The "cnum" is a Structure in C so I created a class in Java.But about that fseek how can i point to the exact position in File in Java.
Your C design is broken, and you can't do what you apparently want in Java.
It appears that you're storing information out of C structs by blindly dumping the pointer to disk. In addition to being difficult to debug, it's prone to break completely with any change that makes the compiler decide to pack the struct differently, including in particular compiling identical code for 32-bit and 64-bit or little- and big-endian targets. Instead, you should always explicitly serialize structured data. Human-readable formats are best unless there's a very large amount of data.
Java simply doesn't permit this kind of attempt. The Java memory model explicitly hides information about runtime memory packing, and the JVM has wide latitude to organize memory management as it sees fit.
Instead, define a clear format for saving your data, including endianness, and use that from both languages.

How do I read a file without any buffering in Java?

I'm working through the problems in Programming Pearls, 2nd edition, Column 1. One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file. Since Java is the language I'm the most familiar with, I've decided to use it even though the author seems to have had C and C++ in mind.
Since I'm pretending memory is limited for the purpose of the problem I'm working on, I'd like to make sure the process of reading the file has no buffering at all.
I thought InputStreamReader would be a good solution, until I read this in the Java documentation:
To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation.
Ideally, only the bytes that are necessary would be read from the stream -- in other words, I don't want any buffering.
One of the problems involves writing a program that uses only around 1 megabyte of memory to store the contents of a file as a bit array with each bit representing whether or not a 7 digit number is present in the file.
This implies that you need to read the file as bytes (not characters).
Assuming that you do have a genuine requirement to read from a file without buffering, then you should use the FileInputStream class. It does no buffering. It reads (or attempts to read) precisely the number of bytes that you asked for.
If you then need to convert those bytes to characters, you could do this by applying the appropriate String constructor to a byte or byte[]. Note that for multibyte character encodings such as UTF-8, you would need to read sufficient bytes to complete each character. Doing that without the possibility of read-ahead is a bit tricky ... and entails "knowledge* of the character encoding you are reading.
(You could avoid that knowledge by using a CharsetDecoder directly. But then you'd need to use the decode method that operates on Buffer objects, and that is a bit complicated too.)
For what it is worth, Java makes a clear distinction between stream-of-byte and stream-of-character I/O. The former is supported by InputStream and OutputStream, and the latter by Reader and Write. The InputStreamReader class is a Reader, that adapts an InputStream. You should not be considering using it for an application that wants to read stuff byte-wise.

The Efficiency of Hard-Coding vs. File Input

I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.
Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.
Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.
The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).
I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.

Huffman coding in Java

I want encode every file by Huffman code.
I have found the length of bits per symbol (its Huffman code).
Is it possible to encode a character into a file in Java: are there any existing classes that read and write to a file bit by bit and not with minimum dimension of char?
You could create a BitSet to store your encoding as you are creating it and simply write the String representation to a file when you are done.
You really don't want to write single bits to a file, believe me. Usually we define a byte buffer, build the "file" in memory and, after all work is done, write the complete buffer. Otherwise it would take forever (nearly).
If you need a fast bit vector, then have a look at the colt library. That's pretty convenient if you want to write single bits and don't do all this bit shifting operations on your own.
I'm sure there are Huffman classes out there, but I'm not immediately aware of where they are. If you want to roll your own, two ways to do this spring to mind immediately.
The first is to assemble the bit strings in memory my using mask and shift operators and accumulate the bits into larger data objects (i.e. ints or longs) and then write those out to file with standard streaming.
The second, more ambitious and self-contained idea would be to write an implementation of OutputStream that has a method for writing a single bit and then this OutputStream class would do the aforementioned buffering/shifting/accumulating itself and probably pass the results down to a second, wrapped OutputStream.
Try writing a bit vector in java to do the bit representation: it should allow you to set/reset the individual bits in a bit stream.
The bit stream can thus hold your Huffman encoding. This is the best approach, and lightning fast too.
Huffmann sample analysis here
You can find a working (and fast) implementation here: http://code.google.com/p/kanzi/source/browse/src/kanzi/entropy/HuffmanTree.java

Categories

Resources