My Java program saves its data to a binary file, and (very) occasionally the file becomes corrupt due to a hardware fault. Usually only a few bytes are affected in a file that is several megabytes in size. To cope with this problem, I could write the data twice, but this seems overkill - I would prefer to limit the file size increase to about 20%.
This seems to me to be similar to the problem of sending information over a 'noisy' data stream. Is there a Java library or algorithm that can write redundant information to an output stream so the receiver can recover when noise is introduced?
What you want is Error Correction Codes. Check this code out: http://freshmeat.net/projects/javafec/
As well, the wikipedia article might give you more information:
http://en.wikipedia.org/wiki/Forward_error_correction
Your two possiblities are Forward Error Correction, where you send redundant data, or a system for error detection, where you check a hash value and re-request any data that has become corrupted. If corruption is an expected thing, error correction is the approach to take.
Without knowing the nature of your environment, giving more specific advice isn't really possible, but this should get you started on knowing how to approach this problem.
Error correcting codes. If I recall correctly the number of additional bits goes as log n for the block size, so the larger blocks the fewer correction bits.
You should choose a mechanism that interleaves the checkbits (probably most convenient as extra characters) in between the normal text. This allows for having repairable holes in your data stream while still being readable.
The problem of noisy communications has already has a great solution: Send a hash/CRC of the data (with the data) which is (re)evaluated by the receiver and re-requested if there was corruption en route.
In other words: use a hash algorithm to check for corruption and retransmit when necessary instead of sending data redundantly.
CRCs and ECCs are the stand answer to detecting and (for ECCs) recovering from data corruption due to noise. However, any scheme can only cope with a certain level of noise. Beyond that level you will get undetected and/or uncorrectable errors. The second problem is that these schemes will only work if you can add the ECCs / CRCs before the noise is injected.
But I'm a bit suspicious that you may be trying to address the wrong problem:
If the corruption is occurring when you transmit the file over a comms line, then you should be using comms hardware with built-in ECC etc support.
If the corruption is occurring when you write the file to disc, then you should replace the disc.
You should also consider the possibility that it is your application that is corrupting the data; e.g. due to some synchronization bug in your code.
Sounds antiquated, but funny, I just had a similar conversation with someone who wrote "mobile" apps (not PDA/phone but Oil&Gas drilling rig-style field applications). Due to the environment they actually wrote to disk in a modified XMODEM CRC transfer. I think it is easy to say however nothing special there other than to:
Use a RandomAccessFile in "rw" write a block of data (512-4096 bytes), re-read for CRC check, re-write if non-match, or iterate to next block. With OS file caching I'm curious how effective this is?
Related
I have a 10gb file and I need to parse it in Java, whereas the following error arises when I attempt to do this.
java.lang.NegativeArraySizeException
at java.util.Arrays.copyOf(Arrays.java:2894)
at org.antlr.v4.runtime.ANTLRInputStream.load(ANTLRInputStream.java:123)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:86)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:82)
at org.antlr.v4.runtime.ANTLRInputStream.<init>(ANTLRInputStream.java:90)
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
It looks like ANTLR v4 has a pervasive hard-wired limitation that input stream size is less that 2^31 characters. Removing this limitation would not be a small task.
Take a look at the source code for the ANTLRInputStream class - here.
As you can see, it attempts to hold the entire stream contents in a single char[]. That ain't going to work ... for huge input files. But simply fixing that by buffering the data in a larger data structure isn't going to be the answer either. If you look further down the file, there are a number of other methods that use int as the type for indexing the stream. They would need to be changed to use long ... and the changes will ripple out.
How can I solve this problem properly? How can I adjust such an input stream to handle this error?
Two approaches spring to mind:
Create your own version of ANTLR that supports large input files. This is a non-trivial project. I expect that the 32 bit assumption reaches into the code that ANTLR generates, etc.
Split your input files into smaller files before you attempt to parse them. Whether this is viable depends on the input syntax.
My recommendation would be the 2nd alternative. The problem with "supporting" huge input files (by in-memory buffering) is that it is going to be inefficient and memory wasteful ... and it ultimately doesn't scale.
You could also create an issue here, or ask on antlr-discussion.
i never stumbled upon this error, but i guess your array gets too big and it's index overflows (e.g., the integer wraps around and becomes negative). use another data structure, and most importantly, don't load all of the file at once (use lazy loading instead, that means, load only those parts that are being accessed)
I hope this will help http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
You might want to have some kind of buffer to read big files.
There is region in file(possible small) that I want to overwrite.
Assume I calling fseek, fwrite, fsync. Is there any way to ensure atomicity of such region-rewriting operation, e.g. i need to be sure, that in any case of failure the region will contains only old(before modification) data, or only new(modified) data, but not a mix of this.
There are two thing i want to highlight.
First: It's ok if there is no way to atomically write ANY size region - we can handle it by appending data to the file, fsync'ing, and then rewriting 'pointer' area in file, then fsyncing again. However, if 'pointer' writing is not atomic, we still can have corrupted file with illegal pointers.
Second: I am pretty sure, writing 1-byte regions is atomic: i will not see in file any bytes I never put there. So we can use some tricks with allocating two regions for addresses and use 1-byte switch, so rewriting of region became - append new data, syncing, rewrite one of two(unused) pointer slots, syncing again, and then rewrite 'switch byte' and again syncing. So the overwrite region operation now contains at least 3 fsync invocation.
All of this would be much easer, if I will have atomic writing for longs, but do i really have it?
Is there any way to handle this situation without using method, mentioned in point 2?
Another question is - is there any ordering guarantee between writing and syncing?
For example, if i call fseek, fwrite [1], fseek, fwrite [2], fsync, can i have writing at [2] commited, and writing at [1] - not commited?
This question is applicable to linux and windows operation system, any particular answer(e.g. in ubuntu version a.b.c ....) is also wanted.
It's usually safe to assume that writing a 512 bytes chunks are done in one write by the HDDs.
However, i would not assume that. Instead, i'd go with your second solution, while adding a checksum to your write and verifying it before changing the pointer in the file.
Generally, it's a good practice to add checksum to everything written to disk.
To answer about "sync" guarantee - you can assume that. While sync is FS and disk dependent, let's say we are talking about 'reasonable' implementation.
After the 1st sync the data is guaranteed to be flushed to the disk (the disk might have it
in it's cache still) and if the data you are expected to get whatever you wrote.
If after the second sync the data of both syncs is in the disk cache, the situation you described can happen, but IMHO the probability of that is very low.
Anyway, there's no other mechanism which will promise you data is on disk. That's why you must have checksums.
Some more info: Ensure fsync did its job
I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.
Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.
Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.
The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).
I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.
I did some quick searching on the site and couldn't seem to find the answer I was looking for so that being said, what are some best practices for passing large xml files across a network. My thoughts on the matter are to stream chunks across the network in manageable segments, however I am looking for other approaches and best practices for this. I realize that large is a relative term so I will let you choose an arbitrary value to be considered large.
In case there is any confusion the question is "What are some best practices for sending large xml files across networks?"
Edit:
I am seeing a lot of compression being talked about, any particular compression algorithm that could be utilized and in terms of decompressing said files? I do not have much desire to roll my own when I am aware there are proofed algorithms out there. Also I appreciate the responses so far.
Compressing and reducing XML size has been an issue for more than a decade now, especially in mobile communications where both bandwidth and client computation power are scarce resources. The final solution used in wireless communications, which is what I prefer to use if I have enough control on both the client and server sides, is WBXML (WAP Binary XML Spec).
This spec defines how to convert the XML into a binary format which is not only compact, but also easy-to-parse. This is in contrast to general-purpose compression methods, such as gzip, that require high computational power and memory on the receiver side to decompress and then parse the XML content. The only downside to this spec is that an application token table should exist on both sides which is a statically-defined code table to hold binary values for all possible tags and attributes in an application-specific XML content. Today, this format is widely used in mobile communications for transmitting configuration and data in most of the applications, such as OTA configuration and Contact/Note/Calendar/Email synchronization.
For transmitting large XML content using this format, you can use a chunking mechanism similar to the one proposed in SyncML protocol. You can find a design document here, describing this mechanism in section "2.6. Large Objects Handling". As a brief intro:
This feature provides a means to synchronize an object whose size exceeds that which can be transmitted within one message (e.g. the maximum message size – declared in MaxMsgSize
element – that the target device can receive). This is achieved by splitting the object into chunks that will each fit within one message and by sending them contiguously. The first chunk of data is sent with the overall size of the object and a MoreData tag signaling that more chunks will be sent. Every subsequent chunk is sent with a MoreData tag, except from the last one.
Depending on how large it is, you might want to considering compressing it first. This, of course, depends on how often the same data is sent and how often it's changed.
To be honest, the vast majority of the time, the simplest solution works fine. I'd recommend transmitting it the easiest way first (which is probably all at once), and if that turns out to be problematic, keep on segmenting it until you find a size that's rarely disrupted.
Compression is an obvious approach. This XML bugger will shrink like there is no tomorrow.
If you can keep a local copy and two copies at the server, you could use diffxml to reduce what you have to transmit down to only the changes, and then bzip2 the diffs. That would reduce the bandwidth requirement a lot, at the expense of some storage.
Are you reading the XML with a proper XML parser, or are you reading it with expectations of a specific layout?
For XML data feeds, waiting for the entire file to download can be a real waste of memory and processing time. You could write a custom parser, perhaps using a regular expression search, that looks at the XML line-by-line if you can guarantee that the XML will not have any linefeeds within tags.
If you have code that can digest the XML a node-at-a-time, then spit it out a node-at-a-time, using something like Transfer-Encoding: chunked. You write the length of the chunk (in hex) followed by the chunk, then another chunk, or "0\n" at the end. To save bandwidth, gzip each chunk.
So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.