There is region in file(possible small) that I want to overwrite.
Assume I calling fseek, fwrite, fsync. Is there any way to ensure atomicity of such region-rewriting operation, e.g. i need to be sure, that in any case of failure the region will contains only old(before modification) data, or only new(modified) data, but not a mix of this.
There are two thing i want to highlight.
First: It's ok if there is no way to atomically write ANY size region - we can handle it by appending data to the file, fsync'ing, and then rewriting 'pointer' area in file, then fsyncing again. However, if 'pointer' writing is not atomic, we still can have corrupted file with illegal pointers.
Second: I am pretty sure, writing 1-byte regions is atomic: i will not see in file any bytes I never put there. So we can use some tricks with allocating two regions for addresses and use 1-byte switch, so rewriting of region became - append new data, syncing, rewrite one of two(unused) pointer slots, syncing again, and then rewrite 'switch byte' and again syncing. So the overwrite region operation now contains at least 3 fsync invocation.
All of this would be much easer, if I will have atomic writing for longs, but do i really have it?
Is there any way to handle this situation without using method, mentioned in point 2?
Another question is - is there any ordering guarantee between writing and syncing?
For example, if i call fseek, fwrite [1], fseek, fwrite [2], fsync, can i have writing at [2] commited, and writing at [1] - not commited?
This question is applicable to linux and windows operation system, any particular answer(e.g. in ubuntu version a.b.c ....) is also wanted.
It's usually safe to assume that writing a 512 bytes chunks are done in one write by the HDDs.
However, i would not assume that. Instead, i'd go with your second solution, while adding a checksum to your write and verifying it before changing the pointer in the file.
Generally, it's a good practice to add checksum to everything written to disk.
To answer about "sync" guarantee - you can assume that. While sync is FS and disk dependent, let's say we are talking about 'reasonable' implementation.
After the 1st sync the data is guaranteed to be flushed to the disk (the disk might have it
in it's cache still) and if the data you are expected to get whatever you wrote.
If after the second sync the data of both syncs is in the disk cache, the situation you described can happen, but IMHO the probability of that is very low.
Anyway, there's no other mechanism which will promise you data is on disk. That's why you must have checksums.
Some more info: Ensure fsync did its job
Related
I am currently designing around a big memory index structure (several giga bytes). The index is actually a RTree which leafes are BTrees (dont ask). It supports a special query and pushes it to the logical limit.
Since those nodes are soley search nodes I ask my self how to best make it parallel.
I know of six solutions so far:
Block reads when a write is scheduled. The tree is completely blocked until the last read is finished and then the write is performed and after the write the tree can yet again used for multiple reads. (reads need no locking).
Clone Nodes to change and reuse existing nodes (including leafs) and switch between both by simply yet again stop reads switch and done. Since leaf pointers must be altered also the leaf pointers might become their own collection making it possible to switch modifications atomar and changes can be redo to a second version to avoid copy of the pointer on each insert.
Use independent copies of the index like double buffering. Update one copy of the index, switch it. Once noone reads the old index, alter this index in the same way. This way the change can be done without blocking existing reads. If another insert hits the tree in a reasonable amount of time these changes can also be done.
Use a serial share nothing architecture so each search thread has its own copy. Since a thread can only alter its tree after a single read is performed, this would be also lock free and simple. Due reads are spread evenly for each worker thread (being bound to a certain core), the throughput would not be harmed.
Use write / read locks for each node being about to be written and do only block a subtree during write. This would involve additional operations against the tree since splitting and merging would propagate upwards and therefore require a repass of the insert (since expanding locks upwards (parentwise) would introduce the chance of a deadlock). Since Split and Merge are not that frequent if you have a higher page size, this would also be a good way. Actually currently my BTree implementation currently uses a similar mechanism by spliting a node and reinsert the value unless no split is needed (which is not optimal but more simple).
Use double buffer for each node like the shadow cache for databases where each page is switched between two versions. So everytime a node is modified a copy is modified and once a read is issued the old versions are used or the new one. Each node gets a version number and the version that is more close to the active version (latest change) is choosen. To switch between to version, one needs only an atomar change on the root information. This way the tree can be altered and used. This swith can be done every time but it must be ensured that no read is using the old version when overriding the new one. This method has the possibility to not interfer with cache locality in order to link leafs and alike. But it also requires twice the amount of memory since a back buffer must be present but saves allocation time and might be good for a high frequency of changes.
With all that thoughts what is best? I know it depends but what is done in the wild? If there are 10 read threads (or even more) and being blocked by a single write operation I guess this is nothing I really want.
Also how about L3, L2 and L1 cache and in scenarios with multiple CPUs? Any issues on that? The beauty of the double buffering is the chance that those reads hitting the old version are still working with the correct cache version.
The version of creating a fresh copy of a node is quite not appealing. So what is meet in the wild of todays database landscapes?
[update]
By rereading the post, I wonder if using the write locks for split and merge would be better suited by creating replacement nodes since for a split and a merge I need to copy somewhat the half of elements around, those operations are very rare and so actually copy a node completely would do the trick by replacing this node in the parent node which is a simple and fast operation. This way the actual blocks for reads would be very limited and since we create copies anyway, the blocking only happens when the new nodes are replaced. Since during those access leafs may not be altered it is unimportant since the information density has not changed. But again this needs for every access of a node a increment and decrement of a read lock and checking for intended write locks. This all is overhead and this all is blocking further reads.
[Update2]
Solution 7. (currently favored)
Currently we favor a double buffer for the internal (non-leaf) nodes and use something similar to row locking.
Our logical tables that we try to decompose using those index structure (which is all a index does) results in using algebra of sets on those information. I noticed that this algebra of sets is linear (O(m+n) for intersection and union) and gives us the chance to lock each entry being part of such operation.
By double buffering the internal nodes (which is not hard to implement nor does it cost much (about <1% memory overhead)) we can live problem free on that issue not blocking too much read operations.
Since we batch modifications in a certain way it is very rarely seen that a given column is updated but once it is, it takes more time since those modifications might go in the thousands for this single entry.
So the goal is to alter the algebra of sets used to simply intersect those columns being currently modified later on. Since only one column is modified at a time such operation would only block once. And for everyone currently reading it, the write operation has to wait. And guess what, once a write operation waits, it usually lets another write operation of another column taking place that is not bussy. We calculate the propability of such a block to be very very low. So we dont need to care.
The locking mechanism is done using check for write, check for write intention, add read, check for write again and procced with the read. So there is no explicit object locking. We access fixed areas of bytes and if the structure is clear everything critical is planed to move into a c++ version to make it somewhat faster (2x we guess and this only takes one person one or two weeks to do especially if you use a Java to C++ translator).
The only effect that is now also important might be the caching issue since it invalidates L1 caches and maybe L2 too. So we plan to collect all modifications on such a table / index to be scheduled to run within 1 or more minutes timeshare but be evenly distributed to not make a system that has performance hickhups.
If you know of anything that helps us please go ahead.
As noone replied I would like to summarize what we (I) finally did. The structure is now separated. We have a RTree which leaf are actually Tables. Those tables can be even remote so we have a distribution way that is mostly transparent thanks to RMI and proxies.
The rest was simply easy. The RTree has the way to advise a table to split and this split is again a table. This split is been done on a single maschine and transfered to another if it has to be remote. Merge is almost similar.
This remote also is true for threads bound to different CPUs to avoid cache issues.
About the modification in memory it is as I already suggested. we duplicate internal nodes and turned the table 90° and adapted the algebraic set algorithms to handle locked columns efficiently. The test for a single table is simple and compared to the 1000ends of entries per column not a performance issue after all. Deadlocks are also impossible since one column is used at a time so there is only one lock per thread. We experiment with doing columns in parallel which would increase the response time. We also think about binding columns to a given virtual core so there is no locking again since the column is in isolation and yet again the modification can be serialized.
This way one can utilize 20 cores and more per CPU and also avoid cache misses.
I have a java application that writes a RandomAccessFile to the file system. It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file.
I would like to somehow put the file into a zip archive. I guess I could just do this at the end, but this would involve copying all the data that has been written so far. Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data.
Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk?
It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done.
Any ideas or suggestions..?
Maybe you need to change the file format so it can be written sequentially.
In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds.
It is easy to write, not having to go back to the beginning of the large sequential chunk. It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding.
The way the DEFLATE format is specified, it only makes sense if you read it from the start. So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again.
To sum it up, ZIP/DEFLATE isn't the format for this. However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible.
The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. This makes it infeasible to create a compression algorithm that would allow random access writing. In particular:
You never know in advance how well a piece of data can be compressed. So if you change some block of data, its compressed version will be most likely either longer or shorter.
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. So if you change something, the algorithm needs to re-compress everything from this change to the end.
So the only reasonable solution is to manipulate the data and compress at once it at the end.
I was assigned to parallelize GZip in Java 7, and I am not sure which is possible.
The assignment is:
Parallelize gzip using a given number of threads
Each thread takes a 1024 KiB block, using the last 32 KiB block from
the previous block as a dictionary. There is an option to use no
dicitionary
Read from Stdin and stdout
What I have tried:
I have tried using GZIPOutputStream, but there doesn't seem to be a
way to isolate and parallelize the deflate(), nor can I access the
deflater to alter the dictionary. I tried extending off of GZIPOutputStream, but it didn't seem to act as I wanted to, since I still couldn't isolate the compress/deflate.
I tried using Deflater with wrap enabled and a FilterOutputStream to
output the compressed bytes, but I wasn't able to get it to compress
properly in GZip format. I made it so each thread had a compressor that will write to a byte array, then it will write to the OutputStream.
I am not sure if I am did my approaches wrong or took the wrong approaches completely. Can anyone point me the right direction for which classes to use for this project?
Yep, zipping a file with dictionary can't be parallelized, as everything depends on everything. Maybe your teacher asked you to parallelize the individual gzipping of multiple files in a folder? That would be a great example of parallelized work.
To make a process concurrent, you need to have portions of code which can run concurrently and independently. Most compression algorithms are designed to be run sequentially, where every byte depends on every byte has comes before.
The only way to do compression concurrently is to change the algorythm (making it incompatible with existing approaches)
I think you can do it by inserting appropriate resets in the compression stream. The idea is that the underlying compression engine used in gzip allows the deflater to be reset, with an aim that it makes it easier to recover from stream corruption, though at a cost of making the compression ratio worse. After reset, the deflater will be in a known state and thus you could in fact start from that state (which is independent of the content being compressed) in multiple threads (and from many locations in the input data, of course) produce a compressed chunk and include the data produced when doing the following reset so that it takes the deflater back to the known state. Then you've just to reassemble the compressed pieces into the overall compressed stream. “Simple!” (Hah!)
I don't know if this will work, and I suspect that the complexity of the whole thing will make it not a viable choice except when you're compressing single very large files. (If you had many files, it would be much easier to just compress each of those in parallel.) Still, that's what I'd try first.
(Also note that the gzip format is just a deflated stream with extra metadata.)
I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.
My Java program saves its data to a binary file, and (very) occasionally the file becomes corrupt due to a hardware fault. Usually only a few bytes are affected in a file that is several megabytes in size. To cope with this problem, I could write the data twice, but this seems overkill - I would prefer to limit the file size increase to about 20%.
This seems to me to be similar to the problem of sending information over a 'noisy' data stream. Is there a Java library or algorithm that can write redundant information to an output stream so the receiver can recover when noise is introduced?
What you want is Error Correction Codes. Check this code out: http://freshmeat.net/projects/javafec/
As well, the wikipedia article might give you more information:
http://en.wikipedia.org/wiki/Forward_error_correction
Your two possiblities are Forward Error Correction, where you send redundant data, or a system for error detection, where you check a hash value and re-request any data that has become corrupted. If corruption is an expected thing, error correction is the approach to take.
Without knowing the nature of your environment, giving more specific advice isn't really possible, but this should get you started on knowing how to approach this problem.
Error correcting codes. If I recall correctly the number of additional bits goes as log n for the block size, so the larger blocks the fewer correction bits.
You should choose a mechanism that interleaves the checkbits (probably most convenient as extra characters) in between the normal text. This allows for having repairable holes in your data stream while still being readable.
The problem of noisy communications has already has a great solution: Send a hash/CRC of the data (with the data) which is (re)evaluated by the receiver and re-requested if there was corruption en route.
In other words: use a hash algorithm to check for corruption and retransmit when necessary instead of sending data redundantly.
CRCs and ECCs are the stand answer to detecting and (for ECCs) recovering from data corruption due to noise. However, any scheme can only cope with a certain level of noise. Beyond that level you will get undetected and/or uncorrectable errors. The second problem is that these schemes will only work if you can add the ECCs / CRCs before the noise is injected.
But I'm a bit suspicious that you may be trying to address the wrong problem:
If the corruption is occurring when you transmit the file over a comms line, then you should be using comms hardware with built-in ECC etc support.
If the corruption is occurring when you write the file to disc, then you should replace the disc.
You should also consider the possibility that it is your application that is corrupting the data; e.g. due to some synchronization bug in your code.
Sounds antiquated, but funny, I just had a similar conversation with someone who wrote "mobile" apps (not PDA/phone but Oil&Gas drilling rig-style field applications). Due to the environment they actually wrote to disk in a modified XMODEM CRC transfer. I think it is easy to say however nothing special there other than to:
Use a RandomAccessFile in "rw" write a block of data (512-4096 bytes), re-read for CRC check, re-write if non-match, or iterate to next block. With OS file caching I'm curious how effective this is?