I have a problem here and no idea what to do. Basically I'm creating a .txt file which serves as an index for a random access file. In it I have the number of bytes required to seek to each entry in the file.
The file has 1484 records. This is where I have my problem: with the large amount of bytes the record has, I end up writing pretty long numbers into the file, and ultimately the .txt file ends up being too big. When I open it with an appropriate piece of software (such as notepad) the file is simply cut off at a certain point.
I tried to minimize it as much as possible, but it's just too big.
What can I do here? I'm clueless.
Thanks.
I am not really sure that the problem is that one... only 1484 records?
You can write a binary file instead, in which each four or eight bytes correspond to a record position. This way, all positions have the same length on disk, no matter how many digits they hold. If you need to browse/modify the file, you can easily write utility programs that decode the file so it lets you inspect it, and that encode your modifications, modifying it.
Another solution would be to compress the file. You can use the zip capabilities of Java, and unzip the file before using it, and zip it again after that.
It is probably because you are not feeding new lines to terminate each line. There is a limit set to the maximum line length that text editors can handle safely.
Storing your indices in a binary file, inside some kind of Collection (depending on your needs) would probably be much faster and lighter.
Related
I have a large file that I need to delete a particular row from. I have the exact row number I want to be deleted and I cannot find a solution to a way to go directly to this line and delete it.
Most answers in StackOverflow simply suggest to iterate through the entire file and basically copy it over to a temp file and when the targeted line is found, simply not include it, swap filenames and delete.
This is does not seem like a very efficient solution especially for a large file and it ticks me to have to use it. Any other ideas that don't take this approach?
What the other stackoverflow files are suggesting is not only efficient solution. It is a very effecient solution.
The best you can do is to allocate a large buffer as large as your memmory allows it. Read everything before the line everything after the line. Combine in a new file.
When a file is created it does not just create the file name. The operating system also allocates the space the file is taking, if we presume we do as you say. Remove a line from a file effectivly this would mean to shrink the allocated space. How do you think this is achieved ? By allocating a new chunk of memory.
I am doing some unusual data manipulation. I have 36,000 input files. More then can be loaded into memory at once. I want to take the first byte of every file and put it in one output file, and then do this again for the second and so on. It does not need to be done in any specific order. Because the input files are compressed loading them takes bit longer, and they can't be read 1 byte at a time. I end up with a byte array of each input file.
The input files are about ~1-6MB uncompressed and ~.3-1MB compressed (lossy compression). The output files end up being the number of input files in bytes. ~36KB in my example.
I know the ulimit can be set on a Linux OS and the equivalent can be done on windows. Even though this number can be raised I don't think any OS will like millions of files being written to concurrently.
My current solution is to make 3000 or so bufferedwriter streams and loading each input file in turn and writing 1 byte to 3000 files and then closing the file and loading the next input. With this system each input file needs to be opened about 500 times each.
The whole operation takes 8 days to complete and is only a test case for a more practical application that would end up with larger input files, more of them, and more output files.
Catching all the compressed files in memory and then decompress them as needed does not sound practical, and would not scale to a larger input files.
I think the solution would be to buffer what I can from the input files(because memory constraints will not allow buffering it all), and then writing to files sequentially, and then doing it all over again.
However I do not know if there is a better solution using something I am not read up on.
EDIT
I am grateful for the fast response. I know I was being vague in the application of what I am doing and I will try to correct that. I basically have a three dimensional array [images][X][Y] I want to iterate over every image and save each color from a specific pixel on every image, and do this for all images. The problems is memory constraints.
byte[] pixels = ((DataBufferByte) ImageIO.read( fileList.get(k) ).getRaster().getDataBuffer()).getData();
This is what I am using to load images, because it takes care of decompression and skipping the header.
I am not editing it as a video because I would have to get a frame, then turn it into an image (a costly color space conversion), and then convert it to a byte[] just to get pixel data int RGB color space.
I could load each image and split it into ~500 parts (size of Y) and write to separate files I leave open and write to for each image. The outputs would be easily under a gig. The resultant file could be loaded completely into memory and turned into an array for sequential file writing.
The intermediate steps does mean I could split the load up on a network but I am trying to get it done on a low quality laptop with 4gb ram, no GPU, and a low quality i7.
I had not considered saving anything to file as an intermediate step before reading davidbak's response. Size is the only thing making this problem not trivial and I now see the size can be divided into smaller more manageable chunks.
Three phase operation:
Phase one: read all input files, one at a time, and write into a single output file. The output file will be record oriented - say, 8 byte records, 4 byte of "character offset", and 4 byte "character codepoint". As you're reading a file the character offset starts at 0, of course, so if the input file is "ABCD" you're writing (0, A) (1, B) (2, C) (3, D). Each input file is opened once, read sequentially and closed. The output file is opened once, written throughout sequentially, then closed.
Phase two: Use an external sort to sort the 8 byte records of the intermediate file on the 4 byte character offset field.
Phase three: Open the sorted intermediate file and make one pass through it. Open a new output file every time the character index field changes and write to that output file all the characters that belong to that index. Input file is opened once and read sequentially. Each output file is opened, written to sequentially, then closed.
VoilĂ ! You need space for the intermediate file, and a good external sort (and space for its work files).
As #Jorge suggests, both phase 1 and phase 2 can be parallelized, and in fact, this sort of job as outlined (phases 1 to 3) is exactly in mapreduce/hadoop's sweet spot.
You are being very vague in there, but, maybe a look at mapreduce could help. It seems the kind of job that could be distributed.
With the additional info you provided, I really don't see how to execute that task on common hardware like the 4GB i7 you mentioned. Your problem looks like a image stacking algorithm to get a decent image from a lot of not so good images, a typical problem in astronomical image processing, and I'm sure it is applied to other areas. A good lookup into astronomical imaging processing may be a good use of your time, there is a software called registax (not sure if it still exists) that does something like that but with video files.
Doing back some napkin math if you take 1 second to open a file you get 10h worth just of file opening.
An approach would be to get some FAST disk (SSD), the I'd decompress all the files into some raw format and store them on disk, from there on you'll have to use file pointers to read directly from the files without getting them into memory and write the output into a file, directly on the disk.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage. For example, what character set should I use (I just need to write positive and negative numbers), would I be able to write less than 1 byte to a file, should I be using Scanners/BufferedWriters etc. Thanks in advance, I can provide more information if needed.
Read the Java tutorial about IO.
You should
not use Writers and character sets, since you want to write binary data
use a buffered stream to avoid too many native calls and make the write fast
not use Scanners, as they're used to read data, and not write data
And no, you won't be able to write less than a byte in a file. The byte is the smallest amount of information that can be stored in a file.
Compression is almost always more expensive than file IO. You shouldn't worry about the speed of your writes unless you know it's a bottle neck.
I am writing my own image compression program in Java, I have entropy encoded data stored in multiple arrays which I need to write to file. I am aware of different ways to write to file but I would like to know what needs to be taken into account when trying to use the least possible amount of storage.
Write the data in a binary format and it will be the smallest. This is why almost all image formats use binary.
For example, what character set should I use (I just need to write positive and negative numbers),
Character encoding is for encoding characters i.e. text. You don't use these in binary formats generally (unless they contain some text which you are unlikely to do initially).
would I be able to write less than 1 byte to a file,
Technically you can use less than the block size on disk e.g. 512 bytes or 4 KB. You can write any amount less than this but it doesn't use less space, nor would it really matter if it did because the amount of disk is too small to worry about.
should I be using Scanners/BufferedWriters etc.
No, These are for text,
Instead use DataOutputStream and DataInputStream as these are for binary.
what character set should I use
You would need to write your data as bytes, not chars, so forget about character set.
would I be able to write less than 1 byte to a file
No, this would not be possible. But to follow decoder expected bit stream you might need to construct a byte, from something like 5 and 3 bits before writing that byte to the file.
I have a java application that writes a RandomAccessFile to the file system. It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file.
I would like to somehow put the file into a zip archive. I guess I could just do this at the end, but this would involve copying all the data that has been written so far. Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data.
Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk?
It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done.
Any ideas or suggestions..?
Maybe you need to change the file format so it can be written sequentially.
In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds.
It is easy to write, not having to go back to the beginning of the large sequential chunk. It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding.
The way the DEFLATE format is specified, it only makes sense if you read it from the start. So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again.
To sum it up, ZIP/DEFLATE isn't the format for this. However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible.
The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. This makes it infeasible to create a compression algorithm that would allow random access writing. In particular:
You never know in advance how well a piece of data can be compressed. So if you change some block of data, its compressed version will be most likely either longer or shorter.
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. So if you change something, the algorithm needs to re-compress everything from this change to the end.
So the only reasonable solution is to manipulate the data and compress at once it at the end.
Can anyone recommend a fast way to sort the contents of a text file, based on the first X amount of characters of each line?
For example if i have in the text file the following text
Adrian Graham some more text here
John Adams some more text here
Then another record needs to be inserted for eg.
Bob Something some more text here
I need to keep the file sorted but this is a rather big file and i'd rather not load it entirely into memory at once.
By big i mean about 500 000 lines, so perhaps not terribly huge.
I've had a search around and found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sorting-really-BIG-files---the-Java-source-code/1208
and i wanted to know if anyone could suggest any other ways? For the sake of having second opinions?
My initial idea before i read the above linked article was:
Read the file
Split it into several files, for eg A to Z
If a line begins with "a" then it is written to the file called A.txt
Each of the files then have their contents sorted (no clear idea how just yet apart from alphabetical order)
Then when it comes to reading data, i know that if i want to find a line which starts with A then i open A.txt
When inserting a new line the same thing applies and i just append to the end of the file. Later after the insert when there is time i can invoke my sorting program to reorder the files that have had stuff appended to them.
I realise that there are a few flaws in this like for eg. there won't be an even number of lines that start with a particular letter so some files may be bigger than others etc.
Which again is why i need a second opinion for suggestions on how to approach this?
The current program is in java but any programming language could be used for an example that would achieve this...I'll port what i need to.
(If anyone's wondering i'm not deliberately trying to give myself a headache by storing info this way, i inherited a painful little program which stores data to files instead of using some kind of database)
Thanks in advance
You may also want to simply call the DOS "sort" command to sort the file. It is quick and will require next to no programming on your part.
In a DOS box, type help sort|more for the sort syntax and options.
500,000 shouldn't really be that much to sort. Read the whole thing into memory, and then sort it using standard built in functions. I you really find that these are too slow, then move onto something more complicated. 500,000 lines x about 60 bytes per line still only ends up being 30 megs.
Another option might be to read the file and put it in a lightweight db (for example hsqldb in file mode)
Then get the data sorted, and write it back to a file. (Or simply migrate to program, so it uses a db)