I have a large file that I need to delete a particular row from. I have the exact row number I want to be deleted and I cannot find a solution to a way to go directly to this line and delete it.
Most answers in StackOverflow simply suggest to iterate through the entire file and basically copy it over to a temp file and when the targeted line is found, simply not include it, swap filenames and delete.
This is does not seem like a very efficient solution especially for a large file and it ticks me to have to use it. Any other ideas that don't take this approach?
What the other stackoverflow files are suggesting is not only efficient solution. It is a very effecient solution.
The best you can do is to allocate a large buffer as large as your memmory allows it. Read everything before the line everything after the line. Combine in a new file.
When a file is created it does not just create the file name. The operating system also allocates the space the file is taking, if we presume we do as you say. Remove a line from a file effectivly this would mean to shrink the allocated space. How do you think this is achieved ? By allocating a new chunk of memory.
Related
I need to check if the password that an user entered is contained in a 10k lines .txt file that is locally stored in my computer. I've been asked to do this for a college project and they've been very emphatic about achieving this in an efficent manner, not taking too long to find the match.
The thing is that reading the file line by line using a BufferedReader the match is done almost instantly.
I've tested it in two computers, one with an ssd and the other one with an hdd and I cannot tell the difference.
Am I missing something? Is there another and more efficent way to do it? For example I could load the file or chunks of the file into memory, but is it worth it?
10k passwords isn't all that much and should easily fit in RAM. You can read the file into memory when your application starts and then only access the in-memory structure. The in-memory structure could even be parsed to provide more efficient lookup (i.e. using a HashMap or HashSet) or sort it in memory for the one-time cost of O(n × log n) to enable binary-searching the list (10k items can be searched with max. 14 steps). Or you could use even fancier data structures such as a bloom filter.
Just keep in mind: when you write "it is almost instant", then it probably already is efficient enough. (Again, 10k passwords isn't all that much, probably the file is only ~100kB in size)
I have a file and from file I am populating the HashMap<String, ArrayList<Objects>>. HashMap size will be 25 for sure, means 25 keys, but the List will be huge say million records for each key.
So what I use to do now is for each key retrieve the list of records and process them parallel using threads. Things went on good until I faced the larger file and so I am facing the "java.lang.OutOfMemoryError: Java heap space".
I would like to ask you what is the best way instead populating the HashMap with the list of objects? What I am thinking is to get the 25 offsets of the file and instead of putting the lines I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought. But before I execute, I would like to know any better ways to optimize memory usage.
I will populate the HashMap<String, ArrayList<Objects>>
After populating the HashMap what do you need to do with it? I believe that just populating the Map is not your task. Whatever the scenario, you don't need to read the whole file in memory.
Increasing the heap size may not be a good solution as someday you may get a file even bigger than your heap size.
Read the file in chunks using a BufferedReader or BufferedInputStream depending on your needs and do your task as you read. The two APIs only read a part of the file in memory at a time.
I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought.
Using multiple threads will not prevent java.lang.OutOfMemoryError because all the threads will be in same JVM. Furthermore, no matter you read the file in one list or multiple lists, all the data from the file will be read into the same heap memory.
If you mention what you actually want to do with the data from file, this answer can be more specific.
Ditto what ares said. Need more information. What do you plan on doing with the map. Is it an operation that requires the whole file to be loaded onto memory ? Or can it be done in parts ?
Also, have you considered splitting the file into parts once its size surpasses a threshold size ?
Like Pshemo's answer here : How to break a file into pieces using Java?
Also, If you want to process in parallel, you could consider processing a map which covers a part of the file. Process that map in parallel and store the results in a queue of some sort. Provided the queue will contain a subset of the data you are processing(to avoid OutOfMemory exceptions).
I have a problem here and no idea what to do. Basically I'm creating a .txt file which serves as an index for a random access file. In it I have the number of bytes required to seek to each entry in the file.
The file has 1484 records. This is where I have my problem: with the large amount of bytes the record has, I end up writing pretty long numbers into the file, and ultimately the .txt file ends up being too big. When I open it with an appropriate piece of software (such as notepad) the file is simply cut off at a certain point.
I tried to minimize it as much as possible, but it's just too big.
What can I do here? I'm clueless.
Thanks.
I am not really sure that the problem is that one... only 1484 records?
You can write a binary file instead, in which each four or eight bytes correspond to a record position. This way, all positions have the same length on disk, no matter how many digits they hold. If you need to browse/modify the file, you can easily write utility programs that decode the file so it lets you inspect it, and that encode your modifications, modifying it.
Another solution would be to compress the file. You can use the zip capabilities of Java, and unzip the file before using it, and zip it again after that.
It is probably because you are not feeding new lines to terminate each line. There is a limit set to the maximum line length that text editors can handle safely.
Storing your indices in a binary file, inside some kind of Collection (depending on your needs) would probably be much faster and lighter.
Can anyone recommend a fast way to sort the contents of a text file, based on the first X amount of characters of each line?
For example if i have in the text file the following text
Adrian Graham some more text here
John Adams some more text here
Then another record needs to be inserted for eg.
Bob Something some more text here
I need to keep the file sorted but this is a rather big file and i'd rather not load it entirely into memory at once.
By big i mean about 500 000 lines, so perhaps not terribly huge.
I've had a search around and found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sorting-really-BIG-files---the-Java-source-code/1208
and i wanted to know if anyone could suggest any other ways? For the sake of having second opinions?
My initial idea before i read the above linked article was:
Read the file
Split it into several files, for eg A to Z
If a line begins with "a" then it is written to the file called A.txt
Each of the files then have their contents sorted (no clear idea how just yet apart from alphabetical order)
Then when it comes to reading data, i know that if i want to find a line which starts with A then i open A.txt
When inserting a new line the same thing applies and i just append to the end of the file. Later after the insert when there is time i can invoke my sorting program to reorder the files that have had stuff appended to them.
I realise that there are a few flaws in this like for eg. there won't be an even number of lines that start with a particular letter so some files may be bigger than others etc.
Which again is why i need a second opinion for suggestions on how to approach this?
The current program is in java but any programming language could be used for an example that would achieve this...I'll port what i need to.
(If anyone's wondering i'm not deliberately trying to give myself a headache by storing info this way, i inherited a painful little program which stores data to files instead of using some kind of database)
Thanks in advance
You may also want to simply call the DOS "sort" command to sort the file. It is quick and will require next to no programming on your part.
In a DOS box, type help sort|more for the sort syntax and options.
500,000 shouldn't really be that much to sort. Read the whole thing into memory, and then sort it using standard built in functions. I you really find that these are too slow, then move onto something more complicated. 500,000 lines x about 60 bytes per line still only ends up being 30 megs.
Another option might be to read the file and put it in a lightweight db (for example hsqldb in file mode)
Then get the data sorted, and write it back to a file. (Or simply migrate to program, so it uses a db)
I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.
What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.
I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).
2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.