File processing in java - java

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.

What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.

I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).

2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance

I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

Related

Find a String in a 10k line files in Java efficiently

I need to check if the password that an user entered is contained in a 10k lines .txt file that is locally stored in my computer. I've been asked to do this for a college project and they've been very emphatic about achieving this in an efficent manner, not taking too long to find the match.
The thing is that reading the file line by line using a BufferedReader the match is done almost instantly.
I've tested it in two computers, one with an ssd and the other one with an hdd and I cannot tell the difference.
Am I missing something? Is there another and more efficent way to do it? For example I could load the file or chunks of the file into memory, but is it worth it?
10k passwords isn't all that much and should easily fit in RAM. You can read the file into memory when your application starts and then only access the in-memory structure. The in-memory structure could even be parsed to provide more efficient lookup (i.e. using a HashMap or HashSet) or sort it in memory for the one-time cost of O(n × log n) to enable binary-searching the list (10k items can be searched with max. 14 steps). Or you could use even fancier data structures such as a bloom filter.
Just keep in mind: when you write "it is almost instant", then it probably already is efficient enough. (Again, 10k passwords isn't all that much, probably the file is only ~100kB in size)

How to optimize the memory usage for large file processing

I have a file and from file I am populating the HashMap<String, ArrayList<Objects>>. HashMap size will be 25 for sure, means 25 keys, but the List will be huge say million records for each key.
So what I use to do now is for each key retrieve the list of records and process them parallel using threads. Things went on good until I faced the larger file and so I am facing the "java.lang.OutOfMemoryError: Java heap space".
I would like to ask you what is the best way instead populating the HashMap with the list of objects? What I am thinking is to get the 25 offsets of the file and instead of putting the lines I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought. But before I execute, I would like to know any better ways to optimize memory usage.
I will populate the HashMap<String, ArrayList<Objects>>
After populating the HashMap what do you need to do with it? I believe that just populating the Map is not your task. Whatever the scenario, you don't need to read the whole file in memory.
Increasing the heap size may not be a good solution as someday you may get a file even bigger than your heap size.
Read the file in chunks using a BufferedReader or BufferedInputStream depending on your needs and do your task as you read. The two APIs only read a part of the file in memory at a time.
I read from file into the arrayList, put the offset of the file and give each thread an iterator to iterate from its start offset to end offset. I still have to try this thought.
Using multiple threads will not prevent java.lang.OutOfMemoryError because all the threads will be in same JVM. Furthermore, no matter you read the file in one list or multiple lists, all the data from the file will be read into the same heap memory.
If you mention what you actually want to do with the data from file, this answer can be more specific.
Ditto what ares said. Need more information. What do you plan on doing with the map. Is it an operation that requires the whole file to be loaded onto memory ? Or can it be done in parts ?
Also, have you considered splitting the file into parts once its size surpasses a threshold size ?
Like Pshemo's answer here : How to break a file into pieces using Java?
Also, If you want to process in parallel, you could consider processing a map which covers a part of the file. Process that map in parallel and store the results in a queue of some sort. Provided the queue will contain a subset of the data you are processing(to avoid OutOfMemory exceptions).

Write files into randomly non-contiguous hard disk positions

I need to do some performance tests for my program and I need to evaluate the worst IO access time to some data stored in files, so I think that the best way to evaluate this is to randomly store these data into different HD sectors in order to avoid contiguous data access and caching improvements. I think that the only way to do this is using some low-level OS commands, like dd in UNIX where you can specify the sector where you write the data, but if I'm not mistaken, this is an insecure method. Someone know a good alternative to do this?
PS: Any solution for any OS will work, the only requirement is that I have to do the tests over different data size, accessing the data through a JAVA program.
I think that the only way to do this is using some low-level OS commands
No... RandomAccessFile has .seek():
final RandomAccessFile f = new RandomAccessFile(someFile, "rw");
f.seek(someRandomLong);
f.write(...);
Now, it is of course up to you to ensure that writes don't collide with one another.
Another solution is to map the file in memory and set the buffer's position to some random position before writing.

Java - Read File in Byte Array

I want to read an file of 1.5 GB into an array. Now, as it takes long time, I want to switch it to some other option. Can anybody help me,
If I preprocess the byte file into some database (or may be in other way) can I make it faster ?
Can anybody help me is there any other way to make it faster.
Actually, I have to process more than 50, 1.5GB file. So, such operation is quite expensive for me.
It depends on what you want to do.
If you only wanted to access a few random bytes, then reading into an array isn't good - a MappedByteBuffer would be better.
If you want to read all the data and sequentially process it a small portion at a time then you could stream it.
If you need to do computations that do random access of the whole dataset, particularly if you need to repeatedly read elements, then loading into an array might be sensible (but a ByteBuffer is still a candidate).
Can you show some example code or explain further?
How fast is your disk subsystem?
If you can read 40 MB per second, reading 1500 MB should take about 40 seconds. If you want to go faster than this, you need a faster disk subsystem. If you are reading from a local drive and its taking minutes, you have a tuning problem and there is not much you can doing Java to fix this because it is not the problem.
You can use a memory mapped file instead, but this will only speed up the access if you don't need all the data. If you need it all, you are limited by the speed of your hardware.
Using BufferedInputStream or InputStream is probably as fast as you can get (faster than RandomAccessFile). The largest int size is 2,147,483,647 so you're getting somewhat close there with your array of 1,610,612,736 which would also be the max size of an array.
I'd recommend you just access the file using BufferedInputStream for best speed, skip() and read() to get the data you want. Maybe have a class that implements those, is aware of its position, and takes care of the seeking for you when you send it an offset to read from. I believe you close and reopen the input stream to put it back at the beginning.
And... you may not want to save them in an array and just access them on need from the file. That might help if loading time is your killer.

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

Categories

Resources