Reading huge file in Java

Reading huge file in Java - java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?

It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.

Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file

A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Related

Java - Thread-safe hash collection with no duplicates that is recoverable after crash

I am new to java, and I'm developing an application that needs to do the following. I need to read a high volume of files on multiple threads and perform a series of tasks on each file. I cannot process duplicate files, and I need to know when I've attempted a duplicate file so that I can rename/move the file from our watch folder and send notification of such an attempt. Additionally, any number of these tasks to be performed could result in the file being aborted, so to speak, where we can we can make another attempt after changes to the file have been made. This also needs to be recoverable in the event of a crash so that no file is missed or mishandled. These tasks to be performed could take a few milliseconds up to multiple seconds or possibly even minutes to complete.
My application is currently using multiple ExecutorServices and PriorityBlockingQueues for each task to be performed with an in-memory HashMap to keep track of which files we've already seen and to prevent duplicates. At present, with each new file, I check if HashSet.add() is true/false. If false, I rename the duplicate. For each file that is added to the HashSet, I add it to the first task's queue where multiple threads churn away on that queue and then either abort the file or pass to the next tasks's queue where multiple threads churn away and so on.
My thought was that I would continue down this road but with the following changes. 1.) The HashSet will be loaded from disk at startup and then workflow will proceed as I've outlined above. 2.) Once all tasks are completed then I'll write any file that isn't aborted to the HashSet on disk and it will remain in the in-memory hash map as well.
There will be multiple threads that addto the HashSet on disk and multiple threads that add to the HashSet in memory. As I stated, at times we'll want to reattempt a file that was previously aborted, and for this step it is possible that we'll require a change to one of the file's properties that our contains is evaluating...so it's not necessarily necessary that we have the ability to remove from the HashSet for files that we've aborted. We may require that we're presented with what we perceive as a new file for these scenarios.
All that said, I'm not looking for anyone to write or review my code so I won't post it here. Can I have multiple threads call HashSet.add() with no repercussions?

How to know that no operation is running on ConcurrentHashMap or in Idle state in JAVA?

I have a situation like, whenever my ConcurrentHashMap updates I need to clear an existing file and write the entire data into the File again. So every time I update clearing the file and writing the data again into the file causes high latency. So I am thinking that whenever my hashmap is in Idle state, like if no updating operation is going on then I will write the entire data into the file, else I will wait until the hashmap is idle.
Basically, I will be deleting Strings continuously from the Map. So everytime I delete a String from the HashMap writing to the file is a very costly operation. So is there a way to know that no deletion operation is going on the ConcurrentHashMap?

So is there a way to know that no deletion operation is going on the ConcurrentHashMap?
Short answer: no there isn't a way.
But even if there wasn't you would still get into problems. For example, suppose that new updates arrived immediately after you started clearing / writing.
I think the solution is to use two maps and a queue.
When an update request happens:
perform the update on the concurrent hashmap
add the request to the queue
In a background thread:
pull requests from the queue, and perform updates on the second (shadow) hashmap
periodically or based on some other criteria, cease pulling requests and flush the shadow hashmap to the file.
The primary hashmap is always updated quickly, and is always up to date. Operations updating and using the primary hashmap do not get (significantly) blocked.
The queue provides request buffering while the shadow hashmap is being written.
The second hashmap is only accessed by one thread, so it doesn't need to be concurrent. Therefore it will be faster.
The state of the file will typically be a little behind the primary hashmap. But that is inevitable. The only way to avoid that is to block updates to the primary map ... which is what you are trying to avoid.
Another way to approach this would be to make writing to the file faster. I suspect that the reason it is slow is because your current design requires you to clear and rewrite the file each time. Another approach would be to write only the changes to the file. This means you may have more work to do on restart ... assuming the purpose of the file is to record the map state so that you can restart.

Sounds like you would need to make use of encapsulation by wrapping the ConcurrentHashMap in a class and possibly have add/remove methods with a Queue. Look at the java.util.concurrent package for other options.
The Idea would be to use a Queue. Every access to the Map would go by calling the add/remove from the wrapper and add that to the queue. Then there would be an infinite Thread loop consuming the queue. While doing that you can check if the queue is empty and do the file persisting.

Let a queue build up to a certain amount before processing

So let me give you an idea of what I'm trying to do:
I've got a program that records statistics, lots and lots of them, but it records them as they happen one at a time and puts them into an ArrayList, for example:
Please note this is an example, I'm not recording these stats, I'm just simplifying it a bit
User clicks -> Add user_click to array
User clicks -> Add user_click to array
Key press -> Add key_press to array
After each event(clicks, key presses, etc) it checks the size of the ArrayList, if it is > 150 the following happens:
A new thread is created
That thread is given a copy of the ArrayList
The original ArrayList is .clear()'ed
The new thread combines similar items so user_click would now be one item with a quantity of 2, instead of 2 items with a quantity of 1 each
The thread processes the data to a MySQL db
I would love to find a better approach to this, although this works just fine. The issue with threadpools and processing immediately is there would be literally thousands of MySQL queries per day without combining them first..
Is there a better way to accomplish this? Is my method okay?
The other thing to keep in mind is the thread where events are fired and recorded can't be slowed down so I don't really want to combine items in the main thread.
If you've got code examples that would be great, if not just an idea of a good way to do this would be awesome as-well!
For anyone interested, this project is hosted on GitHub, the main thread is here, the queue processor is here and please forgive my poor naming conventions and general code cleanliness, I'm still(always) learning!

The logic described seems pretty good, with two adjustments:
Don't copy the list and clear the original. Send the original and create a new list for future events. This eliminates the O(n) processing time of copying the entries.
Don't create a new thread each time. Events are delayed anyway, since you're collecting them, so timeliness of writing to database is not your major concern. Two choices:
Start a single thread up front, then use a BlockingQueue to send list from thread 1 to thread 2. If thread 2 is falling behind, the lists will simply accumulate in the queue until thread 2 can catch up, without delaying thread 1, and without overloading the system with too many threads.
Submit the job to a thread pool, e.g. using an Executor. This would allow multiple (but limited number of) threads to process the lists, in case processing is slower than event generation. Disadvantage is that events may be written out of order.
For the purpose of separation of concern and reusability, you should encapsulate the logic of collecting events, and sending them to thread in blocks for processing, in a separate class, rather than having that logic embedded in the event-generation code.
That way you can easily add extra features, e.g. a timeout for flushing pending events before reaching normal threshold (150), so events don't sit there too long if event generation slows down.

Exchange data in real time over AJAX with multiple threads

I am developing an application in JSF 2.0 and I would like to have a multiline textbox which displays output data which is being read (line by line) from a file in real time.
So the goal is to have a page with a button on it that triggers the backend to start reading from the file and then displaying the results as it's reading in the textbox.
I had thought about doing this in the following way:
Have the local page keep track of what lines it has retrieved/displayed in the textbox so far.
Periodically the local page will poll the backend using AJAX and request any new data that has been read (tell it what lines the page has so far and only retrieve the new lines since then).
This will continue until the entire file has been completely retrieved.
The issue is that the bean method that reads from the file is running a while loop that blocks. So to read from the data structure it is writing to at the same time will require using additional Threads, correct? I hear that spawning new Threads in a web application is a potentially dangerous move and that Thread pools should be used, etc.
Can anyone shed some insight on this?
Update: I tried a couple of different things with no luck. But I did manage to get it working by spawning a separate Thread to run my blocking loop while the main thread could be used to read from it whenever an AJAX request is processed. Is there a good library I could use to do something similar to this that still gives JSF some lifecycle control over this Thread?

Have you considered implementing the Future interface (included in Java5+ Concurrency API)? Basically, as you read in the file, you could split it into sections and simply create a new Future object (for each section). Then you can have the object return once the computation has completed.
This way you prevent having to access the structure while it is still being manipulated by the loop and you also split the operations into smaller computations reducing the amount of time locking occurs (total lock time might be greater but you get faster response to other areas). If you maintain the order in which your Future objects were created then you don't need to track line #'s. Note that calling Future.get() does block until the object is 'ready'.
The rest of you're approach would be similar - make the Ajax call to get content of all 'ready' Future objects from a FIFO queue.
I think I understand what you're trying to accomplish.. maybe a bit more info would help.

Why use Java's AsynchronousFileChannel?

I can understand why network apps would use multiplexing (to not create too many threads), and why programs would use async calls for pipelining (more efficient). But I don't understand the efficiency purpose of AsynchronousFileChannel.
Any ideas?

It's a channel that you can use to read files asynchronously, i.e. the I/O operations are done on a separate thread, so that the thread you're calling it from can do other things while the I/O operations are happening.
For example: The read() methods of the class return a Future object to get the result of reading data from the file. So, what you can do is call read(), which will return immediately with a Future object. In the background, another thread will read the actual data from the file. Your own thread can continue doing things, and when it needs the read data, you call get() on the Future object. That will then return the data (if the background thread hasn't completed reading the data, it will make your thread block until the data is ready). The advantage of this is that your thread doesn't have to wait the whole length of the read operation; it can do some other things until it really needs the data.
See the documentation.
Note that AsynchronousFileChannel will be a new class in Java SE 7, which is not released yet.

I've just come across another, somewhat unexpected reason for using AsynchronousFileChannel. When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel.
My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.
I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.
Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).
Will be interesting to see if the same ratios hold on Linux.

The main reason I can think of to use asynchronous IO is to better utilize the processor. Imagine you have some application which does some sort of processing on a file. And also let's assume you can process the data contained in the file in chunks. If you don't make use of asynchronous IO then your application will probably behave something like this:
Read a block of data. No processor utilization at this point as you're blocked waiting for the data to be read.
process the data you just read. At this point your application will start consuming CPU cycles as it processed the data.
If more data to read, goto #1.
The processor utilization will go up and then to zero and then up and then to zero, ... . Ideally you want to not be idle if you want your application to be efficient and process the data as fast as possible. A better approach would be:
Issue async read
When read completes issue next async read and then process data
The first step is the bootstrapping. You have no data yet so you have to issue a read. From then on, when you get notified a read has completed you issue another async read and then process the data. The benefit here is that by the time you finish processing the chunk of data the next read has probably finished, so you always have data available to process and thus you're more efficiently using the processor. If your processing finishes before the read has finished you might need to issue multiple asynchronous reads so that you have more data to process.
Nick

Here's something no one has mentioned:
A plain FileChannel implements InterruptibleChannel so it, as well as anything that uses it such as the OutputStream returned by Files.newOutputStream(), has the unfortunate[1][2] behaviour that performing any blocking operation on it (e.g. read() and write()) in a thread in interrupted state will cause the Channel itself to close with java.nio.channels.ClosedByInterruptException.
If this is a problem, using AsynchronousFileChannel instead is a possible alternative.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6608965
[2] https://bugs.openjdk.java.net/browse/JDK-4469683

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.