Concurrent text file reads by numerous VMs

Concurrent text file reads by numerous VMs - java

I have a java app I am porting as a proof-of-concept to a cloud architecture. I want to process a very large text file by running the same processing program on chunks of the file on separate VMs.
Worker nodes = n
Head node running Master and one Worker, with n-1 Worker nodes
I have two ideas in mind:
Master reads file line-by-line, sends first line to first worker node, second to second worker node and so on, repeating every n lines.
Master reads number of lines in file. Worker nodes then instructed to read no_of_lines/n concurrently from the file.
I am considering using an RMI or sockets based approach for transfer of data. Could anyone tell me which of the above methods would be most efficient? If this question cannot be answered without specifying which java constructs I would be using, I would appreciate suggestions on those.
Also, would locking be an issue with concurrent file access if I each node knows which lines it is supposed to read?
Thanks for any suggestions
Ian

To take the second question first, there is never any problem in many programs reading one file IFF no program is writing the file: each program has its own file-position pointer. Even if some program is writing to the file, there might not be any problem if that program is always writing at the end of the file which, in any sane system, is always the case.
As for the first question, IFF all of the lines in the file are of constant length, then the issue is as always one of efficiency: it's more efficient to read several lines than it is to read one line.
If I were doing the project, the master would ask the workers to read (n_lines_in_file/n_workers) lines. There seems to me little point in the master's reading lines and passing them out to workers. That's assuming, though, that each line takes the same amount of worker processing as any other.
If that's not true, or if there are other variables you haven't told about, my strategy would no doubt change.

When you break up a program, you should ensure that you are not creating more overhead than you are looking to save. For example, reading a few lines of text is relatively cheap compared with doing an RMI call. Copying the data to many hosts may be more expensive than the processing you intend to do.
How long does the processing take? This will guide you as to how large each piece of work needs to be to be efficient. You may find that the optimal number of threads is one. ;)

Related

Should I fill stream into a sting/container for manipulation?

I want to make this general question. If we have a program which reads data from outside of the program, should we first put the data in a container and then manipulate the data or should we work directly on the stream, if the stream api of a language is powerful enough?
For example. I am writing a program which reads from text file. Should I first put the data in a string and then manipulate instead of working directly on the stream. I am using java and let's say it has powerful enough (for my needs) stream classes.

Stream processing is generally preferable to accumulating data in memory.
Why? One obvious reason is that the file you are reading might not even fit into memory. You might not even know the size of the data before you've read it completely (imagine, that you are reading from a socket or a pipe rather than a file).
It is also more efficient, especially, when the size isn't known ahead of time - allocating large chunks of memory and moving data around between them can be taxing. Things like processing and concatenating large strings aren't free either.
If the io is slow (ever tried reading from a tape?) or if the data is being produced in real time by a peer process (socket/pipe), your processing of the data read can, at least in part, happen in parallel with reading, which will speed things up.
Stream processing is inherently easier to scale and parallelize if necessary, because your logic is forced to only depend on the current element, being processed, you are free from state. If the amount of data becomes too large to process sequentially, you can trivially scale your app, by adding more readers, and splitting the stream between them.
You might argue, that in case none of this matters, because the file you are reading is only 300 bytes. Indeed, for small amounts of data, this is not crucial (you may also bubble sort it while you are at it), but adopting good patterns and practices makes you a better programmer, and will help when it does matters. There is no disadvantage to it. No, it does not make your code more complicated. It might seem so to you at first, but that's simply because you are not used to stream processing. Once you get into the right mindset, and it becomes natural to you, you'll see that, if anything, the code, dealing with one small piece of data at a time, and not caring about indexes, pointers and positions, is simpler than the alternative.
All of the above applies to sequential processing though. You read the stream once, processing the data immediately, as it comes in, and discarding it (or, perhaps, writing out to the next stream in pipeline).
You mentioned RandomAccessFile ... that's a completely different beast. If you need random access, and the data fits in memory, put it in memory. Seeking the file back and forth is the same thing conceptually, only much slower. There is no benefit to it other than saving memory.

You should certainly process it as you receive it. The other way adds latency and doesn't scale.

How to write java thread pool programme to read content of file?

I want to define thread pool with 10 threads and read the content of the file. But different threads must not read same content.(like divide content into 10 pieces and read each pieces by one thread)

Well what you would do would be roughly this:
get the length of the file,
divide by N.
create N threads
have each one skip to (file_size / N) * thread_no and read (file_size / N) bytes into a buffer
wait for all threads to complete.
stitch the buffers together.
(If you were slightly clever about it, you could avoid the last step ...)
HOWEVER, it is doubtful that you would get much speed-up by doing this. Indeed, I wouldn't be surprised if you got a slow down in many cases. With a typical OS, I would expect that you would get as good, if not better performance by reading the file using one big read(...) call from one thread.
The OS can fetch the data faster from the disc if you read it sequentially. Indeed, a lot of OSes optimize for this use-case, and use read-ahead and in-memory buffering (using OS-level buffers) to give high effective file read rates.
Reading a file with multiple threads means that each thread will typically be reading from a different position in the file. Naively, that would entail the OS to seeking the disk heads backwards and forwards between the different positions ... which will slow down I/O considerably. In practice, the OS will do various things to mitigate that, but even so, simultaneously reading data from different positions on a disk is still bad for I/O throughput.

Channel for sharing data between threads

I have a requirement where I need to read text file then transform it and write it to some other file. I wish to do this in parallel fashion like one thread for read, one for transform and another for write.
Now to share data between threads I need some channel, I was thinking to use BlockingQueue for this but would like to explore some other (better) alternatives if available.
Guava has a EventBus but not sure whether this is a good fit for the requirement. What other alternatives are available and which one is best from performance point of view.

Unless your transform step is really intensive, this is probably a waste of time.
Think of it this way. What are you asking for?
You're asking for something that
Takes an incoming stream of data
Copies it to another thread
Presents it to that thread as an incoming stream of data
What data structure best represents an incoming stream of data for step 3? (Hint: it's the InputStream you started with!)
What value do the first two steps add? The "transform" thread can read from disk just as fast as it could read from disk through another thread. Adding the thread inbetween does not speed up the disk read.
You would start to consider adding another thread when
Your problem can be usefully divided into independent pieces of work (say, each thread works on a chunk of text
The cost of splitting the problem into those pieces of work is significantly smaller than the overhead of adding an additional thread and coordinating between them (which is small, but not free!)
The problem requires more resources than a single CPU can provide (a thread gives you access to more CPU resources, but doesn't provide much value in terms of I/O throughput)

Is a single Java thread better than multiple threading in my scenario?

Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks

Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.

Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.

It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.

First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system

As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.

In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.

Converse of java FileDescriptor .sync() for reading files

Reading the javadoc on FileDesciptor's .sync() method, it is apparent that sync() is primarily concerned with committing any modified buffers back to the underlying storage. I.e., making sure that anything that your program has output will actually make it to the disk (or socket or what-have-you, but my question pertains mainly to disks).
But what about the other direction, what about INPUT? Suppose my program has some parts of a java.io.RandomAccessFile buffered in memory, and I want to READ those parts of the file, but perhaps some other process has modified those parts of the file since the last time my program read those blocks?
This is akin to marking a variable as 'volatile' in a C program; something else may have changed the 'real version' of something you merely have a convenient copy of.
I.e., how can you be certain that what your java program reads is at least reasonably up-to-date?
(Clearly the definition of 'up to date' matters. Purely as an example, suppose that the other process, the one that writes to the file, does so on the order of maybe once per second, and suppose that the reading process reads maybe once per minute. In a situation like this, performance isn't a big deal, it's just a matter of making sure that what the reader reads is consistent with what the write writes, to within say, a second.)

Before re-reading your file, it is usually a good idea to check the last modified timestamp of the file with File.lastModified(). If this timestamp is not newer than the last time you read the file, you don't need to bother with more disk I/O to re-read the blocks you are interested in. One thing to keep in mind though, is that the last modifed timestamp may not always be updated immediately when the contents are updated if you are using a network filesystem. If you are dealing with a local process updating the file and another local process running your code reading the file, you most likely won't run into this issue.
One method I've had success with in the past was to have a separate thread poll the file for the last modified timestamp on certain intervals, say 5 seconds. If the file changed, re-process the file and send an event to registered listeners. In my case, 5 seconds was more than soon enough to get updates.

At the moment where the file is read into the internal buffer, the contents are up-to-date to the contents on the disk.
If you want to be sure to have the latest contents on your next access, you also have to go to the disk again, skipping all internal buffers and caches. If you really want to be sure, that all such layers are skipped, you'll have to reopen the file from scratch and seek to the according position you want to access.
Of course, your performance will go down the tubes if you access the disk on every possible access of the data. Don't think of 3-5 fold or so but orders of magnitudes.

If another program you control is the only one writing to the file, then it's probably best to have 2 threads in the same Java process coordinate. The easiest solution is to create a java.util.concurrrent.atomic.AtomicBoolean. The writer thread calls set(true) on the AtomicBoolean and the reader calls getAndSet(false). If getAndSet() returns true, then you know the reader needs to re-read the data. If it's an issue, you could synchronize on some object to prevent the writer from writing while the reader is reading.
You said "process" in the question, so maybe you are concerned about any other process on the system changing the data. In this case, I think you best bet is to just reopen and reread the data. The performance impact of this should be negligible if you really are only reading once per minute.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.