java many threads writing random bytes to file simultaneously [just need advice] - java

I am writing a simple benchmark of sorts in java to test parallelization. The program generates 1000 random bytes in total and writes them to a binary file. It uses different amounts of threads to parallelize the byte generation and writing to disk, and measures the execution time of the overall process for each thread count.
The program splits the entirety of the execution among a specified number of threads - both the generation of byte arrays and the writing of these bytes to a file.
My problem is, I need to have a single binary file at the end. I need advice as far as how best to make each thread write its trash bytes to the same file. Keep in mind I do not care at all what order they end up in. I have three ideas so far:
1) Should I have each thread create an instance of RandomAccessFile each referencing the same empty file on the disk, and have each thread write to the file starting from a different location in a way that they do not overlap? This seems like the best way to truly parallelize the disk writing.
2) Can I pass each thread a reference to some kind of buffered stream object and have each thread send its byte array into this stream? Is there a way to create an object which will just listen for bytes and immediately write them to a file in whatever order it receives them? I am worried that having a single object collect all of the bytes would not truly represent parallelized disk writing.
3) Should I have each thread write its bytes to its own file, and then "merge" its file into the main file?
Thanks for your time! I don't need detailed code examples, just want to get some advice as I work on this to point me in the right direction.

Create FileOutputStream, get corresponding FileChannel and write data using ByteBuffers.

Related

How do you write to a blocking device file using Java?

I have a character special device file on a Linux system (eg /dev/foobardcma6) that I wish to constantly write data to. What is the preferred way to do this in Java?
I tried using AsynchronousFileChannel and was able to write some bytes to it, but eventually it blocks/hangs when writing. I don't know if this is the right approach or not, however.
I could use a FileChannel. The write method returns how many bytes were actually written. I assume if it returns less bytes than were requested to write, it means the write buffer is full, and you should wait before writing again. However, I don't see any mechanism to be notified that the file is ready for writing.
Update:
I tried using a FileChannel, and it also blocks after a certain number of bytes. The suspicious thing is both the FileChannel and AsychronousFileChannel implementations block after writing exactly the same number of bytes. In both cases the last call to write never returns.
I have a test utility written in C++ that can successfully write data to the device without issue, so it's not a hardware problem. I assume I'm doing something wrong with the FileChannels.

HttpClient - write from one InputStream to multiple POST requests

I have a cluster of servers (potentially remote from each other) which all run Tomcat and communicate over HTTP using Apache HttpClient. A large number of these servers are data stores, and one of the servers is a front-facing webserver that serves as an intermediary between the client and the stores. A user should be able to upload a file to the webserver and the webserver will pass that file to a given number of stores.
So, the question: is it possible to take the file part of the upload from the client as an InputStream and write to multiple POST requests to the stores at the same time? If I were simply writing to local files, the obvious solution would simply be to read chunks of the InputStream into a byte array buffer and write from the buffer to each of the outputs in turn, but I'm at a loss as to how to convince HttpClient to "share" a stream like this.
And yes, I could simply read the entire InputStream into an object on the webserver and write it out to each store sequentially, but since I could potentially be accepting very large files I'd have to write the data to disk and then read it back for each store server, and the number of disk operations could swiftly become prohibitive. This is an implementation I'd prefer to avoid.
If the stores do not have the network bandwidth to keep up, how would it "share" the stream?
You can split up the incoming file and pass it on to the stores without writing it to disk, but if just one of the stores cannot keep up, you'll have to keep that file data in memory until it can accept it. If it's a big file, or many users, it can potentially take all your memory.
More technically what I mean is that you can create 5 threads that will send data as fast as they can to the stores and keep the file data in a shared FIFO structure. When the last thread has accessed a portion and sent that portion, that data can be removed from the data structure, but not before. If one is slow, the data structure can grow huge.
The data has to be somewhere, if not memory and not hard drive, then where?
So, keep the incoming data in memory until (if?) you're running out of memory (never?), then flush it to the hard drive. Keep trying to empty the data structure with the data by getting it sent to the stores and then removing.
You can rather easily code an ExecutorService to handle the re-transmit of data and cleaning up the data structure, but it won't solve the problem magically. :)
I haven't provided source code, because you don't seem to want this solution. You might need help implementing it later if you accept that you can't magically pass the data on without there being some chance of having to buffer it on the hard drive (or a worse solution would be to throtte the user uploads to MinimumBandwidth(store1, store2, store3, store4, store5)).
Edit/changing:
I'm not sure you really want an ExecutorService even though I said that. I would create my own custom Thread's to handle this actually. I would create a Collection from the concurrent package, probably a LinkedBlockingQueue that holds byte arrays (not bytes, arrays of bytes). Then I would create a map from Thread->Integer that holds the current index for each thread's process in passing on the data. When all progress numbers are above say 10 (meaning all threads have sent the first 10 chunks), then I remove the first 10 byte arrays, and subtract 10 from all the thread's progress to reset it.
Create your own output stream. Attach as many HTTP POST Clients to this stream. If you receive Date to your output stream send it to each of the connected POST Clients.

What is the fastest way to write a large amount of data from memory to a file?

I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps

Channel for sharing data between threads

I have a requirement where I need to read text file then transform it and write it to some other file. I wish to do this in parallel fashion like one thread for read, one for transform and another for write.
Now to share data between threads I need some channel, I was thinking to use BlockingQueue for this but would like to explore some other (better) alternatives if available.
Guava has a EventBus but not sure whether this is a good fit for the requirement. What other alternatives are available and which one is best from performance point of view.
Unless your transform step is really intensive, this is probably a waste of time.
Think of it this way. What are you asking for?
You're asking for something that
Takes an incoming stream of data
Copies it to another thread
Presents it to that thread as an incoming stream of data
What data structure best represents an incoming stream of data for step 3? (Hint: it's the InputStream you started with!)
What value do the first two steps add? The "transform" thread can read from disk just as fast as it could read from disk through another thread. Adding the thread inbetween does not speed up the disk read.
You would start to consider adding another thread when
Your problem can be usefully divided into independent pieces of work (say, each thread works on a chunk of text
The cost of splitting the problem into those pieces of work is significantly smaller than the overhead of adding an additional thread and coordinating between them (which is small, but not free!)
The problem requires more resources than a single CPU can provide (a thread gives you access to more CPU resources, but doesn't provide much value in terms of I/O throughput)

In Java, what is the difference between using a BufferedWriter or writing straight to file? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
In Java, what is the advantage of using BufferedWriter to append to a file?
The site that I am looking at says
"The BufferWriter class is used to write text to a character-output stream, buffering characters so as to provide for the efficient writing of single characters, arrays, and strings."
What make's it more efficient and why?
BufferedWriter is more efficient because it uses buffers rather than writing character by character. So it reduces I/O operations of the disk. Data is collected in a buffer and write to the file when the buffer is full.
This is why sometimes no data is written in the file if you didn't call flush method. That is data is collected in the buffer but program exits before writing them to the file. Calling flush method will cause the data to be written in the file even the buffer is not filled completely.
The cost of writing becomes expensive when you write character by character to the file. For reducing that cost, buffers are provided. If you are writing to Buffer, it waits for some limit and then writes the whole to the disk.
A BufferedWriter waits until the buffer (8192 bytes) is full and writes the whole buffer in one disk operation. Unbuffered each single write would result in a disk I/O which is obviously more expensive.
Hard disk hava a minimum unit of information storage so for example if you are writing a single byte the operating system asks for the disk to store a unit of storage (I think that the minimum is 512 bytes). So you ask for writing one byte and the operating system writes much more. If you ask to store 512 bytes with 512 calls you end up doing a lot more I/O (512 disk operations) that buffering 512 bytes and issuing only one call (1 disk operation).
As the name suggests, BufferedWriter uses a buffer to reduce the costs of writes. If you are writing to file, you might know that writing 1byte or writing 4kbytes roughly costs the same. The time required to perform such write is dominated by the access time (~8ms) which is the time required by the disk to rotate and to seek the right sector.
Additionally, aggregating small writes in a bigger one allows you to reduce the overhead on the operating system, achieving better performances.
Most of the operating systems do have an internal buffer to cache writes. However, these caches tries to figure out what the application is doing, by analyzing the write patterns. If the application itself is able to perform that caching, and perform a write only when the data is ready, the result (in terms of performance) is better.

Categories

Resources