Questions related to writing your own file downloader using multiple threads java - java

In my current company, i am doing a PoC on how we can write a file downloader utility. We have to use socket programming(TCP/IP) for downloading the files. One of the requirements of the client is that a file(which will be large in size) should be transfered in chunks for example if we have a file of 5Mb size then we can have 5 threads which transfer 1 Mb each. I have written a small application which downloads a file. You can download the eclipe project
from http://www.fileflyer.com/view/QM1JSC0
A brief explanation of my classes
FileSender.java : This class provides the bytes of file. It has a method called
sendBytesOfFile(long start,long end, long sequenceNo) which gives the number of bytes.
import java.io.File;
import java.io.IOException;
import java.util.zip.CRC32;
import org.apache.commons.io.FileUtils;
public class FileSender {
private static final String FILE_NAME = "C:\\shared\\test.pdf";
public ByteArrayWrapper sendBytesOfFile(long start,long end, long sequenceNo){
try {
File file = new File(FILE_NAME);
byte[] fileBytes = FileUtils.readFileToByteArray(file);
System.out.println("Size of file is " +fileBytes.length);
System.out.println();
System.out.println("Start "+start +" end "+end);
byte[] bytes = getByteArray(fileBytes, start, end);
ByteArrayWrapper wrapper = new ByteArrayWrapper(bytes, sequenceNo);
return wrapper;
} catch (IOException e) {
throw new RuntimeException(e);
}
}
private byte[] getByteArray(byte[] bytes, long start, long end){
long arrayLength = end-start;
System.out.println("Start : "+start +" end : "+end + " Arraylength : "+arrayLength +" length of source array : "+bytes.length);
byte[] arr = new byte[(int)arrayLength];
for(int i = (int)start, j =0; i < end;i++,j++){
arr[j] = bytes[i];
}
return arr;
}
public static long fileSize(){
File file = new File(FILE_NAME);
return file.length();
}
}
FileReceiver.java - This class receives the file.
Small Explanation what this file does
This class finds the size of the file to be fetched from Sender
Depending upon the size of the file it finds the start and end position till the bytes needs to be read.
It starts n number of threads giving each thread start,end, sequence number and a list which all the threads share.
Each thread reads the number of bytes and creates a ByteArrayWrapper.
ByteArrayWrapper objects are added to the list
Then i have while loop which basically make sure that all threads have done their work
finally it sorts the list based on the sequence number.
then the bytes are joined, and a complete byte array is formed which is converted to a file.
Code of File Receiver
package com.filedownloader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.zip.CRC32;
import org.apache.commons.io.FileUtils;
public class FileReceiver {
public static void main(String[] args) {
FileReceiver receiver = new FileReceiver();
receiver.receiveFile();
}
public void receiveFile(){
long startTime = System.currentTimeMillis();
long numberOfThreads = 10;
long filesize = FileSender.fileSize();
System.out.println("File size received "+filesize);
long start = filesize/numberOfThreads;
List<ByteArrayWrapper> list = new ArrayList<ByteArrayWrapper>();
for(long threadCount =0; threadCount<numberOfThreads ;threadCount++){
FileDownloaderTask task = new FileDownloaderTask(threadCount*start,(threadCount+1)*start,threadCount,list);
new Thread(task).start();
}
while(list.size() != numberOfThreads){
// this is done so that all the threads should complete their work before processing further.
//System.out.println("Waiting for threads to complete. List size "+list.size());
}
if(list.size() == numberOfThreads){
System.out.println("All bytes received "+list);
Collections.sort(list, new Comparator<ByteArrayWrapper>() {
#Override
public int compare(ByteArrayWrapper o1, ByteArrayWrapper o2) {
long sequence1 = o1.getSequence();
long sequence2 = o2.getSequence();
if(sequence1 < sequence2){
return -1;
}else if(sequence1 > sequence2){
return 1;
}
else{
return 0;
}
}
});
byte[] totalBytes = list.get(0).getBytes();
byte[] firstArr = null;
byte[] secondArr = null;
for(int i = 1;i<list.size();i++){
firstArr = totalBytes;
secondArr = list.get(i).getBytes();
totalBytes = concat(firstArr, secondArr);
}
System.out.println(totalBytes.length);
convertToFile(totalBytes,"c:\\tmp\\test.pdf");
long endTime = System.currentTimeMillis();
System.out.println("Total time taken with "+numberOfThreads +" threads is "+(endTime-startTime)+" ms" );
}
}
private byte[] concat(byte[] A, byte[] B) {
byte[] C= new byte[A.length+B.length];
System.arraycopy(A, 0, C, 0, A.length);
System.arraycopy(B, 0, C, A.length, B.length);
return C;
}
private void convertToFile(byte[] totalBytes,String name) {
try {
FileUtils.writeByteArrayToFile(new File(name), totalBytes);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
Code of ByteArrayWrapper
package com.filedownloader;
import java.io.Serializable;
public class ByteArrayWrapper implements Serializable{
private static final long serialVersionUID = 3499562855188457886L;
private byte[] bytes;
private long sequence;
public ByteArrayWrapper(byte[] bytes, long sequenceNo) {
this.bytes = bytes;
this.sequence = sequenceNo;
}
public byte[] getBytes() {
return bytes;
}
public long getSequence() {
return sequence;
}
}
Code of FileDownloaderTask
import java.util.List;
public class FileDownloaderTask implements Runnable {
private List<ByteArrayWrapper> list;
private long start;
private long end;
private long sequenceNo;
public FileDownloaderTask(long start,long end,long sequenceNo,List<ByteArrayWrapper> list) {
this.list = list;
this.start = start;
this.end = end;
this.sequenceNo = sequenceNo;
}
#Override
public void run() {
ByteArrayWrapper wrapper = new FileSender().sendBytesOfFile(start, end, sequenceNo);
list.add(wrapper);
}
}
Questions related to this code
Does file downloading becomes fast when multiple threads is used? In this code i am not able to see the benefit.
How should i decide how many threads should i create ?
Are their any opensource libraries which does that
The file which file receiver receives is valid and not corrupted but checksum (i used FileUtils of common-io) does not match. Whats the problem?
This code gives out of memory when used with large file(above 100 Mb) i.e. because byte array which is created. How can i avoid?
I know this is a very bad code but i have to write this in one day -:). Please suggest any other good way to do this?

There's a bunch of questions here to answer. I'm not going to go through all of the code, but I can give you some tips.
First off, what some download accelerators do is indeed using the HTTP Range header to download parts of a file in parallel. Why does this work? TCP tries to allocate bandwidth fairly per connection. So if you're downloading a file from a server whose bandwidth is swamped, then you can receive a bigger share of the bandwidth by adding more connections. The same principle applies to servers that restrict outgoing bandwidth, which is usually also applied per connection (sometimes taking the IP into consideration).
Obviously if everybody was doing this, we'd be left with a whole lot of TCP connections and their overhead, and not a lot of bandwidth to do the actual downloading, which is why even these download accelerators will only use 2-4 connections. Moreover, if you are the one writing the server, you really don't need to worry about this, as you will only be slowing yourself down (by adding more overhead).
Going out of memory: don't use a bytearray, use a (buffered) InputStream (or if you have some time, learn how to use java.nio and the byte buffers) and read chunks as you are sending the file. The java tutorials cover all the basics.

1) Another reason why multiple connections may be faster is related to TCP window size.
throughput <= window size / roundtrip time
See http://en.wikipedia.org/wiki/TCP_tuning#Window_size for details.
You wont see that much difference if you run tests on a local network, because roundtrip time is small enough.
2) The only way to know for sure is to try. And the right number of threads will depend on environnment. If you need to download really big files, it might be worth it to first run a small calibration program that will try to download with different number of threads.
3) I havent looked there for a long time, but Azureus (now called Vuze) has a pretty complete API to download anything from torrent files to FTP ... And they probably have a quite efficient implementation...
Good luck !
Edit (clarification on window size) :
What you are trying to do is maximize throughput (download files faster). There is not much you can do about roundtime trip, it depends on the network. What you can do is increase window size. The window size is automagically adjusted (there is plenty of documentation on this, but I'm too lazy to google it) to best fit the current state of the network. Basically a larger window means better throughput as long as there isnt congestion or packet loss.
In the best case, you will get a window size of 64Kbits, at this point, unless you use some tricks (Jumbo frame / window scaling) which are not cupported by all routers on the internet, you get stuck at a maximum throughput of :
throughput >= 64Kbit / roundtrip time
As you cant get a bigger window, you have to open multiple windows to get around this limitation.
Notes :
As aioobe said, UDP isnt subject to the same limitations, this is one of the reason why it is more efficient.
A very efficient and scalable protocol to distribute large files is Bittorrent. As long as you dont need authentication / authorization of the downloads, it might work for you. And if you do need authorization, you can always encrypt the files ...

1 Does file downloading becomes fast when multiple threads is used? In this code i am not able to see the benefit.
No. I would be very surprised if that was the case. The CPU would never have a problem of keeping up with the feeding the network-buffer.
2 How should i decide how many threads should i create ?
In my opinion, 0 extra threads.
4 The file which file receiver receives is valid and not corrupted
but checksum (i used FileUtils of
common-io) does not match. Whats the
problem?
Make sure you don't accidentally rely on strings and specific encodings.
5 This code gives out of memory when
used with large file(above 100 Mb)
i.e. because byte array which is
created. How can i avoid?
The obvious solution would be to read smaller chunks of the file. Have a look at the read method of DataInputStream
http://java.sun.com/j2se/1.4.2/docs/api/java/io/DataInputStream.html#read%28byte[],%20int,%20int%29
And, finally, some general pointers in the matter: Instead of using multiple threads for this kind of thing, I strongly encourage you to have a look at the java.nio package, specifically java.nio.channels and the Selector class.
EDIT:
If you're really keen on getting it super-efficient, and have very large files, you could benefit from using UDP, and handle packet order and acknowledgements yourself. TCP does for instance guarantee that the packets received come in the same order as the packets sent. This is not something that you rely heavily on (since you could easily encode the "byte-offset" for each datagram yourself) and thus don't need to "pay" for.

Don't read huge file chunks into memory. No wonder you're running out. Just seek to the required position in the file and start copying via a sensibly sized buffer:
int count;
byte[] buffer = new byte[8192];
// or whatever takes your fancy, but sizes > the socket send buffer size are pointless
while ((count = in.read(buffer)) > 0)
out.write(buffer, 0, count);
out.close();
in.close();
Same logic can be used at both ends - when writing the file at the receiver, use a RandomAccessFile and seek to the appropriate offset before starting this loop.
However as other respondents have noted, the client's requirement is really pretty pointless. It doesn't buy anything much except expense and risk. I would just stream the file via a single connection.
What you should do is set a large socket send and receive buffers at both ends, e.g. 60k. The default is 8k on Windows which is uselessly low.

Related

How to improve performance of deserializing objects from HttpsURLConnection.getInputStream()?

I have a client-server application where the server sends some binary data to the client and the client has to deserialize objects from that byte stream according to a custom binary format. The data is sent via an HTTPS connection and the client uses HttpsURLConnection.getInputStream() to read it.
I implemented a DataDeserializer that takes an InputStream and deserializes it completely. It works in a way that it performs multiple inputStream.read(buffer) calls with small buffers (usually less than 100 bytes). On my way of achieving better overall performance I also tried different implementations here. One change did improve this class' performance significantly (I'm using a ByteBuffer now to read primitive types rather than doing it manually with byte shifting), but in combination with the network stream no differences show up. See the section below for more details.
Quick summary of my issue
Deserializing from the network stream takes way too long even though I proved that the network and the deserializer themselves are fast. Are there any common performance tricks that I could try? I am already wrapping the network stream with a BufferedInputStream. Also, I tried double buffering with some success (see code below). Any solution to achieve better performance is welcome.
The performance test scenario
In my test scenario server and client are located on the same machine and the server sends ~174 MB of data. The code snippets can be found at the end of this post. All numbers you see here are averages of 5 test runs.
First I wanted to know, how fast that InputStream of the HttpsURLConnection can be read. Wrapped into a BufferedInputStream, it took 26.250s to write the entire data into a ByteArrayOutputStream.1
Then I tested the performance of my deserializer passing it all that 174 MB as a ByteArrayInputStream. Before I improved the deserializer's implementation, it took 38.151s. After the improvement it took only 23.466s.2
So this is going to be it, I thought... but no.
What I actually want to do, somehow, is passing connection.getInputStream() to the deserializer. And here comes the strange thing: Before the deserializer improvement deserializing took 61.413s and after improving it was 60.100s!3
How can that happen? Almost no improvement here despite the deserializer improved significantly. Also, unrelated to that improvement, I was surprised that this takes longer than the separate performances summed up (60.100 > 26.250 + 23.466). Why? Don't get me wrong, I didn't expect this to be the best solution, but I didn't expect it to be that bad either.
So, three things to notice:
The overall speed is bound by the network which takes at least 26.250s. Maybe there are some http-settings that I could tweak or I could further optimize the server, but for now this is likely not what I should focus on.
My deserializer implementation is very likely still not perfect, but on its own it is faster than the network, so I don't think there is need to further improve it.
Based on 1. and 2. I'm assuming that it should be somehow possible to do the entire job in a combined way (reading from the network + deserializing) which should take not much more than 26.250s. Any suggestions on how to achieve this are welcome.
I was looking for some kind of double buffer allowing two threads to read from it and write to it in parallel.
Is there something like that in standard Java? Preferably some class inheriting from InputStream that allows to write to it in parallel? If there is something similar, but not inheriting from InputStream, I may be able to change my DataDeserializer to consume from that one as well.
As I haven't found any such DoubleBufferInputStream, I implemented it myself.
The code is quite long and likely not perfect and I don't want to bother you to do a code review for me. It has two 16kB buffers. Using it I was able to improve the overall performance to 39.885s.4
That is much better than 60.100s but still much worse than 26.250s. Choosing different buffer sizes didn't change much. So, I hope someone can lead me to some good double buffer implementation.
The test code
1 (26.250s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
int count = 0;
long start = System.nanoTime();
while ((count = inputStream.read(buffer)) >= 0) {
outputStream .write(buffer, 0, count);
}
long end = System.nanoTime();
2 (23.466s)
InputStream inputStream = new ByteArrayInputStream(entire174MBbuffer);
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
3 (60.100s)
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
DataDeserializer deserializer = new DataDeserializer(inputStream);
long start = System.nanoTime();
deserializer.Deserialize();
long end = System.nanoTime();
4 (39.885s)
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(connection.getInputStream())) {
byte[] buffer = new byte[16 * 1024];
int count = 0;
while ((count = inputStream.read(buffer)) >= 0) {
doubleBufferInputStream.write(buffer, 0, count);
}
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting(); // read() may return -1 now
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
long start = System.nanoTime();
deserializer.deserialize();
long end = System.nanoTime();
Update
As requested, here is the core of my deserializer. I think the most important method is prepareForRead() which performs the actual reading of the stream.
class DataDeserializer {
private InputStream _stream;
private ByteBuffer _buffer;
public DataDeserializer(InputStream stream) {
_stream = stream;
_buffer = ByteBuffer.allocate(256 * 1024);
_buffer.order(ByteOrder.LITTLE_ENDIAN);
_buffer.flip();
}
private int readInt() throws IOException {
prepareForRead(4);
return _buffer.getInt();
}
private long readLong() throws IOException {
prepareForRead(8);
return _buffer.getLong();
}
private CustomObject readCustomObject() throws IOException {
prepareForRead(/*size of CustomObject*/);
int customMember1 = _buffer.getInt();
long customMember2 = _buffer.getLong();
// ...
return new CustomObject(customMember1, customMember2, ...);
}
// several other built-in and custom object read methods
private void prepareForRead(int count) throws IOException {
while (_buffer.remaining() < count) {
if (_buffer.capacity() - _buffer.limit() < count) {
_buffer.compact();
_buffer.flip();
}
int read = _stream.read(_buffer.array(), _buffer.limit(), _buffer.capacity() - _buffer.limit());
if (read < 0)
throw new EOFException("Unexpected end of stream.");
_buffer.limit(_buffer.limit() + read);
}
}
public HugeCustomObject Deserialize() throws IOException {
while (...) {
// call several of the above methods
}
return new HugeCustomObject(/* deserialized members */);
}
}
Update 2
I modified my code snippet #1 a little bit to see more precisely where time is being spent:
InputStream inputStream = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = istream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
ostream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
This tells me that reading from the network stream takes 25.756s while writing to the ByteArrayOutputStream only takes 0.817s. This makes sense as these two numbers almost perfectly sum up to the previously measured 26.250s (plus some additional measuring overhead).
In the very same way I modified code snippet #4:
MyDoubleBufferInputStream doubleBufferInputStream = new MyDoubleBufferInputStream();
new Thread(new Runnable() {
#Override
public void run() {
try (InputStream inputStream = new BufferedInputStream(httpChannelOutputStream.getConnection().getInputStream(), 256 * 1024)) {
byte[] buffer = new byte[16 * 1024];
long read = 0;
long write = 0;
while (true) {
long t1 = System.nanoTime();
int count = inputStream.read(buffer);
long t2 = System.nanoTime();
read += t2 - t1;
if (count < 0)
break;
t1 = System.nanoTime();
doubleBufferInputStream.write(buffer, 0, count);
t2 = System.nanoTime();
write += t2 - t1;
}
System.out.println(read + " " + write);
} catch (IOException e) {
} finally {
doubleBufferInputStream.closeWriting();
}
}
}).start();
DataDeserializer deserializer = new DataDeserializer(doubleBufferInputStream);
deserializer.deserialize();
Now I would expect that the measured reading time is exactly the same as in the previous example. But instead, the read variable holds a value of 39.294s (How is that possible?? It's the exact same code being measured as in the previous example with 25.756s!)* while writing to my double buffer only takes 0.096s. Again, these numbers almost perfectly sum up to the measured time of code snippet #4.
Additionally, I profiled this very same code using Java VisualVM. That tells me that 40s were spent in this thread's run() method and 100% of these 40s are CPU time. On the other hand, it also spends 40s inside of the deserializer, but here only 26s are CPU time and 14s are spent waiting. This perfectly matches the time of reading from network into ByteBufferOutputStream. So I guess I have to improve my double buffer's "buffer-switching-algorithm".
*) Is there any explanation for this strange observation? I could only imagine that this way of measuring is very inaccurate. However, the read- and write-times of the latest measurements perfectly sum up to the original measurement, so it cannot be that inaccurate... Could someone please shed some light on this?
I was not able to find these read and write performances in the profiler... I will try to find some settings that allow me to observe the profiling results for these two methods.
Apparently, my "mistake" was to use a 32-bit JVM (jre1.8.0_172 being precise).
Running the very same code snippets on a 64-bit version JVM, and tadaaa... it is fast and makes all sense there.
In particular see these new numbers for the corresponding code snippets:
snippet #1: 4.667s (vs. 26.250s)
snippet #2: 11.568s (vs. 23.466s)
snippet #3: 17.185s (vs. 60.100s)
snippet #4: 12.336s (vs. 39.885s)
So apparently, the answers given to Does Java 64 bit perform better than the 32-bit version? are simply not true anymore. Or, there is a serious bug in this particular 32-bit JRE version. I didn't test any others yet.
As you can see, #4 is only slightly slower than #2 which perfectly matches my original assumption that
Based on 1. and 2. I'm assuming that it should be somehow possible to
do the entire job in a combined way (reading from the network +
deserializing) which should take not much more than 26.250s.
Also the very weird results of my profiling approach described in Update 2 of my question do not occur anymore. I didn't repeat every single test in 64 bit yet, but all profiling results that I did do are plausible now, i.e. the same code takes the same time no matter in which code snippet. So maybe it's really a bug, or does anybody have a reasonable explanation?
The most certain way to improve any of these is to change
connection.getInputStream()
to
new BufferedInputStream(connection.getInputStream())
If that doesn't help, the input stream isn't your problem.

Read first N bytes of a url in scala? [duplicate]

For the life of me, I haven't been able to find a question that matches what I'm trying to do, so I'll explain what my use-case is here. If you know of a topic that already covers the answer to this, please feel free to direct me to that one. :)
I have a piece of code that uploads a file to Amazon S3 periodically (every 20 seconds). The file is a log file being written by another process, so this function is effectively a means of tailing the log so that someone can read its contents in semi-real-time without having to have direct access to the machine that the log resides on.
Up until recently, I've simply been using the S3 PutObject method (using a File as input) to do this upload. But in AWS SDK 1.9, this no longer works because the S3 client rejects the request if the content size actually uploaded is greater than the content-length that was promised at the start of the upload. This method reads the size of the file before it starts streaming the data, so given the nature of this application, the file is very likely to have increased in size between that point and the end of the stream. This means that I need to now ensure I only send N bytes of data regardless of how big the file is.
I don't have any need to interpret the bytes in the file in any way, so I'm not concerned about encoding. I can transfer it byte-for-byte. Basically, what I want is a simple method where I can read the file up to the Nth byte, then have it terminate the read even if there's more data in the file past that point. (In other words, insert EOF into the stream at a specific point.)
For example, if my file is 10000 bytes long when I start the upload, but grows to 12000 bytes during the upload, I want to stop uploading at 10000 bytes regardless of that size change. (On a subsequent upload, I would then upload the 12000 bytes or more.)
I haven't found a pre-made way to do this - the best I've found so far appears to be IOUtils.copyLarge(InputStream, OutputStream, offset, length), which can be told to copy a maximum of "length" bytes to the provided OutputStream. However, copyLarge is a blocking method, as is PutObject (which presumably calls a form of read() on its InputStream), so it seems that I couldn't get that to work at all.
I haven't found any methods or pre-built streams that can do this, so it's making me think I'd need to write my own implementation that directly monitors how many bytes have been read. That would probably then work like a BufferedInputStream where the number of bytes read per batch is the lesser of the buffer size or the remaining bytes to be read. (eg. with a buffer size of 3000 bytes, I'd do three batches at 3000 bytes each, followed by a batch with 1000 bytes + EOF.)
Does anyone know a better way to do this? Thanks.
EDIT Just to clarify, I'm already aware of a couple alternatives, neither of which are ideal:
(1) I could lock the file while uploading it. Doing this would cause loss of data or operational problems in the process that's writing the file.
(2) I could create a local copy of the file before uploading it. This could be very inefficient and take up a lot of unnecessary disk space (this file can grow into the several-gigabyte range, and the machine it's running on may be that short of disk space).
EDIT 2: My final solution, based on a suggestion from a coworker, looks like this:
private void uploadLogFile(final File logFile) {
if (logFile.exists()) {
long byteLength = logFile.length();
try (
FileInputStream fileStream = new FileInputStream(logFile);
InputStream limitStream = ByteStreams.limit(fileStream, byteLength);
) {
ObjectMetadata md = new ObjectMetadata();
md.setContentLength(byteLength);
// Set other metadata as appropriate.
PutObjectRequest req = new PutObjectRequest(bucket, key, limitStream, md);
s3Client.putObject(req);
} // plus exception handling
}
}
LimitInputStream was what my coworker suggested, apparently not aware that it had been deprecated. ByteStreams.limit is the current Guava replacement, and it does what I want. Thanks, everyone.
Complete answer rip & replace:
It is relatively straightforward to wrap an InputStream such as to cap the number of bytes it will deliver before signaling end-of-data. FilterInputStream is targeted at this general kind of job, but since you have to override pretty much every method for this particular job, it just gets in the way.
Here's a rough cut at a solution:
import java.io.IOException;
import java.io.InputStream;
/**
* An {#code InputStream} wrapper that provides up to a maximum number of
* bytes from the underlying stream. Does not support mark/reset, even
* when the wrapped stream does, and does not perform any buffering.
*/
public class BoundedInputStream extends InputStream {
/** This stream's underlying #{code InputStream} */
private final InputStream data;
/** The maximum number of bytes still available from this stream */
private long bytesRemaining;
/**
* Initializes a new {#code BoundedInputStream} with the specified
* underlying stream and byte limit
* #param data the #{code InputStream} serving as the source of this
* one's data
* #param maxBytes the maximum number of bytes this stream will deliver
* before signaling end-of-data
*/
public BoundedInputStream(InputStream data, long maxBytes) {
this.data = data;
bytesRemaining = Math.max(maxBytes, 0);
}
#Override
public int available() throws IOException {
return (int) Math.min(data.available(), bytesRemaining);
}
#Override
public void close() throws IOException {
data.close();
}
#Override
public synchronized void mark(int limit) {
// does nothing
}
#Override
public boolean markSupported() {
return false;
}
#Override
public int read(byte[] buf, int off, int len) throws IOException {
if (bytesRemaining > 0) {
int nRead = data.read(
buf, off, (int) Math.min(len, bytesRemaining));
bytesRemaining -= nRead;
return nRead;
} else {
return -1;
}
}
#Override
public int read(byte[] buf) throws IOException {
return this.read(buf, 0, buf.length);
}
#Override
public synchronized void reset() throws IOException {
throw new IOException("reset() not supported");
}
#Override
public long skip(long n) throws IOException {
long skipped = data.skip(Math.min(n, bytesRemaining));
bytesRemaining -= skipped;
return skipped;
}
#Override
public int read() throws IOException {
if (bytesRemaining > 0) {
int c = data.read();
if (c >= 0) {
bytesRemaining -= 1;
}
return c;
} else {
return -1;
}
}
}

Limit the content available from a Java NIO Channel (File or Socket)

I'm pretty new to NIO and wanted to implement some feature with it, instead of typical Streams (which can do all sort of things).
What I'm not sure I can get is reading from a file into a buffer and limiting the content that I will transfer. Let's say from position 100 to 200 (even if file length is 1000). It also would be nice to do on network sockets.
I know that NIO keeps things basic to leverage OS capabilities that's why I'm not sure it can be done.
I was thinking that a tricky way to do it would be a 'LimitedReadChannel' that when it's should return less than the available buffer size it uses another byte-buffer and then transfer to the original one (1). But seems more tricky than necessary. I also don't want to use anything related to streams because it would defeat the purpose of using NIO.
(1) So far....
LimitedChannel.read(buffer) {
if (buffer.available?? > contentLeft) {
delegateChannel.read(smallerBuffer);
// transfer from smallerBuffer to buffer
} else {
delegateChannel.read(buffer);
}
}
I've found that Buffers admit to ask for the current limit or set a new one. So that wrapper channel (the one that limits the effective number of bytes read) could modify the buffer limit to avoid reading more...
Something like:
// LimitedChannel.java
// private int bytesLeft; // remaining amount of bytes to read
public int read(ByteBuffer buffer) {
if (bytesLeft <= 0) {
return -1;
}
int oldLimit = buffer.limit();
if (bytesLeft < buffer.remaining()) {
// ensure I'm not reading more than allowed
buffer.limit(buffer.position() + bytesLeft);
}
int bytesRead = delegateChannel.read(buffer);
bytesLeft -= bytesRead;
buffer.limit(oldLimit);
return bytesRead;
}
Anyway not sure if this already exists somewhere. It's difficult to find documentation about this use case...

how to divide huge text file into small in java [duplicate]

I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.

Writing partial region of downloaded file

I am downloading a file from internet separately. Like 3 regions. Lets say I have to download a file of size 1024kB and i am have set the region as 0-340kB, 341 - 680kB and 681kB - 1024 kB. I have separate thread of each sections. But, the problem i have now is, writing the downloaded file content into a single file.
Since we have 3 threads, each will download the sections which needs to be write in to the file sequentially.
How can I achieve this ? I thought of having 3 temporary files and write into them. Once all the files written, I have to read file by file and write into a single file. I felt like this is kind of overhead. Is there any other better way ?
Thanks in advance.
To be clear, I am not convinced that this approach will actually improve the download speed. It may give more consistent download speeds if you are downloading the same file from multiple mirrors, though.
First off, if your file isn't too large, you can buffer all of it before you write it out. So allocate a buffer that all your threads can access:
byte[] buf = new byte[fileSize];
Now you create a suitable Thread type:
public class WriterThread extends Thread
{
byte[] buf;
int write_pos, write_remaining;
public WriterThread(byte[] buf, int start, int len)
{
this.buf = buf;
this.write_pos = start;
this.write_remaining = len;
}
#Override
public void run()
{
try (Socket s = yourMethodForSettingUpTheSocketConnection();
InputStream istream = s.getInputStream()) {
while (this.write_remaining > 0) {
int read = istream.read(this.buf, this.write_pos, this.write_remaining);
if (read == -1) error("Not enough data received");
this.write_remaining -= read;
this.write_pos += read;
}
// otherwise you are done reading your chunk!
}
}
}
Now you can start as many of these WriterThread objects with suitable starts and lengths. For example, for a file that is 6000 bytes in size:
byte[] buf = new byte[6000];
WriterThread t0 = new WriterThread(buf, 0, 3000);
WriterThread t1 = new WriterThread(buf, 3000, 3000);
t0.start();
t1.start();
t0.join();
t1.join();
// check for errors
Note the important bit here: each of the WriterThreads has a referecence to exactly the same buffer, just a different offset that it starts writing at. Of course you have to make sure that yourMethodForSettingUpTheSocketConnection requests data starting at offset this.write_pos; how you do that depends on the networking protocol that you use and is beyond what you asked about.
If your file is too big to fit into memory, this approach won't work. Instead, you'll have to use the (slower) method of first creating a large file and then writing to that. While I haven't tried that, you should be able to use java.nio.file.File.newByteChannel()' to set up a suitableSeekableByteChannelas your output file. If you create such aSeekableByteChannel sbc`, you should then be able to do
sbc.location(fileSize - 1); // move to the specified position in the file
sbc.write(java.nio.ByteBuffer.allocate(1)); // grow the file to the expected final size
and then use one distinct SeekableByteChannel object per thread, pointing to the same file on disk, and setting the write start location using the SeekableByteChannel.location(int) method. You'll need a temporary byte[] around which you can wrap a ByteBuffer (via ByteBuffer.wrap()), but otherwise the strategy is analogous to the above:
thread_local_sbc.location(this.write_pos);
and then every thread_local_sbc.write() will write to the file starting at this.write_pos.

Categories

Resources