I am very new to Hadoop file system.
I have two different hadoop file systems(one is client and another is server), both of them are in different domain and not having direct access to one another.
I want to copy files(in GB) from server to client.
Since I don't have direct access to server(from client), I followed below method to copy the files.
I wrote server java program which reads the file with server configuration and writing as bytes to stdout.
System.out.write(buf.array(), 0, length);
System.out.flush();
then, I wrote cgi script which call this server jar.
then, I wrote a client java program which calls above cgi script to read the data
FSDataOutputStream dataOut = fs.create(client_file, true, bufSize, replica, blockSize);
URL url = new URL("http://xxx.company.com/cgi/my_cgi_script?" + "file=" + server_file);
InputStream is = url.openStream();
byte[] byteChunk = new byte[1024 * 1024];
int n = 0;
while ( (n = is.read(byteChunk)) > 0 ) {
dataOut.write(byteChunk, 0, n);
received += n;
}
dataOut.close();
now, copying file done without any issue and I see the same file size on server and client.
When I do FileChecksum for same file on client and server file systems, I am getting different values.
MD5-of-262144MD5-of-512CRC32C:86094f4043b9592a49ec7f6ef157e0fe
MD5-of-262144MD5-of-512CRC32C:a83a0b3f182db066da7520b36c79e696
Can you please help me to fix this issue?
Note: I am using the same blockSize on client and server file systems
We have an environment where each user may get different html/js/css resources. I'm using the following code to compress and transfer a java script resource:
public static byte[] compress(String str) throws IOException {
if (str == null || str.length() == 0) {
return null;
}
ByteArrayOutputStream obj=new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(obj);
gzip.write(str.getBytes("UTF-8"));
gzip.close();
return obj.toByteArray();
}
...
HttpServletResponse raw = response.raw();
raw.setBufferSize(file.length().intValue());
ServletOutputStream servletOutputStream = raw.getOutputStream();
servletOutputStream.write(compress(FileUtils.readFileToString(file)));
servletOutputStream.flush();
servletOutputStream.close();
...
Inspecting the problem using chrome network tab, the download time is 2 seconds for 300KB of compressed data - this seems unreasonable.
The problem is not the bandwidth or jetty itself, because static resources transfer time is fast.
Don't know if that is the source of your bottleneck, but I wouldn't do:
raw.setBufferSize(file.length().intValue());
If the gzipped file is around 300KB then you could create response buffers which are a magnitude greater than that. And you don't need a big response buffer at all when you are streaming static content.
From the servlet javadoc:
Sets the preferred buffer size for the body of the response. The
servlet container will use a buffer at least as large as the size
requested. The actual buffer size used can be found using
getBufferSize.
A larger buffer allows more content to be written before anything is
actually sent, thus providing the servlet with more time to set
appropriate status codes and headers. A smaller buffer decreases
server memory load and allows the client to start receiving data more
quickly.
I'm using Apache Commons Net 3.3 to handle FTP transfers in a Java application.
Downloads seem to work fine, but I'm getting speeds a lot slower than the local internet connection capabilities for uploads.
The code that writes the file data to the stream looks like this:
BufferedOutputStream out = new BufferedOutputStream(ftp.getOutputStream(prt));
BufferedInputStream in = new BufferedInputStream(prov.getInputStream(s));
byte[] buff = new byte[BUFF_SIZE];
int len;
while ((len = in.read(buff)) >= 0 && !prog.isCanceled()) {
out.write(buff, 0, len);
total += len;
prog.setProgress((int) (Math.round((total / combo) * 100)));
}
in.close();
out.close();
BUFF_SIZE = 16kB
I have the FTPClient buffer size also set to 16kB via setBufferSize
The issue isn't with the server or my internet connection because the upload proceeds at a much more reasonable speed using Filezilla as a FTP client.
The issue also seems to occur with Java 6 and 7 JVMs.
Does anyone have any ideas as to why this is happening? Is there a problem with Commons Net or Java? Or is there something I haven't configured correctly?
Same problem - using SDK 1.6 resolve problem, but also try to find better way
UPD: Solved (see comments)
It have established the following code, which seems to be working well:
void pipe(InputStream, OutputStream os) {
try {
try {
byte[] buf = new byte[1024*16];
int len, available = is.available();
while ((len = is.read(buf, 0, available > 0 ? available : 1)) != -1) {
os.write(buf, 0, len);
available = is.available();
if(available <= 0)
os.flush();
}
} finally {
try {
os.flush();
} finally {
os.close();
}
}
} finally {
is.close();
}
}
In the past, I found that if I call is.read(buf), then, even if data was available, it would block waiting for more data until the buffer was full. This was an echo server for TCP data, so my requirement was for there to be an immediate flush as soon as new data was arrived.
My first solution was the inefficient one-at-a-time is.read(). Later, when that was not good enough, I was looking at the available methods and found is.available(). The API states:
A single read or skip of this many bytes will not block.
So I have a pretty good solution now, but the one thing that looks bad to me is how I am handling cases where is.available() == 0. In this case, I simply read a single byte as a way to wait until new data is available.
What would be the recommended way to transfer data from an InputStream to an OutputStream, with immediate flush as data arrives? Is the above code really the right way, or should I change it, or use some brand new code? Perhaps I should be using some of the newer async routines, or maybe there is a built-in Java method for this?
In the past, I found that if I call is.read(buf), then, even if data was available, it would block waiting for more data until the buffer was full.
No you didn't. TCP doesn't work that way; sockets don't work that way; and Java sockets don't work that way. You are mistaken.
It's a lot simpler than you think:
while ((count = in.read(buffer)) > 0)
{
out.write(buffer, 0, count);
}
out.close();
in.close();
There is no buffering on socket input/output streams so this will write everything that's read as soon as it has been read.
Calling available() in this circumstance, or indeed almost any circumstance, is a complete waste of time.
This is also the way to copy any kind of input stream to any kind of output stream in Java.
If you want non-blocking code, use NIO. But I don't see that you really do.
I am using IBM Websphere Application Server v6 and Java 1.4 and am trying to write large CSV files to the ServletOutputStream for a user to download. Files are ranging from a 50-750MB at the moment.
The smaller files aren't causing too much of a problem but with the larger files it appears that it is being written into the heap which is then causing an OutOfMemory error and bringing down the entire server.
These files can only be served out to authenticated users over HTTPS which is why I am serving them through a Servlet instead of just sticking them in Apache.
The code I am using is (some fluff removed around this):
resp.setHeader("Content-length", "" + fileLength);
resp.setContentType("application/vnd.ms-excel");
resp.setHeader("Content-Disposition","attachment; filename=\"export.csv\"");
FileInputStream inputStream = null;
try
{
inputStream = new FileInputStream(path);
byte[] buffer = new byte[1024];
int bytesRead = 0;
do
{
bytesRead = inputStream.read(buffer, offset, buffer.length);
resp.getOutputStream().write(buffer, 0, bytesRead);
}
while (bytesRead == buffer.length);
resp.getOutputStream().flush();
}
finally
{
if(inputStream != null)
inputStream.close();
}
The FileInputStream doesn't seem to be causing a problem as if I write to another file or just remove the write completely the memory usage doesn't appear to be a problem.
What I am thinking is that the resp.getOutputStream().write is being stored in memory until the data can be sent through to the client. So the entire file might be read and stored in the resp.getOutputStream() causing my memory issues and crashing!
I have tried Buffering these streams and also tried using Channels from java.nio, none of which seems to make any bit of difference to my memory issues. I have also flushed the OutputStream once per iteration of the loop and after the loop, which didn't help.
The average decent servletcontainer itself flushes the stream by default every ~2KB. You should really not have the need to explicitly call flush() on the OutputStream of the HttpServletResponse at intervals when sequentially streaming data from the one and same source. In for example Tomcat (and Websphere!) this is configureable as bufferSize attribute of the HTTP connector.
The average decent servletcontainer also just streams the data in chunks if the content length is unknown beforehand (as per the Servlet API specification!) and if the client supports HTTP 1.1.
The problem symptoms at least indicate that the servletcontainer is buffering the entire stream in memory before flushing. This can mean that the content length header is not set and/or the servletcontainer does not support chunked encoding and/or the client side does not support chunked encoding (i.e. it is using HTTP 1.0).
To fix the one or other, just set the content length beforehand:
response.setContentLengthLong(new File(path).length());
Or when you're not on Servlet 3.1 yet:
response.setHeader("Content-Length", String.valueOf(new File(path).length()));
Does flush work on the output stream.
Really I wanted to comment that you should use the three-arg form of write as the buffer is not necessarily fully read (particularly at the end of the file(!)). Also a try/finally would be in order unless you want you server to die unexpectedly.
I have used a class that wraps the outputstream to make it reusable in other contexts. It has worked well for me in getting data to the browser faster, but I haven't looked at the memory implications. (please pardon my antiquated m_ variable naming)
import java.io.IOException;
import java.io.OutputStream;
public class AutoFlushOutputStream extends OutputStream {
protected long m_count = 0;
protected long m_limit = 4096;
protected OutputStream m_out;
public AutoFlushOutputStream(OutputStream out) {
m_out = out;
}
public AutoFlushOutputStream(OutputStream out, long limit) {
m_out = out;
m_limit = limit;
}
public void write(int b) throws IOException {
if (m_out != null) {
m_out.write(b);
m_count++;
if (m_limit > 0 && m_count >= m_limit) {
m_out.flush();
m_count = 0;
}
}
}
}
I'm also not sure if flush() on ServletOutputStream works in this case, but ServletResponse.flushBuffer() should send the response to the client (at least per 2.3 servlet spec).
ServletResponse.setBufferSize() sounds promising, too.
So, following your scenario, shouldn't you been flush(ing) inside that while loop (on every iteration), instead of outside of it? I would try that, with a bit larger buffer though.
Kevin's class should close the m_out field if it's not null in the close() operator, we don't want to leak things, do we?
As well as the ServletOutputStream.flush() operator, the HttpServletResponse.flushBuffer() operation may also flush the buffers. However, it appears to be an implementation specific detail as to whether or not these operations have any effect, or whether http content length support is interfering. Remember, specifying content-length is an option on HTTP 1.0, so things should just stream out if you flush things. But I don't see that
The while condition does not work, you need to check the -1 before using it. And please use a temporary variable for the output stream, its nicer to read and it safes calling the getOutputStream() repeadably.
OutputStream outStream = resp.getOutputStream();
while(true) {
int bytesRead = inputStream.read(buffer);
if (bytesRead < 0)
break;
outStream.write(buffer, 0, bytesRead);
}
inputStream.close();
out.close();
unrelated to your memory problems, the while loop should be:
while(bytesRead > 0);
your code has an infinite loop.
do
{
bytesRead = inputStream.read(buffer, offset, buffer.length);
resp.getOutputStream().write(buffer, 0, bytesRead);
}
while (bytesRead == buffer.length);
offset has the same value thoughout the loop, so if initially offset = 0, it will remain so in every iteration which will cause infinite-loop and which will leads to OOM error.
Ibm websphere application server uses asynchronous data transfer for servlets by default. That means that it buffers response. If you have problems with large data and OutOfMemory exceptions, try changing settings on WAS to use synchronous mode.
Setting the WebSphere Application Server WebContainer to synchronous mode
You must also take care of loading chunks and flush them.
Sample for loading from large file.
ServletOutputStream os = response.getOutputStream();
FileInputStream fis = new FileInputStream(file);
try {
int buffSize = 1024;
byte[] buffer = new byte[buffSize];
int len;
while ((len = fis.read(buffer)) != -1) {
os.write(buffer, 0, len);
os.flush();
response.flushBuffer();
}
} finally {
os.close();
}