read a huge amount of data

read a huge amount of data - java

In a java program (Eclipse ISE on a PC) I want to read a huge amount of data (around 1640188 bytes) from a web site. With Wireshark I can see that these datas come in many blocks of 1460 bytes.
When I use the following code I read only the first block seen at high level (size around 18000 bytes). How could I do to have the other blocks?
URLConnection con = url.openConnection();
InputStream input = con.getInputStream();
while(input.available()>0)
{
System.out.println(input.available());
int n = input.available();
byte[] mydataTab = new byte[n];
input.read(mydataTab, 0, n);
String str = new String(mydataTab);
memoData += str;
}

First:
Do not
int n = input.available();
byte[] mydataTab = new byte[n];
because:
Note that while some implementations of InputStream will return the
total number of bytes in the stream, many will not. It is never
correct to use the return value of this method to allocate a buffer
intended to hold all data in this stream.
Java InputStream Documentation
Second:
Try to use some predefined chunck size for your reading, so you can do:
int chuncksize = 1024;
int sizeRead = input.read(mydataTab, 0, n);
where the sizeRead is the amount of bytes that you read.
And keep reading the chunks until the end of the streaming.

Related

Java Reading large files into byte array chunk by chunk

So I've been trying to make a small program that inputs a file into a byte array, then it will turn that byte array into hex, then binary. It will then play with the binary values (I haven't thought of what to do when I get to this stage) and then save it as a custom file.
I studied a lot of internet code and I can turn a file into a byte array and into hex, but the problem is I can't turn huge files into byte arrays (out of memory).
This is the code that is not a complete failure
public void rundis(Path pp) {
byte bb[] = null;
try {
bb = Files.readAllBytes(pp); //Files.toByteArray(pathhold);
System.out.println("byte array made");
} catch (Exception e) {
e.printStackTrace();
}
if (bb.length != 0 || bb != null) {
System.out.println("byte array filled");
//send to method to turn into hex
} else {
System.out.println("byte array NOT filled");
}
}
I know how the process should go, but I don't know how to code that properly.
The process if you are interested:
Input file using File
Read the chunk by chunk of the file into a byte array. Ex. each byte array record hold 600 bytes
Send that chunk to be turned into a Hex value --> Integer.tohexstring
Send that hex value chunk to be made into a binary value --> Integer.toBinarystring
Mess around with the Binary value
Save to custom file line by line
Problem:: I don't know how to turn a huge file into a byte array chunk by chunk to be processed.
Any and all help will be appreciated, thank you for reading :)

To chunk your input use a FileInputStream:
Path pp = FileSystems.getDefault().getPath("logs", "access.log");
final int BUFFER_SIZE = 1024*1024; //this is actually bytes
FileInputStream fis = new FileInputStream(pp.toFile());
byte[] buffer = new byte[BUFFER_SIZE];
int read = 0;
while( ( read = fis.read( buffer ) ) > 0 ){
// call your other methodes here...
}
fis.close();

To stream a file, you need to step away from Files.readAllBytes(). It's a nice utility for small files, but as you noticed not so much for large files.
In pseudocode it would look something like this:
while there are more bytes available
read some bytes
process those bytes
(write the result back to a file, if needed)
In Java, you can use a FileInputStream to read a file byte by byte or chunk by chunk. Lets say we want to write back our processed bytes. First we open the files:
FileInputStream is = new FileInputStream(new File("input.txt"));
FileOutputStream os = new FileOutputStream(new File("output.txt"));
We need the FileOutputStream to write back our results - we don't want to just drop our precious processed data, right? Next we need a buffer which holds a chunk of bytes:
byte[] buf = new byte[4096];
How many bytes is up to you, I kinda like chunks of 4096 bytes. Then we need to actually read some bytes
int read = is.read(buf);
this will read up to buf.length bytes and store them in buf. It will return the total bytes read. Then we process the bytes:
//Assuming the processing function looks like this:
//byte[] process(byte[] data, int bytes);
byte[] ret = process(buf, read);
process() in above example is your processing method. It takes in a byte-array, the number of bytes it should process and returns the result as byte-array.
Last, we write the result back to a file:
os.write(ret);
We have to execute this in a loop until there are no bytes left in the file, so lets write a loop for it:
int read = 0;
while((read = is.read(buf)) > 0) {
byte[] ret = process(buf, read);
os.write(ret);
}
and finally close the streams
is.close();
os.close();
And thats it. We processed the file in 4096-byte chunks and wrote the result back to a file. It's up to you what to do with the result, you could also send it over TCP or even drop it if it's not needed, or even read from TCP instead of a file, the basic logic is the same.
This still needs some proper error-handling to work around missing files or wrong permissions but that's up to you to implement that.
A example implementation for the process method:
//returns the hex-representation of the bytes
public static byte[] process(byte[] bytes, int length) {
final char[] hexchars = "0123456789ABCDEF".toCharArray();
char[] ret = new char[length * 2];
for ( int i = 0; i < length; ++i) {
int b = bytes[i] & 0xFF;
ret[i * 2] = hexchars[b >>> 4];
ret[i * 2 + 1] = hexchars[b & 0x0F];
}
return ret;
}

java: read large binary file

I need to read out a given large file that contains 500000001 binaries. Afterwards I have to translate them into ASCII.
My Problem occurs while trying to store the binaries in a large array. I get the warning at the definition of the array ioBuf:
"The literal 16000000032 of type int is out of range."
I have no clue how to save these numbers to work with them! Has somebody an idea?
Here is my code:
public byte[] read(){
try{
BufferedInputStream in = new BufferedInputStream(new FileInputStream("data.dat"));
ByteArrayOutputStream bs = new ByteArrayOutputStream();
BufferedOutputStream out = new BufferedOutputStream(bs);
byte[] ioBuf = new byte[16000000032];
int bytesRead;
while ((bytesRead = in.read(ioBuf)) != -1){
out.write(ioBuf, 0, bytesRead);
}
out.close();
in.close();
return bs.toByteArray();
}

The maximum Index of an Array is Integer.MAX_VALUE and 16000000032 is greater than Integer.MAX_VALUE
Integer.MAX_VALUE = 2^31-1 = 2147483647
2147483647 < 16000000032
You could overcome this by checking if the Array is full and create another and continue reading.
But i'm not quite sure if your approach is the best way to perform this. byte[Integer_MAX_VALUE] is huge ;)
Maybe you can split the input file in smaller chunks process them.
EDIT: This is how you could read a single int of your file. You can resize the buffer's size to the amount of data you want to read. But you tried to read the whole file at once.
//Allocate buffer with 4byte = 32bit = Integer.SIZE
byte[] ioBuf = new byte[4];
int bytesRead;
while ((bytesRead = in.read(ioBuf)) != -1){
//if bytesRead == 4 you read 1 int
//do your stuff
}

If you need to declare a large constant, append an 'L' to it which indicates to the compiler that is a long constant. However, as mentioned in another answer you can't declare arrays that large.
I suspect the purpose of the exercise is to learn how to use the java.nio.Buffer family of classes.

I made some progress by starting from scratch! But I still have a problem.
My idea is to read up the first 32 bytes, convert them to a int number. Then the next 32 bytes etc. Unfortunately I just get the first and don't know how to proceed.
I discovered following method for converting these numbers to int:
public static int byteArrayToInt(byte[] b){
final ByteBuffer bb = ByteBuffer.wrap(b);
bb.order(ByteOrder.LITTLE_ENDIAN);
return bb.getInt();
}
so now I have:
BufferedInputStream in=null;
byte[] buf = new byte[32];
try {
in = new BufferedInputStream(new FileInputStream("ndata.dat"));
in.read(buf);
System.out.println(byteArrayToInt(buf));
in.close();
} catch (IOException e) {
System.out.println("error while reading ndata.dat file");
}

difference between input.read and input.read(array, offset, length)

I'm trying to understand how inputstreams work. The following block of code is one of the many ways to read data from a text file:-
File file = new File("./src/test.txt");
InputStream input = new BufferedInputStream (new FileInputStream(file));
int data = 0;
while (data != -1) (-1 means we reached the end of the file)
{
data = input.read(); //if a character was read, it'll be turned to a bite and we get the integer representation of it so a is 97 b is 98
System.out.println(data + (char)data); //this will print the numbers followed by space then the character
}
input.close();
Now to use input.read(byte, offset, length) i have this code. I got it from here
File file = new File("./src/test.txt");
InputStream input = new BufferedInputStream (new FileInputStream(file));
int totalBytesRead = 0, bytesRemaining, bytesRead;
byte[] result = new byte[ ( int ) file.length()];
while ( totalBytesRead < result.length )
{
bytesRemaining = result.length - totalBytesRead;
bytesRead = input.read ( result, totalBytesRead, bytesRemaining );
if ( bytesRead > 0 )
totalBytesRead = totalBytesRead + bytesRead;
//printing integer version of bytes read
for (int i = 0; i < bytesRead; i++)
System.out.print(result[i] + " ");
System.out.println();
//printing character version of bytes read
for (int i = 0; i < bytesRead; i++)
System.out.print((char)result[i]);
}
input.close();
I'm assuming that based on the name BYTESREAD, this read method is returning the number of bytes read. In the documentation, it says that the function will try to read as many as possible. So there might be a reason why it wouldn't.
My first question is: What are these reasons?
I could replace that entire while loop with one line of code: input.read(result, 0, result.length)
I'm sure the creator of the article thought about this. It's not about the output because I get the same output in both cases. So there has to be a reason. At least one. What is it?

The documentation of read(byte[],int,int says that it:
Reads up to len bytes of data.
An attempt is made to read as many as len bytes
A smaller number may be read.
Since we are working with files that are right there in our hard disk, it seems reasonable to expect that the attempt will read the whole file, but input.read(result, 0, result.length) is not guaranteed to read the whole file (it's not said anywhere in the documentation). Relying in undocumented behaviors is a source for bugs when the undocumented behavior change.
For instance, the file stream may be implemented differently in other JVMs, some OS may impose a limit on the number of bytes that you may read at once, the file may be located in the network, or you may later use that piece of code with another implementation of stream, which doesn't behave in that way.
Alternatively, if you are reading the whole file in an array, perhaps you could use DataInputStream.readFully
About the loop with read(), it reads a single byte each time. That reduces performance if you are reading a big chunk of data, since each call to read() will perform several tests (has the stream ended? etc) and may ask the OS for one byte. Since you already know that you want file.length() bytes, there is no reason for not using the other more efficient forms.

Imagine you are reading from a network socket, not from a file. In this case you don't have any information about the total amount of bytes in the stream. You would allocate a buffer of fixed size and read from the stream in a loop. During one iteration of the loop you can't expect there are BUFFERSIZE bytes available in the stream. So you would fill the buffer as much as possible and iterate again, until the buffer is full. This can be useful, if you have data blocks of fixed size, for example serialized object.
ArrayList<MyObject> list = new ArrayList<MyObject>();
try {
InputStream input = socket.getInputStream();
byte[] buffer = new byte[1024];
int bytesRead;
int off = 0;
int len = 1024;
while(true) {
bytesRead = input.read(buffer, off, len);
if(bytesRead == len) {
list.add(createMyObject(buffer));
// reset variables
off = 0;
len = 1024;
continue;
}
if(bytesRead == -1) break;
// buffer is not full, adjust size
off += bytesRead;
len -= bytesRead;
}
} catch(IOException io) {
// stream was closed
}
ps. Code is not tested and should only point out, how this function can be useful.

You specify the amount of bytes to read because you might not want to read the entire file at once or maybe you couldn't or might not want to create a buffer as large as the file.

Byte [] used as a BufferInputStream ok but

Ok I know a buffer is actually an array of byte, however I have never seen the following declaration (taken from here)
URLConnection con = new URL("http://maps...").openConnection();
InputStream is = con.getInputStream();
byte bytes[] = new byte[con.getContentLength()];
is.read(bytes);
Is it the right way to avoid using a BufferInputStream object? Here we have an unbuffered stream reading from a byte []? should not be the other way around?
thanks in advance.

No, it is not the right way. Method read() reads up to N bytes where N is the length of your array. It can read less bytes (even 0) if no more byte are available. Number of bytes that have been read is returned by method read(). When end of stream is reached the method returns -1.
Therefore the right way is to read bytes in loop:
byte[] buf = new buf[MAX];
int n = 0;
while ((n = stream.read(buf)) >= 0) {
// deal with n first bytes from buf
}

or use Apache commons-io
InputStream is;
byte[] bytes = IOUtils.toByteArray(is);

Java - size of compression output-byteArray

When using the deflate-method of java.util.zip.Deflater, a byte[] has to be supplied as the argument, how big should that byte[] be initialized to? I've read there's no guarantee the compressed data will even be smaller that the uncompressed data. Is there a certain % of the input I should go with?
Currently I make it twice as big as the input

After calling deflate, call finished to see if it still has more to output. eg:
byte[] buffer = new byte[BUFFER_SIZE];
while (!deflater.finished()) {
int n = deflater.deflate(buffer);
// deal with the n bytes in out here
}
If you just want to collect all of the bytes in-memory you can use a ByteArrayOutputStream. eg:
byte[] buffer = new byte[BUFFER_SIZE];
ByteArrayOutputStream baos = new ByteArrayOutputStream();
while (!deflater.finished()) {
int n = deflater.deflate(buffer);
baos.write(buffer, 0, n);
}
return baos.toByteArray();

Why does Java misspell the class as "deflater"? The word is "deflator". Jeez! Sorry, had to get that off my chest.
As noted, the expected use is to keep calling deflate until you get all of the output from the compression. However, if you really want to do it in a single call, then there is a bound on the amount by which deflate can expand the data. There is a function in zlib that Java unfortunately does not make available called deflateBound() which provides that upper bound. You can just use the conservative bound from that function, with the relevant line copied here:
complen = sourceLen +
((sourceLen + 7) >> 3) + ((sourceLen + 63) >> 6) + 5;

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

read a huge amount of data - java

Related

Java Reading large files into byte array chunk by chunk

java: read large binary file

difference between input.read and input.read(array, offset, length)

Byte [] used as a BufferInputStream ok but

Java - size of compression output-byteArray

Categories

Resources