I am trying to read a big AWS S3 Compressed Object(gz).I don't want to read the whole object, want to read it in parts,so that i can process the uncompressed data in parallel
I am reading it with GetObjectRequest with "Range" Header, where i am setting byte range.
However, when i give a byte range in between (100,200), it fails with "Not in GZIP format"
The reason for failure is , AWS request return a stream,however when i parse it to GZIPInputStream it fails as "GZIPInputStream" expects the first byte (GZIP_MAGIC = 0x8b1f) to confirm is it gzip , which is not present in the stream.
GetObjectRequest rangeObjectRequest = new GetObjectRequest(<<Bucket>>, <<Key>>).withRange(100, 200);
S3Object object = s3Client.getObject(rangeObjectRequest);
S3ObjectInputStream rawData = object.getObjectContent();
InputStream data = new GZIPInputStream(rawData);
Can anyone guide the right approach?
GZIP is a compression format in which each byte in the file depends on all of the bytes that precede it. Which means that you can't pick an arbitrary byte range out of the file and make sense of it.
If you need to read byte ranges, you'll need to store it uncompressed.
You could also create your own file storage format that stores chunks of the file as separately-compressed blocks. You could do this using the ZIP format, where each file in the archive represents a specific block size. But you'd need to implement your own ZIP directory reader to make that work.
Related
I am trying to unzip a file that is in a "response" of a HTTP Request.
My point is that after receiving the response I cannot unzip it nor make it to a blob to parse it afterward.
The zip will always return a xml and the idea after the file is unzipped, is to transform the XML to a JSON.
Here is the code I tried:
val client = HttpClient.newBuilder().build();
val request = HttpRequest.newBuilder()
.uri(URI.create("https://donnees.roulez-eco.fr/opendata/instantane"))
.build();
val response = client.send(request, HttpResponse.BodyHandlers.ofString());
Then the response.body() is just unreadable and I did not find a proper way to make it to a blob
The other code I used for unzipping directly is this one:
val url = URL("https://donnees.roulez-eco.fr/opendata/instantane")
val con = url.openConnection() as HttpURLConnection
con.setRequestProperty("Accept-Encoding", "gzip")
println("Length : " + con.contentLength)
var reader: Reader? = null
reader = InputStreamReader(GZIPInputStream(con.inputStream))
while (true) {
val ch: Int = reader.read()
if (ch == -1) {
break
}
print(ch.toChar())
}
But in this case, it won't accept the gzip
Any idea?
It looks like you're confusing zip (an archive format that supports compression) with gzip (a simple compressed format).
Downloading https://donnees.roulez-eco.fr/opendata/instantane (e.g. with curl) and checking the result shows that it's a zip archive (containing a single file, PrixCarburants_instantane.xml).
But you're trying to decode it as a gzip stream (with GZIPInputStream), which it's not — hence your issue.
Reading a zip file is slightly more involved than reading a gzip file, because it can hold multiple compressed files. But ZipInputStream makes it fairly easy: you can read the first zip entry (which has metadata including its uncompressed size), and then go on to read the actual data in that entry.
A further complication is that this particular compressed file seems to use ISO 8859-1 encoding, not the usual UTF-8. So you need to take that into account when converting the byte stream into text.
Here's some example code:
val zipStream = ZipInputStream(con.inputStream)
val entry = zipStream.nextEntry
val reader = InputStreamReader(zipStream, Charset.forName("ISO-8859-1"))
for (i in 1..entry.size)
print(reader.read().toChar())
Obviously, reading and printing the entire 11MB file one character at a time is not very efficient! And if there's any possibility that the zip archive could have multiple entries, you'd have to read through them all, stopping when you get to the one with the right name. But I hope this is a good illustration.
I'm currently getting a ResponseInputStream<GetObjectResponse> from the S3Client (SDK 2), read it into a byte array and open two ByteArrayInputStream to pass them to Apache Tika and ImageIO.read.
Tika detectes the mimeType, BufferedImage is used to get height and width. Now both operations do not need to read the whole file (at least not for all image types). But reading into a byte array requires the consumption of the whole file.
Now how could I open two streams and just discard it when I'm done? Is the only way to perform two getObject calls to S3? Mark and reset isn't supported by the SDK.
One possible way is while uploading the image, if you add the metadata info in the request, then you only need to call GetObjectMetadata method and you can get the information that you need without retrieving the whole object again.
s3Client.putObject(bucketName, stringObjKeyName, "Uploaded String Object");
// Upload a file as a new object with ContentType and title specified.
PutObjectRequest request = new PutObjectRequest(bucketName, fileObjKeyName, new File(fileName));
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentType("plain/text");
metadata.addUserMetadata("title", "someTitle");
request.setMetadata(metadata);
s3Client.putObject(request);
I am attempting to transfer a gzipped file using IOUtils.copyLarge. When I transfer from a GZIPInputStream to a non-compressed output, it works fine, but when I transfer the original InputStream (attempting to leave it compressed) the end result is 0 bytes.
I have verified the input file is correct. Here is an example of what works
IOUtils.copyLarge(new GZIPInputStream(inputStream), out)
This of course results in an uncompressed file being written out. I would like to keep the file compressed as it is in the original input.
When I try val countBytes = IOUtils.copyLarge(inputStream, out) the result is 0, and the resulting file is empty. The desired result is simply copying the already compressed gzip file to a new destination maintaining compression.
Reading the documentation for the API, I should be using this correctly. Any ideas on what is preventing it from working?
I'm planing to use SheetJS with rhino. And sheetjs takes a binary object(BLOB if i'm correct) as it's input. So i need to read a file from the system using stranded java I/O methods and store it into a blob before passing it to sheetjs. eg :-
var XLDataWorkBook = XLSX.read(blobInput, {type : "binary"});
So how can i create a BLOB(or appropriate type) from a binary file in java in order to pass it in.
i guess i cant pass streams because i guess XLSX needs a completely created object to process.
I found the answer to this by myself. i was able to get it done this way.
Read the file with InputStream and then write it to a ByteArrayOutputStream. like below.
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
...
buffer.write(bytes, 0, len);
Then create a byte array from it.
byte[] byteArray = buffer.toByteArray();
Finally i did convert it to a Base64 String (which is also applicable in my case) using the "Base64.encodeBase64String()" method in apache.commons.codec.binary package. So i can pass Base64 String as a method parameter.
If you further need there are lot of libraries(3rd-party and default) available for Base64 to Blob conversion as well.
I have connected to an ftp location using;
URL url = new URL("ftp://user:password#mydomain.com/" + file_name +";type=i");
I read the content into a byte array as shown below;
byte[] buffer = new byte[1024];
int count = 0;
while((count = fis.read(buffer)) > 0)
{
//check if bytes in buffer is a file
}
I want to be able to check if the bytes in buffer is a file without explicitly passing a specific file to write to it like;
File xfile= new File("dir1/");
FileOutputStream fos = new FileOutputStream(xfile);
fos.write(bytes);
if(xfile.isFile())
{
}
In an Ideal world something like this;
File xfile = new File(buffer);//Note: you cannot do this in java
if(xfile.isFile())
{
}
isFile() is to check if the bytes read from the ftp is file. I don't want to pass an explicit file name as I do not know the name of the file on the ftp location.
Any solutions available?
What is a file?
A computer file is a block of arbitrary information [...] which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished.
Your bytes that are stored in the byte array will be a part of a file if you write them on some kind of durable storage.
Sure, we often say that we read a file or write a file, but basically we read bytes from a file and write bytes to a file.
So we can't test a byte array whether it's content is a file or not. Simply because every byte array can be used to create a file (even an empty array).
BTW - the ftp server does not send a file, it (1) reads bytes and (2) a filename and (3) sends the bytes and (4) the filename so that a client can (5) read the bytes and (6) the filename and use both datasets to (7) create a file. The ftp server doesn't have to access a file, it can take bytes and names from a database or create both in memory...
I guess you cannot check if the byte[] array is a file or not. Why dont' you just use already written and tested library like maybe for example: http://commons.apache.org/net/
There is no way to do that easily.
A file is a byte array on a disk and a byte array will be a file if you write it to disk. There is no reliable way of telling what is in the data you just received, without parsing the data and checking if you can find a valid file header in it.
Where is isFile() file means the content fetched from from the ftp stream is a file.
The answer to that is simple. You can't do it because it doesn't make any sense.
What you have read from the stream IS a sequence of bytes stored in memory.
A file is a sequence of bytes stored on a disk (typically).
These are not the same thing. (Or if you want to get all theoretical / philosophical you have to answer the question "when is a sequence of bytes a file, and when is it not a file".
Now a more sensible question to ask might be:
How do I know if the stuff I fetched by FTP is the contents of a file on the FTP server.
(... as distinct from a rendering of a directory or something).
The answer is that you can't be sure if you fetched the file by opening an URLConnection to the FTP server ... like you have done. It is like asking "is '(123) 555-5555' a phone number?". It could be a phone number, or it could just be a sequence of characters that look like a phone number.