I am trying to unzip a file that is in a "response" of a HTTP Request.
My point is that after receiving the response I cannot unzip it nor make it to a blob to parse it afterward.
The zip will always return a xml and the idea after the file is unzipped, is to transform the XML to a JSON.
Here is the code I tried:
val client = HttpClient.newBuilder().build();
val request = HttpRequest.newBuilder()
.uri(URI.create("https://donnees.roulez-eco.fr/opendata/instantane"))
.build();
val response = client.send(request, HttpResponse.BodyHandlers.ofString());
Then the response.body() is just unreadable and I did not find a proper way to make it to a blob
The other code I used for unzipping directly is this one:
val url = URL("https://donnees.roulez-eco.fr/opendata/instantane")
val con = url.openConnection() as HttpURLConnection
con.setRequestProperty("Accept-Encoding", "gzip")
println("Length : " + con.contentLength)
var reader: Reader? = null
reader = InputStreamReader(GZIPInputStream(con.inputStream))
while (true) {
val ch: Int = reader.read()
if (ch == -1) {
break
}
print(ch.toChar())
}
But in this case, it won't accept the gzip
Any idea?
It looks like you're confusing zip (an archive format that supports compression) with gzip (a simple compressed format).
Downloading https://donnees.roulez-eco.fr/opendata/instantane (e.g. with curl) and checking the result shows that it's a zip archive (containing a single file, PrixCarburants_instantane.xml).
But you're trying to decode it as a gzip stream (with GZIPInputStream), which it's not — hence your issue.
Reading a zip file is slightly more involved than reading a gzip file, because it can hold multiple compressed files. But ZipInputStream makes it fairly easy: you can read the first zip entry (which has metadata including its uncompressed size), and then go on to read the actual data in that entry.
A further complication is that this particular compressed file seems to use ISO 8859-1 encoding, not the usual UTF-8. So you need to take that into account when converting the byte stream into text.
Here's some example code:
val zipStream = ZipInputStream(con.inputStream)
val entry = zipStream.nextEntry
val reader = InputStreamReader(zipStream, Charset.forName("ISO-8859-1"))
for (i in 1..entry.size)
print(reader.read().toChar())
Obviously, reading and printing the entire 11MB file one character at a time is not very efficient! And if there's any possibility that the zip archive could have multiple entries, you'd have to read through them all, stopping when you get to the one with the right name. But I hope this is a good illustration.
Related
I am reading parts of large file via a Java FileInputStream and would like to stream it's content back to the client (in the form of an akka HttpResponse). I am wondering if this is possible, and how I would do this?
From my research, EntityStreamingSupport can be used but only supports json or csv data. I will be streaming raw data from the file, which will not be in the form of json or csv.
Assuming you use akka-http and Scala you may use getFromFile to stream the entire binary file from a path to the HttpResponse like this:
path("download") {
get {
entity(as[FileHandle]) { fileHandle: FileHandle =>
println(s"Server received download request for: ${fileHandle.fileName}")
getFromFile(new File(fileHandle.absolutePath), MediaTypes.`application/octet-stream`)
}
}
}
Taken from this file upload/download roundtrip akka-http example:
https://github.com/pbernet/akka_streams_tutorial/blob/f246bc061a8f5a1ed9f79cce3f4c52c3c9e1b57a/src/main/scala/akkahttp/HttpFileEcho.scala#L52
Streaming the entire file eliminates the need for "manual chunking", thus the example above will run with limited heap size.
However, if needed manual chunking could be done like this:
val fileInputStream = new FileInputStream(fileHandle.absolutePath)
val chunked: Source[ByteString, Future[IOResult]] = akka.stream.scaladsl.StreamConverters
.fromInputStream(() => fileInputStream, chunkSize = 10 * 1024)
chunked.map(each => println(each)).runWith(Sink.ignore)
I am using AWS Lambda to push the file to S3 through Java code.
While sending the file from Postman or from Angular I am trying to print the content of file in Java functions. While doing so headers are getting added to the file content automatically like:
"----------------------------965855468995803568737630
Content-Disposition: form-data; name="test"; filename="test.pdf"
Content-Type: application/pdf"
.
How to get the file content without headers from APIGatewayProxyRequestEvent?.
This is code am using to print the file content.
context.getLogger().log("Input File: "+apiGatewayProxyRequestEvent.getBody());
This is a tricky one for you to solve. The method getBody() will give you the actual request body that is sent through the APIGatewayProxyRequest so it's going to give you back what is sent through, which is the file encoded as form-data with a Content-Type and a filename. The responsibility lies on you to convert the form-data back into an understandable object format if you wan to print the content.
If you have a look at this tutorial on Medium you can see an approach to this. It boils down to processing the data and working with the format boundary:
//Get the uploaded file and decode from base64
byte[] bI = Base64.decodeBase64(event.getBody().getBytes());
//Get the content-type header and extract the boundary
Map<String, String> hps = event.getHeaders();
if (hps != null) {
contentType = hps.get("content-type");
}
String[] boundaryArray = contentType.split("=");
//Transform the boundary to a byte array
byte[] boundary = boundaryArray[1].getBytes();
//Log the extraction for verification purposes
logger.log(new String(bI, "UTF-8") + "\n");
That last line will get you what you want, which is printing the body content, obviously if it's a binary format that might not be very useful for you. I'd recommend giving that tutorial a full read as it will help show you how to iterate through the data stream and create the object.
I am trying to read a big AWS S3 Compressed Object(gz).I don't want to read the whole object, want to read it in parts,so that i can process the uncompressed data in parallel
I am reading it with GetObjectRequest with "Range" Header, where i am setting byte range.
However, when i give a byte range in between (100,200), it fails with "Not in GZIP format"
The reason for failure is , AWS request return a stream,however when i parse it to GZIPInputStream it fails as "GZIPInputStream" expects the first byte (GZIP_MAGIC = 0x8b1f) to confirm is it gzip , which is not present in the stream.
GetObjectRequest rangeObjectRequest = new GetObjectRequest(<<Bucket>>, <<Key>>).withRange(100, 200);
S3Object object = s3Client.getObject(rangeObjectRequest);
S3ObjectInputStream rawData = object.getObjectContent();
InputStream data = new GZIPInputStream(rawData);
Can anyone guide the right approach?
GZIP is a compression format in which each byte in the file depends on all of the bytes that precede it. Which means that you can't pick an arbitrary byte range out of the file and make sense of it.
If you need to read byte ranges, you'll need to store it uncompressed.
You could also create your own file storage format that stores chunks of the file as separately-compressed blocks. You could do this using the ZIP format, where each file in the archive represents a specific block size. But you'd need to implement your own ZIP directory reader to make that work.
I have a web service capable of returning PDF files in two ways:
RAW: The file is simply included in the response body. For example:
HTTP/1.1 200 OK
Content-Type: application/pdf
<file_contents>
JSON: The file is encoded (Base 64) and served as a JSON with the following structure:
HTTP/1.1 200 OK
Content-Type: application/json
{
"base64": <file_contents_base64>
}
I want to be able to consume both services on Android / Java by using the following architecture:
// Get response body input stream (OUT OF THE SCOPE OF THIS QUESTION)
InputStream bodyStream = getResponseBodyInputStream();
// Get PDF file contents input stream from body stream
InputStream fileStream = getPDFFileContentsInputStream(bodyStream);
// Write stream to a local file (OUT OF THE SCOPE OF THIS QUESTION)
saveToFile(fileStream);
For the first case (RAW response), the response body will the file itself. This means that the getPDFFileContentsInputStream(InputStream) method implementation is trivial:
#NonNull InputStream getPDFFileContentsInputStream(#NonNull InputStream bodyStream) {
// Return the input
return bodyStream;
}
The question is: how to implement the getPDFFileContentsInputStream(InputStream) method for the second case (JSON response)?
You can use any json parser (like Jackson or Gson), and then use Base64InputStream from apache-commons codec.
EDIT: You can obtain an input stream from string using ByteArrayInputStream, i.e.
InputStream stream = new ByteArrayInputStream(exampleString.getBytes(StandardCharsets.UTF_8));
as stated here.
EDIT 2: This will cause 2 pass over the data, and if the file is big, you might have memory problems. To solve it, you can use Jackson and parse the content yourself like this example instead of obtaining the whole object through reflection. you can wrap original input stream in another one, say ExtractingInputStream, and this will skip the data in the underlying input stream until the encoded part. Then you can wrap this ExtractingInputStream instance in a Base64InputStream. A simple algorithm to skip unnecessary parts would be like this: In the constructor of ExtractingInputStream, skip until you have read three quotation marks. In read method, return what underlying stream returns except return -1 if the underlying stream returns quotation mark, which corresponds to the end of base 64 encoded data.
I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.