Change in google cloud dataflow's processing of compressed files

Change in google cloud dataflow's processing of compressed files - java

Has anything changed recently with the way google dataflow reads compressed files from google cloud storage? I am working on a project that reads compressed csv log files from GCS and uses these files as the source for a dataflow pipeline. Until recently this worked perfectly with and without specifying the compression type of the file.
Currently the processElement method in my DoFn is only called once (for the csv header row) although the file has many rows. If I use the same source file uncompressed then everything works as expected (the processElement method is called for every row). As suggested here https://stackoverflow.com/a/27775968/6142412 setting the Content-Encoding to gzip does work but I did not have to do this previously.
I am expereincing this issue when using DirectPipelineRunner or DataflowPipelineRunner. I am using version 1.5.0 of the cloud-data-flow sdk.

We identified a problem (BEAM-167) reading from concatenated gzip files. It has been fixed in the Apache Beam Github repository by PR 114 and the Dataflow SDK Github repository by PR 180. It will be part of the next release.
Until then, a work around is to use the SDK built from Github or to compress the entire file as a single part.

Related

How can I download a single file from a large remote zip file in Java?

I'm trying to download a small file (0.3 KB) from a given zip file that's around 3-5 GB in size.
I have currently been using the native library libfragmentzip using JNA, which is very fast, but has issues of its own that come with using native libraries (like not being cross-platform).
I have tried this solution, but it is much slower and ends up taking minutes compared to using libfragmentzip, which only seems to take seconds.
This is a URL to a test zip file (the extension is .ipsw but it is really a zip). The file I am trying to download is BuildManifest.plist, in the root of the zip.
Is there a fast way to download a single file from a remote zip file without using a native library?

You can insert BuildManifest.plist at the end of the URL.
For example:
http://updates-http.cdn-apple.com/2021SpringFCS/fullrestores/071-34317/E63B034D-2116-42D0-9FBD-97A3D9060F68/BuildManifest.plist

Problem loading ISO-8859-1 into BigQuery using DataFlow (Apache Beam)

I'm trying to load an ISO-8859-1 file into BigQuery using DataFlow. I've built a template with Apache Beam Java. Everything works well but when I check the content of the Bigquery table I see that some characters like 'ñ' or accents 'á','é', etc. haven't been stored propertly, they have been stored as �.
I've tried several charset changing before write into BigQuery. Also, I've created a special ISOCoder passed to the pipeline using the method setCoder(), but nothing works.
Does anyone know if is it possible to load into BigQuery this kind of files using Apache Beam? Only UTF-8?
Thanks in advance for your help.

This feature is currently not available in the Java SDK of Beam. In Python this seems to be possible by using the additional_bq_parameters when using WriteToBigQuery, see: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L177

Is there a solution to read the RAR files of version 5 using Junrar library?

I'm writing part of Java application responsible for reading archive files of different formats and preparing a preview mode.
Junrar library appeared to be the most reliable to work with RAR format, but it doesn't support the latest version 5 of rar. Only earlier versions are supported.
Junrar dev team confirms this fact here:
https://github.com/junrar/junrar/issues/23
Winrar by default creates rar files of version 5, but checkbox 'RAR4' in properties helps to create a file of version 4 - which is perfect to work with. But you have to click it every time you archive the file - that's not a good way (earlier versions of Winrar can't be downloaded from official website).
In my case file is stored as a byte array. I don't need to UNRAR file, i just read it - i need 'name' and 'size' of every file inside of it - thus i prepare preview data of archive content as a small HTML table.
Do you know any other good library to work with Rar format? I can't find any.
Or maybe you can imagine some good workarounds?

There is no available solution to unrar RAR5 archives except Winrar itself. You may call it as external program if it is possible on your OS.
I am also treating archives of different formats and I am using junrar for extracting RAR, RAR5 archives remain untreated.
Junrar initially was developed by Edmund Wagner, and some time ago the support was renewed by Beothorn. But, much to my regret, he is not planning to implement RAR5 support for some reason.
I have also checked Raroscope, it does not support RAR 5 neither.
By the way, another archive, which is not supported by open source java library, is ARJ.

Upload Apache POI XLSX generated files to CodeIgniter website

I have to upload XLSX files to a web application. This web application uses CodeIgniter to check and import those files.
Furthermore, I have to generate these XLSX files from another one. In order to read and write XLSX files, I use Apache Poi. This part is pretty easy and works well.
But, here is my problem: when uploading an auto-generated file, CodeIgniter decline the files saying that this file type is not allowed. It's probably a missing property that's not created by Apache POI library but I didn't managed to find which one.
Another 'fun' fact is that when opening an Apache Poi auto-generated file with Microsoft Excel then saving it without any modification the file gains something like 3Ko of data and becomes valid for CodeIgniter. It doesn't work with LibreOffice Calc which apparently adds some data but not the same as Microsoft Excel do.
Do you have any idea of which property or data could be missing? Any method to resolve my problem?
Edit: after some more investigations and according to php finfo_file function (used by CodeIgniter) my bad file has following mime type application/octet-stream while a legit file has following mime type application/vnd.openxmlformats-officedocument.spreadsheetml.sheet. Then, I think that Apache Poi has some bug when generating XLSX.
Edit 2: Finally, there's 2 XLSX type (see enclosed screenshot). Only the second one is recognized as an application/vnd.openxmlformats-officedocument.spreadsheetml.sheet by finfo_file. Unfortunately, Apache Poi generates signature of the first type. Thus, it isn't recognized as a XLSX.

This strange behaviour finally comes from the ambiguity of the XLSX file format. It covers two different file formats (as you can see on the enclosed picture from https://www.filesignatures.net/). Only the second one is recognized as an application/vnd.openxmlformats-officedocument.spreadsheetml.sheet by finfo_file whereas Apache Poi generates files of the first type. Thus, the upload fails saying the file type is not allowed.

Reduce the big size of my Android application resource

I am currently developing an android application that useLibsvm library for data
classification.
To use the Libsvm I should provide file text describing data
the size of my data=1,3G
I have placed all my files in assets Folder => copied them in sdCard and then running the
classification
The problem Now is that my application take a lot of time to be installed on my device!
It is possible to compress those files and the decompress them while running my
classification?
And How to do this in Android

move your resources to desktop and delete them from your app, this would be helpful for testing your app.

You could zip the file on the installation and make your service unzips it the first time it's used.
You may zip/unzip using the java.util.zip standard package.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Change in google cloud dataflow's processing of compressed files - java

Related

How can I download a single file from a large remote zip file in Java?

Problem loading ISO-8859-1 into BigQuery using DataFlow (Apache Beam)

Is there a solution to read the RAR files of version 5 using Junrar library?

Upload Apache POI XLSX generated files to CodeIgniter website

Reduce the big size of my Android application resource

Categories

Resources