Gzipping file on GCS (Google Cloud Storage) via Java API

Gzipping file on GCS (Google Cloud Storage) via Java API - java

I have log files being dropped into a GCS bucket regularly (e.g. gs://my-bucket/log.json) I want to setup a java process to process the files, gzip them, and move them to a separate bucket where I archive files (i.e. move it to gs://archived-logs/my-bucket/log.json.gz)
gsutil cp -z seems to be the only option I can find currently. Has anybody implemented it in a feasible manner using their Java API?

Ok, I think I solved it. Standard streams solution in the ending.
GcsOutputChannel gcsOutputChannel = gcsService.createOrReplace(new GcsFilename("my-bucket", "log.json.gz"),
new GcsFileOptions.Builder().build());
GZIPOutputStream outputStream = new GZIPOutputStream(Channels.newOutputStream(gcsOutputChannel));
GcsInputChannel inputChannel = gcsService
.openReadChannel(new GcsFilename("my-bucket", "log.json"), 10000);
InputStream inStream = Channels.newInputStream(inputChannel);
byte[] byteArr = new byte[10000];
while (inStream.read(byteArr) > 0) {
outputStream.write(byteArr);
}

For the latest streaming writer you can follow the below code. Do note that gcp will automatically decompress the object while serving it to web clients. If it doesnt work then add the following header to accept gzip files, "Accept-Encoding": "gzip".
Credentials credentials = GoogleCredentials.fromStream(Files.newInputStream(Paths.get(environmentConfig.getServiceAccount())));
Storage storage = StorageOptions.newBuilder().setProjectId(environmentConfig.getProjectID()).setCredentials(credentials).build().getService();
try (WriteChannel writer = storage.writer(blobInfo)) {
GZIPOutputStream gzipOutputStream = new
GZIPOutputStream(Channels.newOutputStream(writer));
gzipOutputStream.write(ByteBuffer.wrap(objectData, 0, objectData.length).array());
gzipOutputStream.flush();
gzipOutputStream.close();
} catch (IOException ex) {
log.error("Upload Error: {}", ex.getMessage());
}

Related

File not found when I try to download a .csv file from remote server to local browser

I have a remote Linux server (Debian 9.2) with Tomcat 9. I loaded a web app on the server that generates a .csv file, the user can connect to the server and download it.
When the program runs on the localhost, it works well, however, when it runs on the remote server, the browser says: file not found
Here is my code:
private void writeFile(String nomeFile, String content, HttpServletResponse response) throws IOException {
response.setContentType("text/csv");
response.setHeader("Content-disposition","attachment; filename="+nomeFile);
String filename=nomeFile;
try {
File file = new File(filename);
FileWriter fw = new FileWriter(file);
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.flush();
bw.close();
}
catch(IOException e) {
e.printStackTrace();
}
// This should send the file to browser
ServletOutputStream out = response.getOutputStream();
FileInputStream in = new FileInputStream(filename);
byte[] buffer = new byte[4096];
int length;
while ((length = in.read(buffer)) > 0){
out.write(buffer, 0, length);
}
in.close();
out.flush();
}
I'm trying to debug this, but I do not know where the error could be. The servlet that implements the code runs fine on localhost. Why does it fail on the remote server?

If you have the same code on local and remote machine and it works on one and does not work on the other it means one of 2 things:
1) localhost configuration of Tomcat might be different vs remote host
- this might cause Tomcat to search "different" folder for file etc. OR
2) file is present only localhost and is not present on the remote host.

You can see in DevTools into your Browser on Network tab what is request string to remote server to get file and check it.
Also check your context on the remote server. It may be different with your localhost when you deploy project.
Type where you put csv file and path.

Firebase Difference b/w putBytes or putfile which is fast uploading?

I am using Cloud Storage for Firebase. I have little confusion how to upload imagefile in fastest manner using Byte Array or using file
try {
Uri uri = Uri.parse(UriList.get(Imagecount_update));
bmp = MediaStore.Images.Media.getBitmap(getContentResolver(),uri);
} catch (IOException e) {
Log.d("PrintIOExeception","***** "+e.toString());
e.printStackTrace();
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
bmp.compress(Bitmap.CompressFormat.JPEG, 25, baos);
byte[] data = baos.toByteArray();
mStorageReference = FirebaseStorage.getInstance().getReference();
StorageReference riversRef = mStorageReference.child("images/" + String.valueOf(System.currentTimeMillis()));
UploadTask uploadTask;
uploadTask = riversRef.putBytes(data);
(or)
uploadTask = riversRef.putFile(data);
Which one is fast way to upload images uploadTask = riversRef.putBytes(data); or uploadTask = riversRef.putFile(data);?

It's not a matter of which one is faster. They both DO different things. putFile uploads a file from a URI (a file hosted on the internet, or the path to a file on the clients local system, prefixed with "file://") meaning it will download the file to the server. putBytes accepts the bytes from a physical byte[] produced by the file, and given to the server by you (or another client).
See here the API shows the difference.
Also, there is putStream, which can accept things like a memory stream, which COULD essentially make the processing of the file faster on the client side, but as far as speed of the actual upload it is completely dependent on connection speeds of the client and the server, and no one function will upload/download any faster than the other.
But, in conclusion, to answer your question, I personally would just use putFile() for images, since putFile() most likely handles the byte[] logic for you on the back end.

"Insufficient system resources exist to complete the requested service" Crawling with Apache Tika on a network folder

I am using a custom crawler that calls Apache Tika to extract text from different file formats up to 200M size in a Windows shared resource (using the file UNC). At some point after successfully crawling near 400k files, I start seeing this error with no stop in the logs:
FileNotFoundException: Insufficient system resources exist to complete
the requested service
My code:
public String parseToString(String path, InputStream stream, int maxLength) throws IOException {
WriteOutContentHandler handler =
new WriteOutContentHandler(maxLength);
try {
ParseContext context = new ParseContext();
context.set(Parser.class, super.getParser());
super.getParser().parse(
stream, new BodyContentHandler(handler), new Metadata(), context);
} catch (SAXException | TikaException e) {
TextExtractor.logger.warning(
String.format("File processing Error in %s trying to get the text anyway: %s",
path, Utils.exceptionStacktraceToString(e.getCause())));
return handler.toString();
} finally {
stream.close();
}
return handler.toString();
}
Note: Running the crawler in the machine that stores the file locally will successfully process all files, small and large or wharever.
Using apache tika 1.13 and java 8.

resumable upload with Client Library for Google Cloud Storage

Earlier, I asked a question https://stackoverflow.com/questions/35581090/can-i-use-resumable-upload-for-gae-blobstore-api
about resumable uploading with Blobstire API.
For myself, I decided that it is impossible to implement resumable uploading with Blobstire API.
In this case i am trying to implement Google Cloud Storage with Java Client Library. At the moment I made the download my video file to bucket and serve video. My servlet look like in google example
#Override
public void doPost(HttpServletRequest req, HttpServletResponse resp) throws IOException {
GcsOutputChannel outputChannel =
gcsService.createOrReplace(getFileName(req), GcsFileOptions.getDefaultInstance());
copy(req.getInputStream(), Channels.newOutputStream(outputChannel));
}
private GcsFilename getFileName(HttpServletRequest req) {
String[] splits = req.getRequestURI().split("/", 4);
if (!splits[0].equals("") || !splits[1].equals("gcs")) {
throw new IllegalArgumentException("The URL is not formed as expected. " +
"Expecting /gcs/<bucket>/<object>");
}
return new GcsFilename(splits[2], splits[3]);
}
private void copy(InputStream input, OutputStream output) throws IOException {
try {
byte[] buffer = new byte[BUFFER_SIZE];
int bytesRead = input.read(buffer);
while (bytesRead != -1) {
output.write(buffer, 0, bytesRead);
bytesRead = input.read(buffer);
}
} finally {
input.close();
output.close();
}
}
Now I need to implement
resumable upload (due to poor internet on mobile devices)
uploading by chunck (due to limitation in size of one request with 32mb)
I realized, that serverside of resumable upload should be organized manually and my backend should be able to give me range of uploaded chunck and allow to continue booting in to OutputChannel.
The documentation for the GcsOutputChannel says:
This class is serializable, this allows for writing part of a file,
serializing the GcsOutputChannel deserializing it, and continuing to
write to the same file. The time for which a serialized instance is
valid is limited and determined by the Google Cloud Storage service
I have not enough experience, so the question may be stupid:
Please somebody tell me how to serialize my GcsOutputChannel? I do not understand where I can save the file containing the serialized object.
By the way, can anyone knows how long Google Cloud Storage service store that serialized object?

You can serialize your GcsOutputChannel using any Java serialization means (typically using ObjectOutputStream). If you run on AE you probably want to save that serialized bytes in the Datastore (as Datastore Blob). See this link for how to convert the serialized object to and from byte array.

Apache FTPClient - incomplete file retrieval on Linux, works on Windows

I have a java application on Websphere that is using Apache Commons FTPClient to retrieve files from a Windows server via FTP. When I deploy the application to Websphere running in a Windows environment, I am able to retrieve all of the files cleanly. However, when I deploy the same application to Webpshere on Linux, there are cases where I am getting an incomplete or corrupt files. These cases are consistent though, such that the same files will fail every time and give back the same number of bytes (usually just a few bytes less than what I should be getting). I would say that I can read approximately 95% of the files successfully on Linux.
Here's the relevant code...
ftpc = new FTPClient();
// set the timeout to 30 seconds
ftpc.enterLocalPassiveMode();
ftpc.setDefaultTimeout(30000);
ftpc.setDataTimeout(30000);
try
{
String ftpServer = CoreApplication.getProperty("ftp.server");
String ftpUserID = CoreApplication.getProperty("ftp.userid");
String ftpPassword = CoreApplication.getProperty("ftp.password");
log.debug("attempting to connect to ftp server = "+ftpServer);
log.debug("credentials = "+ftpUserID+"/"+ftpPassword);
ftpc.connect(ftpServer);
boolean login = ftpc.login(ftpUserID, ftpPassword);
if (login)
{
log.debug("Login success..."); }
else
{
log.error("Login failed - connecting to FTP server = "+ftpServer+", with credentials "+ftpUserID+"/"+ftpPassword);
throw new Exception("Login failed - connecting to FTP server = "+ftpServer+", with credentials "+ftpUserID+"/"+ftpPassword);
}
is = ftpc.retrieveFileStream(fileName);
ByteArrayOutputStream out = null;
try {
out = new ByteArrayOutputStream();
IOUtils.copy(is, out);
} finally {
IOUtils.closeQuietly(is);
IOUtils.closeQuietly(out);
}
byte[] bytes = out.toByteArray();
log.info("got bytes from input stream - byte[] size is "+ bytes.length);
Any assistance with this would be greatly appreciated.
Thanks.

I have a suspicion that the FTP might be using ASCII rather than binary transfer mode, and mapping what it thinks are Window end-of-line sequences in the files to Unix end-of-lines. For files that are really text, this will work. For files that are really binary, the result will be corruption and a slightly shorter file if the file contains certain sequences of bytes.
See FTPClient.setFileType(...).
FOLLOWUP
... so why this would work on Windows and not Linux remains a mystery for another day.
The mystery is easy to explain. You were FTP'ing files from a Windows machine to a Windows machine, so there was no need to change the end-of-line markers.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Gzipping file on GCS (Google Cloud Storage) via Java API - java

Related

File not found when I try to download a .csv file from remote server to local browser

Firebase Difference b/w putBytes or putfile which is fast uploading?

"Insufficient system resources exist to complete the requested service" Crawling with Apache Tika on a network folder

resumable upload with Client Library for Google Cloud Storage

Apache FTPClient - incomplete file retrieval on Linux, works on Windows

Categories

Resources