Spring Data MongoDB - Append binary data to existing GridFS file - java

I'm implementing TUS protocol to upload large files using GridFS as persistence layer for the binary data. The idea is that the server will receive the data in chunks and append every new chunk into an existing resource. All the chunks will have the same size except for the last one.
I found this workaround here showing the idea of how to implement it myself but Im wondering if there is a way to append new chunks of binary data to an existing file using GridFSTemplate or other abstraction present on Spring Data Mongo project.

GridFS is a MongoDB-specific implementation. It could make sense to have appendable chunks in MongoDB's GridFS, the folks over at MongoDB are the right people to talk to in the first place.
Spring Data MongoDB can only implement such a functionality if the driver provides it.
Although it' possible to work with MongoDB's file chunks directly, this would include implementation details in Spring Data MongoDB and bind the library to a particular implementation of GridFS. Spring Data isn't maintained by MongoDB but rather by the Spring Team which isn't involved in any change process happening within the scope of MongoDB. So if GridFS undergoes any changes in the future, this could break Spring Data MongoDB's support for appendable chunks.

Related

Design a Spring batch application to read data from different resources(Flat files)

I am developing a batch application using (Spring boot, java, and Spring batch) for which I need to read data from different locations. Below is my use case:
Multiple paths such as C://Temp//M1, C://Temp//M2 , both locations can contain identical files with same data such as C://Temp//M1//File1.txt, C://Temp//M2//File1.txt, and C://Temp//M1//File2.txt, C://Temp//M2//File2.txt
At first, I need to merge them in memory if an identical file exists at both locations before starting batch after removing duplicates and pass the merged in-memory data as an argument to the reader.
I have designed batch using multiresourceitemreader which reads flat files and processes them but not able to achieve in-memory merging and duplicate removal from multiple files.
So may you please have a look and suggest me a way how can I achieve this?
Through my experience I have found the usage of BeanIO library priceless when it comes to dealing with flat files. Also it integrates with spring batch.
http://beanio.org/
Which regards of reading from 2 locations you can:
Implement your reader as a composite that read first line from file 1 then from file two
you first read through the reader file 1 and inside the prosessor you enrich with data from file number 2.
premerge the files
If you are aware of Kafka try Kafka connect framework. Use the Confluent platform to easily use their connectors.
Then consume from Kafka into your Spring application.
https://www.confluent.io/hub
If you are interested in Kafka I'll explain to you in detail

Json file to rest api in spring boot

I have a json file with array of objects, i want to read it and create rest api with couple GET methods. What are best practices to do so? Should i create in memory database (H2), save json objects there and then do the rest? I am looking for most efficient solution.
If the data is static and you’re just doing GET requests, in your data layer you can just read from the contents of the file into POJOs. Then if you need to get more sophisticated you can always change up the implementation detail to H2 or some other DB.
If you JSON file is small and does not change frequently, you do not need to put it in H2 or another database. Just read the JSON file from the disk once and use it in your REST API endpoints.
Jackson is a good library for processing JSON data in Spring Boot. It offers multiple options to read and consume the JSON data.

Spring JPA and Streaming - Is the data fetched incrementally?

I am looking at streaming query results section of the Spring documentation. Does this functionality fetch all the data at once but provide it as a stream? Or does it fetch data incrementally so that it will be more memory efficient?
If it doesn't fetch data incrementally, is there any other way to achieve this with spring data jpa?
It depends on your platform.
Instead of simply wrapping the query results in a Stream data store specific methods are used to perform the streaming.
With MySQL for example the streaming is performed in a truly streaming fashion, but of course if the underlying datastore (or the driver being used) doesn't have support for such a mechanism (yet) it won't make a difference.
MySQL is IIRC currently the only driver that can provide streaming without additional configuration in this fashion whereas other databases/drivers go with the standard fetch size setting as described by the venerable Vlad Mihalcea here: https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/, note the trade-off between performance vs. memory use. Other databases are most likely going to need a reactive database client in order to even perform true streaming.
Whatever the underlying streaming method, what affects most is how you process the stream. Using Spring's StreamingResponseBody for example would allow you to stream large amounts of data directly from the database to the client with minimal memory use. Still it's a very specific use case, so don't start streaming everything just yet unless you're sure it's worth it.

Using spring hibernate, read from read-only data source and write to read/write data source

I have two data sources having exactly same schema but one is read-only and other is read/write. The read-only data source get updated by the external project. I am planning to using spring-data-hibernate to create entity model classes and read data from the read-only data source and write to read/write data source.
Is it do-able? Do we have any best practices/design patterns regarding it?
Take a look at: http://spring.io/blog/2007/01/23/dynamic-datasource-routing/
Spring has an AbstractRoutingDataSource that allows you to define multiple data resources on your server which will let spring pick them up and allows you to define which ones are read from and which ones are written to.
I could go into more depth, but the link will take you to a good discussion about it.

Ideal place to store Binary data that can be rendered by calling a url

I am looking for an ideal (performance effective and maintainable) place to store binary data. In my case these are images. I have to do some image processing,scale the images and store in a suitable place which can be accesses via a RESTful service.
From my research so far I have a few options, like:
NoSql solution like MongoDB,GridFS
Storing as files in a file system in a directory hierarchy and then using a web server to access the images by url
Apache Jackrabbit Document repository
Store in a cache something like Memcache,Squid Proxy
Any thoughts of which one you would pick and why would be useful or is there a better way to do it?
Just started using GridFS to do exactly what you described.
From my experience thus far, the main advantage to GridFS is that it obviates the need for a separate file storage system. Our entire persistency layer is already put into Mongo, and so the next logical step would be to store our filesystem there as well. The flat namespacing just rocks and allows you a rich query language to fetch your files based off whatever metadata you want to attach to them. In our app we used an 'appdata' object that embedded all the ownership information, ensure
Another thing to consider with NoSQL file storage, and especially GridFS, is that it will shard and expand along with your other data. If you've got your entire DB key-value store inside the mongo server, then eventually if you ever have to expand your server cluster with more machines, your filesystem will grow along with it.
It can feel a little 'black box' since the binary data itself is split into chunks, a prospect that frightens those used to a classic directory based filesystem. This is alleviated with the help of admin programs like RockMongo.
All in all to store images in GridFS is as easy as inserting the docs themselves, most of the drivers for all the major languages handle everything for you. In our environment we took image uploads at an endpoint and used PIL to perform resizing. The images were then fetched from mongo at another endpoint that just output the data and mimetyped it as a jpeg.
Best of luck!
EDIT:
To give you an example of a trivial file upload with GridFS, here's the simplest approach in PyMongo, the python library.
from pymongo import Connection
import gridfs
binary_data = 'Hello, world!'
db = Connection().test_db
fs = gridfs.GridFS(db)
#the filename kwarg sets the filename in the mongo doc, but you can pass anything in
#and make custom key-values too.
file_id = fs.put(binary_data, filename='helloworld.txt',anykey="foo")
output = fs.get(file_id).read()
print output
>>>Hello, world!
You can also query against your custom values if you like, which can be REALLY useful if you want your queries to be based off custom information relative to your application.
try:
file = fs.get_last_version({'anykey':'foo'})
return file.read()
catch gridfs.errors.NoFile:
return None
These are just some simple examples, and the drivers for alot of the other languages (PHP, Ruby etc.) all have cognates.
I would go for jackrabbit in combination with its REST framework sling http://sling.apache.org
Sling allows you to upload/download files via REST calls or webdav while the underlying jackrabbit repository gives you a performant storage with the possibility to store your files in a tree structure (or flat if you like).
Both jackrabbit and sling support an event mechanism where you can asynchronously process the image after upload to i.e. create thumbnails.
The manual at http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html describes how to manipulate data using the REST interface provided by sling.
Storing the images as blobs in an RDBMS in another option, and you immediately get some guarantees about integrity, security etc (if this is setup properly on the database), store extra metadata, manage the collection with SQL etc.

Categories

Resources