Can MongoDB be used as a datasource to Apache Flink for processing the Streaming Data?
What is the native implementation of Apache Flink to use No-SQL Database as data source?
Currently, Flink does not have a dedicated connector to read from MongoDB. What you can do is the following:
Use StreamExecutionEnvironment.createInput and provide a Hadoop input format for MongoDB using Flink's wrapper input format
Implement your own MongoDB source via implementing SourceFunction/ParallelSourceFunction
The former should give you at-least-once processing guarantees since the MongoDB collection is completely re-read in case of a recovery. Depending on the functionality of the MongoDB client, you might be able to implement exactly-once processing guarantees with the latter approach.
Related
When introducing parallel processing to an application where multiple save entity calls are being made, I see prior dev has chosen to do it via Spring Integration using split().channel(MessageChannels.executor(Executors.newfixedThreadPool(10))).handle("saveClass","saveMethod").aggregate().get() - where this method is mapped to a requestChannel using #Gateway annotation. My question is this task seems to be simpler to do using the parallelStream() and forEach() method. Does IntergrationFlow provide any benefit in this case?
If you really do a plain in-memory data processing where Java's Stream API is enough, then indeed you don't need the whole messaging solution like Spring Integration. But if you deal with distributed requirements to process data from different systems, like from HTTP to Apache Kafka or DB, then it is better to use some tool which allows you smoothly to connect everything together. Also: no one stops you to use Stream API in the Spring Integration application. In the end all your code is Java anyway. Please, learn more what is EIP and why we would need a special framework to implement this messaging-based solutions: https://www.enterpriseintegrationpatterns.com/
I am looking at streaming query results section of the Spring documentation. Does this functionality fetch all the data at once but provide it as a stream? Or does it fetch data incrementally so that it will be more memory efficient?
If it doesn't fetch data incrementally, is there any other way to achieve this with spring data jpa?
It depends on your platform.
Instead of simply wrapping the query results in a Stream data store specific methods are used to perform the streaming.
With MySQL for example the streaming is performed in a truly streaming fashion, but of course if the underlying datastore (or the driver being used) doesn't have support for such a mechanism (yet) it won't make a difference.
MySQL is IIRC currently the only driver that can provide streaming without additional configuration in this fashion whereas other databases/drivers go with the standard fetch size setting as described by the venerable Vlad Mihalcea here: https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/, note the trade-off between performance vs. memory use. Other databases are most likely going to need a reactive database client in order to even perform true streaming.
Whatever the underlying streaming method, what affects most is how you process the stream. Using Spring's StreamingResponseBody for example would allow you to stream large amounts of data directly from the database to the client with minimal memory use. Still it's a very specific use case, so don't start streaming everything just yet unless you're sure it's worth it.
I'm implementing TUS protocol to upload large files using GridFS as persistence layer for the binary data. The idea is that the server will receive the data in chunks and append every new chunk into an existing resource. All the chunks will have the same size except for the last one.
I found this workaround here showing the idea of how to implement it myself but Im wondering if there is a way to append new chunks of binary data to an existing file using GridFSTemplate or other abstraction present on Spring Data Mongo project.
GridFS is a MongoDB-specific implementation. It could make sense to have appendable chunks in MongoDB's GridFS, the folks over at MongoDB are the right people to talk to in the first place.
Spring Data MongoDB can only implement such a functionality if the driver provides it.
Although it' possible to work with MongoDB's file chunks directly, this would include implementation details in Spring Data MongoDB and bind the library to a particular implementation of GridFS. Spring Data isn't maintained by MongoDB but rather by the Spring Team which isn't involved in any change process happening within the scope of MongoDB. So if GridFS undergoes any changes in the future, this could break Spring Data MongoDB's support for appendable chunks.
We are building an Apache Flink based data stream processing application in Java 8. We need to maintain a state-full list of objects which characteristics are updated every ten seconds via a source stream.
By specs we must use, if possible, no distributed storage. So, my question is about Flink's memory manager: in a cluster configuration, does it replicate the memory used by a task-manager? Or is there any way to use a distributed in-memory solution with Flink?
Have a look at Flink state. This way you can store it in flink's state which will be integrated with internal mechanisms like checkpointing/savepointing etc.
If you need to query it externally from other services a queryable state can be a good addition.
I have use-case in which I have 3M records in my Mongodb.
I want to aggregate data based on some condition.
I found two ways to accomplish it
Using Mongodb map reduce function query
Using Apache Spark map reduce function by connecting Mongodb to to spark.
I successfully executed my use-case using the above methods and found similar performance of both.
My query is ?
Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?
Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?
In the broad sense of map-reduce algorithm, yes. Although implementation wise they are different (i.e. JavaScript vs Java Jar)
If your question is more about finding out suitability of the two for your use case, you should consider from other aspects. Especially if for your use case, you've found both to be similar in performance. Let's explore below:
Assuming that you have the resources (time, money, servers) and expertise to maintain an Apache Spark cluster along side MongoDB cluster, then having a separate processing framework (Spark) and data storage (MongoDB) is ideal. Maintaining CPU/RAM resources only for database querying in MongoDB servers, and CPU/RAM resources only for intensive ETL in Spark nodes. Afterward write the result of the processing back into MongoDB.
If you are using MongoDB Connector for Apache Spark, you can take advantage of Aggregation Pipeline and (secondary) indexes to do ETL only the range of data Spark needs. As opposed to pulling unnecessary data to Spark nodes, which means more processing overhead, hardware requirements, network-latency.
You may find the following resources useful:
MongoDB Connector for Spark: Getting started - contains example for aggregation.
MongoDB Spark Connector Java API
M233: Getting started with Spark and MongoDB - free online course
If you don't have the resources and expertise to maintain a Spark cluster, then keep it in MongoDB. Worth mentioning that for most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface than MongoDB's map-reduce. If you can convert your map-reduce into an aggregation pipeline, I would recommend you to do so. Also see Aggregation Pipeline Optimisation for extra optimisation tips.
If your use case doesn't require a real-time processing, you can configure delayed or hidden node of MongoDB Replica Set. Which will serve as a dedicated server/instance for your aggregation/map-reduce processing. Separating the processing node(s) and data-storage node(s). See also Replica Set Architectures.