Mongodb map reduce vs Apache Spark map reduce - java

I have use-case in which I have 3M records in my Mongodb.
I want to aggregate data based on some condition.
I found two ways to accomplish it
Using Mongodb map reduce function query
Using Apache Spark map reduce function by connecting Mongodb to to spark.
I successfully executed my use-case using the above methods and found similar performance of both.
My query is ?
Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?

Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?
In the broad sense of map-reduce algorithm, yes. Although implementation wise they are different (i.e. JavaScript vs Java Jar)
If your question is more about finding out suitability of the two for your use case, you should consider from other aspects. Especially if for your use case, you've found both to be similar in performance. Let's explore below:
Assuming that you have the resources (time, money, servers) and expertise to maintain an Apache Spark cluster along side MongoDB cluster, then having a separate processing framework (Spark) and data storage (MongoDB) is ideal. Maintaining CPU/RAM resources only for database querying in MongoDB servers, and CPU/RAM resources only for intensive ETL in Spark nodes. Afterward write the result of the processing back into MongoDB.
If you are using MongoDB Connector for Apache Spark, you can take advantage of Aggregation Pipeline and (secondary) indexes to do ETL only the range of data Spark needs. As opposed to pulling unnecessary data to Spark nodes, which means more processing overhead, hardware requirements, network-latency.
You may find the following resources useful:
MongoDB Connector for Spark: Getting started - contains example for aggregation.
MongoDB Spark Connector Java API
M233: Getting started with Spark and MongoDB - free online course
If you don't have the resources and expertise to maintain a Spark cluster, then keep it in MongoDB. Worth mentioning that for most aggregation operations,  the Aggregation Pipeline provides better performance and more coherent interface than MongoDB's map-reduce. If you can convert your map-reduce into an aggregation pipeline, I would recommend you to do so. Also see Aggregation Pipeline Optimisation for extra optimisation tips.
If your use case doesn't require a real-time processing, you can configure delayed or hidden node of MongoDB Replica Set. Which will serve as a dedicated server/instance for your aggregation/map-reduce processing. Separating the processing node(s) and data-storage node(s). See also Replica Set Architectures.

Related

Spring JPA and Streaming - Is the data fetched incrementally?

I am looking at streaming query results section of the Spring documentation. Does this functionality fetch all the data at once but provide it as a stream? Or does it fetch data incrementally so that it will be more memory efficient?
If it doesn't fetch data incrementally, is there any other way to achieve this with spring data jpa?
It depends on your platform.
Instead of simply wrapping the query results in a Stream data store specific methods are used to perform the streaming.
With MySQL for example the streaming is performed in a truly streaming fashion, but of course if the underlying datastore (or the driver being used) doesn't have support for such a mechanism (yet) it won't make a difference.
MySQL is IIRC currently the only driver that can provide streaming without additional configuration in this fashion whereas other databases/drivers go with the standard fetch size setting as described by the venerable Vlad Mihalcea here: https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/, note the trade-off between performance vs. memory use. Other databases are most likely going to need a reactive database client in order to even perform true streaming.
Whatever the underlying streaming method, what affects most is how you process the stream. Using Spring's StreamingResponseBody for example would allow you to stream large amounts of data directly from the database to the client with minimal memory use. Still it's a very specific use case, so don't start streaming everything just yet unless you're sure it's worth it.

DataSet javaRDD() performance

I'm getting retrieving data from Cassandrain a SparkApplication using Spark SQL. Data is retrieved as DataSet. However, I need to convert this datasetto JavaRDD using javaRDD() function. It works, however it takes about 2 hours. Is there some parameters to adjust to enhance this time?
Dataset APIs are built on top of the SparkSQL engine, it uses Catalystto generate an optimized logical and physical query plan. Across R, Java, Scala, or Python DataFrame/Dataset APIs, all relation type queries undergo the same code optimizer, providing the space and speed efficiency. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row](an alias of DataFrame) is even faster and suitable for interactive analysis.
For more details Spark RDD vs Dataset performance
Resilient Distributed Dataset (RDD) is the main abstraction of Spark framework while Spark SQL (a Spark module for structured data processing) provides Spark more information about the structure of both the data and the computation being performed, and therefore uses this extra information to perform extra optimizations.
Up until Spark 1.6, RDDs used to perform better than its Spark SQL counterpart DataFrame (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html), however, Spark 2.1 upgrades have made Spark SQL quite more efficient
https://i.stack.imgur.com/TmhXf.png

MongoDB as datasource to Flink

Can MongoDB be used as a datasource to Apache Flink for processing the Streaming Data?
What is the native implementation of Apache Flink to use No-SQL Database as data source?
Currently, Flink does not have a dedicated connector to read from MongoDB. What you can do is the following:
Use StreamExecutionEnvironment.createInput and provide a Hadoop input format for MongoDB using Flink's wrapper input format
Implement your own MongoDB source via implementing SourceFunction/ParallelSourceFunction
The former should give you at-least-once processing guarantees since the MongoDB collection is completely re-read in case of a recovery. Depending on the functionality of the MongoDB client, you might be able to implement exactly-once processing guarantees with the latter approach.

How is memory managed in distributed runtime in Apache Flink?

We are building an Apache Flink based data stream processing application in Java 8. We need to maintain a state-full list of objects which characteristics are updated every ten seconds via a source stream.
By specs we must use, if possible, no distributed storage. So, my question is about Flink's memory manager: in a cluster configuration, does it replicate the memory used by a task-manager? Or is there any way to use a distributed in-memory solution with Flink?
Have a look at Flink state. This way you can store it in flink's state which will be integrated with internal mechanisms like checkpointing/savepointing etc.
If you need to query it externally from other services a queryable state can be a good addition.

Selective replication in mongodb

I have two MongoDB running in two different servers connected via LAN. I want to replicate records from few collections from server 1 to collections in server 2. Is there any way to do it. Below is the pictorial representation of what I want to achieve.
Following are the methods I consider using.
MongoDB replication - But it replicates all collections. Is selective replication possible in MongoDB ??
Oplog watcher APIs - Please suggest some reliable java APIs
Is there any other way to do this ? and what is the best way of doing it ?
MongoDB does not yet support selective replication and it sounds as though you are not actually looking for selective replication but more for selective copying since replication ensures certain rules of using that server.
I am not sure what you mean by a oplog watcher API but it is easy enough to read the oplog over time by just querying it:
> use local
> db.oplog.rs.find()
( http://docs.mongodb.org/manual/reference/local-database/ )
and then storing the latest timestamp of the record you have copied within a script you make.
You can also use a tailable cursor here on the oplog to effectiely listen (pub/sub) to changes and copy them over to your other server.

Categories

Resources