I am using the Java MongoDB Connector to run an Hadoop Mapreduce job against MongoDB.
I am setting the input and output URI with the MongoConfigUtil
MongoConfigUtil.setInputURI( conf, "mongodb://host/db.collection" );
MongoConfigUtil.setOutputURI( conf, "mongodb://host/db.collectionOut" );
And the Job is correctly fetching all the document in the specified collection.
Is there a way to limit the number of fetched document?
I wish to achieve this query(Mongo Style):
db.collection.find().limit(1000)
I know MongoConfigUtil has a SetQuery method but how can I set the limit query? Any hints?
I tried to add
MongoConfigUtil.setLimit(conf, 1000)
But I still get all the documents in the collection.
setSplitSize 8 MB is default Size and this property has higher priority than setLimit(mongo.input.limit).
Example mongoConfig.setSplitSize(5); // MB - 8 MB Deafault
In the example above i set the value to 5 MB.
If the stated limit size(for example 1000) for each chunk fetched for each Mapper.setLimit means the limit of your each chunk(split) query limit.
I think you want to limit the query for the entire MapReduce process.
SetQuery is the query inside the find() and that must be represented in JSON format like MongoDB.As far I know you can't limit inside mongo query(find()).
You can find another way to filter query like { fieldName: { $lt: 20 } } based on you case.Besides, you may create a separate collection based on you limit using projection and then apply MapReduce there.
Finally, SetQuery is used to filter the collection.
I found the solution using the setLimit method of the class MongoInputSplit, passing the number of document that you want to fetch.
myMongoInputSplitObj = new MongoInputSplit(*param*)
myMongoInputSplitObj.setLimit(100)
MongoConfigUtil setLimit
Allow users to set the limit on MongoInputSplits (HADOOP-267).
Related
I'm using the v2 aws DynamoDV Java sdk, and I want to limit the number of results that are returned when querying by the partition key (code snippet below), but the code below returns the full set of items back.
The java docs say "Note:The limit does not refer to the number of items to return, but how many items the database should evaluate while executing the query. Use limit together with Page.lastEvaluatedKey() and exclusiveStartKey in subsequent query calls to evaluate limit items per call." which seems to support the behavior I'm seeing.
However, How to set limit of matching items returned by DynamoDB using Java? has a solution using the .withMaxResultSize method in an earlier version of the sdk.
Does the java dynamodb v2 skd have something similar, or will I have to limit the result set manually?
Code looks like:
QueryConditional conditional = QueryConditional.keyEqualTo(
Key.builder()
.partitionValue(jobId)
.build()
);
QueryEnhancedRequest request = QueryEnhancedRequest.builder()
.queryConditional(conditional)
.limit(1)
.scanIndexForward(false)
.build();
Please read https://github.com/aws/aws-sdk-java-v2/issues/1951
You need to limit the number of pages to be returned from iterable. Example
PageIterable<MyMovie> myMovie = moviesTable.query(queryEnhancedRequest);
myMovie.items()
.stream()
.limit(2)
.forEach(content -> System.out.println(" Movie: %s (%s) \n", content.title, content.year));
This will fetch 2 pages and each page will have 1 item if you set limit(1) in the queryEnhancedRequest
I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!
In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...
I'm performing a test with CouchBase 4.0 and java sdk 2.2. I'm inserting 10 documents whose keys always start by "190".
After inserting these 10 documents I query them with:
cb.restore("190", cache);
Thread.sleep(100);
cb.restore("190", cache);
The query within the 'restore' method is:
Statement st = Select.select("meta(c).id, c.*").from(this.bucketName + " c").where(Expression.x("meta(c).id").like(Expression.s(callId + "_%")));
N1qlQueryResult result = bucket.query(st);
The first call to restore returns 0 documents:
Query 'SELECT meta(c).id, c.* FROM cache c WHERE meta(c).id LIKE "190_%"' --> Size = 0
The second call (100ms later) returns the 10 documents:
Query 'SELECT meta(c).id, c.* FROM cache c WHERE meta(c).id LIKE "190_%"' --> Size = 10
I tried adding PersistTo.MASTER in the 'insert' statement, but it neither works.
It seems that the 'insert' is not persisted immediately.
Any help would be really appreciated.
Joan.
You're using N1QL to query the data - and N1QL is only eventually consistent (by default), so it only shows up after the indices are recalculated. This isn't related to whether or not the data is persisted (meaning: written from RAM to disc).
You can try to change the scan_consitency level from its default - NOT_BOUNDED - to get consistent results, but that would take longer to return.
read more here
java scan_consitency options
I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity
You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();
lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.