In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...
Related
I am using the following code segment to do the insertion using JOOQ's UpdatableRecord.
public void acknowledgeDisclaimer(AcknowledgeDisclaimerReq acknowledgeDisclaimerReq) {
DisclaimerRecord disclaimerRecord = dslContext.newRecord(Disclaimer.DISCLAIMER);
disclaimerRecord.setDisclaimerForId(acknowledgeDisclaimerReq.getDealListingId());
disclaimerRecord.setDisclaimerForType("DEAL");
disclaimerRecord.setAcceptedAt(LocalDateTime.now());
disclaimerRecord.setAcceptedByOwnerId(acknowledgeDisclaimerReq.getLoggedInOwnerId());
int count = disclaimerRecord.store();
log.info("Inserted entry for disclaimer for deal: {}, owner: {}, id {}, insertCount: {}", disclaimerRecord.getDisclaimerForId(), disclaimerRecord.getAcceptedByOwnerId(), disclaimerRecord.getId(), count);
}
When setting the AcceptedAt data, I want to use the database's current timestamp instead of passing the JVM timestamp. Is there any way to do that in JOOQ?
UpdatableRecord.store() can only set Field<T> => T key/values, not Field<T> => Field<T>, so you cannot set an expression in your record. You can obviously run an explicit INSERT / UPDATE / MERGE statement instead.
Using triggers
The best way to ensure such a timestamp is set to the database timestamp whenever you run some specific DML on the table is to use a database trigger (you could make the trigger watch for changes in the ACCEPTED_BY_OWNER_ID value)
If you can't do this on the server side (which is the most reliable, because it will behave correctly for all database clients, not just the JDBC/jOOQ based ones), you might have a few client side options in jOOQ:
Using jOOQ 3.17 client side computed columns
jOOQ 3.17 has added support for stored (or virtual) client side computed columns, a special case of which are audit columns (which is almost what you're doing).
Using this, you can specify, for example:
<forcedType>
<generator><![CDATA[
ctx -> org.jooq.impl.DSL.currentTimestamp()
]]></generator>
<includeExpression>(?i:ACCEPTED_AT)</includeExpression>
</forcedType>
The above acts like a trigger that sets the ACCEPTED_AT date to the current timestamp every time you write to the table. In your case, it'll be more like:
<forcedType>
<generator><![CDATA[
ctx -> org.jooq.impl.DSL
.when(ACCEPTED_BY_OWNER_ID.isNotNull(), org.jooq.impl.DSL.currentTimestamp())
.else_(ctx.table().ACCEPTED_AT)
]]></generator>
<includeExpression>(?i:ACCEPTED_AT)</includeExpression>
</forcedType>
See a current limitation of the above here:
https://github.com/jOOQ/jOOQ/issues/13809
See the relevant manual sections here:
Client side computed columns
Audit columns
Should be something like disclaimerRecord.setAcceptedAt(DSL.now());
I'm trying to save a dataset to cassandra db using java spark.
I'm able to read data into dataset successfully using the below code
Dataset<Row> readdf = sparkSession.read().format("org.apache.spark.sql.cassandra")
.option("keyspace","dbname")
.option("table","tablename")
.load();
But when I try to write dataset I'm getting IOException: Could not load or find table, found similar tables in keyspace
Dataset<Row> dfwrite= readdf.write().format("org.apache.spark.sql.cassandra")
.option("keyspace","dbname")
.option("table","tablename")
.save();
I'm setting host and port in sparksession
The thing is I'm able to write in overwrite and append modes but not able to create table
Versions which I'm using are below:
spark java 2.0
spark cassandra connector 2.3
Tried with different jar versions but nothing worked
I have also gone through different stack overflow and github links
Any help is greatly appreciated.
The write operation in Spark doesn't have a mode that will automatically create a table for you - there are multiple reasons for that. One of them is that you need to define a primary key for your table, otherwise, you may just overwrite data if you set incorrect primary key. Because of this, Spark Cassandra Connector provides a separate method to create a table based on your dataframe structure, but you need to provide a list of partition & clustering key columns. In Java it will look as following (full code is here):
DataFrameFunctions dfFunctions = new DataFrameFunctions(dataset);
Option<Seq<String>> partitionSeqlist = new Some<>(JavaConversions.asScalaBuffer(
Arrays.asList("part")).seq());
Option<Seq<String>> clusteringSeqlist = new Some<>(JavaConversions.asScalaBuffer(
Arrays.asList("clust", "col2")).seq());
CassandraConnector connector = new CassandraConnector(
CassandraConnectorConf.apply(spark.sparkContext().getConf()));
dfFunctions.createCassandraTable("test", "widerows6",
partitionSeqlist, clusteringSeqlist, connector);
and then you can write data as usual:
dataset.write()
.format("org.apache.spark.sql.cassandra")
.options(ImmutableMap.of("table", "widerows6", "keyspace", "test"))
.save();
I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.
It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.
I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)
I wonder if there is any way to disable WAL (write ahead log) operations when inserting new data to a hbase table with JAVA API?
Thank you for you help :)
In HBase 2.0.0
To skip WAL at an individual update level (for a single Put or Delete):
Put p = new Put(ROW_ID).addColumn(FAMILY, NAME, VALUE).setDurability(Durability.SKIP_WAL)
To set this setting for the entire table (so you don't have to do it each time for each update):
TableDescriptorBuilder tBuilder = TableDescriptorBuilder.newBuilder(TableName.valueOf(TABLE_ID));
tBuilder.setDurability(Durability.SKIP_WAL);
... continue building the table
Hope this helps
I am using the Java MongoDB Connector to run an Hadoop Mapreduce job against MongoDB.
I am setting the input and output URI with the MongoConfigUtil
MongoConfigUtil.setInputURI( conf, "mongodb://host/db.collection" );
MongoConfigUtil.setOutputURI( conf, "mongodb://host/db.collectionOut" );
And the Job is correctly fetching all the document in the specified collection.
Is there a way to limit the number of fetched document?
I wish to achieve this query(Mongo Style):
db.collection.find().limit(1000)
I know MongoConfigUtil has a SetQuery method but how can I set the limit query? Any hints?
I tried to add
MongoConfigUtil.setLimit(conf, 1000)
But I still get all the documents in the collection.
setSplitSize 8 MB is default Size and this property has higher priority than setLimit(mongo.input.limit).
Example mongoConfig.setSplitSize(5); // MB - 8 MB Deafault
In the example above i set the value to 5 MB.
If the stated limit size(for example 1000) for each chunk fetched for each Mapper.setLimit means the limit of your each chunk(split) query limit.
I think you want to limit the query for the entire MapReduce process.
SetQuery is the query inside the find() and that must be represented in JSON format like MongoDB.As far I know you can't limit inside mongo query(find()).
You can find another way to filter query like { fieldName: { $lt: 20 } } based on you case.Besides, you may create a separate collection based on you limit using projection and then apply MapReduce there.
Finally, SetQuery is used to filter the collection.
I found the solution using the setLimit method of the class MongoInputSplit, passing the number of document that you want to fetch.
myMongoInputSplitObj = new MongoInputSplit(*param*)
myMongoInputSplitObj.setLimit(100)
MongoConfigUtil setLimit
Allow users to set the limit on MongoInputSplits (HADOOP-267).