Apache Spark - how to get unmatched rows from two RDDs - java

I have two different RDDs, each RDD have some common fields, based on that fields i want to get unmatched records from RDD1 or RDD2.[Records available in RDD1 but not available in RDD2] [Records available in RDD2 but not available in RDD1]
It seems we could use subtract or subtractbyKey.
Sample Input:
**File 1:**
sam,23,cricket
alex,34,football
ann,21,football
**File 2:**
ruby,25,football,usa
alex,34,cricket,usa
ann,21,cricket,usa
**expected output:**
sam,23,cricket
Update:
Currently i am using Spark SQL to get the unmatched records from the RDDs(Writing a query to get the unmatched records).
What i am looking is, is it something we can do it with Spark Core itself instead of using Spark SQL and also i am not looking the code, is there any operation available in Spark Core?
Please advise on this.
Regards,
Shankar.

You could bring both RDDs to the same shape and use subtract to remove the common elements.
Given rdd1 from file1 and rdd2 from file2 as presented above, you could do something like:
val userScore2 = rdd2.map{case (name, score, sport, country) => (name, score, sport)}
val in1andNotin2 = rdd1 subtract userScore2
val in2andNotIn1 = userScore2 subtract rdd1

Related

Find elements in one RDD but not in ther other RDD

I have two JavaRDD A and B. I want to only keep longs that are in A but not in B. How should I do that? Thanks!
I am posting a solution in scala. Should be almost similar in Java.
Do a leftOuterJoin which would give all the records in the first rdd alongwith matching records from the second rdd. Like WrappedArray((168,(def,None)), (192,(abc,Some(abc)))). But to keep the record only present in first rdd, we apply a filter over None.
val data = spark.sparkContext.parallelize(Seq((192, "abc"),(168, "def")))
val data2 = spark.sparkContext.parallelize(Seq((192, "abc")))
val result = data
.leftOuterJoin(data2)
.filter(record => record._2._2 == None)
println(result.collect.toSeq)
Output> WrappedArray((168,(def,None)))
If you use the Dataframe API - RDD is old and does not have a lot of the optimisation of Tungsten engine - you can use an antijoin (it could exist also on RDD api, but let's use the good one ;-) )
val dataA = Seq((192, "abc"),(168, "def") ).toDF("MyLong", "MyString")
val dataB = Seq((192, "abc")).toDF.toDF("MyLong", "MyString")
dataA.join(dataB, Seq("MyLong"), "leftanti").show(false)
+------+--------+
|MyLong|MyString|
+------+--------+
|168 |def |
+------+--------+

Using Apache Spark in poor systems with cassandra and java

I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!

Spark writing to Cassandra with varying TTL

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.
I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.
Currently I am writing to Cassandra with spark using a constant TTL, with the following code:
df.write().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "key_space_name");
put("table, "table_name");
put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
}
}).mode(SaveMode.Overwrite).save();
One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?
Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.
Thanks!
From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:
constant value (withConstantTTL)
automatically resolved value (withAutoTTL)
column-based value (withPerRowTTL)
In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.
For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612
For DataFrame API there is no support for such functionality, yet... There is JIRA for it - https://datastax-oss.atlassian.net/browse/SPARKC-416, you can watch it to get notified when it's implemented...
So only choice that you have is to use RDD API as described in the #bartosz25's answer...

Inserting filename as rowkey using HBase MapReduce

using Java API, I'm trying to Put() to HBase 1.1.x the content of some files. To do so, I have created WholeFileInput class (ref : Using WholeFileInputFormat with Hadoop MapReduce still results in Mapper processing 1 line at a time ) to make MapReduce read the entire file instead of one line. But unfortunately, I cannot figure out how to form my rowkey from the given filename.
Example:
Input:
file-123.txt
file-524.txt
file-9577.txt
...
file-"anotherNumber".txt
Result on my HBase table:
Row-----------------Value
123-----------------"content of 1st file"
524-----------------"content of 2nd file"
...etc
If anyone has already faced this situation to help me with it
Thanks in advance.
Your
rowkey
can be like this
rowkey = prefix + (filenamepart or full file name) + Murmurhash(fileContent)
where your prefix can be between what ever presplits you have done with your table creation time.
For ex :
create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'},
{SPLITS => ['0','1','2','3','4','5','6','7']}
prefix can be any random id generated between range of pre-splits.
This kind of row key will avoid hot-spotting also if data increases.
& Data will be spread across region server.

Apache Spark DataFrame no RDD partitioning

According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.
First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.
Everything was fine, I wrote my code using this approach and that noticed that such code:
JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length
produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.
How to deal with it?
While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?
`
val connectionString` = "jdbc:oracle:thin:username/password#111.11.1.11:1521:ORDERS"
val ordersDF = sqlContext.load("jdbc",
Map( "url" -> connectionString,
"dbtable" -> "(select * from CUSTOMER_ORDERS)",
"partitionColumn" -> "ORDER_ID",
"lowerBound"-> "1000",
"upperBound" -> "40000",
"numPartitions"-> "10"))

Categories

Resources