Compare and Highlight the differences of two dataframes using spark and java - java

I am using spark and java to to try and compare two data frames.
Once I convert my csv files into data frames, I want to highlight exactly what changed between two dataframes.
They all have the same columns in common.
As you can see the only thing not correct with below data frames is emp_id 4 in the second df2.
Dataset<Row> df1 = spark.read().csv("/Users/dataframeOne.csv");
Dataset<Row> df1 = spark.read().csv("/Users/dataframeTwo.csv");
df1.unionAll(df2).except(df1.intersect(df2)).show(true);
Df1
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Df2
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Difference
+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
| 4| sanjose| romino|9848022331| 45123|SanRamon|
+------+--------+--------+----------+-------+--------+
How can I highlight in yellow 'Romino', the incorrect field using JAVA and SPARK?

Highlighting something in Spark depends on your GUI, so as first step I would suggest to detect the different values and add the information about the differences as additional column to the dataframe.
Step 1: Add a suffix to all columns of the two dataframes and join them over the primary key (emp_id):
import static org.apache.spark.sql.functions.*;
private static Dataset<Row> prefix(Dataset<Row> df, String prefix) {
for(String col: df.columns()) df = df.withColumnRenamed(col, col + prefix);
return df;
}
[...]
Dataset<Row> df1 = spark.read().option("header", "true").csv(...);
Dataset<Row> df2 = spark.read().option("header", "true").csv(...);
String[] columns = df1.columns();
Dataset<Row> joined = prefix(df1, "_1").join(prefix(df2, "_2"),
col("emp_id_1").eqNullSafe(col("emp_id_2")), "full_outer");
Step 2: create a list of column objects that check if the value from one table is different from the other table. This list will later be used as input parameter for map.
List<Column> diffs = new ArrayList<>();
for( String column: columns) {
diffs.add(lit(column));
diffs.add(when(col(column + "_1").eqNullSafe(col(column + "_2")), null)
.otherwise(concat_ws("/", col(column + "_1"), col(column + "_2"))));
}
Step 3: create a new column containing a map with all differences:
joined.withColumn("differences", map(diffs.toArray(new Column[]{})))
.withColumn("differences", map_filter(col("differences"), (k, v) -> not(v.isNull())))
.select("emp_id_1", "differences")
.filter(size(col("differences")).gt(0))
.show(false);
Output:
+--------+--------------------------+
|emp_id_1|differences |
+--------+--------------------------+
|4 |{emp_name -> romin/romino}|
+--------+--------------------------+

Related

How to merge two parquet files having different schema in spark (java)

I am having 2 parquet files with different number of columns and trying to merge them with following code snippet
Dataset<Row> dataSetParquet1 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataSetParquet2 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\EFG\\efg.parquet");
dataSetParquet1.unionByName(dataSetParquet2);
// dataSetParquet1.union(dataSetParquet2);
for unionByName() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Cannot resolve column name
for union() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 7 columns and the second table has 6 columns;;
How do I merge these files using spark in java?
UPDATE : Example
dataset 1:
epochMillis | one | two | three| four
--------------------------------------
1630670242000 | 1 | 2 | 3 | 4
1630670244000 | 1 | 2 | 3 | 4
1630670246000 | 1 | 2 | 3 | 4
dataset2 :
epochMillis | one | two | three|five
---------------------------------------
1630670242000 | 11 | 22 | 33 | 55
1630670244000 | 11 | 22 | 33 | 55
1630670248000 | 11 | 22 | 33 | 55
Final dataset after merging:
epochMillis | one | two | three|four |five
--------------------------------------------
1630670242000 | 11 | 22 | 33 |4 |55
1630670244000 | 11 | 22 | 33 |4 |55
1630670246000 | 1 | 2 | 3 |4 |null
1630670248000 | 11 | 22 | 33 |null |55
how to obtain this result for merging two Datasets?
You can use mergeSchema option along with adding all the paths of parquet files you want to merge in parquet method, as follow:
Dataset<Row> finalDataset = testSparkSession.read()
.option("mergeSchema", true)
.parquet("D:\\ABC\\abc.parquet", "D:\\EFG\\efg.parquet");
All columns present in first dataset but not in second dataset will be set with null value in the second dataset
To merge two rows that come from two different dataframes, you first join the two dataframes, then select the right columns according on how you want to merge.
So for your case, it means:
Read separately the two dataframes from their parquet location
Join the two dataframes on their epochTime column, using a full_outer join as you want to keep all rows present in one dataframe but not in the other
From the new dataframe with all the columns of the two dataframes duplicated, select merged columns using a function columnMerges (implementation below)
[Optional] Reorder final dataframe by epochTime
Translated into code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> dataframe1 = testSparkSession.read().parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataframe2 = testSparkSession.read().parquet("D:\\EFG\\efg.parquet");
dataframe1.join(dataframe2, dataframe1.col("epochTime").equalTo(dataframe2.col("epochTime")), "full_outer")
.select(Selector.columnMerges(dataframe2, dataframe1))
.orderBy("epochTime")
Note: when we read parquets no need for mergeSchema option as for each dataframe we read only one parquet file thus only one schema
For the merge function Selector.columnMerges, for each row, what we want to do is:
if the column is present in both dataframe, take value in dataframe2 if not null, else take value in dataframe1
if the column is only present in dataframe2, take value in dataframe2
if the column is only present in dataframe1, take value in dataframe1
So we first build set of columns of dataframe1, set of columns of dataframe2, and the list of columns from both dataframes, deduplicated. Then we iterate over this list of columns, applying previous rules for each one:
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.when;
public class Selector {
public static Column[] columnMerges(Dataset<Row> main, Dataset<Row> second) {
List<Column> columns = new ArrayList<>();
Set<String> columnsFromMain = new HashSet<>(Arrays.asList(main.columns()));
Set<String> columnsFromSecond = new HashSet<>(Arrays.asList(second.columns()));
List<String> columnNames = new ArrayList<>(Arrays.asList(main.columns()));
for (String column: second.columns()) {
if (!columnsFromMain.contains(column)) {
columnNames.add(column);
}
}
for (String column : columnNames) {
if (columnsFromMain.contains(column) && columnsFromSecond.contains(column)) {
columns.add(when(main.col(column).isNull(), second.col(column)).otherwise(main.col(column)).as(column));
} else if (columnsFromMain.contains(column)) {
columns.add(main.col(column).as(column));
} else {
columns.add(second.col(column).as(column));
}
}
return columns.toArray(new Column[0]);
}
}

How to save nested or JSON object in spark Dataset with converting to RDD?

I am working on the spark code where I have to save multiple column values as a object format and save the result to mongodb
Given Dataset
|---|-----|------|----------|
|A |A_SRC|Past_A|Past_A_SRC|
|---|-----|------|----------|
|a1 | s1 | a2 | s2 |
What I Have tried
val ds1 = Seq(("1", "2", "3","4")).toDF("a", "src", "p_a","p_src")
val recordCol = functions.to_json(Seq($"a", $"src", $"p_a",$"p_src"),struct("a", "src", "p_a","p_src")) as "A"
ds1.select(recordCol).show(truncate = false)
gives me result like
+-----------------------------------------+
|A |
+-----------------------------------------+
|{"a":"1","src":"2","p_a":"3","p_src":"4"}|
+-----------------------------------------+
I am Expecting something like
+-----------------------------------------+
|A |
+-----------------------------------------+
|{"source":"1","value":"2","p_source":"4","p_value":"3"}|
+-----------------------------------------+
How can I change the keys in the object type other than column names. using maps in java ?
You can pass as in the column struct , so that that will be saved as the name you passed.
Dataset<Row> tstDS = spark.read().format("csv").option("header", "true").load("/home/exa9/Documents/SparkLogs/y.csv");
tstDS.show();
/****
+---+-----+------+----------+
| A|A_SRC|Past_A|Past_A_SRC|
+---+-----+------+----------+
| a1| s1| a2| s2|
+---+-----+------+----------+
****/
tstDS.withColumn("A",
functions.to_json(
functions.struct(
functions.col("A").as("source"),
functions.col("A_SRC").as("value"),
functions.col("Past_A").as("p_source"),
functions.col("Past_A_SRC").as("p_value")
))
)
.select("A")
.show(false);
/****
+-----------------------------------------------------------+
|A |
+-----------------------------------------------------------+
|{"source":"a1","value":"s1","p_source":"a2","p_value":"s2"}|
+-----------------------------------------------------------+
****/

Parsing key value pairs as Hive Dataset rows using java spark

I have a hdfs file with the following data
key1=value1 key2=value2 key3=value3...
key1=value11 key2=value12 key3=value13..
We use and internal framework that gives Dataset as an input to java method which should be transformed as below and put in a hive table
keys should be the hive column names
Row formed after splitting the dataset with = as delimiter and picking the value to the right
Expected Output:
key1 | key 2 | key3
----------+-------------+----------
value1 | value2 | value3
value11 | value12 | value13
Hdfs file would roughly have 60 key- value pairs so its impossible to manually do a withcolumn() on Dataset. Any help is appreciated.
Edit1:
This is what I could write so far. Dataset.withColumn() doesnt seem to be working in a loop except for the 1st iteration
String[] columnNames = new String[dataset.columns().length];
String unescapedColumn;
Row firstRow= (Row)dataset.first();
String[] rowData = firstRow.mkString(",").split(",");
for(int i=0;i<rowData.length;i++) {
unescapedColumn=rowData[i].split("=")[0];
if(unescapedColumn.contains(">")) {
columnNames[i] = unescapedColumn.substring(unescapedColumn.indexOf(">")+1).trim();
}else {
columnNames[i] = unescapedColumn.trim();
}
}
Dataset<Row> namedDataset = dataset.toDF(columnNames);
for(String column : namedDataset.columns()) {
System.out.println("Column name :" + column);
namedDataset = namedDataset.withColumn(column, functions.substring_index(namedDataset.col(column),"=",-1));
}

Split Java Spring

I have a db with 2 columns, key and value. record:
------------------------------------
| key | value |
------------------------------------
| A | 1,desc 1;2,desc 2;3,desc 3 |
------------------------------------
I want to split value column become json format:
[{"key":"1","value":"desc 1"},{"key":"2","value":"desc 2"},{"key":"3", "value":"desc 3"}]
Where I am put split function? in service? because too dificult for 2 split. How to solve this problem?
Thanks,
Bobby
That depends on how your application is usually working with this value. If the usual case is using some specific data from this column, I would parse this at repository level already:
public static void main(String[] args) {
// You actually get this from DB
String value = "1,desc 1;2,desc 2;3,desc 3";
JSONArray j = new JSONArray();
Stream.of(value.split(";")).forEach((pair -> {
String[] keyValue = pair.split(",");
JSONObject o = new JSONObject();
o.put("key", keyValue[0]);
o.put("value", keyValue[1]);
j.put(o);
}));
System.out.println(j);
}

Data manipulation on all columns in Dataset with Java API

After reading csv file in Dataset, want to remove spaces from String type data using Java API.
Apache Spark 2.0.0
Dataset<Row> dataset = sparkSession.read().format("csv").option("header", "true").load("/pathToCsv/data.csv");
Dataset<String> dataset2 = dataset.map(new MapFunction<Row,String>() {
#Override
public String call(Row value) throws Exception {
return value.getString(0).replace(" ", "");
// But this will remove space from only first column
}
}, Encoders.STRING());
By using MapFunction, not able to remove spaces from all columns.
But in Scala, by using following way in spark-shell able to perform desired operation.
val ds = spark.read.format("csv").option("header", "true").load("/pathToCsv/data.csv")
val opds = ds.select(ds.columns.map(c => regexp_replace(col(c), " ", "").alias(c)): _*)
Dataset opds have data without spaces. Want to achieve same in Java. But in Java API columns method returns String[] and not able to perform functional programming on Dataset.
Input Data
+----------------+----------+-----+---+---+
| x| y| z| a| b|
+----------------+----------+-----+---+---+
| Hello World|John Smith|There| 1|2.3|
|Welcome to world| Bob Alice|Where| 5|3.6|
+----------------+----------+-----+---+---+
Expected Output Data
+--------------+---------+-----+---+---+
| x| y| z| a| b|
+--------------+---------+-----+---+---+
| HelloWorld|JohnSmith|There| 1|2.3|
|Welcometoworld| BobAlice|Where| 5|3.6|
+--------------+---------+-----+---+---+
Try:
for (String col: dataset.columns) {
dataset = dataset.withColumn(col, regexp_replace(dataset.col(col), " ", ""));
}
You can try following regex to remove white spaces between strings.
value.getString(0).replaceAll("\\s+", "");
About \s+ : match any white space character between one and unlimited times, as many times as possible.
Instead of replace use replaceAll function.
More about replace and replaceAll functions Difference between String replace() and replaceAll()

Categories

Resources