How to merge two parquet files having different schema in spark (java)

How to merge two parquet files having different schema in spark (java) - java

I am having 2 parquet files with different number of columns and trying to merge them with following code snippet
Dataset<Row> dataSetParquet1 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataSetParquet2 = testSparkSession.read().option("mergeSchema",true).parquet("D:\\EFG\\efg.parquet");
dataSetParquet1.unionByName(dataSetParquet2);
// dataSetParquet1.union(dataSetParquet2);
for unionByName() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Cannot resolve column name
for union() I get the error:
Caused by: org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 7 columns and the second table has 6 columns;;
How do I merge these files using spark in java?
UPDATE : Example
dataset 1:
epochMillis | one | two | three| four
--------------------------------------
1630670242000 | 1 | 2 | 3 | 4
1630670244000 | 1 | 2 | 3 | 4
1630670246000 | 1 | 2 | 3 | 4
dataset2 :
epochMillis | one | two | three|five
---------------------------------------
1630670242000 | 11 | 22 | 33 | 55
1630670244000 | 11 | 22 | 33 | 55
1630670248000 | 11 | 22 | 33 | 55
Final dataset after merging:
epochMillis | one | two | three|four |five
--------------------------------------------
1630670242000 | 11 | 22 | 33 |4 |55
1630670244000 | 11 | 22 | 33 |4 |55
1630670246000 | 1 | 2 | 3 |4 |null
1630670248000 | 11 | 22 | 33 |null |55
how to obtain this result for merging two Datasets?

You can use mergeSchema option along with adding all the paths of parquet files you want to merge in parquet method, as follow:
Dataset<Row> finalDataset = testSparkSession.read()
.option("mergeSchema", true)
.parquet("D:\\ABC\\abc.parquet", "D:\\EFG\\efg.parquet");
All columns present in first dataset but not in second dataset will be set with null value in the second dataset

To merge two rows that come from two different dataframes, you first join the two dataframes, then select the right columns according on how you want to merge.
So for your case, it means:
Read separately the two dataframes from their parquet location
Join the two dataframes on their epochTime column, using a full_outer join as you want to keep all rows present in one dataframe but not in the other
From the new dataframe with all the columns of the two dataframes duplicated, select merged columns using a function columnMerges (implementation below)
[Optional] Reorder final dataframe by epochTime
Translated into code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> dataframe1 = testSparkSession.read().parquet("D:\\ABC\\abc.parquet");
Dataset<Row> dataframe2 = testSparkSession.read().parquet("D:\\EFG\\efg.parquet");
dataframe1.join(dataframe2, dataframe1.col("epochTime").equalTo(dataframe2.col("epochTime")), "full_outer")
.select(Selector.columnMerges(dataframe2, dataframe1))
.orderBy("epochTime")
Note: when we read parquets no need for mergeSchema option as for each dataframe we read only one parquet file thus only one schema
For the merge function Selector.columnMerges, for each row, what we want to do is:
if the column is present in both dataframe, take value in dataframe2 if not null, else take value in dataframe1
if the column is only present in dataframe2, take value in dataframe2
if the column is only present in dataframe1, take value in dataframe1
So we first build set of columns of dataframe1, set of columns of dataframe2, and the list of columns from both dataframes, deduplicated. Then we iterate over this list of columns, applying previous rules for each one:
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.when;
public class Selector {
public static Column[] columnMerges(Dataset<Row> main, Dataset<Row> second) {
List<Column> columns = new ArrayList<>();
Set<String> columnsFromMain = new HashSet<>(Arrays.asList(main.columns()));
Set<String> columnsFromSecond = new HashSet<>(Arrays.asList(second.columns()));
List<String> columnNames = new ArrayList<>(Arrays.asList(main.columns()));
for (String column: second.columns()) {
if (!columnsFromMain.contains(column)) {
columnNames.add(column);
}
}
for (String column : columnNames) {
if (columnsFromMain.contains(column) && columnsFromSecond.contains(column)) {
columns.add(when(main.col(column).isNull(), second.col(column)).otherwise(main.col(column)).as(column));
} else if (columnsFromMain.contains(column)) {
columns.add(main.col(column).as(column));
} else {
columns.add(second.col(column).as(column));
}
}
return columns.toArray(new Column[0]);
}
}

Related

Need to set values in columns of dataset based on value of 1 column

I have a Dataset<Row> in java. I need to read value of 1 column which is a JSON string, parse it, and set the value of a few other columns based on the parsed JSON value.
My dataset looks like this:
|json | name| age |
========================================
| "{'a':'john', 'b': 23}" | null| null |
----------------------------------------
| "{'a':'joe', 'b': 25}" | null| null |
----------------------------------------
| "{'a':'zack'}" | null| null |
----------------------------------------
And I need to make it like this:
|json | name | age |
========================================
| "{'a':'john', 'b': 23}" | 'john'| 23 |
----------------------------------------
| "{'a':'joe', 'b': 25}" | 'joe' | 25 |
----------------------------------------
| "{'a':'zack'}" | 'zack'|null|
----------------------------------------
I am unable to figure out a way to do it. Please help with the code.

There is a function get_json_object exists in Spark.
Suggesting, you have a data frame named df, you may choose this way to solve your problem:
df.selectExpr("get_json_object(json, '$.a') as name", "get_json_object(json, '$.b') as age" )
But first and foremost, be sure that your json attribute has double quotes instead of single ones.
Note: there is a full list of Spark SQL functions. I am using it heavily. Consider to add it to bookmarks and reference time to time.

You could use UDFs
def parseName(json: String): String = ??? // parse json
val parseNameUDF = udf[String, String](parseName)
def parseAge(json: String): Int = ??? // parse json
val parseAgeUDF = udf[Int, String](parseAge)
dataFrame
.withColumn("name", parseNameUDF(dataFrame("json")))
.withColumn("age", parseAgeUDF(dataFrame("json")))

Spark Dataset - NullPointerException while doing a filter on dataset

I have 2 datasets with me as shown below. I'm trying to find out how many products are associated with each game. Basically, I'm trying to keep a count of the number of products associated.
scala> df1.show()
gameid | games | users | cnt_assoc_prod
-------------------------------------------
1 | cricket |[111, 121] |
2 | basketball|[211] |
3 | skating |[101, 100, 98] |
scala> df2.show()
user | products
----------------------
98 | "shampoo"
100 | "soap"
101 | "shampoo"
111 | "shoes"
121 | "honey"
211 | "shoes"
I'm trying to iterate through each of df1's users from the array and find the corresponding row in df2 by applying the filter on column matching the user.
df1.map{x => {
var assoc_products = new Set()
x.users.foreach(y => assoc_products + df2.filter(z => z.user == y).first().
products)
x.cnt_assoc_prod = assoc_products.size
}
While applying filter I get following Exception
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.logicalPlan(Dataset.scala:784)
at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:344)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:307)
I'm using spark version 1.6.1.

You can explode the users column in df1, join with df2 on the user column, then do the groupBy count:
(df1.withColumn("user", explode(col("users")))
.join(df2, Seq("user"))
.groupBy("gameid", "games")
.agg(count($"products").alias("cnt_assoc_prod"))
).show
+------+----------+--------------+
|gameid| games|cnt_assoc_prod|
+------+----------+--------------+
| 3| skating| 3|
| 2|basketball| 1|
| 1| cricket| 2|
+------+----------+--------------+

With Apache Spark flattern the 2 first rows of each group with Java

Giving the following input table:
+----+------------+----------+
| id | shop | purchases|
+----+------------+----------+
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
+----+------------+----------+
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
+----+-------+------+
| id | top_1 | top_2|
+----+-------+------+
| 1 | 02 | 01 |
| 2 | 03 | |
+----+-------+------+
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
...
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
#Override
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
}
}
});
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.

Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
+---+---+----+
| id| 1| 2|
+---+---+----+
| 1| 02| 01|
| 2| 03|null|
+---+---+----+
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).

JTable getting values of specific column

How could I get the values under specific column in JTable
example :
_________________________
| Column 1 | Column 2 |
________________________
| 1 | a |
________________________
| 2 | b |
________________________
| 3 | c |
_________________________
How could I get the values under Column 1 that is [1, 2, 3]
In the form of some data structure ( preferable array)?

you can do something like this
ArrayList list = new ArrayList();
for(int i = 0;i<table.getModel().getRowCount();i++)
{
list.add(table.getModel().getValueAt(i,0)); //get the all row values at column index 0
}

read pdf from itext

I had made a table in pdf using text in java web application.
PDF Generated is:
Gender | Column 1 | Column 2 | Column 3
Male | 1845 | 645 | 254
Female | 214 | 457 | 142
On reading pdf i used following code:
ArrayList allrows = firstable.getRows();
for (PdfPRow currentrow:allrows) {
PdfPCell[] allcells = currentrow.getCells();
System.out.println("CurrentRow ->"+currentrow.getCells());
for(PdfPCell currentcell : allcells){
ArrayList<Element> element = (ArrayList<Element>) currentcell.getCompositeElements();
System.out.println("Element->"+element.toString());
}
}
How to read text from pdf columns and pass to int variables?

Why don't you generate the Column of the pdf as fields, so that reading will be much easier

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to merge two parquet files having different schema in spark (java) - java

Related

Need to set values in columns of dataset based on value of 1 column

Spark Dataset - NullPointerException while doing a filter on dataset

With Apache Spark flattern the 2 first rows of each group with Java

JTable getting values of specific column

read pdf from itext

Categories

Resources