I have created a delta table and now I'm trying to insert data to that table using foreachBatch(). I've followed this example. The only difference is that I'm using Java and not in a notebook, but I suppose that should not make any difference?
My code looks as follows:
spark.sql("CREATE TABLE IF NOT EXISTS example_src_table(id int, load_date timestamp) USING DELTA LOCATION '/mnt/delta/events/example_src_table'");
Dataset<Row> exampleDF = spark.sql("SELECT e.id as id, e.load_date as load_date FROM example e");
try {
exampleDF
.writeStream()
.format("delta")
.foreachBatch((dataset, batchId) -> {
dataset.persist();
// Set the dataframe to view name
dataset.createOrReplaceTempView("updates");
// Use the view name to apply MERGE
// NOTE: You have to use the SparkSession that has been used to define the `updates` dataframe
dataset.sparkSession().sql("MERGE INTO example_src_table e" +
" USING updates u" +
" ON e.id = u.id" +
" WHEN NOT MATCHED THEN INSERT (e.id, e.load_date) VALUES (u.id, u.load_date)");
})
.outputMode("update")
.option("checkpointLocation", "/mnt/delta/events/_checkpoints/example_src_table")
.start();
} catch (TimeoutException e) {
e.printStackTrace();
}
This code runs without any problems, but there is no data written to the delta table with url '/mnt/delta/events/example_src_table'. Anyone know what I'm doing wrong?
I'm using Spark 3.0 and Java 8.
EDIT
Tested on a Databricks Notebook using Scala, and then it worked just fine.
try to follow a syntax like the following one in case you want to update the data with the new data
WHEN NOT MATCHED THEN
UPDATE SET e.load_date = u.load_date AND e.id = u.id
If you only want to add the data it occupies something like this
WHEN NOT MATCHED THEN INSERT *
Related
I have a dataframe in the following schema, that I extract from a Hive table using the SQL below:
Id
Group_name
Sub_group_number
Year_Month
1
Active
1
202110
2
Active
3
202110
3
Inactive
4
202110
4
Active
1
202110
The T-SQL to extract the information is:
SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = 202110
AND id IN (SELECT Id FROM table WHERE Year_Month = 202109 AND Sub_group_number = 1)
After extract this information I want to group by Sub_group to extract the Id quantity as below:
df = (df.withColumn('FROM', F.lit(1))
.groupBy('Year_Month', 'FROM', 'Sub_group_number')
.count())
The result is a table as below:
Year_Month
From
Sub_group_number
Quantity
202110
1
1
2
202110
1
3
1
202110
1
4
1
Until this point there is no issue on my code and I'm able to run and execute action commands with Spark. The issue happens when I try to make the year_month and sub_group as parameters of my T-SQL in order to have a complete table. I'm using the following code:
sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
for month in year_months:
for group in sub_groups:
query = f"""SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = {month + 1}
AND id IN (SELECT Id FROM table WHERE Year_Month = {month} AND Sub_group_number = {group})"""
df_temp = (spark.sql(query)
.withColumn('FROM', F.lit(group))
.groupBy('Year_Month', 'FROM', 'Sub_group_number')
.count())
df = df.union(df_temp).dropDuplicates()
When I execute a df.show() or try to write as Table I have the issue:
An error occurred while calling o8522.showString
Any ideas of what is causing this error?
You're attempting string interpolation.
If using Python, maybe try this:
query = "SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = {0}
AND id IN (SELECT Id FROM table WHERE Year_Month = {1}
AND Sub_group_number = {2})".format(month + 1, month, group)
The error states it is StackOverflowError that can happen when DAG plan grows too much. Because of Spark's lazy evaluation, this could easily happen with for-loops, especially you have nested for-loop. If you are curious, you can try df.explain() where you did df.show(), you should see pretty long physical plans that Spark cannot handle to run in actual.
To solve this, you want to avoid for-loop as much as possible and in your case , it seems you don't need it.
sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
# Modify this to use datetime lib for more robustness (ex: handle 202112 -> 202201).
month_plus = [x+1 for x in year_months]
def _to_str_elms(li):
return str(li)[1:-1]
spark.sql("""
SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month IN ({','.join(_to_str_elms(month_plus))})
AND id IN (SELECT Id FROM table WHERE Year_Month IN ({','.join(_to_str_elms(month))}) AND Sub_group_number IN ({','.join(_to_str_elms(sub_groups))}))
""")
UPDATE:
I think I understood why you are looping. You need "parent" group where along with the Sub_group_number of its record and you are using lit with looped value. I think one way is that you can rethink about this problem by first query to fetch all records that are in [202101, 202102, 202103, 202104], then use some window functions to figure out the parent group. I am not yet foreseeing how it looks like, so if you can give us some sample records and logic how you want to get the "group", I can perhaps provide updates.
I am trying to implement sqlparser and using gsqlparser from here. The source of the jar is in Java but I am implementing the same in Scala.
Below is my query which contains a join condition.
SELECT e.last_name AS name, e.commission_pct comm, e.salary * 12 "Annual Salary" FROM scott.employees AS e right join scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn WHERE e.salary > 1000 ORDER BY e.first_name, e.last_name
I was able to parse the query to read names & aliases of columns, where conditions, table names (checking the table names directly inside the query) as below.
val sqlParser = new TGSqlParser(EDbVendor.dbvsnowflake)
sqlParser.sqltext = "SELECT e.last_name AS name, e.commission_pct comm, e.salary * 12 \"Annual Salary\" FROM scott.employees AS e right join scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn WHERE e.salary > 1000 ORDER BY e.first_name, e.last_name"
val selectStmnt = sqlParser.sqltext
println("Columns List:")
for(i <- 0 until selectStmnt.getResultColumnList.size()) {
val resCol = selectStmnt.getResultColumnList.getResultColumn(i)
println("Column: " + resCol.getExpr.toString + " Alias: " + resCol
.getAliasClause().toString)
}
Output:
Columns List:
Column: e.last_name Alias: name
Column: e.commission_pct Alias: comm
Column: e.salary * 12 Alias: "Annual Salary"
I am trying to parse the join condition and get the details inside it
for(j <- 0 until selectStmnt.getJoins.size()) {
println(selectStmnt.getJoins.getJoin(j).getTable)
}
The problem here is there is only one join condition in the query, so the size returned is 1.
Hence the output is scott.employees.
If I do it a bit different as below using getJoinItems
println("Parsing Join items")
for(j <- 0 until selectStmnt.getJoins.size()) {
println(selectStmnt.getJoins.getJoin(j).getJoinItems)
}
I get the output by cutting off the first table from the join condition as below:
scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn
The method: getJoinItems() returns a list: TJoinItemList which I thought of traversing through. But even its size is 1.
println(selectStmnt.getJoins.getJoin(j).getJoinItems.size()) -> 1
I am out of ideas now. Could anyone let me know how can I parse the query's join condition and get the table names inside the join ?
I don't have access to Snowflake dialect in GSP but I mimicked this scenario with Teradata dialect using the following query and created a sql parser.
SELECT e.last_name as name
FROM department d
RIGHT JOIN
trimmed_employee e
ON d.dept_id = e.emp_id
WHERE e.salary > 1000
ORDER BY e.first_name
Here is the Groovy code of getting both the tables department, trimmed_employee. It boils down to iterating over each join and while doing so collect the current join's items (joinItems) using curJoin.joinItems only if it is not null.
stmt.joins.asList().collect { curJoin ->
[curJoin.table] + (curJoin?.joinItems?.asList()?.collect { joinItems -> joinItems.table } ?: [])
}.flatten()
Result:
department
trimmed_employee
For this simple sql that you mentioned in my case, the following code also works.
stmt.tables.asList()
I am trying to use the update query with the LIMIT clause using sqlite-JDBC.
Let's say there are 100 bob's in the table but I only want to update one of the records.
Sample code:
String name1 = "bob";
String name2 = "alice";
String updateSql = "update mytable set user = :name1 " +
"where user is :name2 " +
"limit 1";
try (Connection con = sql2o.open()) {
con.createQuery(updateSql)
.addParameter("bob", name1)
.addParameter("alice", name2)
.executeUpdate();
} catch(Exception e) {
e.printStackTrace();
}
I get an error:
org.sql2o.Sql2oException: Error preparing statement - [SQLITE_ERROR] SQL error or missing database (near "limit": syntax error)
Using
sqlite-jdbc 3.31
sql2o 1.6 (easy database query library)
The flag:
SQLITE_ENABLE_UPDATE_DELETE_LIMIT
needs to be set to get the limit clause to work with the update query.
I know the SELECT method works with the LIMIT clause but I would need 2 queries to do this task; SELECT then UPDATE.
If there is no way to get LIMIT to work with UPDATE then I will just use the slightly more messy method of having a query and sub query to get things to work.
Maybe there is a way to get sqlite-JDBC to use an external sqlite engine outside of the integrated one, which has been compiled with the flag set.
Any help appreciated.
You can try this query instead:
UPDATE mytable SET user = :name1
WHERE ROWID = (SELECT MIN(ROWID)
FROM mytable
WHERE user = :name2);
ROWID is a special column available in all tables (unless you use WITHOUT ROWID)
I am trying to read cloud SQL table in java beam using JdbcIO.Read. I want to convert each row in Resultset into GenericData.Record using .withRowMapper(Resultset resultSet) method. Is there a way I can pass JSON Schema String as input in .withRowMapper method like ParDo accepts sideInputs as PCollectionView
I have tried doing both reads operations (read from information_schema.columns and My Table in same JdbcIO.Read transform). However, I would like to have Schema PCollection generated first and then read table using JdbcIO.Read
I am generating Avro schema of table on the fly like this :
PCollection<String> avroSchema= pipeline.apply(JdbcIO.<String>read()
.withDataSourceConfiguration(config)
.withCoder(StringUtf8Coder.of())
.withQuery("SELECT DISTINCT column_name, data_type \n" +
"FROM information_schema.columns\n" +
"WHERE table_name = " + "'" + tableName + "'")
.withRowMapper((JdbcIO.RowMapper<String>) resultSet -> {
// code here to generate avro schema string
// this works fine for me
}))
Creating PCollectionView which will hold my json schema for each table.
PCollectionView<String> s = avroSchema.apply(View.<String>asSingleton());
// I want to access this view as side input in next JdbcIO.Read operation
// something like this ;
pipeline.apply(JdbcIO.<String>read()
.withDataSourceConfiguration(config)
.withCoder(StringUtf8Coder.of())
.withQuery(queryString)
.withRowMapper(new JdbcIO.RowMapper<String>() {
#Override
public String mapRow(ResultSet resultSet) throws Exception {
// access schema here and use it to parse and create
//GenericData.Record from ResultSet fields as per schema
return null;
}
})).
withSideInputs(My PCollectionView here); // this option is not there right now.
Is there any better way to approach this problem?
At this point IOs API do not accept SideInputs.
It should be feasible to add ParDo right after read and do mapping there. That ParDo can accept side inputs.
I wonder if there is a good solution to build a JPQL query (my query is too "expressive" and i cannot use Criteria) based on a filter.
Something like:
query = "Select from Ent"
if(parameter!=null){
query += "WHERE field=:parameter"
}
if(parameter2!=null) {
query += "WHERE field2=:parameter2"
}
But i would write WHERE twice!! and the casuistic explodes as the number of parameter increases. Because none or all could be null eventually.
Any hint to build these queries based on filters on a proper way?
select * from Ent
where (field1 = :parameter1 or :parameter1 is null)
and (field2 = :parameter2 or :parameter2 is null)
Why can't you use a criteria, like this.
Other options (less good imho):
Create two named queries one for each condition, then call the respective query.
Or build up a string and use a native query.
Oh, do you just mean the string formation(?) :
query = "Select from Ent where 1=1 "
if(parameter!=null){
query += " and field=:parameter"
}
if(parameter2!=null) {
query += " and field2=:parameter2"
}
(I think that string formation is ugly, but it seemed to be what was asked for)