I am trying to implement sqlparser and using gsqlparser from here. The source of the jar is in Java but I am implementing the same in Scala.
Below is my query which contains a join condition.
SELECT e.last_name AS name, e.commission_pct comm, e.salary * 12 "Annual Salary" FROM scott.employees AS e right join scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn WHERE e.salary > 1000 ORDER BY e.first_name, e.last_name
I was able to parse the query to read names & aliases of columns, where conditions, table names (checking the table names directly inside the query) as below.
val sqlParser = new TGSqlParser(EDbVendor.dbvsnowflake)
sqlParser.sqltext = "SELECT e.last_name AS name, e.commission_pct comm, e.salary * 12 \"Annual Salary\" FROM scott.employees AS e right join scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn WHERE e.salary > 1000 ORDER BY e.first_name, e.last_name"
val selectStmnt = sqlParser.sqltext
println("Columns List:")
for(i <- 0 until selectStmnt.getResultColumnList.size()) {
val resCol = selectStmnt.getResultColumnList.getResultColumn(i)
println("Column: " + resCol.getExpr.toString + " Alias: " + resCol
.getAliasClause().toString)
}
Output:
Columns List:
Column: e.last_name Alias: name
Column: e.commission_pct Alias: comm
Column: e.salary * 12 Alias: "Annual Salary"
I am trying to parse the join condition and get the details inside it
for(j <- 0 until selectStmnt.getJoins.size()) {
println(selectStmnt.getJoins.getJoin(j).getTable)
}
The problem here is there is only one join condition in the query, so the size returned is 1.
Hence the output is scott.employees.
If I do it a bit different as below using getJoinItems
println("Parsing Join items")
for(j <- 0 until selectStmnt.getJoins.size()) {
println(selectStmnt.getJoins.getJoin(j).getJoinItems)
}
I get the output by cutting off the first table from the join condition as below:
scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn
The method: getJoinItems() returns a list: TJoinItemList which I thought of traversing through. But even its size is 1.
println(selectStmnt.getJoins.getJoin(j).getJoinItems.size()) -> 1
I am out of ideas now. Could anyone let me know how can I parse the query's join condition and get the table names inside the join ?
I don't have access to Snowflake dialect in GSP but I mimicked this scenario with Teradata dialect using the following query and created a sql parser.
SELECT e.last_name as name
FROM department d
RIGHT JOIN
trimmed_employee e
ON d.dept_id = e.emp_id
WHERE e.salary > 1000
ORDER BY e.first_name
Here is the Groovy code of getting both the tables department, trimmed_employee. It boils down to iterating over each join and while doing so collect the current join's items (joinItems) using curJoin.joinItems only if it is not null.
stmt.joins.asList().collect { curJoin ->
[curJoin.table] + (curJoin?.joinItems?.asList()?.collect { joinItems -> joinItems.table } ?: [])
}.flatten()
Result:
department
trimmed_employee
For this simple sql that you mentioned in my case, the following code also works.
stmt.tables.asList()
Related
I have a dataframe in the following schema, that I extract from a Hive table using the SQL below:
Id
Group_name
Sub_group_number
Year_Month
1
Active
1
202110
2
Active
3
202110
3
Inactive
4
202110
4
Active
1
202110
The T-SQL to extract the information is:
SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = 202110
AND id IN (SELECT Id FROM table WHERE Year_Month = 202109 AND Sub_group_number = 1)
After extract this information I want to group by Sub_group to extract the Id quantity as below:
df = (df.withColumn('FROM', F.lit(1))
.groupBy('Year_Month', 'FROM', 'Sub_group_number')
.count())
The result is a table as below:
Year_Month
From
Sub_group_number
Quantity
202110
1
1
2
202110
1
3
1
202110
1
4
1
Until this point there is no issue on my code and I'm able to run and execute action commands with Spark. The issue happens when I try to make the year_month and sub_group as parameters of my T-SQL in order to have a complete table. I'm using the following code:
sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
for month in year_months:
for group in sub_groups:
query = f"""SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = {month + 1}
AND id IN (SELECT Id FROM table WHERE Year_Month = {month} AND Sub_group_number = {group})"""
df_temp = (spark.sql(query)
.withColumn('FROM', F.lit(group))
.groupBy('Year_Month', 'FROM', 'Sub_group_number')
.count())
df = df.union(df_temp).dropDuplicates()
When I execute a df.show() or try to write as Table I have the issue:
An error occurred while calling o8522.showString
Any ideas of what is causing this error?
You're attempting string interpolation.
If using Python, maybe try this:
query = "SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month = {0}
AND id IN (SELECT Id FROM table WHERE Year_Month = {1}
AND Sub_group_number = {2})".format(month + 1, month, group)
The error states it is StackOverflowError that can happen when DAG plan grows too much. Because of Spark's lazy evaluation, this could easily happen with for-loops, especially you have nested for-loop. If you are curious, you can try df.explain() where you did df.show(), you should see pretty long physical plans that Spark cannot handle to run in actual.
To solve this, you want to avoid for-loop as much as possible and in your case , it seems you don't need it.
sub_groups = [i for i in range(22)]
year_months = [202101, 202102, 202103]
# Modify this to use datetime lib for more robustness (ex: handle 202112 -> 202201).
month_plus = [x+1 for x in year_months]
def _to_str_elms(li):
return str(li)[1:-1]
spark.sql("""
SELECT Id, Group_Name, Sub_group_number, Year_Month
FROM table
WHERE Year_Month IN ({','.join(_to_str_elms(month_plus))})
AND id IN (SELECT Id FROM table WHERE Year_Month IN ({','.join(_to_str_elms(month))}) AND Sub_group_number IN ({','.join(_to_str_elms(sub_groups))}))
""")
UPDATE:
I think I understood why you are looping. You need "parent" group where along with the Sub_group_number of its record and you are using lit with looped value. I think one way is that you can rethink about this problem by first query to fetch all records that are in [202101, 202102, 202103, 202104], then use some window functions to figure out the parent group. I am not yet foreseeing how it looks like, so if you can give us some sample records and logic how you want to get the "group", I can perhaps provide updates.
I'm trying to write an SQL query using CriteriaQuery, but I'm having a hard time doing so. This query basically gets a list of shipments and sorts them by their authorization date. This authorization date is represented as the date attribute of the first record in the status transition messages table with an initial status of 3 and a final status of 4. This is my query:
SELECT s.id
FROM shipment s
ORDER BY (SELECT min(stm.date)
FROM status_transition_message stm
WHERE stm.initial_status = 1 AND stm.final_status = 3 AND stm.shipment_id = s.id) desc;
I've tried multiple different solutions, but none have worked so far.
My current iteration is as follows:
private void sortByAuthDate(Root<ShipmentTbl> root, CriteriaQuery<?> query, CriteriaBuilder builder, ListSort sort) {
Subquery<Timestamp> authDateQuery = query.subquery(Timestamp.class);
Root<StatusTransitionMessageTbl> stmRoot = authDateQuery.from(StatusTransitionMessageTbl.class);
Predicate shipmentId = builder.equal(stmRoot.<ShipmentTbl>get("shipment").<String>get("id"), root.<String>get("id"));
Predicate initialStatus = builder.equal(stmRoot.<Integer>get("initialStatus"), 3);
Predicate finalStatus = builder.equal(stmRoot.<Integer>get("finalStatus"), 4);
// returns the authorization date for each queried shipment
authDateQuery.select(builder.least(stmRoot.<Timestamp>get("date")))
.where(builder.and(shipmentId, initialStatus, finalStatus));
Expression<Timestamp> authDate = authDateQuery.getSelection();
Order o = sort.getSortDirection() == ListSort.SortDirection.ASC ? builder.asc(authDate) : builder.desc(authDate);
query.multiselect(authDate).orderBy(o);
}
The problem with this solution is that the SQL query generated by the CriteriaQuery does not support subqueries in the ORDER BY clause, causing a parsing exception.
My CriteriaQuery-fu is not good enough to help you with that part, but you could rewrite your SQL query to this:
SELECT s.id
FROM shipment s
LEFT JOIN status_transition_message stm
ON stm.initial_status = 1 AND stm.final_status = 3 AND stm.shipment_id = s.id
GROUP BY s.id
ORDER BY min(stm.date) DESC;
To me, this quite likely seems to be a faster solution anyway than running a correlated subquery in the ORDER BY clause, especially on RDBMS with less sophisticated optimisers.
So I attempted to follow #Lukas Eder solution and reached this solution:
private void sortByAuthDate(Root<ShipmentTbl> root, CriteriaQuery<?> query, CriteriaBuilder builder, ShipmentListSort sort) {
Join<ShipmentTbl, StatusTransitionMessageTbl> shipmentStatuses = root.join("shipmentStatus", JoinType.LEFT);
Predicate initialStatus = builder.equal(shipmentStatuses.<Integer>get("initialStatus"), 1);
Predicate finalStatus = builder.equal(shipmentStatuses.<Integer>get("finalStatus"), 3);
Expression<Timestamp> authDate = builder.least(shipmentStatuses.<Timestamp>get("date"));
Order o = sort.getSortDirection() == ShipmentListSort.SortDirection.ASC ? builder.asc(authDate) : builder.desc(authDate);
shipmentStatuses.on(builder.and(initialStatus, finalStatus));
query.multiselect(authDate).groupBy(root.<String>get("id")).orderBy(o);
}
}
But now it's throwing this exception:
ERROR o.h.e.jdbc.spi.SqlExceptionHelper - ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This happens because the query is only going to get distinct shipments later on and it's asking for the sorting column also appear in the select. The problem is I don't know how to force CriteriaQuery to keep that column in the SELECT statement. It automatically only puts in the ORDER BY.
Here's the JPQL query it's executing in my test:
select
distinct generatedAlias0
from
ShipmentTbl as generatedAlias0
left join
generatedAlias0.shipmentStatus as generatedAlias1 with ( generatedAlias1.initialStatus=:param0 )
and (
generatedAlias1.finalStatus=:param1
)
where
lower(generatedAlias0.shipmentName) like :param2
group by
generatedAlias0.id
order by
min(generatedAlias1.date) desc
and the generated SQL query:
select
distinct shipmenttb0_.id as id1_13_,
shipmenttb0_.archived_date as archived2_13_,
shipmenttb0_.auth_code as auth_cod3_13_,
shipmenttb0_.authorization_date as authoriz4_13_,
shipmenttb0_.booked_in_by_user as booked_i5_13_,
shipmenttb0_.business_channel as business6_13_,
shipmenttb0_.courier as courier7_13_,
shipmenttb0_.courier_amount as courier_8_13_,
shipmenttb0_.courier_currency as courier_9_13_,
shipmenttb0_.ship_to as ship_to39_13_,
shipmenttb0_.estimated_shipment_date as estimat10_13_,
shipmenttb0_.last_updated_date as last_up11_13_,
shipmenttb0_.measurement_unit as measure12_13_,
shipmenttb0_.original_submitted_date as origina13_13_,
shipmenttb0_.packaging_type as packagi14_13_,
shipmenttb0_.placeholder_message as placeho15_13_,
shipmenttb0_.scheduled_period_of_day as schedul16_13_,
shipmenttb0_.scheduled_shipment_date as schedul17_13_,
shipmenttb0_.ship_from as ship_fr40_13_,
shipmenttb0_.ship_origin as ship_or41_13_,
shipmenttb0_.shipment_name as shipmen18_13_,
shipmenttb0_.status as status19_13_,
shipmenttb0_.submitted_date as submitt20_13_,
shipmenttb0_.supplier_contact_email as supplie21_13_,
shipmenttb0_.supplier_contact_name as supplie22_13_,
shipmenttb0_.supplier_contact_phone_number as supplie23_13_,
shipmenttb0_.supplier_email as supplie24_13_,
shipmenttb0_.supplier_secondary_contact_email as supplie25_13_,
shipmenttb0_.supplier_secondary_contact_name as supplie26_13_,
shipmenttb0_.supplier_secondary_contact_phone_number as supplie27_13_,
shipmenttb0_.tenant as tenant28_13_,
shipmenttb0_.total_received_boxes as total_r29_13_,
shipmenttb0_.total_units as total_u30_13_,
shipmenttb0_.total_value as total_v31_13_,
shipmenttb0_.total_volume as total_v32_13_,
shipmenttb0_.total_weight as total_w33_13_,
shipmenttb0_.tracking_number as trackin34_13_,
shipmenttb0_.tt_note as tt_note35_13_,
shipmenttb0_.tt_priority as tt_prio36_13_,
shipmenttb0_.updated_by_user as updated37_13_,
shipmenttb0_.weight_unit as weight_38_13_
from
shipment shipmenttb0_
left outer join
status_transition_message shipmentst1_
on shipmenttb0_.id=shipmentst1_.shipment_id
and (
shipmentst1_.initial_status=?
and shipmentst1_.final_status=?
)
where
lower(shipmenttb0_.shipment_name) like ?
group by
shipmenttb0_.id
order by
min(shipmentst1_.date) desc limit ?
I’m working with the database and servlets, there was such a problem. I need to receive data from the database of 6 pieces per page, for this I made such a request
SELECT *, COUNT(*) AS 'count'
FROM product
INNER JOIN product_category
on product.product_category_id = product_category.id
INNER JOIN company_manufacturer_product
on product.company_manufacturer_product_id =
company_manufacturer_product.id
GROUP BY 1 LIMIT 6 OFFSET 0;
where 6 is the maximum number of items per page and 0 is the page number multiplied by the maximum quantity of goods. But with such an implementation on the second page I have duplicate products how can i improve it?
The part of the code where I form the request:
StringBuilder startResponse = new StringBuilder("SELECT *, COUNT(*) AS 'count' FROM product " +
"INNER JOIN product_category on product.product_category_id = product_category.id " +
"INNER JOIN company_manufacturer_product on product.company_manufacturer_product_id=company_manufacturer_product.id");
if (nonNull(form.getProductMax()) && nonNull(form.getPage())) {
startResponse.append(" LIMIT ").append(form.getProductMax()).append(" OFFSET ").append(form.getPage() * form.getProductMax());
}
My database respone without LIMIT and OFFSET:
My database respone when I use the query that described above, this request is sent to the database when I turn to the first page with the goods:
When I turn to the second page with goods, I send such a request to the database
SELECT * , COUNT(*) AS 'count'
FROM product
INNER JOIN product_category
on product.product_category_id = product_category.id
INNER JOIN company_manufacturer_product
on product.company_manufacturer_product_id =
company_manufacturer_product.id
GROUP BY 1 LIMIT 6 OFFSET 6;
and i have response like that:
I can not understand what the problem is. I have to use requests through COUNT! How prove it?
Not against the solution of this question, according to the above method, adding order by to original sql can solve the problem.
But I think I have a better practice for pagination: using parameters like has_more, last_product_id and limit_num to connect clients with server.
has_more indicates more data in server whether or not left;
last_product_id indicates the id of last response data;
limit_num indicates the number of per page.
So, client can using has_more to determine sending a request or not, if it is, client sends a request with last_product_id and limit_num to server; and for server, the sql can be this:
select * from table where id < $last_product_id order by id desc
limit $limit_num + 1; =>$datas
And, count($datas) and $limit_num to calculate the value of has_more and last_product_id:
$has_more = 0;
$data_num = count($datas);
if ($data_num > $page_limit) {
$has_more = 1;
array_pop($datas);
$data_num--;
}
$last_product_id = end($datas)['id'] ?? 0;
SELECT *, COUNT(product.id) AS 'count' FROM product INNER JOIN
product_category on product.product_category_id = product_category.id
INNER JOIN company_manufacturer_product on
product.company_manufacturer_product_id=company_manufacturer_product.id
group by product.id order by product.id LIMIT 6 OFFSET 0;
I need to scrub an SQL Server table on a regular basis, but my solution is taking ridiculously long (about 12 minutes for 73,000 records).
My table has 4 fields:
id1
id2
val1
val2
For every group of records with the same "id1", I need to keep the first (lowest id2) and last (highest id2) and delete everything in between UNLESS val1 or val2 has changed from the previous (next lowest "id2") record.
If you're following me so far, what would a more efficient algorithm be? Here is my java code:
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
thisID1 = qrs.getInt("id1");
thisID2 = qrs.getInt("id2");
thisVAL1= qrs.getInt("val1");
thisVAL2= qrs.getDouble("val2");
if (thisID1==lastID1) {
if (bDEL) { //Ensures this is not the last record
qps2 = conn2.prepareStatement("DELETE FROM STATUS_DATA where id1="+lastID1+" and id2="+lastID2);
qps2.executeUpdate();
qps2.close();
bDEL = false;
}
if (thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
bDEL = true;
}
} else if (bDEL) bDEL=false;
lastID1 = thisID1;
lastID2 = thisID2;
lastVAL1= thisVAL1;
lastVAL2= thisVAL2;
}
UPDATE 4/20/2015 # 11:10 AM
OK so here is my final solution - for every record, the Java code enters an XML record into a string which is written to file every 10,000 records and then java calls a stored procedure on SQL Server and passes the file name to read. The stored procedure can only use the file name as a variable if dynamic SQL is used to execute the openrowset. I will play around with the interval of procedure execution but so far my performance results are as follows:
BEFORE (1 record delete at a time):
73,000 records processed, 101 records per second
AFTER (bulk XML import):
1.4 Million records processed, 5800 records per second
JAVA SNIPPET:
String ts, sXML = "<DataRecords>\n";
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
thisID1 = qrs.getInt("id1");
thisID2 = qrs.getInt("id2");
thisVAL1= qrs.getInt("val1");
thisVAL2= qrs.getDouble("val2");
if (bDEL && thisID1==lastID1) { //Ensures this is not the first or last record
sXML += "<nxtrec id1=\""+lastID1+"\" id2=\""+lastID2+"\"/>\n";
if ((i + 1) % 10000 == 0) { //Execute every 10000 records
sXML += "</DataRecords>\n"; //Close off Parent Tag
ts = String.valueOf((new java.util.Date()).getTime()); //Each XML File Uniquely Named
writeFile(sDir, "ds"+ts+".xml", sXML); //Write XML to file
conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate(); //Execute Stored Procedure
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml"); //Delete File
sXML = "<DataRecords>\n";
}
bDEL = false;
}
if (thisID1==lastID1 && thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
bDEL = true;
} else if (bDEL) bDEL=false;
} else if (bDEL) bDEL=false;
lastID1 = thisID1;
lastID2 = thisID2;
lastVAL1= thisVAL1;
lastVAL2= thisVAL2;
i++;
}
qrs.close(); qps.close(); conn.close();
sXML += "</DataRecords>\n";
ts = String.valueOf((new java.util.Date()).getTime());
writeFile(sdir, "ds"+ts+".xml", sXML);
conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate();
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml");
XML FILE OUTPUT:
<DataRecords>
<nxtrec id1="100" id2="1112"/>
<nxtrec id1="100" id2="1113"/>
<nxtrec id1="100" id2="1117"/>
<nxtrec id1="102" id2="1114"/>
...
<nxtrec id1="838" id2="1112"/>
</DataRecords>
SQL SERVER STORED PROCEDURE:
PROCEDURE [dbo].[SCRUB_DATA] #floc varchar(100) -- File Location (dir + filename) as only parameter
BEGIN
SET NOCOUNT ON;
DECLARE #sql as varchar(max);
SET #sql = '
DECLARE #XmlFile XML
SELECT #XmlFile = BulkColumn
FROM OPENROWSET(BULK ''' + #floc + ''', SINGLE_BLOB) x;
CREATE TABLE #TEMP_TABLE (id1 INT, id2 INT);
INSERT INTO #TEMP_TABLE (id1, id2)
SELECT
id1 = DataTab.value(''#id1'', ''int''),
id2 = DataTab.value(''#id2'', ''int'')
FROM
#XmlFile.nodes(''/DataRecords/nxtrec'') AS XTbl(DataTab);
delete from D
from STATUS_DATA D
inner join #TEMP_TABLE T on ( (T.id1 = D.id1) and (T.id2 = D.id2) );
';
EXEC (#sql);
END
It is almost for certain that your performance issues are not in your algorithm, but rather in the implementation. Say for example your cleanup step has to remove 10,000 records, this means you will have 10000 round trips to your database server.
Instead of doing that, write each of the id pairs to be deleted to an XML file, and send that XML file to SQL server stored proc that shreds the XML into a corresponding temp or temp_var table. Then use a single delete from (or equivalent) to delete all 10K rows.
If you don't know how to shred xml in TSQL, it is well worth the time to learn. Take a look at a simple example to get you started, out just check out a couple of search results for "tsql shred xml" to get going.
ADDED
Pulling 10K records to client should be < 1 second. Your Java code likewise. If you don't have the time to learn use XML as suggested, you could write a quick an dirty stored proc that accepts 10 (20, 50?) pairs of ids and delete the corresponding records from within the stored proc. I use the XML approach regularly to "batch" stuff from the client. If your batches are "large", you might take a look at using the BULK INSERT command on SQL Server -- but the XML is easy and a bit more flexible as it can contain nested data structures. E.g., master/detail relationships.
ADDED
I just did this locally
create table #tmp
(
id int not null
primary key(id)
)
GO
insert #tmp (id)
select 4
union
select 5
GO
-- now has two rows #tmp
delete from L
from TaskList L
inner join #tmp T on (T.id = L.taskID)
(2 row(s) affected)
-- and they are no longer in TaskList
i.e., this should not be a problem unless you are doing it wrong somehow. Are you creating the temp table and then attempting to use it in different databases connections/sessions. If the sessions are different, the temp table won't be seen in the 2nd session.
Hard to think of another way for this to be wrong off the top of my head.
Have you considered doing something that pushes more of the calculating to SQL instead of java?
This is ugly and doesn't take into account your "value changing" part, but it could be a lot faster:
(This deletes everything except the highest and lowest id2 for each id1)
select * into #temp
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2) AS 'RowNo',
* from myTable)x
delete from myTable i
left outer join
(select t.* from #temp t
left outer join (select id1, max(rowNo) rowNo from #temp group by id1) x
on x.id1 = t.id1 and x.rowNo = t.RowNo
where t.RowNo != 1 and x.rowNo is null)z
on z.id2 = i.id2 and z.id1 = i.id1
where z.id1 is not null
Never underestimate the power of SQL =)
Although I understand this seems more 'straightforward' to implement in a row-by-row fashion, doing it 'set-based' will make it fly.
Some code to create test-data:
SET NOCOUNT ON
IF OBJECT_ID('mySTATUS_DATA') IS NOT NULL DROP TABLE mySTATUS_DATA
GO
CREATE TABLE mySTATUS_DATA (id1 int NOT NULL,
id2 int NOT NULL PRIMARY KEY (id1, id2),
val1 varchar(100) NOT NULL,
val2 varchar(100) NOT NULL)
GO
DECLARE #counter int,
#id1 int,
#id2 int,
#val1 varchar(100),
#val2 varchar(100)
SELECT #counter = 100000,
#id1 = 1,
#id2 = 1,
#val1 = 'abc',
#val2 = '123456'
BEGIN TRANSACTION
WHILE #counter > 0
BEGIN
INSERT mySTATUS_DATA (id1, id2, val1, val2)
VALUES (#id1, #id2, #val1, #val2)
SELECT #counter = #counter - 1
SELECT #id2 = #id2 + 1
SELECT #id1 = #id1 + 1, #id2 = 1 WHERE Rand() > 0.8
SELECT #val1 = SubString(convert(varchar(100), NewID()), 0, 9) WHERE Rand() > 0.90
SELECT #val2 = SubString(convert(varchar(100), NewID()), 0, 9) WHERE Rand() > 0.90
if #counter % 1000 = 0
BEGIN
COMMIT TRANSACTION
BEGIN TRANSACTION
END
END
COMMIT TRANSACTION
SELECT top 1000 * FROM mySTATUS_DATA
SELECT COUNT(*) FROM mySTATUS_DATA
And here the code to do the actual scrubbing. Mind that the why column is there merely for educational purposes. If you're going to put this in production I'd advice to put it into comments as it only slows down the operations. Also, you could combine the checks on val1 and val2 in 1 single update... in fact, with a bit of effort you probably can combine everything into 1 single DELETE statement. However, I very much doubt it would make things much faster... but it surely would make things a lot less readable.
Anyway, when I run this on my laptop for 100k records it takes a only 5 seconds so I doubt performance is going to be an issue.
IF OBJECT_ID('tempdb..#working') IS NOT NULL DROP TABLE #working
GO
-- create copy of table
SELECT id1, id2, id2_seqnr = ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2),
val1, val2,
keep_this_record = Convert(bit, 0),
why = Convert(varchar(500), NULL)
INTO #working
FROM STATUS_DATA
WHERE 1 = 2
-- load records
INSERT #working (id1, id2, id2_seqnr, val1, val2, keep_this_record, why)
SELECT id1, id2, id2_seqnr = ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2),
val1, val2,
keep_this_record = Convert(bit, 0),
why = ''
FROM STATUS_DATA
-- index
CREATE UNIQUE CLUSTERED INDEX uq0 ON #working (id1, id2_seqnr)
-- make sure we keep the first record of each id1
UPDATE upd
SET keep_this_record = 1,
why = upd.why + 'first id2 for id1 = ' + Convert(varchar, id1) + ','
FROM #working upd
WHERE id2_seqnr = 1 -- first in sequence
-- make sure we keep the last record of each id1
UPDATE #working
SET keep_this_record = 1,
why = upd.why + 'last id2 for id1 = ' + Convert(varchar, upd.id1) + ','
FROM #working upd
JOIN (SELECT id1, max_seqnr = MAX(id2_seqnr)
FROM #working
GROUP BY id1) mx
ON upd.id1 = mx.id1
AND upd.id2_seqnr = mx.max_seqnr
-- check if val1 has changed versus the previous record
UPDATE upd
SET keep_this_record = 1,
why = upd.why + 'val1 for ' + Convert(varchar, upd.id1) + '/' + Convert(varchar, upd.id2) + ' differs from val1 for ' + Convert(varchar, prev.id1) + '/' + Convert(varchar, prev.id2) + ','
FROM #working upd
JOIN #working prev
ON prev.id1 = upd.id1
AND prev.id2_seqnr = upd.id2_seqnr - 1
AND prev.val1 <> upd.val1
-- check if val1 has changed versus the previous record
UPDATE upd
SET keep_this_record = 1,
why = upd.why + 'val2 for ' + Convert(varchar, upd.id1) + '/' + Convert(varchar, upd.id2) + ' differs from val2 for ' + Convert(varchar, prev.id1) + '/' + Convert(varchar, prev.id2) + ','
FROM #working upd
JOIN #working prev
ON prev.id1 = upd.id1
AND prev.id2_seqnr = upd.id2_seqnr - 1
AND prev.val2 <> upd.val2
-- delete those records we do not want to keep
DELETE del
FROM STATUS_DATA del
JOIN #working w
ON w.id1 = del.id1
AND w.id2 = del.id2
AND w.keep_this_record = 0
-- some info
SELECT TOP 500 * FROM #working ORDER BY id1, id2
SELECT TOP 500 * FROM STATUS_DATA ORDER BY id1, id2
Edit: Cleaning up by removing details not relevant to the problem.
The problem. JPA query returns no results.
String qstr = "select o from MyStats o where o.queue_name = :queue";
String queue = "3";
em.createQuery(qstr).setParameter("queue", queue);
I thought the problem was either in an incorrect syntax of the JPA query or in incorrect annotation of EmbeddedID. Hence I posted definitions of classes involved but told nothing about database table apart from that it was Oracle.
My test code: Read from DB, take first value and re-use that value in subsequent select query meaning that record exists. Should be there, it was just read, right?
Test
String queue = "";
String qstr1 = "select o from MyStats o";
String qstr2 = "select o from MyStats o where o.queue_name = :queue";
logger.debug("SQL query: " + qstr1);
List<MyStats> list = em.createQuery(qstr1).getResultList();
logger.debug("111 Returning results: " + list.size());
for (MyStats s : list) {
queue = s.getQueue_name();
logger.debug("Picking queue name: " + queue);
break;
}
logger.debug("SQL query: " + qstr2);
list = em.createQuery(qstr2).setParameter("queue", queue).getResultList();
logger.debug("222 Returning results: " + list.size());
Output:
SQL query: select o from MyStats o
111 Returning results: 166
Picking queue name: 3
SQL query: select o from MyStats o where o.rec_id.queue_name = :queue
222 Returning results: 0
Class definition
#Entity
public class MyStats {
private String queue_name;
private long stats_id;
... //getters and setters
}
A query without WHERE clause works correctly so as a query with a member of MyStats class.
em.createQuery("select o from MyStats o where o.stats_id = :sid").setParameter("sid", 179046583493L);
I am using Oracle 10 database, Java EE 5 SDK, Glassfish 2.1.
The problem appeared to be with the mapping of Java String type to database column CHAR type.
Database table queue_name column is defined as CHAR(20), while Java type is String.
There are few options to fix it
Replace database column CHAR type with VARCHAR
Pad query parameter value with spaces for every request
Use LIKE condition instead of equals = and add % to the end of parameter value
Speculative: Use cast
(1) Acceptable if you have control over the database table
(2) Works for the given select statement, possibly breaks for JOINs
(3) May fail to do the trick. LIKE 'a%' returns not only 'a ' but 'aa ', 'abc ', and so on
(4) This is not completely clear to me. I am not sure if it is possible to adopt:
em.createNativeQuery("select cast(queue_name as CHAR(20)) from ...");