I am trying to implement sqlparser and using gsqlparser from here. The source of the jar is in Java but I am implementing the same in Scala.
Below is my query which contains a join condition.
SELECT e.last_name AS name, e.commission_pct comm, e.salary * 12 "Annual Salary" FROM scott.employees AS e right join scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn WHERE e.salary > 1000 ORDER BY e.first_name, e.last_name
I was able to parse the query to read names & aliases of columns, where conditions, table names (checking the table names directly inside the query) as below.
val sqlParser = new TGSqlParser(EDbVendor.dbvsnowflake)
sqlParser.sqltext = "SELECT e.last_name AS name, e.commission_pct comm, e.salary * 12 \"Annual Salary\" FROM scott.employees AS e right join scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn WHERE e.salary > 1000 ORDER BY e.first_name, e.last_name"
val selectStmnt = sqlParser.sqltext
println("Columns List:")
for(i <- 0 until selectStmnt.getResultColumnList.size()) {
val resCol = selectStmnt.getResultColumnList.getResultColumn(i)
println("Column: " + resCol.getExpr.toString + " Alias: " + resCol
.getAliasClause().toString)
}
Output:
Columns List:
Column: e.last_name Alias: name
Column: e.commission_pct Alias: comm
Column: e.salary * 12 Alias: "Annual Salary"
I am trying to parse the join condition and get the details inside it
for(j <- 0 until selectStmnt.getJoins.size()) {
println(selectStmnt.getJoins.getJoin(j).getTable)
}
The problem here is there is only one join condition in the query, so the size returned is 1.
Hence the output is scott.employees.
If I do it a bit different as below using getJoinItems
println("Parsing Join items")
for(j <- 0 until selectStmnt.getJoins.size()) {
println(selectStmnt.getJoins.getJoin(j).getJoinItems)
}
I get the output by cutting off the first table from the join condition as below:
scott.companies as c on c.orgid = e.orgid and c.orgname = e.orgn
The method: getJoinItems() returns a list: TJoinItemList which I thought of traversing through. But even its size is 1.
println(selectStmnt.getJoins.getJoin(j).getJoinItems.size()) -> 1
I am out of ideas now. Could anyone let me know how can I parse the query's join condition and get the table names inside the join ?
I don't have access to Snowflake dialect in GSP but I mimicked this scenario with Teradata dialect using the following query and created a sql parser.
SELECT e.last_name as name
FROM department d
RIGHT JOIN
trimmed_employee e
ON d.dept_id = e.emp_id
WHERE e.salary > 1000
ORDER BY e.first_name
Here is the Groovy code of getting both the tables department, trimmed_employee. It boils down to iterating over each join and while doing so collect the current join's items (joinItems) using curJoin.joinItems only if it is not null.
stmt.joins.asList().collect { curJoin ->
[curJoin.table] + (curJoin?.joinItems?.asList()?.collect { joinItems -> joinItems.table } ?: [])
}.flatten()
Result:
department
trimmed_employee
For this simple sql that you mentioned in my case, the following code also works.
stmt.tables.asList()
I'm using jasync for ktor to connect to mysql. All my queries run fine except this one query. It shows a syntax error even after remove parts of the query.
Mysql is running and processes queries just fine. I run the same query in HeidiSQL and it returns just fine, no errors.
USE ${city.getString("DbName")};
SELECT
m.id AS meter_id,
m.meter_num AS meter_number,
IF(
l.split = 0,
m.route_id,
CONCAT(
CAST(m.route_id AS CHAR),
CAST(l.split AS CHAR)
)
) AS route_id,
m.sequence AS sequence_number,
m.address,
m.location,
m.location_2,
m.location_3,
m.previous_read,
m.account_num AS account_number,
m.low_limit,
m.high_limit,
m.low_usage,
m.high_usage,
m.calc_usage AS calculated_usage,
m.num_dials,
CONCAT(h.Date, h.Time, LPAD(h.time_seconds, 2, '0')) as date_time,
c.utility,
c.acct_owner AS account_owner,
c.MeterCode AS meter_code,
m.msg,
c.misc1,
c.misc2,
c.misc3,
c.active,
c.backward,
mcd.always_require_photo AS require_photo
FROM
datazeo.meters AS m
INNER JOIN datazeo.routes AS rt ON rt.id = m.route_id
INNER JOIN info.loads AS l ON l.RouteID = rt.route
INNER JOIN custfile AS c ON c.IDnum = m.meter_num
LEFT JOIN acsreads AS r ON r.IDnum = c.IDnum
LEFT JOIN history AS h ON
h.IDnum = c.IDnum
AND h.Month = $month
LEFT JOIN datazeo.meter_customer_data AS mcd ON mcd.meter_id = m.id
WHERE
m.city_id = $cityId
AND rt.city_id = $cityId
AND l.CityID = "$paddedCityId"
AND l.ReaderID = ${requestContext.user.id}
AND r.IDnum IS NULL
AND l.Recheck = 0
AND (
l.split = 0
OR (
m.sequence >= l.seq_start
AND m.sequence <= l.seq_end
)
)
The query returns 27 columns and 153 rows when using HeidiSQL.
But i get this error when running it in kotlin with jasync and JDBC...
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT
m.id AS meter_id,
m.meter_num AS meter_number,
IF(
l.' at line 3"
Found the answer. For whatever reason JDBC doesnt allow you to send a USE statement in the same request as your select.
So I replaces my code to look like this
SELECT
m.id AS meter_id,
m.meter_num AS meter_number,
IF(
l.split = 0,
m.route_id,
CONCAT(
CAST(m.route_id AS CHAR),
CAST(l.split AS CHAR)
)
) AS route_id,
m.sequence AS sequence_number,
m.address,
m.location,
m.location_2,
m.location_3,
m.previous_read,
m.account_num AS account_number,
m.low_limit,
m.high_limit,
m.low_usage,
m.high_usage,
m.calc_usage AS calculated_usage,
m.num_dials,
CONCAT(h.Date, h.Time, LPAD(h.time_seconds, 2, '0')) as date_time,
c.utility,
c.acct_owner AS account_owner,
c.MeterCode AS meter_code,
m.msg,
c.misc1,
c.misc2,
c.misc3,
c.active,
c.backward,
mcd.always_require_photo AS require_photo
FROM
datazeo.meters AS m
INNER JOIN datazeo.routes AS rt ON rt.id = m.route_id
INNER JOIN info.loads AS l ON l.RouteID = rt.route
INNER JOIN ${city.getString("DbName")}.custfile AS c ON c.IDnum = m.meter_num
LEFT JOIN ${city.getString("DbName")}.acsreads AS r ON r.IDnum = c.IDnum
LEFT JOIN ${city.getString("DbName")}.history AS h ON
h.IDnum = c.IDnum
AND h.Month = $month
LEFT JOIN datazeo.meter_customer_data AS mcd ON mcd.meter_id = m.id
WHERE
m.city_id = $cityId
AND rt.city_id = $cityId
AND l.CityID = "$paddedCityId"
AND l.ReaderID = ${requestContext.user.id}
AND r.IDnum IS NULL
AND l.Recheck = 0
AND (
l.split = 0
OR (
m.sequence >= l.seq_start
AND m.sequence <= l.seq_end
)
)
Just specify the database before the table and works fine now.
How to convert all varchar columns in schema to nvarchar effectively, without dropping constraints
I am using this procedure
CREATE PROCEDURE VarCharToNvarChar
AS
BEGIN
DECLARE curChangeTypes CURSOR FOR
SELECT column_name,
table_name,
character_maximum_length
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE='VARCHAR'
AND table_name IN (SELECT DISTINCT table_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE='VARCHAR');
OPEN curChangeTypes;
DECLARE #cn VARCHAR(50),
#tn VARCHAR(50),
#cml INT;
DECLARE #str VARCHAR(8000);
FETCH NEXT FROM curChangeTypes INTO #cn,#tn,#cml;
WHILE(##FETCH_STATUS = 0)
BEGIN
IF(#cml = -1)
SET #str = 'ALTER TABLE ' + #tn + ' ALTER COLUMN ' + #cn + ' NVARCHAR(MAX)';
ELSE
SET #str = 'ALTER TABLE ' + #tn + ' ALTER COLUMN ' + #cn +
' NVARCHAR('+CAST(#cml AS VARCHAR)+')';
EXEC(#str);
FETCH NEXT FROM curChangeTypes INTO #cn,#tn,#cml;
END
CLOSE curChangeTypes;
DEALLOCATE curChangeTypes;
END
It fails due to constraints on fields
If we drop constraints procedure executed successfully
But constraints need to be intact
Suggest the best way to achieve the alter query [ varchar-->nvarchar ] without dropping constraints
There is a table 'EXAMPLE_TABLE' which contains two columns. The first column 'ID' store value '5555' and the second 'IS_EXIST' store char 1 byte '0'. How create a procedure which will do 'INSERT INTO' if this values doesn't exist, or 'UPDATE' if 'ID' the same as in a query and 'IS_EXIST' == 0, or throw some exception which will be handled in java if 'ID' the same and 'IS_EXIST' != 0. I considered the merge and primarily insert ways to resolve this problem.
it have to approximately look like :
if(ID doesn't exist)
insert into
if(ID exist and IS_EXIST equals 0)
update
else
throw Exception
but how this will look in procedure?
This is a simple way to do it if you want to throw or raise some exception using procedure without merging:
procedure PC_INSERT_OR_UPDATE(P_ID number) as
cursor C_1 is
select M.ID,
C.IS_EXIST
from MY_TABLE M
where M.ID = P_ID;
MSG clob;
begin
for C in C_1 loop
begin
if C.ID is null then
insert into MY_TABLE
(ID,
IS_EXIST)
values
(P_ID,
1);
elsif C.ID is not null and C.IS_EXIST = 0 then
update MY_TABLE M
set M.IS_EXIST = 1
where M.ID = P_ID;
else
RAISE_APPLICATION_ERROR(-20001, 'My exception was raised');
end if;
exception
when others then
rollback;
MSG := 'Error - ' || TO_CHAR(sqlcode) || ' - ' || sqlerrm;
end;
end loop;
end;
I need to scrub an SQL Server table on a regular basis, but my solution is taking ridiculously long (about 12 minutes for 73,000 records).
My table has 4 fields:
id1
id2
val1
val2
For every group of records with the same "id1", I need to keep the first (lowest id2) and last (highest id2) and delete everything in between UNLESS val1 or val2 has changed from the previous (next lowest "id2") record.
If you're following me so far, what would a more efficient algorithm be? Here is my java code:
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
thisID1 = qrs.getInt("id1");
thisID2 = qrs.getInt("id2");
thisVAL1= qrs.getInt("val1");
thisVAL2= qrs.getDouble("val2");
if (thisID1==lastID1) {
if (bDEL) { //Ensures this is not the last record
qps2 = conn2.prepareStatement("DELETE FROM STATUS_DATA where id1="+lastID1+" and id2="+lastID2);
qps2.executeUpdate();
qps2.close();
bDEL = false;
}
if (thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
bDEL = true;
}
} else if (bDEL) bDEL=false;
lastID1 = thisID1;
lastID2 = thisID2;
lastVAL1= thisVAL1;
lastVAL2= thisVAL2;
}
UPDATE 4/20/2015 # 11:10 AM
OK so here is my final solution - for every record, the Java code enters an XML record into a string which is written to file every 10,000 records and then java calls a stored procedure on SQL Server and passes the file name to read. The stored procedure can only use the file name as a variable if dynamic SQL is used to execute the openrowset. I will play around with the interval of procedure execution but so far my performance results are as follows:
BEFORE (1 record delete at a time):
73,000 records processed, 101 records per second
AFTER (bulk XML import):
1.4 Million records processed, 5800 records per second
JAVA SNIPPET:
String ts, sXML = "<DataRecords>\n";
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
thisID1 = qrs.getInt("id1");
thisID2 = qrs.getInt("id2");
thisVAL1= qrs.getInt("val1");
thisVAL2= qrs.getDouble("val2");
if (bDEL && thisID1==lastID1) { //Ensures this is not the first or last record
sXML += "<nxtrec id1=\""+lastID1+"\" id2=\""+lastID2+"\"/>\n";
if ((i + 1) % 10000 == 0) { //Execute every 10000 records
sXML += "</DataRecords>\n"; //Close off Parent Tag
ts = String.valueOf((new java.util.Date()).getTime()); //Each XML File Uniquely Named
writeFile(sDir, "ds"+ts+".xml", sXML); //Write XML to file
conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate(); //Execute Stored Procedure
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml"); //Delete File
sXML = "<DataRecords>\n";
}
bDEL = false;
}
if (thisID1==lastID1 && thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
bDEL = true;
} else if (bDEL) bDEL=false;
} else if (bDEL) bDEL=false;
lastID1 = thisID1;
lastID2 = thisID2;
lastVAL1= thisVAL1;
lastVAL2= thisVAL2;
i++;
}
qrs.close(); qps.close(); conn.close();
sXML += "</DataRecords>\n";
ts = String.valueOf((new java.util.Date()).getTime());
writeFile(sdir, "ds"+ts+".xml", sXML);
conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate();
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml");
XML FILE OUTPUT:
<DataRecords>
<nxtrec id1="100" id2="1112"/>
<nxtrec id1="100" id2="1113"/>
<nxtrec id1="100" id2="1117"/>
<nxtrec id1="102" id2="1114"/>
...
<nxtrec id1="838" id2="1112"/>
</DataRecords>
SQL SERVER STORED PROCEDURE:
PROCEDURE [dbo].[SCRUB_DATA] #floc varchar(100) -- File Location (dir + filename) as only parameter
BEGIN
SET NOCOUNT ON;
DECLARE #sql as varchar(max);
SET #sql = '
DECLARE #XmlFile XML
SELECT #XmlFile = BulkColumn
FROM OPENROWSET(BULK ''' + #floc + ''', SINGLE_BLOB) x;
CREATE TABLE #TEMP_TABLE (id1 INT, id2 INT);
INSERT INTO #TEMP_TABLE (id1, id2)
SELECT
id1 = DataTab.value(''#id1'', ''int''),
id2 = DataTab.value(''#id2'', ''int'')
FROM
#XmlFile.nodes(''/DataRecords/nxtrec'') AS XTbl(DataTab);
delete from D
from STATUS_DATA D
inner join #TEMP_TABLE T on ( (T.id1 = D.id1) and (T.id2 = D.id2) );
';
EXEC (#sql);
END
It is almost for certain that your performance issues are not in your algorithm, but rather in the implementation. Say for example your cleanup step has to remove 10,000 records, this means you will have 10000 round trips to your database server.
Instead of doing that, write each of the id pairs to be deleted to an XML file, and send that XML file to SQL server stored proc that shreds the XML into a corresponding temp or temp_var table. Then use a single delete from (or equivalent) to delete all 10K rows.
If you don't know how to shred xml in TSQL, it is well worth the time to learn. Take a look at a simple example to get you started, out just check out a couple of search results for "tsql shred xml" to get going.
ADDED
Pulling 10K records to client should be < 1 second. Your Java code likewise. If you don't have the time to learn use XML as suggested, you could write a quick an dirty stored proc that accepts 10 (20, 50?) pairs of ids and delete the corresponding records from within the stored proc. I use the XML approach regularly to "batch" stuff from the client. If your batches are "large", you might take a look at using the BULK INSERT command on SQL Server -- but the XML is easy and a bit more flexible as it can contain nested data structures. E.g., master/detail relationships.
ADDED
I just did this locally
create table #tmp
(
id int not null
primary key(id)
)
GO
insert #tmp (id)
select 4
union
select 5
GO
-- now has two rows #tmp
delete from L
from TaskList L
inner join #tmp T on (T.id = L.taskID)
(2 row(s) affected)
-- and they are no longer in TaskList
i.e., this should not be a problem unless you are doing it wrong somehow. Are you creating the temp table and then attempting to use it in different databases connections/sessions. If the sessions are different, the temp table won't be seen in the 2nd session.
Hard to think of another way for this to be wrong off the top of my head.
Have you considered doing something that pushes more of the calculating to SQL instead of java?
This is ugly and doesn't take into account your "value changing" part, but it could be a lot faster:
(This deletes everything except the highest and lowest id2 for each id1)
select * into #temp
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2) AS 'RowNo',
* from myTable)x
delete from myTable i
left outer join
(select t.* from #temp t
left outer join (select id1, max(rowNo) rowNo from #temp group by id1) x
on x.id1 = t.id1 and x.rowNo = t.RowNo
where t.RowNo != 1 and x.rowNo is null)z
on z.id2 = i.id2 and z.id1 = i.id1
where z.id1 is not null
Never underestimate the power of SQL =)
Although I understand this seems more 'straightforward' to implement in a row-by-row fashion, doing it 'set-based' will make it fly.
Some code to create test-data:
SET NOCOUNT ON
IF OBJECT_ID('mySTATUS_DATA') IS NOT NULL DROP TABLE mySTATUS_DATA
GO
CREATE TABLE mySTATUS_DATA (id1 int NOT NULL,
id2 int NOT NULL PRIMARY KEY (id1, id2),
val1 varchar(100) NOT NULL,
val2 varchar(100) NOT NULL)
GO
DECLARE #counter int,
#id1 int,
#id2 int,
#val1 varchar(100),
#val2 varchar(100)
SELECT #counter = 100000,
#id1 = 1,
#id2 = 1,
#val1 = 'abc',
#val2 = '123456'
BEGIN TRANSACTION
WHILE #counter > 0
BEGIN
INSERT mySTATUS_DATA (id1, id2, val1, val2)
VALUES (#id1, #id2, #val1, #val2)
SELECT #counter = #counter - 1
SELECT #id2 = #id2 + 1
SELECT #id1 = #id1 + 1, #id2 = 1 WHERE Rand() > 0.8
SELECT #val1 = SubString(convert(varchar(100), NewID()), 0, 9) WHERE Rand() > 0.90
SELECT #val2 = SubString(convert(varchar(100), NewID()), 0, 9) WHERE Rand() > 0.90
if #counter % 1000 = 0
BEGIN
COMMIT TRANSACTION
BEGIN TRANSACTION
END
END
COMMIT TRANSACTION
SELECT top 1000 * FROM mySTATUS_DATA
SELECT COUNT(*) FROM mySTATUS_DATA
And here the code to do the actual scrubbing. Mind that the why column is there merely for educational purposes. If you're going to put this in production I'd advice to put it into comments as it only slows down the operations. Also, you could combine the checks on val1 and val2 in 1 single update... in fact, with a bit of effort you probably can combine everything into 1 single DELETE statement. However, I very much doubt it would make things much faster... but it surely would make things a lot less readable.
Anyway, when I run this on my laptop for 100k records it takes a only 5 seconds so I doubt performance is going to be an issue.
IF OBJECT_ID('tempdb..#working') IS NOT NULL DROP TABLE #working
GO
-- create copy of table
SELECT id1, id2, id2_seqnr = ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2),
val1, val2,
keep_this_record = Convert(bit, 0),
why = Convert(varchar(500), NULL)
INTO #working
FROM STATUS_DATA
WHERE 1 = 2
-- load records
INSERT #working (id1, id2, id2_seqnr, val1, val2, keep_this_record, why)
SELECT id1, id2, id2_seqnr = ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2),
val1, val2,
keep_this_record = Convert(bit, 0),
why = ''
FROM STATUS_DATA
-- index
CREATE UNIQUE CLUSTERED INDEX uq0 ON #working (id1, id2_seqnr)
-- make sure we keep the first record of each id1
UPDATE upd
SET keep_this_record = 1,
why = upd.why + 'first id2 for id1 = ' + Convert(varchar, id1) + ','
FROM #working upd
WHERE id2_seqnr = 1 -- first in sequence
-- make sure we keep the last record of each id1
UPDATE #working
SET keep_this_record = 1,
why = upd.why + 'last id2 for id1 = ' + Convert(varchar, upd.id1) + ','
FROM #working upd
JOIN (SELECT id1, max_seqnr = MAX(id2_seqnr)
FROM #working
GROUP BY id1) mx
ON upd.id1 = mx.id1
AND upd.id2_seqnr = mx.max_seqnr
-- check if val1 has changed versus the previous record
UPDATE upd
SET keep_this_record = 1,
why = upd.why + 'val1 for ' + Convert(varchar, upd.id1) + '/' + Convert(varchar, upd.id2) + ' differs from val1 for ' + Convert(varchar, prev.id1) + '/' + Convert(varchar, prev.id2) + ','
FROM #working upd
JOIN #working prev
ON prev.id1 = upd.id1
AND prev.id2_seqnr = upd.id2_seqnr - 1
AND prev.val1 <> upd.val1
-- check if val1 has changed versus the previous record
UPDATE upd
SET keep_this_record = 1,
why = upd.why + 'val2 for ' + Convert(varchar, upd.id1) + '/' + Convert(varchar, upd.id2) + ' differs from val2 for ' + Convert(varchar, prev.id1) + '/' + Convert(varchar, prev.id2) + ','
FROM #working upd
JOIN #working prev
ON prev.id1 = upd.id1
AND prev.id2_seqnr = upd.id2_seqnr - 1
AND prev.val2 <> upd.val2
-- delete those records we do not want to keep
DELETE del
FROM STATUS_DATA del
JOIN #working w
ON w.id1 = del.id1
AND w.id2 = del.id2
AND w.keep_this_record = 0
-- some info
SELECT TOP 500 * FROM #working ORDER BY id1, id2
SELECT TOP 500 * FROM STATUS_DATA ORDER BY id1, id2