Hi. Any ideas why these two queries return a different number of rows?
/* RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
create table QueryData as
select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length=31,
999.1 as Rule_Order,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
EXCEPT
select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length=31,
999.1 as Rule_Order,
imb_dlvry_zip_5
from ods_iv_recon_selected_mp;
quit;
vs
/* RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
create table QueryData as
select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length=31,
999.1 as Rule_Order,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
where imb_code not in(select imb_code
from ods_iv_recon_selected_mp);
quit;
The first query using EXCEPT only excludes rows where ALL columns are identical.
The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).
The first query using EXCEPT only excludes rows where ALL columns are identical.
The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).
If using EXCEPT you want also to pay attention to the effect of keyword ALL
Which syntax to use really depends on what result you need and if it's about performance then also what data volumes you're dealing with (eg. the number of rows of the table used within the IN operator).
You're dealing with SAS files - right?
400 million rows and 60+ variables: That's for table ods_bi_recon_selected?
What's the size of ods_iv_recon_selected?
And what you need are all distinct rows in ods_bi_recon_selected with no match over the key column to ods_iv_recon_selected? Right?
And: Do you have an idea how many distinct values for imb_code you're having in ods_iv_recon_selected?
....and last but not least: How many rows do you expect to get in result table QueryData?
I'm thinking about using a hash approach but it will depend on volumes if that's possible.
If the volumes in both tables are that huge and there is nothing you can do to reduce them AND your key column has potentially also more distinct values than we could fit into memory, then I guess your 2nd query is more or less what you can do.
Does your result set really need to be "distinct"? That's very costly.
The following paper provides a lot of insight of how the SQL optimizer works which is all you need to decide in concrete cases how you could eventually tweak code: http://www2.sas.com/proceedings/sugi30/101-30.pdf
If imp_code is 8 bytes, the table will 3GB, which probably can fit into memory (?), thus making a hash table lookup feasible.
I'm using DISTINCT because about 2% of IMB_CODEs are duplicates and the duplicates break my program. So for now I'm choosing to lose 2% of my data just to make head way...not good I know.
The official requirement is that IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value. Is there anyway I can remove the UNIQUE option and still handle a requirement like this? If there is I'd appreciate the help for sure.
/* RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
create table QueryData as
SELECT DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length=31,
999.1 as Rule_Order,
imb_dlvry_zip_5
FROM ods_bi_recon_selected_mp
WHERE imb_code not in(select imb_code from ods_iv_recon_selected_mp)
AND (requirement = IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value?)
;
quit;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.