topic Re: Two queries different answers in SAS Programming

Two queries different answers

buechler66 — Mon, 03 Oct 2016 22:15:40 GMT

Hi. Any ideas why these two queries return a different number of rows?

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
        EXCEPT
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_iv_recon_selected_mp;
	quit;

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
	    where imb_code not in(select imb_code 
		                        from ods_iv_recon_selected_mp);
	quit;

Re: Two queries different answers

Patrick — Tue, 04 Oct 2016 00:51:41 GMT

The first query using EXCEPT only excludes rows where ALL columns are identical.

The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).

Re: Two queries different answers

buechler66 — Tue, 04 Oct 2016 01:05:22 GMT

Wow, great insight. Thanks so much.

I always thought Nested Selects were inefficient, which is the reason I was trying the Except concept.

Is there a more efficient way to restructure the nested select version of the query and get the same results where ? Or is the Nested Select a good option?

Re: Two queries different answers

Patrick — Tue, 04 Oct 2016 01:23:43 GMT

If using EXCEPT you want also to pay attention to the effect of keyword ALL

http://support.sas.com/documentation/cdl/en/sqlproc/69049/HTML/default/viewer.htm#n0vo2lglyrnexwn14emi8m0jqvrj.htm

Which syntax to use really depends on what result you need and if it's about performance then also what data volumes you're dealing with (eg. the number of rows of the table used within the IN operator).

Re: Two queries different answers

buechler66 — Tue, 04 Oct 2016 01:48:57 GMT

Thanks for this info! Yah, it's 400 million rows and 60+ variables. Huge to me, but what are ur thoughts on that size of a ds using the Nested Select?

Re: Two queries different answers

Patrick — Tue, 04 Oct 2016 02:13:27 GMT

You're dealing with SAS files - right?

400 million rows and 60+ variables: That's for table ods_bi_recon_selected?

What's the size of ods_iv_recon_selected?

And what you need are all distinct rows in ods_bi_recon_selected with no match over the key column to ods_iv_recon_selected? Right?

And: Do you have an idea how many distinct values for imb_code you're having in ods_iv_recon_selected?

....and last but not least: How many rows do you expect to get in result table QueryData?

I'm thinking about using a hash approach but it will depend on volumes if that's possible.

Re: Two queries different answers

buechler66 — Tue, 04 Oct 2016 02:29:58 GMT

Sorry, 400 million for the BI and 400 million for the IV dataset each. Yes, both are already SAS datasets.

Yes, distinct only, BUT approx. 98% of records are already DISTINCT.

I expect anywhere from 0 to 400 million records in QueryData depending on the day the code is run.

Thanks for your input.

Re: Two queries different answers

Patrick — Tue, 04 Oct 2016 02:55:20 GMT

If the volumes in both tables are that huge and there is nothing you can do to reduce them AND your key column has potentially also more distinct values than we could fit into memory, then I guess your 2nd query is more or less what you can do.

Does your result set really need to be "distinct"? That's very costly.

The following paper provides a lot of insight of how the SQL optimizer works which is all you need to decide in concrete cases how you could eventually tweak code: http://www2.sas.com/proceedings/sugi30/101-30.pdf

Re: Two queries different answers

LinusH — Tue, 04 Oct 2016 10:47:00 GMT

If imp_code is 8 bytes, the table will 3GB, which probably can fit into memory (?), thus making a hash table lookup feasible.

Re: Two queries different answers

buechler66 — Tue, 04 Oct 2016 13:19:49 GMT

I'm using DISTINCT because about 2% of IMB_CODEs are duplicates and the duplicates break my program. So for now I'm choosing to lose 2% of my data just to make head way...not good I know.

The official requirement is that IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value. Is there anyway I can remove the UNIQUE option and still handle a requirement like this? If there is I'd appreciate the help for sure.

/*	 RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
	create table QueryData as
	SELECT DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
	       	actual_dlvry_date, 
		subpad(imb_code,1,31) as IMB_CODE length=31, 
		999.1 as Rule_Order,
		imb_dlvry_zip_5
	FROM ods_bi_recon_selected_mp
        WHERE imb_code not in(select imb_code from ods_iv_recon_selected_mp)
          AND (requirement = IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value?)
        ;
quit;