DATA Step, Macro, Functions and more

Two queries different answers

Accepted Solution Solved
Reply
Regular Contributor
Posts: 212
Accepted Solution

Two queries different answers

Hi.  Any ideas why these two queries return a different number of rows?

 

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
        EXCEPT
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_iv_recon_selected_mp;
	quit;

vs

 

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
	    where imb_code not in(select imb_code 
		                        from ods_iv_recon_selected_mp);
	quit;

Accepted Solutions
Solution
‎10-04-2016 08:54 AM
Respected Advisor
Posts: 3,887

Re: Two queries different answers

The first query using EXCEPT only excludes rows where ALL columns are identical.

The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).

View solution in original post


All Replies
Solution
‎10-04-2016 08:54 AM
Respected Advisor
Posts: 3,887

Re: Two queries different answers

The first query using EXCEPT only excludes rows where ALL columns are identical.

The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).

Regular Contributor
Posts: 212

Re: Two queries different answers

Wow, great insight. Thanks so much.

I always thought Nested Selects were inefficient, which is the reason I was trying the Except concept.

Is there a more efficient way to restructure the nested select version of the query and get the same results where ? Or is the Nested Select a good option?
Respected Advisor
Posts: 3,887

Re: Two queries different answers

If using EXCEPT you want also to pay attention to the effect of keyword ALL

http://support.sas.com/documentation/cdl/en/sqlproc/69049/HTML/default/viewer.htm#n0vo2lglyrnexwn14e...

 

Which syntax to use really depends on what result you need and if it's about performance then also what data volumes you're dealing with (eg. the number of rows of the table used within the IN operator).

Regular Contributor
Posts: 212

Re: Two queries different answers

Thanks for this info! Yah, it's 400 million rows and 60+ variables. Huge to me, but what are ur thoughts on that size of a ds using the Nested Select?
Respected Advisor
Posts: 3,887

Re: Two queries different answers

[ Edited ]

You're dealing with SAS files - right?

 

400 million rows and 60+ variables: That's for table ods_bi_recon_selected?

 

What's the size of ods_iv_recon_selected?

 

And what you need are all distinct rows in ods_bi_recon_selected with no match over the key column to ods_iv_recon_selected? Right?

 

And: Do you have an idea how many distinct values for imb_code you're having in ods_iv_recon_selected?

 

 

....and last but not least: How many rows do you expect to get in result table QueryData?

 

I'm thinking about using a hash approach but it will depend on volumes if that's possible.

 

Regular Contributor
Posts: 212

Re: Two queries different answers

Sorry, 400 million for the BI and 400 million for the IV dataset each. Yes, both are already SAS datasets.

Yes, distinct only, BUT approx. 98% of records are already DISTINCT.

I expect anywhere from 0 to 400 million records in QueryData depending on the day the code is run.

Thanks for your input.
Respected Advisor
Posts: 3,887

Re: Two queries different answers

If the volumes in both tables are that huge and there is nothing you can do to reduce them AND your key column has potentially also more distinct values than we could fit into memory, then I guess your 2nd query is more or less what you can do.

 

Does your result set really need to be "distinct"? That's very costly.

 

The following paper provides a lot of insight of how the SQL optimizer works which is all you need to decide in concrete cases how you could eventually tweak code: http://www2.sas.com/proceedings/sugi30/101-30.pdf

 

Super User
Posts: 5,255

Re: Two queries different answers

If imp_code is 8 bytes, the table will 3GB, which probably can fit into memory (?), thus making a hash table lookup feasible.

Data never sleeps
Regular Contributor
Posts: 212

Re: Two queries different answers

[ Edited ]

I'm using DISTINCT because about 2% of IMB_CODEs are duplicates and the duplicates break my program.  So for now I'm choosing to lose 2% of my data just to make head way...not good I know.

 

The official requirement is that IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value.  Is there anyway I can remove the UNIQUE option and still handle a requirement like this?  If there is I'd appreciate the help for sure.

 

/*	 RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
	create table QueryData as
	SELECT DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
	       	actual_dlvry_date, 
		subpad(imb_code,1,31) as IMB_CODE length=31, 
		999.1 as Rule_Order,
		imb_dlvry_zip_5
	FROM ods_bi_recon_selected_mp
        WHERE imb_code not in(select imb_code from ods_iv_recon_selected_mp)
AND (requirement = IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value?)
; quit;

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 439 views
  • 2 likes
  • 3 in conversation