BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
buechler66
Barite | Level 11

Hi.  Any ideas why these two queries return a different number of rows?

 

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
        EXCEPT
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_iv_recon_selected_mp;
	quit;

vs

 

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
	    where imb_code not in(select imb_code 
		                        from ods_iv_recon_selected_mp);
	quit;
1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

The first query using EXCEPT only excludes rows where ALL columns are identical.

The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).

View solution in original post

9 REPLIES 9
Patrick
Opal | Level 21

The first query using EXCEPT only excludes rows where ALL columns are identical.

The second query excludes all rows where there is a matching IMB_CODE (but other columns can still have different values).

buechler66
Barite | Level 11
Wow, great insight. Thanks so much.

I always thought Nested Selects were inefficient, which is the reason I was trying the Except concept.

Is there a more efficient way to restructure the nested select version of the query and get the same results where ? Or is the Nested Select a good option?
Patrick
Opal | Level 21

If using EXCEPT you want also to pay attention to the effect of keyword ALL

http://support.sas.com/documentation/cdl/en/sqlproc/69049/HTML/default/viewer.htm#n0vo2lglyrnexwn14e...

 

Which syntax to use really depends on what result you need and if it's about performance then also what data volumes you're dealing with (eg. the number of rows of the table used within the IN operator).

buechler66
Barite | Level 11
Thanks for this info! Yah, it's 400 million rows and 60+ variables. Huge to me, but what are ur thoughts on that size of a ds using the Nested Select?
Patrick
Opal | Level 21

You're dealing with SAS files - right?

 

400 million rows and 60+ variables: That's for table ods_bi_recon_selected?

 

What's the size of ods_iv_recon_selected?

 

And what you need are all distinct rows in ods_bi_recon_selected with no match over the key column to ods_iv_recon_selected? Right?

 

And: Do you have an idea how many distinct values for imb_code you're having in ods_iv_recon_selected?

 

 

....and last but not least: How many rows do you expect to get in result table QueryData?

 

I'm thinking about using a hash approach but it will depend on volumes if that's possible.

 

buechler66
Barite | Level 11
Sorry, 400 million for the BI and 400 million for the IV dataset each. Yes, both are already SAS datasets.

Yes, distinct only, BUT approx. 98% of records are already DISTINCT.

I expect anywhere from 0 to 400 million records in QueryData depending on the day the code is run.

Thanks for your input.
Patrick
Opal | Level 21

If the volumes in both tables are that huge and there is nothing you can do to reduce them AND your key column has potentially also more distinct values than we could fit into memory, then I guess your 2nd query is more or less what you can do.

 

Does your result set really need to be "distinct"? That's very costly.

 

The following paper provides a lot of insight of how the SQL optimizer works which is all you need to decide in concrete cases how you could eventually tweak code: http://www2.sas.com/proceedings/sugi30/101-30.pdf

 

LinusH
Tourmaline | Level 20

If imp_code is 8 bytes, the table will 3GB, which probably can fit into memory (?), thus making a hash table lookup feasible.

Data never sleeps
buechler66
Barite | Level 11

I'm using DISTINCT because about 2% of IMB_CODEs are duplicates and the duplicates break my program.  So for now I'm choosing to lose 2% of my data just to make head way...not good I know.

 

The official requirement is that IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value.  Is there anyway I can remove the UNIQUE option and still handle a requirement like this?  If there is I'd appreciate the help for sure.

 

/*	 RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
	create table QueryData as
	SELECT DISTINCT subpad('PIECES MISSING IN IV',1,58) as RULE_NM, 
	       	actual_dlvry_date, 
		subpad(imb_code,1,31) as IMB_CODE length=31, 
		999.1 as Rule_Order,
		imb_dlvry_zip_5
	FROM ods_bi_recon_selected_mp
        WHERE imb_code not in(select imb_code from ods_iv_recon_selected_mp)
AND (requirement = IMB_CODEs are only unique for 30 days after their ACTUAL_DLVRY_DATE variable value?)
; quit;

 

sas-innovate-2024.png

Today is the last day to save with the early bird rate! Register today for just $695 - $100 off the standard rate.

 

Plus, pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1255 views
  • 2 likes
  • 3 in conversation