DATA Step, Macro, Functions and more

Efficiency Advice

Reply
Regular Contributor
Posts: 212

Efficiency Advice

[ Edited ]

Hi. I have the following series of queries that run as part of a macro loop, once for each ZIP Code. And there are 68 ZIP Codes! The data is huge so each step can take a couple of hours per ZIP Code.

 

Due to the huge size of the data the datasets ods_bi_recon_selected_mp and ods_iv_recon_selected_mp are created with records from only 1 ZIP Code at a time. Then each in the series of the queries is executed against these datasets. Then the process repeats for each of the remaining ZIPs.

 

I'm curious if anyone sees opportunities for improving efficiency of these individual queries or in executing all the queries as a whole?  Any suggestions would be greatly appreciated.

 

 

/*Create IV dataset*/
%PUT CHECK: CREATE IV DS FOR &ZIP5;
proc sql;
connect to oracle as db (user=&orauser password=&orapass path="ivasprd");
create table ods_iv_recon_selected_mp as
select * from connection to db
( select *
from ivas.ods_iv_recon_selected_mp
where imb_dlvry_zip_5=&ZIP5_QUOTED
);
disconnect from db;
quit;

/*Create BIDS dataset*/
%PUT CHECK: CREATE BIDS DS FOR &ZIP5;
proc sql;
connect to oracle as db (user="myuser" password="mypw" path="ibscrprd");
create table ods_bi_recon_selected_mp as
select * from connection to db
( select *
from imapsscr.ods_bi_recon_selected_mp
where imb_dlvry_zip_5=&ZIP5_QUOTED
);
disconnect from db;
quit;

 

The series of queries are:

 

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select subpad('PIECES MISSING IN IV',1,58) as RULE_NM length=58, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
	    where imb_code not in(select imb_code 
		                        from ods_iv_recon_selected_mp);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: PIECES MISSING IN BIDS QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN BIDS QUERY 999.6;
	proc sql;
		create table QueryData as
		select subpad('PIECES MISSING IN BIDS',1,58) as RULE_NM length=58, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.6 as RULE_ORDER,
				imb_dlvry_zip_5
		from ods_iv_recon_selected_mp
		where imb_code not in(select imb_code 
		                      from ods_bi_recon_selected_mp);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: VOLUME MATCHING */
/*	Volume Matching assumes the records match exactly and have not violated any rules */
	%PUT CHECK: RULE: VOLUME MATCHING 999.2;
	proc sql;
		create table QueryData as
		select subpad('VOLUME MATCHING',1,58) as RULE_NM length=58, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length = 31, 
				999.2 as RULE_ORDER,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
		EXCEPT
	    select 'VOLUME MATCHING' as RULE_NM length=58, 
	           	actual_dlvry_date, 
				imb_code length = 31, 
				999.2 as RULE_ORDER,
				imb_dlvry_zip_5
	    from QueryData&ZIP5
	    where rule_order < 997.1
		  and rule_order is not null
		EXCEPT
	    select  'VOLUME MATCHING' as RULE_NM length=58, 
	        	actual_dlvry_date, 
				imb_code length = 31, 
				999.2 as RULE_ORDER,
				imb_dlvry_zip_5
	    from ods_bi_recon_selected_mp
		where imb_code not in(select imb_code
	                          from ods_iv_recon_selected_mp)
		;    
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: TOTAL BIDS VOLUME SAMPLED */
	%PUT CHECK: RULE: TOTAL BIDS VOLUME SAMPLED 999.3;
	data QueryData;
	  length rule_nm $58;
	  length imb_code $31;
	  set ods_bi_recon_selected_mp;
	  RULE_NM='TOTAL BIDS VOLUME SAMPLED'; 
	  RULE_ORDER=999.3;
	  keep rule_nm actual_dlvry_date imb_code rule_order imb_dlvry_zip_5;
	run;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: EXCLUDED IN IV INCLUDED IN BIDS */
	%PUT CHECK: RULE: EXCLUDED IN IV INCLUDED IN BIDS 999.4;
	proc sql;
		create table QueryData as
		select subpad('EXCLUDED IN IV INCLUDED IN BIDS',1,58) as RULE_NM length=58,
				actual_dlvry_date, 
				imb_code length=31, 
				999.4 as RULE_ORDER,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
		where excl_sts_code is null
		  and imb_code in (select imb_code 
		                   from ods_iv_recon_selected_mp
		                   where excl_sts_code is not null);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: INCLUDED IN IV EXCLUDED IN BIDS */
	%PUT CHECK: RULE: INCLUDED IN IV EXCLUDED IN BIDS 999.5;
	proc sql;
		create table QueryData as
		select subpad('INCLUDED IN IV EXCLUDED IN BIDS',1,58) as RULE_NM length=58, 
	           actual_dlvry_date, 
			   imb_code length=31, 
	           999.5 as RULE_ORDER,
			   imb_dlvry_zip_5
	    from ods_bi_recon_selected_mp
	    where excl_sts_code is not null
	      and imb_code in (select imb_code 
	                       from ods_iv_recon_selected_mp
	                       where excl_sts_code is null);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

 

Super User
Posts: 5,260

Re: Efficiency Advice

It's quite a long code, and i don't have the time to figure out what it does and why.

Ther's probably a reason why you process one zip at the time...?

If there's no logical problems, run all zipcodes in one go instead.

Data never sleeps
Regular Contributor
Posts: 212

Re: Efficiency Advice

Yah, I can't run all ZIPs in one go due to the size of the data. The 2 ds combine for over 8 million records. It ends up gobbling up all our SAS temp space. Smiley Sad
Respected Advisor
Posts: 3,777

Re: Efficiency Advice

Why don't you ask for advice on how to do what you're efficiently instead of how do I fix my SQL code.  Include sample data all relevant details.

 

 

Super User
Super User
Posts: 7,424

Re: Efficiency Advice

How are you planning on working with the data in this, and other processes - considering the fact that you can't where clause the data for a subset?

At a guess, and this has many connotations, I would say create a table on the database (using pass through, and into a temp area) which contains the subset of data you want to work with, then bring that data across into SAS.  

Regular Contributor
Posts: 212

Re: Efficiency Advice

A good suggestion, but not possible In my case because the tables reside on two different databases/servers and a Database Link is not allowed by one of the database/server owners.

I'm not clear on what you mean by 'considering the fact that you can't where clause the data for a subset?' Each query does have a Where clause.
Trusted Advisor
Posts: 1,401

Re: Efficiency Advice

It is difficult to follow such a long and complex code. Anyhow I have some hits:

 

1) As the datasets are huge and you want to deal one zip code at a time,

    add index to the datasets and compute IMB_CODE while importing data from oracle:

           

create table ods_iv_recon_selected_mp(index=imb_dlvry_zip_5  imb_code) as 

 

create table ods_bi_recon_selected_mp(index=imb_dlvry_zip_5  imb_code) as 

   adding index will need some more time and disk space but will save time laiter.

 

2)  If you computed imb_code already you don't need to compute it on next SQL steps.

 

3) Try to convert the sql step 

create table QueryData as

    into data step with merge, something like:

    

   data QueryData;

    merge ods_bi_recon_selected_mp (in=inA  keep=... where=(excl_sts_code is null))

               ods_iv_recon_selected_mp (in=inB where=(excl_sts_code is not null));

    by imb_code;

        if inA and inB;

   run;

 

   - check the logic, is the code fits your needs

   - data step may be more efficient and faster, execpt if merge is many to many (check log messages).

      If you got message of "... more than one ..." - you can't use data step.

 

4) Finally you may try using SPDE engine, but this may be a long story to explain and it may need more disk space;

 

Regular Contributor
Posts: 212

Re: Efficiency Advice

Thanks for the tips. Exactly what is meant by 'compute imb_code'?
Trusted Advisor
Posts: 1,401

Re: Efficiency Advice

Here is your code, repeated several times, computing IMB_CODE

subpad(imb_code,1,31) as IMB_CODE length=31, 

 You can do it once per oracle table, just add format $char31.

Regular Contributor
Posts: 212

Re: Efficiency Advice

oh, and why do you recommend a SAS Merge over Proc SQL? Is it more efficient. I'd have guessed Proc SQL to be more efficient as it doesn't request the data be sorted first with a Proc Sort.
Trusted Advisor
Posts: 1,401

Re: Efficiency Advice

In some cases merging with data step can be more efficient than sql with nested select;

 

In case, more than one tablee have same combination of BY group, you cannot use merge,

then I use SQL with JOIN.

Super User
Posts: 5,096

Re: Efficiency Advice

I'm not the best SQL guy, but there are a couple of elephants in the room.

 

First, why is the pass-through logic using select * ?  It seems likely that you need only a handful of fields (but it's impossible to tell by looking at the program how many fields are actually being retrieved).

 

Second, if you want help from the more dyed-in-the-wool SQL experts, it would help to break down how much time each of the steps takes. 

 

Good luck.

Trusted Advisor
Posts: 1,401

Re: Efficiency Advice

Relating to @Astounding note:

 

If you want more efficient program: less memory and less disk space will run the program in less time.

If you select from DB only those variables you need, you save memory, disk and time.

Ask a Question
Discussion stats
  • 12 replies
  • 321 views
  • 6 likes
  • 6 in conversation