Re: Efficiency Advice

buechler66 · Posted 11-15-2016 09:30 AM

Hi. I have the following series of queries that run as part of a macro loop, once for each ZIP Code. And there are 68 ZIP Codes! The data is huge so each step can take a couple of hours per ZIP Code.

Due to the huge size of the data the datasets ods_bi_recon_selected_mp and ods_iv_recon_selected_mp are created with records from only 1 ZIP Code at a time. Then each in the series of the queries is executed against these datasets. Then the process repeats for each of the remaining ZIPs.

I'm curious if anyone sees opportunities for improving efficiency of these individual queries or in executing all the queries as a whole? Any suggestions would be greatly appreciated.

/*Create IV dataset*/
 %PUT CHECK: CREATE IV DS FOR &ZIP5;
 proc sql;
 connect to oracle as db (user=&orauser password=&orapass path="ivasprd");
 create table ods_iv_recon_selected_mp as 
 select * from connection to db
 ( select *
 from ivas.ods_iv_recon_selected_mp
 where imb_dlvry_zip_5=&ZIP5_QUOTED
 ); 
 disconnect from db;
 quit;

/*Create BIDS dataset*/
 %PUT CHECK: CREATE BIDS DS FOR &ZIP5;
 proc sql;
 connect to oracle as db (user="myuser" password="mypw" path="ibscrprd");
 create table ods_bi_recon_selected_mp as 
 select * from connection to db
 ( select *
 from imapsscr.ods_bi_recon_selected_mp
 where imb_dlvry_zip_5=&ZIP5_QUOTED
 ); 
 disconnect from db;
 quit;

The series of queries are:

/*	 RULE: PIECES MISSING IN IV QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
	proc sql;
		create table QueryData as
		select subpad('PIECES MISSING IN IV',1,58) as RULE_NM length=58, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.1 as Rule_Order,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
	    where imb_code not in(select imb_code 
		                        from ods_iv_recon_selected_mp);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: PIECES MISSING IN BIDS QUERY */
	%PUT CHECK: RULE: PIECES MISSING IN BIDS QUERY 999.6;
	proc sql;
		create table QueryData as
		select subpad('PIECES MISSING IN BIDS',1,58) as RULE_NM length=58, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length=31, 
				999.6 as RULE_ORDER,
				imb_dlvry_zip_5
		from ods_iv_recon_selected_mp
		where imb_code not in(select imb_code 
		                      from ods_bi_recon_selected_mp);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: VOLUME MATCHING */
/*	Volume Matching assumes the records match exactly and have not violated any rules */
	%PUT CHECK: RULE: VOLUME MATCHING 999.2;
	proc sql;
		create table QueryData as
		select subpad('VOLUME MATCHING',1,58) as RULE_NM length=58, 
		       	actual_dlvry_date, 
				subpad(imb_code,1,31) as IMB_CODE length = 31, 
				999.2 as RULE_ORDER,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
		EXCEPT
	    select 'VOLUME MATCHING' as RULE_NM length=58, 
	           	actual_dlvry_date, 
				imb_code length = 31, 
				999.2 as RULE_ORDER,
				imb_dlvry_zip_5
	    from QueryData&ZIP5
	    where rule_order < 997.1
		  and rule_order is not null
		EXCEPT
	    select  'VOLUME MATCHING' as RULE_NM length=58, 
	        	actual_dlvry_date, 
				imb_code length = 31, 
				999.2 as RULE_ORDER,
				imb_dlvry_zip_5
	    from ods_bi_recon_selected_mp
		where imb_code not in(select imb_code
	                          from ods_iv_recon_selected_mp)
		;    
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: TOTAL BIDS VOLUME SAMPLED */
	%PUT CHECK: RULE: TOTAL BIDS VOLUME SAMPLED 999.3;
	data QueryData;
	  length rule_nm $58;
	  length imb_code $31;
	  set ods_bi_recon_selected_mp;
	  RULE_NM='TOTAL BIDS VOLUME SAMPLED'; 
	  RULE_ORDER=999.3;
	  keep rule_nm actual_dlvry_date imb_code rule_order imb_dlvry_zip_5;
	run;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: EXCLUDED IN IV INCLUDED IN BIDS */
	%PUT CHECK: RULE: EXCLUDED IN IV INCLUDED IN BIDS 999.4;
	proc sql;
		create table QueryData as
		select subpad('EXCLUDED IN IV INCLUDED IN BIDS',1,58) as RULE_NM length=58,
				actual_dlvry_date, 
				imb_code length=31, 
				999.4 as RULE_ORDER,
				imb_dlvry_zip_5
		from ods_bi_recon_selected_mp
		where excl_sts_code is null
		  and imb_code in (select imb_code 
		                   from ods_iv_recon_selected_mp
		                   where excl_sts_code is not null);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

/*	RULE: INCLUDED IN IV EXCLUDED IN BIDS */
	%PUT CHECK: RULE: INCLUDED IN IV EXCLUDED IN BIDS 999.5;
	proc sql;
		create table QueryData as
		select subpad('INCLUDED IN IV EXCLUDED IN BIDS',1,58) as RULE_NM length=58, 
	           actual_dlvry_date, 
			   imb_code length=31, 
	           999.5 as RULE_ORDER,
			   imb_dlvry_zip_5
	    from ods_bi_recon_selected_mp
	    where excl_sts_code is not null
	      and imb_code in (select imb_code 
	                       from ods_iv_recon_selected_mp
	                       where excl_sts_code is null);
	quit;

/*	Append datasets to final dataset */
	proc append base=QueryData&ZIP5 data=QueryData force;
	run;

LinusH · Posted 11-15-2016 09:47 AM

It's quite a long code, and i don't have the time to figure out what it does and why.

Ther's probably a reason why you process one zip at the time...?

If there's no logical problems, run all zipcodes in one go instead.

Data never sleeps

buechler66 · Posted 11-15-2016 09:53 AM

Yah, I can't run all ZIPs in one go due to the size of the data. The 2 ds combine for over 8 million records. It ends up gobbling up all our SAS temp space. 😞

data_null__ · Posted 11-15-2016 10:23 AM

Why don't you ask for advice on how to do what you're efficiently instead of how do I fix my SQL code. Include sample data all relevant details.

RW9 · Posted 11-15-2016 10:25 AM

How are you planning on working with the data in this, and other processes - considering the fact that you can't where clause the data for a subset?

At a guess, and this has many connotations, I would say create a table on the database (using pass through, and into a temp area) which contains the subset of data you want to work with, then bring that data across into SAS.

buechler66 · Posted 11-15-2016 11:48 AM

A good suggestion, but not possible In my case because the tables reside on two different databases/servers and a Database Link is not allowed by one of the database/server owners.

I'm not clear on what you mean by 'considering the fact that you can't where clause the data for a subset?' Each query does have a Where clause.

Shmuel · Posted 11-15-2016 10:56 AM

It is difficult to follow such a long and complex code. Anyhow I have some hits:

1) As the datasets are huge and you want to deal one zip code at a time,

add index to the datasets and compute IMB_CODE while importing data from oracle:

create table ods_iv_recon_selected_mp(index=imb_dlvry_zip_5  imb_code) as

create table ods_bi_recon_selected_mp(index=imb_dlvry_zip_5  imb_code) as

adding index will need some more time and disk space but will save time laiter.

2) If you computed imb_code already you don't need to compute it on next SQL steps.

3) Try to convert the sql step

create table QueryData as

into data step with merge, something like:

data QueryData;

merge ods_bi_recon_selected_mp (in=inA keep=... where=(excl_sts_code is null))

ods_iv_recon_selected_mp (in=inB where=(excl_sts_code is not null));

by imb_code;

if inA and inB;

run;

- check the logic, is the code fits your needs

- data step may be more efficient and faster, execpt if merge is many to many (check log messages).

If you got message of "... more than one ..." - you can't use data step.

4) Finally you may try using SPDE engine, but this may be a long story to explain and it may need more disk space;

buechler66 · Posted 11-15-2016 11:42 AM

Thanks for the tips. Exactly what is meant by 'compute imb_code'?

Shmuel · Posted 11-15-2016 02:37 PM

Here is your code, repeated several times, computing IMB_CODE

subpad(imb_code,1,31) as IMB_CODE length=31,

You can do it once per oracle table, just add format $char31.

buechler66 · Posted 11-15-2016 11:44 AM

oh, and why do you recommend a SAS Merge over Proc SQL? Is it more efficient. I'd have guessed Proc SQL to be more efficient as it doesn't request the data be sorted first with a Proc Sort.

Shmuel · Posted 11-15-2016 02:43 PM

In some cases merging with data step can be more efficient than sql with nested select;

In case, more than one tablee have same combination of BY group, you cannot use merge,

then I use SQL with JOIN.

Astounding · Posted 11-15-2016 10:25 AM

I'm not the best SQL guy, but there are a couple of elephants in the room.

First, why is the pass-through logic using select * ? It seems likely that you need only a handful of fields (but it's impossible to tell by looking at the program how many fields are actually being retrieved).

Second, if you want help from the more dyed-in-the-wool SQL experts, it would help to break down how much time each of the steps takes.

Good luck.

Shmuel · Posted 11-15-2016 02:49 PM

Relating to @Astounding note:

If you want more efficient program: less memory and less disk space will run the program in less time.

If you select from DB only those variables you need, you save memory, disk and time.

SAS Innovate 2025: Call for Content

Classroom Training Available!