Hi. I have the following series of queries that run as part of a macro loop, once for each ZIP Code. And there are 68 ZIP Codes! The data is huge so each step can take a couple of hours per ZIP Code.
Due to the huge size of the data the datasets ods_bi_recon_selected_mp and ods_iv_recon_selected_mp are created with records from only 1 ZIP Code at a time. Then each in the series of the queries is executed against these datasets. Then the process repeats for each of the remaining ZIPs.
I'm curious if anyone sees opportunities for improving efficiency of these individual queries or in executing all the queries as a whole? Any suggestions would be greatly appreciated.
/*Create IV dataset*/
%PUT CHECK: CREATE IV DS FOR &ZIP5;
proc sql;
connect to oracle as db (user=&orauser password=&orapass path="ivasprd");
create table ods_iv_recon_selected_mp as
select * from connection to db
( select *
from ivas.ods_iv_recon_selected_mp
where imb_dlvry_zip_5=&ZIP5_QUOTED
);
disconnect from db;
quit;
/*Create BIDS dataset*/
%PUT CHECK: CREATE BIDS DS FOR &ZIP5;
proc sql;
connect to oracle as db (user="myuser" password="mypw" path="ibscrprd");
create table ods_bi_recon_selected_mp as
select * from connection to db
( select *
from imapsscr.ods_bi_recon_selected_mp
where imb_dlvry_zip_5=&ZIP5_QUOTED
);
disconnect from db;
quit;
The series of queries are:
/* RULE: PIECES MISSING IN IV QUERY */
%PUT CHECK: RULE: PIECES MISSING IN IV QUERY 999.1;
proc sql;
create table QueryData as
select subpad('PIECES MISSING IN IV',1,58) as RULE_NM length=58,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length=31,
999.1 as Rule_Order,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
where imb_code not in(select imb_code
from ods_iv_recon_selected_mp);
quit;
/* Append datasets to final dataset */
proc append base=QueryData&ZIP5 data=QueryData force;
run;
/* RULE: PIECES MISSING IN BIDS QUERY */
%PUT CHECK: RULE: PIECES MISSING IN BIDS QUERY 999.6;
proc sql;
create table QueryData as
select subpad('PIECES MISSING IN BIDS',1,58) as RULE_NM length=58,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length=31,
999.6 as RULE_ORDER,
imb_dlvry_zip_5
from ods_iv_recon_selected_mp
where imb_code not in(select imb_code
from ods_bi_recon_selected_mp);
quit;
/* Append datasets to final dataset */
proc append base=QueryData&ZIP5 data=QueryData force;
run;
/* RULE: VOLUME MATCHING */
/* Volume Matching assumes the records match exactly and have not violated any rules */
%PUT CHECK: RULE: VOLUME MATCHING 999.2;
proc sql;
create table QueryData as
select subpad('VOLUME MATCHING',1,58) as RULE_NM length=58,
actual_dlvry_date,
subpad(imb_code,1,31) as IMB_CODE length = 31,
999.2 as RULE_ORDER,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
EXCEPT
select 'VOLUME MATCHING' as RULE_NM length=58,
actual_dlvry_date,
imb_code length = 31,
999.2 as RULE_ORDER,
imb_dlvry_zip_5
from QueryData&ZIP5
where rule_order < 997.1
and rule_order is not null
EXCEPT
select 'VOLUME MATCHING' as RULE_NM length=58,
actual_dlvry_date,
imb_code length = 31,
999.2 as RULE_ORDER,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
where imb_code not in(select imb_code
from ods_iv_recon_selected_mp)
;
quit;
/* Append datasets to final dataset */
proc append base=QueryData&ZIP5 data=QueryData force;
run;
/* RULE: TOTAL BIDS VOLUME SAMPLED */
%PUT CHECK: RULE: TOTAL BIDS VOLUME SAMPLED 999.3;
data QueryData;
length rule_nm $58;
length imb_code $31;
set ods_bi_recon_selected_mp;
RULE_NM='TOTAL BIDS VOLUME SAMPLED';
RULE_ORDER=999.3;
keep rule_nm actual_dlvry_date imb_code rule_order imb_dlvry_zip_5;
run;
/* Append datasets to final dataset */
proc append base=QueryData&ZIP5 data=QueryData force;
run;
/* RULE: EXCLUDED IN IV INCLUDED IN BIDS */
%PUT CHECK: RULE: EXCLUDED IN IV INCLUDED IN BIDS 999.4;
proc sql;
create table QueryData as
select subpad('EXCLUDED IN IV INCLUDED IN BIDS',1,58) as RULE_NM length=58,
actual_dlvry_date,
imb_code length=31,
999.4 as RULE_ORDER,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
where excl_sts_code is null
and imb_code in (select imb_code
from ods_iv_recon_selected_mp
where excl_sts_code is not null);
quit;
/* Append datasets to final dataset */
proc append base=QueryData&ZIP5 data=QueryData force;
run;
/* RULE: INCLUDED IN IV EXCLUDED IN BIDS */
%PUT CHECK: RULE: INCLUDED IN IV EXCLUDED IN BIDS 999.5;
proc sql;
create table QueryData as
select subpad('INCLUDED IN IV EXCLUDED IN BIDS',1,58) as RULE_NM length=58,
actual_dlvry_date,
imb_code length=31,
999.5 as RULE_ORDER,
imb_dlvry_zip_5
from ods_bi_recon_selected_mp
where excl_sts_code is not null
and imb_code in (select imb_code
from ods_iv_recon_selected_mp
where excl_sts_code is null);
quit;
/* Append datasets to final dataset */
proc append base=QueryData&ZIP5 data=QueryData force;
run;
It's quite a long code, and i don't have the time to figure out what it does and why.
Ther's probably a reason why you process one zip at the time...?
If there's no logical problems, run all zipcodes in one go instead.
Why don't you ask for advice on how to do what you're efficiently instead of how do I fix my SQL code. Include sample data all relevant details.
How are you planning on working with the data in this, and other processes - considering the fact that you can't where clause the data for a subset?
At a guess, and this has many connotations, I would say create a table on the database (using pass through, and into a temp area) which contains the subset of data you want to work with, then bring that data across into SAS.
It is difficult to follow such a long and complex code. Anyhow I have some hits:
1) As the datasets are huge and you want to deal one zip code at a time,
add index to the datasets and compute IMB_CODE while importing data from oracle:
create table ods_iv_recon_selected_mp(index=imb_dlvry_zip_5 imb_code) as
create table ods_bi_recon_selected_mp(index=imb_dlvry_zip_5 imb_code) as
adding index will need some more time and disk space but will save time laiter.
2) If you computed imb_code already you don't need to compute it on next SQL steps.
3) Try to convert the sql step
create table QueryData as
into data step with merge, something like:
data QueryData;
merge ods_bi_recon_selected_mp (in=inA keep=... where=(excl_sts_code is null))
ods_iv_recon_selected_mp (in=inB where=(excl_sts_code is not null));
by imb_code;
if inA and inB;
run;
- check the logic, is the code fits your needs
- data step may be more efficient and faster, execpt if merge is many to many (check log messages).
If you got message of "... more than one ..." - you can't use data step.
4) Finally you may try using SPDE engine, but this may be a long story to explain and it may need more disk space;
Here is your code, repeated several times, computing IMB_CODE
subpad(imb_code,1,31) as IMB_CODE length=31,
You can do it once per oracle table, just add format $char31.
In some cases merging with data step can be more efficient than sql with nested select;
In case, more than one tablee have same combination of BY group, you cannot use merge,
then I use SQL with JOIN.
I'm not the best SQL guy, but there are a couple of elephants in the room.
First, why is the pass-through logic using select * ? It seems likely that you need only a handful of fields (but it's impossible to tell by looking at the program how many fields are actually being retrieved).
Second, if you want help from the more dyed-in-the-wool SQL experts, it would help to break down how much time each of the steps takes.
Good luck.
Relating to @Astounding note:
If you want more efficient program: less memory and less disk space will run the program in less time.
If you select from DB only those variables you need, you save memory, disk and time.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.