I am looking for a different way to write the code below. It does run but very slow as the files are large.
It does run with the where statement in a separate step but before the where, the file is 774,000,000 records and the step runs for 2 hours. Any tips would be appreciated. Thanks
FROM WORK.SELECRCS1 t1 LEFT JOIN WORK.APPEND_TABLE_0003 t2 ON (t1.LEVEL3 = t2.LEVEL3) AND (t1.LEVEL2 = t2.LEVEL2)
AND (t1.LEVEL1 = t2.LEVEL1) AND (t1.PROMO_CD = t2.PROMO_CODE)
WHERE t2.LEVEL3 IS NOT MISSING;
QUIT;
I did not sort either data set before the SQL step. One file is around 152 million and the other is around 12 million
I could do a sort beforehand if that would help. It already runs so long I did not think sorting would help but I can give that a try?
proc sql will probably sort behind the scene, so sorting before hand might save you time, especially if you need that sort order more than once when using these tables.
And if that the case, adding order by might slow down proc sort a bit, but might make the next step faster.
Also explore the tagsort option if memory is an issue.
Another way to speed a sort is to put the utility location and the work locations on separate drives. These are set when SAS starts.
Another option is to create an index on the larger table. No need for the small table since you keep all the records.
This is only good if you keep a small portion of the indexed table (typically less than 5 to 15%).
Another way is to read the unsorted large table sequentially in a data step, and load the small table in a hash table to perform the match.
Another way is to store the source data using the SPDE engine instead of the base engine: SPDE is very good at sorting on the fly.
Sorting might be critical, but not for a SQL approach.
Here is what a different approach could look like:
data want;
set t1 (rename=(promo_cd = promo_code) in=found_in_t1) t2 (in=fromt2);
by level1 level2 level3 promo_code;
if first.promo_code then keepme = found_in_t1;
if keepme;
retain keepme;
if fromt2;
run;
It's untested code, but its accuracy relies on two points. First, the data sets have to be sorted (both of them). Second, the T2 data set has to be supplying the identifiers only not additional variables. If additional variables are involved coming from t2, the task can still be done but it gets a little longer.
I believe the speed here will be much faster than anything else, since there is only one SET statement. But you need to confirm the conditions we are working under ... does T2 need to supply additional variables, and can we sort the data sets.
On sorted data sets, data steps are not necessarily faster than SQL.
data T1(sortedby=I);
do I=1 to 100e6;
output;
end;
run;
data T2(sortedby=I);
do I=1 to 100e6 by 10;
output;
end;
run;
proc sql _method;
create table T as select T1.I from T1, T2 where T1.I=T2.I;
quit;
data T;
merge T1 T2(in=B);
by I;
if B;
run;
26 proc sql _method;
27 create table T as select T1.I from T1, T2 where T1.I=T2.I;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxjm
sqxsrc( WORK.T2 )
sqxsrc( WORK.T1 )
NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: Table WORK.T created, with 10000000 rows and 1 columns.
28 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 12.88 seconds
cpu time 12.89 seconds
29 data T;
30 merge T1 T2(in=B);
31 by I;
32 if B;
33 run;
NOTE: Compression was disabled for data set WORK.T because compression overhead would increase the size of the data set.
NOTE: There were 100000000 observations read from the data set WORK.T1.
NOTE: There were 10000000 observations read from the data set WORK.T2.
NOTE: The data set WORK.T has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 14.67 seconds
cpu time 14.67 seconds
@ChrisNZ ,
Obviously you are correct about a MERGE statement:
data T;
merge T1 T2(in=B);
by I;
if B;
run;
But try it with a SET statement instead:
data T;
set T1 (in=A) T2(in=B);
by I;
if first.i then keepme=A;
retain keepme;
if B and keepme;
run;
I would expect the time to be cut in half (roughly).
In addition - one file has unique combinations of the joining fields, the other does not and that is why the file ends up being so large. We want those multiple joins to split credit so that is actually ok in this scenario.
The performance of SAS WORK table joins will be largely limited by the IO performance of the SAS App server WORK drive. It would be helpful if you could post the full SAS log of your query. If the query CPU usage is considerably less than the elapsed time then you can be pretty sure it is IO-constrained. If you are processing a lot of character columns, then setting this option may speed things up - OPTIONS compress = yes; - this will compress your tables so they take up less space and so improve IO.
Try this:
data want;
set selecrcs1;
if _n_ = 1
then do;
declare hash t2 (dataset:"append_table_003 (rename=(promo_code=promo_cd))");
t2.definekey("level1","level2","level3","promo_cd");
t2.definedone();
end;
if t2.check() = 0;
run;
It might fail with a memory overflow because of the size of the second table, but it's worth a try.
That's the best option if the small table is only used for subsetting the large one, or if the amount of data retrieved is small (up to the memory available) as none of the tables need sorting.
Note that if you only subset, you should specify a data portion for the hash table of 8 bytes maximum in order to reduce memory footprint.
@ChrisNZ wrote:
Note that if you only subset, you should specify a data portion for the hash table of 8 bytes maximum in order to reduce memory footprint.
That is an important tip. I will test that today on the AIX platfirm, just to see if that effect spans across operating systems.
Did the test; there is a noticeable reduction in memory footprint when I define one of the key variables that has a length of 8 as data.
Yes, memory footprint goes up in increments of 16 bytes per item, so one byte too many and that's 16 bytes added.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.