There seems to be a bug, introduced with M7, which influences memory consumption while reading a dataset with a WHERE= dataset option.
See this code:
options fullstimer;
data TEST_BIGDS (drop=i);
length CHAR1 $5. CHAR2 $10. KEY $16. DATE 5.;
format KEY $HEX32. DATE DDMMYYP10.;
do i=1 to 16195392;
if i le 7825 then CHAR1="FGHIJ";
else CHAR1="ABCDE";
CHAR2="KLMNO";
KEY=put(i,8.);
DATE=today();
output;
end;
run;
data TEST_SMALLDS;
length CHAR3 $5. KEY $16.;
format KEY $HEX32.;
CHAR3="PQRST";
do i=1 to 20000;
KEY=put(i,8.);
output;
end;
run;
data HASH_LOOKUP (drop=rc);
length CHAR2 $10. DATE 5.;
format DATE DDMMYYP10.;
set TEST_SMALLDS;
if _n_ eq 1 then do;
declare hash H (dataset: "TEST_BIGDS (where=(CHAR1 eq 'FGHIJ'))");
H.defineKey("KEY");
H.defineData("CHAR2","DATE");
H.defineDone();
call missing(CHAR2,DATE);
end;
rc=H.find();
run;
Code like this suddenly crashed after upgrading from M6 to M7 (on AIX).
Since we had a backup server still on M6, we could make a comparative test.
This is the log from M6:
26 data HASH_LOOKUP (drop=rc); 27 length CHAR2 $10. DATE 5.; 28 format DATE DDMMYYP10.; 29 set TEST_SMALLDS; 30 if _n_ eq 1 then do; 31 declare hash H (dataset: "TEST_BIGDS (where=(CHAR1 eq 'FGHIJ'))"); 32 H.defineKey("KEY"); 33 H.defineData("CHAR2","DATE"); 34 H.defineDone(); 35 call missing(CHAR2,DATE); 36 end; 37 rc=H.find(); 38 run; NOTE: There were 7825 observations read from the data set WORK.TEST_BIGDS. WHERE CHAR1='FGHIJ'; NOTE: There were 20000 observations read from the data set WORK.TEST_SMALLDS. NOTE: The data set WORK.HASH_LOOKUP has 20000 observations and 5 variables. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 4.50 seconds user cpu time 0.28 seconds system cpu time 0.31 seconds memory 1834.62k OS Memory 12276.00k Timestamp 04.03.2021 05:38:31 nachm. Step Count 3 Switch Count 46 Page Faults 1064 Page Reclaims 1192 Page Swaps 0 Voluntary Context Switches 125 Involuntary Context Switches 340 Block Input Operations 0 Block Output Operations 0
and this from M7:
254 data HASH_LOOKUP (drop=rc); 255 length CHAR2 $10. DATE 5.; 256 format DATE DDMMYYP10.; 257 set TEST_SMALLDS; 258 if _n_ eq 1 then do; 259 declare hash H (dataset: "TEST_BIGDS (where=(CHAR1 eq 'FGHIJ'))"); 260 H.defineKey("KEY"); 261 H.defineData("CHAR2","DATE"); 262 H.defineDone(); 263 call missing(CHAR2,DATE); 264 end; 265 rc=H.find(); 266 run; NOTE: There were 7825 observations read from the data set WORK.TEST_BIGDS. WHERE CHAR1='FGHIJ'; NOTE: There were 20000 observations read from the data set WORK.TEST_SMALLDS. NOTE: The data set WORK.HASH_LOOKUP has 20000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.82 seconds user cpu time 0.67 seconds system cpu time 0.11 seconds memory 539046.65k OS Memory 553580.00k Timestamp 03/04/2021 05:34:02 PM Step Count 4 Switch Count 0 Page Faults 0 Page Reclaims 131388 Page Swaps 0 Voluntary Context Switches 2 Involuntary Context Switches 73 Block Input Operations 0 Block Output Operations 0
This can cause (batch) jobs to fail when the MEMSIZE is not sufficient to deal with this sudden increase.
Creating an intermediate dataset with the WHERE condition is a suitable workaround for the moment.
Thanks to @ccaero who found this and created the test.
SAS confirmed the problem and issued a SAS Note 67620 " A hash object in SAS® 9.4M7 (TS1M7) might consume significantly more memory than it did in previous releases"
From the note:
This problem occurs when the HASHEXP method is not specified when defining the hash object. The amount of memory that is allocated can vary, depending on the data within the defined hash.
The workaround is to define the hash object with the HASHEXP method, as illustrated by the syntax fragment below:
data HASH_LOOKUP;
if _n_ eq 1 then do;
declare hash H (dataset: "TEST_DSET (where=(CHAR1 eq 'ABCDE'))",HASHEXP:8);
...more code...
If the HASHEXP method is not specified in the declaration, a default value of 8 is used. However, specifying HASHEXP:8 (the default) in the DECLARE statement will dramatically reduce the step memory footprint than not coding the method. The value for the HASHEXP method depends on usage. It is recommended that you test different values to find the optimal value for each case.
Hi @Kurt_Bremser has this been reported to SAS Tech Support? If not, I'll be happy to open a track for it.
Hi @ChrisHemedinger ,
I opened a support-track on this topic today, the ticket number is 7613295889.
SAS confirmed the problem and issued a SAS Note 67620 " A hash object in SAS® 9.4M7 (TS1M7) might consume significantly more memory than it did in previous releases"
From the note:
This problem occurs when the HASHEXP method is not specified when defining the hash object. The amount of memory that is allocated can vary, depending on the data within the defined hash.
The workaround is to define the hash object with the HASHEXP method, as illustrated by the syntax fragment below:
data HASH_LOOKUP;
if _n_ eq 1 then do;
declare hash H (dataset: "TEST_DSET (where=(CHAR1 eq 'ABCDE'))",HASHEXP:8);
...more code...
If the HASHEXP method is not specified in the declaration, a default value of 8 is used. However, specifying HASHEXP:8 (the default) in the DECLARE statement will dramatically reduce the step memory footprint than not coding the method. The value for the HASHEXP method depends on usage. It is recommended that you test different values to find the optimal value for each case.
Whats the outcome of the track?
It would be nice if others who have already upgraded to M7 could check this on other platforms.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.