code optimization with hashes

mariopellegrini · Posted 02-28-2023 09:44 AM

Good morning everyone. I'm trying to optimize some time consuming code, I'm asking for suggestions. Starting from a typical situation that I report in the example code below, it is a matter of identifying through sort + data step with the use of by of the records, I was wondering if with hash techniques there is the possibility of improving the processing time

data ds_1;
input cod1 cod2;
datalines;
1 4
1 12
1 7
2 7
2 6
2 9
3 12
3 4
;
proc sort data=ds_1;
by cod1 cod2;
run;

data ds_2;
set ds_1;
by cod1 cod2;
if last.cod1;
run;

Kurt_Bremser · Posted 02-28-2023 11:06 AM

How large is your real dataset (number of observations, observation size)?

How long do your steps take in real life?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mariopellegrini · Posted 02-28-2023 12:50 PM

35,800,340 observations and 11 variables in total (8 variables in the "by")
The original date step lasts:
real time 5:38.72
cpu time 5:10.59

Kurt_Bremser · Posted 02-28-2023 01:50 PM

If your initial dataset is already sorted by cod1, you could avoid the PROC SORT by using a DOW loop:

data want;
do until (last.cod1);
  set have;
  by cod1;
  _cod2 = max(_cod2,cod2);
end;
do until (last.cod1);
  set have;
  by cod1;
  if cod2 = _cod2 then output;
end;
drop _cod2;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX