Hi,
I am trying to calculate how many times different suppliers have provided a name and address.
What I need is for the value of the supplied variable for last.UID to be applied to all observations of that UID so that when I later class by each supplier I get an accurate count of the total time that record has been supplied.
Current code:
data new;
set supplied1;
by uid;
if first.uid then do;
supplied=0;
end;
supplied=supplied+1;
retain supplied;
if last.uid;
run;
I can do this in a number of steps but I have a feeling there is a much simpler way of doing it!
Thanks
I think you will have to make two passes through the data, even if only one datastep is used. One solution:
data new;
do until (last.uid);
set supplied1;
by uid;
if first.uid then supplied=1;
else supplied=supplied+1;
end;
do until (last.uid);
set supplied1;
by uid;
output;
end;
run;
There are merits in this approach. Although there appears to be 2 passes through the data because there are two set statemets, the data flow switches from one stream to the other on change of by group. If the buffering and caching of the data holds more than one by group there will be only one I/O stream into the buffers for this data step.
It has been presented before in forums and sas-L - using two references to to same dataset in a single SET statement like :
Data final ;
SET supplied1( in= A ) supplied1 ;
By uid ;
If first.uid then supplied=0;
If a then supplied +1 ;
else output;
Run;
: Don't know how I missed that but, in the tests I just ran, both approaches took approximately the same amounts of CPU and real times, thus I'm not convinced that they don't both take two passes through the data. However, that said, I like the approach you suggested as it does the same thing with less code. The only thing I'd suggest changing is the if then else combinations. i.e.:
data new;
set supplied1( in= a ) supplied1 ;
by uid ;
if first.uid then supplied=1;
else if a then supplied +1 ;
else output;
run;
shorter:
data have;
input id@@;
cards;
1 2 2 3 3 3 9 9 9 9 18 18 18 18 18
;
data want;
set have(in=a)have;
by id ;
if a then t+1-first.id*t;
else output;
run;
proc print;run;
: Actually, on average, your suggestion runs slightly slower, but the two approaches do run in very similar times. I prefer the if-then-else combinations as it is easier (I think) for most to follow the logic. The fact that it run slightly faster (and I mean really, really only slightly), I think, is just an added benefit.
Hi Art,
You are right. Is there any leftover turkey to share?
@Linlin: Only if you get here quickly and now is being served in the form of Sheppard's Pie
You won't measure the difference in a one-variable data set, but it would be good practice to modify the SET statement:
set supplied1 (in=a keep=uid) supplied1;
: Interestingly, the efficiency doesn't appear to apply in this particular situation. The test code I ran was:
data supplied1;
set sashelp.class (rename=(age=uid));
array junk(999) (999*1);
do _n_=1 to 10000;
output;
end;
run;
proc sort data=supplied1;
by uid;
run;
data new;
set supplied1( in= a ) supplied1 ;
by uid ;
if first.uid then supplied=0;
if a then supplied +1 ;
else output;
run;
data new;
set supplied1( in= a keep=uid) supplied1 ;
by uid ;
if first.uid then supplied=0;
if a then supplied +1 ;
else output;
run;
Art,
Hmmmm ... there are mysteries that can be difficult to explain. By any chance did you measure the difference with your original DOW approach?
: of course I looked at that as well. Even with the double DOW, the pdv is populated with the variables from both steps thus using a keep option doesn't improve the processing times.
If any of you are wondering why is surprised, take a look at: Increase Your Productivity by Doing Less - sasCommunity
Art's paper reminds me that one of the co-authors, Ksharp, has dropped out of the grid for a long time. Let's wish him the best!
Haikuo
For the sake of code clarity,
set SUPPLIED1(in=a keep=uid) SUPPLIED1;
is indeed better imho, regardless of performance.
This way we have more information about what each table is used for.
We could even write:
set SUPPLIED1(in=groupingstream keep=uid) SUPPLIED1;
or
set SUPPLIED1(in=groupingstream keep=uid) SUPPLIED1(in=outputstream);
if we were really petty about descriptive names, which we are not.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.