BookmarkSubscribeRSS Feed
LukeD
Calcite | Level 5

Hi,

I am trying to calculate how many times different suppliers have provided a name and address.

What I need is for the value of the supplied variable for last.UID to be applied to all observations of that UID so that when I later class by each supplier I get an accurate count of the total time that record has been supplied.

Current code:

data new;

     set supplied1;

     by uid;

          if first.uid then do;

               supplied=0;

          end;

     supplied=supplied+1;

retain supplied;

if last.uid;

run;

I can do this in a number of steps but I have a feeling there is a much simpler way of doing it!

Thanks

18 REPLIES 18
art297
Opal | Level 21

I think you will have to make two passes through the data, even if only one datastep is used.  One solution:

data new;

  do until (last.uid);

    set supplied1;

    by uid;

    if first.uid then supplied=1;

    else supplied=supplied+1;

  end;

  do until (last.uid);

    set supplied1;

    by uid;

    output;

  end;

run;

Peter_C
Rhodochrosite | Level 12

There are merits in this approach. Although there appears to be 2 passes through the data because there are two set statemets, the data flow switches from one stream to the other on change of by group. If the buffering and caching of the data holds more than one by group there will be only one I/O stream into the buffers for this data step.

It has been presented before in forums and sas-L - using two references to to same dataset in a single SET statement like :

Data final ;

SET supplied1( in= A ) supplied1 ;

By uid ;

If first.uid then supplied=0;

If a then  supplied +1 ;

else output;

Run;

art297
Opal | Level 21

: Don't know how I missed that but, in the tests I just ran, both approaches took approximately the same amounts of CPU and real times, thus I'm not convinced that they don't both take two passes through the data. However, that said, I like the approach you suggested as it does the same thing with less code.  The only thing I'd suggest changing is the if then else combinations. i.e.:

data new;

  set supplied1( in= a ) supplied1 ;

  by uid ;

  if first.uid then supplied=1;

  else if a then supplied +1 ;

  else output;

run;

Linlin
Lapis Lazuli | Level 10

shorter:

data have;

input id@@;

cards;

1 2 2 3 3 3 9 9 9 9 18 18 18 18 18

;

data want;

set have(in=a)have;

by id ;

if a then t+1-first.id*t;

else output;

run;

proc print;run;

art297
Opal | Level 21

: Actually, on average, your suggestion runs slightly slower, but the two approaches do run in very similar times.  I prefer the if-then-else combinations as it is easier (I think) for most to follow the logic.  The fact that it run slightly faster (and I mean really, really only slightly), I think, is just an added benefit.

Linlin
Lapis Lazuli | Level 10

Hi Art,

You are right. Is there any leftover turkey to share?

art297
Opal | Level 21

@Linlin: Only if you get here quickly and now is being served in the form of Sheppard's Pie

Astounding
PROC Star

You won't measure the difference in a one-variable data set, but it would be good practice to modify the SET statement:

set supplied1 (in=a keep=uid) supplied1;


art297
Opal | Level 21

: Interestingly, the efficiency doesn't appear to apply in this particular situation.  The test code I ran was:

data supplied1;

  set sashelp.class (rename=(age=uid));

  array junk(999) (999*1);

  do _n_=1 to 10000;

    output;

  end;

run;

proc sort data=supplied1;

  by uid;

run;

data new;

  set supplied1( in= a ) supplied1 ;

  by uid ;

  if first.uid then supplied=0;

  if a then  supplied +1 ;

  else output;

run;

data new;

  set supplied1( in= a keep=uid) supplied1 ;

  by uid ;

  if first.uid then supplied=0;

  if a then  supplied +1 ;

  else output;

run;

Astounding
PROC Star

Art,

Hmmmm ... there are mysteries that can be difficult to explain.  By any chance did you measure the difference with your original DOW approach?

art297
Opal | Level 21

: of course I looked at that as well.  Even with the double DOW, the pdv is populated with the variables from both steps thus using a keep option doesn't improve the processing times.

If any of you are wondering why is surprised, take a look at: Increase Your Productivity by Doing Less - sasCommunity

Haikuo
Onyx | Level 15

Art's paper reminds me that one of the co-authors, Ksharp, has dropped out of the grid for a long time. Let's wish him the best!

Haikuo

art297
Opal | Level 21

: KSharp is still alive and well and, every now and then, sends me an idea for a SAS-related paper that he'd like me to investigate.

ChrisNZ
Tourmaline | Level 20

For the sake of code clarity,

set SUPPLIED1(in=a keep=uid) SUPPLIED1;

is indeed better imho, regardless of performance.

This way we have more information about what each table is used for.

We could even write:

set SUPPLIED1(in=groupingstream keep=uid) SUPPLIED1;

or

set SUPPLIED1(in=groupingstream keep=uid) SUPPLIED1(in=outputstream);

if we were really petty about descriptive names, which we are not.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1129 views
  • 7 likes
  • 8 in conversation