Solved: Re: Increment on out of order but distinct values?

Krueger1 · Posted 06-22-2021 11:03 AM

I have to sort by NAME and then DATE. Once this is complete I'm trying to determine the UNIQUE frequency of each ID occurring between any ID's -4.

Currently if ID = 1 then 2 then 1 it'll count as 3 when I'm needing that to count as 2. Below is examples.

data have;
input NAME $ Flag DATE ID RESULT;
format DATE yymmdd10.;
cards;
Joe 0 20190101 1 500
Joe 0 20190102 1 500
Joe 0 20190103 1 500
Joe 0 20190104 1  -4
Joe 1 20190121 1 500
Joe 0 20190122 1 500
Tom 0 20190104 1 500
Tom 0 20190101 1  -4
Tom 0 20190102 1 500
Tom 0 20190103 2 500
Tom 0 20190104 3 500
Tom 0 20190105 1 500
Tom 0 20190106 2 500
Tom 0 20190107 7  -4
;
run;

data want;
input NAME $ DATE ID RESULT want;
format DATE yymmdd10.;
cards;
Joe 0 20190101 1 500 1
Joe 0 20190102 1 500 1
Joe 0 20190103 1 500 1
Joe 0 20190104 1 -4 1
Joe 1 20190121 1 500 1
Joe 0 20190122 1 500 1
Tom 0 20190104 1 500 1
Tom 0 20190101 1 -4 1
Tom 0 20190102 1 500 1
Tom 0 20190103 2 500 2
Tom 0 20190104 3 500 3
Tom 0 20190105 1 500 3
Tom 0 20190106 2 500 3
Tom 0 20190107 7 -4  4
;
run;

*existing non-working code;
data want;
set have;
by name id notsorted;
retain have;
if first.name or lag(id) = -4 then card_Count = 1;
else if first.id then card_Count + 1; 
run;

FreelanceReinh · Posted 06-22-2021 12:31 PM

Hello @Krueger1,

You need to create some sort of lookup table to store the information about the distinct IDs you've already seen in the current group of observations.

I would use a hash object for this purpose:

data want;
if _n_=1 then do;
  dcl hash h();
  h.definekey('id');
  h.definedone();
end;
set have;
by name date;
if first.name or lag(result) = -4 then h.clear();
h.ref();
card_Count=h.num_items;
run;

(This assumes that dataset HAVE has been sorted by NAME DATE, as you mentioned.)

If you're not familiar with hash objects, you may want to use a temporary array instead. Depending on the ID values, you can use

an array indexed by ID values (assuming that ID values are integers between 1 and the array dimension, e.g., 999)

data want;
set have;
by name date;
array c[999] _temporary_;
if first.name or lag(result) = -4 then call missing(of c[*]);
c[id]=1;
card_Count=sum(of c[*]);
run;

or an array containing (up to, e.g., 999) distinct ID values:

data want;
set have;
by name date;
array c[999] _temporary_;
if first.name or lag(result) = -4 then do;
  card_Count=0;
  call missing(of c[*]);
end;
if id ~in c then do;
  card_Count+1;
  c[card_Count]=id;
end;
run;

View solution in original post

FreelanceReinh · Posted 06-22-2021 12:31 PM

Hello @Krueger1,

You need to create some sort of lookup table to store the information about the distinct IDs you've already seen in the current group of observations.

I would use a hash object for this purpose:

data want;
if _n_=1 then do;
  dcl hash h();
  h.definekey('id');
  h.definedone();
end;
set have;
by name date;
if first.name or lag(result) = -4 then h.clear();
h.ref();
card_Count=h.num_items;
run;

(This assumes that dataset HAVE has been sorted by NAME DATE, as you mentioned.)

If you're not familiar with hash objects, you may want to use a temporary array instead. Depending on the ID values, you can use

an array indexed by ID values (assuming that ID values are integers between 1 and the array dimension, e.g., 999)

data want;
set have;
by name date;
array c[999] _temporary_;
if first.name or lag(result) = -4 then call missing(of c[*]);
c[id]=1;
card_Count=sum(of c[*]);
run;

or an array containing (up to, e.g., 999) distinct ID values:

data want;
set have;
by name date;
array c[999] _temporary_;
if first.name or lag(result) = -4 then do;
  card_Count=0;
  call missing(of c[*]);
end;
if id ~in c then do;
  card_Count+1;
  c[card_Count]=id;
end;
run;

Krueger1 · Posted 06-22-2021 12:43 PM

This is great and makes a lot of sense! My ID's in actual dataset are closer to GUID's then simple ID's but this is a great base and I'll definitely do some more research into HASH. Thank you!

FreelanceReinh · Posted 06-22-2021 01:06 PM

@Krueger1 wrote:

My ID's in actual dataset are closer to GUID's then simple ID's ...

Unlike the first array solution, the second can be modified easily to work with character IDs. Just insert a suitable length specification into the array definition, e.g.,

array c[999] $16 _temporary_;

if the IDs have a length <=16.

The hash object solution works with character IDs as well because it uses the type and length of the ID variable in dataset HAVE.

Increment on out of order but distinct values?

Re: Increment on out of order but distinct values?

Re: Increment on out of order but distinct values?

Re: Increment on out of order but distinct values?

Re: Increment on out of order but distinct values?

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!