Solved: Generate variable according to the order of same-id observations

Ginny_Han · Posted 09-17-2020 04:50 AM

Dear all,

I have a question regarding how to generate a variable that reflect the order of the same-id observations. I don't want to delete any duplicated records.

I proposed a sample data here. Suppose id means patient and date is admission date and each observation is an admission record. I want to generate a "order" variable to reflect (after sorting by id and date) whether the observation is the 1st, 2nd... admission. I also want a "times" variable to reflect how many admissions (observations) of the patient there are in the dataset.

data test;
input id $ date $ order $ times;
datalines;
1 2010/09/10 1 1
2 2010/0912  1  2
2  2010/10/10 2 2
3 2010/04/03 1 3
3 2010/05/03 2 3
3 2010/05/29 3 3
;
run;

So far, I get how to sort the variables by id and then date.

proc sort data=test out=test;
by id date;
run;

Please help me on generating the "order" and "times" variables.

(I used to use STATA and I wonder if SAS has things like "_n" and "_N" to denote the order and number of observations in a by-sort group (e.g. group id-date). Like in STATA there is "bysort id date" and "gen order=_n" or "gen x=x1(_n-1)". If there is similar concept is SAS, it would be great to know what it is called.)

Thanks in advance!

Ginny Han

yabwon · Posted 09-17-2020 05:04 AM

Hi,

double DoW-loop looks like good candidate to help.

data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;

proc sort data=test out=test;
by id date;
run;

data want;

do until (last.id);
  set test;
  by id date;
  times + 1; 
end;

order = 0;
do until (last.id);
  set test;
  by id date;
  order + 1; 
  output;
end;

times = 0;
run;

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

yabwon · Posted 09-17-2020 05:04 AM

Hi,

double DoW-loop looks like good candidate to help.

data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;

proc sort data=test out=test;
by id date;
run;

data want;

do until (last.id);
  set test;
  by id date;
  times + 1; 
end;

order = 0;
do until (last.id);
  set test;
  by id date;
  order + 1; 
  output;
end;

times = 0;
run;

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ginny_Han · Posted 09-17-2020 09:11 PM

Hi Bart @yabwon ,

Thanks for the solution! It worked fine.

Can I ask a follow-up question? How does the command "-output-" work in the do loop? Because with it, the loop would produce an order of the observations. And without it, the times variables would be generated with every observation in the same by-sort group having the same value?

order = 0;
do until (last.id);
  set test;
  by id date;
  order + 1; 
  output;
end;

Thanks again!

Ginny

yabwon · Posted 09-18-2020 02:44 AM

Hi Ginny, @Ginny_Han,

it is not exactly as you writing, I'll try to explain.

First you have to remember thai _each_ `set` statement in your code (and we have two here) is a separate reading process. Here is a bit modified version of the original code with some additional `put` statements printing state of the PDV vector to the log. I also added two `curobs=` options to show which observation was read during which `set` statement execution (look at it in the log below).

data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;

proc sort data=test out=test;
by id date;
run;

options ls=max ps=max;
data want;

put "Before: " @20 _ALL_;

do until (last.id);
  set test curobs = curobs1;
  by id date;
  times + 1; 
  put "First loop: " @20 _ALL_;
end; /* here the TIMES variable is only acumulated */

put "After first loop: "  @20 _ALL_;

order = 0;
do until (last.id);
  set test curobs = curobs2;
  by id date;
  order + 1; 
  put "Second loop: " @20 _ALL_;
  output; /* here data are outputed to WANT dataset */
end;

put "After second loop: "  @20 _ALL_;
put ;
times = 0;
run;

Here is the log:

1    data test;
2    input id $ date yymmdd10.;
3    format date yymmdd10.;
4    datalines;

NOTE: The data set WORK.TEST has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


11   ;
12   run;
13
14   proc sort data=test out=test;
15   by id date;
16   run;

NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST has 6 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


17
18   options ls=max ps=max;
19   data want;
20
21   put "Before: " @20 _ALL_;
22
23   do until (last.id);
24     set test curobs = curobs1;
25     by id date;
26     times + 1;
27     put "First loop: " @20 _ALL_;
28   end; /* here the TIMES variable is only acumulated */
29
30   put "After first loop: "  @20 _ALL_;
31
32   order = 0;
33   do until (last.id);
34     set test curobs = curobs2;
35     by id date;
36     order + 1;
37     put "Second loop: " @20 _ALL_;
38     output; /* here data are outputed to WANT dataset */
39   end;
40
41   put "After second loop: "  @20 _ALL_;
42   put ;
43   times = 0;
44   run;

Before:            last.id=1 curobs1=. id=  date=. FIRST.id=1 FIRST.date=1 LAST.date=1 times=0 order=0 curobs2=. _ERROR_=0 _N_=1
First loop:        last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=0 curobs2=. _ERROR_=0 _N_=1
After first loop:  last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=0 curobs2=. _ERROR_=0 _N_=1
Second loop:       last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=1 curobs2=1 _ERROR_=0 _N_=1
After second loop: last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=1 curobs2=1 _ERROR_=0 _N_=1

Before:            last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=0 order=1 curobs2=1 _ERROR_=0 _N_=2
First loop:        last.id=0 curobs1=2 id=2 date=2010-09-12 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=1 curobs2=1 _ERROR_=0 _N_=2
First loop:        last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=1 curobs2=1 _ERROR_=0 _N_=2
After first loop:  last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=1 curobs2=1 _ERROR_=0 _N_=2
Second loop:       last.id=0 curobs1=3 id=2 date=2010-09-12 FIRST.id=1 FIRST.date=1 LAST.date=1 times=2 order=1 curobs2=2 _ERROR_=0 _N_=2
Second loop:       last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=2 curobs2=3 _ERROR_=0 _N_=2
After second loop: last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=2 curobs2=3 _ERROR_=0 _N_=2

Before:            last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=0 order=2 curobs2=3 _ERROR_=0 _N_=3
First loop:        last.id=0 curobs1=4 id=3 date=2010-04-03 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=2 curobs2=3 _ERROR_=0 _N_=3
First loop:        last.id=0 curobs1=5 id=3 date=2010-05-03 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=2 curobs2=3 _ERROR_=0 _N_=3
First loop:        last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=2 curobs2=3 _ERROR_=0 _N_=3
After first loop:  last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=2 curobs2=3 _ERROR_=0 _N_=3
Second loop:       last.id=0 curobs1=6 id=3 date=2010-04-03 FIRST.id=1 FIRST.date=1 LAST.date=1 times=3 order=1 curobs2=4 _ERROR_=0 _N_=3
Second loop:       last.id=0 curobs1=6 id=3 date=2010-05-03 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=2 curobs2=5 _ERROR_=0 _N_=3
Second loop:       last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=3 curobs2=6 _ERROR_=0 _N_=3
After second loop: last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=3 curobs2=6 _ERROR_=0 _N_=3

Before:            last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=0 order=3 curobs2=6 _ERROR_=0 _N_=4
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.WANT has 6 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

So basically first loop reads observation 1 from TEST [curobs1=1] (in this case it is also the last observation in the group) and increases `times` by 1 (with the `times + 1;` sum statement). when the first loop ends the PDV contains "accumulated" value of `times`. The second loop (since it contains separate `set` statement) it reads observation 1 from the TEST [curobs2=1] and increases order by 1, and executes `output` so the current state od PDV is moved as new observation into WANT. First iteration of data step ends but since there was the `set` statement in the code SAS runs next iteration of data step'a main loop.

In the second iteration of the data step's main loop the first until-loop reads observations 2 and 3 from TEST [curobs1=2, 3] and increases `times` by 1 two times so when the first loop ends the PDV contains "accumulated" value of `times` equal to 2. The second loop (since it contains separate `set` statement) it reads observation 2 from the TEST [curobs2=2] and increases order by 1, and executes `output` so the current state od PDV is moved as the second observation into WANT, and then it reads observation 3 from the TEST [curobs2=3] and increases order by 1 (so the value is 2), and executes `output` so the current state od PDV is moved as third observation into WANT.

Then the third main loop iteration starts, and it does the same what two before but for observations 4,5,and 6 to cumulate `times` and the again 4,5, and 6.

And eventually SAS starts fourth iteration of the main loop (_N_ = 4 in the log) , first until-loop starts and the `set` statement in it tries to read next observation (7th) but since there is nothing more to read it stops.

So that is more-or-less how it works. If you need some more details there is great papers by Paul Dorfman (@hashman) titled: "The Magnificent DO" and "The DoW-Loop Unrolled",

links are here:

https://support.sas.com/resources/papers/proceedings13/126-2013.pdf

https://www.lexjansen.com/nesug/nesug08/hw/hw02.pdf

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ksharp · Posted 09-17-2020 07:24 AM

data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;
data want;
 set test;
 by id;
 if first.id then order=0;
 order+1;
 times+first.id;
run;

yabwon · Posted 09-18-2020 02:50 AM

@Ksharp

I like "short-ness" of the approach but I think it won't work for the following data where `times` for the last group should be 5,

data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
3 2010/05/30
3 2010/05/31
;
run;

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ksharp · Posted 09-19-2020 07:04 AM

OK. I misunderstand the OP's question .

ballardw · Posted 09-17-2020 10:52 AM

@Ginny_Han wrote:

(I used to use STATA and I wonder if SAS has things like "_n" and "_N" to denote the order and number of observations in a by-sort group (e.g. group id-date). Like in STATA there is "bysort id date" and "gen order=_n" or "gen x=x1(_n-1)". If there is similar concept is SAS, it would be great to know what it is.

The SAS data step has an automatic varaible _n_ that counts the iteration of the data step. SAS automatic variables are not output to the data set but you can create a new variable with the value.

data test;
   input id $ date $ order $ times;
   iteration = _n_;
datalines;
1 2010/09/10 1 1
2 2010/0912  1  2
2  2010/10/10 2 2
3 2010/04/03 1 3
3 2010/05/03 2 3
3 2010/05/29 3 3
;
run;

With simple code such as this it returns a row number but with some of the more complex approaches possible the results can be bit unexpected.

Ginny_Han · Posted 09-17-2020 09:25 PM

It's good to know anyhow. Thanks.

Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

Re: Generate variable according to the order of same-id observations

SAS Innovate 2025: Call for Content