Dear all,
I have a question regarding how to generate a variable that reflect the order of the same-id observations. I don't want to delete any duplicated records.
I proposed a sample data here. Suppose id means patient and date is admission date and each observation is an admission record. I want to generate a "order" variable to reflect (after sorting by id and date) whether the observation is the 1st, 2nd... admission. I also want a "times" variable to reflect how many admissions (observations) of the patient there are in the dataset.
data test;
input id $ date $ order $ times;
datalines;
1 2010/09/10 1 1
2 2010/0912 1 2
2 2010/10/10 2 2
3 2010/04/03 1 3
3 2010/05/03 2 3
3 2010/05/29 3 3
;
run;
So far, I get how to sort the variables by id and then date.
proc sort data=test out=test;
by id date;
run;
Please help me on generating the "order" and "times" variables.
(I used to use STATA and I wonder if SAS has things like "_n" and "_N" to denote the order and number of observations in a by-sort group (e.g. group id-date). Like in STATA there is "bysort id date" and "gen order=_n" or "gen x=x1(_n-1)". If there is similar concept is SAS, it would be great to know what it is called.)
Thanks in advance!
Ginny Han
Hi,
double DoW-loop looks like good candidate to help.
data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;
proc sort data=test out=test;
by id date;
run;
data want;
do until (last.id);
set test;
by id date;
times + 1;
end;
order = 0;
do until (last.id);
set test;
by id date;
order + 1;
output;
end;
times = 0;
run;
All the best
Bart
Hi,
double DoW-loop looks like good candidate to help.
data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;
proc sort data=test out=test;
by id date;
run;
data want;
do until (last.id);
set test;
by id date;
times + 1;
end;
order = 0;
do until (last.id);
set test;
by id date;
order + 1;
output;
end;
times = 0;
run;
All the best
Bart
Hi Bart @yabwon ,
Thanks for the solution! It worked fine.
Can I ask a follow-up question? How does the command "-output-" work in the do loop? Because with it, the loop would produce an order of the observations. And without it, the times variables would be generated with every observation in the same by-sort group having the same value?
order = 0;
do until (last.id);
set test;
by id date;
order + 1;
output;
end;
Thanks again!
Ginny
Hi Ginny, @Ginny_Han,
it is not exactly as you writing, I'll try to explain.
First you have to remember thai _each_ `set` statement in your code (and we have two here) is a separate reading process. Here is a bit modified version of the original code with some additional `put` statements printing state of the PDV vector to the log. I also added two `curobs=` options to show which observation was read during which `set` statement execution (look at it in the log below).
data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;
proc sort data=test out=test;
by id date;
run;
options ls=max ps=max;
data want;
put "Before: " @20 _ALL_;
do until (last.id);
set test curobs = curobs1;
by id date;
times + 1;
put "First loop: " @20 _ALL_;
end; /* here the TIMES variable is only acumulated */
put "After first loop: " @20 _ALL_;
order = 0;
do until (last.id);
set test curobs = curobs2;
by id date;
order + 1;
put "Second loop: " @20 _ALL_;
output; /* here data are outputed to WANT dataset */
end;
put "After second loop: " @20 _ALL_;
put ;
times = 0;
run;
Here is the log:
1 data test; 2 input id $ date yymmdd10.; 3 format date yymmdd10.; 4 datalines; NOTE: The data set WORK.TEST has 6 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 11 ; 12 run; 13 14 proc sort data=test out=test; 15 by id date; 16 run; NOTE: There were 6 observations read from the data set WORK.TEST. NOTE: The data set WORK.TEST has 6 observations and 2 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 17 18 options ls=max ps=max; 19 data want; 20 21 put "Before: " @20 _ALL_; 22 23 do until (last.id); 24 set test curobs = curobs1; 25 by id date; 26 times + 1; 27 put "First loop: " @20 _ALL_; 28 end; /* here the TIMES variable is only acumulated */ 29 30 put "After first loop: " @20 _ALL_; 31 32 order = 0; 33 do until (last.id); 34 set test curobs = curobs2; 35 by id date; 36 order + 1; 37 put "Second loop: " @20 _ALL_; 38 output; /* here data are outputed to WANT dataset */ 39 end; 40 41 put "After second loop: " @20 _ALL_; 42 put ; 43 times = 0; 44 run; Before: last.id=1 curobs1=. id= date=. FIRST.id=1 FIRST.date=1 LAST.date=1 times=0 order=0 curobs2=. _ERROR_=0 _N_=1 First loop: last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=0 curobs2=. _ERROR_=0 _N_=1 After first loop: last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=0 curobs2=. _ERROR_=0 _N_=1 Second loop: last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=1 curobs2=1 _ERROR_=0 _N_=1 After second loop: last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=1 curobs2=1 _ERROR_=0 _N_=1 Before: last.id=1 curobs1=1 id=1 date=2010-09-10 FIRST.id=1 FIRST.date=1 LAST.date=1 times=0 order=1 curobs2=1 _ERROR_=0 _N_=2 First loop: last.id=0 curobs1=2 id=2 date=2010-09-12 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=1 curobs2=1 _ERROR_=0 _N_=2 First loop: last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=1 curobs2=1 _ERROR_=0 _N_=2 After first loop: last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=1 curobs2=1 _ERROR_=0 _N_=2 Second loop: last.id=0 curobs1=3 id=2 date=2010-09-12 FIRST.id=1 FIRST.date=1 LAST.date=1 times=2 order=1 curobs2=2 _ERROR_=0 _N_=2 Second loop: last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=2 curobs2=3 _ERROR_=0 _N_=2 After second loop: last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=2 curobs2=3 _ERROR_=0 _N_=2 Before: last.id=1 curobs1=3 id=2 date=2010-10-10 FIRST.id=0 FIRST.date=1 LAST.date=1 times=0 order=2 curobs2=3 _ERROR_=0 _N_=3 First loop: last.id=0 curobs1=4 id=3 date=2010-04-03 FIRST.id=1 FIRST.date=1 LAST.date=1 times=1 order=2 curobs2=3 _ERROR_=0 _N_=3 First loop: last.id=0 curobs1=5 id=3 date=2010-05-03 FIRST.id=0 FIRST.date=1 LAST.date=1 times=2 order=2 curobs2=3 _ERROR_=0 _N_=3 First loop: last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=2 curobs2=3 _ERROR_=0 _N_=3 After first loop: last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=2 curobs2=3 _ERROR_=0 _N_=3 Second loop: last.id=0 curobs1=6 id=3 date=2010-04-03 FIRST.id=1 FIRST.date=1 LAST.date=1 times=3 order=1 curobs2=4 _ERROR_=0 _N_=3 Second loop: last.id=0 curobs1=6 id=3 date=2010-05-03 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=2 curobs2=5 _ERROR_=0 _N_=3 Second loop: last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=3 curobs2=6 _ERROR_=0 _N_=3 After second loop: last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=3 order=3 curobs2=6 _ERROR_=0 _N_=3 Before: last.id=1 curobs1=6 id=3 date=2010-05-29 FIRST.id=0 FIRST.date=1 LAST.date=1 times=0 order=3 curobs2=6 _ERROR_=0 _N_=4 NOTE: There were 6 observations read from the data set WORK.TEST. NOTE: There were 6 observations read from the data set WORK.TEST. NOTE: The data set WORK.WANT has 6 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
So basically first loop reads observation 1 from TEST [curobs1=1] (in this case it is also the last observation in the group) and increases `times` by 1 (with the `times + 1;` sum statement). when the first loop ends the PDV contains "accumulated" value of `times`. The second loop (since it contains separate `set` statement) it reads observation 1 from the TEST [curobs2=1] and increases order by 1, and executes `output` so the current state od PDV is moved as new observation into WANT. First iteration of data step ends but since there was the `set` statement in the code SAS runs next iteration of data step'a main loop.
In the second iteration of the data step's main loop the first until-loop reads observations 2 and 3 from TEST [curobs1=2, 3] and increases `times` by 1 two times so when the first loop ends the PDV contains "accumulated" value of `times` equal to 2. The second loop (since it contains separate `set` statement) it reads observation 2 from the TEST [curobs2=2] and increases order by 1, and executes `output` so the current state od PDV is moved as the second observation into WANT, and then it reads observation 3 from the TEST [curobs2=3] and increases order by 1 (so the value is 2), and executes `output` so the current state od PDV is moved as third observation into WANT.
Then the third main loop iteration starts, and it does the same what two before but for observations 4,5,and 6 to cumulate `times` and the again 4,5, and 6.
And eventually SAS starts fourth iteration of the main loop (_N_ = 4 in the log) , first until-loop starts and the `set` statement in it tries to read next observation (7th) but since there is nothing more to read it stops.
So that is more-or-less how it works. If you need some more details there is great papers by Paul Dorfman (@hashman) titled: "The Magnificent DO" and "The DoW-Loop Unrolled",
links are here:
https://support.sas.com/resources/papers/proceedings13/126-2013.pdf
https://www.lexjansen.com/nesug/nesug08/hw/hw02.pdf
All the best
Bart
data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
;
run;
data want;
set test;
by id;
if first.id then order=0;
order+1;
times+first.id;
run;
I like "short-ness" of the approach but I think it won't work for the following data where `times` for the last group should be 5,
data test;
input id $ date yymmdd10.;
format date yymmdd10.;
datalines;
1 2010/09/10
2 2010/09/12
2 2010/10/10
3 2010/04/03
3 2010/05/03
3 2010/05/29
3 2010/05/30
3 2010/05/31
;
run;
All the best
Bart
OK. I misunderstand the OP's question .
@Ginny_Han wrote:
(I used to use STATA and I wonder if SAS has things like "_n" and "_N" to denote the order and number of observations in a by-sort group (e.g. group id-date). Like in STATA there is "bysort id date" and "gen order=_n" or "gen x=x1(_n-1)". If there is similar concept is SAS, it would be great to know what it is.
The SAS data step has an automatic varaible _n_ that counts the iteration of the data step. SAS automatic variables are not output to the data set but you can create a new variable with the value.
data test; input id $ date $ order $ times; iteration = _n_; datalines; 1 2010/09/10 1 1 2 2010/0912 1 2 2 2010/10/10 2 2 3 2010/04/03 1 3 3 2010/05/03 2 3 3 2010/05/29 3 3 ; run;
With simple code such as this it returns a row number but with some of the more complex approaches possible the results can be bit unexpected.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.