I wanted to create sequent numbers for people who have multiple assessments within each admission and a person can have multiple admissions within a year. I sorted the data first by ID and then by admission date. I saw SAS examples for single group and it worked fine if I just want to assing sequential number for each patient:
data want; set original;
by pid;
if first.pid then seq_id=1;
else seq_id+1;
run;
Then the data would look like following:
PID Date Seq_ID
1 1/1/2011 1
1 1/1/2011 2
1 3/4/2011 3
2 5/1/2010 1
2 6/3/2010 2
But when I tried to expand the code to assign sequential numbers for each patient w/n the same admission, the results don't look right (seems to be the same as the above);
data want; set original;
by pid date;
if first.pid and first.date then seq_id=1;
else seq_id+1;
run;
I want the data look like the following:
PID Date Seq_ID
1 1/1/2011 1
1 1/1/2011 2
1 3/4/2011 1
2 5/1/2010 1
2 6/3/2010 1
It should be a simple question and I did google first before asking. Thanks in advance for any help.
HI Solph,
As @Tom says, you want to reset the counter when FIRST.DATE is true. You do not care about the status of the FIRST. variables for earlier variables in the BY statement.
You can try this below code..
data test;
input pid $ date $;
cards;
1 1/1/2011
1 1/1/2011
1 3/4/2011
2 5/1/2010
2 6/3/2010
;
run;
proc sort data=test;
by pid date;
run;
data want;
set test ;
by pid date;
if first.date then seq_id=0;
seq_id+1;
if last.date then output;
run;
Thanks.
You want to reset the counter when FIRST.DATE is true. You do not care about the status of the FIRST. variables for earlier variables in the BY statement.
It indeed worked now. Thanks much though I still don't fullly understand the logic of frist and last. Any gook book chapters or websites on the use of first, last, retain, accumulating, summing? (For example, why do you only need to specify the second variable in the BY list). Thanks.
data want; set original;
by pid date;
if first.pid and first.date then seq_id=1;
else seq_id+1;
run;
Try here:
http://www.ats.ucla.edu/stat/sas/
or
www.lexjansen.com and search the papers for relevant topics.
The first site has a ton of info on it though.
HI Solph,
As @Tom says, you want to reset the counter when FIRST.DATE is true. You do not care about the status of the FIRST. variables for earlier variables in the BY statement.
You can try this below code..
data test;
input pid $ date $;
cards;
1 1/1/2011
1 1/1/2011
1 3/4/2011
2 5/1/2010
2 6/3/2010
;
run;
proc sort data=test;
by pid date;
run;
data want;
set test ;
by pid date;
if first.date then seq_id=0;
seq_id+1;
if last.date then output;
run;
Thanks.
Thanks so much. (I liked the actual code in reply). I'm also confused now. For example, I've seen examples end with "if last.var then output" (like you did here) while many don't. When do I need it? Thanks.
Solph,
Normally,the variable first indicates the first observation for each id;
the variable last indicates the last observation for each id.
Thanks
Solph,
See this below example.Its easy to understand..
data students;
input gender score;
cards;
1 48
1 48
1 45
2 50
2 42
1 41
2 51
1 52
1 43
2 52
2 52
;
run;
proc sort data = students;
by gender score;
run;
data students1;
set students;
count + 1;
by gender score;
if first.score then count = 1;
if last.score then output;
run;
Thanks
It helps to look at some example data.
data test;
do a=1 to 3; do b=1 to 3; do c=1 to 3; output; end; end; end;
run;
data _null_;
set test;
by a b ;
put (a first.a last.a b first.b last.b c) (=);
run;
a=1 FIRST.a=1 LAST.a=0 b=1 FIRST.b=1 LAST.b=0 c=1
a=1 FIRST.a=0 LAST.a=0 b=1 FIRST.b=0 LAST.b=0 c=2
a=1 FIRST.a=0 LAST.a=0 b=1 FIRST.b=0 LAST.b=1 c=3
a=1 FIRST.a=0 LAST.a=0 b=2 FIRST.b=1 LAST.b=0 c=1
a=1 FIRST.a=0 LAST.a=0 b=2 FIRST.b=0 LAST.b=0 c=2
a=1 FIRST.a=0 LAST.a=0 b=2 FIRST.b=0 LAST.b=1 c=3
a=1 FIRST.a=0 LAST.a=0 b=3 FIRST.b=1 LAST.b=0 c=1
a=1 FIRST.a=0 LAST.a=0 b=3 FIRST.b=0 LAST.b=0 c=2
a=1 FIRST.a=0 LAST.a=1 b=3 FIRST.b=0 LAST.b=1 c=3
a=2 FIRST.a=1 LAST.a=0 b=1 FIRST.b=1 LAST.b=0 c=1
a=2 FIRST.a=0 LAST.a=0 b=1 FIRST.b=0 LAST.b=0 c=2
a=2 FIRST.a=0 LAST.a=0 b=1 FIRST.b=0 LAST.b=1 c=3
a=2 FIRST.a=0 LAST.a=0 b=2 FIRST.b=1 LAST.b=0 c=1
a=2 FIRST.a=0 LAST.a=0 b=2 FIRST.b=0 LAST.b=0 c=2
a=2 FIRST.a=0 LAST.a=0 b=2 FIRST.b=0 LAST.b=1 c=3
a=2 FIRST.a=0 LAST.a=0 b=3 FIRST.b=1 LAST.b=0 c=1
a=2 FIRST.a=0 LAST.a=0 b=3 FIRST.b=0 LAST.b=0 c=2
a=2 FIRST.a=0 LAST.a=1 b=3 FIRST.b=0 LAST.b=1 c=3
a=3 FIRST.a=1 LAST.a=0 b=1 FIRST.b=1 LAST.b=0 c=1
a=3 FIRST.a=0 LAST.a=0 b=1 FIRST.b=0 LAST.b=0 c=2
a=3 FIRST.a=0 LAST.a=0 b=1 FIRST.b=0 LAST.b=1 c=3
a=3 FIRST.a=0 LAST.a=0 b=2 FIRST.b=1 LAST.b=0 c=1
a=3 FIRST.a=0 LAST.a=0 b=2 FIRST.b=0 LAST.b=0 c=2
a=3 FIRST.a=0 LAST.a=0 b=2 FIRST.b=0 LAST.b=1 c=3
a=3 FIRST.a=0 LAST.a=0 b=3 FIRST.b=1 LAST.b=0 c=1
a=3 FIRST.a=0 LAST.a=0 b=3 FIRST.b=0 LAST.b=0 c=2
a=3 FIRST.a=0 LAST.a=1 b=3 FIRST.b=0 LAST.b=1 c=3
Thanks for the great advice from everyone.
I've a maybe not so related question and I thought I would just ask here instead of creating a new subject entry.
I want to compare data between records by ID, admission date and a group variable. If Group is the same between two consecutive lines (w/n the same admission), then combine the record and output the appropriate dates so that I can calculate the time elapse between dates.
data test; input pid $ admission_date1 $ assessment_date group $;
cards;
1 1/1/2011 1/3/2011 A
1 1/1/2011 1/9/2011 B
1 3/4/2011 4/2/2011 B
1 3/4/2011 4/8/2011 B
2 6/1/2010 6/4/2011 C
2 6/5/2010 7/4/2011 C
3 7/1/2011 7/9/2011 D
run;
*Desired output:
PID Adm date Date1 Date2 Group
1 1/1/2011 1/1/2011 1/3/2011 A *admission date repeating for each record, the first record's date1=admission date
1 1/1/2011 1/3/2011 1/9/2011 B *In the subsequent record, date1=date from the previous record
1 3/4/2011 3/4/2011 4/8/2011 B *Group D is the same between records w/n the same admission, so output as a single record and appr. date
2 6/1/2010 6/12010 6/4/2011 C
2 6/5/2010 6/5/2010 7/4/2011 C
3 7/1/2011 7/1/2011 7/9/2011 D
Hope the above makes sense. (The real data is more complicate). And thanks for whatever help you can give.
Use LAG to get previous value. When it is the first record for the admit date then overwrite that with the admit date.
data have;
informat admdate testdate mmddyy10.;
format admdate testdate mmddyy10. ;
input pid $ admdate : testdate : group $ @@;
cards;
1 1/1/2011 1/3/2011 A 1 1/1/2011 1/9/2011 B
1 3/4/2011 4/2/2011 B 1 3/4/2011 4/8/2011 B
2 6/1/2010 6/4/2011 C 2 6/5/2010 7/4/2011 C
3 7/1/2011 7/9/2011 D
run;
proc sort; by pid admdate testdate ; run;
data want ;
set have ;
by pid admdate testdate ;
prevdate = lag(testdate) ;
format prevdate mmddyy10.;
if first.admdate then prevdate=admdate;
diff = testdate - prevdate ;
put pid admdate prevdate testdate diff group ;
run;
1 01/01/2011 01/01/2011 01/03/2011 2 A
1 01/01/2011 01/03/2011 01/09/2011 6 B
1 03/04/2011 03/04/2011 04/02/2011 29 B
1 03/04/2011 04/02/2011 04/08/2011 6 B
2 06/01/2010 06/01/2010 06/04/2011 368 C
2 06/05/2010 06/05/2010 07/04/2011 394 C
3 07/01/2011 07/01/2011 07/09/2011 8 D
Thanks so much. You guys are really clever and I really suck at processing rectangle data. Can I ask another related question?
For a similar data with with each ID with multiple admission dates (admdate) and within each date, there are test dates (testdate)and a group variable (GROUP) and Score variable with each test. I'm hoping to get the following:
W/n each person and same admission, if consecutive records have the same group vairalbes, I'd like to sum up the conse first assessment's RUG value is the same as the next record's value, then I want to sum up the score values and drop the one of the records -- so I don't care about the test date, only about ID, admdate, group and score. I know the lag function will allow me to compare values, but how do I delete the records?
I was only able to come up with the following, but how do I delete the previous record (w/n the same ID and admission date) that has the same group? Would using retain work? How? Sorry, I'm really sucked at reading rectangle data, lag function and retain. So thanks for being patient with me. The data is below if it helps. Thanks very much.
proc sort data=mydata.testdata out=test;
by id admdate testdate;
run;
data want; set test;
by ID admdate testdate;
prev_group= lag(group);
prev_score=lag(score);
if first.testdate and prev_group = group then score2=sum(score,prev_score);
if first.testdate and prev_group ne group then score2=score;
ID Admdate Testdate Group Score
1 30/04/2001 01-Apr-06 SE2 49
1 30/04/2001 14-May-06 RLB 17
2 16/12/2004 01-Apr-06 SSB 80
2 16/12/2004 30-Jun-06 SSB 82
2 16/12/2004 30-Sep-06 SE2 104
2 16/12/2004 30-Dec-06 SSB 80
2 16/12/2004 30-Mar-07 CB2 2
3 24/08/2005 01-Apr-06 RLB 59
4 08/07/1989 01-Apr-06 CC2 88
4 08/07/1989 01-Jul-06 CC2 71
5 15/04/2000 01-Apr-06 CB1 8
5 15/04/2000 12-Apr-06 CB1 19
6 10/04/2001 01-Apr-06 SSB 69
6 10/04/2001 17-Jun-06 SSB 27
7 11/04/2005 01-Apr-06 CB1 17
7 11/04/2005 24-Apr-06 SSA 78
7 11/04/2005 24-Jul-06 SSA 14
8 07/04/2005 01-Apr-06 PB1 8
8 07/04/2005 20-Apr-06 CA1 45
Why don't you just use proc summary?
That's very much true and I was overthinking about those complicate ways. But it works well if I don't call if they occur in consecutive order (such as 2nd record vs.1st, and 3rd record vs. 2nd). What if I care (in some cases I do need to)?
Then use proc summary with a class statement and the nway option.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.