Greetings,
I have a data set of patients with scores corresponding to a specific visit (from 1 to 3). Here is a snippet of the data:
As you can see, id 1 has three separate visits. However, id 3 has only 1 visit recorded. What I would like to do is keep the original format of the data set (long) but insert "visit = 2, score = ." or "visit = 2, score = ." and "visit = 3, score = ." for subjects who have either one or two subsequent missing visits, respectively. I can do this by converting the data set to long format, with three separate scores (score1, score2, score3), but I would like to keep the original format. Also, is there a way to do this in a single data step?
Here's my code (converting from long to wide currently):
data center_a;
input id gender age visit score @@;
datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
CENTER = 1;
run;
proc sort data = center_a;
by id visit;
run;
data center_a_revised;
set center_a;
by id visit;
array subject{3} score1 score2 score3;
retain score1 score2 score3;
if (first.id) then do i = 1 to 3;
subject{i} = .;
end;
subject{visit} = score;
if (lsat.id) then output;
keep id gender age visit score CENTER;
run;
Hi @smithb By any chance are you asking for this?
data center_a;
input id gender age visit score ;
datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
%let n=3;/*number of visits*/
data want;
do until(last.id);
set center_a;
by id;
output;
end;
score=.;
do visit=visit+1 to &n;
output;
end;
run;
Hello @novinosrin , thanks for all your help. I can't tell you how much I appreciate the time you've taken to help me in this. I ran the code, and yes, this is what I was looking for. Thank you so much!
Something like this. I'm sure you can do it in one step, but this is how I think about it using proc transpose.
data center_a;
input id gender age visit score @@;
datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
run;
proc sort data=center_a; by id visit; run;
proc transpose data=center_a out=center_trx(drop=_name_) prefix=vst_;
by id gender age;
id visit;
var score;
run;
proc transpose data=center_trx out=center_trx_back(rename=col1=score);
by id gender age;
run;
data want(drop=_name_);
set center_trx_back;
visit = 1*scan(_name_,2,"_");
run;
Are the missing visits always at the end?
data center_a_revised;
set center_a;
by id visit;
output;
if last.id then do visit=visit+1 to 3 ;
score=.;
output;
end;
run;
Or is it possible that some visits are missing in the middle. So someone has visits 1 and 3 and is missing visit 2?
data center_a_all_visits;
set center_a(keep=id visit);
by id;
if first.id then do visit=1 to 3;
output;
end;
run;
data center_a_revised;
merge center_a center_a_all_visits;
by id visit;
run;
Hello @Tom , the data includes subjects whom have an initial first visit, but some of them did not appear for any further subsequent visits. So no, none of the observations are missing visit two but have visits 1 and 3 accounted for. Thank you for this interesting take on the data. And thank you for taking the time to work on this code, I really appreciate it. I ran your top code, and that is the output I am looking for. Thanks again
data center_a;
input id gender age visit score ;
datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
run;
data want;
set center_a;
by id;
output;
if last.id then do;
do while(visit < 3);
visit=visit +1;
score=.;
output;
end;
end;
run;
I strongly dislike hard-coding, so instead of setting the max number of visits to 3, I'd rather first determine from the data itself what that number really is and then use it downstream in the step:
data have ;
input id gender age visit score ;
cards ;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
run ;
data want ;
if _n_ = 1 then do until (z) ;
set have (keep = visit) end = z ;
last.visit = max (last.visit, visit) ;
end ;
set have ;
by id ;
output ;
call missing (score) ;
if last.id and visit < last.visit then do visit = visit + 1 to last.visit ;
output ;
end ;
run ;
An interesting question is whether or not the same can be done in a single step if the input data set is completely disordered. The answer is that with a modicum of memorizing stuff in a hash table, it is affirmative:
data have ;
input id gender age visit score ;
cards ;
4 0 70 3 4.88889
1 1 49 3 0.55556
2 1 51 1 2.66667
1 1 49 2 4.55556
2 1 51 3 4.22222
4 0 70 1 3.44444
2 1 51 2 4.88889
3 0 52 1 3.44444
4 0 70 2 0.44444
5 0 49 1 3.33333
1 1 49 1 1.33333
;
run ;
data want ;
if _n_ = 1 then do ;
dcl hash h (ordered:"a") ;
h.definekey ("id") ;
h.definedata ("id", "_iorc_", "_n_") ;
h.definedone () ;
do until (z) ;
set have (keep = id visit) curobs = q end = z ;
last.visit = max (last.visit, visit) ;
if h.find() ne 0 then call missing (_iorc_, _n_) ;
if _iorc_ < visit then do ;
_iorc_ = visit ;
_n_ = q ;
end ;
h.replace() ;
end ;
end ;
set have curobs = q ;
output ;
h.find() ;
call missing (score) ;
if _n_ = q then do visit = _iorc_ + 1 to last.visit ;
output ;
end ;
run ;
Though it might look like a codeful, the scheme is actually pretty simple:
Note that the variable LAST.VISIT has nothing to do with BY processing. It's just an arbitrary temporary variable, whose convenience lies in the fact that it is set to 0 at compile, auto-retained, and auto-dropped.
Kind regards
Paul D.
Nice one Guru @hashman after paying attention to your post, my leaning was little bit towards utilizing SPARSE
tables id*visit/out=temp(keep=id visit) sparse;
in proc freq to get all max combinations in an ordered form and perform a direct dynamic look up.
However when i tested on some large data, the performance of SPARSE was worse than awful. Yours was lightning fast. Of course as expected.
Okay, my post isn't about just words of praise that you anyway are used to receiving from SAS world, but many procedure options are not really performing at least at citizens SAS installation. Is it only at citizens resources or just expected normal behavior that you've seen from your enormous experiences? I guess you or Tom are likely to be best judge of it.
PS Thank you for the text. SAS certification seems far easier than passing road test. lol. We'll see i i make it in achieving my drivers. 🙂
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.