Re: Adding missing values to long data set

smithb · Posted 11-16-2019 12:18 PM

Greetings,

I have a data set of patients with scores corresponding to a specific visit (from 1 to 3). Here is a snippet of the data:

As you can see, id 1 has three separate visits. However, id 3 has only 1 visit recorded. What I would like to do is keep the original format of the data set (long) but insert "visit = 2, score = ." or "visit = 2, score = ." and "visit = 3, score = ." for subjects who have either one or two subsequent missing visits, respectively. I can do this by converting the data set to long format, with three separate scores (score1, score2, score3), but I would like to keep the original format. Also, is there a way to do this in a single data step?

Here's my code (converting from long to wide currently):

data center_a;
	input id gender age visit score @@;
	datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
	CENTER = 1;
run;
proc sort data = center_a;
	by id visit;
run;
data center_a_revised;
	set center_a;
	by id visit;
	array subject{3} score1 score2 score3;
	retain score1 score2 score3;
	if (first.id) then do i = 1 to 3;
		subject{i} = .;
	end;
	subject{visit} = score;
	if (lsat.id) then output;
	keep id gender age visit score CENTER;
run;

novinosrin · Posted 11-16-2019 12:25 PM

Hi @smithb By any chance are you asking for this?


 data center_a;
	input id gender age visit score ;
	datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
%let n=3;/*number of visits*/
data want;
 do until(last.id);
  set center_a;
  by id;
  output;
 end;
 score=.;
 do visit=visit+1 to &n;
  output;
 end;
run;

smithb · Posted 11-16-2019 02:29 PM

Hello @novinosrin , thanks for all your help. I can't tell you how much I appreciate the time you've taken to help me in this. I ran the code, and yes, this is what I was looking for. Thank you so much!

unison · Posted 11-16-2019 12:34 PM

Something like this. I'm sure you can do it in one step, but this is how I think about it using proc transpose.

data center_a;
	input id gender age visit score @@;
	datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
run;
proc sort data=center_a; by id visit; run;

proc transpose data=center_a out=center_trx(drop=_name_) prefix=vst_;
by id gender age;
id visit;
var score;
run;

proc transpose data=center_trx out=center_trx_back(rename=col1=score);
by id gender age;
run; 

data want(drop=_name_);
set center_trx_back;
visit = 1*scan(_name_,2,"_");
run;

-unison

Tom · Posted 11-16-2019 12:36 PM

Are the missing visits always at the end?

data center_a_revised;
  set center_a;
  by id visit;
  output;
  if last.id then do visit=visit+1 to 3 ;
    score=.;
    output;
  end;
run;

Or is it possible that some visits are missing in the middle. So someone has visits 1 and 3 and is missing visit 2?

data center_a_all_visits;
  set center_a(keep=id visit);
  by id;
  if first.id then do visit=1 to 3;
    output;
  end;
run;

data center_a_revised;
  merge center_a center_a_all_visits;
  by id visit;
run;

smithb · Posted 11-16-2019 02:33 PM

Hello @Tom , the data includes subjects whom have an initial first visit, but some of them did not appear for any further subsequent visits. So no, none of the observations are missing visit two but have visits 1 and 3 accounted for. Thank you for this interesting take on the data. And thank you for taking the time to work on this code, I really appreciate it. I ran your top code, and that is the output I am looking for. Thanks again

r_behata · Posted 11-16-2019 07:34 PM

 data center_a;
	input id gender age visit score ;
	datalines;
1 1 49 1 1.33333
1 1 49 2 4.55556
1 1 49 3 0.55556
2 1 51 1 2.66667
2 1 51 2 4.88889
2 1 51 3 4.22222
3 0 52 1 3.44444
4 0 70 1 3.44444
4 0 70 2 0.44444
4 0 70 3 4.88889
5 0 49 1 3.33333
;
run;

data want;
	set center_a;
	
	by id;
	
	output;
	
	if last.id then do;
	do while(visit < 3);
		visit=visit +1;
		score=.;
		output;
	end;
	end;
run;

hashman · Posted 11-16-2019 11:58 PM

@smithb:

I strongly dislike hard-coding, so instead of setting the max number of visits to 3, I'd rather first determine from the data itself what that number really is and then use it downstream in the step:

data have ;                                                                                                                             
  input id gender age visit score ;                                                                                                     
  cards ;                                                                                                                               
1 1 49 1 1.33333                                                                                                                        
1 1 49 2 4.55556                                                                                                                        
1 1 49 3 0.55556                                                                                                                        
2 1 51 1 2.66667                                                                                                                        
2 1 51 2 4.88889                                                                                                                        
2 1 51 3 4.22222                                                                                                                        
3 0 52 1 3.44444                                                                                                                        
4 0 70 1 3.44444                                                                                                                        
4 0 70 2 0.44444                                                                                                                        
4 0 70 3 4.88889                                                                                                                        
5 0 49 1 3.33333                                                                                                                        
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  if _n_ = 1 then do until (z) ;                                                                                                        
    set have (keep = visit) end = z ;                                                                                                   
    last.visit = max (last.visit, visit) ;                                                                                              
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  by id ;                                                                                                                               
  output ;                                                                                                                              
  call missing (score) ;                                                                                                                
  if last.id and visit < last.visit then do visit = visit + 1 to last.visit ;                                                           
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;

An interesting question is whether or not the same can be done in a single step if the input data set is completely disordered. The answer is that with a modicum of memorizing stuff in a hash table, it is affirmative:

data have ;                                                                                                                             
  input id gender age visit score ;                                                                                                     
  cards ;                                                                                                                               
4 0 70 3 4.88889                                                                                                                        
1 1 49 3 0.55556                                                                                                                        
2 1 51 1 2.66667                                                                                                                        
1 1 49 2 4.55556                                                                                                                        
2 1 51 3 4.22222                                                                                                                        
4 0 70 1 3.44444                                                                                                                        
2 1 51 2 4.88889                                                                                                                        
3 0 52 1 3.44444                                                                                                                        
4 0 70 2 0.44444                                                                                                                        
5 0 49 1 3.33333                                                                                                                        
1 1 49 1 1.33333                                                                                                                        
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (ordered:"a") ;                                                                                                          
    h.definekey ("id") ;                                                                                                                
    h.definedata ("id", "_iorc_", "_n_") ;                                                                                              
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set have (keep = id visit) curobs = q end = z ;                                                                                   
      last.visit = max (last.visit, visit) ;                                                                                            
      if h.find() ne 0 then call missing (_iorc_, _n_) ;                                                                                
      if _iorc_ < visit then do ;                                                                                                       
        _iorc_ = visit ;                                                                                                                
        _n_ = q ;                                                                                                                       
      end ;                                                                                                                             
      h.replace() ;                                                                                                                     
    end ;                                                                                                                               
  end ;                                                                                                                                 
  set have curobs = q ;                                                                                                                 
  output ;                                                                                                                              
  h.find() ;                                                                                                                            
  call missing (score) ;                                                                                                                
  if _n_ = q then do visit = _iorc_ + 1 to last.visit ;                                                                                 
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;

Though it might look like a codeful, the scheme is actually pretty simple:

Read the input file, find the highest visit across all records, and store it in the variable LAST.VISIT.
For each ID, memorize its highest visit and the record number on which it occurs.
Read the file again and output each record. If for the current record's ID, the stored highest visit is less than LAST.VISIT, augment the output with visit numbers from the stored value up by 1 to last.visit.

Note that the variable LAST.VISIT has nothing to do with BY processing. It's just an arbitrary temporary variable, whose convenience lies in the fact that it is set to 0 at compile, auto-retained, and auto-dropped.

Kind regards

Paul D.

novinosrin · Posted 11-17-2019 10:56 AM

Nice one Guru @hashman after paying attention to your post, my leaning was little bit towards utilizing SPARSE

tables id*visit/out=temp(keep=id visit) sparse;

in proc freq to get all max combinations in an ordered form and perform a direct dynamic look up.

However when i tested on some large data, the performance of SPARSE was worse than awful. Yours was lightning fast. Of course as expected.

Okay, my post isn't about just words of praise that you anyway are used to receiving from SAS world, but many procedure options are not really performing at least at citizens SAS installation. Is it only at citizens resources or just expected normal behavior that you've seen from your enormous experiences? I guess you or Tom are likely to be best judge of it.

PS Thank you for the text. SAS certification seems far easier than passing road test. lol. We'll see i i make it in achieving my drivers. 🙂

Registration is open

SAS Training: Just a Click Away