Hello SAS users!
Though this may be an easy task to those who have expertise with programming, I am struggling with this.
So I got data 'have' below.
data have;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
And I want to convert 'have' to 'want' like below.
data want;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 0 1 1 3 2
1 3 5 9 0 3 9 6
1 4 5 8 0 7 3 2
1 5 5 8 2 0 6
1 6 5 0 9 6 3 9
2 1 3 0 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 0 5 . .
2 4 3 13 7 9 . .
;
run;
So when you look at data set 'have',
id=1 have 5 not-all-missing numeric variables x1, x2, x3, x4, x5
and id=2 have only 3 not-all-missing numeric variables x1, x2, x3.
That means id=1 have 5 variables and id=2 have 3 variables.
For id=1, I want to convert missing values in x1, x2, x3, x4, x5 to zero
and For id=2, I want to convert missing values in x1, x2, x3 to zero but remains x4, x5 all missing.
I got thousands of id, so the array of variable x could be more than 100 (x1~x105)
How do I perform this EFFICIENTLY?
Any help will be appreciated.
There is definitely a more efficient way, but here is a transpose/transpose back approach
data have;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
proc transpose data=have out=temp1;
   by id permno N;
run;
proc sql;
   create table temp2 as
   select *
         ,case when count(col1)=0 then 0 else 1 end as flag
   from temp1
   group by id, _NAME_;
quit;
data temp3;
   set temp2;
   if col1=. & flag=1 then col1=0;
   drop flag;
run;
proc sort data=temp3;
   by id permno;
run;
proc transpose data=temp3 out=want(drop=_NAME_);
   by id permno;
   id _NAME_;
   var col1;
run;Something like:
data have;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
data want;
set have;
array val x:;
do over val;
if val=. then val=0;
end;
run;
Would actually change the value. However you might not need to. If you just want missings to be displayed as zero then you could use:
options missing="0";
Note this afffects the whole session.
You might want to be careful doing that anyways, 0 is not missing, therefore you are imputing data - this can affect calculations such as proc means.
Dear RW9
What I want to do is to remain missing values of id=2's x4, x5
not to change all missing values in the dataset to zero
Ah, ok then a couple of if statements are need:
data have;
 input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
data want (drop=flag i);
 set have;
 array x(5);
 flag=0;
 do i=1 to 5;
   if flag=0 and x{i}=. then do;
     x{i}=0;
     flag=1;
   end;
   else if flag=1 and x{i}=. then x{i}=0;
   else if flag=1 and x{i} ne . then leave;
 end;
run; 
					
				
			
			
				
			
			
			
			
			
			
			
		Ok, sorry, didn't see that. Just work from the back to start:
data have;
 input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
data want (drop=flag i);
 set have;
 array x(5);
 flag=0;
 do i=5 to 1 by -1;
   if x{i} ne . then flag=1;
   if flag=1 and x{i}=. then x{i}=0;
 end;
run; 
					
				
			
			
				
			
			
			
			
			
			
			
		@RW9, I politely disagree that this would work in the general case. What if the data looks like this?
data have;
 input id permno N x1 x2 x3 x4 x5 x6;
cards;
1 1 5 4 3 2 2 1 1
1 2 5 . 1 1 3 2 2
1 3 5 9 . 3 9 6 3
1 4 5 8 . 7 3 2 4
1 5 5 8 2 . 6 5
1 6 5 . 9 6 3 9 6
2 1 3 . 13 8 . . 1
2 2 3 23 5 8 . . 2
2 3 3 11 . 5 . . 3
2 4 3 13 7 9 . . 4
;
run;Then your code would convert the missing values for ID=2 in the variables x4 and x5 even though they are all missing in the by group and should remain so.
That is of course, if I understand your needs correctly @jkim197? 🙂
I don't disagree that taking different data, or different logic then my code would not work. TBH its so hard to tell what the OP is trying to do...
First step would be to ascertain why there is a transposed format in the first place, and why there are missings in a given set of data, which would most likely be simpler to fix earlier in the process. However we have only what is posted here.
There is definitely a more efficient way, but here is a transpose/transpose back approach
data have;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
proc transpose data=have out=temp1;
   by id permno N;
run;
proc sql;
   create table temp2 as
   select *
         ,case when count(col1)=0 then 0 else 1 end as flag
   from temp1
   group by id, _NAME_;
quit;
data temp3;
   set temp2;
   if col1=. & flag=1 then col1=0;
   drop flag;
run;
proc sort data=temp3;
   by id permno;
run;
proc transpose data=temp3 out=want(drop=_NAME_);
   by id permno;
   id _NAME_;
   var col1;
run;If you have many variable to impute , Make a macro .
data have;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6 .
1 6 5 . 9 6 3 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . .
;
run;
proc sql;
create table temp as
 select *,n(x1)=0 as _x1,n(x2)=0 as _x2,n(x3)=0 as _x3,n(x4)=0 as _x4,n(x5)=0 as _x5
  from have
   group by id
    order by id,permno;
quit;
data want;
 set temp;
 array x{*} x1-x5;
 array _x{*} _x1-_x5;
 do i=1 to dim(x);
  if missing(x{i}) and _x{i}=0 then x{i}=0;
 end;
 drop i _:;
run;HI @jkim197 just out of curiosity and if you have time, does the variable N in your HAVE denote the number of variables in starting/ascending sequence in the array group X1-Xn series to check and impute? as N is 5 and 3 for ID1 and ID2 respectively?
If that is the case the following should do? nevertheless ,I am wary when somebody brilliant like Ksharp didn't offer this,that makes me think I am the one whose understanding is wrong. Can you clarify plz?
data want;
set have;
by id;
array t(*) x:;
do _n_=1 to n;
if missing(t(_n_)) then t(_n_)=0;
end;
run;
Yes you are right.
your code works great and fast.. Thank you
@jkim197 That made me chuckle. Like @hashman pointed out, the description of the question explaining the contents of the dataset matters most in the first place for which a robust solution is required. I am glad, my peek into your dataset and the guesses or in other words intuition turned out to be correct.
Last but not the least, irony in lack of clarity gave us a gift to receive the privilege of yet another beautiful APP class of PD that I copied to my notes 🙂
First, please do not include irrelevant variables (such as Permno and N, in this case) in your sample input/output: They muddy up the picture and distract from the task at hand. Your sample input had better look like this:
data have ;      
  input id x1-x5 
  cards ;        
1   4  3  2  2  1
1   .  1  1  3  2
1   9  .  3  9  6
1   8  .  7  3  2
1   8  2  .  6   
1   .  9  6  3  9
2   . 13  8  .  .
2  23  5  8  .  .
2  11  .  5  .  .
2  13  7  9  .  .
;                
run ;            Second, you want an efficient solution. That means, to begin with, that you want to minimize the number of passes through the input data. In this case, you need at minimum 2 passes since you cannot do what you need to any BY group without pre-processing it beforehand. But any task involving the logic (1) read a BY group to discover something and (2) read it again to process it based on the discovery is a typical work for what is known as the double DoW-loop, whose extra advantage lies in processing the whole shebang in a single step.
Hence, below: (1) every BY group is read first to determine the maximum number of non-missing values in any row _NX, then (2) this very BY group is scanned again to set the nulls to zeros in each row only for the first _NX leftmost X-variables.
data want (drop = _:) ; 
  do _ng = 1 by 1 until (last.id) ; 
    set have ; 
    by ID ; 
    _nx = _nx <> N (of x:) ; 
  end ; 
  do _ng = 1 to _ng ; 
    set have ; 
    array x [*] x: ; 
    do _i = 1 to _nx ; 
      if nmiss (x[_i]) then x[_i] = 0 ;
    end ; 
    output ; 
  end ; 
run ; 
                            Even if you have 105 X-variables, scanning through the array is a pretty quick affair. However, if you want to avoid that in favor of using a potentially faster string search/replace function, you can do as below. Note that it is not to suggest that you code this way: APP tricks like this are reserved to extreme circumstances where performance gains are more than well worth the extra complexity; I just want to indicate that the possibility is there (plus perhaps it may pique someone's curiosity to learn more about the APP functions).
data want (drop = _:) ;                                             
  do _ng = 1 by 1 until (last.id) ;                                 
    set have ;                                                      
    by ID ;                                                         
    _nx = _nx <> N (of x:) ;                                        
  end ;                                                             
  retain _a _m _z ;                                                 
  if _n_ = 1 then do ;                                              
    _a  = addrlong (x1) ;                                           
    _m  = put (. , rb8.) ;                                          
    _z  = put (0 , rb8.) ;                                          
  end ;                                                             
  _nx = 8 * _nx ;                                                   
  do _ng = 1 to _ng ;                                               
    set have ;                                                      
    call pokelong (tranwrd (peekclong (_a, _nx), _m, _z), _a, _nx) ;
    output ;                                                        
  end ;                                                             
run ;                                                               FWIW
Paul D.
I propose this solution:
data have;
input id permno N x1 x2 x3 x4 x5;
cards;
1 1 5 4 3 2 2 1
1 2 5 . 1 1 3 2
1 3 5 9 . 3 9 6
1 4 5 8 . 7 3 2
1 5 5 8 2 . 6
1 6 5 . 9 6 . 9
2 1 3 . 13 8 . .
2 2 3 23 5 8 . .
2 3 3 11 . 5 . .
2 4 3 13 7 9 . 10
;
run;
data want;
set have;
array X(*) x:;
do i=1 to dim(x);
   if x(i)=. then do;
   rc=i;
   do j=rc to dim(x);
   if x(j) ne . then do;
   x(i)=0;
   leave;
 
   end;  
  end;
 end;
end;
run; It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
