BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
paolamejia
Calcite | Level 5

Hi,

 

I'm new to SAS and this is my first post.

I have a small dataset with body measurements. I have three different measurements for the same variable, and I want to average only two of these values, the closest ones. How do I do this? 

My data looks something like this:

 

obs weight1 weight2 weight3

1     54 .         54.7        53

2      48 .         47.5 .       .  

3     67.4 .       68 .         69

4 .    48 .          48           .

 

I'm using version 9.4

 

I thought I could write a code to compare the sums of weight 1+2, 2+3 and 1+3, and average the smallest quantity, but I don't know how to ask for an absolute value. Thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this?

data WANT; 
  set HAVE;
  DIFF12=abs(WEIGHT1-WEIGHT2);
  DIFF13=abs(WEIGHT1-WEIGHT3);
  DIFF23=abs(WEIGHT2-WEIGHT3);
  if DIFF12=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT2);
  if DIFF13=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT3);
  if DIFF23=min(of DIFF:) then AVG=mean(WEIGHT2,WEIGHT3);
run;

 

View solution in original post

21 REPLIES 21
novinosrin
Tourmaline | Level 20

Not able to understand your desired output for the sample. Can you post an output sample for the input sample explaining your requirement plz

paolamejia
Calcite | Level 5

Thanks for replying!

I need to create a new variable that is the average of two of the values, the two closest ones. So where there are only two measurements, the new variable is the average of those two. But if there are three measurements (the third is taken when the discrepancy between the first two is larger than one unit), I want the average to be of only two values, the two closest values.

 

paolamejia
Calcite | Level 5

Trying to clarify my question: 

I have these data

 

obs weight1 weight2 weight3

1       53          54.7        53

2       48          47.5            

3     67.4          68.5         69

4      48            48           .

 

The output I want is:

 

obs  weight1   weight2   weight3  finalweight

1        53              54.7        53          53

2        48               47.5                     47.8

3        67.4            68.5        69         68.8

4        48              48                         48

 

Where weight3 is a missing value the new variable is the mean of weight1 and weight2. But when there are three different weights, I only want to use two of them to calculate the mean. Maybe calculating the absolute value of every possible sum of pairs, and then asking SAS to compute the mean of the closest values?

novinosrin
Tourmaline | Level 20
data have;
input obs weight1 weight2 weight3 ;
cards;
1     54          54.7        53
2      48          47.5       .  
3     67.4        68          69
4    48          48           .
;

data w;
set have;
n=_n_;
array  t(*) weight:;
k=n(of t(*));
if k=3 then 
do i=1 to 2;
do j=i+1 to 3;
k1=t(i);
k2=t(j);
dif=abs(t(j)-t(i));
output;
end;
end;
else do;avg=mean(weight1,weight2,weight3); output; end;
run;

proc sort data=w out=_w ;
by n  dif;
run;

data want;
set _w;
by n;
if first.n;
if first.n and not last.n then 
avg=mean(k1,k2);
keep n weight: avg;
run;
paolamejia
Calcite | Level 5

Wow, thanks so much! It seems to work great for the observations that have three weight measurements, but what happens with the observations that only have two measurements? They don't seem to come out in the final output, data want, which has only two observations.

 

I'm going to try it with my own small dataset

ChrisNZ
Tourmaline | Level 20

Like this?

data WANT; 
  set HAVE;
  DIFF12=abs(WEIGHT1-WEIGHT2);
  DIFF13=abs(WEIGHT1-WEIGHT3);
  DIFF23=abs(WEIGHT2-WEIGHT3);
  if DIFF12=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT2);
  if DIFF13=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT3);
  if DIFF23=min(of DIFF:) then AVG=mean(WEIGHT2,WEIGHT3);
run;

 

paolamejia
Calcite | Level 5

Yes! Works like a charm. Thank you so much!

hashman
Ammonite | Level 13

@ChrisNZ:

 

Tis' nice, especially for handling the ignoring of the nulls.

And terse ... for the number of weights N=3, which certainly fits the scope of the OP's question.

Now, how would you apply/extend this code to N=10? N=100?

 

Best

Paul D.

ChrisNZ
Tourmaline | Level 20

Haha! Golf challenge?

Can't test atm, but something along these lines I suppose (pseudo code made by modifying your code):

data WANT (keep = W: MEAN ) ;
  set HAVE;                         
  array W[*] W: ; 
  call symputx('____dim',dim(W));
  array Z[&____dim.] _temporary_;
  do I = 1 to &____dim.  ;   
    Z[I]=W[I]
  end;
  call sortn(of Z[*]);                  
  do I = 1 to &____dim. - 1 ;          
    DIFF = abs (Z[I] - Z[I+1]) ;    
    if . < DIFF < MINDIFF then do ;
      MINDIFF = DIFF ;                 
      MEAN=mean(Z[I], Z[I+1]) ;                       
    end;                            
  end;                              
run ;  

Would this work?

 

hashman
Ammonite | Level 13

Ha-ha! Why not. I like your idea, and it should work if further developed. SYMPUTX won't do what you want because it's a run-time instruction, while arrays are sized at compile time, so the value produced by SYMPUTX cannot size an array in the same DATA step. But methinks it will sure work in the context of the following if you adjust the code somewhat to handle the generation of missing values on the line DIFF=.

data have ;                                
  call streaminit (7) ;                    
  array w weight1-weight5 ;                
  do _n_ = 1 to 5 ;                        
    do over w ;                            
      w = rand ("integer", 23) ;           
      if rand ("uniform") < .3 then w = . ;
    end ;                                  
    output ;                               
  end ;                                    
run ;      
                                
%let dim = 5 ;                  
data WANT (keep = W: MEAN ) ;      
  set HAVE;                        
  array W[*] W: ;                  
  array Z[&dim] _temporary_;       
  do I = 1 to &dim  ;              
    Z[I]=W[I] ;                    
  end;                             
  call sortn(of Z[*]);             
  do I = 1 to &dim - 1 ;           
    DIFF = abs (Z[I] - Z[I+1]) ;   
    if . < DIFF < MINDIFF then do ;
      MINDIFF = DIFF ;             
      MEAN=mean(Z[I], Z[I+1]) ;    
    end;                           
  end;                             
run ;  

Have fun!

 

Best

Paul D.

 

   

ChrisNZ
Tourmaline | Level 20

Yes, I thought about using symput being a silly idea right after I typed it.

This is a bit better than my previous version (still untested) if one wants to only keep a single step.

Why adjust the code somewhat to handle the generation of missing values on the line DIFF=. ?

data WANT (keep = W: MEAN ) ;
  set HAVE;                         
  array W[*] W: ; 
  array Z[9999] _temporary_;
  do I = 1 to dim(W) ;   
    Z[I]=W[I]
  end;
  call sortn(of Z[*]);                  
  do I = 9999 to 1 by -1 while (^missing(Z[I-1])) ;          
    DIFF = abs (Z[I] - Z[I-1]) ;    
    if DIFF < MINDIFF then do;
      MINDIFF = DIFF ;                 
      MEAN=mean(Z[I], Z[I-1]) ;                       
    end;                            
  end;                              
run ;  

 

hashman
Ammonite | Level 13

@ChrisNZ:

 

As I've said, I like your idea - because sorting turns a O(n**2) problem (pairwise comparison) into O(n*log(n)) problem (heap sort behind the scenes). Also, the ordering makes it unnecessary to call the ABS function since you always subtract something smaller or equal. Note, however, that you're not scaling the code you first posted but present an altogether different approach instead. Your new untested code won't work just yet ; but it will with a couple of adjustments:

 

1. Since it is a minimum that is being calculated, the initial value should be set to something bigger than any possible MINDIFF. In your code, it is auto-set to missing, which is less than anything but _. and which is why MEAN comes out missing. 

2. Since temp array elements are retained, it should be cleaned up for each new record, otherwise the leftovers from the previous records mess up the values relied upon during the calculation.

 

Thus, working (and tested) code:

data WANT (keep = W: MEAN ) ;                           
  set HAVE ;                                            
  array W [*] W: ;                                      
  array Z [9999] _temporary_;                           
  do I = 1 to dim(W) ;                                  
    Z[I] = W[I] ;                                       
  end;                                                  
  call sortn (of Z[*]) ;                                
  MINDIFF = Z [dim(Z)] ;                                
  do I = dim (Z) to 1 by -1 while (^ missing (Z[I-1])) ;
    DIFF = Z[I] - Z[I-1] ; * Note: no ABS ;                              
    if DIFF < MINDIFF then do ;                          
      MINDIFF = DIFF ;                                  
      MEAN = mean (Z[I], Z[I-1]) ;                      
    end ;                                               
  end ;                                                 
  call missing (of Z{*]) ;                              
run ;                                                   

Of course, performance-wise, sizing the temp array as "big enough" has its impact on sorting. That can be alleviated by using an ordered hash table instead of the temp array, as the former grows and shrinks dynamically. If I find time, I'll post this kind of variant on the theme. 

 

Best

Paul D.

 

p.s. I said "adjust the code somewhat to handle the generation of missing values" because in my testing, this line generated missing values. Above, it no longer does.  

ChrisNZ
Tourmaline | Level 20

@hashman Good points.

 

1. In order to get a MEAN even if there is a single value we need to alter the loop slightly so it iterates at least once.

Taking care of this can also take care of the missing initial value for MINDIFF.

 

2. I thought I didn't need to reset to zero as I refill the array with every observation. But of course and all values go to the end when I sort it, so this was wrong. We just need to fill the back of the array to avoid that then.

 

So something this may work better (untested):

 

 

data WANT (keep = W: MEAN ) ;                           
  set HAVE ;                                            
  array W [*] W: ;                                      
  array Z [9999] _temporary_;                           
  do I = 1 to dim(W) ;                                  
    Z[10000-I] = W[I] ;                                       
  end;                                                  
  call sortn (of Z[*]) ;                                
  do I = 9999 to 2 by -1 until (missing (Z[I-1])) ;
DIFF = Z[I] - Z[I-1] ; if . < DIFF < MINDIFF or MEAN = . then do ; MINDIFF = DIFF ; MEAN = mean (Z[I], Z[I-1]) ; end ; end ; run ;

 

Yes hash tables require a lot more code, Furthermore:

3. You needn't have multidata in the hash table.

4. I wouldn't initialise

_dm = max (of w:) ;

because the logic then fails if the values are negative.

 

hashman
Ammonite | Level 13

@ChrisNZ

 

1. Yes.

 

2. Takes care of business. Missing values will still be generated at the DIFF= line, but it's just a matter of log notes. Plus, WHILE (^) instead of UNTIL () can fix that.  

 

3. Yes, I do - if we want the same assumptions. To wit, your code treats ties as "the closest" and computes the mean accordingly. Mine, too, except that I'm getting the lowest ties (since I ascend) and you - the highest (since you descend). This can be aligned on my side if I use ORDERED:"D" (but not on yours, as SORTN always sorts ascending and you run descending for obvious reasons).  If I omit MULTIDATA:"Y", there will be no ties; and the result will change accordingly.

 

4. Agreed;  a good catch. Better to set it to constant("big"), as I did initially, to cover all bases. 

 

Best

Paul D.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 3237 views
  • 4 likes
  • 4 in conversation