Hi,
I'm new to SAS and this is my first post.
I have a small dataset with body measurements. I have three different measurements for the same variable, and I want to average only two of these values, the closest ones. How do I do this?
My data looks something like this:
obs weight1 weight2 weight3
1 54 . 54.7 53
2 48 . 47.5 . .
3 67.4 . 68 . 69
4 . 48 . 48 .
I'm using version 9.4
I thought I could write a code to compare the sums of weight 1+2, 2+3 and 1+3, and average the smallest quantity, but I don't know how to ask for an absolute value. Thanks!
Like this?
data WANT;
set HAVE;
DIFF12=abs(WEIGHT1-WEIGHT2);
DIFF13=abs(WEIGHT1-WEIGHT3);
DIFF23=abs(WEIGHT2-WEIGHT3);
if DIFF12=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT2);
if DIFF13=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT3);
if DIFF23=min(of DIFF:) then AVG=mean(WEIGHT2,WEIGHT3);
run;
Not able to understand your desired output for the sample. Can you post an output sample for the input sample explaining your requirement plz
Thanks for replying!
I need to create a new variable that is the average of two of the values, the two closest ones. So where there are only two measurements, the new variable is the average of those two. But if there are three measurements (the third is taken when the discrepancy between the first two is larger than one unit), I want the average to be of only two values, the two closest values.
Trying to clarify my question:
I have these data
obs weight1 weight2 weight3
1 53 54.7 53
2 48 47.5
3 67.4 68.5 69
4 48 48 .
The output I want is:
obs weight1 weight2 weight3 finalweight
1 53 54.7 53 53
2 48 47.5 47.8
3 67.4 68.5 69 68.8
4 48 48 48
Where weight3 is a missing value the new variable is the mean of weight1 and weight2. But when there are three different weights, I only want to use two of them to calculate the mean. Maybe calculating the absolute value of every possible sum of pairs, and then asking SAS to compute the mean of the closest values?
data have;
input obs weight1 weight2 weight3 ;
cards;
1 54 54.7 53
2 48 47.5 .
3 67.4 68 69
4 48 48 .
;
data w;
set have;
n=_n_;
array t(*) weight:;
k=n(of t(*));
if k=3 then
do i=1 to 2;
do j=i+1 to 3;
k1=t(i);
k2=t(j);
dif=abs(t(j)-t(i));
output;
end;
end;
else do;avg=mean(weight1,weight2,weight3); output; end;
run;
proc sort data=w out=_w ;
by n dif;
run;
data want;
set _w;
by n;
if first.n;
if first.n and not last.n then
avg=mean(k1,k2);
keep n weight: avg;
run;
Wow, thanks so much! It seems to work great for the observations that have three weight measurements, but what happens with the observations that only have two measurements? They don't seem to come out in the final output, data want, which has only two observations.
I'm going to try it with my own small dataset
Like this?
data WANT;
set HAVE;
DIFF12=abs(WEIGHT1-WEIGHT2);
DIFF13=abs(WEIGHT1-WEIGHT3);
DIFF23=abs(WEIGHT2-WEIGHT3);
if DIFF12=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT2);
if DIFF13=min(of DIFF:) then AVG=mean(WEIGHT1,WEIGHT3);
if DIFF23=min(of DIFF:) then AVG=mean(WEIGHT2,WEIGHT3);
run;
Yes! Works like a charm. Thank you so much!
Tis' nice, especially for handling the ignoring of the nulls.
And terse ... for the number of weights N=3, which certainly fits the scope of the OP's question.
Now, how would you apply/extend this code to N=10? N=100?
Best
Paul D.
Haha! Golf challenge?
Can't test atm, but something along these lines I suppose (pseudo code made by modifying your code):
data WANT (keep = W: MEAN ) ;
set HAVE;
array W[*] W: ;
call symputx('____dim',dim(W));
array Z[&____dim.] _temporary_;
do I = 1 to &____dim. ;
Z[I]=W[I]
end;
call sortn(of Z[*]);
do I = 1 to &____dim. - 1 ;
DIFF = abs (Z[I] - Z[I+1]) ;
if . < DIFF < MINDIFF then do ;
MINDIFF = DIFF ;
MEAN=mean(Z[I], Z[I+1]) ;
end;
end;
run ;
Would this work?
Ha-ha! Why not. I like your idea, and it should work if further developed. SYMPUTX won't do what you want because it's a run-time instruction, while arrays are sized at compile time, so the value produced by SYMPUTX cannot size an array in the same DATA step. But methinks it will sure work in the context of the following if you adjust the code somewhat to handle the generation of missing values on the line DIFF=.
data have ;
call streaminit (7) ;
array w weight1-weight5 ;
do _n_ = 1 to 5 ;
do over w ;
w = rand ("integer", 23) ;
if rand ("uniform") < .3 then w = . ;
end ;
output ;
end ;
run ;
%let dim = 5 ;
data WANT (keep = W: MEAN ) ;
set HAVE;
array W[*] W: ;
array Z[&dim] _temporary_;
do I = 1 to &dim ;
Z[I]=W[I] ;
end;
call sortn(of Z[*]);
do I = 1 to &dim - 1 ;
DIFF = abs (Z[I] - Z[I+1]) ;
if . < DIFF < MINDIFF then do ;
MINDIFF = DIFF ;
MEAN=mean(Z[I], Z[I+1]) ;
end;
end;
run ;
Have fun!
Best
Paul D.
Yes, I thought about using symput being a silly idea right after I typed it.
This is a bit better than my previous version (still untested) if one wants to only keep a single step.
Why adjust the code somewhat to handle the generation of missing values on the line DIFF=. ?
data WANT (keep = W: MEAN ) ;
set HAVE;
array W[*] W: ;
array Z[9999] _temporary_;
do I = 1 to dim(W) ;
Z[I]=W[I]
end;
call sortn(of Z[*]);
do I = 9999 to 1 by -1 while (^missing(Z[I-1])) ;
DIFF = abs (Z[I] - Z[I-1]) ;
if DIFF < MINDIFF then do;
MINDIFF = DIFF ;
MEAN=mean(Z[I], Z[I-1]) ;
end;
end;
run ;
As I've said, I like your idea - because sorting turns a O(n**2) problem (pairwise comparison) into O(n*log(n)) problem (heap sort behind the scenes). Also, the ordering makes it unnecessary to call the ABS function since you always subtract something smaller or equal. Note, however, that you're not scaling the code you first posted but present an altogether different approach instead. Your new untested code won't work just yet ; but it will with a couple of adjustments:
1. Since it is a minimum that is being calculated, the initial value should be set to something bigger than any possible MINDIFF. In your code, it is auto-set to missing, which is less than anything but _. and which is why MEAN comes out missing.
2. Since temp array elements are retained, it should be cleaned up for each new record, otherwise the leftovers from the previous records mess up the values relied upon during the calculation.
Thus, working (and tested) code:
data WANT (keep = W: MEAN ) ;
set HAVE ;
array W [*] W: ;
array Z [9999] _temporary_;
do I = 1 to dim(W) ;
Z[I] = W[I] ;
end;
call sortn (of Z[*]) ;
MINDIFF = Z [dim(Z)] ;
do I = dim (Z) to 1 by -1 while (^ missing (Z[I-1])) ;
DIFF = Z[I] - Z[I-1] ; * Note: no ABS ;
if DIFF < MINDIFF then do ;
MINDIFF = DIFF ;
MEAN = mean (Z[I], Z[I-1]) ;
end ;
end ;
call missing (of Z{*]) ;
run ;
Of course, performance-wise, sizing the temp array as "big enough" has its impact on sorting. That can be alleviated by using an ordered hash table instead of the temp array, as the former grows and shrinks dynamically. If I find time, I'll post this kind of variant on the theme.
Best
Paul D.
p.s. I said "adjust the code somewhat to handle the generation of missing values" because in my testing, this line generated missing values. Above, it no longer does.
@hashman Good points.
1. In order to get a MEAN even if there is a single value we need to alter the loop slightly so it iterates at least once.
Taking care of this can also take care of the missing initial value for MINDIFF.
2. I thought I didn't need to reset to zero as I refill the array with every observation. But of course and all values go to the end when I sort it, so this was wrong. We just need to fill the back of the array to avoid that then.
So something this may work better (untested):
data WANT (keep = W: MEAN ) ;
set HAVE ;
array W [*] W: ;
array Z [9999] _temporary_;
do I = 1 to dim(W) ;
Z[10000-I] = W[I] ;
end;
call sortn (of Z[*]) ;
do I = 9999 to 2 by -1 until (missing (Z[I-1])) ;
DIFF = Z[I] - Z[I-1] ;
if . < DIFF < MINDIFF or MEAN = . then do ;
MINDIFF = DIFF ;
MEAN = mean (Z[I], Z[I-1]) ;
end ;
end ;
run ;
Yes hash tables require a lot more code, Furthermore:
3. You needn't have multidata in the hash table.
4. I wouldn't initialise
_dm = max (of w:) ;
because the logic then fails if the values are negative.
1. Yes.
2. Takes care of business. Missing values will still be generated at the DIFF= line, but it's just a matter of log notes. Plus, WHILE (^) instead of UNTIL () can fix that.
3. Yes, I do - if we want the same assumptions. To wit, your code treats ties as "the closest" and computes the mean accordingly. Mine, too, except that I'm getting the lowest ties (since I ascend) and you - the highest (since you descend). This can be aligned on my side if I use ORDERED:"D" (but not on yours, as SORTN always sorts ascending and you run descending for obvious reasons). If I omit MULTIDATA:"Y", there will be no ties; and the result will change accordingly.
4. Agreed; a good catch. Better to set it to constant("big"), as I did initially, to cover all bases.
Best
Paul D.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.