Hi,
I have a problem and Im thankful for help.
Given is the data set "final". Today, for example, it contains 3 variables, but it could also have 5 or 10 variables the next time it is created. The names of the variables are always freely selectable and have no logic. But they all have one thing in common: The first variable is always "Eink" with the same characteristics and the goal is calculating a median (based on the groups in "Eink").
For the current dataset I have written a syntax for calculating the median. I attached the current syntax below. Based on my current dataset, Bg and He were used there. My problem is: If the record "final" appears tomorrow with other variables, the syntax should adapt individually to it. If tomorrow "Bg" and "He" no longer exist, but for example XX, XY and XZ, then the syntax should automatically be applied to these three variables with their specific names. So I probably need a loop, but am unable to write it. Could anyone help me with this big concern? Would be great!
dataset "final" given
Eink | Bg | He |
01 | 55.487739 | 2.722371 |
02 | 127.485645 | 18.670678 |
03 | 182.95309 | 53.529988 |
04 | 354.69612 | 122.703093 |
05 | 450.777081 | 137.596942 |
06 | 461.617933 | 157.457164 |
07 | 429.38197 | 192.283645 |
08 | 370.644235 | 173.688658 |
09 | 281.256479 | 180.848677 |
10 | 170.426973 | 137.045263 |
11 | 118.22022 | 128.45571 |
12 | 93.656461 | 104.596442 |
13 | 68.561622 | 112.710123 |
14 | 42.443895 | 77.742469 |
15 | 50.933162 | 136.718895 |
16 | 35.285987 | 104.219131 |
17 | 23.123863 | 78.424041 |
18 | 16.724659 | 89.48717 |
19 | 9.271308 | 46.396113 |
20 | 4.870804 | 21.574738 |
21 | 4.83902 | 19.753305 |
22 | 3.487616 | 9.23105 |
23 | 0.860551 | 2.136266 |
24 | 1.202043 | 1.861731 |
code
/*1*/
Data Eink_2;
set final;
if Eink = "01" then do; Xmin = 0 ; Xmax = 250 ; end;
if Eink = "02" then do; Xmin = 250 ; Xmax = 500 ; end;
if Eink = "03" then do; Xmin = 500 ; Xmax = 750 ; end;
if Eink = "04" then do; Xmin = 750 ; Xmax = 1000 ; end;
if Eink = "05" then do; Xmin = 1000 ; Xmax = 1250 ; end;
if Eink = "06" then do; Xmin = 1250 ; Xmax = 1500 ; end;
if Eink = "07" then do; Xmin = 1500 ; Xmax = 1750 ; end;
if Eink = "08" then do; Xmin = 1750 ; Xmax = 2000 ; end;
if Eink = "09" then do; Xmin = 2000 ; Xmax = 2250 ; end;
if Eink = "10" then do; Xmin = 2250 ; Xmax = 2500 ; end;
if Eink = "11" then do; Xmin = 2500 ; Xmax = 2750 ; end;
if Eink = "12" then do; Xmin = 2750 ; Xmax = 3000 ; end;
if Eink = "13" then do; Xmin = 3000 ; Xmax = 3250 ; end;
if Eink = "14" then do; Xmin = 3250 ; Xmax = 3500 ; end;
if Eink = "15" then do; Xmin = 3500 ; Xmax = 4000 ; end;
if Eink = "16" then do; Xmin = 4000 ; Xmax = 4500 ; end;
if Eink = "17" then do; Xmin = 4500 ; Xmax = 5000 ; end;
if Eink = "18" then do; Xmin = 5000 ; Xmax = 6000 ; end;
if Eink = "19" then do; Xmin = 6000 ; Xmax = 7000 ; end;
if Eink = "20" then do; Xmin = 7000 ; Xmax = 8000 ; end;
if Eink = "21" then do; Xmin = 8000 ; Xmax = 10000 ; end;
if Eink = "22" then do; Xmin = 10000 ; Xmax = 15000 ; end;
if Eink = "23" then do; Xmin = 15000 ; Xmax = 25000 ; end;
if Eink = "24" then do; Xmin = 25000 ; Xmax = . ; end;
run;
/*2*/
Data Eink_3;
set Eink_2;
cum_Bg+Bg;
cum_He+He;
run;
/*3*/
proc sql noprint;
select sum(Bg)/2 into: n2_B from Eink_3;
select sum(He)/2 into: n2_H from Eink_3;
run;
/*4*/
data Eink_4;
set Eink_3;
Fx_prä_B = &n2_B. - cum_Bg;
Fx_prä_H = &n2_H. - cum_He;
run;
/*5*/
proc rank data=Eink_4 out=fx_1_Rang_B; var Fx_prä_B; ranks rang ; where Fx_prä_B >0;
proc rank data=Eink_4 out=fx_1_Rang_H; var Fx_prä_H; ranks rang ; where Fx_prä_H >0;
/*6*/
proc sql noprint; select put(cum_Bg, decimal.7) into:Fx_1_B from fx_1_Rang_B where rang =1; quit;
proc sql noprint; select put(cum_He, decimal.7) into:Fx_1_H from fx_1_Rang_H where rang =1; quit;
/*7*/
proc rank data=Eink_4 descending out=fx_B; var Fx_prä_B; ranks Rang; where Fx_prä_B <0;
proc rank data=Eink_4 descending out=fx_H; var Fx_prä_H; ranks Rang; where Fx_prä_H <0;
/*8*/
proc sql noprint;
select put(Bg, decimal.7) into:fx_B from fx_B where rang=1;
select put(He, decimal.7) into:fx_H from fx_H where rang=1;
run;
/*9*/
proc sql noprint;
select Xmin into: Xmin_B from fx_B where Rang = 1;
select Xmin into: Xmin_H from fx_H where Rang = 1;
quit;
/*10*/
proc sql noprint;
select Xmax-Xmin into: b_B from fx_B where Rang = 1;
select Xmax-Xmin into: b_H from fx_H where Rang = 1;
quit;
/*11 - into the formula */
Data Median_Final;
retain Eink Bg He;
Format
Bg 15.12
He 15.12;
Eink = "Median";
Bg = &Xmin_B. + &b_B.*((&n2_B.-&Fx_1_B.)/&fx_B.);
He = &Xmin_H. + &b_H.*((&n2_H.-&Fx_1_H.)/&fx_H.);
run;
Transpose to long, so you can use the _NAME_ variable as additional CLASS in PROC MEANS or ACROSS in PROC REPORT. If eink is the only character variable, you can use _NUMERIC_ in the VAR statement of TRANSPOSE.
Hmm, wouldnt be the best idea, because I have a total of 11 steps, which each create their own temporary datasets. As an example, I'll supplement my code snippet from above with the subsequent step of my syntax. So they always refer to each other.
I would like to apply these 11 steps as a loop. The code snippet was just one of these 11 on which the macro could be exemplified. I'll try to include the rest then.
Yes, please show us, because the suggestion from @Kurt_Bremser ought to work on the 11 different steps.
I have fleshed out my initial post above and added my syntax that works based on the current data set (but I have reformulated/anonymized the syntax a bit). I hope this doesn't make too much work. Thanks!
It looks like you are computing some sort of a median, using sums, max, min for each "half" of the data, and then plugging these into a formula. I would think there are better/faster ways to do this that don't involve macros or macro variables, but I don't fully understand the process or the end result. Could you please explain what you are trying to do?
The goal is to calculate a weigthed median, based on a formular from the institution Im working for. With the 11 steps I calculate each component of the formular:
- Xmin_B and Xmin_H
- b_B and b_H
- n2_B and n2_H
- Fx_1_B and Fx_1_H
- fx_B and fx_H
At the end, I put these components into the formula and calculate the median (step 11). That works pretty well. But with new variables I always have to manually adjust the syntax. And my hope is, that there is a solution like a loop...
With such a complex process which can't use the available SAS procedures, you have to feed a list of variables to a macro and use %DO loops to create the repeating code wherever needed.
You can create multiple macro variables in a single SELECT INTO, so use that to speed up your process.
Similarly, avoid to run multiple steps for each variable; in most cases, you can process multiple variables in a single step.
@Konkordanz wrote:
Hmm, wouldnt be the best idea, because I have a total of 11 steps, which each create their own temporary datasets. As an example, I'll supplement my code snippet from above with the subsequent step of my syntax. So they always refer to each other.
I would like to apply these 11 steps as a loop. The code snippet was just one of these 11 on which the macro could be exemplified. I'll try to include the rest then.
If you have a data set with a grouping variable and want to create summary statistics for numeric variables then proc means/summary is likely where you belong. Splitting data into separate data sets just to get a median of a value is just "wrong" in a SAS approach.
Here is a small example with a data set you should have available to examine.
Proc summary data=sashelp.class nway; class sex; var _numeric_; output out=want (drop=_type_ _freq_) median = ; run;
The option NWAY is to just show the result for the highest _type_ value of the result. _Type_ indicates which combination(s) of Class variables are involved in an output observation. The default output would also include an overall summary for all observations and if more variables are involved on the class statement then there multiple combinations possible.
Class indicates your grouping variable.
Var is list of numeric variables to use. In this case _numeric_ uses all of the numeric variables in the data set (there are only 3).
The output statement names the output dataset and requests which statistic(s) to create. The drop removes two automatic variables that would appear by default. The Median= requests the median statistic. With a single statistic that means the output statistic has the same name as the input variable. You can either name the resulting statistic your self or use /autoname option to create names with the statistic appended.
Looks to me like your first 4 steps are just these two steps. Plus I generated SIGN variables that can probably be used to reduce the PROC RANK steps to one instead of two.
data final ;
input Eink :$2. Bg He ;
cards;
01 55.487739 2.722371
02 127.485645 18.670678
03 182.95309 53.529988
04 354.69612 122.703093
05 450.777081 137.596942
06 461.617933 157.457164
07 429.38197 192.283645
08 370.644235 173.688658
09 281.256479 180.848677
10 170.426973 137.045263
11 118.22022 128.45571
12 93.656461 104.596442
13 68.561622 112.710123
14 42.443895 77.742469
15 50.933162 136.718895
16 35.285987 104.219131
17 23.123863 78.424041
18 16.724659 89.48717
19 9.271308 46.396113
20 4.870804 21.574738
21 4.83902 19.753305
22 3.487616 9.23105
23 0.860551 2.136266
24 1.202043 1.861731
;
data min_max ;
length Eink $2 Xmin Xmax 8;
do xmax = 250 to 3500 by 250,4000,4500,5000,6000,7000,8000,10000,15000,25000,. ;
i+1;
eink=put(i,z2.);
xmin=sum(0,lag(xmax));
output;
end;
drop i;
run;
proc summary data=FINAL ;
var Bg He ;
output out=sums(drop=_freq_ _type_) sum=sum_Bg sum_He ;
run;
data STEP4;
merge final min_max ;
by eink;
if _n_=1 then set sums;
cum_Bg + Bg;
Fx_Bg = sum_bg/2 - cum_Bg;
Sign_Bg = sign(fx_bg);
cum_He + He;
Fx_He = sum_he/2 - cum_He;
Sign_He = sign(fx_he);
run;
So if Bg and He represent the variables you need to process then you might use macro code to generate those blocks that use them.
...
sum=sum_Bg sum_He ;
...
cum_Bg + Bg;
Fx_Bg = sum_bg/2 - cum_Bg;
Sign_Bg = sign(fx_bg);
cum_He + He;
Fx_He = sum_he/2 - cum_He;
Sign_He = sign(fx_he);
...
But you could also transpose the data first. Then no need to generate code (other than the list of variables in the VAR statement of the TRANSPOSE step).
proc transpose data=FINAL out=TALL(rename=(col1=VALUE)) name=VARNAME ;
by Eink;
var Bg He ;
run;
proc sort data=TALL;
by varname eink;
run;
proc summary data=TALL nway;
by varname ;
var value;
output out=sums(drop=_freq_ _type_) sum=sum_value ;
run;
data step4_tall;
merge tall sums ;
by varname ;
if _n_=1 then do;
set min_max;
declare hash h(dataset:'min_max');
h.definekey('eink');
h.definedata('Xmin','Xmax');
h.definedone();
end;
if first.varnum then cum_value=.;
h.find();
cum_value + value ;
fx_value = sum_value/2 - cum_value ;
Sign_value = sign(fx_value);
run;
What is STEP5 to STEP11 doing?
Why is it using RANK instead of just taking the MIN and MAX?
proc summary data=step4_tall nway ;
by varname sum_value ;
class sign_value ;
var cum_value fx_value ;
output out=stats min= max= / autoname ;
run;
proc print;
run;
cum_ cum_ sum_ Sign_ value_ fx_value_ value_ fx_value_ Obs VARNAME value value _TYPE_ _FREQ_ Min Min Max Max 1 Bg 3358.21 -1 1 18 2062.40 -1679.10 3358.21 -383.30 2 Bg 3358.21 1 1 6 55.49 46.09 1633.02 1623.62 3 He 2109.85 -1 1 15 1176.55 -1054.93 2109.85 -121.62 4 He 2109.85 1 1 9 2.72 15.43 1039.50 1052.20
For your example data what values did the macro variables used in that last step end up having?
Bg = &Xmin_B. + &b_B.*((&n2_B.-&Fx_1_B.)/&fx_B.);
He = &Xmin_H. + &b_H.*((&n2_H.-&Fx_1_H.)/&fx_H.);
If you don't know re-run it with the SYMBOLGEN option turned on and the SAS log will show you.
Are they basically just the values from these rows where the value on the FX variable flips from positive to negative?
sum_ cum_ Sign_ Obs Eink VARNAME VALUE value Xmin Xmax value fx_value value 6 06 Bg 461.618 3358.21 1250 1500 1633.02 46.087 1 7 07 Bg 429.382 3358.21 1500 1750 2062.40 -383.295 -1 33 09 He 180.849 2109.85 2000 2250 1039.50 15.426 1 34 10 He 137.045 2109.85 2250 2500 1176.55 -121.620 -1
At this point I think I am going call "Foul" on use of the word "median" at all to describe this process.
When I use that example data and run the code the result I get is
Obs Eink Bg He 1 Median 1526.8329804346 2278.1399438082
Every definition I have run into for median relates to "being in the middle" of values with some slop as to exactly how that middle is defined. Or perhaps "Formally, a median of a population is any value such that at least half of the population is less than or equal to the proposed median and at least half is greater than or equal to the proposed median."
Since the largest value in your Final example data for Bg is 461.6179330 I see no way that a value of 1526 or greater should be considered any sort of "median" for Bg. Similar with your He variable as the max in the set is 192.283645.
This may be a statistic of some sort but a median it ain't.
So that last formula appears to interpolating between XMIN and XMAX based on the something that looks like perhaps the range at the point where the cum sum crosses the 50% point of the total sum.
Bg = &Xmin_B. + &b_B.*((&n2_B.-&Fx_1_B.)/&fx_B.);
He = &Xmin_H. + &b_H.*((&n2_H.-&Fx_1_H.)/&fx_H.);
Since you appear to have gotten it to run what are the values of those 5 macro variables for each analysis variable that ended up being used to make your result?
Since they appear from the convoluted steps 5 to 10 be just pulling values from the dataset we should be able to look at the numbers and figure out which values the formula is using (much easier than figuring out what that code is trying to pull from looking at it).
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.