BookmarkSubscribeRSS Feed
Konkordanz
Pyrite | Level 9

Hi,

 

I have a problem and Im thankful for help.

Given is the data set "final". Today, for example, it contains 3 variables, but it could also have 5 or 10 variables the next time it is created. The names of the variables are always freely selectable and have no logic. But they all have one thing in common: The first variable is always "Eink" with the same characteristics and the goal is calculating a median (based on the groups in "Eink").

 

For the current dataset I have written a syntax for calculating the median. I attached the current syntax below. Based on my current dataset, Bg and He were used there.  My problem is: If the record "final" appears tomorrow with other variables, the syntax should adapt individually to it. If tomorrow "Bg" and "He" no longer exist, but for example XX, XY and XZ, then the syntax should automatically be applied to these three variables with their specific names. So I probably need a loop, but am unable to write it. Could anyone help me with this big concern? Would be great!

 

dataset "final" given

Eink Bg He
01 55.487739 2.722371
02 127.485645 18.670678
03 182.95309 53.529988
04 354.69612 122.703093
05 450.777081 137.596942
06 461.617933 157.457164
07 429.38197 192.283645
08 370.644235 173.688658
09 281.256479 180.848677
10 170.426973 137.045263
11 118.22022 128.45571
12 93.656461 104.596442
13 68.561622 112.710123
14 42.443895 77.742469
15 50.933162 136.718895
16 35.285987 104.219131
17 23.123863 78.424041
18 16.724659 89.48717
19 9.271308 46.396113
20 4.870804 21.574738
21 4.83902 19.753305
22 3.487616 9.23105
23 0.860551 2.136266
24 1.202043 1.861731

 

code


/*1*/

Data Eink_2;
set final;
if Eink = "01" then do; 	Xmin = 	0		; Xmax = 	250		; end;
if Eink = "02" then do; 	Xmin = 	250		; Xmax = 	500		; end;
if Eink = "03" then do; 	Xmin = 	500		; Xmax = 	750		; end;
if Eink = "04" then do; 	Xmin = 	750		; Xmax = 	1000	; end;
if Eink = "05" then do; 	Xmin = 	1000	; Xmax = 	1250	; end;
if Eink = "06" then do; 	Xmin = 	1250	; Xmax = 	1500	; end;
if Eink = "07" then do; 	Xmin = 	1500	; Xmax = 	1750	; end;
if Eink = "08" then do; 	Xmin = 	1750	; Xmax = 	2000	; end;
if Eink = "09" then do; 	Xmin = 	2000	; Xmax = 	2250	; end;
if Eink = "10" then do; 	Xmin = 	2250	; Xmax = 	2500	; end;
if Eink = "11" then do; 	Xmin = 	2500	; Xmax = 	2750	; end;
if Eink = "12" then do; 	Xmin = 	2750	; Xmax = 	3000	; end;
if Eink = "13" then do; 	Xmin = 	3000	; Xmax = 	3250	; end;
if Eink = "14" then do; 	Xmin = 	3250	; Xmax = 	3500	; end;
if Eink = "15" then do; 	Xmin = 	3500	; Xmax = 	4000	; end;
if Eink = "16" then do; 	Xmin = 	4000	; Xmax = 	4500	; end;
if Eink = "17" then do; 	Xmin = 	4500	; Xmax = 	5000	; end;
if Eink = "18" then do; 	Xmin = 	5000	; Xmax = 	6000	; end;
if Eink = "19" then do; 	Xmin = 	6000	; Xmax = 	7000	; end;
if Eink = "20" then do; 	Xmin = 	7000	; Xmax = 	8000	; end;
if Eink = "21" then do; 	Xmin = 	8000	; Xmax = 	10000	; end;
if Eink = "22" then do; 	Xmin = 	10000	; Xmax = 	15000	; end;
if Eink = "23" then do; 	Xmin = 	15000	; Xmax = 	25000	; end;
if Eink = "24" then do; 	Xmin = 	25000	; Xmax = 	.		; end;
run;

/*2*/

Data Eink_3;
set Eink_2;
cum_Bg+Bg;
cum_He+He;
run;


/*3*/

proc sql noprint;
select sum(Bg)/2 		into: n2_B	from Eink_3;
select sum(He)/2 		into: n2_H	from Eink_3;
run;



/*4*/

data Eink_4;
set Eink_3;
Fx_prä_B = 	&n2_B. - cum_Bg;
Fx_prä_H = 	&n2_H. - cum_He;
run;


/*5*/

	proc rank data=Eink_4 out=fx_1_Rang_B; 	var Fx_prä_B; 	ranks rang ; where Fx_prä_B >0;
	proc rank data=Eink_4 out=fx_1_Rang_H; 	var Fx_prä_H; 	ranks rang ; where Fx_prä_H >0;


/*6*/

	proc sql noprint; select put(cum_Bg, decimal.7) 	into:Fx_1_B from fx_1_Rang_B 	where rang =1; quit;
	proc sql noprint; select put(cum_He, decimal.7) 	into:Fx_1_H from fx_1_Rang_H 	where rang =1; quit;



/*7*/

	proc rank data=Eink_4 descending out=fx_B; 	var Fx_prä_B; 	ranks Rang; where Fx_prä_B <0;
	proc rank data=Eink_4 descending out=fx_H; 	var Fx_prä_H; 	ranks Rang; where Fx_prä_H <0;


/*8*/

	proc sql noprint;
	select put(Bg, decimal.7) 	into:fx_B 	from fx_B where rang=1;
	select put(He, decimal.7) 	into:fx_H 	from fx_H where rang=1;
	run;


/*9*/

proc sql noprint;
select Xmin into: Xmin_B from fx_B where Rang = 1;
select Xmin into: Xmin_H from fx_H where Rang = 1;
quit;


/*10*/

proc sql noprint;
select Xmax-Xmin into: b_B 	from fx_B 	where Rang = 1;
select Xmax-Xmin into: b_H 	from fx_H 	where Rang = 1;
quit;



/*11 - into the formula */

Data Median_Final;
retain Eink Bg He;
Format 
Bg 15.12
He 15.12;
Eink 	= "Median";
Bg =	&Xmin_B.  	+ &b_B.*((&n2_B.-&Fx_1_B.)/&fx_B.);
He 	=	&Xmin_H.  	+ &b_H.*((&n2_H.-&Fx_1_H.)/&fx_H.);
run;

 

17 REPLIES 17
Kurt_Bremser
Super User

Transpose to long, so you can use the _NAME_ variable as additional CLASS in PROC MEANS or ACROSS in PROC REPORT. If eink is the only character variable, you can use _NUMERIC_ in the VAR statement of TRANSPOSE.

Konkordanz
Pyrite | Level 9

Hmm, wouldnt be the best idea, because I have a total of 11 steps, which each create their own temporary datasets. As an example, I'll supplement my code snippet from above with the subsequent step of my syntax. So they always refer to each other.

I would like to apply these 11 steps as a loop. The code snippet was just one of these 11 on which the macro could be exemplified. I'll try to include the rest then.

PaigeMiller
Diamond | Level 26

Yes, please show us, because the suggestion from @Kurt_Bremser ought to work on the 11 different steps.

--
Paige Miller
Konkordanz
Pyrite | Level 9

I have fleshed out my initial post above and added my syntax that works based on the current data set (but I have reformulated/anonymized the syntax a bit). I hope this doesn't make too much work. Thanks!

PaigeMiller
Diamond | Level 26

It looks like you are computing some sort of a median, using sums, max, min for each "half" of the data, and then plugging these into a formula. I would think there are better/faster ways to do this that don't involve macros or macro variables, but I don't fully understand the process or the end result. Could you please explain what you are trying to do?

--
Paige Miller
Konkordanz
Pyrite | Level 9

The goal is to calculate a weigthed median, based on a formular from the institution Im working for.  With the 11 steps I calculate each component of the formular:

- Xmin_B and Xmin_H

- b_B and b_H

- n2_B and n2_H

- Fx_1_B and Fx_1_H

- fx_B and fx_H

 

At the end, I put these components into the formula and calculate the median (step 11). That works pretty well. But with new variables I always have to manually adjust the syntax. And my hope is, that there is a solution like a loop...

Kurt_Bremser
Super User

With such a complex process which can't use the available SAS procedures, you have to feed a list of variables to a macro and use %DO loops to create the repeating code wherever needed.

You can create multiple macro variables in a single SELECT INTO, so use that to speed up your process.

Similarly, avoid to run multiple steps for each variable; in most cases, you can process multiple variables in a single step.

ballardw
Super User

@Konkordanz wrote:

Hmm, wouldnt be the best idea, because I have a total of 11 steps, which each create their own temporary datasets. As an example, I'll supplement my code snippet from above with the subsequent step of my syntax. So they always refer to each other.

I would like to apply these 11 steps as a loop. The code snippet was just one of these 11 on which the macro could be exemplified. I'll try to include the rest then.


If you have a data set with a grouping variable and want to create summary statistics for numeric variables then proc means/summary is likely where you belong. Splitting data into separate data sets just to get a median of a value is just "wrong" in a SAS approach.

 

Here is a small example with a data set you should have available to examine.

Proc summary data=sashelp.class nway;
   class sex;
   var _numeric_;
   output out=want (drop=_type_ _freq_)
        median = ;
run;

The option NWAY is to just show the result for the highest _type_ value of the result. _Type_ indicates which combination(s) of Class variables are involved in an output observation. The default output would also include an overall summary for all observations and if more variables are involved on the class statement then there multiple combinations possible.

Class indicates your grouping variable.

Var is list of numeric variables to use. In this case _numeric_ uses all of the numeric variables in the data set (there are only 3).

The output statement names the output dataset and requests which statistic(s) to create. The drop removes two automatic variables that would appear by default. The Median= requests the median statistic. With a single statistic that means the output statistic has the same name as the input variable. You can either name the resulting statistic your self or use /autoname option to create names with the statistic appended.

 

Konkordanz
Pyrite | Level 9
Thx for answering. I wouldnt say "wrong", because I dont calculate an default median. I have to calculate a median based on a formular of the institution Im working for. I have fleshed out my initial post above.
Tom
Super User Tom
Super User

Looks to me like your first 4 steps are just these two steps.  Plus I generated SIGN variables that can probably be used to reduce the PROC RANK steps to one instead of two.

data final ;
  input Eink :$2. Bg He ;
cards;
01 55.487739 2.722371
02 127.485645 18.670678
03 182.95309 53.529988
04 354.69612 122.703093
05 450.777081 137.596942
06 461.617933 157.457164
07 429.38197 192.283645
08 370.644235 173.688658
09 281.256479 180.848677
10 170.426973 137.045263
11 118.22022 128.45571
12 93.656461 104.596442
13 68.561622 112.710123
14 42.443895 77.742469
15 50.933162 136.718895
16 35.285987 104.219131
17 23.123863 78.424041
18 16.724659 89.48717
19 9.271308 46.396113
20 4.870804 21.574738
21 4.83902 19.753305
22 3.487616 9.23105
23 0.860551 2.136266
24 1.202043 1.861731
;

data min_max ;
  length Eink $2  Xmin Xmax 8;
  do xmax = 250 to 3500 by 250,4000,4500,5000,6000,7000,8000,10000,15000,25000,. ;
     i+1;
     eink=put(i,z2.);
     xmin=sum(0,lag(xmax));
     output;
  end;
  drop i;
run;

proc summary data=FINAL ;
  var Bg He ;
  output out=sums(drop=_freq_ _type_) sum=sum_Bg sum_He ;
run;

data STEP4;
  merge final min_max ;
  by eink;
  if _n_=1 then set sums;
  cum_Bg + Bg;
  Fx_Bg = 	sum_bg/2 - cum_Bg;
  Sign_Bg = sign(fx_bg);
  cum_He + He;
  Fx_He = sum_he/2 - cum_He;
  Sign_He = sign(fx_he);
run;

So if Bg and He represent the variables you need to process then you might use macro code to generate those blocks that use them.

... 
sum=sum_Bg sum_He  ;
...
  cum_Bg + Bg;
  Fx_Bg = 	sum_bg/2 - cum_Bg;
  Sign_Bg = sign(fx_bg);
  cum_He + He;
  Fx_He = sum_he/2 - cum_He;
  Sign_He = sign(fx_he);
...

But you could also transpose the data first. Then no need to generate code (other than the list of variables in the VAR statement of the TRANSPOSE step).

proc transpose data=FINAL out=TALL(rename=(col1=VALUE)) name=VARNAME ;
  by Eink;
  var Bg He ;
run;

proc sort data=TALL;
  by varname eink;
run;

proc summary data=TALL nway;
  by varname ;
  var value;
  output out=sums(drop=_freq_ _type_) sum=sum_value ;
run;

data step4_tall;
  merge tall sums ;
  by varname ;
  if _n_=1 then do;
    set min_max;
    declare hash h(dataset:'min_max');
    h.definekey('eink');
    h.definedata('Xmin','Xmax');
    h.definedone();
  end;
  if first.varnum then cum_value=.;
  h.find();
  cum_value + value ;
  fx_value = sum_value/2 - cum_value ;
  Sign_value = sign(fx_value);
run;
Tom
Super User Tom
Super User

What is STEP5 to STEP11 doing?

Why is it using RANK instead of just taking the MIN and MAX?

proc summary data=step4_tall nway ;
  by varname sum_value ;
  class sign_value ;
  var cum_value fx_value ;
  output out=stats min= max= / autoname ;
run;
proc print;
run;
                                                             cum_                   cum_
                    sum_     Sign_                         value_    fx_value_     value_    fx_value_
Obs    VARNAME     value     value    _TYPE_    _FREQ_        Min       Min         Max         Max

 1       Bg       3358.21      -1        1        18      2062.40     -1679.10    3358.21      -383.30
 2       Bg       3358.21       1        1         6        55.49        46.09    1633.02      1623.62
 3       He       2109.85      -1        1        15      1176.55     -1054.93    2109.85      -121.62
 4       He       2109.85       1        1         9         2.72        15.43    1039.50      1052.20
Tom
Super User Tom
Super User

For your example data what values did the macro variables used in that last step end up having?

Bg = &Xmin_B.   + &b_B.*((&n2_B.-&Fx_1_B.)/&fx_B.);
He  = &Xmin_H.   + &b_H.*((&n2_H.-&Fx_1_H.)/&fx_H.);

If you don't know re-run it with the SYMBOLGEN option turned on and the SAS log will show you.

 

Are they basically just the values from these rows where the value on the FX variable flips from positive to negative?

                                       sum_                       cum_                 Sign_
Obs    Eink    VARNAME     VALUE      value     Xmin    Xmax     value     fx_value    value

  6     06       Bg       461.618    3358.21    1250    1500    1633.02      46.087       1
  7     07       Bg       429.382    3358.21    1500    1750    2062.40    -383.295      -1

 33     09       He       180.849    2109.85    2000    2250    1039.50      15.426       1
 34     10       He       137.045    2109.85    2250    2500    1176.55    -121.620      -1

 

ballardw
Super User

At this point I think I am going call "Foul" on use of the word "median" at all to describe this process.

When I use that example data and run the code the result I get is

 

Obs     Eink                  Bg                 He

 1     Median    1526.8329804346    2278.1399438082


Every definition I have run into for median relates to "being in the middle" of values with some slop as to exactly how that middle is defined. Or perhaps "Formally, a median of a population is any value such that at least half of the population is less than or equal to the proposed median and at least half is greater than or equal to the proposed median."

 

 

Since the largest value in your Final example data for Bg is 461.6179330 I see no way that a value of 1526 or greater should be considered any sort of "median" for Bg. Similar with your He variable as the max in the set is 192.283645.

This may be a statistic of some sort but a median it ain't.

Tom
Super User Tom
Super User

So that last formula appears to interpolating between XMIN and XMAX based on the something that looks like perhaps the range at the point where the cum sum crosses the 50% point of the total sum.

Bg =	&Xmin_B.  	+ &b_B.*((&n2_B.-&Fx_1_B.)/&fx_B.);
He 	=	&Xmin_H.  	+ &b_H.*((&n2_H.-&Fx_1_H.)/&fx_H.);

Since you appear to have gotten it to run what are the values of those 5 macro variables for each analysis variable that ended up being used to make your result? 

 

Since they appear from the convoluted steps 5 to 10 be just pulling values from the dataset we should be able to look at the numbers and figure out which values the formula is using (much easier than figuring out what that code is trying to pull from looking at it).

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 2292 views
  • 2 likes
  • 6 in conversation