BookmarkSubscribeRSS Feed
cwhitman412
Calcite | Level 5

Hello all!  

 

I'm wondering if any lovely people here could help me find an easier way to code out my problem. In my research lab, one of the goals I have is to create quantile variables (either in a a 0/1 binary variable i.e. Quart1 Quart 2 etc or a VarQuart variable with a 1-4 range) based off only the control participants in the study. Because of that, proc rank doesn't work as well, because I don't want the whole sample to be part of the ranking system, even though I want all observations to be coded. I've been doing a proc means to get the specific end points for the quantiles and then entering it in manually, but I'm trying to minimize the copy/paste and potentially macro it up to use for any different cut. Does anyone have any recommendations?

 

Current method (pared down, just for one variable, numbers changed)

 

proc means data=matextreme p10 p20 p30 p70 p80 p90;
where Case=0;
var AA ;
run;

 

data matdec;
set pufadata.PUFAanalysis_042618;
if AA < d1 cut number then AAd1 = 1; else AAd1 = 0;
if AA >= d1 cut number and  AA < d2 cut number   then AAd2 = 1; else AAd2 = 0;
....

10 REPLIES 10
Rick_SAS
SAS Super FREQ

1) In the code you provided, does Case=0 represent the control group?

2) In your PROC MEANS call, you use  p10 p20 p30 p70 p80 p90. That will generate 6 numbers.

I assume that you want all observations (not just those where Case=0) to be assigned a value 1-8 where

1 means AA <= P10

2 means P10 < AA <= P20

...

7 means P80 < AA <= P90.

8 means AA > P90.

 

 

Is that correct? If not, please clarify.

cwhitman412
Calcite | Level 5

Yes, Case=0 is the control group. We have about 500 in each, so there should be about 50 in each decile for the control group and a reasonably close amount for the Case=1 sample.

 

There are only a limited variables because I copied from the code in which we are using D4-D7 as a referent group, 

 

The code I was working on is making 8 different 0/1 indicator variables, of which it shows if they are included in the Decile group or not. For example, the objects in the first decile would have a 1 for the AAd1 variable, and a 0 for all others. This is easily interchangeable with a single decile variable of 1-10, and I would prefer to be able to make the 1-10 variable first, so then I can subgroup after however I want.

Reeza
Super User
Depending on which proc you're using in your next step it may support a CLASS statement which means it will create the dummy variables for you.
cwhitman412
Calcite | Level 5

The end goal is to do a proc logistic using some parts of the distribution as a referent, but there are also things such as quartile/decile/etc medians I need for other things as well.

Rick_SAS
SAS Super FREQ

You say that the goal is to use PROC LOGISTIC. Most statisticians argue that you should not use binning to split continuous explanatory variables. Your models will be more interpretable and applicable if you use the continuous variables in the model, rather than a binned CLASS variable.

ballardw
Super User

@cwhitman412 wrote:

Yes, Case=0 is the control group. We have about 500 in each, so there should be about 50 in each decile for the control group and a reasonably close amount for the Case=1 sample.

 

There are only a limited variables because I copied from the code in which we are using D4-D7 as a referent group, 

 

The code I was working on is making 8 different 0/1 indicator variables, of which it shows if they are included in the Decile group or not. For example, the objects in the first decile would have a 1 for the AAd1 variable, and a 0 for all others. This is easily interchangeable with a single decile variable of 1-10, and I would prefer to be able to make the 1-10 variable first, so then I can subgroup after however I want.


I am not quite clear what this means. Does this mean that you would actually prefer to have a single variable that indicated which interval such as 1 when AA <p10, 2 when p10 le AA < p20 and so on?

 

Following is an example that created a format from the quantile boundaries and would display a value of 1 for the first group such as your AA1 and so on:

proc summary data=sashelp.baseball;
   var natbat;
   output out=work.summary p10= p20= p30= p70= p80= p90= /autoname;
run;

data work.cntlin;
   set work.summary;
   fmtname='natbatquantile';
   type='N';
   sexcl='N';
   eexcl='Y';
   array p natbat: ;
   do i= 1 to dim(p);
      if i=1 then do;
         label=i;
         start=.;
         HLO='L';
         end=p[i];
         output;
         label=i+1;
         HLO='';
         start=p[i];
         end  =p[i+1];
      end;
      else if i=dim(p) then do;
         label=i+1;
         end=.;
         eexcl='N';
         HLO='H';
         start=p[i];
      end;
      else do;
         start=p[i];
         end  =p[i+1];
         label=i+1;
      end;

   output;
   end;
run;


proc format library=work cntlin=work.cntlin;
run;
quit;

proc freq data=sashelp.baseball;
   tables natbat;
   format natbat natbatquantile.;
run;

You could create a variable with the values of 1 to 7 using

quantilegroup = put(natbat, natbatquantile.);

 

replace "natbat" with "aa"  in my code and use your data set that you use to build the quantiles.

 

If you ever want to change for low<p10, p10 le aa < p20 you will need to investigate the Proc Format EEXCL SEXCL and HLO variables used in a control data set.

Reeza
Super User

How many variables do you need to apply this to? Is the data large, will speed be an issue?

 


@cwhitman412 wrote:

Hello all!  

 

I'm wondering if any lovely people here could help me find an easier way to code out my problem. In my research lab, one of the goals I have is to create quantile variables (either in a a 0/1 binary variable i.e. Quart1 Quart 2 etc or a VarQuart variable with a 1-4 range) based off only the control participants in the study. Because of that, proc rank doesn't work as well, because I don't want the whole sample to be part of the ranking system, even though I want all observations to be coded. I've been doing a proc means to get the specific end points for the quantiles and then entering it in manually, but I'm trying to minimize the copy/paste and potentially macro it up to use for any different cut. Does anyone have any recommendations?

 

Current method (pared down, just for one variable, numbers changed)

 

proc means data=matextreme p10 p20 p30 p70 p80 p90;
where Case=0;
var AA ;
run;

 

data matdec;
set pufadata.PUFAanalysis_042618;
if AA < d1 cut number then AAd1 = 1; else AAd1 = 0;
if AA >= d1 cut number and  AA < d2 cut number   then AAd2 = 1; else AAd2 = 0;
....


 

cwhitman412
Calcite | Level 5

Not too large, as I said, about 500 per Case/Control (1k in total) and 10-15 variables. It takes me a while to manually do it and I'd rather cut down on potential copy/paste typos.

Rick_SAS
SAS Super FREQ

This is essentially binning variables according to the deciles of the control group. Do you want to bin 15 variables or only the AA variable?

cwhitman412
Calcite | Level 5

Each individual biomarker is run separately, we are comparing the biomarker in each of its quartiles against a referent. I know the analysis is correct, I just need to figure out an easier way to do the coding as opposed to simply manually typing in the cutoffs.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1883 views
  • 2 likes
  • 4 in conversation