Hello all!
I'm wondering if any lovely people here could help me find an easier way to code out my problem. In my research lab, one of the goals I have is to create quantile variables (either in a a 0/1 binary variable i.e. Quart1 Quart 2 etc or a VarQuart variable with a 1-4 range) based off only the control participants in the study. Because of that, proc rank doesn't work as well, because I don't want the whole sample to be part of the ranking system, even though I want all observations to be coded. I've been doing a proc means to get the specific end points for the quantiles and then entering it in manually, but I'm trying to minimize the copy/paste and potentially macro it up to use for any different cut. Does anyone have any recommendations?
Current method (pared down, just for one variable, numbers changed)
proc means data=matextreme p10 p20 p30 p70 p80 p90;
where Case=0;
var AA ;
run;
data matdec;
set pufadata.PUFAanalysis_042618;
if AA < d1 cut number then AAd1 = 1; else AAd1 = 0;
if AA >= d1 cut number and AA < d2 cut number then AAd2 = 1; else AAd2 = 0;
....
1) In the code you provided, does Case=0 represent the control group?
2) In your PROC MEANS call, you use p10 p20 p30 p70 p80 p90. That will generate 6 numbers.
I assume that you want all observations (not just those where Case=0) to be assigned a value 1-8 where
1 means AA <= P10
2 means P10 < AA <= P20
...
7 means P80 < AA <= P90.
8 means AA > P90.
Is that correct? If not, please clarify.
Yes, Case=0 is the control group. We have about 500 in each, so there should be about 50 in each decile for the control group and a reasonably close amount for the Case=1 sample.
There are only a limited variables because I copied from the code in which we are using D4-D7 as a referent group,
The code I was working on is making 8 different 0/1 indicator variables, of which it shows if they are included in the Decile group or not. For example, the objects in the first decile would have a 1 for the AAd1 variable, and a 0 for all others. This is easily interchangeable with a single decile variable of 1-10, and I would prefer to be able to make the 1-10 variable first, so then I can subgroup after however I want.
The end goal is to do a proc logistic using some parts of the distribution as a referent, but there are also things such as quartile/decile/etc medians I need for other things as well.
You say that the goal is to use PROC LOGISTIC. Most statisticians argue that you should not use binning to split continuous explanatory variables. Your models will be more interpretable and applicable if you use the continuous variables in the model, rather than a binned CLASS variable.
@cwhitman412 wrote:
Yes, Case=0 is the control group. We have about 500 in each, so there should be about 50 in each decile for the control group and a reasonably close amount for the Case=1 sample.
There are only a limited variables because I copied from the code in which we are using D4-D7 as a referent group,
The code I was working on is making 8 different 0/1 indicator variables, of which it shows if they are included in the Decile group or not. For example, the objects in the first decile would have a 1 for the AAd1 variable, and a 0 for all others. This is easily interchangeable with a single decile variable of 1-10, and I would prefer to be able to make the 1-10 variable first, so then I can subgroup after however I want.
I am not quite clear what this means. Does this mean that you would actually prefer to have a single variable that indicated which interval such as 1 when AA <p10, 2 when p10 le AA < p20 and so on?
Following is an example that created a format from the quantile boundaries and would display a value of 1 for the first group such as your AA1 and so on:
proc summary data=sashelp.baseball; var natbat; output out=work.summary p10= p20= p30= p70= p80= p90= /autoname; run; data work.cntlin; set work.summary; fmtname='natbatquantile'; type='N'; sexcl='N'; eexcl='Y'; array p natbat: ; do i= 1 to dim(p); if i=1 then do; label=i; start=.; HLO='L'; end=p[i]; output; label=i+1; HLO=''; start=p[i]; end =p[i+1]; end; else if i=dim(p) then do; label=i+1; end=.; eexcl='N'; HLO='H'; start=p[i]; end; else do; start=p[i]; end =p[i+1]; label=i+1; end; output; end; run; proc format library=work cntlin=work.cntlin; run; quit; proc freq data=sashelp.baseball; tables natbat; format natbat natbatquantile.; run;
You could create a variable with the values of 1 to 7 using
quantilegroup = put(natbat, natbatquantile.);
replace "natbat" with "aa" in my code and use your data set that you use to build the quantiles.
If you ever want to change for low<p10, p10 le aa < p20 you will need to investigate the Proc Format EEXCL SEXCL and HLO variables used in a control data set.
How many variables do you need to apply this to? Is the data large, will speed be an issue?
@cwhitman412 wrote:
Hello all!
I'm wondering if any lovely people here could help me find an easier way to code out my problem. In my research lab, one of the goals I have is to create quantile variables (either in a a 0/1 binary variable i.e. Quart1 Quart 2 etc or a VarQuart variable with a 1-4 range) based off only the control participants in the study. Because of that, proc rank doesn't work as well, because I don't want the whole sample to be part of the ranking system, even though I want all observations to be coded. I've been doing a proc means to get the specific end points for the quantiles and then entering it in manually, but I'm trying to minimize the copy/paste and potentially macro it up to use for any different cut. Does anyone have any recommendations?
Current method (pared down, just for one variable, numbers changed)
proc means data=matextreme p10 p20 p30 p70 p80 p90;
where Case=0;
var AA ;
run;
data matdec;
set pufadata.PUFAanalysis_042618;
if AA < d1 cut number then AAd1 = 1; else AAd1 = 0;
if AA >= d1 cut number and AA < d2 cut number then AAd2 = 1; else AAd2 = 0;
....
Not too large, as I said, about 500 per Case/Control (1k in total) and 10-15 variables. It takes me a while to manually do it and I'd rather cut down on potential copy/paste typos.
This is essentially binning variables according to the deciles of the control group. Do you want to bin 15 variables or only the AA variable?
Each individual biomarker is run separately, we are comparing the biomarker in each of its quartiles against a referent. I know the analysis is correct, I just need to figure out an easier way to do the coding as opposed to simply manually typing in the cutoffs.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.