Solved: Re: Creating tertiles based on a control group

gejoachim99 · Posted 02-14-2021 03:39 PM

I want to create tertiles to organize my data into categories. However, I want the tertiles to be made based off of the control group. I tried this code, but then when I run the proc freq the table only shows data when casecont_path=1. How do I create tertiles using one group, then apply it to both the cases and the controls?

proc rank data=work.adenovar groups=3;
where casecont_path=1;
var jointyears;
ranks jointyears_tert;
run;

/*categorical joint years*/
data work.adeno1;
set work.data1;
if hash=2 then jointyears_cat=0;
if jointyears_tert=0 then jointyears_cat=1;
if jointyears_tert=1 then jointyears_cat=2;
if jointyears_tert=2 then jointyears_cat=3;
run;

proc freq data=work.adeno1;
tables jointyears_cat*casecont_path/norow nopercent;
run;

mkeintz · Posted 02-15-2021 02:53 PM

Then run the PROC UNIVARITATE I showed, and do a PROC PRINT of the NEED dataset produced by it. If this were other, more standard percentiles, then you could drop the "NOPRINT" option and see the percentile reported by PROC UNIVARIATE.

Or, in the DATA step assigning tertiles, just add a

  if _n_=1 then put (pctl:) (=);

statement after the "set need" statement.

============================== editted comments below ================================

This above was marked as a "solution", but it's really only the answer to a derivative question, and doesn't really connect with the subject line. So I've included below my original response, which will give context to the above:

proc univariate data=work.adenovar noprint;;
  where casecont_path=1;
  var jointyears;
  output out=need pctlpts=33.33 66.67 pctlpre=pctl ;
run;

/*categorical joint years*/
data work.adeno1 (drop=pctl33_33 pctl_66_67);
  set work.data1;
  if _n_=1 then set need; 
  if _n_=1 then put (pctl:) (=);   /* Added in response to follow-up*/
  if hash=2 then jointyears_cat=0; else
  jointyears=1 + (jointyears>pctl33_33) + (jointyears>pctl66_67);
run;

Dataset need will have one observation with two variables PCTL33_33, and PCTL66_67.
The "if _n_=1 then set need;" statement reads the single NEED observation during only the first iteration of the data step. Otherwise it would attempt to read beyond the end of NEED, and the data step would stop prematurely. And all the variables read by that conditional SET NEED statement will be retained for use in all subsequent obs as you progress through work.adeno1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PaigeMiller · Posted 02-14-2021 04:10 PM

where casecont_path=1;

so when you get to PROC FREQ, it never finds any other value for variable CASECONT_PATH.

A useful debugging method is for you to actually look at the data sets with your own eyes and see what is in there.

--
Paige Miller

gejoachim99 · Posted 02-14-2021 04:16 PM

I understand that the where statement is why the proc freq isn't what I need, but I'm more asking if there is a way to create tertiles based on only that group, then apply that categorization to both my cases and controls. My data set is very large.

mkeintz · Posted 02-14-2021 04:38 PM

Why not use proc univariate to make two cut-points, at the 33.33%ile and 66.67%ile values? This is a good way to avoid holes from the proc rank on a subset. (edit: failed to paste this additional text).

proc univariate data=work.adenovar noprint;;
  where casecont_path=1;
  var jointyears;
  output out=need pctlpts=33.33 66.67 pctlpre=pctl ;
run;

/*categorical joint years*/
data work.adeno1 (drop=pctl33_33 pctl_66_67);
  set work.data1;
  if _n_=1 then set need; 
  if hash=2 then jointyears_cat=0; else
  jointyears=1 + (jointyears>pctl33_33) + (jointyears>pctl66_67);
run;

Dataset need will have one observation with two variables PCTL33_33, and PCTL66_67.
The "if _n_=1 then set need;" statement reads the single NEED observation during only the first iteration of the data step. Otherwise it would attempt to read beyond the end of NEED, and the data step would stop prematurely. And all the variables read by that conditional SET NEED statement will be retained for use in all subsequent obs as you progress through work.adeno1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 02-15-2021 02:05 AM

This code

data work.adeno1;
set work.data1;
if hash=2 then jointyears_cat=0;
if jointyears_tert=0 then jointyears_cat=1;
if jointyears_tert=1 then jointyears_cat=2;
if jointyears_tert=2 then jointyears_cat=3;
run;

is very likely overwriting the value of jointyears_cat set when hash=2 unless "joint_years_tert" is missing.

You might be wanting

data work.adeno1;
   set work.data1;
   if hash=2 then jointyears_cat=0;
   Else Jointyears_cat = jointyears_tert +1;
run;

Be prepared to provide example data and expected result for that example.

The "application" to other variables can be extremely data dependendent

gejoachim99 · Posted 02-15-2021 02:35 PM

That part of the code isn't what I had a question on, I know it is correct. I was just including it for some context.

gejoachim99 · Posted 02-15-2021 02:37 PM

After some more thought, I think I'm asking the wrong question. I know how to divide up my data into tertiles, and I now understand how to do it for just my control group. What I really need to know now is how to see what each tertile actually is, like it's numerical value, so I can use it in making categories.

mkeintz · Posted 02-15-2021 02:53 PM

Then run the PROC UNIVARITATE I showed, and do a PROC PRINT of the NEED dataset produced by it. If this were other, more standard percentiles, then you could drop the "NOPRINT" option and see the percentile reported by PROC UNIVARIATE.

Or, in the DATA step assigning tertiles, just add a

  if _n_=1 then put (pctl:) (=);

statement after the "set need" statement.

============================== editted comments below ================================

This above was marked as a "solution", but it's really only the answer to a derivative question, and doesn't really connect with the subject line. So I've included below my original response, which will give context to the above:

proc univariate data=work.adenovar noprint;;
  where casecont_path=1;
  var jointyears;
  output out=need pctlpts=33.33 66.67 pctlpre=pctl ;
run;

/*categorical joint years*/
data work.adeno1 (drop=pctl33_33 pctl_66_67);
  set work.data1;
  if _n_=1 then set need; 
  if _n_=1 then put (pctl:) (=);   /* Added in response to follow-up*/
  if hash=2 then jointyears_cat=0; else
  jointyears=1 + (jointyears>pctl33_33) + (jointyears>pctl66_67);
run;

Dataset need will have one observation with two variables PCTL33_33, and PCTL66_67.
The "if _n_=1 then set need;" statement reads the single NEED observation during only the first iteration of the data step. Otherwise it would attempt to read beyond the end of NEED, and the data step would stop prematurely. And all the variables read by that conditional SET NEED statement will be retained for use in all subsequent obs as you progress through work.adeno1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

gejoachim99 · Posted 02-15-2021 03:00 PM

Thank you so much! That worked! 🙂

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away