- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I want to create tertiles to organize my data into categories. However, I want the tertiles to be made based off of the control group. I tried this code, but then when I run the proc freq the table only shows data when casecont_path=1. How do I create tertiles using one group, then apply it to both the cases and the controls?
proc rank data=work.adenovar groups=3;
where casecont_path=1;
var jointyears;
ranks jointyears_tert;
run;
/*categorical joint years*/
data work.adeno1;
set work.data1;
if hash=2 then jointyears_cat=0;
if jointyears_tert=0 then jointyears_cat=1;
if jointyears_tert=1 then jointyears_cat=2;
if jointyears_tert=2 then jointyears_cat=3;
run;
proc freq data=work.adeno1;
tables jointyears_cat*casecont_path/norow nopercent;
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Then run the PROC UNIVARITATE I showed, and do a PROC PRINT of the NEED dataset produced by it. If this were other, more standard percentiles, then you could drop the "NOPRINT" option and see the percentile reported by PROC UNIVARIATE.
Or, in the DATA step assigning tertiles, just add a
if _n_=1 then put (pctl:) (=);
statement after the "set need" statement.
============================== editted comments below ================================
This above was marked as a "solution", but it's really only the answer to a derivative question, and doesn't really connect with the subject line. So I've included below my original response, which will give context to the above:
proc univariate data=work.adenovar noprint;;
where casecont_path=1;
var jointyears;
output out=need pctlpts=33.33 66.67 pctlpre=pctl ;
run;
/*categorical joint years*/
data work.adeno1 (drop=pctl33_33 pctl_66_67);
set work.data1;
if _n_=1 then set need;
if _n_=1 then put (pctl:) (=); /* Added in response to follow-up*/
if hash=2 then jointyears_cat=0; else
jointyears=1 + (jointyears>pctl33_33) + (jointyears>pctl66_67);
run;
- Dataset need will have one observation with two variables PCTL33_33, and PCTL66_67.
- The "if _n_=1 then set need;" statement reads the single NEED observation during only the first iteration of the data step. Otherwise it would attempt to read beyond the end of NEED, and the data step would stop prematurely. And all the variables read by that conditional SET NEED statement will be retained for use in all subsequent obs as you progress through work.adeno1.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
where casecont_path=1;
so when you get to PROC FREQ, it never finds any other value for variable CASECONT_PATH.
A useful debugging method is for you to actually look at the data sets with your own eyes and see what is in there.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why not use proc univariate to make two cut-points, at the 33.33%ile and 66.67%ile values? This is a good way to avoid holes from the proc rank on a subset. (edit: failed to paste this additional text).
proc univariate data=work.adenovar noprint;;
where casecont_path=1;
var jointyears;
output out=need pctlpts=33.33 66.67 pctlpre=pctl ;
run;
/*categorical joint years*/
data work.adeno1 (drop=pctl33_33 pctl_66_67);
set work.data1;
if _n_=1 then set need;
if hash=2 then jointyears_cat=0; else
jointyears=1 + (jointyears>pctl33_33) + (jointyears>pctl66_67);
run;
- Dataset need will have one observation with two variables PCTL33_33, and PCTL66_67.
- The "if _n_=1 then set need;" statement reads the single NEED observation during only the first iteration of the data step. Otherwise it would attempt to read beyond the end of NEED, and the data step would stop prematurely. And all the variables read by that conditional SET NEED statement will be retained for use in all subsequent obs as you progress through work.adeno1.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This code
data work.adeno1; set work.data1; if hash=2 then jointyears_cat=0; if jointyears_tert=0 then jointyears_cat=1; if jointyears_tert=1 then jointyears_cat=2; if jointyears_tert=2 then jointyears_cat=3; run;
is very likely overwriting the value of jointyears_cat set when hash=2 unless "joint_years_tert" is missing.
You might be wanting
data work.adeno1; set work.data1; if hash=2 then jointyears_cat=0; Else Jointyears_cat = jointyears_tert +1; run;
Be prepared to provide example data and expected result for that example.
The "application" to other variables can be extremely data dependendent
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
After some more thought, I think I'm asking the wrong question. I know how to divide up my data into tertiles, and I now understand how to do it for just my control group. What I really need to know now is how to see what each tertile actually is, like it's numerical value, so I can use it in making categories.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Then run the PROC UNIVARITATE I showed, and do a PROC PRINT of the NEED dataset produced by it. If this were other, more standard percentiles, then you could drop the "NOPRINT" option and see the percentile reported by PROC UNIVARIATE.
Or, in the DATA step assigning tertiles, just add a
if _n_=1 then put (pctl:) (=);
statement after the "set need" statement.
============================== editted comments below ================================
This above was marked as a "solution", but it's really only the answer to a derivative question, and doesn't really connect with the subject line. So I've included below my original response, which will give context to the above:
proc univariate data=work.adenovar noprint;;
where casecont_path=1;
var jointyears;
output out=need pctlpts=33.33 66.67 pctlpre=pctl ;
run;
/*categorical joint years*/
data work.adeno1 (drop=pctl33_33 pctl_66_67);
set work.data1;
if _n_=1 then set need;
if _n_=1 then put (pctl:) (=); /* Added in response to follow-up*/
if hash=2 then jointyears_cat=0; else
jointyears=1 + (jointyears>pctl33_33) + (jointyears>pctl66_67);
run;
- Dataset need will have one observation with two variables PCTL33_33, and PCTL66_67.
- The "if _n_=1 then set need;" statement reads the single NEED observation during only the first iteration of the data step. Otherwise it would attempt to read beyond the end of NEED, and the data step would stop prematurely. And all the variables read by that conditional SET NEED statement will be retained for use in all subsequent obs as you progress through work.adeno1.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content