Hello Everyone:
I have a big dataset, and I want to create subsets from this dataset and combine all the subsets into a separate dataset. Here are more details:
corvar1 through corvar12: 12 different dummy variables. Each variable takes the value of 0 or 1.
allmdummy: a dummy variable that takes the value of 0 or 1.
I want to create 108 different subsets. For example, for corvar1, I would like to create the following nine subsets:
corvar1 le 1 + allmdummy le 1 (include all values of corvar1 and all values of allmdummy)
corvar1 le 1 + allmdummy eq 1
corvar1 le 1 + allmdummy lt 1
corvar1 eq 1 + allmdummy le 1
corvar1 eq 1 + allmdummy eq 1
corvar1 eq 1 + allmdummy lt 1
corvar1 lt 1 + allmdummy le 1
corvar1 lt 1 + allmdummy eq 1
corvar1 lt 1 + allmdummy lt 1 (include only the values where corvar1=0 and allmdummy=0).
Below is a rough sketch of what I would like to accomplish. I would love to have your help with improving this code.
Thank you so much!
%macro groupa;
%do cvar=1 %to 12; /*names for the variables*/
%do m=1 %to 3; /*for math operations*/
%do n=1 %to 3; /*for math operations*/
data set&cvar&m&n; set regsample;
if &m=1 then opa="le";
if &m=2 then opa="eq";
if &m=3 then opa="lt";
if &n=1 then opb="le";
if &n=2 then opb="eq";
if &n=3 then opb="lt";
if corvar&cvar opa 1 and allmdummy opb 1;
id1=&cvar;
id2=&m;
id3=&n;
sampleid=&cvar&m&n; /*create a sample ID*/
proc append base=wanted data=set&cvar&m&n; /*I want to combine all the datasets into a single one. Not sure if this syntax is right*/
%end;
%end;
%end;
%mend groupa;
%groupa
Partial code, without macros, no pulling data set apart and then later combining data sets
data want;
set regsample;
length sampleid $ 4;
array corvar(12) corvar1-corvar12;
do i=1 to 12;
if corvar(i) le 1 and alldummy le 1 then do;
sampleid=cats(i,1,1);
output;
end;
if corvar(i) eq 1 and alldummy le 1 then do;
sampleid=cats(i,2,1);
output;
end;
/* You write the rest, there are only 9 possibilities here, as m=1,2,3 and n=1,2,3 */
end;
run;
proc sort data=want;
by sampleid obsid;
run;
proc reg data=want noprint outest=wanted2 noprint adjrsq;
by sampleid obsid; /*obsid is another ID already stored in the dataset*/
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
proc means data=want;
by sampleid;
var intercept ind1 ind2 ind3 ind4 _rsq_ _adjrsq_;
output out=wanted3 mean = std= t= probt= / autoname;
run;
This code passes through data set REGSAMPLE once. Your original code (pull data set apart, combine them all together later) passes through REGSAMPLE 108 times.
@finans_sas wrote:
I have a big dataset, and I want to create subsets from this dataset and combine all the subsets into a separate dataset.
Why do this? Why not leave the data set as one big data set, then no splitting and re-combining is needed?
In general, splitting large data sets into smaller data sets is a sub-optimal approach in most situations.
Thank you so much, @PaigeMiller for your prompt reply. You made an excellent point. It is a nice segway into my end goal. I actually want to run the following for each sampleid (assuming that wanted is the output of the prior macro). If I could find a way to incorporate the prior macro into the next one, that will indeed save me a lot of time. I will really appreciate any insight on this one.
proc reg data=wanted noprint outest=wanted2 noprint adjrsq;
by sample id obsid; /*obsid is another ID already stored in the dataset*/
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
proc means data=wanted2;
by sampleid;
var intercept ind1 ind2 ind3 ind4 _rsq_ _adjrsq_;
output out=wanted3 mean = std= t= probt= / autoname;
run;
It's still not clear to me why you would not simply take the big data set, assign appropriate values to variable SampleID (without splitting up the big data set), and then use this big data set which now has values for SampleID in PROC MEANS and PROC REG.
Thank you so much once again for your quickly reply, @PaigeMiller . Here is my challenge: the subsets will be overlapping. For example, sampleid 111 will be a combination of 112 and 113. I could not figure out how to create sample IDs without overwriting the existing ones.
@finans_sas wrote:
Thank you so much once again for your quickly reply, @PaigeMiller . Here is my challenge: the subsets will be overlapping. For example, sampleid 111 will be a combination of 112 and 113. I could not figure out how to create sample IDs without overwriting the existing ones.
One could, without writing macros to split up large data sets, write a data step which created (potentially) multiple observations from one original observation. So if an original observation winds up in 112 and 111, the data step creates two observations, with appropriate sample IDs. Then you sort this data by SampleID and then run your PROC MEANS and PROC REG. And then no splitting of the data set and recombining is needed, and no macros are needed.
For PROC MEANS (but not PROC REG as far as I know) you could also use MULTILABEL formats for this (example).
I do not have time to help you write such code at this time, but I'm sure there are plenty of people here who can help you with this.
This does bring up the interesting problem, if you have 108 regressions, how do you use these in some logical way? What is the point of generating 108 regressions in the first place?
Partial code, without macros, no pulling data set apart and then later combining data sets
data want;
set regsample;
length sampleid $ 4;
array corvar(12) corvar1-corvar12;
do i=1 to 12;
if corvar(i) le 1 and alldummy le 1 then do;
sampleid=cats(i,1,1);
output;
end;
if corvar(i) eq 1 and alldummy le 1 then do;
sampleid=cats(i,2,1);
output;
end;
/* You write the rest, there are only 9 possibilities here, as m=1,2,3 and n=1,2,3 */
end;
run;
proc sort data=want;
by sampleid obsid;
run;
proc reg data=want noprint outest=wanted2 noprint adjrsq;
by sampleid obsid; /*obsid is another ID already stored in the dataset*/
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
proc means data=want;
by sampleid;
var intercept ind1 ind2 ind3 ind4 _rsq_ _adjrsq_;
output out=wanted3 mean = std= t= probt= / autoname;
run;
This code passes through data set REGSAMPLE once. Your original code (pull data set apart, combine them all together later) passes through REGSAMPLE 108 times.
As @PaigeMiller often asks (wisely), what comes next after you have replicated your data in this way? It looks like you are creating subsets (non-exclusive) and then stacking them back together. There are situations where this is useful (e.g. for simulations), but this may not be one of those situations.
If you want to analyze data for different subsets, you can leave the data as is, and then use a WHERE statement to identify subsets of interest. e.g.:
proc whatever;
where corvar1 le 1 and allmdummy le 1 ;
run;
If you really want 108 different subsets, you could generate it with code like:
data big;
set have;
if corvar1 le 1 and allmdummy le 1 then do;
subset=1;
output;
end;
if corvar1 le 1 and allmdummy eq 1 then do;
subset=2;
output;
end;
*...;
run;
And you might use a macro or some other code generation technique to create the list of IF statements.
Your current IF statement:
if corvar&cvar opa 1 and allmdummy opb 1;
Can't work because the data step compiler won't see the operators when the step tries to compile.
Thank you so much for your suggestions, @Quentin . If I could find a way to do the proc reg without splitting the dataset, that will actually be awesome since the splitting and re-combining will take a massive amount of hard disk space. The challenge I have at this point is that some observations will fall under more than one sample ID due to overlapping subsets.
@finans_sas wrote:
Thank you so much for your suggestions, @Quentin . If I could find a way to do the proc reg without splitting the dataset, that will actually be awesome since the splitting and re-combining will take a massive amount of hard disk space. The challenge I have at this point is that some observations will fall under more than one sample ID due to overlapping subsets.
The DATA step I posted will work fine with overlapping subsets. Each record is output once for each subset it is in. And you could use a macro or whatever to generate the 108 if-then blocks. Then you could run one PROC REG on the dataset, BY SAMPLEID.
But yes, it will take lots of disk space.
Take a look at Tom's suggestion of putting your control data (WHERE expressions) in a dataset, and then using that to drive macro calls to do 108 PROC REG steps.
I'd still like an answer to this: "This does bring up the interesting problem, if you have 108 regressions, how do you use these in some logical way? What is the point of generating 108 regressions in the first place?"
Great question, @PaigeMiller ! This is related to the need to show that the results are robust to different ways of measuring the variable of interest and to the different subsamples. Most of them will be a footnote anyways but we need to perform them.
@finans_sas wrote:
Thank you so much for your suggestions, @Quentin . If I could find a way to do the proc reg without splitting the dataset, that will actually be awesome since the splitting and re-combining will take a massive amount of hard disk space. The challenge I have at this point is that some observations will fall under more than one sample ID due to overlapping subsets.
I have not tried it by it looks like PROC REG already has features for what you want to do. The REWEIGHT and REFIT statements. https://documentation.sas.com/doc/en/statug/15.2/statug_reg_syntax14.htm
Another feature you might be able to use is the FREQ statement. If you run your source data through PROC SUMMARY to produce a count of the number of observations for all existing combinations of subsetting flags and the variables included in your MODEL and BY statements then you might be able to reduce the size of the input data from 50M observation to something much smaller. Then just use the FREQ statement to tell PROC REG to use that count when doing the fitting.
Again I have not tried this, so experiment with some small data and models first to make sure you know it works. But from what I read you want to do something like this:
proc summary nway data=HAVE ;
by obsid ;
class corvar1-corvar12 alldummmy depvar1 ind1-ind4 / missing;
output out=collapsed(drop=_type_ rename=(_freq_=FREQ));
run;
proc reg data=collapsed noprint outest=wanted2 noprint adjrsq;
by obsid;
freq FREQ ;
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
* Exclude obs with CORVAR=1;
reweight corvar1=1 ;
refit;
run;
* Exclude obs with CORVAR1=0;
reweight;
reweight corvar1=0 ;
refit;
run;
... repeat for all of your combinations of EXCLUDED observations ...
quit;
Thank you so much for this code, @Tom ! I will explore it further.
Put 108 WHERE clauses into a character variable in a dataset.
Either just type them out
data samples;
sampleid+1;
input where $200.;
cards;
corvar1 le 1 and allmdummy le 1
corvar1 le 1 and allmdummy eq 1
corvar1 le 1 and allmdummy lt 1
corvar1 eq 1 and allmdummy le 1
corvar1 eq 1 and allmdummy eq 1
corvar1 eq 1 and allmdummy lt 1
corvar1 lt 1 and allmdummy le 1
corvar1 lt 1 and allmdummy eq 1
corvar1 lt 1 and allmdummy lt 1
;
or generate them if there is pattern.
Then create a macro for the analysis that takes SAMPLEID and WHERE as inputs. The macro could use &WHERE to generate a WHERE statement.
proc reg data=have;
where &where ;
....
So say the macro is named ANAL you could then call it 108 times with code like this:
data _null_;
set samples;
call execute(cats('%nrstr(%anal)(sampleid=',sampleid,',where=',where,')'));
run;
The macro could include PROC APPEND step(s) to aggregate the results into a single dataset.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.