Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- SAS Macro & Nested Do Loop to subset data

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-20-2023 10:02 AM
(610 views)

Hello Everyone:

I have a big dataset, and I want to create subsets from this dataset and combine all the subsets into a separate dataset. Here are more details:

corvar1 through corvar12: 12 different dummy variables. Each variable takes the value of 0 or 1.

allmdummy: a dummy variable that takes the value of 0 or 1.

I want to create 108 different subsets. For example, for corvar1, I would like to create the following nine subsets:

corvar1 le 1 + allmdummy le 1 (include all values of corvar1 and all values of allmdummy)

corvar1 le 1 + allmdummy eq 1

corvar1 le 1 + allmdummy lt 1

corvar1 eq 1 + allmdummy le 1

corvar1 eq 1 + allmdummy eq 1

corvar1 eq 1 + allmdummy lt 1

corvar1 lt 1 + allmdummy le 1

corvar1 lt 1 + allmdummy eq 1

corvar1 lt 1 + allmdummy lt 1 (include only the values where corvar1=0 and allmdummy=0).

Below is a rough sketch of what I would like to accomplish. I would love to have your help with improving this code.

Thank you so much!

%macro groupa;

%do cvar=1 %to 12; /*names for the variables*/

%do m=1 %to 3; /*for math operations*/

%do n=1 %to 3; /*for math operations*/

data set&cvar&m&n; set regsample;

if &m=1 then opa="le";

if &m=2 then opa="eq";

if &m=3 then opa="lt";

if &n=1 then opb="le";

if &n=2 then opb="eq";

if &n=3 then opb="lt";

if corvar&cvar opa 1 and allmdummy opb 1;

id1=&cvar;

id2=&m;

id3=&n;

sampleid=&cvar&m&n; /*create a sample ID*/

proc append base=wanted data=set&cvar&m&n; /*I want to combine all the datasets into a single one. Not sure if this syntax is right*/

%end;

%end;

%end;

%mend groupa;

%groupa

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Partial code, without macros, no pulling data set apart and then later combining data sets

```
data want;
set regsample;
length sampleid $ 4;
array corvar(12) corvar1-corvar12;
do i=1 to 12;
if corvar(i) le 1 and alldummy le 1 then do;
sampleid=cats(i,1,1);
output;
end;
if corvar(i) eq 1 and alldummy le 1 then do;
sampleid=cats(i,2,1);
output;
end;
/* You write the rest, there are only 9 possibilities here, as m=1,2,3 and n=1,2,3 */
end;
run;
proc sort data=want;
by sampleid obsid;
run;
proc reg data=want noprint outest=wanted2 noprint adjrsq;
by sampleid obsid; /*obsid is another ID already stored in the dataset*/
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
proc means data=want;
by sampleid;
var intercept ind1 ind2 ind3 ind4 _rsq_ _adjrsq_;
output out=wanted3 mean = std= t= probt= / autoname;
run;
```

This code passes through data set REGSAMPLE once. Your original code (pull data set apart, combine them all together later) passes through REGSAMPLE 108 times.

--

Paige Miller

Paige Miller

25 REPLIES 25

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@finans_sas wrote:

I have a big dataset, and I want to create subsets from this dataset and combine all the subsets into a separate dataset.

Why do this? Why not leave the data set as one big data set, then no splitting and re-combining is needed?

In general, splitting large data sets into smaller data sets is a sub-optimal approach in most situations.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you so much, @PaigeMiller for your prompt reply. You made an excellent point. It is a nice segway into my end goal. I actually want to run the following for each sampleid (assuming that wanted is the output of the prior macro). If I could find a way to incorporate the prior macro into the next one, that will indeed save me a lot of time. I will really appreciate any insight on this one.

proc reg data=wanted noprint outest=wanted2 noprint adjrsq;

by sample id obsid; /*obsid is another ID already stored in the dataset*/

model depvar1 = ind1 ind2 ind3 ind4 / hcc;

run;

proc means data=wanted2;

by sampleid;

var intercept ind1 ind2 ind3 ind4 _rsq_ _adjrsq_;

output out=wanted3 mean = std= t= probt= / autoname;

run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It's still not clear to me why you would not simply take the big data set, assign appropriate values to variable SampleID (without splitting up the big data set), and then use this big data set which now has values for SampleID in PROC MEANS and PROC REG.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@finans_sas wrote:

One could, without writing macros to split up large data sets, write a data step which created (potentially) multiple observations from one original observation. So if an original observation winds up in 112 and 111, the data step creates two observations, with appropriate sample IDs. Then you sort this data by SampleID and then run your PROC MEANS and PROC REG. And then no splitting of the data set and recombining is needed, and no macros are needed.

For PROC MEANS (but not PROC REG as far as I know) you could also use MULTILABEL formats for this (example).

I do not have time to help you write such code at this time, but I'm sure there are plenty of people here who can help you with this.

This does bring up the interesting problem, if you have 108 regressions, how do you use these in some logical way? What is the point of generating 108 regressions in the first place?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Partial code, without macros, no pulling data set apart and then later combining data sets

```
data want;
set regsample;
length sampleid $ 4;
array corvar(12) corvar1-corvar12;
do i=1 to 12;
if corvar(i) le 1 and alldummy le 1 then do;
sampleid=cats(i,1,1);
output;
end;
if corvar(i) eq 1 and alldummy le 1 then do;
sampleid=cats(i,2,1);
output;
end;
/* You write the rest, there are only 9 possibilities here, as m=1,2,3 and n=1,2,3 */
end;
run;
proc sort data=want;
by sampleid obsid;
run;
proc reg data=want noprint outest=wanted2 noprint adjrsq;
by sampleid obsid; /*obsid is another ID already stored in the dataset*/
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
proc means data=want;
by sampleid;
var intercept ind1 ind2 ind3 ind4 _rsq_ _adjrsq_;
output out=wanted3 mean = std= t= probt= / autoname;
run;
```

This code passes through data set REGSAMPLE once. Your original code (pull data set apart, combine them all together later) passes through REGSAMPLE 108 times.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

As @PaigeMiller often asks (wisely), what comes next after you have replicated your data in this way? It looks like you are creating subsets (non-exclusive) and then stacking them back together. There are situations where this is useful (e.g. for simulations), but this may not be one of those situations.

If you want to analyze data for different subsets, you can leave the data as is, and then use a WHERE statement to identify subsets of interest. e.g.:

```
proc whatever;
where corvar1 le 1 and allmdummy le 1 ;
run;
```

If you really want 108 different subsets, you could generate it with code like:

```
data big;
set have;
if corvar1 le 1 and allmdummy le 1 then do;
subset=1;
output;
end;
if corvar1 le 1 and allmdummy eq 1 then do;
subset=2;
output;
end;
*...;
run;
```

And you might use a macro or some other code generation technique to create the list of IF statements.

Your current IF statement:

`if corvar&cvar opa 1 and allmdummy opb 1;`

Can't work because the data step compiler won't see the operators when the step tries to compile.

Check out the Boston Area SAS Users Group (BASUG) video archives: https://www.basug.org/videos.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@finans_sas wrote:

The DATA step I posted will work fine with overlapping subsets. Each record is output once for each subset it is in. And you could use a macro or whatever to generate the 108 if-then blocks. Then you could run one PROC REG on the dataset, BY SAMPLEID.

But yes, it will take lots of disk space.

Take a look at Tom's suggestion of putting your control data (WHERE expressions) in a dataset, and then using that to drive macro calls to do 108 PROC REG steps.

Check out the Boston Area SAS Users Group (BASUG) video archives: https://www.basug.org/videos.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I'd still like an answer to this: "This does bring up the interesting problem, if you have 108 regressions, how do you use these in some logical way? What is the point of generating 108 regressions in the first place?"

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@finans_sas wrote:

I have not tried it by it looks like PROC REG already has features for what you want to do. The REWEIGHT and REFIT statements. https://documentation.sas.com/doc/en/statug/15.2/statug_reg_syntax14.htm

Another feature you might be able to use is the FREQ statement. If you run your source data through PROC SUMMARY to produce a count of the number of observations for all existing combinations of subsetting flags and the variables included in your MODEL and BY statements then you might be able to reduce the size of the input data from 50M observation to something much smaller. Then just use the FREQ statement to tell PROC REG to use that count when doing the fitting.

Again I have not tried this, so experiment with some small data and models first to make sure you know it works. But from what I read you want to do something like this:

```
proc summary nway data=HAVE ;
by obsid ;
class corvar1-corvar12 alldummmy depvar1 ind1-ind4 / missing;
output out=collapsed(drop=_type_ rename=(_freq_=FREQ));
run;
proc reg data=collapsed noprint outest=wanted2 noprint adjrsq;
by obsid;
freq FREQ ;
model depvar1 = ind1 ind2 ind3 ind4 / hcc;
run;
* Exclude obs with CORVAR=1;
reweight corvar1=1 ;
refit;
run;
* Exclude obs with CORVAR1=0;
reweight;
reweight corvar1=0 ;
refit;
run;
... repeat for all of your combinations of EXCLUDED observations ...
quit;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you so much for this code, @Tom ! I will explore it further.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Put 108 WHERE clauses into a character variable in a dataset.

Either just type them out

```
data samples;
sampleid+1;
input where $200.;
cards;
corvar1 le 1 and allmdummy le 1
corvar1 le 1 and allmdummy eq 1
corvar1 le 1 and allmdummy lt 1
corvar1 eq 1 and allmdummy le 1
corvar1 eq 1 and allmdummy eq 1
corvar1 eq 1 and allmdummy lt 1
corvar1 lt 1 and allmdummy le 1
corvar1 lt 1 and allmdummy eq 1
corvar1 lt 1 and allmdummy lt 1
;
```

or generate them if there is pattern.

Then create a macro for the analysis that takes SAMPLEID and WHERE as inputs. The macro could use &WHERE to generate a WHERE statement.

```
proc reg data=have;
where &where ;
....
```

So say the macro is named ANAL you could then call it 108 times with code like this:

```
data _null_;
set samples;
call execute(cats('%nrstr(%anal)(sampleid=',sampleid,',where=',where,')'));
run;
```

The macro could include PROC APPEND step(s) to aggregate the results into a single dataset.

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.