Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- selecting variables based on categorical values

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 12-07-2018 02:49 PM
(1169 views)

Hello Members,

I get stuck with a simple SAS problem.

I have a 60 variables. The first 30 numerical variables (resp_1-resp_30, with 2 categories: 0, 1) indicating students' score on 30 items, with 0 being incorrect and 1 being correct. The next 30 are categorical variables (cat_1-cat_30, with 3 categories: 1,2,3) indicating the content domains each of the 30 item measuring. Each item may measure 1, or 2, or 3 content domains.

I want to calculate domain total score for domain 1, 2, and 3. How should I code?

Thanks so much!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Aha! So each resp variable is in a single domain throughout the data set.

So imagine you could have

data want; set have; cat1_total=sum(of resp_1); cat2_total=sum(of resp_2 resp_3); cat3_total=sum(of resp_30); run;

So all you need is a way to use the data (from just one observation) to generate lists of variable names inside the SUM functions. This does that in the first data step, which makes 3 macrovars holding those 3 lists.

```
/* Make 3 macrovars: listing RESP vars for cat1, cat2, cat3 */
data _null_ ;
set have (keep=cat:);
array list {3} $240;
array ct {*} cat: ;
do i=1 to dim(ct);
list{ct{i}} = catx(' ',list{ct{i}},cats('resp_',i));
end;
call symput('resp_cat1',trim(list{1}));
call symput('resp_cat2',trim(list{2}));
call symput('resp_cat3',trim(list{3}));
stop;
run;
/* Show the macrovars */
%put &=resp_cat1;
%put &=resp_cat2;
%put &=resp_cat3;
/* Use the macrovars */
data want;
set have ;
cat1_total=sum(of &resp_cat1);
cat2_total=sum(of &resp_cat2);
cat3_total=sum(of &resp_cat3);
run;
```

For large datasets, I suspect this could save some time ordinarily used by looping through the 30 variables. And it would probably save more time for even longer lists of variables. Also it's easy to generalize to more categories. And would be easy to generalize to overlapping categories.

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

13 REPLIES 13

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Transpose your data so that you have a long structure of variables instead:

Student Question Response Category

PROC TRANSPOSE will work but it's likely easier to use a data step.

```
data want;
set have;
array grs(*) grad1-grad30;
array cat(*) category1-category30;
do question = 1 to 30;
Question_Category = cat(question);
Question_Correct = grad(question);
output;
end;
keep id question_:;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Provide a worked example with starting data and the final desired result (is the result a data set or report for people to read).

You likely only need to provide 4 or 5 each of the RESP and CAT variables to demonstrate the basic concept. Show a few different scenarios with some of the ones that you are having difficulty with.

I suspect that you may want to add 3 domain variables indicating which domain is concerned which could perhaps be another 1/0 variable for each record but we likely need to see whether your cat variables are character or numeric and exactly what they contain.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Sorry, my mistake: each item should measure only one of the three domains. The data set looks like below. Resp_1 to Resp_30 are the students' responses (0 or 1) on 30 items. cat_1 to cat30 are domain indicators indicating which domain each item is measuring (1 or 2, or 3).

I want create three datasets for each of the three domains with response variables and a total score for that domain.

Thank you!

obs | resp_1 | resp_2 | resp_2 | …… | resp_30 | cat_1 | cat_2 | cat_3 | …… | cat_30 |

1 | 1 | 1 | 0 | 1 | 1 | 2 | 2 | 3 | ||

2 | 0 | 1 | 1 | 0 | 1 | 2 | 2 | 3 | ||

3 | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 3 | ||

4 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 3 | ||

………………………………………………………. | ||||||||||

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If I understand the problem perhaps:

data have; input obs resp_1-resp_3 resp_30 cat_1-cat_3 cat_30; datalines; 1 1 1 0 1 1 2 2 3 2 0 1 1 0 1 2 2 3 3 0 0 0 0 1 2 2 3 4 1 1 1 1 1 2 2 3 ; run; data want; set have; array r resp_: ; array c cat_: ; array domain {3}; do i= 1 to dim(c); domain[c[i]]= sum(domain[c[i]],r[i]); end; run; proc means data=want sum; var domain: ; run;

can get you started. You haven't shown what a final result would look like for three datasets for each of the three domains with response variables and a total score for that domain.

If the purpose is to make report that people will read then you likely do not separate data set but kind of need to know what the appearance would actually be.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The desired look for the three datasets is that each data site has student item responses resp_1, resp_3,....,resp_30 (only items measuring that domain) for that domain and the computed domain total. If not too cumbersome, keep the domain indicator of that domain (the values should be the same) as well, cat_1, cat_3,..., cat30.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Please show an example of your expected output.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

each dataset looks like below. Cat1--cat28 is not necessary but just for checking purpose. Thanks.

obs | resp_1 | resp_5 | resp_10 | …… | resp_28 | cat_1 | cat_5 | cat_10 | …… | cat_28 | cat1_tot |

1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 9 | |

2 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 9 | |

3 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 8 | |

4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 10 |

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Rather than have several back and forth conversations regarding logic, can you please take the sample data from BallardW post and show the EXACT output expected, including the calculations. I only see cat1_total here for example, but I assume you would want cat3_total and are those across ID or just for a single row at a time?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I am sorry for the confusion.

Hope this time is clear:

There are 34 items in total in this test. the 34 items measures 3 content domains.

5 items measures domain 1: item1, item5, item10, item12, item28;

15 items measures domain 2: item2, item3, item4,..., item26;

14 items measure domain 3: item11, item13,..., item34;

I want to create one dataset for each domain. Each dataset should have binary item responses, item domain indicators (values should be the same for all indicator for the same domain) for checking purpose, and total score (sum of item response).

For example, for domain 1, the dataset should look like below. resp_1 to resp_28 are the item responses. cat1 to cat28 are domain indicator. domain1_tot is the sum of resp_1, resp_2, resp_10, resp_12, resp28.

create the same datasets for domain2 and domain3.

obs | resp_1 | resp_5 | resp_10 | resp_12 | resp_28 | cat_1 | cat_5 | cat_10 | cat_12 | cat_28 | domain1_tot |

1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 4 |

2 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 4 |

3 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 3 |

4 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 4 |

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

That clarifies a lot.

```
data domain1;
set have (keep = obs resp_1 resp_5 resp_10 pres_12 resp_28
cat_1 cat_5 cat_10 cat_12 cat_28);
domain_total = sum(of resp_:);
run;
```

Change the variable names for each of the data sets you want and repeat this three times. This could be a little bit more dynamic but this works fine.

@TX_STAR wrote:

I am sorry for the confusion.

Hope this time is clear:

There are 34 items in total in this test. the 34 items measures 3 content domains.

5 items measures domain 1: item1, item5, item10, item12, item28;

15 items measures domain 2: item2, item3, item4,..., item26;

14 items measure domain 3: item11, item13,..., item34;

I want to create one dataset for each domain. Each dataset should have binary item responses, item domain indicators (values should be the same for all indicator for the same domain) for checking purpose, and total score (sum of item response).

For example, for domain 1, the dataset should look like below. resp_1 to resp_28 are the item responses. cat1 to cat28 are domain indicator. domain1_tot is the sum of resp_1, resp_2, resp_10, resp_12, resp28.

create the same datasets for domain2 and domain3.

obs resp_1 resp_5 resp_10 resp_12 resp_28 cat_1 cat_5 cat_10 cat_12 cat_28 domain1_tot 1 1 1 1 1 0 1 1 1 1 1 4 2 1 1 1 1 0 1 1 1 1 1 4 3 1 1 0 1 0 1 1 1 1 1 3 4 1 1 1 0 1 1 1 1 1 1 4

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Aha! So each resp variable is in a single domain throughout the data set.

So imagine you could have

data want; set have; cat1_total=sum(of resp_1); cat2_total=sum(of resp_2 resp_3); cat3_total=sum(of resp_30); run;

So all you need is a way to use the data (from just one observation) to generate lists of variable names inside the SUM functions. This does that in the first data step, which makes 3 macrovars holding those 3 lists.

```
/* Make 3 macrovars: listing RESP vars for cat1, cat2, cat3 */
data _null_ ;
set have (keep=cat:);
array list {3} $240;
array ct {*} cat: ;
do i=1 to dim(ct);
list{ct{i}} = catx(' ',list{ct{i}},cats('resp_',i));
end;
call symput('resp_cat1',trim(list{1}));
call symput('resp_cat2',trim(list{2}));
call symput('resp_cat3',trim(list{3}));
stop;
run;
/* Show the macrovars */
%put &=resp_cat1;
%put &=resp_cat2;
%put &=resp_cat3;
/* Use the macrovars */
data want;
set have ;
cat1_total=sum(of &resp_cat1);
cat2_total=sum(of &resp_cat2);
cat3_total=sum(of &resp_cat3);
run;
```

For large datasets, I suspect this could save some time ordinarily used by looping through the 30 variables. And it would probably save more time for even longer lists of variables. Also it's easy to generalize to more categories. And would be easy to generalize to overlapping categories.

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

With some modification, I was able to get the datasets I that need for different subjects with different number of subdomains (with different items) with the sample SAS code.

Thank you all for the help.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@TX_STAR wrote:

Hello Members,

I get stuck with a simple SAS problem.

I have a 60 variables. The first 30 numerical variables (resp_1-resp_30, with 2 categories: 0, 1) indicating students' score on 30 items, with 0 being incorrect and 1 being correct. The next 30 are categorical variables (cat_1-cat_30, with 3 categories: 1,2,3) indicating the content domains each of the 30 item measuring. Each item may measure 1, or 2, or 3 content domains.

I want to calculate domain total score for domain 1, 2, and 3. How should I code?

Thanks so much!

data have; input obs resp_1-resp_3 cat_1-cat_3; datalines; 1 1 1 0 1 2 3 2 0 1 1 1 2 2 3 0 0 0 1 3 2 4 1 1 1 1 2 3 ; run; data want; set have; array grs(*) resp_1-resp_3; array ct(*) cat_1-cat_3; do question = 1 to dim(grs); Question_Category = ct(question); Question_Correct = grs(question); output; end; keep id question_:; run; proc means data=want N Mean SUM maxdec=2; class question_category; var question_correct; run;

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.