Obsidian | Level 7

## selecting variables based on categorical values

Hello Members,

I get stuck with a simple SAS problem.

I have a 60 variables. The first 30 numerical variables (resp_1-resp_30, with 2 categories: 0, 1) indicating students' score on 30 items, with 0 being incorrect and 1 being correct. The next 30 are categorical variables (cat_1-cat_30, with 3 categories: 1,2,3) indicating the content domains each of the 30 item measuring. Each item may measure 1, or 2, or 3 content domains.

I want to calculate domain total score for domain 1, 2, and 3. How should I code?

Thanks so much!

1 ACCEPTED SOLUTION

Accepted Solutions

## Re: selecting variables based on categorical values

Aha!  So each resp variable is in a single domain throughout the data set.

So imagine you could have

```data want;
set have;
cat1_total=sum(of resp_1);
cat2_total=sum(of resp_2 resp_3);
cat3_total=sum(of resp_30);
run;```

So all you need is a way to use the data (from just one observation) to generate lists of variable names inside the SUM functions.  This does that in the first data step, which makes 3 macrovars holding those 3 lists.

``````/* Make 3 macrovars: listing RESP vars for cat1, cat2, cat3 */
data _null_ ;
set have (keep=cat:);
array list {3} \$240;

array ct {*} cat: ;
do i=1 to dim(ct);
list{ct{i}} = catx(' ',list{ct{i}},cats('resp_',i));
end;
call symput('resp_cat1',trim(list{1}));
call symput('resp_cat2',trim(list{2}));
call symput('resp_cat3',trim(list{3}));
stop;
run;

/* Show the macrovars */
%put &=resp_cat1;
%put &=resp_cat2;
%put &=resp_cat3;

/* Use the macrovars */
data want;
set have ;

cat1_total=sum(of &resp_cat1);
cat2_total=sum(of &resp_cat2);
cat3_total=sum(of &resp_cat3);
run;``````

For large datasets, I suspect this could save some time ordinarily used by looping through the 30 variables.  And it would probably save more time for even longer lists of variables.  Also it's easy to generalize to more categories.  And would be easy to generalize to overlapping categories.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
13 REPLIES 13
Super User

## Re: selecting variables based on categorical values

Transpose your data so that you have a long structure of variables instead:

Student Question Response Category

PROC TRANSPOSE will work but it's likely easier to use a data step.

``````data want;
set have;

array cat(*) category1-category30;

do question = 1 to 30;
Question_Category = cat(question);
output;
end;

keep id question_:;
run;``````
Super User

## Re: selecting variables based on categorical values

Provide a worked example with starting data and the final desired result (is the result a data set or report for people to read).

You likely only need to provide 4 or 5 each of the RESP and CAT variables to demonstrate the basic concept. Show a few different scenarios with some of the ones that you are having difficulty with.

I suspect that you may want to add 3 domain variables indicating which domain is concerned which could perhaps be another 1/0 variable for each record but we likely need to see whether your cat variables are character or numeric and exactly what they contain.

Obsidian | Level 7

## Re: selecting variables based on categorical values

Sorry, my mistake: each item should measure only one of the three domains. The data set looks like below. Resp_1 to Resp_30 are the students' responses (0 or 1) on 30 items. cat_1 to cat30 are domain indicators indicating which domain each item is measuring (1 or 2, or 3).

I want create three datasets for each of the three domains with response variables and a total score for that domain.

Thank you!

 obs resp_1 resp_2 resp_2 …… resp_30 cat_1 cat_2 cat_3 …… cat_30 1 1 1 0 1 1 2 2 3 2 0 1 1 0 1 2 2 3 3 0 0 0 0 1 2 2 3 4 1 1 1 1 1 2 2 3 ……………………………………………………….
Super User

## Re: selecting variables based on categorical values

If I understand the problem perhaps:

```data have;
input obs resp_1-resp_3 resp_30 cat_1-cat_3 cat_30;
datalines;
1 1 1 0 1 1 2 2 3
2 0 1 1 0 1 2 2 3
3 0 0 0 0 1 2 2 3
4 1 1 1 1 1 2 2 3
;
run;

data want;
set have;
array r resp_: ;
array c cat_:  ;
array domain {3};
do i= 1 to dim(c);
domain[c[i]]= sum(domain[c[i]],r[i]);
end;
run;

proc means data=want sum;
var domain: ;
run;```

can get you started. You haven't shown what a final result would look like for three datasets for each of the three domains with response variables and a total score for that domain.

If the purpose is to make report that people will read then you likely do not separate data set but kind of need to know what the appearance would actually be.

Obsidian | Level 7

## Re: selecting variables based on categorical values

The desired look for the three datasets is that each data site has student item responses resp_1, resp_3,....,resp_30 (only items measuring that domain) for that domain and the computed domain total. If not too cumbersome, keep the domain indicator of that domain (the values should be the same) as well, cat_1, cat_3,..., cat30.

Thanks!

Super User

## Re: selecting variables based on categorical values

Obsidian | Level 7

## Re: selecting variables based on categorical values

each dataset looks like below. Cat1--cat28 is not necessary but just for checking purpose. Thanks.

 obs resp_1 resp_5 resp_10 …… resp_28 cat_1 cat_5 cat_10 …… cat_28 cat1_tot 1 1 1 1 0 1 1 1 1 1 9 2 1 1 1 0 1 1 1 1 1 9 3 1 1 0 0 1 1 1 1 1 8 4 1 1 1 1 1 1 1 1 1 10
Super User

## Re: selecting variables based on categorical values

Rather than have several back and forth conversations regarding logic, can you please take the sample data from BallardW post and show the EXACT output expected, including the calculations. I only see cat1_total here for example, but I assume you would want cat3_total and are those across ID or just for a single row at a time?
Obsidian | Level 7

## Re: selecting variables based on categorical values

I am sorry for the confusion.

Hope this time is clear:

There are 34 items in total in this test. the 34 items measures 3 content domains.

5 items measures domain 1: item1, item5, item10, item12, item28;

15 items measures domain 2: item2, item3, item4,..., item26;

14 items measure domain 3: item11, item13,..., item34;

I want to create one dataset for each domain. Each dataset should have binary item responses, item domain indicators (values should be the same for all indicator for the same domain) for checking purpose, and total score (sum of item response).

For example, for domain 1, the dataset should look like below. resp_1 to resp_28 are the item responses. cat1 to cat28 are domain indicator. domain1_tot is the sum of resp_1, resp_2, resp_10, resp_12, resp28.

create the same datasets for domain2 and domain3.

 obs resp_1 resp_5 resp_10 resp_12 resp_28 cat_1 cat_5 cat_10 cat_12 cat_28 domain1_tot 1 1 1 1 1 0 1 1 1 1 1 4 2 1 1 1 1 0 1 1 1 1 1 4 3 1 1 0 1 0 1 1 1 1 1 3 4 1 1 1 0 1 1 1 1 1 1 4

Thanks!

Super User

## Re: selecting variables based on categorical values

That clarifies a lot.

``````data domain1;
set have (keep =  obs resp_1 resp_5 resp_10 pres_12 resp_28
cat_1 cat_5 cat_10 cat_12 cat_28);

domain_total = sum(of resp_:);

run;``````

Change the variable names for each of the data sets you want and repeat this three times. This could be a little bit more dynamic but this works fine.

@TX_STAR wrote:

I am sorry for the confusion.

Hope this time is clear:

There are 34 items in total in this test. the 34 items measures 3 content domains.

5 items measures domain 1: item1, item5, item10, item12, item28;

15 items measures domain 2: item2, item3, item4,..., item26;

14 items measure domain 3: item11, item13,..., item34;

I want to create one dataset for each domain. Each dataset should have binary item responses, item domain indicators (values should be the same for all indicator for the same domain) for checking purpose, and total score (sum of item response).

For example, for domain 1, the dataset should look like below. resp_1 to resp_28 are the item responses. cat1 to cat28 are domain indicator. domain1_tot is the sum of resp_1, resp_2, resp_10, resp_12, resp28.

create the same datasets for domain2 and domain3.

 obs resp_1 resp_5 resp_10 resp_12 resp_28 cat_1 cat_5 cat_10 cat_12 cat_28 domain1_tot 1 1 1 1 1 0 1 1 1 1 1 4 2 1 1 1 1 0 1 1 1 1 1 4 3 1 1 0 1 0 1 1 1 1 1 3 4 1 1 1 0 1 1 1 1 1 1 4

Thanks!

## Re: selecting variables based on categorical values

Aha!  So each resp variable is in a single domain throughout the data set.

So imagine you could have

```data want;
set have;
cat1_total=sum(of resp_1);
cat2_total=sum(of resp_2 resp_3);
cat3_total=sum(of resp_30);
run;```

So all you need is a way to use the data (from just one observation) to generate lists of variable names inside the SUM functions.  This does that in the first data step, which makes 3 macrovars holding those 3 lists.

``````/* Make 3 macrovars: listing RESP vars for cat1, cat2, cat3 */
data _null_ ;
set have (keep=cat:);
array list {3} \$240;

array ct {*} cat: ;
do i=1 to dim(ct);
list{ct{i}} = catx(' ',list{ct{i}},cats('resp_',i));
end;
call symput('resp_cat1',trim(list{1}));
call symput('resp_cat2',trim(list{2}));
call symput('resp_cat3',trim(list{3}));
stop;
run;

/* Show the macrovars */
%put &=resp_cat1;
%put &=resp_cat2;
%put &=resp_cat3;

/* Use the macrovars */
data want;
set have ;

cat1_total=sum(of &resp_cat1);
cat2_total=sum(of &resp_cat2);
cat3_total=sum(of &resp_cat3);
run;``````

For large datasets, I suspect this could save some time ordinarily used by looping through the 30 variables.  And it would probably save more time for even longer lists of variables.  Also it's easy to generalize to more categories.  And would be easy to generalize to overlapping categories.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Obsidian | Level 7

## Re: selecting variables based on categorical values

With some modification, I was able to get the datasets I that need for different subjects with different number of subdomains (with different items) with the sample SAS code.

Thank you all for the help.

Super User

## Re: selecting variables based on categorical values

@TX_STAR wrote:

Hello Members,

I get stuck with a simple SAS problem.

I have a 60 variables. The first 30 numerical variables (resp_1-resp_30, with 2 categories: 0, 1) indicating students' score on 30 items, with 0 being incorrect and 1 being correct. The next 30 are categorical variables (cat_1-cat_30, with 3 categories: 1,2,3) indicating the content domains each of the 30 item measuring. Each item may measure 1, or 2, or 3 content domains.

I want to calculate domain total score for domain 1, 2, and 3. How should I code?

Thanks so much!

```data have;
input obs resp_1-resp_3  cat_1-cat_3;
datalines;
1 1 1 0  1 2 3
2 0 1 1  1 2 2
3 0 0 0  1 3 2
4 1 1 1  1 2 3
;
run;

data want;
set have;

array grs(*) resp_1-resp_3;
array ct(*) cat_1-cat_3;

do question = 1 to dim(grs);
Question_Category = ct(question);
Question_Correct = grs(question);
output;
end;

keep id question_:;
run;

proc means data=want N Mean SUM maxdec=2;
class question_category;
var question_correct;
run;
```
Discussion stats
• 13 replies
• 1170 views
• 0 likes
• 4 in conversation