Solved: Count distinct values by subjects

Joshua017 · Posted 06-22-2018 07:05 PM

I want to create a variable that can count how many "Yes" for each person/record. Is there a better way to do it? My data set actually have more than 3000 records......

Thank you so much!

	Name	Q1	Q2	Q3	Q4	Q5	Yes_count
1	Danny	Yes	Yes	No	No	No
2	Mary	Yes	No	Yes	Yes	Yes
3	John	Yes	Yes	Yes	Yes	No
4	Lisa	No	Yes	Yes	Yes	Yes
5	Mike	No	Yes	No	No	Yes
6	Philip	Yes	Yes	Yes	Yes	No
7	Sui	Yes	Yes	No	No	Yes
8	Zack	No	Yes	No	Yes	No
9	Pam	Yes	Yes	No	No	Yes
10	William	No	No	Yes	No	Yes

FreelanceReinh · Posted 06-29-2018 05:21 AM

Okay, it seems that the Yes/No variables might be too sparse to select them conveniently using name range lists. But perhaps they can be characterized as the "Char 3" variables in the dataset? If so, the following approach should work:

proc sql;
select name into :ynvars separated by ' '
from dictionary.columns
where libname='WORK' & memname='HAVE' & type='char' & length=3;
quit;

data want;
set have;
array qs &ynvars;
yes_count=count(catx('01'x, of qs[*]), 'Yes');
run;

The PROC SQL step selects the names of the "Char 3" variables from metadata and stores them as a blank-separated list in macro variable YNVARS, which is then used in the array definition of the subsequent data step. (Of course, you'll have to replace "WORK" and "HAVE" by your libname and memname.) Obviously, other character variables with length 3, but different contents (e.g. STATE with values 'TX', 'NC' etc.) wouldn't be a problem for the 'Yes' count. I've made the count a bit more robust by using CATX with a "rare" delimiter (in place of CATS, which could "create" a 'Yes' by concatenating something like 'NY' and 'est').

EDIT: To exclude unwanted "Char 3" variables whose names don't start with "Q", you can extend the WHERE condition:

where libname='WORK' & memname='HAVE' & type='char' & length=3 & name eqt 'Q';

View solution in original post

novinosrin · Posted 06-22-2018 07:14 PM

data have;
input obs (Name 	Q1	Q2	Q3	Q4	Q5) ($)	;
cards;
1	Danny	Yes	Yes	No	No	No	 
2	Mary	Yes	No	Yes	Yes	Yes	 
3	John	Yes	Yes	Yes	Yes	No	 
4	Lisa	No	Yes	Yes	Yes	Yes	 
5	Mike	No	Yes	No	No	Yes	 
6	Philip	Yes	Yes	Yes	Yes	No	 
7	Sui	Yes	Yes	No	No	Yes	 
8	Zack	No	Yes	No	Yes	No	 
9	Pam	Yes	Yes	No	No	Yes	 
10	William	No	No	Yes	No	Yes	 
;
data want;
set have;
yes_count=count(cats(of q:),'Yes');
run;

ballardw · Posted 06-22-2018 07:18 PM

@Joshua017 wrote:

I want to create a variable that can count how many "Yes" for each person/record. Is there a better way to do it? My data set actually have more than 3000 records......

Thank you so much!

Name Q1 Q2 Q3 Q4 Q5 Yes_count

1 Danny Yes Yes No No No

2 Mary Yes No Yes Yes Yes

3 John Yes Yes Yes Yes No

4 Lisa No Yes Yes Yes Yes

5 Mike No Yes No No Yes

6 Philip Yes Yes Yes Yes No

7 Sui Yes Yes No No Yes

8 Zack No Yes No Yes No

9 Pam Yes Yes No No Yes

10 William No No Yes No Yes

One thing is that coding with character values for Yes/No is ugly at best and cumbersome and confusing at worst.

data want;
   set have;
   array qs q: ; /* assumes all the variables you want that are coded Yes/no
                 are named starting with Q and that the variables that start with 
                 Q are coded yes/no if not list them out*/
   do i= 1 dim(qs);
      yes_count= sum(yes_count, (qs[i]='Yes'));
   end;
run;

The (qs[I]='Yes') takes advantage of the SAS behavior that a true comparison will return a numeric 1 and false a 0.

If the variables had been coded 1/0 for yes/no to begin with:

yes_count= sum(of q1-q5); (or however many are involved)

Joshua017 · Posted 06-22-2018 07:37 PM

Thank you so much for your reply! Actually I have about 150 variables and about 50 of them have Yes/No and rest of them with Name, date, and demographic information. Is it anyway that I don't have to list all 50 variables out? Thanks again!!

novinosrin · Posted 06-22-2018 07:41 PM

@Joshua017 You can use variable lists like i showed in my code.

or group the 50 in an array q:

array qs q1-q50 ;

yes_count=count(cats(of qs(*)),'Yes');

Are you able to follow both of these solutions?

FreelanceReinh · Posted 06-23-2018 09:25 AM

@Joshua017 wrote:

I have about 150 variables and about 50 of them have Yes/No and rest of them with Name, date, and demographic information. Is it anyway that I don't have to list all 50 variables out? Thanks again!!

Chances are very good that you can abbreviate the list by using one or another type (or even combining different types) of SAS variable lists, as novinosrin mentioned already. If your Yes/No variables do not have a common name prefix, name range lists can be particularly useful. Just apply PROC CONTENTS with option VARNUM to your dataset and see if the (I assume: character) variables in question form one or a few blocks of consecutive rows in the output (possibly interrupted by numeric variables).

Example (focusing on numeric variables, though):

proc contents data=sashelp.heart varnum;
run;

Output (highlighting added):

Result: The eight numeric variables highlighted above can be referred to as

AgeCHDdiag-numeric-Weight MRW--Cholesterol

where appropriate.

Similarly, Chol_Status--Weight_Status (3 variables) or DeathCause-character-BP_Status (4 variables) are valid lists of character variables -- at least as long as the structure (variable order) of the dataset is not changed.

Joshua017 · Posted 06-28-2018 04:22 PM

That's part of the dataset. For variables start with Q, some of them are Yes/No. I want to count how many Yes for each record. How could I put them in a array or I can just count all variables with "Yes"?

FreelanceReinh · Posted 06-29-2018 05:21 AM

Okay, it seems that the Yes/No variables might be too sparse to select them conveniently using name range lists. But perhaps they can be characterized as the "Char 3" variables in the dataset? If so, the following approach should work:

proc sql;
select name into :ynvars separated by ' '
from dictionary.columns
where libname='WORK' & memname='HAVE' & type='char' & length=3;
quit;

data want;
set have;
array qs &ynvars;
yes_count=count(catx('01'x, of qs[*]), 'Yes');
run;

The PROC SQL step selects the names of the "Char 3" variables from metadata and stores them as a blank-separated list in macro variable YNVARS, which is then used in the array definition of the subsequent data step. (Of course, you'll have to replace "WORK" and "HAVE" by your libname and memname.) Obviously, other character variables with length 3, but different contents (e.g. STATE with values 'TX', 'NC' etc.) wouldn't be a problem for the 'Yes' count. I've made the count a bit more robust by using CATX with a "rare" delimiter (in place of CATS, which could "create" a 'Yes' by concatenating something like 'NY' and 'est').

EDIT: To exclude unwanted "Char 3" variables whose names don't start with "Q", you can extend the WHERE condition:

where libname='WORK' & memname='HAVE' & type='char' & length=3 & name eqt 'Q';

Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Re: Count distinct values by subjects

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!