Solved: Re: Counting "Yes" in all columns in a dataset

maggi_beetle · Posted 06-09-2021 06:56 PM

Hello,

I have a huge dataset, with almost 30 variables and a patientid, and variable names of columns are long too(as it's real world data). The columns contain Yes, no or unknwon values, or missing. I need to find number of Yes in all columns per patient. How can I acheive this in just one data step. I know there are other multiple ways , using multiple datasteps to acheive this, but I am curious how to do it in one datastep. I tried this, but it doesn't work:

nyes = countw(MYOCARDIALINFARCTION--AIDS, "Yes");

where first columns name is MYOCARDIALINFARCTION and last column name is AIDS

Thank you

ballardw · Posted 06-09-2021 07:24 PM

Example data in the form of a data step would be a good start so we have something to work with.

Countw counts "words", read the documentation for a word, not specific values of the words.

This is a very good example of why character Yes/No values are a bad idea. If you use 1/0 coded for yes/no then this is trivial:

count = sum(var1,var2, ... , varn);

As it is you will have to parse the values.

data want; 
    set have;
    array yn (*) <list the names of all those yes/no variables here>;
   do i=1 to dim(yn);
     count= sum(count,yn[i] = 'Yes');
  end;
run;

This will count per observation. If you need to total this for multiple observations per patient id then you need to tell us so.

The comparison value needs to be as in the data. I'm not sure if you have mixed case or mixes of Yes and Y or what. The sum works because SAS will return 1/0 for logical comparisons.

Note: you can fix such poor data when you read it with a data step and a custom informat.

proc format;
invalue ynu (upcase)
'YES','Y' = 1
'NO','N'  = 0
'U',"UNK","UNKNOWN" = .U
other= .
;
run;

data example;
   input y :ynu.;
datalines;
Yes
yes
Y
n
No
nO
U
u
Unk
Unknown
.
fred 
NNN
;

Proc freq data=example;
   table y;
   table y/missing;
run;

The custom informat created with the Invalue statement uses the option UPCASE to convert all text to upper case before comparing to the value lists. That means that Yes, yes, yEs and other capitalization problems get fixed. The .U is a special missing. SAS has 27 of these, . plus a letter and ._ . For most purposes such as numeric calculation they are treated as missing BUT as shown with the second table statement in the Proc freq if you want to know about the missing values they appear and can be counted.

If your data had been read with similar then likely you could use

nyes = sum(of MYOCARDIALINFARCTION--AIDS);

If you had read the values as single characters so they were all Y, N, U then you could use the COUNTC function:

nyes = countc(cats(of MYOCARDIALINFARCTION--AIDS),'Y');

Countc does count occurences of specific single characters. The Cats function would combine all the varaibles into a single string that could be counted. This could work with Yes values depending on what actually appears in the "unknown" or "missing" values because it counts Y. But likely would not work for N because "UNKNOWN" has 3 N's that would get counted each time that appears as a value.

View solution in original post

ballardw · Posted 06-09-2021 07:24 PM

Example data in the form of a data step would be a good start so we have something to work with.

Countw counts "words", read the documentation for a word, not specific values of the words.

This is a very good example of why character Yes/No values are a bad idea. If you use 1/0 coded for yes/no then this is trivial:

count = sum(var1,var2, ... , varn);

As it is you will have to parse the values.

data want; 
    set have;
    array yn (*) <list the names of all those yes/no variables here>;
   do i=1 to dim(yn);
     count= sum(count,yn[i] = 'Yes');
  end;
run;

This will count per observation. If you need to total this for multiple observations per patient id then you need to tell us so.

The comparison value needs to be as in the data. I'm not sure if you have mixed case or mixes of Yes and Y or what. The sum works because SAS will return 1/0 for logical comparisons.

Note: you can fix such poor data when you read it with a data step and a custom informat.

proc format;
invalue ynu (upcase)
'YES','Y' = 1
'NO','N'  = 0
'U',"UNK","UNKNOWN" = .U
other= .
;
run;

data example;
   input y :ynu.;
datalines;
Yes
yes
Y
n
No
nO
U
u
Unk
Unknown
.
fred 
NNN
;

Proc freq data=example;
   table y;
   table y/missing;
run;

The custom informat created with the Invalue statement uses the option UPCASE to convert all text to upper case before comparing to the value lists. That means that Yes, yes, yEs and other capitalization problems get fixed. The .U is a special missing. SAS has 27 of these, . plus a letter and ._ . For most purposes such as numeric calculation they are treated as missing BUT as shown with the second table statement in the Proc freq if you want to know about the missing values they appear and can be counted.

If your data had been read with similar then likely you could use

nyes = sum(of MYOCARDIALINFARCTION--AIDS);

If you had read the values as single characters so they were all Y, N, U then you could use the COUNTC function:

nyes = countc(cats(of MYOCARDIALINFARCTION--AIDS),'Y');

Countc does count occurences of specific single characters. The Cats function would combine all the varaibles into a single string that could be counted. This could work with Yes values depending on what actually appears in the "unknown" or "missing" values because it counts Y. But likely would not work for N because "UNKNOWN" has 3 N's that would get counted each time that appears as a value.

Tom · Posted 06-09-2021 11:46 PM

COUNTW() counts the number of words without regard to their values.

COUNT() can count the number of times a substring appears.

If the only possible values of yes,no,unknown and blank then simple concatenation should work.

nyes = count(cats( of MYOCARDIALINFARCTION--AIDS), "Yes",'i');

Counting "Yes" in all columns in a dataset

Re: Counting "Yes" in all columns in a dataset

Re: Counting "Yes" in all columns in a dataset

Re: Counting "Yes" in all columns in a dataset

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away