Hello,
I have a huge dataset, with almost 30 variables and a patientid, and variable names of columns are long too(as it's real world data). The columns contain Yes, no or unknwon values, or missing. I need to find number of Yes in all columns per patient. How can I acheive this in just one data step. I know there are other multiple ways , using multiple datasteps to acheive this, but I am curious how to do it in one datastep. I tried this, but it doesn't work:
nyes = countw(MYOCARDIALINFARCTION--AIDS, "Yes");
where first columns name is MYOCARDIALINFARCTION and last column name is AIDS
Thank you
Example data in the form of a data step would be a good start so we have something to work with.
Countw counts "words", read the documentation for a word, not specific values of the words.
This is a very good example of why character Yes/No values are a bad idea. If you use 1/0 coded for yes/no then this is trivial:
count = sum(var1,var2, ... , varn);
As it is you will have to parse the values.
data want; set have; array yn (*) <list the names of all those yes/no variables here>; do i=1 to dim(yn); count= sum(count,yn[i] = 'Yes'); end; run;
This will count per observation. If you need to total this for multiple observations per patient id then you need to tell us so.
The comparison value needs to be as in the data. I'm not sure if you have mixed case or mixes of Yes and Y or what. The sum works because SAS will return 1/0 for logical comparisons.
Note: you can fix such poor data when you read it with a data step and a custom informat.
proc format; invalue ynu (upcase) 'YES','Y' = 1 'NO','N' = 0 'U',"UNK","UNKNOWN" = .U other= . ; run; data example; input y :ynu.; datalines; Yes yes Y n No nO U u Unk Unknown . fred NNN ; Proc freq data=example; table y; table y/missing; run;
The custom informat created with the Invalue statement uses the option UPCASE to convert all text to upper case before comparing to the value lists. That means that Yes, yes, yEs and other capitalization problems get fixed. The .U is a special missing. SAS has 27 of these, . plus a letter and ._ . For most purposes such as numeric calculation they are treated as missing BUT as shown with the second table statement in the Proc freq if you want to know about the missing values they appear and can be counted.
If your data had been read with similar then likely you could use
nyes = sum(of MYOCARDIALINFARCTION--AIDS);
If you had read the values as single characters so they were all Y, N, U then you could use the COUNTC function:
nyes = countc(cats(of MYOCARDIALINFARCTION--AIDS),'Y');
Countc does count occurences of specific single characters. The Cats function would combine all the varaibles into a single string that could be counted. This could work with Yes values depending on what actually appears in the "unknown" or "missing" values because it counts Y. But likely would not work for N because "UNKNOWN" has 3 N's that would get counted each time that appears as a value.
Example data in the form of a data step would be a good start so we have something to work with.
Countw counts "words", read the documentation for a word, not specific values of the words.
This is a very good example of why character Yes/No values are a bad idea. If you use 1/0 coded for yes/no then this is trivial:
count = sum(var1,var2, ... , varn);
As it is you will have to parse the values.
data want; set have; array yn (*) <list the names of all those yes/no variables here>; do i=1 to dim(yn); count= sum(count,yn[i] = 'Yes'); end; run;
This will count per observation. If you need to total this for multiple observations per patient id then you need to tell us so.
The comparison value needs to be as in the data. I'm not sure if you have mixed case or mixes of Yes and Y or what. The sum works because SAS will return 1/0 for logical comparisons.
Note: you can fix such poor data when you read it with a data step and a custom informat.
proc format; invalue ynu (upcase) 'YES','Y' = 1 'NO','N' = 0 'U',"UNK","UNKNOWN" = .U other= . ; run; data example; input y :ynu.; datalines; Yes yes Y n No nO U u Unk Unknown . fred NNN ; Proc freq data=example; table y; table y/missing; run;
The custom informat created with the Invalue statement uses the option UPCASE to convert all text to upper case before comparing to the value lists. That means that Yes, yes, yEs and other capitalization problems get fixed. The .U is a special missing. SAS has 27 of these, . plus a letter and ._ . For most purposes such as numeric calculation they are treated as missing BUT as shown with the second table statement in the Proc freq if you want to know about the missing values they appear and can be counted.
If your data had been read with similar then likely you could use
nyes = sum(of MYOCARDIALINFARCTION--AIDS);
If you had read the values as single characters so they were all Y, N, U then you could use the COUNTC function:
nyes = countc(cats(of MYOCARDIALINFARCTION--AIDS),'Y');
Countc does count occurences of specific single characters. The Cats function would combine all the varaibles into a single string that could be counted. This could work with Yes values depending on what actually appears in the "unknown" or "missing" values because it counts Y. But likely would not work for N because "UNKNOWN" has 3 N's that would get counted each time that appears as a value.
COUNTW() counts the number of words without regard to their values.
COUNT() can count the number of times a substring appears.
If the only possible values of yes,no,unknown and blank then simple concatenation should work.
nyes = count(cats( of MYOCARDIALINFARCTION--AIDS), "Yes",'i');
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.