I want to calculate for each observation
1- number of numeric missing values (From numeric columns only)
2- number of char missing values (From char columns only)
3-number of columns
4-number of non missing values from all columns (numeric+chars)
This code below produce wrong calculation.
data have; input team $ pos $ X t; cards; A G 10 8 B F 4 . . F 7 10 D C . 14 E F . 10 F G 12 7 G C . 11 A A . . ; run; proc contents data=have out=cols noprint; run; data _null_; set cols nobs=total; call symputx('nr_Vars', total); run; Data want; SET have; nr_Vars=&nr_Vars; nr_missing_numeric_Vars=nmiss(of _numeric_); nr_missing_charc_Vars=nmiss(of _character_); nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_charc_Vars; Run;
Thanks for supplying code in the manner you have.
I ran your code and checked the log and it showed there were issues with your code.
For starters, please note that nmiss() is for numeric variables only, please check where you can use cmiss() and if you have any more problems, then check the log and get back to us as required.
Thanks & kind regards,
Some optimization and correction:
proc sql noprint; select count(*) into :nr_vars from dictionary.columns where libname = "WORK" and memname = "HAVE"; quit; data want; set have; nr_vars = &nr_vars.; nr_missing_numeric_vars = nmiss(of _numeric_); nr_missing_charc_vars = cmiss(of _character_); nr_non_missing_all_vars = nr_vars - nr_missing_numeric_vars - nr_missing_charc_vars ; run;
Note how the use of whitespace (blanks and line breaks) improves the readability and maintainability of the code.
I run the code based on your reply but as I see nr_missing_numeric_Vars get always value that is not correct (It is getting 1 more than should get), does it mean that always need to subscribe 1 when I calculate nr missing numeric vars?
Why did it happen?
data have; input team $ pos $ X t; cards; A G 10 8 B F 4 . . F 7 10 D C . 14 E F . 10 F G 12 7 G C . 11 A A . . . . 10 . ; run; proc contents data=have out=cols noprint; run; data _null_; set cols nobs=total; call symputx('nr_Vars', total); run; %put &nr_Vars; /**Another way to calculate nr vars****/ /*proc sql noprint;*/ /*select count(*) into :nrr_vars*/ /*from dictionary.columns*/ /*where libname = "WORK" and memname = "HAVE";*/ /*quit;*/ /*%put &nr_Vars;*/ Data want; SET have; nr_Vars=&nr_Vars.; nr_missing_numeric_Vars=nmiss(of _numeric_);/***nmiss() is for numeric variables only***/ nr_missing_char_vars = cmiss(of _character_);/***cmiss() is for char variables only***/ nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_char_Vars; Run;
I see, we need to make sure that the new variables do not influence the calculation:
data have; input team $ pos $ X t; cards; A G 10 8 B F 4 . . F 7 10 D C . 14 E F . 10 F G 12 7 G C . 11 A A . . . . 10 . ; proc sql noprint; select count(*) into :nr_vars from dictionary.columns where libname = "WORK" and memname = "HAVE"; quit; data want; set have; retain nr_vars &nr_vars. nr_missing_numeric_vars 0 nr_missing_charc_vars 0 nr_non_missing_all_vars 0 ; nr_missing_numeric_vars = nmiss(of _numeric_); nr_missing_charc_vars = cmiss(of _character_); nr_non_missing_all_vars = nr_vars - nr_missing_numeric_vars - nr_missing_charc_vars ; run;
Code is now tested on SAS ODA.
Without it, all variables which will be added will be set to missing at the start of a data step iteration.
I had two options to avoid influencing the NMISS call:
I chose the second option, as I find it more "elegant".
If you use arrays you can do it all in one (1) data step:
data have; input team $ pos $ X t; cards; A G 10 8 B F 4 . . F 7 10 D C . 14 E F . 10 F G 12 7 G C . 11 A A . . ; run; proc print; run; Data want; fake_num=42; SET have; fake_char="fake"; drop fake_:; array N[*] _numeric_; array C[*] _character_; nr_missing_numeric_Vars=nmiss(of N[*]); nr_missing_charc_Vars =cmiss(of C[*]); nr_Vars=dim(N)+dim(C)-2; /* substract 2 because of "fake" */ nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_charc_Vars; Run; proc print; run;
I added those two "fake" variables to assure there is at least one numeric and one character variable on the list.
In this case there is none. I wrote it:
fake_num=42; SET have; fake_char="fake";
because I usually later wrote:
array N[*] fake_num-_numeric_-fake_char; array C[*] fake_num-_character_-fake_char;
data have; input team $ pos $ X t; cards; A G 10 8 B F 4 . . F 7 10 D C . 14 E F . 10 F G 12 7 G C . 11 A A . . . . 10 . run; data want; set have; nmiss_num=nmiss(of _numeric_)-1; /*Minus 1 because nmiss_num starts out missing, yet is included in _numeric_ */ nmiss_chr=cmiss(of _character_); /*So does nmiss_chr but it is not in _character_, so no minus 1 */ run;
So you have gotten a lot of good answers that address many of the issues in trying to do this type of program than can work for ANY input dataset. There are two main issues that such a requirement to work for any input causes. One is variable name conflicts. The variable name you want to use for one or more your four variables might already exist in the input dataset. And it might not have the right variable type. The other is that some datasets might not have any numeric variable or any characters variables (or perhaps no variables at all). That can cause trouble if you are trying to use an ARRAY to help simplify the coding issue because an ARRAY cannot be defined that contains zero variables.
So some of the methods of dealing with the variable naming problem is to use strange names for the variables to reduce the possibility of name conflicts. There are not really any names you can use that will 100% eliminate the risk of name conflict. But using something like a name with many leading underscores and reduce the risk. You could then later use a RENAME statement (or RENAME= dataset option) to convert these improbably names back to something more user friendly. Something like:
data want; .... ___vars= ...; ___nmiss=....; ___cmiss=....; ___nonmiss=....; keep ___vars -- ___nonmiss; rename ___vars = nr_vars ___nmiss=nr_missing_numeric_vars .... ; run;
To handle the idea of there not being any numeric variables there are a copy of methods. One is to add some new variable that you know is numeric and another that is character to force there to be at least one such variable. For the numeric variable it could be on the 4 variables you were already planning to create to store the numbers you want. But for the character variable it will need to be fifth unique unlikely variable name.
Once you have added this extra variable so that the array statement can be generated you will need to take that into account when you count the number of missing or number of variables.
So now let's see if we can make a complete data step;
data want; set have; ___nvars= 0 ; ___char='x'; array ___n _numeric_; array ___c _character_; ___nvars=dim(___n)+dim(___c)-2 ; ___nmiss=nmiss(of ___n[*]); ___cmiss=cmiss(of ___c[*]); ___nonmiss=___nvars - ___nmiss - ___cmiss; keep ___n: ____cmiss; rename ___nvars = nr_vars ___nmiss = nr_missing_numeric_vars ___cmiss = nr_missing_char_vars ___nonmiss = nr_nonmissing_vars ; run;
Another way to handle the possibility of there not being any numeric of character variables is to analyze the metadata ("data about data", in this case the list of variable names and their types) of the dataset first and then generate different code based on that.
You can use PROC CONTENTS as in your example or query DICTIONARY.COLUMNS to find the metadata. For DICTIONARY.COLUMNS you need to know the libref and the memname of the input dataset. The values of those fields are always uppercase in the DICTIONARY.COLUMNS view so you could use code like this to generate some macro variables.
proc sql noprint; select count(*) , case when (max(type='num')) then 'nmiss(of _numeric_)-1' else '0' end , case when (max(type='char')) then 'cmiss(of _character_)' else '0' end into :nvars trimmed , :num trimmed , :char trimmed from dictionary.columns where libname="WORK' and memname="HAVE" ; quit;
You can now use these three macro variables to generate the code to do the counting of missing on the actual data.
data want; set have; nr_vars = &nvars ; nr_missing_numeric_vars = &num ; nr_missing_char_vars = &char ; nr_nonmissing_vars = nr_vars - nr_missing_numeric_vars - nr_missing_char_vars ; keep nr_vars -- nr_nonmissing_vars ; run;
So when there are no numeric variables the macro variable NUM will have the string 0 the second assignment statement will be
nr_missing_numeric_vars = 0 ;
And when there are numeric variables then it will be
nr_missing_numeric_vars = nmiss(of _numeric_)-1 ;
Notice that for the numeric variables the generated code subtracts one. That is because the new variable to store the number of missing numeric variables will exist when the statement is compiled, so it is included in the _NUMERIC_ variable list) but it will not yet have a value when the actual query runs.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.