number of missing values per observation

Ronein · Posted 09-08-2023 03:52 AM

Hello

I want to calculate for each observation

1- number of numeric missing values (From numeric columns only)

2- number of char missing values (From char columns only)

3-number of columns

4-number of non missing values from all columns (numeric+chars)

This code below produce wrong calculation.

Why?

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
;
run;
proc contents data=have out=cols noprint;
run;
data _null_;
set cols nobs=total;
call symputx('nr_Vars', total);
run;
Data want;
SET have;
nr_Vars=&nr_Vars;
nr_missing_numeric_Vars=nmiss(of _numeric_);
nr_missing_charc_Vars=nmiss(of _character_);
nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_charc_Vars;
Run;

Amir · Posted 09-08-2023 04:06 AM

Hi,

Thanks for supplying code in the manner you have.

I ran your code and checked the log and it showed there were issues with your code.

For starters, please note that nmiss() is for numeric variables only, please check where you can use cmiss() and if you have any more problems, then check the log and get back to us as required.

Thanks & kind regards,

Amir.

Edit: Typo.

Kurt_Bremser · Posted 09-08-2023 05:00 AM

Some optimization and correction:

proc sql noprint;
select count(*) into :nr_vars
from dictionary.columns
where libname = "WORK" and memname = "HAVE";
quit;

data want;
set have;
nr_vars = &nr_vars.;
nr_missing_numeric_vars = nmiss(of _numeric_);
nr_missing_charc_vars = cmiss(of _character_);
nr_non_missing_all_vars =
  nr_vars - nr_missing_numeric_vars - nr_missing_charc_vars
;
run;

Note how the use of whitespace (blanks and line breaks) improves the readability and maintainability of the code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein · Posted 09-09-2023 02:09 AM

Thanks,

I run the code based on your reply but as I see nr_missing_numeric_Vars get always value that is not correct (It is getting 1 more than should get), does it mean that always need to subscribe 1 when I calculate nr missing numeric vars?

Why did it happen?

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
. . 10 .
;
run;


proc contents data=have out=cols noprint;
run;
data _null_;
set cols nobs=total;
call symputx('nr_Vars', total);
run;
%put &nr_Vars;

/**Another way to calculate nr vars****/
/*proc sql noprint;*/
/*select count(*) into :nrr_vars*/
/*from dictionary.columns*/
/*where libname = "WORK" and memname = "HAVE";*/
/*quit;*/
/*%put &nr_Vars;*/


Data want;
SET have;
nr_Vars=&nr_Vars.;
nr_missing_numeric_Vars=nmiss(of _numeric_);/***nmiss() is for numeric variables only***/
nr_missing_char_vars = cmiss(of _character_);/***cmiss() is for char variables only***/
nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_char_Vars;
Run;

Kurt_Bremser · Posted 09-09-2023 02:54 AM

I see, we need to make sure that the new variables do not influence the calculation:

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
. . 10 .
;

proc sql noprint;
select count(*) into :nr_vars
from dictionary.columns
where libname = "WORK" and memname = "HAVE";
quit;

data want;
set have;
retain
  nr_vars &nr_vars.
  nr_missing_numeric_vars 0
  nr_missing_charc_vars 0
  nr_non_missing_all_vars 0
;
nr_missing_numeric_vars = nmiss(of _numeric_);
nr_missing_charc_vars = cmiss(of _character_);
nr_non_missing_all_vars =
  nr_vars - nr_missing_numeric_vars - nr_missing_charc_vars
;
run;

Code is now tested on SAS ODA.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ronein · Posted 09-09-2023 03:20 AM

Thanks,

May you please explain why adding the retain statement to initiate var values helped to solve the problem?

Kurt_Bremser · Posted 09-09-2023 03:39 AM

Without it, all variables which will be added will be set to missing at the start of a data step iteration.

I had two options to avoid influencing the NMISS call:

set all new variables to 0 before doing the NMISS
use a RETAIN to initialize once and carry over the values from the previous iteration

I chose the second option, as I find it more "elegant".

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yabwon · Posted 09-08-2023 06:01 AM

If you use arrays you can do it all in one (1) data step:

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
;
run;
proc print;
run;

Data want;
  fake_num=42;
  SET have;
  fake_char="fake";

  drop fake_:;

  array N[*] _numeric_;
  array C[*] _character_;

  nr_missing_numeric_Vars=nmiss(of N[*]);
  nr_missing_charc_Vars  =cmiss(of C[*]);

  nr_Vars=dim(N)+dim(C)-2; /* substract 2 because of "fake" */

  nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_charc_Vars;
Run;

proc print;
run;

I added those two "fake" variables to assure there is at least one numeric and one character variable on the list.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Ronein · Posted 09-09-2023 03:16 AM

Is there a reason that you wrote fake_num=42 before SET statement?

What is the difference in the background if you write fake_num=42 before SET or after SET?

yabwon · Posted 09-10-2023 02:02 PM

In this case there is none. I wrote it:

  fake_num=42;
  SET have;
  fake_char="fake";

because I usually later wrote:

  array N[*] fake_num-_numeric_-fake_char;
  array C[*] fake_num-_character_-fake_char;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

mkeintz · Posted 09-10-2023 10:49 PM

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
. . 10 .
run;
data want;
  set have;
  nmiss_num=nmiss(of _numeric_)-1;  /*Minus 1 because nmiss_num starts out missing, yet is included in _numeric_ */
  nmiss_chr=cmiss(of _character_);  /*So does nmiss_chr but it is not in _character_, so no minus 1 */
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 09-11-2023 07:25 PM

So you have gotten a lot of good answers that address many of the issues in trying to do this type of program than can work for ANY input dataset. There are two main issues that such a requirement to work for any input causes. One is variable name conflicts. The variable name you want to use for one or more your four variables might already exist in the input dataset. And it might not have the right variable type. The other is that some datasets might not have any numeric variable or any characters variables (or perhaps no variables at all). That can cause trouble if you are trying to use an ARRAY to help simplify the coding issue because an ARRAY cannot be defined that contains zero variables.

So some of the methods of dealing with the variable naming problem is to use strange names for the variables to reduce the possibility of name conflicts. There are not really any names you can use that will 100% eliminate the risk of name conflict. But using something like a name with many leading underscores and reduce the risk. You could then later use a RENAME statement (or RENAME= dataset option) to convert these improbably names back to something more user friendly. Something like:

data want;
 ....
 ___vars= ...;
 ___nmiss=....;
 ___cmiss=....;
 ___nonmiss=....;
 keep ___vars -- ___nonmiss;
 rename ___vars = nr_vars ___nmiss=nr_missing_numeric_vars .... ;
run;

To handle the idea of there not being any numeric variables there are a copy of methods. One is to add some new variable that you know is numeric and another that is character to force there to be at least one such variable. For the numeric variable it could be on the 4 variables you were already planning to create to store the numbers you want. But for the character variable it will need to be fifth unique unlikely variable name.

Once you have added this extra variable so that the array statement can be generated you will need to take that into account when you count the number of missing or number of variables.

So now let's see if we can make a complete data step;

data want;
  set have;
  ___nvars= 0 ;
  ___char='x';
  array ___n _numeric_;
  array ___c _character_;
  ___nvars=dim(___n)+dim(___c)-2 ;
  ___nmiss=nmiss(of ___n[*]);
  ___cmiss=cmiss(of ___c[*]);
  ___nonmiss=___nvars - ___nmiss - ___cmiss;
  keep ___n: ____cmiss;
  rename 
    ___nvars = nr_vars
    ___nmiss = nr_missing_numeric_vars
    ___cmiss = nr_missing_char_vars
    ___nonmiss = nr_nonmissing_vars
  ;
run;

Another way to handle the possibility of there not being any numeric of character variables is to analyze the metadata ("data about data", in this case the list of variable names and their types) of the dataset first and then generate different code based on that.

You can use PROC CONTENTS as in your example or query DICTIONARY.COLUMNS to find the metadata. For DICTIONARY.COLUMNS you need to know the libref and the memname of the input dataset. The values of those fields are always uppercase in the DICTIONARY.COLUMNS view so you could use code like this to generate some macro variables.

proc sql noprint;
  select count(*) 
       , case when (max(type='num')) then 'nmiss(of _numeric_)-1' else '0' end
       , case when (max(type='char')) then 'cmiss(of _character_)' else '0' end
   into :nvars trimmed
      , :num trimmed
      , :char trimmed
  from dictionary.columns
  where libname="WORK'
    and memname="HAVE"
;
quit;

You can now use these three macro variables to generate the code to do the counting of missing on the actual data.

data want;
   set have;
    nr_vars = &nvars ;
    nr_missing_numeric_vars = &num ;
    nr_missing_char_vars = &char ;
    nr_nonmissing_vars = nr_vars - nr_missing_numeric_vars - nr_missing_char_vars ;
   keep nr_vars -- nr_nonmissing_vars ;
run;

So when there are no numeric variables the macro variable NUM will have the string 0 the second assignment statement will be

    nr_missing_numeric_vars = 0 ;

And when there are numeric variables then it will be

    nr_missing_numeric_vars = nmiss(of _numeric_)-1 ;

Notice that for the numeric variables the generated code subtracts one. That is because the new variable to store the number of missing numeric variables will exist when the statement is compiled, so it is included in the _NUMERIC_ variable list) but it will not yet have a value when the actual query runs.

number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Re: number of missing values per observation

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!