BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

I want to calculate for each observation

1- number of numeric missing values (From numeric columns only)

2- number of char missing values (From char columns only)

3-number of columns

4-number of non missing values from all columns (numeric+chars)

This code below produce wrong calculation.

Why?

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
;
run;
proc contents data=have out=cols noprint;
run;
data _null_;
set cols nobs=total;
call symputx('nr_Vars', total);
run;
Data want;
SET have;
nr_Vars=&nr_Vars;
nr_missing_numeric_Vars=nmiss(of _numeric_);
nr_missing_charc_Vars=nmiss(of _character_);
nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_charc_Vars;
Run;

 

11 REPLIES 11
Amir
PROC Star

Hi,

 

Thanks for supplying code in the manner you have.

 

I ran your code and checked the log and it showed there were issues with your code.

 

For starters, please note that nmiss() is for numeric variables only, please check where you can use cmiss() and if you have any more problems, then check the log and get back to us as required.

 

 

Thanks & kind regards,

Amir.

 

Edit: Typo.

Kurt_Bremser
Super User

Some optimization and correction:

proc sql noprint;
select count(*) into :nr_vars
from dictionary.columns
where libname = "WORK" and memname = "HAVE";
quit;

data want;
set have;
nr_vars = &nr_vars.;
nr_missing_numeric_vars = nmiss(of _numeric_);
nr_missing_charc_vars = cmiss(of _character_);
nr_non_missing_all_vars =
  nr_vars - nr_missing_numeric_vars - nr_missing_charc_vars
;
run;

Note how the use of whitespace (blanks and line breaks) improves the readability and maintainability of the code.

Ronein
Meteorite | Level 14

Thanks,

I run the code based on your reply but as I see nr_missing_numeric_Vars get always value that is not correct (It is getting 1 more than should get), does it mean that always  need to subscribe 1  when I calculate nr missing numeric vars?

Why did it happen?

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
. . 10 .
;
run;


proc contents data=have out=cols noprint;
run;
data _null_;
set cols nobs=total;
call symputx('nr_Vars', total);
run;
%put &nr_Vars;

/**Another way to calculate nr vars****/
/*proc sql noprint;*/
/*select count(*) into :nrr_vars*/
/*from dictionary.columns*/
/*where libname = "WORK" and memname = "HAVE";*/
/*quit;*/
/*%put &nr_Vars;*/


Data want;
SET have;
nr_Vars=&nr_Vars.;
nr_missing_numeric_Vars=nmiss(of _numeric_);/***nmiss() is for numeric variables only***/
nr_missing_char_vars = cmiss(of _character_);/***cmiss() is for char variables only***/
nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_char_Vars;
Run;

 

 

Kurt_Bremser
Super User

I see, we need to make sure that the new variables do not influence the calculation:

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
. . 10 .
;

proc sql noprint;
select count(*) into :nr_vars
from dictionary.columns
where libname = "WORK" and memname = "HAVE";
quit;

data want;
set have;
retain
  nr_vars &nr_vars.
  nr_missing_numeric_vars 0
  nr_missing_charc_vars 0
  nr_non_missing_all_vars 0
;
nr_missing_numeric_vars = nmiss(of _numeric_);
nr_missing_charc_vars = cmiss(of _character_);
nr_non_missing_all_vars =
  nr_vars - nr_missing_numeric_vars - nr_missing_charc_vars
;
run;

Code is now tested on SAS ODA.

Ronein
Meteorite | Level 14

Thanks,

May you please explain why adding the retain statement to initiate var values helped to solve the problem?

Kurt_Bremser
Super User

Without it, all variables which will be added will be set to missing at the start of a data step iteration.

I had two options to avoid influencing the NMISS call:

  • set all new variables to 0 before doing the NMISS
  • use a RETAIN to initialize once and carry over the values from the previous iteration

I chose the second option, as I find it more "elegant".

yabwon
Onyx | Level 15

If you use arrays you can do it all in one (1) data step:

data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
;
run;
proc print;
run;

Data want;
  fake_num=42;
  SET have;
  fake_char="fake";

  drop fake_:;

  array N[*] _numeric_;
  array C[*] _character_;

  nr_missing_numeric_Vars=nmiss(of N[*]);
  nr_missing_charc_Vars  =cmiss(of C[*]);

  nr_Vars=dim(N)+dim(C)-2; /* substract 2 because of "fake" */

  nr_non_Missing_All_Vars=nr_Vars-nr_missing_numeric_Vars-nr_missing_charc_Vars;
Run;

proc print;
run;

I added those two "fake" variables to assure there is at least one numeric and one character variable on the list.

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Ronein
Meteorite | Level 14

Is there a reason that you wrote fake_num=42  before SET statement?

What is the difference in the background if you write fake_num=42  before SET or after SET?

yabwon
Onyx | Level 15

In this case there is none. I wrote it:

  fake_num=42;
  SET have;
  fake_char="fake";

because I usually later wrote:

  array N[*] fake_num-_numeric_-fake_char;
  array C[*] fake_num-_character_-fake_char;

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



mkeintz
PROC Star
data have;
input team $ pos $ X t;
cards;
A G 10 8
B F 4 .
. F 7 10
D C . 14
E F . 10
F G 12 7
G C . 11
A A . .
. . 10 .
run;
data want;
  set have;
  nmiss_num=nmiss(of _numeric_)-1;  /*Minus 1 because nmiss_num starts out missing, yet is included in _numeric_ */
  nmiss_chr=cmiss(of _character_);  /*So does nmiss_chr but it is not in _character_, so no minus 1 */
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

So you have gotten a lot of good answers that address many of the issues in trying to do this type of program than can work for ANY input dataset.  There are two main issues that such a requirement to work for any input causes.  One is variable name conflicts.  The variable name you want to use for one or more your four variables might already exist in the input dataset.  And it might not have the right variable type.  The other is that some datasets might not have any numeric variable or any characters variables (or perhaps no variables at all).  That can cause trouble if you are trying to use an ARRAY to help simplify the coding issue because an ARRAY cannot be defined that contains zero variables.

 

So some of the methods of dealing with the variable naming problem is to use strange names for the variables to reduce the possibility of name conflicts.  There are not really any names you can use that will 100% eliminate the risk of name conflict.  But using something like a name with many leading underscores and reduce the risk.  You could then later use a RENAME statement (or RENAME= dataset option) to convert these improbably names back to something more user friendly.  Something like:

data want;
 ....
 ___vars= ...;
 ___nmiss=....;
 ___cmiss=....;
 ___nonmiss=....;
 keep ___vars -- ___nonmiss;
 rename ___vars = nr_vars ___nmiss=nr_missing_numeric_vars .... ;
run;

To handle the idea of there not being any numeric variables there are a copy of methods.  One is to add some new variable that you know is numeric and another that is character to force there to be at least one such variable.  For the numeric variable it could be on the 4 variables you were already planning to create to store the numbers you want.  But for the character variable it will need to be fifth unique unlikely variable name.

 

Once you have added this extra variable so that the array statement can be generated you will need to take that into account when you count the number of missing or number of variables.

 

So now let's see if we can make a complete data step;

data want;
  set have;
  ___nvars= 0 ;
  ___char='x';
  array ___n _numeric_;
  array ___c _character_;
  ___nvars=dim(___n)+dim(___c)-2 ;
  ___nmiss=nmiss(of ___n[*]);
  ___cmiss=cmiss(of ___c[*]);
  ___nonmiss=___nvars - ___nmiss - ___cmiss;
  keep ___n: ____cmiss;
  rename 
    ___nvars = nr_vars
    ___nmiss = nr_missing_numeric_vars
    ___cmiss = nr_missing_char_vars
    ___nonmiss = nr_nonmissing_vars
  ;
run;

Another way to handle the possibility of there not being any numeric of character variables is to analyze the metadata ("data about data", in this case the list of variable names and their types) of the dataset first and then generate different code based on that.

 

You can use PROC CONTENTS as in your example or query DICTIONARY.COLUMNS to find the metadata.  For DICTIONARY.COLUMNS you need to know the libref and the memname of the input dataset.   The values of those fields are always uppercase in the DICTIONARY.COLUMNS view so you could use code like this to generate some macro variables.

 

proc sql noprint;
  select count(*) 
       , case when (max(type='num')) then 'nmiss(of _numeric_)-1' else '0' end
       , case when (max(type='char')) then 'cmiss(of _character_)' else '0' end
   into :nvars trimmed
      , :num trimmed
      , :char trimmed
  from dictionary.columns
  where libname="WORK'
    and memname="HAVE"
;
quit;

You can now use these three macro variables to generate the code to do the counting of missing on the actual data.  

data want;
   set have;
    nr_vars = &nvars ;
    nr_missing_numeric_vars = &num ;
    nr_missing_char_vars = &char ;
    nr_nonmissing_vars = nr_vars - nr_missing_numeric_vars - nr_missing_char_vars ;
   keep nr_vars -- nr_nonmissing_vars ;
run;

So when there are no numeric variables the macro variable NUM will have the string 0 the second assignment statement will be 

    nr_missing_numeric_vars = 0 ;

And when there are numeric variables then it will be 

    nr_missing_numeric_vars = nmiss(of _numeric_)-1 ;

Notice that for the numeric variables the generated code subtracts one.  That is because the new variable to store the number of missing numeric variables will exist when the statement is compiled, so it is included in the _NUMERIC_ variable list) but it will not yet have a value when the actual query runs.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1120 views
  • 4 likes
  • 6 in conversation