SAS Programming

wlierman · Posted 05-12-2021 06:12 PM

I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.)

What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) . I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be

Variable name # non-missing # missing Total percent_missing

DEYEdi 45,000 18,000 63,000 28.5

DEARdi 50,000 13,000 63,000 20.6

I would alos like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe

for zipcodes, counties, cities etc.

Thank you for your help. (I will open another question for counting missing/nonmissingnumeric vars in the dataset.)

wlierman

Tom · Posted 05-12-2021 08:30 PM

Try this macro (click on spoiler link to reveal the macro definition code).

Spoiler

%macro percent_missing(input_dsn,output_dsn);
%if not %length(&input_dsn) %then %let input_dsn=&syslast;

* create format for missing;

proc format;
  value $missfmt ' ' = "Missing"  other = "Not_Missing";
  value nmissfmt low-high ="Not_Missing" other="Missing";
run;

* turns off the output so the results do not get too messy;
ods select none;

ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN.;
  table _all_ / missing missprint;
  format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;

data &OUTPUT_DSN ;
  length __v__ $32. __m__ __n__ __t__ __p__ 8 ;
  set temp;
  by table notsorted ;
  if first.table then call missing(of __:);
  __v__ = substr(table,7);
  if vvaluex(__v__)='Missing' then __m__=frequency;
  else __n__=frequency;
  retain __m__ __n__;
  if last.table then do;
     __m__=sum(0,__m__);
     __n__=sum(0,__n__);
     __t__=sum(__m__,__n__);
     __p__ = 100*divide(__m__,__t__);
     output;
  end;
  keep __v__ __m__ __n__ __t__ __p__ ;
  rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total;
  format __m__ __n__ __t__ comma11. __p__ 6.2;
run;
%mend;

%macro percent_missing(input_dsn,output_dsn); %if not %length(&input_dsn) %then %let input_dsn=&syslast; * create format for missing; proc format; value $missfmt ' ' = "Missing" other = "Not_Missing"; value nmissfmt low-high ="Not_Missing" other="Missing"; run; * turns off the output so the results do not get too messy; ods select none; ods table onewayfreqs=temp; proc freq data=&INPUT_DSN.; table _all_ / missing missprint; format _numeric_ nmissfmt. _character_ $missfmt.; run; ods select all; data &OUTPUT_DSN ; length __v__ $32. __m__ __n__ __t__ __p__ 8 ; set temp; by table notsorted ; if first.table then call missing(of __:); __v__ = substr(table,7); if vvaluex(__v__)='Missing' then __m__=frequency; else __n__=frequency; retain __m__ __n__; if last.table then do; __m__=sum(0,__m__); __n__=sum(0,__n__); __t__=sum(__m__,__n__); __p__ = 100*divide(__m__,__t__); output; end; keep __v__ __m__ __n__ __t__ __p__ ; rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total; format __m__ __n__ __t__ comma11. __p__ 6.2; run; %mend;

Examples:

%percent_missing(sashelp.heart,example1);
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          5,209          5,209      0.00
  2    DeathCause              3,218          1,991          5,209     61.78
  3    AgeCHDdiag              3,760          1,449          5,209     72.18
  4    Sex                         0          5,209          5,209      0.00
  5    AgeAtStart                  0          5,209          5,209      0.00
  6    Height                      6          5,203          5,209      0.12
  7    Weight                      6          5,203          5,209      0.12
  8    Diastolic                   0          5,209          5,209      0.00
  9    Systolic                    0          5,209          5,209      0.00
 10    MRW                         6          5,203          5,209      0.12
 11    Smoking                    36          5,173          5,209      0.69
 12    AgeAtDeath              3,218          1,991          5,209     61.78
 13    Cholesterol               152          5,057          5,209      2.92
 14    Chol_Status               152          5,057          5,209      2.92
 15    BP_Status                   0          5,209          5,209      0.00
 16    Weight_Status               6          5,203          5,209      0.12
 17    Smoking_Status             36          5,173          5,209      0.69

With a KEEP= dataset option:

%percent_missing(sashelp.heart(keep=chol:),cholesterol);
proc print; run;

Results:

Obs       name            missing        present          total    percent

 1     Cholesterol            152          5,057          5,209      2.92
 2     Chol_Status            152          5,057          5,209      2.92

With WHERE= dataset option:

%percent_missing(sashelp.heart(where=(not missing(DeathCause))),deaths );
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          1,991          1,991      0.00
  2    DeathCause                  0          1,991          1,991      0.00
  3    AgeCHDdiag              1,097            894          1,991     55.10
  4    Sex                         0          1,991          1,991      0.00
  5    AgeAtStart                  0          1,991          1,991      0.00
  6    Height                      5          1,986          1,991      0.25
  7    Weight                      3          1,988          1,991      0.15
  8    Diastolic                   0          1,991          1,991      0.00
  9    Systolic                    0          1,991          1,991      0.00
 10    MRW                         3          1,988          1,991      0.15
 11    Smoking                    20          1,971          1,991      1.00
 12    AgeAtDeath                  0          1,991          1,991      0.00
 13    Cholesterol                69          1,922          1,991      3.47
 14    Chol_Status                69          1,922          1,991      3.47
 15    BP_Status                   0          1,991          1,991      0.00
 16    Weight_Status               3          1,988          1,991      0.15
 17    Smoking_Status             20          1,971          1,991      1.00

View solution in original post

PaigeMiller · Posted 05-12-2021 06:15 PM

@wlierman wrote:

What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) . I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be

    Variable name                   # non-missing                # missing                Total        percent_missing

      DEYEdi                               45,000                           18,000               63,000                28.5

      DEARdi                               50,000                           13,000               63,000               20.6

I'm afraid I don't see a relationship between your sentences that I quoted above, and the data. Please explain.

Also, please provide a representative portion of your data that illustrates the problem following these instructions (and not via any other method), PLUS the desired output.

--
Paige Miller

wlierman · Posted 05-12-2021 06:18 PM

I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.)

What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of numeric observations that have some type of response (age at when a disability begins) . So hypothetically the table could be

Variable_name # missing # non-missing total Percent_missing

DEYEage 61,000 2,000 63,000 96.8

DEARage 59,900 3,100 63,000 95.0

I would also like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe

for annual income, population etc

Thank you for your help.

wlierman

Tom · Posted 05-12-2021 07:47 PM

Your report looks like a simple report of number missing that the trick with PROC FREQ special formats and stacked ODS output produces. As already answered here. https://communities.sas.com/t5/SAS-Programming/Counting-missing-and-non-missing-obs/m-p/740272#M2312...

What do you mean by "chunks"?

Do you mean a subset of variables? Try adding dataset option KEEP= or DROP=.

proc freq data=have(keep= de:);
  tables _all_ ;
  format _numeric_ nmissfmt.;
  format _character_ $missfmt.;
run;

A subset of observations? Try add an WHERE statement.

Do you mean you want that wrapped into a macro where you can pass in both the input dataset and the input list of variables to check?

Or are you asking to test something different. Like how many observations are missing for ALL of this subset of the variables in the dataset? For example how many observations did not have any answers on a multiple question section of a survey.

Provide some test data to demonstrate what you are looking to calculate. To get real answers provide real example of data in the form of a data step. Make the example small enough that you can just type it into a data step. You really only need two or three numeric and two or three character variables and probably about 10 observations to provide an example. Once you have the example manual count up what you expect the output to be for that example.

Tom · Posted 05-12-2021 08:30 PM

Try this macro (click on spoiler link to reveal the macro definition code).

Spoiler

%macro percent_missing(input_dsn,output_dsn);
%if not %length(&input_dsn) %then %let input_dsn=&syslast;

* create format for missing;

proc format;
  value $missfmt ' ' = "Missing"  other = "Not_Missing";
  value nmissfmt low-high ="Not_Missing" other="Missing";
run;

* turns off the output so the results do not get too messy;
ods select none;

ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN.;
  table _all_ / missing missprint;
  format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;

data &OUTPUT_DSN ;
  length __v__ $32. __m__ __n__ __t__ __p__ 8 ;
  set temp;
  by table notsorted ;
  if first.table then call missing(of __:);
  __v__ = substr(table,7);
  if vvaluex(__v__)='Missing' then __m__=frequency;
  else __n__=frequency;
  retain __m__ __n__;
  if last.table then do;
     __m__=sum(0,__m__);
     __n__=sum(0,__n__);
     __t__=sum(__m__,__n__);
     __p__ = 100*divide(__m__,__t__);
     output;
  end;
  keep __v__ __m__ __n__ __t__ __p__ ;
  rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total;
  format __m__ __n__ __t__ comma11. __p__ 6.2;
run;
%mend;

%macro percent_missing(input_dsn,output_dsn); %if not %length(&input_dsn) %then %let input_dsn=&syslast; * create format for missing; proc format; value $missfmt ' ' = "Missing" other = "Not_Missing"; value nmissfmt low-high ="Not_Missing" other="Missing"; run; * turns off the output so the results do not get too messy; ods select none; ods table onewayfreqs=temp; proc freq data=&INPUT_DSN.; table _all_ / missing missprint; format _numeric_ nmissfmt. _character_ $missfmt.; run; ods select all; data &OUTPUT_DSN ; length __v__ $32. __m__ __n__ __t__ __p__ 8 ; set temp; by table notsorted ; if first.table then call missing(of __:); __v__ = substr(table,7); if vvaluex(__v__)='Missing' then __m__=frequency; else __n__=frequency; retain __m__ __n__; if last.table then do; __m__=sum(0,__m__); __n__=sum(0,__n__); __t__=sum(__m__,__n__); __p__ = 100*divide(__m__,__t__); output; end; keep __v__ __m__ __n__ __t__ __p__ ; rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total; format __m__ __n__ __t__ comma11. __p__ 6.2; run; %mend;

Examples:

%percent_missing(sashelp.heart,example1);
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          5,209          5,209      0.00
  2    DeathCause              3,218          1,991          5,209     61.78
  3    AgeCHDdiag              3,760          1,449          5,209     72.18
  4    Sex                         0          5,209          5,209      0.00
  5    AgeAtStart                  0          5,209          5,209      0.00
  6    Height                      6          5,203          5,209      0.12
  7    Weight                      6          5,203          5,209      0.12
  8    Diastolic                   0          5,209          5,209      0.00
  9    Systolic                    0          5,209          5,209      0.00
 10    MRW                         6          5,203          5,209      0.12
 11    Smoking                    36          5,173          5,209      0.69
 12    AgeAtDeath              3,218          1,991          5,209     61.78
 13    Cholesterol               152          5,057          5,209      2.92
 14    Chol_Status               152          5,057          5,209      2.92
 15    BP_Status                   0          5,209          5,209      0.00
 16    Weight_Status               6          5,203          5,209      0.12
 17    Smoking_Status             36          5,173          5,209      0.69

With a KEEP= dataset option:

%percent_missing(sashelp.heart(keep=chol:),cholesterol);
proc print; run;

Results:

Obs       name            missing        present          total    percent

 1     Cholesterol            152          5,057          5,209      2.92
 2     Chol_Status            152          5,057          5,209      2.92

With WHERE= dataset option:

%percent_missing(sashelp.heart(where=(not missing(DeathCause))),deaths );
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          1,991          1,991      0.00
  2    DeathCause                  0          1,991          1,991      0.00
  3    AgeCHDdiag              1,097            894          1,991     55.10
  4    Sex                         0          1,991          1,991      0.00
  5    AgeAtStart                  0          1,991          1,991      0.00
  6    Height                      5          1,986          1,991      0.25
  7    Weight                      3          1,988          1,991      0.15
  8    Diastolic                   0          1,991          1,991      0.00
  9    Systolic                    0          1,991          1,991      0.00
 10    MRW                         3          1,988          1,991      0.15
 11    Smoking                    20          1,971          1,991      1.00
 12    AgeAtDeath                  0          1,991          1,991      0.00
 13    Cholesterol                69          1,922          1,991      3.47
 14    Chol_Status                69          1,922          1,991      3.47
 15    BP_Status                   0          1,991          1,991      0.00
 16    Weight_Status               3          1,988          1,991      0.15
 17    Smoking_Status             20          1,971          1,991      1.00

wlierman · Posted 05-13-2021 04:58 PM

Hello,
By chunks I meant different portions of the dataset - demographics for the contact; race and ethn icity break outs; language; and the disability section that you have helped me on. While all the REALD data is sparsely p0opulated, the disability section is the most sparse.
Your earlier coding helped to a solution. So I will label it as a solution.
Thanks again.
wlierman

Reeza · Posted 05-12-2021 06:28 PM

Sample data.

ballardw · Posted 05-12-2021 06:43 PM

Please try this:

data junk;
   input a $ b $ C $;
datalines;
123 . 456
dds lkjhl .
abc 78jk  .
.   .     .
;
data need;
   set junk;
   array _c (*) _character_;
   do i= 1 to dim(_c);
      varname = vname(_c[i]);
      x   = missing(_c[i]);
      x2 =x;
      output;
   end;
   keep varname x x2;
run; 

proc format;
value ismiss
1='Missing'
0='Not Missing'
;
run;

proc tabulate data=need;
   class varname;
   class x;
   format x ismiss.;
   var x2  ;

   table varname, 
     (x=' ' all='Total')*n=' '  x2="Percent missing"*mean=''*f=percent8.1
   ;
run;

The array _C in the NEED could use any list of like variables, either a list of character or a list of numeric variables but not a mixture, basic array rule.

If you wanted to do both character and numeric create a different array, such as _n for the numeric variables and have a second loop that uses _n instead of _c.

The Missing function returns values of 1(true or missing) and 0 (false or not missing) for either character or numeric variables. So we create the same reporting variables regardless of starting variable type and the report procedure doesn't need any changes as long as the data set looks like "need".

The variables will most likely come out in alphabetical order but remember upper case 'Z' is before 'a'. So if your variables are initially defined with mixed case not the order you expect. If that is problem you can UPCASE or LOWCASE the Varname variable in the Need data step.

wlierman · Posted 05-13-2021 06:35 PM

Your code approach also produces the output that I am needing.

I would also post your email as a solution along with the solution from Tom.

But I don't know how to submit a second solution.

Thanks for your help.

wlierman

SAS Programming

Method to count the number of non-missing and missing character values in

Re: How to count missing and non-missing numeric variables in a data set

Re: Method to count the number of non-missing and missing character values in

How to count missing and non-missing numeric variables in a data set

Re: How to count missing and non-missing numeric variables in a data set

Re: How to count missing and non-missing numeric variables in a data set

Re: How to count missing and non-missing numeric variables in a data set

Re: Method to count the number of non-missing and missing character values in

Re: Method to count the number of non-missing and missing character values in

Re: Method to count the number of non-missing and missing character values in

Selection of control for cases by date (missing and non missing)

Count number of missing dates within group

fill missing values with known non-missing values

Counting non missing with proc freq; remove zeros from 'not missing'

Find first 5 non missing character values

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...