BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.)

 

What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) .  I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be

 

    Variable name                   # non-missing                # missing                Total        percent_missing

      DEYEdi                               45,000                           18,000               63,000                28.5

      DEARdi                               50,000                           13,000               63,000               20.6

 

I would alos like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe

for zipcodes, counties,  cities etc.

 

Thank you for your help. (I will open another question for counting missing/nonmissingnumeric vars in the dataset.)

 

wlierman

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Try this macro (click on spoiler link to reveal the macro definition code).

Spoiler
%macro percent_missing(input_dsn,output_dsn);
%if not %length(&input_dsn) %then %let input_dsn=&syslast;

* create format for missing;

proc format;
  value $missfmt ' ' = "Missing"  other = "Not_Missing";
  value nmissfmt low-high ="Not_Missing" other="Missing";
run;

* turns off the output so the results do not get too messy;
ods select none;

ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN.;
  table _all_ / missing missprint;
  format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;

data &OUTPUT_DSN ;
  length __v__ $32. __m__ __n__ __t__ __p__ 8 ;
  set temp;
  by table notsorted ;
  if first.table then call missing(of __:);
  __v__ = substr(table,7);
  if vvaluex(__v__)='Missing' then __m__=frequency;
  else __n__=frequency;
  retain __m__ __n__;
  if last.table then do;
     __m__=sum(0,__m__);
     __n__=sum(0,__n__);
     __t__=sum(__m__,__n__);
     __p__ = 100*divide(__m__,__t__);
     output;
  end;
  keep __v__ __m__ __n__ __t__ __p__ ;
  rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total;
  format __m__ __n__ __t__ comma11. __p__ 6.2;
run;
%mend;

Examples:

%percent_missing(sashelp.heart,example1);
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          5,209          5,209      0.00
  2    DeathCause              3,218          1,991          5,209     61.78
  3    AgeCHDdiag              3,760          1,449          5,209     72.18
  4    Sex                         0          5,209          5,209      0.00
  5    AgeAtStart                  0          5,209          5,209      0.00
  6    Height                      6          5,203          5,209      0.12
  7    Weight                      6          5,203          5,209      0.12
  8    Diastolic                   0          5,209          5,209      0.00
  9    Systolic                    0          5,209          5,209      0.00
 10    MRW                         6          5,203          5,209      0.12
 11    Smoking                    36          5,173          5,209      0.69
 12    AgeAtDeath              3,218          1,991          5,209     61.78
 13    Cholesterol               152          5,057          5,209      2.92
 14    Chol_Status               152          5,057          5,209      2.92
 15    BP_Status                   0          5,209          5,209      0.00
 16    Weight_Status               6          5,203          5,209      0.12
 17    Smoking_Status             36          5,173          5,209      0.69

With a KEEP= dataset option:

%percent_missing(sashelp.heart(keep=chol:),cholesterol);
proc print; run;

Results:

Obs       name            missing        present          total    percent

 1     Cholesterol            152          5,057          5,209      2.92
 2     Chol_Status            152          5,057          5,209      2.92

With WHERE= dataset option:

%percent_missing(sashelp.heart(where=(not missing(DeathCause))),deaths );
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          1,991          1,991      0.00
  2    DeathCause                  0          1,991          1,991      0.00
  3    AgeCHDdiag              1,097            894          1,991     55.10
  4    Sex                         0          1,991          1,991      0.00
  5    AgeAtStart                  0          1,991          1,991      0.00
  6    Height                      5          1,986          1,991      0.25
  7    Weight                      3          1,988          1,991      0.15
  8    Diastolic                   0          1,991          1,991      0.00
  9    Systolic                    0          1,991          1,991      0.00
 10    MRW                         3          1,988          1,991      0.15
 11    Smoking                    20          1,971          1,991      1.00
 12    AgeAtDeath                  0          1,991          1,991      0.00
 13    Cholesterol                69          1,922          1,991      3.47
 14    Chol_Status                69          1,922          1,991      3.47
 15    BP_Status                   0          1,991          1,991      0.00
 16    Weight_Status               3          1,988          1,991      0.15
 17    Smoking_Status             20          1,971          1,991      1.00

 

 

 

View solution in original post

8 REPLIES 8
PaigeMiller
Diamond | Level 26

@wlierman wrote:

 

What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) .  I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be

 

    Variable name                   # non-missing                # missing                Total        percent_missing

      DEYEdi                               45,000                           18,000               63,000                28.5

      DEARdi                               50,000                           13,000               63,000               20.6

 


I'm afraid I don't see a relationship between your sentences that I quoted above, and the data. Please explain.

 

Also, please provide a representative portion of your data that illustrates the problem following these instructions (and not via any other method), PLUS the desired output.

--
Paige Miller
wlierman
Lapis Lazuli | Level 10

I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.)

 

What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of numeric observations that have some type of response (age at when a disability begins) .  So hypothetically the table could be

 

   Variable_name              # missing           # non-missing        total           Percent_missing

     DEYEage                      61,000                    2,000             63,000               96.8

    DEARage                       59,900                    3,100             63,000               95.0

 

I would also like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe

for annual income, population etc

 

Thank you for your help.

 

wlierman

Tom
Super User Tom
Super User

Your report looks like a simple report of number missing that the trick with PROC FREQ special formats and stacked ODS output produces.   As already answered here.  https://communities.sas.com/t5/SAS-Programming/Counting-missing-and-non-missing-obs/m-p/740272#M2312...

 

What do you mean by "chunks"? 

  Do you mean a subset of variables? Try adding dataset option KEEP= or DROP=.

proc freq data=have(keep= de:);
  tables _all_ ;
  format _numeric_ nmissfmt.;
  format _character_ $missfmt.;
run;

  A subset of observations? Try add an WHERE statement.

 

Do you mean you want that wrapped into a macro where you can pass in both the input dataset and the input list of variables to check?

 

Or are you asking to test something different.  Like how many observations are missing for ALL of this subset of the variables in the dataset?  For example how many observations did not have any answers on a multiple question section of a survey.

 

Provide some test data to demonstrate what you are looking to calculate. To get real answers  provide real example of data in the form of a data step.  Make the example small enough that you can just type it into a data step.  You really only need two or three numeric and two or three character variables and probably about 10 observations to provide an example.   Once you have the example manual count up what you expect the output to be for that example.  

Tom
Super User Tom
Super User

Try this macro (click on spoiler link to reveal the macro definition code).

Spoiler
%macro percent_missing(input_dsn,output_dsn);
%if not %length(&input_dsn) %then %let input_dsn=&syslast;

* create format for missing;

proc format;
  value $missfmt ' ' = "Missing"  other = "Not_Missing";
  value nmissfmt low-high ="Not_Missing" other="Missing";
run;

* turns off the output so the results do not get too messy;
ods select none;

ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN.;
  table _all_ / missing missprint;
  format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;

data &OUTPUT_DSN ;
  length __v__ $32. __m__ __n__ __t__ __p__ 8 ;
  set temp;
  by table notsorted ;
  if first.table then call missing(of __:);
  __v__ = substr(table,7);
  if vvaluex(__v__)='Missing' then __m__=frequency;
  else __n__=frequency;
  retain __m__ __n__;
  if last.table then do;
     __m__=sum(0,__m__);
     __n__=sum(0,__n__);
     __t__=sum(__m__,__n__);
     __p__ = 100*divide(__m__,__t__);
     output;
  end;
  keep __v__ __m__ __n__ __t__ __p__ ;
  rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total;
  format __m__ __n__ __t__ comma11. __p__ 6.2;
run;
%mend;

Examples:

%percent_missing(sashelp.heart,example1);
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          5,209          5,209      0.00
  2    DeathCause              3,218          1,991          5,209     61.78
  3    AgeCHDdiag              3,760          1,449          5,209     72.18
  4    Sex                         0          5,209          5,209      0.00
  5    AgeAtStart                  0          5,209          5,209      0.00
  6    Height                      6          5,203          5,209      0.12
  7    Weight                      6          5,203          5,209      0.12
  8    Diastolic                   0          5,209          5,209      0.00
  9    Systolic                    0          5,209          5,209      0.00
 10    MRW                         6          5,203          5,209      0.12
 11    Smoking                    36          5,173          5,209      0.69
 12    AgeAtDeath              3,218          1,991          5,209     61.78
 13    Cholesterol               152          5,057          5,209      2.92
 14    Chol_Status               152          5,057          5,209      2.92
 15    BP_Status                   0          5,209          5,209      0.00
 16    Weight_Status               6          5,203          5,209      0.12
 17    Smoking_Status             36          5,173          5,209      0.69

With a KEEP= dataset option:

%percent_missing(sashelp.heart(keep=chol:),cholesterol);
proc print; run;

Results:

Obs       name            missing        present          total    percent

 1     Cholesterol            152          5,057          5,209      2.92
 2     Chol_Status            152          5,057          5,209      2.92

With WHERE= dataset option:

%percent_missing(sashelp.heart(where=(not missing(DeathCause))),deaths );
proc print; run;

Results:

Obs    name                  missing        present          total    percent

  1    Status                      0          1,991          1,991      0.00
  2    DeathCause                  0          1,991          1,991      0.00
  3    AgeCHDdiag              1,097            894          1,991     55.10
  4    Sex                         0          1,991          1,991      0.00
  5    AgeAtStart                  0          1,991          1,991      0.00
  6    Height                      5          1,986          1,991      0.25
  7    Weight                      3          1,988          1,991      0.15
  8    Diastolic                   0          1,991          1,991      0.00
  9    Systolic                    0          1,991          1,991      0.00
 10    MRW                         3          1,988          1,991      0.15
 11    Smoking                    20          1,971          1,991      1.00
 12    AgeAtDeath                  0          1,991          1,991      0.00
 13    Cholesterol                69          1,922          1,991      3.47
 14    Chol_Status                69          1,922          1,991      3.47
 15    BP_Status                   0          1,991          1,991      0.00
 16    Weight_Status               3          1,988          1,991      0.15
 17    Smoking_Status             20          1,971          1,991      1.00

 

 

 

wlierman
Lapis Lazuli | Level 10
Hello,
By chunks I meant different portions of the dataset -  demographics for the contact; race and ethn icity break outs; language; and the disability section that you have helped me on.  While all the REALD data is sparsely p0opulated, the disability section is the most sparse.
Your earlier coding helped to a solution.  So I will label it as a solution.
Thanks again.
wlierman

ballardw
Super User

Please try this:

data junk;
   input a $ b $ C $;
datalines;
123 . 456
dds lkjhl .
abc 78jk  .
.   .     .
;
data need;
   set junk;
   array _c (*) _character_;
   do i= 1 to dim(_c);
      varname = vname(_c[i]);
      x   = missing(_c[i]);
      x2 =x;
      output;
   end;
   keep varname x x2;
run; 

proc format;
value ismiss
1='Missing'
0='Not Missing'
;
run;

proc tabulate data=need;
   class varname;
   class x;
   format x ismiss.;
   var x2  ;

   table varname, 
     (x=' ' all='Total')*n=' '  x2="Percent missing"*mean=''*f=percent8.1
   ;
run;

The array _C in the NEED could use any list of like variables, either a list of character or a list of numeric variables but not a mixture, basic array rule.

 

If you wanted to do both character and numeric create a different array, such as _n for the numeric variables and have a second loop that uses _n instead of _c.

 

The Missing function returns values of 1(true or missing) and 0 (false or not missing) for either character or numeric variables. So we create the same reporting variables regardless of starting variable type and the report procedure doesn't need any changes as long as the data set looks like "need".

The variables will most likely come out in alphabetical order but remember upper case 'Z' is before 'a'. So if your variables are initially defined with mixed case not the order you expect. If that is problem you can UPCASE or LOWCASE the Varname variable in the Need data step.

wlierman
Lapis Lazuli | Level 10

Your code approach also produces the output that I am needing.

 

I would also post your email as a solution along with the solution from Tom.

 

But I don't know how to submit a second solution.

 

Thanks for your help.

 

wlierman

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 3325 views
  • 3 likes
  • 5 in conversation