- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.)
What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) . I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be
Variable name # non-missing # missing Total percent_missing
DEYEdi 45,000 18,000 63,000 28.5
DEARdi 50,000 13,000 63,000 20.6
I would alos like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe
for zipcodes, counties, cities etc.
Thank you for your help. (I will open another question for counting missing/nonmissingnumeric vars in the dataset.)
wlierman
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try this macro (click on spoiler link to reveal the macro definition code).
%macro percent_missing(input_dsn,output_dsn);
%if not %length(&input_dsn) %then %let input_dsn=&syslast;
* create format for missing;
proc format;
value $missfmt ' ' = "Missing" other = "Not_Missing";
value nmissfmt low-high ="Not_Missing" other="Missing";
run;
* turns off the output so the results do not get too messy;
ods select none;
ods table onewayfreqs=temp;
proc freq data=&INPUT_DSN.;
table _all_ / missing missprint;
format _numeric_ nmissfmt. _character_ $missfmt.;
run;
ods select all;
data &OUTPUT_DSN ;
length __v__ $32. __m__ __n__ __t__ __p__ 8 ;
set temp;
by table notsorted ;
if first.table then call missing(of __:);
__v__ = substr(table,7);
if vvaluex(__v__)='Missing' then __m__=frequency;
else __n__=frequency;
retain __m__ __n__;
if last.table then do;
__m__=sum(0,__m__);
__n__=sum(0,__n__);
__t__=sum(__m__,__n__);
__p__ = 100*divide(__m__,__t__);
output;
end;
keep __v__ __m__ __n__ __t__ __p__ ;
rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total;
format __m__ __n__ __t__ comma11. __p__ 6.2;
run;
%mend;
Examples:
%percent_missing(sashelp.heart,example1);
proc print; run;
Results:
Obs name missing present total percent 1 Status 0 5,209 5,209 0.00 2 DeathCause 3,218 1,991 5,209 61.78 3 AgeCHDdiag 3,760 1,449 5,209 72.18 4 Sex 0 5,209 5,209 0.00 5 AgeAtStart 0 5,209 5,209 0.00 6 Height 6 5,203 5,209 0.12 7 Weight 6 5,203 5,209 0.12 8 Diastolic 0 5,209 5,209 0.00 9 Systolic 0 5,209 5,209 0.00 10 MRW 6 5,203 5,209 0.12 11 Smoking 36 5,173 5,209 0.69 12 AgeAtDeath 3,218 1,991 5,209 61.78 13 Cholesterol 152 5,057 5,209 2.92 14 Chol_Status 152 5,057 5,209 2.92 15 BP_Status 0 5,209 5,209 0.00 16 Weight_Status 6 5,203 5,209 0.12 17 Smoking_Status 36 5,173 5,209 0.69
With a KEEP= dataset option:
%percent_missing(sashelp.heart(keep=chol:),cholesterol);
proc print; run;
Results:
Obs name missing present total percent 1 Cholesterol 152 5,057 5,209 2.92 2 Chol_Status 152 5,057 5,209 2.92
With WHERE= dataset option:
%percent_missing(sashelp.heart(where=(not missing(DeathCause))),deaths );
proc print; run;
Results:
Obs name missing present total percent 1 Status 0 1,991 1,991 0.00 2 DeathCause 0 1,991 1,991 0.00 3 AgeCHDdiag 1,097 894 1,991 55.10 4 Sex 0 1,991 1,991 0.00 5 AgeAtStart 0 1,991 1,991 0.00 6 Height 5 1,986 1,991 0.25 7 Weight 3 1,988 1,991 0.15 8 Diastolic 0 1,991 1,991 0.00 9 Systolic 0 1,991 1,991 0.00 10 MRW 3 1,988 1,991 0.15 11 Smoking 20 1,971 1,991 1.00 12 AgeAtDeath 0 1,991 1,991 0.00 13 Cholesterol 69 1,922 1,991 3.47 14 Chol_Status 69 1,922 1,991 3.47 15 BP_Status 0 1,991 1,991 0.00 16 Weight_Status 3 1,988 1,991 0.15 17 Smoking_Status 20 1,971 1,991 1.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@wlierman wrote:
What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of observations that have some type of response (could be No or Yes) . I don't want the No or Yes counted just if there was a response or if the obs is blank / missing. So hypothetically the table could be
Variable name # non-missing # missing Total percent_missing
DEYEdi 45,000 18,000 63,000 28.5
DEARdi 50,000 13,000 63,000 20.6
I'm afraid I don't see a relationship between your sentences that I quoted above, and the data. Please explain.
Also, please provide a representative portion of your data that illustrates the problem following these instructions (and not via any other method), PLUS the desired output.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset that has over 181,000 rows and upwards of 80-90 fields. It is not a huge, huge dataset but it is very sparse. (It is based on epidemiologist survey of covid-19 cases - so not everyone supplied some type of response.)
What I need is code that will allow me to check various chunks (sections of columns) to get a count of the number of numeric observations that have some type of response (age at when a disability begins) . So hypothetically the table could be
Variable_name # missing # non-missing total Percent_missing
DEYEage 61,000 2,000 63,000 96.8
DEARage 59,900 3,100 63,000 95.0
I would also like to be able to "reuse" the code to test other parts of the dataset for missing/nonmissing maybe
for annual income, population etc
Thank you for your help.
wlierman
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your report looks like a simple report of number missing that the trick with PROC FREQ special formats and stacked ODS output produces. As already answered here. https://communities.sas.com/t5/SAS-Programming/Counting-missing-and-non-missing-obs/m-p/740272#M2312...
What do you mean by "chunks"?
Do you mean a subset of variables? Try adding dataset option KEEP= or DROP=.
proc freq data=have(keep= de:);
tables _all_ ;
format _numeric_ nmissfmt.;
format _character_ $missfmt.;
run;
A subset of observations? Try add an WHERE statement.
Do you mean you want that wrapped into a macro where you can pass in both the input dataset and the input list of variables to check?
Or are you asking to test something different. Like how many observations are missing for ALL of this subset of the variables in the dataset? For example how many observations did not have any answers on a multiple question section of a survey.
Provide some test data to demonstrate what you are looking to calculate. To get real answers provide real example of data in the form of a data step. Make the example small enough that you can just type it into a data step. You really only need two or three numeric and two or three character variables and probably about 10 observations to provide an example. Once you have the example manual count up what you expect the output to be for that example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try this macro (click on spoiler link to reveal the macro definition code).
%macro percent_missing(input_dsn,output_dsn);
%if not %length(&input_dsn) %then %let input_dsn=&syslast;
* create format for missing;
proc format;
value $missfmt ' ' = "Missing" other = "Not_Missing";
value nmissfmt low-high ="Not_Missing" other="Missing";
run;
* turns off the output so the results do not get too messy;
ods select none;
ods table onewayfreqs=temp;
proc freq data=&INPUT_DSN.;
table _all_ / missing missprint;
format _numeric_ nmissfmt. _character_ $missfmt.;
run;
ods select all;
data &OUTPUT_DSN ;
length __v__ $32. __m__ __n__ __t__ __p__ 8 ;
set temp;
by table notsorted ;
if first.table then call missing(of __:);
__v__ = substr(table,7);
if vvaluex(__v__)='Missing' then __m__=frequency;
else __n__=frequency;
retain __m__ __n__;
if last.table then do;
__m__=sum(0,__m__);
__n__=sum(0,__n__);
__t__=sum(__m__,__n__);
__p__ = 100*divide(__m__,__t__);
output;
end;
keep __v__ __m__ __n__ __t__ __p__ ;
rename __v__=name __n__=present __m__=missing __p__=percent __t__ = total;
format __m__ __n__ __t__ comma11. __p__ 6.2;
run;
%mend;
Examples:
%percent_missing(sashelp.heart,example1);
proc print; run;
Results:
Obs name missing present total percent 1 Status 0 5,209 5,209 0.00 2 DeathCause 3,218 1,991 5,209 61.78 3 AgeCHDdiag 3,760 1,449 5,209 72.18 4 Sex 0 5,209 5,209 0.00 5 AgeAtStart 0 5,209 5,209 0.00 6 Height 6 5,203 5,209 0.12 7 Weight 6 5,203 5,209 0.12 8 Diastolic 0 5,209 5,209 0.00 9 Systolic 0 5,209 5,209 0.00 10 MRW 6 5,203 5,209 0.12 11 Smoking 36 5,173 5,209 0.69 12 AgeAtDeath 3,218 1,991 5,209 61.78 13 Cholesterol 152 5,057 5,209 2.92 14 Chol_Status 152 5,057 5,209 2.92 15 BP_Status 0 5,209 5,209 0.00 16 Weight_Status 6 5,203 5,209 0.12 17 Smoking_Status 36 5,173 5,209 0.69
With a KEEP= dataset option:
%percent_missing(sashelp.heart(keep=chol:),cholesterol);
proc print; run;
Results:
Obs name missing present total percent 1 Cholesterol 152 5,057 5,209 2.92 2 Chol_Status 152 5,057 5,209 2.92
With WHERE= dataset option:
%percent_missing(sashelp.heart(where=(not missing(DeathCause))),deaths );
proc print; run;
Results:
Obs name missing present total percent 1 Status 0 1,991 1,991 0.00 2 DeathCause 0 1,991 1,991 0.00 3 AgeCHDdiag 1,097 894 1,991 55.10 4 Sex 0 1,991 1,991 0.00 5 AgeAtStart 0 1,991 1,991 0.00 6 Height 5 1,986 1,991 0.25 7 Weight 3 1,988 1,991 0.15 8 Diastolic 0 1,991 1,991 0.00 9 Systolic 0 1,991 1,991 0.00 10 MRW 3 1,988 1,991 0.15 11 Smoking 20 1,971 1,991 1.00 12 AgeAtDeath 0 1,991 1,991 0.00 13 Cholesterol 69 1,922 1,991 3.47 14 Chol_Status 69 1,922 1,991 3.47 15 BP_Status 0 1,991 1,991 0.00 16 Weight_Status 3 1,988 1,991 0.15 17 Smoking_Status 20 1,971 1,991 1.00
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
By chunks I meant different portions of the dataset - demographics for the contact; race and ethn icity break outs; language; and the disability section that you have helped me on. While all the REALD data is sparsely p0opulated, the disability section is the most sparse.
Your earlier coding helped to a solution. So I will label it as a solution.
Thanks again.
wlierman
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please try this:
data junk; input a $ b $ C $; datalines; 123 . 456 dds lkjhl . abc 78jk . . . . ; data need; set junk; array _c (*) _character_; do i= 1 to dim(_c); varname = vname(_c[i]); x = missing(_c[i]); x2 =x; output; end; keep varname x x2; run; proc format; value ismiss 1='Missing' 0='Not Missing' ; run; proc tabulate data=need; class varname; class x; format x ismiss.; var x2 ; table varname, (x=' ' all='Total')*n=' ' x2="Percent missing"*mean=''*f=percent8.1 ; run;
The array _C in the NEED could use any list of like variables, either a list of character or a list of numeric variables but not a mixture, basic array rule.
If you wanted to do both character and numeric create a different array, such as _n for the numeric variables and have a second loop that uses _n instead of _c.
The Missing function returns values of 1(true or missing) and 0 (false or not missing) for either character or numeric variables. So we create the same reporting variables regardless of starting variable type and the report procedure doesn't need any changes as long as the data set looks like "need".
The variables will most likely come out in alphabetical order but remember upper case 'Z' is before 'a'. So if your variables are initially defined with mixed case not the order you expect. If that is problem you can UPCASE or LOWCASE the Varname variable in the Need data step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your code approach also produces the output that I am needing.
I would also post your email as a solution along with the solution from Tom.
But I don't know how to submit a second solution.
Thanks for your help.
wlierman