- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This may be a simple question by does anyone know how to:
- Count the number of non-missing observations
- Count the number of missing observations
for both character and numeric values using proc means? If I use n and nmiss it will only count the numeric values and not the character values. Any help would be great.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please post a sample of data to illustrate your query.
code you have tried and point what issues you had.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Data CARS;
infile 'C:\Users\rsbggb\Downloads\cars.txt' firstobs=4 pad;
input Size $ 1-10 Manufact $ 11-21 Model $ 22-37 Mileage 38-49 Reliable 50;
if Size='' then delete;
run;
proc means data=CARS n;
Var Size Manufact Model Mileage Reliable;
run;
This is my code so far. The error I am getting is:
12 proc means data=CARS n;
13 Var Size Manufact Model Mileage Reliable;
ERROR: Variable Size in list does not match type prescribed for this list.
ERROR: Variable Manufact in list does not match type prescribed for this list.
ERROR: Variable Model in list does not match type prescribed for this list.
14 run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@HALOBEAST999 wrote:
Data CARS;
infile 'C:\Users\rsbggb\Downloads\cars.txt' firstobs=4 pad;
input Size $ 1-10 Manufact $ 11-21 Model $ 22-37 Mileage 38-49 Reliable 50;
if Size='' then delete;
run;
proc means data=CARS n;
Var Size Manufact Model Mileage Reliable;
run;
This is my code so far. The error I am getting is:
12 proc means data=CARS n;
13 Var Size Manufact Model Mileage Reliable;
ERROR: Variable Size in list does not match type prescribed for this list.
ERROR: Variable Manufact in list does not match type prescribed for this list.
ERROR: Variable Model in list does not match type prescribed for this list.
14 run;
Size is not a numeric variable. PROC MEANS can't work on it.
Manufact is not a numeric variable. ditto
Model is not a numeric variable. ditto
You have read these variables in from the .txt file improperly, and so they are not numeric, they are character. Remove the $ from the INPUT statement, and see if that works.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You're better off using PROC FREQ unless you want other statistics at the same time.
This is a bit complicated because I try and make a pretty table for reports, so I can dump it straight to Word/PDF/HTML for reports.
You can run the code and examine the data sets at each point, it should work in any version of SAS.
/*This program creates a report with the number and percent of
missing data for each variable in the data set.
The ony change should be to the macro variable, INPUT_DSN.
Author: F. Khurshed
Date: 2019-01-04
*/
*create sample data to work with;
data class;
set sashelp.class;
if age=14 then
call missing(height, weight, sex);
if name='Alfred' then
call missing(sex, age, height);
label age="Fancy Age Label";
run;
*set input data set name;
%let INPUT_DSN = class;
%let OUTPUT_DSN = want;
*create format for missing;
proc format;
value $ missfmt ' '="Missing" other="Not Missing";
value nmissfmt .="Missing" other="Not Missing";
run;
*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;
proc freq data=&INPUT_DSN.;
table _all_ / missing;
format _numeric_ nmissfmt. _character_ $missfmt.;
run;
ods select all;
*Format output;
data long;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);
Variable_Value=strip(trim(vvaluex(variable)));
presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
keep variable variable_value frequency percent cum: presentation;
label variable='Variable' variable_value='Variable Value';
run;
proc sort data=long;
by variable;
run;
*make it a wide data set for presentation, with values as N (Percent);
proc transpose data=long out=wide_presentation (drop=_name_);
by variable;
id variable_value;
var presentation;
run;
*transpose only N;
proc transpose data=long out=wide_N prefix=N_;
by variable;
id variable_value;
var frequency;
run;
*transpose only percents;
proc transpose data=long out=wide_PCT prefix=PCT_;
by variable;
id variable_value;
var percent;
run;
*final output file;
data &Output_DSN.;
merge wide_N wide_PCT wide_presentation;
by variable;
drop _name_;
label N_Missing='# Missing' N_Not_Missing='# Not Missing'
PCT_Missing='% Missing' N_Not_Missing='% Not Missing' Missing='Missing'
Not_missing='Not Missing';
run;
title "Missing Report of &INPUT_DSN.";
proc print data=&output_dsn. noobs label;
run;
@HALOBEAST999 wrote:
This may be a simple question by does anyone know how to:
- Count the number of non-missing observations
- Count the number of missing observations
for both character and numeric values using proc means? If I use n and nmiss it will only count the numeric values and not the character values. Any help would be great.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Reeza,
This could be simple as.
data class;
set sashelp.class;
if age=14 then
call missing(height, weight, sex);
if name='Alfred' then
call missing(sex, age, height);
label age="Fancy Age Label";
run;
proc transpose data=class(obs=0) out=temp;
var _all_;
run;
proc sql noprint;
select catx(' ','nmiss(',_name_,') as ',_name_)
into : n_miss separated by ','
from temp;
select catx(' ','nmiss(',_name_,')/count(*) as ',_name_)
into : n_miss_per separated by ','
from temp;
create table temp1 as
select 'n_miss ' as label,&n_miss from class
union
select 'n_miss_per' as label,&n_miss_per from class;
quit;
proc transpose data=temp1 out=want;
id label;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Ksharp yeah, I wrote it in 2011 actually but have just been putting them up these days in a central location so I can load them automatically. If they work, I'm usually too lazy to make them more efficient these days 😛
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @HALOBEAST999,
I agree with @Reeza that PROC FREQ is more appropriate for this task. But if you really want to use PROC MEANS, then try this:
data have;
input (cvar1-cvar3) ($) nvar1 nvar2;
cards;
a e x . 6
. f . 1 .
b g . 2 7
. h y 3 .
c . . 4 8
. i z . 9
d j . 5 .
;
proc format;
value $miss
' ' = 'missing'
other = 'nonmissing';
run;
proc means data=have noprint;
format _char_ $miss.;
class _char_ / missing;
ways 0,1;
output out=cnt n= nmiss= / autoname;
run;
proc print data=cnt noobs;
run;
Result:
nvar1_ nvar2_ cvar1 cvar2 cvar3 _TYPE_ _FREQ_ nvar1_N nvar2_N NMiss NMiss missing missing missing 0 7 5 4 2 3 missing missing missing 1 4 4 2 0 2 missing missing nonmissing 1 3 1 2 2 1 missing missing missing 2 1 1 1 0 0 missing nonmissing missing 2 6 4 3 2 3 missing missing missing 4 3 2 1 1 2 nonmissing missing missing 4 4 3 3 1 1
So, you get the desired numbers for the numeric variables in the first row and for the character variables (in reverse order) in column _FREQ_ in the other rows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
NOTE: I've just edited the first (array) solution since @FreelanceReinh has found an actual error in the code - it wasn't populating the array for numeric variables' frequencies NQ and suggested a 2-dimensional array instead to keep a single LINK block.
That's right, MEANS doesn't work with character analytical variables. But in 10-20 minutes (depending on your SAS exposure) you can write 17 lines of DATA step code to do what you want for all variables in the input data set regardless of their data type. For example (corrected code thanks to @FreelanceReinh😞
data have ;
input n1 c1:$1. n2 c2:$2. n3 c3:$3. ;
cards ;
0 . 00 AA 000 .
. B 11 . 111 BBB
2 C . CC . CCC
3 . . DD 333 DDD
. E 44 EE 444 .
5 F . FF 555 FFF
6 . 66 GG 666 GGG
7 H . HH . HHH
. I . . 888 .
9 . 99 JJ 999 .
;
run ;
data freq (keep = vname N_:) ;
set have nobs = nobs end = eof ;
array nn _numeric_ ;
array cc _char_ ;
array q [2, 1000000] _temporary_ ;
do over nn ; q [1,_i_] ++ cmiss (nn) ; end ;
do over cc ; q [2,_i_] ++ cmiss (cc) ; end ;
if eof ;
do over nn ; Vname = put (vname (nn), $32.) ; d = 1 ; link compout ; end ;
do over cc ; Vname = put (vname (cc), $32.) ; d = 2 ; link compout ; end ;
return ;
compout: N_null = q [d,_i_] ;
N_notnull = nobs - N_null ;
output ;
return ;
run ;
The hard-coded 100000 is supposed to exceed the number of either numeric or character variables in your input file. If you have more than that 😉 , you can bump it up to 1000000 or even 10000000 (at the expense of at most 150 MB of RAM).
If you're still averse to setting hard-coded limits of this kind, the SAS hash object is your friend:
data _null_ ;
if _n_ = 1 then do ;
dcl hash f (ordered:"D") ;
f.definekey ("vname") ;
f.definedata ("vname", "n_null", "n_notnull") ;
f.definedone () ;
end ;
set have end = eof ;
array nn _numeric_ ;
array cc _char_ ;
do over nn ;
Vname = put (vname (nn), $32.) ;
if f.find() ne 0 then call missing (N_null, N_notnull) ;
if cmiss (nn) then n_null ++ 1 ;
else n_notnull ++ 1 ;
f.replace() ;
end ;
do over cc ;
Vname = put (vname (cc), $32.) ;
if f.find() ne 0 then call missing (N_null, N_notnull) ;
if cmiss (cc) then n_null ++ 1 ;
else n_notnull ++ 1 ;
f.replace() ;
end ;
if eof then f.output (dataset:"freq") ;
run ;
Sometimes it's easier to write a simple DATA step program than to try fitting what you need in the Procrustean bed of an existing proc.
Note that the two programs above will work for any input data set. The only caveat is that if it has no numeric or no character variables, you'll get a harmless warning in the log saying that a zero dimension array is being defined.
Also note that for very large data sets, the array approach will work noticeably faster. This is because arrays items can be overwritten directly in place and therefore (unlike the hash object) do not involve moving data to the PDV and back.
Kind regards
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@hashman: Powerful and beautiful code, as always! I would suggest one correction to the first solution: Replace the two arrays nq and cq by a single, two-dimensional array, say, q[2, 100000], and use this in the compout block (q[t,_i_]) with the appropriate value t of the first index, e.g.
if eof ; t=1;
do over nn ; Vname = put (vname (nn), $32.) ; link compout ; end ; t=2;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! Frankly, it did occur to me ... but decided not to for the sake of symmetry (and perhaps better readability).
Kind regards
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@hashman wrote:
Thanks! Frankly, it did occur to me ... but decided not to for the sake of symmetry (and perhaps better readability).
Sure, but then you would need a separate "compout" block for the numeric variables, or?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hm ... methinks not. Just the first dimension index would have to be assigned before calling the link.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@FreelanceReinh wrote:
Sure, but then you would need a separate "compout" block for the numeric variables, or?
What I meant was: Currently, variable N_null is populated with values from array cq only (leading to incorrect results for the numeric variables, in general). The potential second "compout" block could do the same for array nq.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Aaaaaha! You're absolutely right! Thanks for reading the code so attentively and finding an actual error.
But of course: It's either 2 arrays and 2 link blocks or a single 2-dim array and a single link block.
I'll edit the original post accordingly.
Thanks again.
Kind regards
Paul D.