Hi,
I want to find out the missing values in each feature of the dataset and remove the features which have greater than 10% missing values.
Any help would be appreciated.
Thanks.
Hi and welcome to the SAS Community! 🙂
What is a 'feature' in a data set? Are you able to give a small example of what your data looks like and what you want the desired result to look like? Makes it much easier to provide a usable code answer.
The dataset has different numerical features and categorical features.
For example:
X Y Z
High Yes 2.56
‘ ‘ No 3.67
Medium ‘ ‘ .
Low ‘ ‘ 2
‘ ‘ ‘ ‘ .
and so on....
X,Y,Z might be 3 features of this dataset, where X and Y are Categorical features. I want to find missing values for X and for Y and then remove X or Y if the missing values are greater than 10% in either of them.
I used this..
/*create format for missing*/
proc format;
value $missfmt ' '='Missing' other='Not Missing';
*value missfmt . ='Missing' other='Not Missing';
run;
/*find out the missing char values*/
proc freq data = trainingdata;
format _char_ $missfmt.;
tables _char_ / missing nocum nopercent out=tempstr;
run;
I am able to see missing and not missing count for each feature, but I am not able to fetch the missing count from this from this, so that I can divide from the total number of values and get the percentage.
I am performing the numerical and categorical missing values search separately as proc means didn't work for tables=_char_
If you need the percent values then what about removing keyword NOPERCENT from your code?
I have removed the nopercent from the code.
The output pattern is as follows:
The SAS System
The FREQ Procedure
X
X Frequency Percent
Not missing 1000 equivalent percentage value
Missing 900 equivalent percentage value
Y
Y Frequency Percent
Not missing 800 equivalent percentage value
Missing 1100 equivalent percentage value
and so on.....
I want to fetch the percentage of missing values for X and Y and then remove the features which has greater than 10% of missing values.
I tried to transpose the output dataset, to get a new dataset with an assumption to see a output like this,
Missing Not missing Percent
X 900 1000 ....
Y 1100 800 ....
But it doesn't work.
I want to fetch missing percent for X and missing percent for Y from the output and then check if it is greater than 10%.
So it seems you have coded your "features" into SAS variables. And you already know how to count the number of missing values. To see if that number is more or less than 20% you just need to count how many observations there are and divide. Since you have both the number of missing and non-missing the total number is just the sum of those two numbers.
Use the CLASS statement in PROC SUMMARY. Should work as long as the number of variables is not too many.
So define your two formats:
proc format;
value $missfmt ' '='Missing' other='Not Missing';
value missfmt low-high='Not Missing' other ='Missing' ;
run;
Pick your dataset.
%let dsn=sashelp.cars;
Get the list of variables:
proc transpose data=&dsn(obs=0) out=step1;
var _all_;
run;
Summarize
proc summary data = &dsn missing chartype ;
class _all_ ;
ways 1;
output out=step2 ;
format _char_ $missfmt. _numeric_ missfmt.;
run;
Then collapse to one observation per variable.
data want;
length varnum 8 _name_ $32. _label_ $256 missing non_missing percent_missing 8;
keep varnum -- percent_missing;
do until(last._type_);
set step2;
by _type_;
varnum=indexc(_type_,'1');
p=varnum;
set step1 point=p;
if vvaluex(_name_)='Missing' then missing=_freq_;
else non_missing=_freq_;
end;
missing=sum(0,missing);
non_missing=sum(0,non_missing);
percent_missing=missing/(sum(missing,non_missing));
run;
Note variable names will come out in reverse order that they exist in the dataset.
proc sort data=want; by varnum; run;
non_ percent_ Obs varnum _name_ _label_ missing missing missing 1 1 Make 0 428 .000000000 2 2 Model 0 428 .000000000 3 3 Type 0 428 .000000000 4 4 Origin 0 428 .000000000 5 5 DriveTrain 0 428 .000000000 6 6 MSRP 0 428 .000000000 7 7 Invoice 0 428 .000000000 8 8 EngineSize Engine Size (L) 0 428 .000000000 9 9 Cylinders 2 426 .004672897 10 10 Horsepower 0 428 .000000000 11 11 MPG_City MPG (City) 0 428 .000000000 12 12 MPG_Highway MPG (Highway) 0 428 .000000000 13 13 Weight Weight (LBS) 0 428 .000000000 14 14 Wheelbase Wheelbase (IN) 0 428 .000000000 15 15 Length Length (IN) 0 428 .000000000
Hi @Bond007 ,
do you mean something like this:
data have;
infile cards dlm = "|";
input X : $ 10. Y : $ 10. Z;
cards;
High |Yes|2.56
|No |3.67
Medium|Yes|.
Low |Yes|2
|Yes|.
High |Yes|2.56
|No |3.67
Medium|Yes|.
Low |Yes|2
| |.
High |Yes|2.56
|No |3.67
Medium|Yes|.
Low |Yes|2
|Yes|.
;
run;
data _null_;
sentinel1 = .;
if 0 then set have;
sentinel2=.;
array _C_{*} sentinel1-character-sentinel2;
call symputX("_DIM_",dim(_C_),"G"); /* collect the metadata */
stop;
run;
options symbolgen;
data _null_;
sentinel1 = .;
if 0 then set have;
sentinel2=.;
array _C_ sentinel1-character-sentinel2;
array CNT[&_DIM_.] _temporary_;
d=dim(_C_);
put d=;
do until (eof);
set have end=eof nobs=nobs;
do over _C_;
if MISSING (_C_) then CNT[_I_] + 1;
end;
end;
call execute('data want; set have; drop _N_');
do over _C_;
if CNT[_I_]/NOBS > 0.10 then call execute(vname(_C_)); /*Y stays, X drops */
end;
call execute('; run;');
stop;
run;
All the best
Bart
Barteku, as a matter of dynamic hash programming curiosity, the entire feat can be pulled in a single step without crossing a step boundary once:
data have ;
input (X Y) ($) Z ;
cards ;
High Yes 2.56
. No 3.67
Medium Yes .
Low Yes 2
. Yes .
High Yes 2.56
. No 3.67
Medium Yes .
Low Yes 2
. . .
High Yes 2.56
. No 3.67
Medium Yes .
Low Yes 2
. Yes .
;
run ;
%let pct = 10 ;
data _null_ ;
dcl hash v () ;
v.definekey ("vn") ;
v.definedata ("vn", "nmiss") ;
v.definedone () ;
dcl hiter iv ("v") ;
do until (lr) ;
set have end = lr nobs = n ;
array nn _numeric_ ;
array cc _char_ ;
do over nn ;
if not cmiss (nn) then continue ;
vn = put (vname (nn), $32.) ;
link count ;
end ;
do over cc ;
if not cmiss (cc) then continue ;
vn = put (vname (cc), $32.) ;
link count ;
end ;
end ;
dcl hash h (dataset:"have", multidata:"y") ;
do while (iv.next() = 0) ;
if divide (nmiss, n) * 100 < &pct then h.definekey (vn) ;
end ;
h.definedone() ;
h.output (dataset: "want") ;
stop ;
count: if v.find() then nmiss = 1 ;
else nmiss + 1 ;
v.replace() ;
run ;
Kind regards
Paul D.
@hashman :
Paul, 100% agree, hash table give us elegant "one step" solution. The only reason I've made it 2 datasteps was to get the
&_DIM_.
macrovariable. In fact since all loops are "over _C_" I could make it a static value and it would work well in one datastep to, like:
array CNT[1000000] _temporary_; /* ~8MB of RAM, for quite "wide" dataset ;) */
but I thought it wouldn't be "nice/elegant".
But, as you remember our discussion from SAS-L, the one titled "Array search concepts (was: Re: Help?)", I'm finishing/polishing this dynamic array concept, which would allow us to write something like:
...
array _C_ sentinel1-character-sentinel2;
call arrayCNT('Allocate', lbound(_C_), hbound(_C_));
...
do over _C_;
if MISSING (_C_) then call arrayCNT('Add', _I_, 1);
end;
...
All the best
Bart
Bart, as far as elegance is concerned, I have the same kind of aesthetic aversion to "big enough" allocations. I do remember our -L exchange and am looking forward to seeing the finished product whenever you deem it to have been sufficiently Polished ;). I've learned quite a few FCMP tricks of trade from you and strongly suspect will learn more from this one.
Aesthetics aside, "big enough" can be extremely useful, as you well know, and quite productive speed-wise. This is because a one fell swoop allocation at compile time works much faster than allocating memory at run time for each extra item one at a time. With a really big hash object table, most of its time is spent not so much on the search and retrieve operations as on looking for available memory before doing an insert, especially when its memory footprint approaches the system limits. As a result, code can be run for a long time, only to fail on the next insertion when no more memory is available; whereas if an open-addressed array-based hash table is allocated as a big array at compile time, it either fails at once or else will work to the end without a glitch.
As you've seen from my -L writings, with a single integer key an open-addressed array-based hash table wins hands down both in insert/search speed (~1:2) and memory usage (~1:3 even with a half-sparse table). It also excels at aggregation because it can be done directly in the corresponding array cells (rather than via the FIND-aggregate-REPLACE hash object cycle). This is because the hash function in this case is the simple mod(key,table_size). The hash object readily takes over in terms of search speed for character keys and composite keys of mixed types since against such keys, its internal hash function is much faster than what can be mustered via:
mod (input (md5 (catx (k1-kN), pib6.), table_size)
externally - this combo just takes too long to compute, and a slow hash function defeats the purpose of using hashing in the first place. If SAS surfaced a really tight and fast function that could be used against an arbitrary key, array-based hashing would be much more attractive.
Kind regards
Paul D.
p.s. If you want to see how a hash object table fares against an open-addressed array table with a single integer key, try the program in the attached file.
data have;
infile cards dlm = "|";
input X : $ 10. Y : $ 10. Z;
cards;
High |Yes|2.56
|No |3.67
Medium|Yes|.
Low |Yes|2
|Yes|.
High |Yes|2.56
|No |3.67
Medium|Yes|.
Low |Yes|2
| |.
High |Yes|2.56
|No |3.67
Medium|Yes|.
Low |Yes|2
|Yes|.
;
run;
proc transpose data=have(obs=0) out=temp;
var _all_;
run;
proc sql;
select cats('nmiss(',_name_ ,')/count(*) as p_',_name_) into : list separated by ','
from temp;
create table percent_missing as
select &list
from have ;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.