topic Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu in SAS Programming

Find char missing values in a dataset and remove the features with more than 10% of missing values

Bond007 — Tue, 01 Oct 2019 09:06:32 GMT

Hi,

I want to find out the missing values in each feature of the dataset and remove the features which have greater than 10% missing values.

Any help would be appreciated.

Thanks.

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

PeterClemmensen — Tue, 01 Oct 2019 09:13:09 GMT

Hi and welcome to the SAS Community! 🙂

What is a 'feature' in a data set? Are you able to give a small example of what your data looks like and what you want the desired result to look like? Makes it much easier to provide a usable code answer.

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Bond007 — Tue, 01 Oct 2019 10:22:05 GMT

The dataset has different numerical features and categorical features.

For example:

X Y Z

High Yes 2.56

‘ ‘ No 3.67

Medium ‘ ‘ .

Low ‘ ‘ 2

‘ ‘ ‘ ‘ .

and so on....

X,Y,Z might be 3 features of this dataset, where X and Y are Categorical features. I want to find missing values for X and for Y and then remove X or Y if the missing values are greater than 10% in either of them.

I used this..

/*create format for missing*/
proc format;
value $missfmt ' '='Missing' other='Not Missing';
*value missfmt . ='Missing' other='Not Missing';
run;

/*find out the missing char values*/
proc freq data = trainingdata;
format _char_ $missfmt.;
tables _char_ / missing nocum nopercent out=tempstr;
run;

I am able to see missing and not missing count for each feature, but I am not able to fetch the missing count from this from this, so that I can divide from the total number of values and get the percentage.

I am performing the numerical and categorical missing values search separately as proc means didn't work for tables=_char_

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Patrick — Tue, 01 Oct 2019 10:51:29 GMT

@Bond007

If you need the percent values then what about removing keyword NOPERCENT from your code?

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Bond007 — Tue, 01 Oct 2019 11:24:44 GMT

I have removed the nopercent from the code.

The output pattern is as follows:

The SAS System

The FREQ Procedure

X Frequency Percent

Not missing 1000 equivalent percentage value

Missing 900 equivalent percentage value

Y Frequency Percent

Not missing 800 equivalent percentage value

Missing 1100 equivalent percentage value

and so on.....

I want to fetch the percentage of missing values for X and Y and then remove the features which has greater than 10% of missing values.

I tried to transpose the output dataset, to get a new dataset with an assumption to see a output like this,

Missing Not missing Percent

X 900 1000 ....

Y 1100 800 ....

But it doesn't work.

I want to fetch missing percent for X and missing percent for Y from the output and then check if it is greater than 10%.

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

yabwon — Tue, 01 Oct 2019 12:48:25 GMT

Hi @Bond007 ,

do you mean something like this:

data have;
infile cards dlm = "|";
input X : $ 10. Y : $ 10. Z;
cards;
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |   |.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
;
run;


data _null_;
  sentinel1 = .;
  if 0 then set have;
  sentinel2=.;
  array _C_{*} sentinel1-character-sentinel2;
  call symputX("_DIM_",dim(_C_),"G");  /* collect the metadata */
  stop;
run;

options symbolgen;
data _null_;

  sentinel1 = .;
  if 0 then set have;
  sentinel2=.;

  array _C_ sentinel1-character-sentinel2;
  array CNT[&_DIM_.] _temporary_;

  d=dim(_C_);
  put d=;
  do until (eof);
    set have end=eof nobs=nobs;
    do over _C_;
      if MISSING (_C_) then CNT[_I_] + 1;
    end;
  end;

  call execute('data want; set have; drop _N_');
  do over _C_;
    if CNT[_I_]/NOBS > 0.10 then call execute(vname(_C_)); /*Y stays, X drops */
  end;
  call execute('; run;');
  stop;
run;

All the best

Bart

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Tom — Tue, 01 Oct 2019 14:03:27 GMT

So it seems you have coded your "features" into SAS variables. And you already know how to count the number of missing values. To see if that number is more or less than 20% you just need to count how many observations there are and divide. Since you have both the number of missing and non-missing the total number is just the sum of those two numbers.

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Ksharp — Thu, 03 Oct 2019 15:45:54 GMT

data have;
infile cards dlm = "|";
input X : $ 10. Y : $ 10. Z;
cards;
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |   |.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
;
run;


proc transpose data=have(obs=0) out=temp;
var _all_;
run;

proc sql;
select cats('nmiss(',_name_ ,')/count(*) as p_',_name_) into : list separated by ','
 from temp;

create table percent_missing as
select &list
 from have ;
quit;

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Tom — Thu, 03 Oct 2019 17:36:02 GMT

Use the CLASS statement in PROC SUMMARY. Should work as long as the number of variables is not too many.

So define your two formats:

proc format;
  value $missfmt ' '='Missing' other='Not Missing';
  value missfmt low-high='Not Missing' other ='Missing' ;
run;

Pick your dataset.

%let dsn=sashelp.cars;

Get the list of variables:

proc transpose data=&dsn(obs=0) out=step1;
  var _all_;
run;

Summarize

proc summary data = &dsn missing chartype ;
  class _all_ ;
  ways 1;
  output out=step2 ;
  format _char_ $missfmt. _numeric_ missfmt.;
run;

Then collapse to one observation per variable.

data want;
  length varnum 8 _name_ $32. _label_ $256 missing non_missing percent_missing 8;
  keep varnum -- percent_missing;
  do until(last._type_);
    set step2;
    by _type_;
    varnum=indexc(_type_,'1');
    p=varnum;
    set step1 point=p;
    if vvaluex(_name_)='Missing' then missing=_freq_;
    else non_missing=_freq_;
  end;
  missing=sum(0,missing);
  non_missing=sum(0,non_missing);
  percent_missing=missing/(sum(missing,non_missing));
run;

Note variable names will come out in reverse order that they exist in the dataset.

proc sort data=want; by varnum; run;

                                                                non_       percent_
Obs    varnum    _name_         _label_            missing    missing       missing

  1       1      Make                                 0         428      .000000000
  2       2      Model                                0         428      .000000000
  3       3      Type                                 0         428      .000000000
  4       4      Origin                               0         428      .000000000
  5       5      DriveTrain                           0         428      .000000000
  6       6      MSRP                                 0         428      .000000000
  7       7      Invoice                              0         428      .000000000
  8       8      EngineSize     Engine Size (L)       0         428      .000000000
  9       9      Cylinders                            2         426      .004672897
 10      10      Horsepower                           0         428      .000000000
 11      11      MPG_City       MPG (City)            0         428      .000000000
 12      12      MPG_Highway    MPG (Highway)         0         428      .000000000
 13      13      Weight         Weight (LBS)          0         428      .000000000
 14      14      Wheelbase      Wheelbase (IN)        0         428      .000000000
 15      15      Length         Length (IN)           0         428      .000000000

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

hashman — Thu, 03 Oct 2019 19:18:49 GMT

@yabwon:

Barteku, as a matter of dynamic hash programming curiosity, the entire feat can be pulled in a single step without crossing a step boundary once:

data have ;                                                                                                                                                                                                                                                     
  input (X Y) ($) Z ;                                                                                                                                                                                                                                           
  cards ;                                                                                                                                                                                                                                                       
High   Yes 2.56                                                                                                                                                                                                                                                 
.      No  3.67                                                                                                                                                                                                                                                 
Medium Yes  .                                                                                                                                                                                                                                                   
Low    Yes 2                                                                                                                                                                                                                                                    
.      Yes  .                                                                                                                                                                                                                                                   
High   Yes 2.56                                                                                                                                                                                                                                                 
.      No  3.67                                                                                                                                                                                                                                                 
Medium Yes  .                                                                                                                                                                                                                                                   
Low    Yes 2                                                                                                                                                                                                                                                    
.      .    .                                                                                                                                                                                                                                                   
High   Yes 2.56                                                                                                                                                                                                                                                 
.      No  3.67                                                                                                                                                                                                                                                 
Medium Yes  .                                                                                                                                                                                                                                                   
Low    Yes 2                                                                                                                                                                                                                                                    
.      Yes  .                                                                                                                                                                                                                                                   
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
%let pct = 10 ;                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                
data _null_ ;                                                                                                                                                                                                                                                   
  dcl hash v () ;                                                                                                                                                                                                                                               
  v.definekey ("vn") ;                                                                                                                                                                                                                                          
  v.definedata ("vn", "nmiss") ;                                                                                                                                                                                                                                
  v.definedone () ;                                                                                                                                                                                                                                             
  dcl hiter iv ("v") ;                                                                                                                                                                                                                                           
  do until (lr) ;                                                                                                                                                                                                                                               
    set have end = lr nobs = n ;                                                                                                                                                                                                                                
    array nn _numeric_ ;                                                                                                                                                                                                                                        
    array cc _char_ ;                                                                                                                                                                                                                                           
    do over nn ;                                                                                                                                                                                                                                                
      if not cmiss (nn) then continue ;                                                                                                                                                                                                                         
      vn = put (vname (nn), $32.) ;                                                                                                                                                                                                                             
      link count ;                                                                                                                                                                                                                                              
    end ;                                                                                                                                                                                                                                                       
    do over cc ;                                                                                                                                                                                                                                                
      if not cmiss (cc) then continue ;                                                                                                                                                                                                                         
      vn = put (vname (cc), $32.) ;                                                                                                                                                                                                                             
      link count ;                                                                                                                                                                                                                                              
    end ;                                                                                                                                                                                                                                                       
  end ;                                                                                                                                                                                                                                                         
  dcl hash h (dataset:"have", multidata:"y") ;                                                                                                                                                                                                                  
  do while (iv.next() = 0) ;                                                                                                                                                                                                                                     
    if divide (nmiss, n) * 100 < &pct then h.definekey (vn) ;                                                                                                                                                                                                   
  end ;                                                                                                                                                                                                                                                         
  h.definedone() ;                                                                                                                                                                                                                                              
  h.output (dataset: "want") ;                                                                                                                                                                                                                                  
  stop ;                                                                                                                                                                                                                                                        
  count: if v.find() then nmiss = 1 ;                                                                                                                                                                                                                      
         else             nmiss + 1 ;                                                                                                                                                                                                                      
         v.replace() ;                                                                                                                                                                                                                                          
run ;

Kind regards

Paul D.

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

yabwon — Thu, 03 Oct 2019 19:45:42 GMT

@hashman :

Paul, 100% agree, hash table give us elegant "one step" solution. The only reason I've made it 2 datasteps was to get the

&_DIM_.

macrovariable. In fact since all loops are "over _C_" I could make it a static value and it would work well in one datastep to, like:

array CNT[1000000] _temporary_; /* ~8MB of RAM, for quite "wide" dataset ;) */

but I thought it wouldn't be "nice/elegant".

But, as you remember our discussion from SAS-L, the one titled "Array search concepts (was: Re: Help?)", I'm finishing/polishing this dynamic array concept, which would allow us to write something like:

...
 array _C_ sentinel1-character-sentinel2;
 call arrayCNT('Allocate', lbound(_C_), hbound(_C_));
...
 do over _C_;
  if MISSING (_C_) then call arrayCNT('Add', _I_, 1);
 end;
...

All the best

Bart

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

hashman — Thu, 03 Oct 2019 22:36:18 GMT

@yabwon:

Bart, as far as elegance is concerned, I have the same kind of aesthetic aversion to "big enough" allocations. I do remember our -L exchange and am looking forward to seeing the finished product whenever you deem it to have been sufficiently Polished ;). I've learned quite a few FCMP tricks of trade from you and strongly suspect will learn more from this one.

Aesthetics aside, "big enough" can be extremely useful, as you well know, and quite productive speed-wise. This is because a one fell swoop allocation at compile time works much faster than allocating memory at run time for each extra item one at a time. With a really big hash object table, most of its time is spent not so much on the search and retrieve operations as on looking for available memory before doing an insert, especially when its memory footprint approaches the system limits. As a result, code can be run for a long time, only to fail on the next insertion when no more memory is available; whereas if an open-addressed array-based hash table is allocated as a big array at compile time, it either fails at once or else will work to the end without a glitch.

As you've seen from my -L writings, with a single integer key an open-addressed array-based hash table wins hands down both in insert/search speed (~1:2) and memory usage (~1:3 even with a half-sparse table). It also excels at aggregation because it can be done directly in the corresponding array cells (rather than via the FIND-aggregate-REPLACE hash object cycle). This is because the hash function in this case is the simple mod(key,table_size). The hash object readily takes over in terms of search speed for character keys and composite keys of mixed types since against such keys, its internal hash function is much faster than what can be mustered via:

mod (input (md5 (catx (k1-kN), pib6.), table_size)

externally - this combo just takes too long to compute, and a slow hash function defeats the purpose of using hashing in the first place. If SAS surfaced a really tight and fast function that could be used against an arbitrary key, array-based hashing would be much more attractive.

Kind regards

Paul D.

p.s. If you want to see how a hash object table fares against an open-addressed array table with a single integer key, try the program in the attached file.