Re: Find char missing values in a dataset and remove the features with...

Bond007 · Posted 10-01-2019 05:06 AM

Hi,

I want to find out the missing values in each feature of the dataset and remove the features which have greater than 10% missing values.

Any help would be appreciated.

Thanks.

PeterClemmensen · Posted 10-01-2019 05:13 AM

Hi and welcome to the SAS Community! 🙂

What is a 'feature' in a data set? Are you able to give a small example of what your data looks like and what you want the desired result to look like? Makes it much easier to provide a usable code answer.

The DATA to DATA Step Macro
Blog: SASnrd

Bond007 · Posted 10-01-2019 06:22 AM

The dataset has different numerical features and categorical features.

For example:

X Y Z

High Yes 2.56

‘ ‘ No 3.67

Medium ‘ ‘ .

Low ‘ ‘ 2

‘ ‘ ‘ ‘ .

and so on....

X,Y,Z might be 3 features of this dataset, where X and Y are Categorical features. I want to find missing values for X and for Y and then remove X or Y if the missing values are greater than 10% in either of them.

I used this..

/*create format for missing*/
proc format;
value $missfmt ' '='Missing' other='Not Missing';
*value missfmt . ='Missing' other='Not Missing';
run;

/*find out the missing char values*/
proc freq data = trainingdata;
format _char_ $missfmt.;
tables _char_ / missing nocum nopercent out=tempstr;
run;

I am able to see missing and not missing count for each feature, but I am not able to fetch the missing count from this from this, so that I can divide from the total number of values and get the percentage.

I am performing the numerical and categorical missing values search separately as proc means didn't work for tables=_char_

Patrick · Posted 10-01-2019 06:51 AM

@Bond007

If you need the percent values then what about removing keyword NOPERCENT from your code?

Bond007 · Posted 10-01-2019 07:16 AM

I have removed the nopercent from the code.

The output pattern is as follows:

The SAS System

The FREQ Procedure

X

X Frequency Percent

Not missing 1000 equivalent percentage value

Missing 900 equivalent percentage value

Y

Y Frequency Percent

Not missing 800 equivalent percentage value

Missing 1100 equivalent percentage value

and so on.....

I want to fetch the percentage of missing values for X and Y and then remove the features which has greater than 10% of missing values.

I tried to transpose the output dataset, to get a new dataset with an assumption to see a output like this,

Missing Not missing Percent

X 900 1000 ....

Y 1100 800 ....

But it doesn't work.

I want to fetch missing percent for X and missing percent for Y from the output and then check if it is greater than 10%.

Tom · Posted 10-01-2019 10:03 AM

So it seems you have coded your "features" into SAS variables. And you already know how to count the number of missing values. To see if that number is more or less than 20% you just need to count how many observations there are and divide. Since you have both the number of missing and non-missing the total number is just the sum of those two numbers.

Tom · Posted 10-03-2019 01:36 PM

Use the CLASS statement in PROC SUMMARY. Should work as long as the number of variables is not too many.

So define your two formats:

proc format;
  value $missfmt ' '='Missing' other='Not Missing';
  value missfmt low-high='Not Missing' other ='Missing' ;
run;

Pick your dataset.

%let dsn=sashelp.cars;

Get the list of variables:

proc transpose data=&dsn(obs=0) out=step1;
  var _all_;
run;

Summarize

proc summary data = &dsn missing chartype ;
  class _all_ ;
  ways 1;
  output out=step2 ;
  format _char_ $missfmt. _numeric_ missfmt.;
run;

Then collapse to one observation per variable.

data want;
  length varnum 8 _name_ $32. _label_ $256 missing non_missing percent_missing 8;
  keep varnum -- percent_missing;
  do until(last._type_);
    set step2;
    by _type_;
    varnum=indexc(_type_,'1');
    p=varnum;
    set step1 point=p;
    if vvaluex(_name_)='Missing' then missing=_freq_;
    else non_missing=_freq_;
  end;
  missing=sum(0,missing);
  non_missing=sum(0,non_missing);
  percent_missing=missing/(sum(missing,non_missing));
run;

Note variable names will come out in reverse order that they exist in the dataset.

proc sort data=want; by varnum; run;

                                                                non_       percent_
Obs    varnum    _name_         _label_            missing    missing       missing

  1       1      Make                                 0         428      .000000000
  2       2      Model                                0         428      .000000000
  3       3      Type                                 0         428      .000000000
  4       4      Origin                               0         428      .000000000
  5       5      DriveTrain                           0         428      .000000000
  6       6      MSRP                                 0         428      .000000000
  7       7      Invoice                              0         428      .000000000
  8       8      EngineSize     Engine Size (L)       0         428      .000000000
  9       9      Cylinders                            2         426      .004672897
 10      10      Horsepower                           0         428      .000000000
 11      11      MPG_City       MPG (City)            0         428      .000000000
 12      12      MPG_Highway    MPG (Highway)         0         428      .000000000
 13      13      Weight         Weight (LBS)          0         428      .000000000
 14      14      Wheelbase      Wheelbase (IN)        0         428      .000000000
 15      15      Length         Length (IN)           0         428      .000000000

yabwon · Posted 10-01-2019 08:48 AM

Hi @Bond007 ,

do you mean something like this:

data have;
infile cards dlm = "|";
input X : $ 10. Y : $ 10. Z;
cards;
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |   |.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
;
run;


data _null_;
  sentinel1 = .;
  if 0 then set have;
  sentinel2=.;
  array _C_{*} sentinel1-character-sentinel2;
  call symputX("_DIM_",dim(_C_),"G");  /* collect the metadata */
  stop;
run;

options symbolgen;
data _null_;

  sentinel1 = .;
  if 0 then set have;
  sentinel2=.;

  array _C_ sentinel1-character-sentinel2;
  array CNT[&_DIM_.] _temporary_;

  d=dim(_C_);
  put d=;
  do until (eof);
    set have end=eof nobs=nobs;
    do over _C_;
      if MISSING (_C_) then CNT[_I_] + 1;
    end;
  end;

  call execute('data want; set have; drop _N_');
  do over _C_;
    if CNT[_I_]/NOBS > 0.10 then call execute(vname(_C_)); /*Y stays, X drops */
  end;
  call execute('; run;');
  stop;
run;

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

hashman · Posted 10-03-2019 03:18 PM

@yabwon:

Barteku, as a matter of dynamic hash programming curiosity, the entire feat can be pulled in a single step without crossing a step boundary once:

data have ;                                                                                                                                                                                                                                                     
  input (X Y) ($) Z ;                                                                                                                                                                                                                                           
  cards ;                                                                                                                                                                                                                                                       
High   Yes 2.56                                                                                                                                                                                                                                                 
.      No  3.67                                                                                                                                                                                                                                                 
Medium Yes  .                                                                                                                                                                                                                                                   
Low    Yes 2                                                                                                                                                                                                                                                    
.      Yes  .                                                                                                                                                                                                                                                   
High   Yes 2.56                                                                                                                                                                                                                                                 
.      No  3.67                                                                                                                                                                                                                                                 
Medium Yes  .                                                                                                                                                                                                                                                   
Low    Yes 2                                                                                                                                                                                                                                                    
.      .    .                                                                                                                                                                                                                                                   
High   Yes 2.56                                                                                                                                                                                                                                                 
.      No  3.67                                                                                                                                                                                                                                                 
Medium Yes  .                                                                                                                                                                                                                                                   
Low    Yes 2                                                                                                                                                                                                                                                    
.      Yes  .                                                                                                                                                                                                                                                   
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
%let pct = 10 ;                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                
data _null_ ;                                                                                                                                                                                                                                                   
  dcl hash v () ;                                                                                                                                                                                                                                               
  v.definekey ("vn") ;                                                                                                                                                                                                                                          
  v.definedata ("vn", "nmiss") ;                                                                                                                                                                                                                                
  v.definedone () ;                                                                                                                                                                                                                                             
  dcl hiter iv ("v") ;                                                                                                                                                                                                                                           
  do until (lr) ;                                                                                                                                                                                                                                               
    set have end = lr nobs = n ;                                                                                                                                                                                                                                
    array nn _numeric_ ;                                                                                                                                                                                                                                        
    array cc _char_ ;                                                                                                                                                                                                                                           
    do over nn ;                                                                                                                                                                                                                                                
      if not cmiss (nn) then continue ;                                                                                                                                                                                                                         
      vn = put (vname (nn), $32.) ;                                                                                                                                                                                                                             
      link count ;                                                                                                                                                                                                                                              
    end ;                                                                                                                                                                                                                                                       
    do over cc ;                                                                                                                                                                                                                                                
      if not cmiss (cc) then continue ;                                                                                                                                                                                                                         
      vn = put (vname (cc), $32.) ;                                                                                                                                                                                                                             
      link count ;                                                                                                                                                                                                                                              
    end ;                                                                                                                                                                                                                                                       
  end ;                                                                                                                                                                                                                                                         
  dcl hash h (dataset:"have", multidata:"y") ;                                                                                                                                                                                                                  
  do while (iv.next() = 0) ;                                                                                                                                                                                                                                     
    if divide (nmiss, n) * 100 < &pct then h.definekey (vn) ;                                                                                                                                                                                                   
  end ;                                                                                                                                                                                                                                                         
  h.definedone() ;                                                                                                                                                                                                                                              
  h.output (dataset: "want") ;                                                                                                                                                                                                                                  
  stop ;                                                                                                                                                                                                                                                        
  count: if v.find() then nmiss = 1 ;                                                                                                                                                                                                                      
         else             nmiss + 1 ;                                                                                                                                                                                                                      
         v.replace() ;                                                                                                                                                                                                                                          
run ;

Kind regards

Paul D.

yabwon · Posted 10-03-2019 03:42 PM

@hashman :

Paul, 100% agree, hash table give us elegant "one step" solution. The only reason I've made it 2 datasteps was to get the

&_DIM_.

macrovariable. In fact since all loops are "over _C_" I could make it a static value and it would work well in one datastep to, like:

array CNT[1000000] _temporary_; /* ~8MB of RAM, for quite "wide" dataset ;) */

but I thought it wouldn't be "nice/elegant".

But, as you remember our discussion from SAS-L, the one titled "Array search concepts (was: Re: Help?)", I'm finishing/polishing this dynamic array concept, which would allow us to write something like:

...
 array _C_ sentinel1-character-sentinel2;
 call arrayCNT('Allocate', lbound(_C_), hbound(_C_));
...
 do over _C_;
  if MISSING (_C_) then call arrayCNT('Add', _I_, 1);
 end;
...

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

hashman · Posted 10-03-2019 06:36 PM

@yabwon:

Bart, as far as elegance is concerned, I have the same kind of aesthetic aversion to "big enough" allocations. I do remember our -L exchange and am looking forward to seeing the finished product whenever you deem it to have been sufficiently Polished ;). I've learned quite a few FCMP tricks of trade from you and strongly suspect will learn more from this one.

Aesthetics aside, "big enough" can be extremely useful, as you well know, and quite productive speed-wise. This is because a one fell swoop allocation at compile time works much faster than allocating memory at run time for each extra item one at a time. With a really big hash object table, most of its time is spent not so much on the search and retrieve operations as on looking for available memory before doing an insert, especially when its memory footprint approaches the system limits. As a result, code can be run for a long time, only to fail on the next insertion when no more memory is available; whereas if an open-addressed array-based hash table is allocated as a big array at compile time, it either fails at once or else will work to the end without a glitch.

As you've seen from my -L writings, with a single integer key an open-addressed array-based hash table wins hands down both in insert/search speed (~1:2) and memory usage (~1:3 even with a half-sparse table). It also excels at aggregation because it can be done directly in the corresponding array cells (rather than via the FIND-aggregate-REPLACE hash object cycle). This is because the hash function in this case is the simple mod(key,table_size). The hash object readily takes over in terms of search speed for character keys and composite keys of mixed types since against such keys, its internal hash function is much faster than what can be mustered via:

mod (input (md5 (catx (k1-kN), pib6.), table_size)

externally - this combo just takes too long to compute, and a slow hash function defeats the purpose of using hashing in the first place. If SAS surfaced a really tight and fast function that could be used against an arbitrary key, array-based hashing would be much more attractive.

Kind regards

Paul D.

p.s. If you want to see how a hash object table fares against an open-addressed array table with a single integer key, try the program in the attached file.

Ksharp · Posted 10-03-2019 11:45 AM

data have;
infile cards dlm = "|";
input X : $ 10. Y : $ 10. Z;
cards;
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |   |.
High  |Yes|2.56
      |No |3.67
Medium|Yes|.
Low   |Yes|2
      |Yes|.
;
run;


proc transpose data=have(obs=0) out=temp;
var _all_;
run;

proc sql;
select cats('nmiss(',_name_ ,')/count(*) as p_',_name_) into : list separated by ','
 from temp;

create table percent_missing as
select &list
 from have ;
quit;

Find char missing values in a dataset and remove the features with more than 10% of missing values

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Re: Find char missing values in a dataset and remove the features with more than 10% of missing valu

Registration is open

SAS Training: Just a Click Away