dropping variables based on conditions

Accepted Solution Solved
Reply
Contributor
Posts: 40
Accepted Solution

dropping variables based on conditions

Hi,

I am trying to count zeroes in the variables of my dataset and eliminate those variables where count of zero is greater than 80%. For e.g. If dataset has 100 observations for variable V1, if count of 0s in V1 = 80, please remove V1 from dataset.

Also for variables where count of zeroes is 60% - 80%, put them in a separate varlist.

Coming from R and being used to passing variables in functionopera, I am finding it difficult to do this in SAS . Below is the R code to accomplish the above tasks:

> todrop <- (colSums(have==0)/nrow(have)*100)

> want <- have[ , todrop > 80, drop = FALSE]

> want2 <- have[ , todrop > = 60  & todrop < =  80, drop = FALSE]

thanks,

Nikhil


Accepted Solutions
Solution
‎01-22-2015 08:41 PM
Super User
Super User
Posts: 6,373

Re: dropping variables based on conditions

It is a little easier to handle these types of things if you can convert your data into tall/skinny format.


data have ;

  input id $ var1-var5 ;

  put id var1-var5;

cards;

Ob1 0 123 0 42356 0

Ob2 234234 0 0 986 56

Ob3 0 0 0 673467 0

Ob4 1212 0 0 0 0

Ob5 0 0 245346 65473467 57

Ob6 0 0 3564367 0 0

Ob7 0 0 0 1341234 245245

Ob8 0 0 0 89 0

Ob9 0 0 0 9079 6356

Ob10 4667 673 0 235 0

;;;;

proc transpose data=have out=tall ;

  by id notsorted ;

  var var1-var5 ;

run;

proc sql noprint;

%let varlist=;

  select _name_

  into :varlist separated by ' '

  from

(select _name_

      , count(*) as nobs

      , sum(col1=0) as nzero

from tall

group by _name_

)

  where round(nzero/nobs,0.01) >= 0.80

  ;

quit;

data want ;

  set have ;

  drop &varlist ;

run;

View solution in original post


All Replies
Valued Guide
Posts: 3,206

Re: dropping variables based on conditions

It is more easy to describe what you already having and want to achieve.

Please tell also what SAS version and modules you are having.

The basic SAS language (datastep) is more like No-SQL and there is a Proc SQL.

Having SAS 9.4 there is some matrix calculation as package with Proc DS2.

The complete matrix language is possible by Proc IML

Al lost of stat functions have their own Proc's, see a Proc as same R package fulfilling some job to be done. 

---->-- ja karman --<-----
Contributor
Posts: 40

Re: dropping variables based on conditions

Hi Jaap, Here you go :

HAVE:

VAR1

VAR2

VAR3

VAR4

VAR5

Ob1

0

123

0

42356

0

Ob2

234234

0

0

986

56

Ob3

0

0

0

673467

0

Ob4

1212

0

0

0

0

Ob5

0

0

245346

65473467

57

Ob6

0

0

3564367

0

0

Ob7

0

0

0

1341234

245245

Ob8

0

0

0

89

0

Ob9

0

0

0

9079

6356

Ob10

4667

673

0

235

0

WANT: (VAR2 and VAR3 removed as > =  80% 0 values.)

VAR1

VAR4

VAR5

Ob1

0

42356

0

Ob2

234234

986

56

Ob3

0

673467

0

Ob4

1212

0

0

Ob5

0

65473467

57

Ob6

0

0

0

Ob7

0

1341234

245245

Ob8

0

89

0

Ob9

0

9079

6356

Ob10

4667

235

0

ALSO WANT (list of Vars with > = 60% to < 80% zeroes ) ,

%put &varlist

VAR1 VAR5

Respected Advisor
Posts: 4,998

Re: dropping variables based on conditions

Would you account for missing values in the calculations?  For example, out of 100 observations, 75 are zero and 25 have a missing value.  Does that variable get removed by the 80% rule?

Contributor
Posts: 40

Re: dropping variables based on conditions

I would remove any observation with missing values before hand. but if it were to slip in, yes - it would count in the calculation. if 40% are zeroes and 40% are missing value, that variable won't contribute much to my analysis .

Respected Advisor
Posts: 4,998

Re: dropping variables based on conditions

I think I would use a DATA step to count:

data _null_;

   set have;

   array nums {*} _numeric_;

   call symputx('n_vars', dim(nums));

   stop;

run;

data zero;

   set have nobs=_totalobs_ end=done;

   array countme {&n_vars} _numeric_;

   keep _numeric_;

   array useless  {&n_vars} useless1 - useless&n_vars;

   retain useless1 - useless&n_vars 0;

   do _n_=1 to &n_vars;

      if countme{_n_} in (., 0) then useless{_n_} + 1;

   end;
   if done;

   do _n_ = 1 to &n_vars;

        countme{_n_} = useless{_n_} / _totalobs_;

   end;

run;

That gives you a data set with a single observation.  The original variable names are kept (for numeric variables only), and their value is the percent with a missing value.  There are various ways to use the information from that point.

N/A
Posts: 1

Re: dropping variables based on conditions

Hi,

You can use macro code to accomplish this; I'm not sure there is another method - maybe someone else has a better idea. I would approach the macro code as follows:

1) Write a proc means/summary routine to get the number (or percentage) of zeroes in particular variables and output to a dataset.

2) Read that output dataset and build macro variables that contain lists of dataset variables to keep in the original dataset.

3) Read original dataset in again, and output to various other datasets using the keep= dataset option and the macro variables just created.

There may well be a simpler way of doing this, but I haven't found it yet.

Valued Guide
Posts: 3,206

Re: dropping variables based on conditions

I see you are trying to learn SAS with a R background.

The most easy approach would be SAS/IML (also matrices), let us focus on the datastep.

It is different to most 4GL's a of the automatic increasing by observation something like RPG AS/400. SQL is the one most a like this.  (get a record fixed type)

The concept with missing is much richer with SAS datastep than with R R: 'Not Available' / Missing Values or SQL.  (is NULL).

Possible approach 1:

Define the 0's as missing and use proc means summary / tabulate to get the totals on a row into a dataset Base SAS(R) 9.4 Procedures Guide, Third Edition (means).

The total of obeservations in have can be retrieved  with SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition (attrn) so you can construct you percentages in the outputvector table

Creating some global SAS macro-s for varnames (there is a function when using arrays) to keep /drop (SYMPUT function) you can used those ones in the  set statement datasetoptions for creating want and want2.  SAS(R) 9.4 Data Set Options: Reference, Second Edition (keep=)

This is flowing your R-code example

Possible approach 2:

Do a transpose of the dataset as dropping columns in a RDBMS approach is difficult but dropping rows is easy. Base SAS(R) 9.4 Procedures Guide, Third Edition (transpose)

The nmis function SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition on all the vars (variable list) is

Then transpose back again.

It is a different way of translating your needs to some code.

There will be many more solutions for this

---->-- ja karman --<-----
Contributor
Posts: 52

Re: dropping variables based on conditions

Your question is related to the often-asked question at this forum about how best to determine the variables with an upper limit (%-wise or row-wize) of null/zero values.

The following is a proposed DATA STEP to find such variables.

First a similated dataset with 4000 variables and 100000 rows.

/***********************************************/
/**** simulated dataset with 4000 variables ****/
/**** the values are either 1 or 0.         ****/
/***********************************************/
data t_have(keep=aaSmiley Happy;
   array a1[4000] 3.;
   array aa[4000] 3.;
   retain _all_;

   do i = 1 to 4000;
      b = int(1000*ranuni(3));
      a1 = b;
   end;

   do j = 1 to 100000;
   do i = 1 to 4000;
      b = int(1000*ranuni(4));
      aa = (b > a1);
   end;
      output;
   end;
run;


A proposed solution (amongst others):

/**************************************/
/**** proposed DATA STEP solution  ****/
/**** table t_list has 2 variables ****/
/**** 1: vname (variable name)     ****/
/**** 2: y_cnt (# of non-zero)     ****/
/**** zLimit > y_cnt for all vname ****/
/**** zLimit is given by user      ****/
/**************************************/
data t_list;
   keep vname y_cnt;
   length vname $ 32  i_ix y_cnt 8;
   array aa[4000];
   array x[4000] _temporary_ (1 : 4000);
   array y[4000] _temporary_ (4000*0);
   zLimit = 5000;

   set t_have nobs=n_last;
   x_sort=0;
   do i=1 to dim(aa) while (x<(dim(aa)+1));
      y[x]+aa[x];
   end;

   if ((mod(_N_,1000)=0) or (_N_=n_last)) then do;
      x_sort=0;
      do i=1 to dim(aa) while (x<(dim(aa)+1));
         if y[x] > zLimit then do; x=dim(aa)+1; x_sort+1; end;
      end;
      if (x_sort>0) then do; call sortn(of x

  • ); end;
       end;
  •    if _N_ = n_last then do;
          do ii=1 to dim(aa) while (x[ii]<dim(aa)+1);
             i_ix = x(ii);
             vname=vname(aa(i_ix));
             y_cnt = y[x[ii]];
             output;
          end;
       end;
    run;


    Hope this helps.

    New Contributor
    Posts: 3

    Re: dropping variables based on conditions

    %let uplimit=1;

    %let lowlimit=0.8;

    data _null_;

         set have end=done;

         length varlst $2000;

         array var{*} _numeric_;

         array varname{200} $ _temporary_;

         array cnt{200} _temporary_;

         array chk{200} _temporary_;

         _n+1;

         do i=1 to dim(var);

               if var{i}=0 then cnt{i}+1;

               if done then do;

                    varname{i}=vname(var{i});

                    chk{i}=cnt{i}/_n;

                    if &uplimit>=chk{i}>=&lowlimit then varlst=strip(catx(' ', varname{i}, varlst));

               end;

         end;

         if done then

               rc=dosubl('data want; set have; drop ' ||varlst||'; run;');

    run;

    Also for variables where count of zeroes is 60% - 80%

    you could modify the above code as the following

    %let uplimit=0.8;

    %let lowlimit=0.6;

                if &lowlimit>chk{i} or chk{i}>&uplimit then varlst=strip(catx(' ', varname{i}, varlst));
    Solution
    ‎01-22-2015 08:41 PM
    Super User
    Super User
    Posts: 6,373

    Re: dropping variables based on conditions

    It is a little easier to handle these types of things if you can convert your data into tall/skinny format.


    data have ;

      input id $ var1-var5 ;

      put id var1-var5;

    cards;

    Ob1 0 123 0 42356 0

    Ob2 234234 0 0 986 56

    Ob3 0 0 0 673467 0

    Ob4 1212 0 0 0 0

    Ob5 0 0 245346 65473467 57

    Ob6 0 0 3564367 0 0

    Ob7 0 0 0 1341234 245245

    Ob8 0 0 0 89 0

    Ob9 0 0 0 9079 6356

    Ob10 4667 673 0 235 0

    ;;;;

    proc transpose data=have out=tall ;

      by id notsorted ;

      var var1-var5 ;

    run;

    proc sql noprint;

    %let varlist=;

      select _name_

      into :varlist separated by ' '

      from

    (select _name_

          , count(*) as nobs

          , sum(col1=0) as nzero

    from tall

    group by _name_

    )

      where round(nzero/nobs,0.01) >= 0.80

      ;

    quit;

    data want ;

      set have ;

      drop &varlist ;

    run;

    Grand Advisor
    Posts: 9,594

    Re: dropping variables based on conditions

    SAS also can do it easily, and be able to process a very big table.

    data have ;
      input id $ var1-var5 ;
      put id var1-var5;
    cards;
    Ob1 0 123 0 42356 0
    Ob2 234234 0 0 986 56
    Ob3 0 0 0 673467 0
    Ob4 1212 0 0 0 0
    Ob5 0 0 245346 65473467 57
    Ob6 0 0 3564367 0 0
    Ob7 0 0 0 1341234 245245
    Ob8 0 0 0 89 0
    Ob9 0 0 0 9079 6356
    Ob10 4667 673 0 235 0
    ;;;;
    run;
    proc sql noprint;
    select cat('sum(',strip(name),'=0)/(select count(*) from have) as ',strip(name)) into : list separated by ','
     from dictionary.columns
      where libname='WORK' and memname='HAVE' and name like 'var%';
    create table temp as
     select &list from have;
    quit;
    proc transpose data=temp out=x;run;
    data _null_;
     set x end=last;
     if _n_ eq 1 then call execute('data want; set have;drop ');
     if col1 ge 0.8 then call execute(_NAME_);
     if last then call execute(';run;');
    run;
     
    

    Xia Keshan

    Respected Advisor
    Posts: 3,775

    Re: dropping variables based on conditions

    I would let PROC FREQ handle the calculation of percent zero using a VALUE format to group the analysis variables into two categories, zeros and the rest.  Then it is just a matter of asking for the names based on your criteria  GE 80, between 60 and 80 etc.

    I borrowed this data step from Tom

    data have ;
       input id $ var1-var5;
       cards;
    Ob1 0 123 0 42356 0
    Ob2 234234 0 0 986 56
    Ob3 0 0 0 673467 0
    Ob4 1212 0 0 0 0
    Ob5 0 0 245346 65473467 57
    Ob6 0 0 3564367 0 0
    Ob7 0 0 0 1341234 245245
    Ob8 0 0 0 89 0
    Ob9 0 0 0 9079 6356
    Ob10 4667 673 0 235 0
    ;;;;
       run;
    proc format;
      
    value zother 0='0' other='1' ._-.z=' ';
      
    run;
    proc freq data=have;
       tables var:;
       format var: zother.;
      
    ods output onewayfreqs=zother;
       run;
    data z80(keep=vname percent);
       set zother;
       length vname $32 value 8;
       vname = vnamex(scan(table,-
    1,' '));
       value = coalesce(of var1-var5);
      
    if value eq 0;
      
    run;
    proc sql noprint;
      
    select vname into :z80 separated by ' ' from z80 where percent ge 80;
      
    select vname into :z60to80 separated by ' ' from z80 where percent between 60 and 80;
      
    run;
    %put NOTE: &=z80 &=z60to80;


    1-23-2015 5-55-54 AM.png

    Message was edited by: data _null_

    ☑ This topic is SOLVED.

    Need further help from the community? Please ask a new question.

    Discussion stats
    • 12 replies
    • 1195 views
    • 9 likes
    • 9 in conversation