Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- General Programming
- /
- dropping variables based on conditions

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-22-2015 12:04 PM

Hi,

I am trying to count zeroes in the variables of my dataset and eliminate those variables where count of zero is greater than 80%. For e.g. If dataset has 100 observations for variable V1, if count of 0s in V1 = 80, please remove V1 from dataset.

Also for variables where count of zeroes is 60% - 80%, put them in a separate varlist.

Coming from R and being used to passing variables in functionopera, I am finding it difficult to do this in SAS . Below is the R code to accomplish the above tasks:

> todrop <- (colSums(have==0)/nrow(have)*100)

> want <- have[ , todrop > 80, drop = FALSE]

> want2 <- have[ , todrop > = 60 & todrop < = 80, drop = FALSE]

thanks,

Nikhil

Accepted Solutions

Solution

01-22-2015
08:41 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 08:41 PM

It is a little easier to handle these types of things if you can convert your data into tall/skinny format.

**data** have ;

input id $ var1-var5 ;

put id var1-var5;

cards;

Ob1 0 123 0 42356 0

Ob2 234234 0 0 986 56

Ob3 0 0 0 673467 0

Ob4 1212 0 0 0 0

Ob5 0 0 245346 65473467 57

Ob6 0 0 3564367 0 0

Ob7 0 0 0 1341234 245245

Ob8 0 0 0 89 0

Ob9 0 0 0 9079 6356

Ob10 4667 673 0 235 0

;;;;

**proc** **transpose** data=have out=tall ;

by id notsorted ;

var var1-var5 ;

**run**;

**proc** **sql** noprint;

%let varlist=;

select _name_

into :varlist separated by ' '

from

(select _name_

, count(*) as nobs

, sum(col1=**0**) as nzero

from tall

group by _name_

)

where round(nzero/nobs,**0.01**) >= **0.80**

;

**quit**;

**data** want ;

set have ;

drop &varlist ;

**run**;

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 01:16 PM

It is more easy to describe what you already having and want to achieve.

Please tell also what SAS version and modules you are having.

The basic SAS language (datastep) is more like No-SQL and there is a Proc SQL.

Having SAS 9.4 there is some matrix calculation as package with Proc DS2.

The complete matrix language is possible by Proc IML

Al lost of stat functions have their own Proc's, see a Proc as same R package fulfilling some job to be done.

---->-- ja karman --<-----

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jakarman

01-22-2015 01:38 PM

**HAVE:**

VAR1 | VAR2 | VAR3 | VAR4 | VAR5 | |

Ob1 | 0 | 123 | 0 | 42356 | 0 |

Ob2 | 234234 | 0 | 0 | 986 | 56 |

Ob3 | 0 | 0 | 0 | 673467 | 0 |

Ob4 | 1212 | 0 | 0 | 0 | 0 |

Ob5 | 0 | 0 | 245346 | 65473467 | 57 |

Ob6 | 0 | 0 | 3564367 | 0 | 0 |

Ob7 | 0 | 0 | 0 | 1341234 | 245245 |

Ob8 | 0 | 0 | 0 | 89 | 0 |

Ob9 | 0 | 0 | 0 | 9079 | 6356 |

Ob10 | 4667 | 673 | 0 | 235 | 0 |

**WANT: (VAR2 and VAR3 removed as > = 80% 0 values.)**

VAR1 | VAR4 | VAR5 | |

Ob1 | 0 | 42356 | 0 |

Ob2 | 234234 | 986 | 56 |

Ob3 | 0 | 673467 | 0 |

Ob4 | 1212 | 0 | 0 |

Ob5 | 0 | 65473467 | 57 |

Ob6 | 0 | 0 | 0 |

Ob7 | 0 | 1341234 | 245245 |

Ob8 | 0 | 89 | 0 |

Ob9 | 0 | 9079 | 6356 |

Ob10 | 4667 | 235 | 0 |

**ALSO WANT (list of Vars with > = 60% to < 80% zeroes ) , **

%put &varlist

VAR1 VAR5

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 01:57 PM

Would you account for missing values in the calculations? For example, out of 100 observations, 75 are zero and 25 have a missing value. Does that variable get removed by the 80% rule?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

01-22-2015 02:01 PM

I would remove any observation with missing values before hand. but if it were to slip in, yes - it would count in the calculation. if 40% are zeroes and 40% are missing value, that variable won't contribute much to my analysis .

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 03:11 PM

I think I would use a DATA step to count:

data _null_;

set have;

array nums {*} _numeric_;

call symputx('n_vars', dim(nums));

stop;

run;

data zero;

set have nobs=_totalobs_ end=done;

array countme {&n_vars} _numeric_;

keep _numeric_;

array useless {&n_vars} useless1 - useless&n_vars;

retain useless1 - useless&n_vars 0;

do _n_=1 to &n_vars;

if countme{_n_} in (., 0) then useless{_n_} + 1;

end;

if done;

do _n_ = 1 to &n_vars;

countme{_n_} = useless{_n_} / _totalobs_;

end;

run;

That gives you a data set with a single observation. The original variable names are kept (for numeric variables only), and their value is the percent with a missing value. There are various ways to use the information from that point.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 01:47 PM

Hi,

You can use macro code to accomplish this; I'm not sure there is another method - maybe someone else has a better idea. I would approach the macro code as follows:

1) Write a proc means/summary routine to get the number (or percentage) of zeroes in particular variables and output to a dataset.

2) Read that output dataset and build macro variables that contain lists of dataset variables to keep in the original dataset.

3) Read original dataset in again, and output to various other datasets using the keep= dataset option and the macro variables just created.

There may well be a simpler way of doing this, but I haven't found it yet.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 03:00 PM

I see you are trying to learn SAS with a R background.

The most easy approach would be SAS/IML (also matrices), let us focus on the datastep.

It is different to most 4GL's a of the automatic increasing by observation something like RPG AS/400. SQL is the one most a like this. (get a record fixed type)

The concept with missing is much richer with SAS datastep than with R R: 'Not Available' / Missing Values or SQL. (is NULL).

Possible approach 1:

Define the 0's as missing and use proc means summary / tabulate to get the totals on a row into a dataset Base SAS(R) 9.4 Procedures Guide, Third Edition (means).

The total of obeservations in have can be retrieved with SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition (attrn) so you can construct you percentages in the outputvector table

Creating some global SAS macro-s for varnames (there is a function when using arrays) to keep /drop (SYMPUT function) you can used those ones in the set statement datasetoptions for creating want and want2. SAS(R) 9.4 Data Set Options: Reference, Second Edition (keep=)

This is flowing your R-code example

Possible approach 2:

Do a transpose of the dataset as dropping columns in a RDBMS approach is difficult but dropping rows is easy. Base SAS(R) 9.4 Procedures Guide, Third Edition (transpose)

The nmis function SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition on all the vars (variable list) is

Then transpose back again.

It is a different way of translating your needs to some code.

There will be many more solutions for this

---->-- ja karman --<-----

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 03:36 PM

Your question is related to the often-asked question at this forum about how best to determine the variables with an upper limit (%-wise or row-wize) of null/zero values.

The following is a proposed DATA STEP to find such variables.

First a similated dataset with 4000 variables and 100000 rows.

/***********************************************/

/**** simulated dataset with 4000 variables ****/

/**** the values are either 1 or 0. ****/

/***********************************************/

data t_have(keep=aa;

array a1[4000] 3.;

array aa[4000] 3.;

retain _all_;

do i = 1 to 4000;

b = int(1000*ranuni(3));

a1* = b; end;*

do j = 1 to 100000;

do i = 1 to 4000;

b = int(1000*ranuni(4));

aa* = (b > a1 );*

end;

output;

end;

run;

A proposed solution (amongst others):

/**************************************/

/**** proposed DATA STEP solution ****/

/**** table t_list has 2 variables ****/

/**** 1: vname (variable name) ****/

/**** 2: y_cnt (# of non-zero) ****/

/**** zLimit > y_cnt for all vname ****/

/**** zLimit is given by user ****/

/**************************************/

data t_list;

keep vname y_cnt;

length vname $ 32 i_ix y_cnt 8;

array aa[4000];

array x[4000] _temporary_ (1 : 4000);

array y[4000] _temporary_ (4000*0);

zLimit = 5000;

set t_have nobs=n_last;

x_sort=0;

do i=1 to dim(aa) while (x*<(dim(aa)+1)); y[x ]+aa[x]; end;*

if ((mod(_N_,1000)=0) or (_N_=n_last)) then do;

x_sort=0;

do i=1 to dim(aa) while (x*<(dim(aa)+1)); if y[x ] > zLimit then do; x=dim(aa)+1; x_sort+1; end; end; if (x_sort>0) then do; call sortn(of x*

end;

if _N_ = n_last then do;

do ii=1 to dim(aa) while (x[ii]<dim(aa)+1);

i_ix = x(ii);

vname=vname(aa(i_ix));

y_cnt = y[x[ii]];

output;

end;

end;

run;

Hope this helps.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 07:21 PM

%let uplimit=1;

%let lowlimit=0.8;

data _null_;

set have end=done;

length varlst $2000;

array var{*} _numeric_;

array varname{200} $ _temporary_;

array cnt{200} _temporary_;

array chk{200} _temporary_;

_n+1;

do i=1 to dim(var);

if var{i}=0 then cnt{i}+1;

if done then do;

varname{i}=vname(var{i});

chk{i}=cnt{i}/_n;

if &uplimit>=chk{i}>=&lowlimit then varlst=strip(catx(' ', varname{i}, varlst));

end;

end;

if done then

rc=dosubl('data want; set have; drop ' ||varlst||'; run;');

run;

Also for variables where count of zeroes is 60% - 80%

you could modify the above code as the following

%let uplimit=0.8;

%let lowlimit=0.6;

if &lowlimit>chk{i} or chk{i}>&uplimit then varlst=strip(catx(' ', varname{i}, varlst)); |

Solution

01-22-2015
08:41 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-22-2015 08:41 PM

It is a little easier to handle these types of things if you can convert your data into tall/skinny format.

**data** have ;

input id $ var1-var5 ;

put id var1-var5;

cards;

Ob1 0 123 0 42356 0

Ob2 234234 0 0 986 56

Ob3 0 0 0 673467 0

Ob4 1212 0 0 0 0

Ob5 0 0 245346 65473467 57

Ob6 0 0 3564367 0 0

Ob7 0 0 0 1341234 245245

Ob8 0 0 0 89 0

Ob9 0 0 0 9079 6356

Ob10 4667 673 0 235 0

;;;;

**proc** **transpose** data=have out=tall ;

by id notsorted ;

var var1-var5 ;

**run**;

**proc** **sql** noprint;

%let varlist=;

select _name_

into :varlist separated by ' '

from

(select _name_

, count(*) as nobs

, sum(col1=**0**) as nzero

from tall

group by _name_

)

where round(nzero/nobs,**0.01**) >= **0.80**

;

**quit**;

**data** want ;

set have ;

drop &varlist ;

**run**;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-23-2015 02:59 AM

SAS also can do it easily, and be able to process a very big table.

data have ; input id $ var1-var5 ; put id var1-var5; cards; Ob1 0 123 0 42356 0 Ob2 234234 0 0 986 56 Ob3 0 0 0 673467 0 Ob4 1212 0 0 0 0 Ob5 0 0 245346 65473467 57 Ob6 0 0 3564367 0 0 Ob7 0 0 0 1341234 245245 Ob8 0 0 0 89 0 Ob9 0 0 0 9079 6356 Ob10 4667 673 0 235 0 ;;;; run; proc sql noprint; select cat('sum(',strip(name),'=0)/(select count(*) from have) as ',strip(name)) into : list separated by ',' from dictionary.columns where libname='WORK' and memname='HAVE' and name like 'var%'; create table temp as select &list from have; quit; proc transpose data=temp out=x;run; data _null_; set x end=last; if _n_ eq 1 then call execute('data want; set have;drop '); if col1 ge 0.8 then call execute(_NAME_); if last then call execute(';run;'); run;

Xia Keshan

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to ngnikhilgoyal

01-23-2015 06:55 AM

I would let PROC FREQ handle the calculation of percent zero using a VALUE format to group the analysis variables into two categories, zeros and the rest. Then it is just a matter of asking for the names based on your criteria GE 80, between 60 and 80 etc.

I borrowed this data step from Tom

input id $ var1-var5;

cards;

Ob1 0 123 0 42356 0

Ob2 234234 0 0 986 56

Ob3 0 0 0 673467 0

Ob4 1212 0 0 0 0

Ob5 0 0 245346 65473467 57

Ob6 0 0 3564367 0 0

Ob7 0 0 0 1341234 245245

Ob8 0 0 0 89 0

Ob9 0 0 0 9079 6356

Ob10 4667 673 0 235 0

;;;;

value zother

tables var:;

format var: zother.;

ods output onewayfreqs=zother;

set zother;

length vname $

vname = vnamex(scan(table,-

value = coalesce(of var1-var5);

if value eq

select vname into :z80 separated by ' ' from z80 where percent ge

select vname into :z60to80 separated by ' ' from z80 where percent between

%put NOTE: &=z80 &=z60to80;

Message was edited by: data _null_