Solved: Re: Drop variables contain specific string or missing values

saivenkat · Posted 09-26-2018 12:31 PM

Hello Everyone,

I see many posts around this concept, but I have unique case as I have 6000 variables approximately in a table and it has 2 million records sizing up to 100 GB. I wanted to drop the columns, if a column has either missing or a single value like '?'

I have been using the following program, which isn't quite optimal.

I have been trying to prevent accumulate these columns as an alternative. Can you suggest the better options, if you think of any?

1000+ columns exist with either missing or ? out of 6000 columns

data _null_;
set have end=last;
           if _n_ = 1 then do;
               declare hash cnt();
                     rc = cnt.definekey('nmc');
                     rc = cnt.definedata('nmc');
                rc = cnt.definedone();
          end; 
array _a{*} $ _character_;
  do i=1 to dim(_a);
     if _a{i} ne '?' and _a{i} ne '' then do;
           nmc=vname(_a{i});
           rc = cnt.find();
           if rc then do;
           cnt.add();
           end;
           if last then
           rc = cnt.output(dataset: "nmc_bo"); 
     end;
end;
run;
proc sql noprint;
     select name into :mv_mc separated by ',' from dictionary.columns where libname='WORK' and memname='HAVE' and type='char' and name not in
           (select nmc from nmc_bo);
quit;
%put Note: Columns that will be discarded are &mv_mc;
proc sql noprint;
alter table have drop &mv_mc.;
quit;
proc datasets lib=work nolist;
delete nmc_bo;
quit;

saivenkat · Posted 09-27-2018 04:06 AM

data _null_;
   set have end=last;  
   array _a{*} $ _character_;
   length name $12;
   
   if _n_ = 1 then do;
      call missing(idx, name);
      declare hash miss (ordered:"a");
      
      miss.definekey("idx");
      miss.definedata('idx','name');
      miss.definedone();

      declare hiter iter('miss');
      do idx=1 to dim(_a);
         miss.add(key:idx,data:idx,data:vname(_a[idx]));
         end;
      end; 
   rc=iter.first();
   do while (rc = 0);
      if _a[idx] ne '?' and _a[idx] ne '' then do; 
         _x=idx;
         rc=iter.next();
         miss.remove(key:_x);
         end;
      else rc=iter.next();
      end; 

   if last then miss.output(dataset:'drop');
   run;

I wanted to drop the columns _a1.._a3 from the sample dataset "have".. since the columns _a1.._a3 has got the values as "x" across the table..

changing the line as if _a[idx] ne 'x' then do; worked fine.. and processing time saved more than 50%.

Master piece... Thank you very much!

View solution in original post

Kurt_Bremser · Posted 09-26-2018 12:42 PM

@saivenkat wrote:

...

as I have 6000 variables approximately in a table

FAIL.

Usually a sign of bad data modeling that needs to be corrected with proc transpose.

See Maxim 49.

After the transpose, you simply remove observations that contain the unwanted values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

novinosrin · Posted 09-26-2018 12:52 PM

Can't agree more with @Kurt_Bremser . DIY data model vs rule based. Ted Codd should have lived longer.

Plus though hash looks ok here as loading the names once found with conditions to the hash table(but not at all needed) and then running through the table converting to macro and then calling the macro certainly not only optimal rather not desirable in the first place.

You could simply concatenate vars aka vnames once found and use call symputx

Astounding · Posted 09-26-2018 01:34 PM

Just to clarify ...

If just one value is missing and all the rest are valid, get rid of the variable?

If just one value is "?" and all the rest are valid, get rid of the variable?

saivenkat · Posted 09-26-2018 02:27 PM

should get rid of the variable if it has only missing values or specific string like '?' across the table

himalayan · Posted 09-26-2018 01:56 PM

Currently you are looping 6000*nrows, but if you keep a track of the columns already disqualified there is no need to consider them again. Here is using a hash iterator to do it. Plus it makes the drop step a bit easier.

Should be n^2 complexity vs n * (n+1)/2 average case complexity now.

I've created some dummy data and moved the data output step outside of the loop. I got 30 seconds vs 0.18.

data have;
   array c[6000] $1 _a1-_a6000;
   do _n_=1 to 10000;
      do i=1 to dim(c);
         c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
         end;
      output;
      end;
   run;
data _null_;
   set have;
   run;
data _null_;
   set have end=last;  
   array _a{*} $ _character_;
   length name $12;
   
   if _n_ = 1 then do;
      call missing(idx, name);
      declare hash miss (ordered:"a");
      
      miss.definekey("idx");
      miss.definedata('idx','name');
      miss.definedone();

      declare hiter iter('miss');
      do idx=1 to dim(_a);
         miss.add(key:idx,data:idx,data:vname(_a[idx]));
         end;
      end; 
   rc=iter.first();
   do while (rc = 0);
      if _a[idx] eq '?' then do; 
         _x=idx;
         rc=iter.next();
         miss.remove(key:_x);
         end;
      else rc=iter.next();
      end; 

   if last then miss.output(dataset:'keep');
   run;

proc sql noprint;
     select name into :mv_mc separated by ',' from keep;
quit;
%put Note: Columns that will be discarded are &mv_mc;
proc sql noprint;
alter table have drop &mv_mc.;
quit;

saivenkat · Posted 09-26-2018 02:19 PM

didn't work.. can you revisit the steps? processing time reduced from 16 mins to 4 mins but output not listed the columns that has only string ?

novinosrin · Posted 09-26-2018 02:25 PM

The solution you got checks only for

'?'

and not both your conditions. As far as the hashing goes, yours and the responders are both linear. However the iterator time makes the difference in responders case as opposed to yours

himalayan · Posted 09-26-2018 02:27 PM

Which part specifically doesn't work. Could you run the example okay?

  if _a[idx] eq '?' then do;

You can change the line above to whatever your drop criteria is

saivenkat · Posted 09-27-2018 04:06 AM

data _null_;
   set have end=last;  
   array _a{*} $ _character_;
   length name $12;
   
   if _n_ = 1 then do;
      call missing(idx, name);
      declare hash miss (ordered:"a");
      
      miss.definekey("idx");
      miss.definedata('idx','name');
      miss.definedone();

      declare hiter iter('miss');
      do idx=1 to dim(_a);
         miss.add(key:idx,data:idx,data:vname(_a[idx]));
         end;
      end; 
   rc=iter.first();
   do while (rc = 0);
      if _a[idx] ne '?' and _a[idx] ne '' then do; 
         _x=idx;
         rc=iter.next();
         miss.remove(key:_x);
         end;
      else rc=iter.next();
      end; 

   if last then miss.output(dataset:'drop');
   run;

I wanted to drop the columns _a1.._a3 from the sample dataset "have".. since the columns _a1.._a3 has got the values as "x" across the table..

changing the line as if _a[idx] ne 'x' then do; worked fine.. and processing time saved more than 50%.

Master piece... Thank you very much!

novinosrin · Posted 09-26-2018 03:33 PM

Here is my stab at it, the same that i said earlier


/*Test data, took it from himalayan as I am lazy*/
data have;
   array c[6000] $1 _a1-_a6000;
   do _n_=1 to 10000;
      do i=1 to dim(c);
         c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
         end;
      output;
      end;
   run;

data _null_;
if 0 then set have;
array _a{*} $ _character_;
call symputx('n',dim(_a));
stop;
run;

/* The process */
data _null_;
set have end=lr;
array _a{*} $ _character_;
array __a(&n) $32 _temporary_;
retain  __a;
if _n_=1 then
do _iorc_=1 to dim( _a);
__a(_iorc_)=vname(_a(_iorc_));
end;
do until(k = 0);
k=whichc('?', of _a(*));
if k=0 then leave;
call missing(__a(k),_a(k));
end;
if lr then call symputx('keep',catx(' ',of __a(*)));
run;
/*Here is your drop list*/
%put keep &keep;

novinosrin · Posted 09-26-2018 04:00 PM

/* The process with linear search seems to be not bad after all and very easy*/

/*Test data, took it from himalayan as I am lazy*/
data have;
   array c[6000] $1 _a1-_a6000;
   do _n_=1 to 10000;
      do i=1 to dim(c);
         c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
         end;
      output;
      end;
   run;

data _null_;
if 0 then set have;
array _a{*} $ _character_;
call symputx('n',dim(_a));
stop;
run;


/* The process with linear search seems to be not bad after all*/
data _null_;
set have end=lr;
array _a{*} $ _character_;
array __a(&n) $32 _temporary_;
retain  __a;
if _n_=1 then
do _iorc_=1 to dim( _a);
__a(_iorc_)=vname(_a(_iorc_));
end;
do _iorc_=1 to dim( _a);
if _a(_iorc_)='?' then call missing(__a(_iorc_));
end;
if lr then call symputx('keep',catx(' ',of __a(*)));
run;

/*Here is your drop list*/
%put keep &keep;

Registration is open

SAS Training: Just a Click Away