BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
saivenkat
Obsidian | Level 7

Hello Everyone,

I see many posts around this concept, but I have unique case as I have 6000 variables approximately in a table and it has 2 million records sizing up to 100 GB. I wanted to drop the columns, if a column has either missing or a single value like '?'

I have been using the following program, which isn't quite optimal.

I have been trying to prevent accumulate these columns as an alternative. Can you suggest the better options, if you think of any?

1000+ columns exist with either missing or ? out of 6000 columns

 

data _null_;
set have end=last;
           if _n_ = 1 then do;
               declare hash cnt();
                     rc = cnt.definekey('nmc');
                     rc = cnt.definedata('nmc');
                rc = cnt.definedone();
          end; 
array _a{*} $ _character_;
  do i=1 to dim(_a);
     if _a{i} ne '?' and _a{i} ne '' then do;
           nmc=vname(_a{i});
           rc = cnt.find();
           if rc then do;
           cnt.add();
           end;
           if last then
           rc = cnt.output(dataset: "nmc_bo"); 
     end;
end;
run;
proc sql noprint;
     select name into :mv_mc separated by ',' from dictionary.columns where libname='WORK' and memname='HAVE' and type='char' and name not in
           (select nmc from nmc_bo);
quit;
%put Note: Columns that will be discarded are &mv_mc;
proc sql noprint;
alter table have drop &mv_mc.;
quit;
proc datasets lib=work nolist;
delete nmc_bo;
quit;

 

1 ACCEPTED SOLUTION

Accepted Solutions
saivenkat
Obsidian | Level 7
data _null_;
   set have end=last;  
   array _a{*} $ _character_;
   length name $12;
   
   if _n_ = 1 then do;
      call missing(idx, name);
      declare hash miss (ordered:"a");
      
      miss.definekey("idx");
      miss.definedata('idx','name');
      miss.definedone();

      declare hiter iter('miss');
      do idx=1 to dim(_a);
         miss.add(key:idx,data:idx,data:vname(_a[idx]));
         end;
      end; 
   rc=iter.first();
   do while (rc = 0);
      if _a[idx] ne '?' and _a[idx] ne '' then do; 
         _x=idx;
         rc=iter.next();
         miss.remove(key:_x);
         end;
      else rc=iter.next();
      end; 

   if last then miss.output(dataset:'drop');
   run;

I wanted to drop the columns _a1.._a3 from the sample dataset "have".. since the columns _a1.._a3 has got the values as "x" across the table..

 

changing the line as if _a[idx] ne 'x' then do; worked fine.. and processing time saved more than 50%.

 

Master piece... Thank you very much!

View solution in original post

11 REPLIES 11
Kurt_Bremser
Super User

@saivenkat wrote:

...

as I have 6000 variables approximately in a table  


FAIL.

Usually a sign of bad data modeling that needs to be corrected with proc transpose.

See Maxim 49.

After the transpose, you simply remove observations that contain the unwanted values.

novinosrin
Tourmaline | Level 20

Can't agree more with @Kurt_Bremser . DIY data model vs rule based. Ted Codd should have lived longer. 

 

Plus though hash looks ok here as loading the names once found with conditions to the hash table(but not at all needed) and then running through the table converting to macro and then calling the macro certainly not only optimal rather not desirable in the first place.

 

You could simply concatenate vars aka vnames once found  and use call symputx

Astounding
PROC Star

Just to clarify ...

 

If just one value is missing and all the rest are valid, get rid of the variable?

 

If just one value is "?" and all the rest are valid, get rid of the variable?

saivenkat
Obsidian | Level 7

should get rid of the variable if it has only missing values or specific string like '?' across the table

himalayan
Fluorite | Level 6

Currently you are looping 6000*nrows, but if you keep a track of the columns already disqualified there is no need to consider them again. Here is using a hash iterator to do it. Plus it makes the drop step a bit easier.

 

Should be n^2 complexity vs n * (n+1)/2 average case complexity now.

 

I've created some dummy data and moved the data output step outside of the loop. I got 30 seconds vs 0.18.

 

 

data have;
   array c[6000] $1 _a1-_a6000;
   do _n_=1 to 10000;
      do i=1 to dim(c);
         c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
         end;
      output;
      end;
   run;
data _null_;
   set have;
   run;
data _null_;
   set have end=last;  
   array _a{*} $ _character_;
   length name $12;
   
   if _n_ = 1 then do;
      call missing(idx, name);
      declare hash miss (ordered:"a");
      
      miss.definekey("idx");
      miss.definedata('idx','name');
      miss.definedone();

      declare hiter iter('miss');
      do idx=1 to dim(_a);
         miss.add(key:idx,data:idx,data:vname(_a[idx]));
         end;
      end; 
   rc=iter.first();
   do while (rc = 0);
      if _a[idx] eq '?' then do; 
         _x=idx;
         rc=iter.next();
         miss.remove(key:_x);
         end;
      else rc=iter.next();
      end; 

   if last then miss.output(dataset:'keep');
   run;

proc sql noprint;
     select name into :mv_mc separated by ',' from keep;
quit;
%put Note: Columns that will be discarded are &mv_mc;
proc sql noprint;
alter table have drop &mv_mc.;
quit;
saivenkat
Obsidian | Level 7

didn't work.. can you revisit the steps? processing time reduced from 16 mins to 4 mins but output not listed the columns that has only string ?

novinosrin
Tourmaline | Level 20

The solution you got checks only for 

'?'

and not both your conditions. As far as the hashing goes, yours and the responders are both linear. However the iterator time makes the difference in responders case as opposed to yours 

himalayan
Fluorite | Level 6

Which part specifically doesn't work. Could you run the example okay?

 

  if _a[idx] eq '?' then do;

You can change the line above to whatever your drop criteria is 

saivenkat
Obsidian | Level 7
data _null_;
   set have end=last;  
   array _a{*} $ _character_;
   length name $12;
   
   if _n_ = 1 then do;
      call missing(idx, name);
      declare hash miss (ordered:"a");
      
      miss.definekey("idx");
      miss.definedata('idx','name');
      miss.definedone();

      declare hiter iter('miss');
      do idx=1 to dim(_a);
         miss.add(key:idx,data:idx,data:vname(_a[idx]));
         end;
      end; 
   rc=iter.first();
   do while (rc = 0);
      if _a[idx] ne '?' and _a[idx] ne '' then do; 
         _x=idx;
         rc=iter.next();
         miss.remove(key:_x);
         end;
      else rc=iter.next();
      end; 

   if last then miss.output(dataset:'drop');
   run;

I wanted to drop the columns _a1.._a3 from the sample dataset "have".. since the columns _a1.._a3 has got the values as "x" across the table..

 

changing the line as if _a[idx] ne 'x' then do; worked fine.. and processing time saved more than 50%.

 

Master piece... Thank you very much!

novinosrin
Tourmaline | Level 20

Here is my stab at it, the same that i said earlier

 


/*Test data, took it from himalayan as I am lazy*/
data have;
   array c[6000] $1 _a1-_a6000;
   do _n_=1 to 10000;
      do i=1 to dim(c);
         c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
         end;
      output;
      end;
   run;

data _null_;
if 0 then set have;
array _a{*} $ _character_;
call symputx('n',dim(_a));
stop;
run;

/* The process */
data _null_;
set have end=lr;
array _a{*} $ _character_;
array __a(&n) $32 _temporary_;
retain  __a;
if _n_=1 then
do _iorc_=1 to dim( _a);
__a(_iorc_)=vname(_a(_iorc_));
end;
do until(k = 0);
k=whichc('?', of _a(*));
if k=0 then leave;
call missing(__a(k),_a(k));
end;
if lr then call symputx('keep',catx(' ',of __a(*)));
run;
/*Here is your drop list*/
%put keep &keep;
novinosrin
Tourmaline | Level 20

 

/* The process with linear search seems to be not bad after all and very easy*/

 

/*Test data, took it from himalayan as I am lazy*/
data have;
   array c[6000] $1 _a1-_a6000;
   do _n_=1 to 10000;
      do i=1 to dim(c);
         c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
         end;
      output;
      end;
   run;

data _null_;
if 0 then set have;
array _a{*} $ _character_;
call symputx('n',dim(_a));
stop;
run;


/* The process with linear search seems to be not bad after all*/
data _null_;
set have end=lr;
array _a{*} $ _character_;
array __a(&n) $32 _temporary_;
retain  __a;
if _n_=1 then
do _iorc_=1 to dim( _a);
__a(_iorc_)=vname(_a(_iorc_));
end;
do _iorc_=1 to dim( _a);
if _a(_iorc_)='?' then call missing(__a(_iorc_));
end;
if lr then call symputx('keep',catx(' ',of __a(*)));
run;

/*Here is your drop list*/
%put keep &keep;

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 3200 views
  • 6 likes
  • 5 in conversation