Hello Everyone,
I see many posts around this concept, but I have unique case as I have 6000 variables approximately in a table and it has 2 million records sizing up to 100 GB. I wanted to drop the columns, if a column has either missing or a single value like '?'
I have been using the following program, which isn't quite optimal.
I have been trying to prevent accumulate these columns as an alternative. Can you suggest the better options, if you think of any?
1000+ columns exist with either missing or ? out of 6000 columns
data _null_;
set have end=last;
if _n_ = 1 then do;
declare hash cnt();
rc = cnt.definekey('nmc');
rc = cnt.definedata('nmc');
rc = cnt.definedone();
end;
array _a{*} $ _character_;
do i=1 to dim(_a);
if _a{i} ne '?' and _a{i} ne '' then do;
nmc=vname(_a{i});
rc = cnt.find();
if rc then do;
cnt.add();
end;
if last then
rc = cnt.output(dataset: "nmc_bo");
end;
end;
run;
proc sql noprint;
select name into :mv_mc separated by ',' from dictionary.columns where libname='WORK' and memname='HAVE' and type='char' and name not in
(select nmc from nmc_bo);
quit;
%put Note: Columns that will be discarded are &mv_mc;
proc sql noprint;
alter table have drop &mv_mc.;
quit;
proc datasets lib=work nolist;
delete nmc_bo;
quit;
data _null_;
set have end=last;
array _a{*} $ _character_;
length name $12;
if _n_ = 1 then do;
call missing(idx, name);
declare hash miss (ordered:"a");
miss.definekey("idx");
miss.definedata('idx','name');
miss.definedone();
declare hiter iter('miss');
do idx=1 to dim(_a);
miss.add(key:idx,data:idx,data:vname(_a[idx]));
end;
end;
rc=iter.first();
do while (rc = 0);
if _a[idx] ne '?' and _a[idx] ne '' then do;
_x=idx;
rc=iter.next();
miss.remove(key:_x);
end;
else rc=iter.next();
end;
if last then miss.output(dataset:'drop');
run;
I wanted to drop the columns _a1.._a3 from the sample dataset "have".. since the columns _a1.._a3 has got the values as "x" across the table..
changing the line as if _a[idx] ne 'x' then do; worked fine.. and processing time saved more than 50%.
Master piece... Thank you very much!
@saivenkat wrote:
...
as I have 6000 variables approximately in a table
FAIL.
Usually a sign of bad data modeling that needs to be corrected with proc transpose.
See Maxim 49.
After the transpose, you simply remove observations that contain the unwanted values.
Can't agree more with @Kurt_Bremser . DIY data model vs rule based. Ted Codd should have lived longer.
Plus though hash looks ok here as loading the names once found with conditions to the hash table(but not at all needed) and then running through the table converting to macro and then calling the macro certainly not only optimal rather not desirable in the first place.
You could simply concatenate vars aka vnames once found and use call symputx
Just to clarify ...
If just one value is missing and all the rest are valid, get rid of the variable?
If just one value is "?" and all the rest are valid, get rid of the variable?
should get rid of the variable if it has only missing values or specific string like '?' across the table
Currently you are looping 6000*nrows, but if you keep a track of the columns already disqualified there is no need to consider them again. Here is using a hash iterator to do it. Plus it makes the drop step a bit easier.
Should be n^2 complexity vs n * (n+1)/2 average case complexity now.
I've created some dummy data and moved the data output step outside of the loop. I got 30 seconds vs 0.18.
data have;
array c[6000] $1 _a1-_a6000;
do _n_=1 to 10000;
do i=1 to dim(c);
c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
end;
output;
end;
run;
data _null_;
set have;
run;
data _null_;
set have end=last;
array _a{*} $ _character_;
length name $12;
if _n_ = 1 then do;
call missing(idx, name);
declare hash miss (ordered:"a");
miss.definekey("idx");
miss.definedata('idx','name');
miss.definedone();
declare hiter iter('miss');
do idx=1 to dim(_a);
miss.add(key:idx,data:idx,data:vname(_a[idx]));
end;
end;
rc=iter.first();
do while (rc = 0);
if _a[idx] eq '?' then do;
_x=idx;
rc=iter.next();
miss.remove(key:_x);
end;
else rc=iter.next();
end;
if last then miss.output(dataset:'keep');
run;
proc sql noprint;
select name into :mv_mc separated by ',' from keep;
quit;
%put Note: Columns that will be discarded are &mv_mc;
proc sql noprint;
alter table have drop &mv_mc.;
quit;
didn't work.. can you revisit the steps? processing time reduced from 16 mins to 4 mins but output not listed the columns that has only string ?
The solution you got checks only for
'?'
and not both your conditions. As far as the hashing goes, yours and the responders are both linear. However the iterator time makes the difference in responders case as opposed to yours
Which part specifically doesn't work. Could you run the example okay?
if _a[idx] eq '?' then do;
You can change the line above to whatever your drop criteria is
data _null_;
set have end=last;
array _a{*} $ _character_;
length name $12;
if _n_ = 1 then do;
call missing(idx, name);
declare hash miss (ordered:"a");
miss.definekey("idx");
miss.definedata('idx','name');
miss.definedone();
declare hiter iter('miss');
do idx=1 to dim(_a);
miss.add(key:idx,data:idx,data:vname(_a[idx]));
end;
end;
rc=iter.first();
do while (rc = 0);
if _a[idx] ne '?' and _a[idx] ne '' then do;
_x=idx;
rc=iter.next();
miss.remove(key:_x);
end;
else rc=iter.next();
end;
if last then miss.output(dataset:'drop');
run;
I wanted to drop the columns _a1.._a3 from the sample dataset "have".. since the columns _a1.._a3 has got the values as "x" across the table..
changing the line as if _a[idx] ne 'x' then do; worked fine.. and processing time saved more than 50%.
Master piece... Thank you very much!
Here is my stab at it, the same that i said earlier
/*Test data, took it from himalayan as I am lazy*/
data have;
array c[6000] $1 _a1-_a6000;
do _n_=1 to 10000;
do i=1 to dim(c);
c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
end;
output;
end;
run;
data _null_;
if 0 then set have;
array _a{*} $ _character_;
call symputx('n',dim(_a));
stop;
run;
/* The process */
data _null_;
set have end=lr;
array _a{*} $ _character_;
array __a(&n) $32 _temporary_;
retain __a;
if _n_=1 then
do _iorc_=1 to dim( _a);
__a(_iorc_)=vname(_a(_iorc_));
end;
do until(k = 0);
k=whichc('?', of _a(*));
if k=0 then leave;
call missing(__a(k),_a(k));
end;
if lr then call symputx('keep',catx(' ',of __a(*)));
run;
/*Here is your drop list*/
%put keep &keep;
/* The process with linear search seems to be not bad after all and very easy*/
/*Test data, took it from himalayan as I am lazy*/
data have;
array c[6000] $1 _a1-_a6000;
do _n_=1 to 10000;
do i=1 to dim(c);
c[i]=ifC(ranuni(123)>.99 and i>3,'?','x');;
end;
output;
end;
run;
data _null_;
if 0 then set have;
array _a{*} $ _character_;
call symputx('n',dim(_a));
stop;
run;
/* The process with linear search seems to be not bad after all*/
data _null_;
set have end=lr;
array _a{*} $ _character_;
array __a(&n) $32 _temporary_;
retain __a;
if _n_=1 then
do _iorc_=1 to dim( _a);
__a(_iorc_)=vname(_a(_iorc_));
end;
do _iorc_=1 to dim( _a);
if _a(_iorc_)='?' then call missing(__a(_iorc_));
end;
if lr then call symputx('keep',catx(' ',of __a(*)));
run;
/*Here is your drop list*/
%put keep &keep;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.