When @Quentin speaks, people should listen. With 10,000,000 variables to read in, the speed of the string search you are trying to implement will be your last worry, ragardless of whether you use the IN operator or the hash object. In fact, I doubt that with so many variables you will be able to process such a data set in any meaningful way.
That said, out of plain curiosity I've spent a few minutes to emulate you case and then test the IN against the hash:
data data (keep=var:) strings (keep=str) ;
array var $16 VAR1-VAR50 ;
array nn [30] _temporary_ ;
do _i_ = 1 to dim (nn) ;
nn [_i_] + ceil (ranuni(1) * 1e6 * 2 ** _i_) ;
str = put (nn[_i_], 16.-L) ;
length STRINGS $ 1024 ;
strings = catx (" ", strings, quote (trim (str))) ;
output strings ;
end ;
do _n_ = 1 to 1e6 ;
do over var ;
nvar = nn [ceil (ranuni(2) * dim (nn))] ;
if ranuni (3) > 0.005 then nvar = nvar * 2 ;
var = put (nvar, 16.-L) ;
end ;
output data ;
end ;
call symputx ("strings", strings) ;
stop ;
run ;
%put &=strings ;
data _null_ /* flag_inop */ ;
set data ;
array var var: ;
do over var ;
if var in (&strings) then leave ;
end ;
FLAG = _i_ <= dim (var) ;
run ;
data _null_ /* flag_hash */ ;
if _n_ = 1 then do ;
if 0 then set strings ;
dcl hash h (dataset:"strings") ;
h.definekey ("str") ;
h.definedone () ;
end ;
set data ;
array var var: ;
do over var ;
if h.check (key:var) = 0 then leave ;
end ;
FLAG = _i_ <= dim (var) ;
run ;
In my environment, (Linux 64 SAS server) the IN operator ran in 2.39 seconds and the hash - in 3.6. Both methods ultimately use the binary search to find or reject the key, which is why the results don't differ much. With them being so close, it does not really matter which one to opt for.
Note that I did not include any extra variables in DATA. If you really are going to process a data set with 10 million variables, I would do either of the above FLAG computing exercises by first keeping only VAR1-VAR50 on SET input and keeping only FLAG in the output. Then you can create a view merging the latter back with data and thereafter use that view as input. For example, if opting for In operator:
data flag_inop (keep=flag) ;
set data (keep=var1-var50) ;
array var var: ;
do over var ;
if var in (&strings) then leave ;
end ;
FLAG = _i_ <= dim (var) ;
run ;
data data_flag / view = data_flag ;
merge data flag_inop ;
run ;
HTH
Paul Dorfman
... View more