I'm trying, relatively successfully, to troll through a very large dataset (with 900-1,000 variables) finding variables which are never populated: every observation is missing.
At an observation level this is easy:
array char_var _char_;
do over char_var;
if missing(char_var) then...
end;
But I want to do it as efficiently as possible over the whole dataset. So I've populated a hash with the variable names via vname(char_var), and that works well. Putting the variable's name into the hash is fine. And removing it when I find a non-missing value is fine too. But I'd like to extract the variable's name from the hash and check whether that variable's value is missing - similar to &&varname in macro code. Iterating over all the elements in the array is time-consuming, when I could just do it over the remnants of the hash.
Is there some obvious solution that I'm missing?
Laurie
I would look for the maximums. If the maximum is missing then all values are missing.
Maybe something like
data _null_;
set SASHELP.HEART(obs=1);
length VARNAME $32;
call execute('proc sql; create table MAX as select ');
do I=1 to 1e9 ;
call vnext(VARNAME);
if VARNAME eq 'VARNAME' then leave;
if I > 1 then call execute(',');
call execute(cats('max(',VARNAME,') as '||VARNAME));
end;
call execute(' from SASHELP.HEART;quit;');
run;
Once the heavyweight-lifting of scanning the table is done in one efficient pass, you only have this one-row table to parse.
Check function VVALUEX()
Brilliant! I was getting close with vvalue, but the extra letter on the function made all the difference.
Thank you.
Laurie
I would look for the maximums. If the maximum is missing then all values are missing.
Maybe something like
data _null_;
set SASHELP.HEART(obs=1);
length VARNAME $32;
call execute('proc sql; create table MAX as select ');
do I=1 to 1e9 ;
call vnext(VARNAME);
if VARNAME eq 'VARNAME' then leave;
if I > 1 then call execute(',');
call execute(cats('max(',VARNAME,') as '||VARNAME));
end;
call execute(' from SASHELP.HEART;quit;');
run;
Once the heavyweight-lifting of scanning the table is done in one efficient pass, you only have this one-row table to parse.
NLEVELS option in proc freq?
966 variables, observation length of over 7k, 20.3 million rows - that's never going to fly. But thanks anyway.
It's best to treat it one row at a time. I think reducing the number of variables the require checking (down to around 400-odd, I think) each time is the most efficient way. In a couple of hash tables using up a maximum of 30k each, it keeps IO down to a minimum.
Multi-threaded DS2 would be even better probably, if I had the time to become proficient in it.
data &_output;
set &_input(&_INPUT1_options) end=eof;
array char_var _char_;
array num_var _numeric_;
if _n_ = 1 then do;
length varname varname_remove $ 32;
dcl hash variables(ordered: 'n');
variables.defineKey('varname');
variables.defineDone();
dcl hiter variables_iter('variables');
dcl hash remove_variables(ordered: 'n');
remove_variables.defineKey('varname_remove');
remove_variables.defineDone();
dcl hiter remove_iter('remove_variables');
do over char_var;
varname = vname(char_var);
variables.add();
end;
do over num_var;
varname = vname(num_var);
variables.add();
end;
end;
rc = variables_iter.first();
rc = remove_variables.clear();
do while(rc = 0);
if strip(vvaluex(varname)) notin(' ', '.') then do;
varname_remove = varname;
remove_variables.add();
end;
rc = variables_iter.next();
end;
if remove_variables.num_items then do;
rc = remove_iter.first();
do while(rc = 0);
varname = varname_remove;
variables.check();
variables.remove();
rc = remove_iter.next();
end;
end;
variable_count = variables.num_items;
if variable_count ne lag(variable_count) then
output;
if eof;
variable_count = variables.num_items;
put variable_count 'variables left in hash.';
rc = variables_iter.first();
do while(not rc);
put varname @;
rc = variables_iter.next();
end;
run;
Never gonna fly? Why?
Done and dusted in 5 min on my 4-core windows box.
data HAVE(compress=no);
array A [999];
do i=1 to 20e6;
output;
end;
run;
data _null_;
set HAVE(obs=1);
length VARNAME $32;
call execute('proc sql; create table MAX as select ');
do I=1 to 1e9 until(VARNAME='VARNAME');
call vnext(VARNAME);
if VARNAME eq 'VARNAME' then leave;
if I > 1 then call execute(',');
call execute(cats('max(',VARNAME,') as '||VARNAME));
end;
call execute(' from HAVE;quit;');
run;
NOTE: PROCEDURE SQL used (Total process time):
real time 5:02.24
user cpu time 3:13.84
system cpu time 1:46.11
memory 411383.68k
OS Memory 445396.00k
Timestamp 17/05/2016 04:51:30 PM
Step Count 69 Switch Count 4642
No, I meant using NLEVELS in proc freq.
Fair enough.
Single-pass bulk processing is normally the preferred option for speed. How does the proc sql compare to the data step with your data?
I was quite pleased with my solution, but it was too heavy on resources. I stopped it after 30 minutes, and it still had a long way to go. Yours took 15 minutes. Nicely done.
I put this at the end of it to produce the analysis:
data &_output(compress=no);
set &_output;
array char_var _char_;
array num_var _numeric_;
length varname $ 32;
do over char_var;
if missing(char_var) then do;
varname = vname(char_var);
output;
end;
end;
do over num_var;
if missing(num_var) then do;
varname = vname(num_var);
output;
end;
end;
run;
Glad it worked.
Your code, as sophisticated and clever as it is, spends its time fetching and looking up and testing and processing each individual value. Not too good when you have 20 billion of them....
Plus the SQL is easier to maintain. Complex logic is required sometimes, but not here. 🙂
More performance tips in
Another way to do the post processing:
data &_output(keep=VARNAME compress=no);
set &_output;
length VARNAME TYPE $32;
do i=1 to 1e6 until(VARNAME='VARNAME');
call vnext(VARNAME,TYPE);
if vvaluex(VARNAME)=' ' | (TYPE='N' & left(vvaluex(VARNAME))=:'.') then output;
end;
run;
Also note that function vvaluex() uses formatted values. So to test for actual missing values, we may need to strip the formats if you have any.
proc datasets lib=WORK nolist; modify HAVE; format _ALL_; run;
Quite so. However this is a fact table in a data warehouse, and thus all values (theoretically at least) are actual unformatted values.
From the "more than one way of skinning a cat" department: another way of generating the SQL:
proc sql noprint;
select cat('max(', strip(name), ') as ', strip(name))
into :varstring separated by ', '
from dictionary.columns
where libname = "%upcase(%scan(&_input, 1, .))"
and memname = "%upcase(%scan(&_input, 2, .))";
create table &_output as
select &varstring
from &_input(&_input_options);
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.