BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LaurieF
Barite | Level 11

I'm trying, relatively successfully, to troll through a very large dataset (with 900-1,000 variables) finding variables which are never populated: every observation is missing.

 

At an observation level this is easy:

array char_var _char_;

do over char_var;

     if missing(char_var) then...

     end;

 

But I want to do it as efficiently as possible over the whole dataset. So I've populated a hash with the variable names via vname(char_var), and that works well. Putting the variable's name into the hash is fine. And removing it when I find a non-missing value is fine too. But I'd like to extract the variable's name from the hash and check whether that variable's value is missing - similar to &&varname in macro code. Iterating over all the elements in the array is time-consuming, when I could just do it over the remnants of the hash.

 

Is there some obvious solution that I'm missing?

 

Laurie

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

I would look for the maximums.  If the maximum is missing then all values are missing.

Maybe something like

data _null_;
  set SASHELP.HEART(obs=1);
  length VARNAME $32;
  call execute('proc sql; create table MAX as select ');
  do I=1 to 1e9 ;
     call vnext(VARNAME);
     if VARNAME eq 'VARNAME' then leave;
     if I > 1 then call execute(',');
     call execute(cats('max(',VARNAME,') as '||VARNAME));
   end;
   call execute(' from SASHELP.HEART;quit;');
run;

 

Once the heavyweight-lifting of scanning the table is done in one efficient pass, you only have this one-row table to parse.

View solution in original post

17 REPLIES 17
LaurieF
Barite | Level 11

Brilliant! I was getting close with vvalue, but the extra letter on the function made all the difference.

 

Thank you.

 

Laurie

ChrisNZ
Tourmaline | Level 20

I would look for the maximums.  If the maximum is missing then all values are missing.

Maybe something like

data _null_;
  set SASHELP.HEART(obs=1);
  length VARNAME $32;
  call execute('proc sql; create table MAX as select ');
  do I=1 to 1e9 ;
     call vnext(VARNAME);
     if VARNAME eq 'VARNAME' then leave;
     if I > 1 then call execute(',');
     call execute(cats('max(',VARNAME,') as '||VARNAME));
   end;
   call execute(' from SASHELP.HEART;quit;');
run;

 

Once the heavyweight-lifting of scanning the table is done in one efficient pass, you only have this one-row table to parse.

LaurieF
Barite | Level 11

966 variables, observation length of over 7k, 20.3 million rows - that's never going to fly. But thanks anyway.

 

It's best to treat it one row at a time. I think reducing the number of variables the require checking (down to around 400-odd, I think) each time is the most efficient way. In a couple of hash tables using up a maximum of 30k each, it keeps IO down to a minimum.

 

Multi-threaded DS2 would be even better probably, if I had the time to become proficient in it.

 

data &_output;
set &_input(&_INPUT1_options) end=eof;
array char_var _char_;
array num_var _numeric_;
if _n_ = 1 then do;
   length varname varname_remove $ 32;
   dcl hash variables(ordered: 'n');
   variables.defineKey('varname');
   variables.defineDone();
   dcl hiter variables_iter('variables');
   dcl hash remove_variables(ordered: 'n');
   remove_variables.defineKey('varname_remove');
   remove_variables.defineDone();
   dcl hiter remove_iter('remove_variables');
   do over char_var;
      varname = vname(char_var);
      variables.add();
      end;
   do over num_var;
      varname = vname(num_var);
      variables.add();
      end;
   end;
rc = variables_iter.first();
rc = remove_variables.clear();
do while(rc = 0);
   if strip(vvaluex(varname)) notin(' ', '.') then do;
      varname_remove = varname;
      remove_variables.add();
   end;
   rc = variables_iter.next();
   end;
if remove_variables.num_items then do;
   rc = remove_iter.first();
   do while(rc = 0);
      varname = varname_remove;
      variables.check();
      variables.remove();
      rc = remove_iter.next();
      end;
   end;
variable_count = variables.num_items;
if variable_count ne lag(variable_count) then
   output;
if eof;
variable_count = variables.num_items;
put variable_count 'variables left in hash.';
rc = variables_iter.first();
do while(not rc);
   put varname @;
   rc = variables_iter.next();
   end;
run;

ChrisNZ
Tourmaline | Level 20

Never gonna fly? Why?

Done and dusted in 5 min on my 4-core windows box.

 

data HAVE(compress=no);
 array A [999];
 do i=1 to 20e6;
   output; 
 end;
run;

data _null_;
  set HAVE(obs=1);
  length VARNAME $32;
  call execute('proc sql; create table MAX as select ');
  do I=1 to 1e9 until(VARNAME='VARNAME');
     call vnext(VARNAME);
     if VARNAME eq 'VARNAME' then leave;
     if I > 1 then call execute(',');
     call execute(cats('max(',VARNAME,') as '||VARNAME));
   end;
   call execute(' from HAVE;quit;');
run;

NOTE: PROCEDURE SQL used (Total process time):

real time 5:02.24

user cpu time 3:13.84

system cpu time 1:46.11

 

memory 411383.68k

OS Memory 445396.00k

Timestamp 17/05/2016 04:51:30 PM

Step Count 69 Switch Count 4642

 

 

 

LaurieF
Barite | Level 11

No, I meant using NLEVELS in proc freq.

ChrisNZ
Tourmaline | Level 20

Fair enough.
Single-pass bulk processing is normally the preferred option for speed. How does the proc sql compare to the data step with your data?

LaurieF
Barite | Level 11

I was quite pleased with my solution, but it was too heavy on resources. I stopped it after 30 minutes, and it still had a long way to go. Yours took 15 minutes. Nicely done.

 

I put this at the end of it to produce the analysis:


data &_output(compress=no);
set &_output;
array char_var _char_;
array num_var _numeric_;
length varname $ 32;
do over char_var;
   if missing(char_var) then do;
      varname = vname(char_var);
   output;
   end;
   end;
do over num_var;
   if missing(num_var) then do;
      varname = vname(num_var);
   output;
   end;
   end;
run;

ChrisNZ
Tourmaline | Level 20

Glad it worked.


Your code, as sophisticated and clever as it is, spends its time fetching and looking up and testing and processing each individual value. Not too good when you have 20 billion of them....

Plus the SQL is easier to maintain. Complex logic is required sometimes, but not here. 🙂

 

More performance tips in

LaurieF
Barite | Level 11
To be fair, as it gets to the end my code gets faster and faster, checking only 50-odd variables.

But yes, you're right. Yours using the max function in sql without a 'group by' precludes sorting of all the variables, as is thus very efficient. I like it a lot.

L
ChrisNZ
Tourmaline | Level 20

Another way to do the post processing:

 

data &_output(keep=VARNAME compress=no);
  set &_output;
  length VARNAME TYPE $32;
   do i=1 to 1e6 until(VARNAME='VARNAME');
     call vnext(VARNAME,TYPE);       
     if vvaluex(VARNAME)=' ' | (TYPE='N' & left(vvaluex(VARNAME))=:'.') then output;
   end;
run;      

 

ChrisNZ
Tourmaline | Level 20

Also note that function vvaluex() uses formatted values. So to test for actual missing values, we may need to strip the formats if you have any.

 

proc datasets lib=WORK nolist; modify HAVE; format _ALL_; run;

LaurieF
Barite | Level 11

Quite so. However this is a fact table in a data warehouse, and thus all values (theoretically at least) are actual unformatted values.

 

From the "more than one way of skinning a cat" department: another way of generating the SQL:

proc sql noprint;
select cat('max(', strip(name), ') as ', strip(name))
  into :varstring separated by ', '
  from dictionary.columns
 where libname = "%upcase(%scan(&_input, 1, .))"
   and memname = "%upcase(%scan(&_input, 2, .))";
create table &_output as
   select &varstring
     from &_input(&_input_options);
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 2124 views
  • 5 likes
  • 4 in conversation