Hi All, I thought this was an interesting problem, and a chance for me to "play" with some techniques in SAS I don't normally use. Apologies in advance for the length of this post. General Concepts: Use of sashelp.vcolumns: In my environment, we pre-allocate about 30 SAS libraries with hundreds of tables and tens of thousands of variables. Some of those libraries are to Oracle databases. After the incredibly painful process (~ 10 minutes) of retrieving 4 variables, the code generation finally began! However, even when I cleared all libnames (libname _all_ clear), sashelp.vcolumn was still really slow. I recommend using proc contents instead of sashelp.vcolumns. Compare the three approaches below. I also switched to other datasets (sashelp), because we also want to compare performance across different approaches, and the example data is just too small to evaluate performance. * using dictionary.columns ; proc sql; create table vars as select libname, memname, name from dictionary.columns where catx(".",libname,memname) in ("SASHELP.CLASS","SASHELP.CARS","WORK.ZIPCODE") order by libname, memname, varnum ; quit; * using sashelp.vcolumn ; data vars; set sashelp.vcolumn (keep=libname memname name); where catx(".",libname,memname) in ("SASHELP.CLASS","SASHELP.CARS","WORK.ZIPCODE"); run; * using proc contents ; %macro get_variables(data); proc contents data=&data out=temp (keep=libname memname name) noprint; run; proc append base=vars data=temp; run; %mend; proc datasets lib=work nowarn nolist; delete vars; quit; %get_variables(sashelp.class); %get_variables(sashelp.cars); %get_variables(work.zipcode); The actual performance will vary based on your environment, but in general I find the best performance is 1) proc contents, 2) dictionary.columns, and 3) sashelp.vcolumn, in that order, especially if you just need the columns from a single dataset. Try each approach in your environment and see which works best for you. call execute vs. dynamic code generation: There is nothing intrinsically wrong with call execute. However, I generally prefer dynamic code generation to a temporary file, then %including that temporary file. This way, I can easily debug my generated code using "fslist". This assumes code generation via DMS, otherwise use a data _null_ step to echo the code to the log. I also get more control over the code formatting. This has nothing to do with code execution, but can be helpful with debugging if your generated code block is large. Here is an example: * dynamically create code ; filename code temp; data _null_; set vars end=eof; file code; if _n_ eq 1 then do; call symputx("firstvar",name,"G"); put @1 "proc sql;"; put @4 "create table temp as"; put @7 "select"; end; put @10 memname $quote. @80 "as Table,"; put @10 name $quote. @80 "as Name,"; put @10 "nmiss(" name +(-1) ")" @80 "as Missing,"; put @10 "sum(case when cats(" name +(-1) ")='N/A' then 1 else 0 end)" @80 "as Not_Applicable,"; put @10 "sum(case when cat(" name +(-1) ") not in ('N/A', ' ') then 1 else 0 end)" @80 "as Not_Missing,"; if eof then do; put @10 "0" @80 "as dummy"; put @7 "from"; put @10 libname +(-1) "." memname; put @4 ";"; put @1 "quit;"; end; run; * check out the generated code ; dm "fslist code"; * execute the code ; %include code; Note: You don’t need the “firstname” macro variable for the proc transpose (see later code below), but the above would be a way to set it to the first variable in the source data. Missing option: The character representation of a missing numeric value is controlled by the missing option, which by default is ".". If you want to make absolutely sure that your code works correctly when checking for missing numeric values, explicitly set the missing option. You can also simplify the code slightly if you explicitly set it to blank: * reset character used for missing numeric data ; %let missing=%sysfunc(getoption(missing)); options missing=" "; /* Your code. Both character and numeric missing are now both “ “ */ * restore character used for missing numeric data ; options missing="&missing"; Increasing disk I/O performance: This is more of an aside. I did test this approach, but it did not perform better than using views. When I need high performance disk I/O, I’ve been using this approach lately: libname spdework spde "%sysfunc(pathname(work))" temp=yes; Then, use a two-level name of spdework.<your work dataset> instead of <work>. This can yield really good performance, especially for a large work dataset that is used repeatedly in downstream code. You can also set the user= option to use a single level name writing to spdework: options user=spdework; data foo;x=1;run; Use views when appropriate: Use a data step or SQL view when you can to reduce disk I/O when appropriate. Determining “when appropriate” is beyond the scope of this post, but see code using a data step view below. Problem Analysis: If we review the problem, what we want to do is group the data into "buckets" (missing, not applicable, not missing), for each variable, then get frequency counts for those "buckets". Further analysis of the previous answers shows that we're generating our results one column at a time, over the entire dataset – each proc sql/select/union all code block is processing the entire dataset, one variable at a time. The original poster indicated that his/her "real" data is 2M + records. It may also have many more variables than the 4 in the sample datasets. So, the total data processed is # of records * # of variables. So, for example, 2M records * 10 variables = processing 20M records. In the "real" problem, performance does matter. I also considered "Is there a way to do this without pre-processing the data (sashelp.vcolumns or proc contents) and dynamic code generation"? Also “Is there a way to process the data in one pass”? I created a number of code versions - I’ll only include a few of the approaches (I won’t post the disasters lol). Some comments: A data step view performed better than creating a SPDE work dataset. Using call vnext and vvaluex, I did not have to pre-process the data. I found it easier to process the data sets one at a time and use proc append, esp. since I didn’t use SQL. In the proc summary approach, I still process # of records * # of variables, so the data does “bloat”. Luckily, the logic checks are mutually exclusive – a column can ONLY be one of missing, not applicable, or not missing. Otherwise the dataset (data step view) would bloat even more. In the first hash object approach, I only process the data once. But, it performed the same as the “bloated” data step view. I conjecture that the overhead of looking up and replacing the counts for every data value created enough overhead that it performed no better than the proc summary approach. But it was fun coding this approach :-). I also found a documentation hit that were really interesting: http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a002585310.htm (search on “Maintaining Key Summaries”), and It took me a while to understand this doc – here is my understanding: when you declare a suminc variable, the hash object maintains an increment counter on all its keys. When a key is found, the value of the suminc variable is added to the current internal value maintained by the hash object for that key. For the first “hit” on the hash object key, you need to use add() instead of find(). At the end of the processing, use the sum() method to retrieve the increment counter for a given key. Since we’re just interested in frequency counts, the value of the suminc variable is always 1. See an example below Here are my code examples. Hopefully they are somewhat self explanatory, otherwise post a follow up question. Code Prologue (create example data used for all programs): options mprint nomlogic; data table1; input Name $ Age; cards; John 45 N/A 30 . 15 Carl 25 ; run; data table2; input Color $ Height; cards; Blue 110 N/A 120 . 100 Red . ; run; * create a large dataset to test performance ; data work.zipcode; set sashelp.zipcode sashelp.zipcode sashelp.zipcode sashelp.zipcode sashelp.zipcode ; run; %bench macro (macro I use when I’m benchmarking performance): Note: comment out call to %parmv /*===================================================================== Program Name : bench.sas Purpose : Measures elapsed time between successive invocations. SAS Version : SAS 8.2 Input Data : N/A Output Data : N/A Macros Called : parmv Originally Written by : Scott Bass Date : 24APR2006 Program Version # : 1.0 ======================================================================= Modification History : Original version =====================================================================*/ /*--------------------------------------------------------------------- Usage: * Start benchmarking. * Both invocations are identical as long as start ; * has not been previously invoked ; %bench; %bench(start); data _null_; rc=sleep(3); run; * Get elapsed time, should be approx. 3 seconds elapsed, 3 seconds total ; %bench(elapsed); data _null_; rc=sleep(7); run; * Get another elapsed time, should be approx. 7 seconds elapsed, 10 seconds total ; %bench; * elapsed parm not required since start was already called ; data _null_; rc=sleep(2); run; * End benchmarking, should be approx. 2 seconds elapsed, 12 seconds total ; * Must be called after start. Resets benchmarking. ; %bench(end); ----------------------------------------------------------------------- Notes: If %bench has never been invoked, calling %bench without parameters starts benchmarking. You may also explicity specify the start parameter. Explicitly specifying the start parameter resets benchmarking, although normally the end parameter would be used. If %bench has been previously invoked with the start parameter, calling %bench without parameters prints the elapsed time. You may also explicity specify the elapsed parameter. To end benchmarking and reset the start time, specify the end parameter. Only the elapsed or end parameters (or equivalent processing) print time measurements to the log. The start parameter does not print anything to the log. The only parameter that needs to be explicitly specified is end. Otherwise the macro should do the right thing, either starting benchmarking or printing elapsed times. Benchmarking a time period greater than 24 hours is "unpredictable". ---------------------------------------------------------------------*/ %macro bench /*--------------------------------------------------------------------- Measures elapsed time between successive invocations. ---------------------------------------------------------------------*/ (PARM /* Benchmarking parameter (Opt). */ /* If not specified: */ /* If first invocation, start benchmarking. */ /* If subsequent invocation, print elapsed time. */ /* Valid values are START ELAPSED END. */ ); %local macro parmerr time_elapsed time_total time_elapsed_str time_total_str h m s; %let macro = &sysmacroname; %* check input parameters ; %parmv(PARM, _req=0,_words=0,_case=U,_val=START ELAPSED END) %if (&parmerr) %then %goto quit; %* nested macro for printing ; %macro print(_parm); %let time_elapsed = %sysevalf(%sysfunc(datetime()) - &_elapsed); %let time_total = %sysevalf(%sysfunc(datetime()) - &_start); %let h = %sysfunc(hour(&time_elapsed),z2.); %let m = %sysfunc(minute(&time_elapsed),z2.); %let s = %sysfunc(second(&time_elapsed),z2.); %let time_elapsed_str = &h hours, &m minutes, &s seconds; %let h = %sysfunc(hour(&time_total),z2.); %let m = %sysfunc(minute(&time_total),z2.); %let s = %sysfunc(second(&time_total),z2.); %let time_total_str = &h hours, &m minutes, &s seconds; %put; %put Benchmark &_parm:; %put; %put Elapsed seconds = &time_elapsed_str &time_elapsed; %put Total seconds = &time_total_str &time_total; %put; %mend; %* declare global variables ; %global _start _elapsed; %if (&parm eq START) %then %do; %let _start = %sysfunc(datetime()); %let _elapsed = &_start; %end; %else %if (&parm eq ELAPSED) %then %do; %if (&_start eq ) %then %do; %put ERROR: Benchmarking must be started before elapsed time can be printed.; %goto quit; %end; %else %do; %print(ELAPSED) %let _elapsed = %sysfunc(datetime()); %end; %end; %else %if (&parm eq END) %then %do; %if (&_start eq ) %then %do; %put ERROR: Benchmarking must be started before elapsed time can be printed.; %goto quit; %end; %else %do; %print(END) %* reset benchmarking ; %symdel _start _elapsed / nowarn; %end; %end; %else %if (&parm eq ) %then %do; %* derive proper parm then recursively call this macro ; %if (&_start eq ) %then %do; %bench(start) %end; %else %do; %bench(elapsed) %end; %end; %quit: %* if (&parmerr) %then %abort; %mend; /******* END OF FILE *******/ Data Step View and PROC SUMMARY approach: %macro get_counts(data); * reset character used for missing numeric data ; %let missing=%sysfunc(getoption(missing)); options missing=" "; * create additional grouping variables ; data vgrouped / view=vgrouped; set &data indsname=dsn; * set a dummy variable as an end of variable list marker ; retain dummy ""; drop dummy; * define additional variables ; * varname must be long enough to contain memname_varname ; length libname $8 memname $32 varname $65 cvalue $200 measure $15; * we only need to get the libname and memname once ; if (_n_=1) then do; libname=scan(dsn,1,"."); memname=scan(dsn,2,"."); retain libname memname; end; * spin through all the variables in the dataset, building grouping variables ; * since our checks are mutually exclusive, this will not cause bloating of the dataset ; do while (1); call vnext(varname); if (varname in ("dsn","eof")) then continue; if (varname="dummy") then leave; * get the variable value (character, formatted value) ; cvalue=vvaluex(varname); * build the grouping variable ; select; when (missing(cvalue)) measure="Missing"; when (strip(cvalue)="N/A") measure="Not_Applicable"; when (not missing(cvalue)) measure="Not_Missing"; * this covers all possibilities, so I purposely left out an otherwise statement ; end; * build the new variable name ; varname=catx("_",memname,varname); * output the observation ; output; end; run; * now summarize over each measure to get the frequency counts ; proc summary data=vgrouped nway; class libname memname varname measure; output out=summary (drop=_type_); run; * transpose data ; proc transpose data=summary out=transposed (drop=_name_); by libname memname varname; id measure; var _freq_; run; * set desired PDV order, ensure all variables are present, and replace missing values with zero ; data missing2zero; format libname memname varname; length Missing Not_Applicable Not_Missing 8; set transposed; array miss{*} Missing -- Not_Missing; do i=1 to dim(miss); if miss{i}=. then miss{i}=0; end; drop i; run; * append data ; proc append base=final1 data=missing2zero; run; * restore character used for missing numeric data ; options missing="&missing"; %mend; proc datasets lib=work nolist nowarn; delete final1:; quit; %bench(start) %get_counts(work.table1) %get_counts(work.table2) %get_counts(sashelp.class) %get_counts(sashelp.cars) %get_counts(work.zipcode) %bench(elapsed) * this sort isn't required, but makes it easier to compare the proc print outputs ; proc sort data=final1; by libname memname varname; run; * one last transpose. use whichever dataset you prefer, final or final2 ; proc transpose data=final1 out=final1a (rename=(_name_=Num_Obs)); id varname; run; title; proc print data=final1; run; proc print data=final1a; run; %bench(end) Hash Object, setting and retrieving frequency counts: This approach only processes the data once, but has the overhead of finding and setting the frequency counts for every data value in the source data. The hash object will never get that big: up to three rows (Missing, Not Applicable, Not Missing) per variable. %macro get_counts(data); * reset character used for missing numeric data ; %let missing=%sysfunc(getoption(missing)); options missing=" "; data _null_; set &data indsname=dsn end=eof; * set a dummy variable as an end of variable list marker ; retain dummy ""; * define additional variables ; * varname must be long enough to contain memname_varname ; length libname $8 memname $32 varname $65 cvalue $200 measure $15 count 8; * use a hash object to store summary data ; if (_n_=1) then do; dcl hash sums(hashexp: 16); sums.defineKey( "libname","memname","varname","measure"); sums.defineData("libname","memname","varname","measure","count"); sums.defineDone(); * get libname and memname ; libname=scan(dsn,1,"."); memname=scan(dsn,2,"."); retain libname memname; end; * spin through all the variables in the dataset, building summarization variables ; do while (1); call vnext(varname); if (varname in ("dsn","eof")) then continue; if (varname in ("dummy")) then leave; * get the variable value (character, formatted value) ; cvalue=vvaluex(varname); * set the hash object keys (libname and memname are already set) ; * derived varname ; varname=catx("_",memname,varname); * measure ; select; when(missing(cvalue)) measure="Missing"; when(strip(cvalue)="N/A") measure="Not_Applicable"; when(not missing(cvalue)) measure="Not_Missing"; end; * initialize counter back to 0 ; count=0; * retrieve the current key and increment it ; * if find fails (first time through) count is still 0; rc=sums.find(); * increment the count ; count=count+1; * save the incremented count ; sums.replace(); end; * now output the hash object as a dataset ; if eof then sums.output(dataset: "counts"); run; * append data ; proc append base=final2 data=counts; run; * restore character used for missing numeric data ; options missing="&missing"; %mend; proc datasets lib=work nolist nowarn; delete final2:; quit; %bench(start) %get_counts(work.table1) %get_counts(work.table2) %get_counts(sashelp.class) %get_counts(sashelp.cars) %get_counts(work.zipcode) %bench(elapsed) * this sort isn't required, but makes it easier to compare the proc print outputs ; proc sort data=final2; by libname memname varname; run; * one last transpose. use whichever dataset you prefer, final or final2 ; proc transpose data=final2 out=temp (drop=_name_); by libname memname varname notsorted; id measure; run; * set PDV order, which sets final observation order in final2 ; * could use a view here but it is such a tiny dataset ; data temp; format libname memname varname Missing Not_Applicable Not_Missing; set temp; run; proc transpose data=temp out=temp2 (rename=(_name_=Num_Obs)); id varname; run; * replace missing values with zero ; data final2a; set temp2; array miss{*} _numeric_; do i=1 to dim(miss); if miss{i}=. then miss{i}=0; end; drop i; run; title; proc print data=final2; run; proc print data=final2a; run; %bench(end) Hash Object, using suminc incrementation variable: This approach also only processes the data once. %macro get_counts(data); data _null_; set &data indsname=dsn end=eof; retain dummy ""; length libname $8 memname vname $32 varname $65 measure $20 cvalue $200; if (_n_=1) then do; dcl hash sums(suminc:"count"); sums.defineKey("libname","memname","varname","measure"); sums.defineData("libname","memname","varname","measure","count"); sums.defineDone(); dcl hiter iter("sums"); libname=scan(dsn,1,"."); memname=scan(dsn,2,"."); retain libname memname; end; do while (1); call vnext(vname); if (vname in ("dsn","eof")) then continue; if (vname in ("dummy")) then leave; cvalue=vvaluex(vname); select; when(missing(cvalue)) measure="Missing"; when(strip(cvalue)="N/A") measure="Not_Applicable"; when(not missing(cvalue)) measure="Not_Missing"; end; varname=catx("_",memname,vname); count=1; if (sums.find() ne 0) then sums.add(); end; if eof then do; rc=iter.first(); do while (rc=0); sums.sum(sum: count); sums.replace(); rc=iter.next(); end; sums.output(dataset: "counts"); end; run; proc append base=final3 data=counts; run; %mend; proc datasets lib=work nolist nowarn; delete final3:; quit; %bench(start) %get_counts(work.table1) %get_counts(work.table2) %get_counts(sashelp.class) %get_counts(sashelp.cars) %get_counts(work.zipcode) %bench(elapsed) * this sort IS required, to group the varnames in the next transpose ; proc sort data=final3; by libname memname varname measure; run; * one last transpose. use whichever dataset you prefer, final or final2 ; proc transpose data=final3 out=temp (drop=_name_); by libname memname varname; id measure; run; * set PDV order, which sets final observation order in final2 ; * could use a view here but it is such a tiny dataset ; data temp; format libname memname varname Missing Not_Applicable Not_Missing; set temp; run; proc transpose data=temp out=temp2 (rename=(_name_=Num_Obs)); id varname; run; * replace missing values with zero ; data final3a; set temp2; array miss{*} _numeric_; do i=1 to dim(miss); if miss{i}=. then miss{i}=0; end; drop i; run; title; proc print data=final3; run; proc print data=final3a; run; %bench(end) After running all three approaches, you can compare the “A” datasets. The “not A” datasets are not normalized the same way, so they don’t compare. proc compare base=final1a compare=final2a; run; proc compare base=final1a compare=final3a; run; To print totals, there are a number of ways to do that. I’ll use PROC REPORT: options nocenter; proc report data=final1 nowd; columns libname memname varname missing not_applicable not_missing total; define libname / order; define memname / order width=12; define varname / order width=25; compute total; total=sum(missing.sum,not_applicable.sum,not_missing.sum); endcomp; break after memname / ol summarize skip; quit; proc report data=final1a nowd; rbreak after / ol summarize skip; label num_obs=" "; quit; Finally, after running all three approaches, there wasn’t a big difference in performance between any of the approaches. Based on this, I’d recommend the proc summary approach, since it’s the simplest (of the three approaches I listed). Also, I did not compare the performance with the original sql/union all approach, since 1) the previous posted code wasn’t generic enough to support datasets in multiple libraries and I didn’t feel like fiddling with the code, 2) with enough datasets and variables you’d likely run into the limits of sql, and 3) the original poster said he’d like to split the output into multiple output datasets, which would be easier with the macro/proc append approach. If someone wants to compare the performance with the original sql approach, please post the results. Hope this helps and sorry again for the length... Scott
... View more