Re: Macro to count the number of records - Page 2

nicnad · Posted 07-23-2012 04:27 PM

Thank you for the reply.

I am using SAS 9.1 so this is why the transpose portion of the code was not working.

First of all, thank you very much for this code. Two questions :

I tried to run your code on 10 tables and I got the following error : "ERROR: Query is too complex to be processed. It references more than 16 tables". Is there a way to bypass this error? Would deleting the temp table when a new one is created solve this problem or am I not understanding the error code properly?
How do I add a column to the table that would give me the sum of each of the three already existing column ? (See example below)

table	Name	Missing	Not_Applicable	Not_Missing	Total
TABLE1_	Name	1	1	2	4
TABLE1_	Age	0	0	4	4
TABLE2_	Color	1	1	2	4
TABLE2_	Height	1	0	3	4

Thank you for your help and time.

MikeZdeb · Posted 07-23-2012 06:50 PM

hi ... not sure why you get that message (Ksharp's code is pretty neat and normally very reliable)

here's an alternative that I think is a bit faster and generic enough that all you have to do is list the data sets with a '/' separator ...

data table1;

input name $ age @@ ;

cards;

John 45 N/A 30 . 15 Carl 25

;

data table2;

input color $ height;

cards;

Blue 110 N/A 120 . 100 Red .

;

proc format;

value nn low-high='1' other='3';

value $c 'N/A'='2' ' '='3' other='1';

run;

%macro tables(stuff);

proc datasets lib=work nolist;

delete tables;

quit;

%let j=1;

%do %while(%scan(&stuff,&j,/) ne);

%let dset=%scan(&stuff,&j,/);

ods output onewayfreqs=temp (keep=table f_: freq:);

proc freq data=&dset;

tables _all_ / missing;

format _numeric_ nn. _character_ $c.;

run;

ods output close;

data temp (keep=table name not_: missing total);

length table name $32;

do until(last.table);

set temp;

by table notsorted;

array t(*) f_:;

select(t(_n_));

when('1') not_missing = frequency;

when('2') not_applicable = frequency;

when('3') missing = frequency;

end;

total = sum(of not_:, missing);

name = scan(table,2);

table = "&dset";

run;

proc append base=tables data=temp;

run;

%let j=%eval(&j+1);

%end;

proc datasets lib=work nolist;

delete temp;

quit;

%mend;

ods listing close;

%tables(table1/table2/sashelp.class/sashelp.heart);

ods listing;

partial output of data set TABLES ...

not_

table name missing not_applicable missing total

table1 name 2 1 1 4

table1 age 4 . . 4

z.table2 color 1 . . 1

z.table2 height 1 . . 1

sashelp.class Name 19 . . 19

sashelp.class Sex 19 . . 19

sashelp.class Age 19 . . 19

sashelp.class Height 19 . . 19

sashelp.class Weight 19 . . 19

sashelp.heart Status 5209 . . 5209

sashelp.heart DeathCause 1991 . 3218 5209

sashelp.heart AgeCHDdiag 1449 . 3760 5209

nicnad · Posted 07-24-2012 10:43 AM

Thank you both for your reply. Really appreciated.

@ Ksharp : Your code works great. Thank you for your help and time.

@ MikeZdeb : I don't get the same output as you. I would really like to get the same output as the one you provided in your post.

If we take my two example tables, here is the result I get :

Bitmap
The FREQ Procedure
name	Frequency	Percent	Cumulative	Cumulative
name	Frequency	Percent	Frequency	Percent
3	1	25.00	1	25.00
1	2	50.00	3	75.00
2	1	25.00	4	100.00
age	Frequency	Percent	Cumulative	Cumulative
age	Frequency	Percent	Frequency	Percent
1	4	100.00	4	100.00
The FREQ Procedure
color	Frequency	Percent	Cumulative	Cumulative
color	Frequency	Percent	Frequency	Percent
3	1	25.00	1	25.00
1	2	50.00	3	75.00
2	1	25.00	4	100.00
height	Frequency	Percent	Cumulative	Cumulative
height	Frequency	Percent	Frequency	Percent
3	1	25.00	1	25.00
1	3	75.00	4	100.00

I am running SAS Enterprise Guide 4.1 with SAS 9.1

Can you please help me get the same output as you?

Thank you for your help

MikeZdeb · Posted 07-24-2012 04:59 PM

Hi ... did you check to see if there was also a data set named TABLES ...

proc print data=tables;

run;

when the macro finished?

nicnad · Posted 07-25-2012 10:29 AM

My bad sorry.

I see the table now.

One last thing I would like to have with your code.

Would you be able to a feature to your code where I would be able to to say create the summary table named "tables" with (work.table1/work.table2) and create another summary table named "summary_table" with (work.table3/work.table4)?

Basicly when writing the macro code colde I have the choice to :

1. Enter a specific output table name

2. Create multiple output tables with specific table name

I would really like to be able to write this myself, but I am a real beginner. I am learning a lot using this forum and I hope I will be able to at least modify existing sas code soon.

Thank you for your help and time!

Ksharp · Posted 07-23-2012 11:27 PM

Oh. I understand that ERROR message.

SQL's UNION clause only can support 250 select clause , so maybe you have too many tables and too many variables needed to process.

data table1;
input Name $     Age ;
cards;
John     45
N/A     30
.     15
Carl     25
;
run;
data table2;
input Color $     Height ;
cards;
Blue     110
N/A     120
.     100
Red     .
;
run;

options missing=' ';
data _null_;
 set sashelp.vcolumn(keep=name memname libname where=(libname='WORK')) end=last;
 if _n_ eq 1 then  call execute("proc sql;");
 call execute("create table temp"||strip(_n_)||" as select '"||strip(memname)||"' length=40 as table,'"||strip(name)||"' length=40 as Name,nmiss("||strip(name)||") as missing,sum(case when cats("||strip(name)||")='N/A' then 1 else 0 end) as Not_Applicable,sum(case when cat("||strip(name)||") not in ('N/A' ' ') then 1 else 0 end) as Not_Missing,count(*) as Total from "||strip(memname)||";" );
 if last then call execute(";quit;");
run;
data want;
 set temp:;
run;

Ksharp

ScottBass · Posted 07-29-2012 08:52 AM

Hi All,

I thought this was an interesting problem, and a chance for me to "play" with some techniques in SAS I don't normally use. Apologies in advance for the length of this post.

General Concepts:

Use of sashelp.vcolumns:

In my environment, we pre-allocate about 30 SAS libraries with hundreds of tables and tens of thousands of variables. Some of those libraries are to Oracle databases. After the incredibly painful process (~ 10 minutes) of retrieving 4 variables, the code generation finally began! However, even when I cleared all libnames (libname _all_ clear), sashelp.vcolumn was still really slow.

I recommend using proc contents instead of sashelp.vcolumns. Compare the three approaches below. I also switched to other datasets (sashelp), because we also want to compare performance across different approaches, and the example data is just too small to evaluate performance.

* using dictionary.columns ;

proc sql;

create table vars as

select libname, memname, name

from dictionary.columns

where catx(".",libname,memname) in ("SASHELP.CLASS","SASHELP.CARS","WORK.ZIPCODE")

order by libname, memname, varnum

;

quit;

* using sashelp.vcolumn ;

data vars;

set sashelp.vcolumn (keep=libname memname name);

where catx(".",libname,memname) in ("SASHELP.CLASS","SASHELP.CARS","WORK.ZIPCODE");

run;

* using proc contents ;

%macro get_variables(data);

proc contents data=&data out=temp (keep=libname memname name) noprint;

run;

proc append base=vars data=temp;

run;

%mend;

proc datasets lib=work nowarn nolist;

delete vars;

quit;

%get_variables(sashelp.class);

%get_variables(sashelp.cars);

%get_variables(work.zipcode);

The actual performance will vary based on your environment, but in general I find the best performance is 1) proc contents, 2) dictionary.columns, and 3) sashelp.vcolumn, in that order, especially if you just need the columns from a single dataset. Try each approach in your environment and see which works best for you.

call execute vs. dynamic code generation:

There is nothing intrinsically wrong with call execute. However, I generally prefer dynamic code generation to a temporary file, then %including that temporary file. This way, I can easily debug my generated code using "fslist". This assumes code generation via DMS, otherwise use a data _null_ step to echo the code to the log. I also get more control over the code formatting. This has nothing to do with code execution, but can be helpful with debugging if your generated code block is large.

Here is an example:

* dynamically create code ;

filename code temp;

data _null_;

set vars end=eof;

file code;

if _n_ eq 1 then do;

call symputx("firstvar",name,"G");

put @1 "proc sql;";

put @4 "create table temp as";

put @7 "select";

end;

put @10 memname $quote. @80 "as Table,";

put @10 name $quote. @80 "as Name,";

put @10 "nmiss(" name +(-1) ")" @80 "as Missing,";

put @10 "sum(case when cats(" name +(-1) ")='N/A' then 1 else 0 end)" @80 "as Not_Applicable,";

put @10 "sum(case when cat(" name +(-1) ") not in ('N/A', ' ') then 1 else 0 end)" @80 "as Not_Missing,";

if eof then do;

put @10 "0" @80 "as dummy";

put @7 "from";

put @10 libname +(-1) "." memname;

put @4 ";";

put @1 "quit;";

end;

run;

* check out the generated code ;

dm "fslist code";

* execute the code ;

%include code;

Note: You don’t need the “firstname” macro variable for the proc transpose (see later code below), but the above would be a way to set it to the first variable in the source data.

Missing option:

The character representation of a missing numeric value is controlled by the missing option, which by default is ".". If you want to make absolutely sure that your code works correctly when checking for missing numeric values, explicitly set the missing option. You can also simplify the code slightly if you explicitly set it to blank:

* reset character used for missing numeric data ;

%let missing=%sysfunc(getoption(missing));

options missing=" ";

/* Your code. Both character and numeric missing are now both “ “ */

* restore character used for missing numeric data ;

options missing="&missing";

Increasing disk I/O performance:

This is more of an aside. I did test this approach, but it did not perform better than using views.

When I need high performance disk I/O, I’ve been using this approach lately:

libname spdework spde "%sysfunc(pathname(work))" temp=yes;

Then, use a two-level name of spdework.<your work dataset> instead of <work>. This can yield really good performance, especially for a large work dataset that is used repeatedly in downstream code.

You can also set the user= option to use a single level name writing to spdework:

options user=spdework;

data foo;x=1;run;

Use views when appropriate:

Use a data step or SQL view when you can to reduce disk I/O when appropriate. Determining “when appropriate” is beyond the scope of this post, but see code using a data step view below.

Problem Analysis:

If we review the problem, what we want to do is group the data into "buckets" (missing, not applicable, not missing), for each variable, then get frequency counts for those "buckets". Further analysis of the previous answers shows that we're generating our results one column at a time, over the entire dataset – each proc sql/select/union all code block is processing the entire dataset, one variable at a time. The original poster indicated that his/her "real" data is 2M + records. It may also have many more variables than the 4 in the sample datasets. So, the total data processed is # of records * # of variables. So, for example, 2M records * 10 variables = processing 20M records. In the "real" problem, performance does matter.

I also considered "Is there a way to do this without pre-processing the data (sashelp.vcolumns or proc contents) and dynamic code generation"? Also “Is there a way to process the data in one pass”?

I created a number of code versions - I’ll only include a few of the approaches (I won’t post the disasters lol). Some comments:

A data step view performed better than creating a SPDE work dataset.
Using call vnext and vvaluex, I did not have to pre-process the data.
I found it easier to process the data sets one at a time and use proc append, esp. since I didn’t use SQL.
In the proc summary approach, I still process # of records * # of variables, so the data does “bloat”. Luckily, the logic checks are mutually exclusive – a column can ONLY be one of missing, not applicable, or not missing. Otherwise the dataset (data step view) would bloat even more.
In the first hash object approach, I only process the data once. But, it performed the same as the “bloated” data step view. I conjecture that the overhead of looking up and replacing the counts for every data value created enough overhead that it performed no better than the proc summary approach. But it was fun coding this approach :-).
I also found a documentation hit that were really interesting:
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a002585310.htm (search on “Maintaining Key Summaries”), and

It took me a while to understand this doc – here is my understanding: when you declare a suminc variable, the hash object maintains an increment counter on all its keys. When a key is found, the value of the suminc variable is added to the current internal value maintained by the hash object for that key. For the first “hit” on the hash object key, you need to use add() instead of find(). At the end of the processing, use the sum() method to retrieve the increment counter for a given key. Since we’re just interested in frequency counts, the value of the suminc variable is always 1. See an example below

Here are my code examples. Hopefully they are somewhat self explanatory, otherwise post a follow up question.

Code Prologue (create example data used for all programs):

options mprint nomlogic;

data table1;

input Name $ Age;

cards;

John 45

N/A 30

. 15

Carl 25

;

run;

data table2;

input Color $ Height;

cards;

Blue 110

N/A 120

. 100

Red .

;

run;

* create a large dataset to test performance ;

data work.zipcode;

set

sashelp.zipcode

;

run;

%bench macro (macro I use when I’m benchmarking performance):

Note: comment out call to %parmv

/*=====================================================================

Program Name : bench.sas

Purpose : Measures elapsed time between successive

invocations.

SAS Version : SAS 8.2

Input Data : N/A

Output Data : N/A

Macros Called : parmv

Originally Written by : Scott Bass

Date : 24APR2006

Program Version # : 1.0

=======================================================================

Modification History : Original version

=====================================================================*/

/*---------------------------------------------------------------------

Usage:

* Start benchmarking.

* Both invocations are identical as long as start ;

* has not been previously invoked ;

%bench;

%bench(start);

data _null_;

rc=sleep(3);

run;

* Get elapsed time, should be approx. 3 seconds elapsed, 3 seconds total ;

%bench(elapsed);

data _null_;

rc=sleep(7);

run;

* Get another elapsed time, should be approx. 7 seconds elapsed, 10 seconds total ;

%bench; * elapsed parm not required since start was already called ;

data _null_;

rc=sleep(2);

run;

* End benchmarking, should be approx. 2 seconds elapsed, 12 seconds total ;

* Must be called after start. Resets benchmarking. ;

%bench(end);

-----------------------------------------------------------------------

Notes:

If %bench has never been invoked, calling %bench without parameters

starts benchmarking. You may also explicity specify the start

parameter. Explicitly specifying the start parameter resets

benchmarking, although normally the end parameter would be used.

If %bench has been previously invoked with the start parameter, calling

%bench without parameters prints the elapsed time. You may also

explicity specify the elapsed parameter.

To end benchmarking and reset the start time, specify the end

parameter.

Only the elapsed or end parameters (or equivalent processing) print

time measurements to the log. The start parameter does not print

anything to the log.

The only parameter that needs to be explicitly specified is end.

Otherwise the macro should do the right thing, either starting

benchmarking or printing elapsed times.

Benchmarking a time period greater than 24 hours is "unpredictable".

---------------------------------------------------------------------*/

%macro bench

/*---------------------------------------------------------------------

Measures elapsed time between successive invocations.

---------------------------------------------------------------------*/

(PARM /* Benchmarking parameter (Opt). */

/* If not specified: */

/* If first invocation, start benchmarking. */

/* If subsequent invocation, print elapsed time. */

/* Valid values are START ELAPSED END. */

);

%local macro parmerr time_elapsed time_total time_elapsed_str time_total_str h m s;

%let macro = &sysmacroname;

%* check input parameters ;

%parmv(PARM, _req=0,_words=0,_case=U,_val=START ELAPSED END)

%if (&parmerr) %then %goto quit;

%* nested macro for printing ;

%macro print(_parm);

%let time_elapsed = %sysevalf(%sysfunc(datetime()) - &_elapsed);

%let time_total = %sysevalf(%sysfunc(datetime()) - &_start);

%let h = %sysfunc(hour(&time_elapsed),z2.);

%let m = %sysfunc(minute(&time_elapsed),z2.);

%let s = %sysfunc(second(&time_elapsed),z2.);

%let time_elapsed_str = &h hours, &m minutes, &s seconds;

%let h = %sysfunc(hour(&time_total),z2.);

%let m = %sysfunc(minute(&time_total),z2.);

%let s = %sysfunc(second(&time_total),z2.);

%let time_total_str = &h hours, &m minutes, &s seconds;

%put;

%put Benchmark &_parm:;

%put;

%put Elapsed seconds = &time_elapsed_str &time_elapsed;

%put Total seconds = &time_total_str &time_total;

%put;

%mend;

%* declare global variables ;

%global _start _elapsed;

%if (&parm eq START) %then %do;

%let _start = %sysfunc(datetime());

%let _elapsed = &_start;

%end;

%else

%if (&parm eq ELAPSED) %then %do;

%if (&_start eq ) %then %do;

%put ERROR: Benchmarking must be started before elapsed time can be printed.;

%goto quit;

%end;

%else %do;

%print(ELAPSED)

%let _elapsed = %sysfunc(datetime());

%end;

%else

%if (&parm eq END) %then %do;

%if (&_start eq ) %then %do;

%put ERROR: Benchmarking must be started before elapsed time can be printed.;

%goto quit;

%end;

%else %do;

%print(END)

%* reset benchmarking ;

%symdel _start _elapsed / nowarn;

%end;

%else

%if (&parm eq ) %then %do;

%* derive proper parm then recursively call this macro ;

%if (&_start eq ) %then %do;

%bench(start)

%end;

%else %do;

%bench(elapsed)

%end;

%quit:

%* if (&parmerr) %then %abort;

%mend;

/******* END OF FILE *******/

Data Step View and PROC SUMMARY approach:
%macro get_counts(data);

* reset character used for missing numeric data ;

%let missing=%sysfunc(getoption(missing));

options missing=" ";

* create additional grouping variables ;

data vgrouped / view=vgrouped;

set &data indsname=dsn;

* set a dummy variable as an end of variable list marker ;

retain dummy "";

drop dummy;

* define additional variables ;

* varname must be long enough to contain memname_varname ;

length libname $8 memname $32 varname $65 cvalue $200 measure $15;

* we only need to get the libname and memname once ;

if (_n_=1) then do;

libname=scan(dsn,1,".");

memname=scan(dsn,2,".");

retain libname memname;

end;

* spin through all the variables in the dataset, building grouping variables ;

* since our checks are mutually exclusive, this will not cause bloating of the dataset ;

do while (1);

call vnext(varname);

if (varname in ("dsn","eof")) then continue;

if (varname="dummy") then leave;

* get the variable value (character, formatted value) ;

cvalue=vvaluex(varname);

* build the grouping variable ;

select;

when (missing(cvalue))

measure="Missing";

when (strip(cvalue)="N/A")

measure="Not_Applicable";

when (not missing(cvalue))

measure="Not_Missing";

* this covers all possibilities, so I purposely left out an otherwise statement ;

end;

* build the new variable name ;

varname=catx("_",memname,varname);

* output the observation ;

output;

end;

run;

* now summarize over each measure to get the frequency counts ;

proc summary data=vgrouped nway;

class libname memname varname measure;

output out=summary (drop=_type_);

run;

* transpose data ;

proc transpose data=summary out=transposed (drop=_name_);

by libname memname varname;

id measure;

var _freq_;

run;

* set desired PDV order, ensure all variables are present, and replace missing values with zero ;

data missing2zero;

format libname memname varname;

length Missing Not_Applicable Not_Missing 8;

set transposed;

array miss{*} Missing -- Not_Missing;

do i=1 to dim(miss);

if miss{i}=. then miss{i}=0;

end;

drop i;

run;

* append data ;

proc append base=final1 data=missing2zero;

run;

* restore character used for missing numeric data ;

options missing="&missing";

%mend;

proc datasets lib=work nolist nowarn;

delete final1:;

quit;

%bench(start)

%get_counts(work.table1)

%get_counts(work.table2)

%get_counts(sashelp.class)

%get_counts(sashelp.cars)

%get_counts(work.zipcode)

%bench(elapsed)

* this sort isn't required, but makes it easier to compare the proc print outputs ;

proc sort data=final1;

by libname memname varname;

run;

* one last transpose. use whichever dataset you prefer, final or final2 ;

proc transpose data=final1 out=final1a (rename=(_name_=Num_Obs));

id varname;

run;

title;

proc print data=final1;

run;

proc print data=final1a;

run;

%bench(end)

Hash Object, setting and retrieving frequency counts:

This approach only processes the data once, but has the overhead of finding and setting the frequency counts for every data value in the source data.

The hash object will never get that big: up to three rows (Missing, Not Applicable, Not Missing) per variable.

%macro get_counts(data);

* reset character used for missing numeric data ;

%let missing=%sysfunc(getoption(missing));

options missing=" ";

data _null_;

set &data indsname=dsn end=eof;

* set a dummy variable as an end of variable list marker ;

retain dummy "";

* define additional variables ;

* varname must be long enough to contain memname_varname ;

length libname $8 memname $32 varname $65 cvalue $200 measure $15 count 8;

* use a hash object to store summary data ;

if (_n_=1) then do;

dcl hash sums(hashexp: 16);

sums.defineKey( "libname","memname","varname","measure");

sums.defineData("libname","memname","varname","measure","count");

sums.defineDone();

* get libname and memname ;

libname=scan(dsn,1,".");

memname=scan(dsn,2,".");

retain libname memname;

end;

* spin through all the variables in the dataset, building summarization variables ;

do while (1);

call vnext(varname);

if (varname in ("dsn","eof")) then continue;

if (varname in ("dummy")) then leave;

* get the variable value (character, formatted value) ;

cvalue=vvaluex(varname);

* set the hash object keys (libname and memname are already set) ;

* derived varname ;

varname=catx("_",memname,varname);

* measure ;

select;

when(missing(cvalue))

measure="Missing";

when(strip(cvalue)="N/A")

measure="Not_Applicable";

when(not missing(cvalue))

measure="Not_Missing";

end;

* initialize counter back to 0 ;

count=0;

* retrieve the current key and increment it ;

* if find fails (first time through) count is still 0;

rc=sums.find();

* increment the count ;

count=count+1;

* save the incremented count ;

sums.replace();

end;

* now output the hash object as a dataset ;

if eof then sums.output(dataset: "counts");

run;

* append data ;

proc append base=final2 data=counts;

run;

* restore character used for missing numeric data ;

options missing="&missing";

%mend;

proc datasets lib=work nolist nowarn;

delete final2:;

quit;

%bench(start)

%get_counts(work.table1)

%get_counts(work.table2)

%get_counts(sashelp.class)

%get_counts(sashelp.cars)

%get_counts(work.zipcode)

%bench(elapsed)

* this sort isn't required, but makes it easier to compare the proc print outputs ;

proc sort data=final2;

by libname memname varname;

run;

* one last transpose. use whichever dataset you prefer, final or final2 ;

proc transpose data=final2 out=temp (drop=_name_);

by libname memname varname notsorted;

id measure;

run;

* set PDV order, which sets final observation order in final2 ;

* could use a view here but it is such a tiny dataset ;

data temp;

format libname memname varname Missing Not_Applicable Not_Missing;

set temp;

run;

proc transpose data=temp out=temp2 (rename=(_name_=Num_Obs));

id varname;

run;

* replace missing values with zero ;

data final2a;

set temp2;

array miss{*} _numeric_;

do i=1 to dim(miss);

if miss{i}=. then miss{i}=0;

end;

drop i;

run;

title;

proc print data=final2;

run;

proc print data=final2a;

run;

%bench(end)

Hash Object, using suminc incrementation variable:

This approach also only processes the data once.

%macro get_counts(data);

data _null_;

set &data indsname=dsn end=eof;

retain dummy "";

length libname $8 memname vname $32 varname $65 measure $20 cvalue $200;

if (_n_=1) then do;

dcl hash sums(suminc:"count");

sums.defineKey("libname","memname","varname","measure");

sums.defineData("libname","memname","varname","measure","count");

sums.defineDone();

dcl hiter iter("sums");

libname=scan(dsn,1,".");

memname=scan(dsn,2,".");

retain libname memname;

end;

do while (1);

call vnext(vname);

if (vname in ("dsn","eof")) then continue;

if (vname in ("dummy")) then leave;

cvalue=vvaluex(vname);

select;

when(missing(cvalue))

measure="Missing";

when(strip(cvalue)="N/A")

measure="Not_Applicable";

when(not missing(cvalue))

measure="Not_Missing";

end;

varname=catx("_",memname,vname);

count=1;

if (sums.find() ne 0) then sums.add();

end;

if eof then do;

rc=iter.first();

do while (rc=0);

sums.sum(sum: count);

sums.replace();

rc=iter.next();

end;

sums.output(dataset: "counts");

end;

run;

proc append base=final3 data=counts;

run;

%mend;

proc datasets lib=work nolist nowarn;

delete final3:;

quit;

%bench(start)

%get_counts(work.table1)

%get_counts(work.table2)

%get_counts(sashelp.class)

%get_counts(sashelp.cars)

%get_counts(work.zipcode)

%bench(elapsed)

* this sort IS required, to group the varnames in the next transpose ;

proc sort data=final3;

by libname memname varname measure;

run;

* one last transpose. use whichever dataset you prefer, final or final2 ;

proc transpose data=final3 out=temp (drop=_name_);

by libname memname varname;

id measure;

run;

* set PDV order, which sets final observation order in final2 ;

* could use a view here but it is such a tiny dataset ;

data temp;

format libname memname varname Missing Not_Applicable Not_Missing;

set temp;

run;

proc transpose data=temp out=temp2 (rename=(_name_=Num_Obs));

id varname;

run;

* replace missing values with zero ;

data final3a;

set temp2;

array miss{*} _numeric_;

do i=1 to dim(miss);

if miss{i}=. then miss{i}=0;

end;

drop i;

run;

title;

proc print data=final3;

run;

proc print data=final3a;

run;

%bench(end)

After running all three approaches, you can compare the “A” datasets. The “not A” datasets are not normalized the same way, so they don’t compare.

proc compare base=final1a compare=final2a;

run;

proc compare base=final1a compare=final3a;

run;

To print totals, there are a number of ways to do that. I’ll use PROC REPORT:

options nocenter;

proc report data=final1 nowd;

columns libname memname varname missing not_applicable not_missing total;

define libname / order;

define memname / order width=12;

define varname / order width=25;

compute total;

total=sum(missing.sum,not_applicable.sum,not_missing.sum);

endcomp;

break after memname / ol summarize skip;

quit;

proc report data=final1a nowd;

rbreak after / ol summarize skip;

label num_obs=" ";

quit;

Finally, after running all three approaches, there wasn’t a big difference in performance between any of the approaches. Based on this, I’d recommend the proc summary approach, since it’s the simplest (of the three approaches I listed). Also, I did not compare the performance with the original sql/union all approach, since 1) the previous posted code wasn’t generic enough to support datasets in multiple libraries and I didn’t feel like fiddling with the code, 2) with enough datasets and variables you’d likely run into the limits of sql, and 3) the original poster said he’d like to split the output into multiple output datasets, which would be easier with the macro/proc append approach.

If someone wants to compare the performance with the original sql approach, please post the results.

Hope this helps and sorry again for the length...

Scott

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Tom · Posted 07-29-2012 10:19 AM

Scott - If you use the CATX function in your where clause then SAS cannot use the indexes (or whatever the equivalent is for dictionary tables) in the metadata table dictionary.columns. That is why it is hitting your Oracle tables and slowing down the process.

103 create table vars as

104 select libname, memname, name

105 from dictionary.columns

106 where libname = 'SASHELP' and memname in ("CLASS","CARS","ZIPCODE")

107 order by libname, memname, varnum

108 ;

NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause.

NOTE: Table WORK.VARS created, with 38 rows and 3 columns.

NOTE: SQL Statement used (Total process time):

real time 0.09 seconds

cpu time 0.03 seconds

109

110

111 create table vars as

112 select libname, memname, name

113 from dictionary.columns

114 where catx(".",libname,memname) in ("SASHELP.CLASS","SASHELP.CARS","WORK.ZIPCODE")

115 order by libname, memname, varnum

116 ;

NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause.

NOTE: Table WORK.VARS created, with 20 rows and 3 columns.

NOTE: SQL Statement used (Total process time):

real time 0.46 seconds

cpu time 0.32 seconds

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away