Solved: Determine SET statement data set list expansion

RichardAD

I have a situation that requires knowing what a data set list expansion will be, regardless of common variable types being mixed, or of differing lengths.

Consider the complicated scenario that has a combination of single data sets, name prefix lists and numbered range lists.

data x_a ; x=1 ; a='abc' ; do i=1 to 4 ; output ; end ;
data x_b ; x=1 ; a='defg' ; b='xyzzy' ; do i=1 to 4 ; output ; end ;
data x_d ; y=1 ; b='xyz'  ; a='pqrstuv' ; i=1; stop ;
data x_f ; x='aaa' ; f='pqr' ; do i=1 to 4 ; output ; end ;
data y1 y2 y3 y4 y9 y10 y11; qqq='rrr' ; do i=1 to 4 ; output ; end ;

The following code captures the data set names, except for the one that has zero rows.

* no multiple length warning, because drop=_all_ ==> no variables;
* one record per input data set (that contains at least one row) ;
* output does not contain dataset='X_D' ;
options obs=1;
data datasets;
  set x_:(drop=_all_) y1-y4(drop=_all_) y9(drop=_all_) indsname=libmem ;
  dataset = libmem ;
run ;
options obs=max;

NOTE: There were 1 observations read from the data set WORK.X_A.
NOTE: There were 1 observations read from the data set WORK.X_B.
NOTE: There were 0 observations read from the data set WORK.X_D.
NOTE: There were 1 observations read from the data set WORK.X_F.
NOTE: There were 1 observations read from the data set WORK.Y1.
NOTE: There were 1 observations read from the data set WORK.Y2.
NOTE: There were 1 observations read from the data set WORK.Y3.
NOTE: There were 1 observations read from the data set WORK.Y4.
NOTE: There were 1 observations read from the data set WORK.Y9.
NOTE: The data set WORK.DATASETS has 8 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Obs    dataset

 1     WORK.X_A
 2     WORK.X_B
 3     WORK.X_F
 4     WORK.Y1
 5     WORK.Y2
 6     WORK.Y3
 7     WORK.Y4
 8     WORK.Y9

Short of writing my own name list expansion routine, is there a way to capture the no rows table?

mkeintz

Instead of using the INDSNAME parameter, you could parse the log notes, which include the datasets with zero observations. All those notes have the text 'observations read from the data set', which can be used as the parameter of the INPUT @'character string' technique:

data x_a ; x=1 ; a='abc' ; do i=1 to 4 ; output ; end ;
data x_b ; x=1 ; a='defg' ; b='xyzzy' ; do i=1 to 4 ; output ; end ;
data x_d ; y=1 ; b='xyz'  ; a='pqrstuv' ; i=1; stop ;
data x_f ; x='aaa' ; f='pqr' ; do i=1 to 4 ; output ; end ;
data y1 y2 y3 y4 y9 y10 y11; qqq='rrr' ; do i=1 to 4 ; output ; end ;
data x_1 x_empty  x_3;
  x=1; output x_1;
  x=3; output x_3; output x_3;
run;

options obs=1;
filename tmplog  temp ;
proc printto log=tmplog new;
run;
data _null_;
  set x_:(drop=_all_) y1-y4(drop=_all_) y9(drop=_all_) ;
run ;

proc printto; run;
options obs=max;
data datasets;
  infile tmplog dlm=' .';
  input  @'observations read from the data set' dslib :$8. dsname :$32. ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

A_Kh

Do you want to list only datasets with some observations after knowing what datasets have no observations?
If so, I would use DICTIONARY.TABLES to identify datasets with/without observations.

Eg:

data x_a ; x=1 ; a='abc' ; do i=1 to 4 ; output ; end ;
data x_b ; x=1 ; a='defg' ; b='xyzzy' ; do i=1 to 4 ; output ; end ;
data x_d ; y=1 ; b='xyz'  ; a='pqrstuv' ; i=1; stop ;
data x_f ; x='aaa' ; f='pqr' ; do i=1 to 4 ; output ; end ;
data y1 y2 y3 y4 y9 y10 y11; qqq='rrr' ; do i=1 to 4 ; output ; end ;

proc sql noprint;
	select distinct memname 
		into: dsn separated by ' ' 
			from dictionary.tables
				where libname= 'WORK' and nobs eq 0 /*or NE 0 to capture datasets with some obs*/;
quit; 
%put &=dsn;

RichardAD

Unfortnately no. I have to deal with data sets specified by a macro argument, such as

%macro zzz(datasets=) ;
...
%mend zzz;

%zzz(datasets=x_: y1-y4 y9)

It looks like I'll end up using Proc PRINTTO to capture a log snippet and then parse that.

A_Kh

It is still possible to use &DSN macro variable as your macro parameter.
It passes all desired datasets (with observations) into macro parameter 'datasets='.

data x_a ; x=1 ; a='abc' ; do i=1 to 4 ; output ; end ;
data x_b ; x=1 ; a='defg' ; b='xyzzy' ; do i=1 to 4 ; output ; end ;
data x_d ; y=1 ; b='xyz'  ; a='pqrstuv' ; i=1; stop ;
data x_f ; x='aaa' ; f='pqr' ; do i=1 to 4 ; output ; end ;
data y1 y2 y3 y4 y9 y10 y11; qqq='rrr' ; do i=1 to 4 ; output ; end ;

proc sql noprint;
	select distinct memname 
		into: dsn separated by ' ' 
			from dictionary.tables
				where libname= 'WORK' and nobs NE 0;
quit; 
%put &=dsn;

%macro zzz(datasets=) ;
...
%mend zzz;

%zzz(datasets=&dsn);

yabwon

Does something like this do the job?


data x_a ; x=1 ; a='abc' ; do i=1 to 4 ; output ; end ;
data x_b ; x=1 ; a='defg' ; b='xyzzy' ; do i=1 to 4 ; output ; end ;
data x_d ; y=1 ; b='xyz'  ; a='pqrstuv' ; i=1; stop ;
data x_f ; x='aaa' ; f='pqr' ; do i=1 to 4 ; output ; end ;
data y1 y2 y3 y4 y9 y10 y11; qqq='rrr' ; do i=1 to 4 ; output ; end ;
run;


%macro expandDataSetsList(lib=work, datasets=) ;
%LOCAL rc DataSetsList prefix;
%let prefix=_%sysfunc(datetime(),b8601dt15.)_;
%let rc = %sysfunc(doSubL(%str(
  options nonotes nosource nostimer nofullstimer nomprint nomlogic nosymbolgen;
  proc sql noprint;
    create table work.&prefix.tmp as
    select memname
    from dictionary.tables
    where libname=%upcase("&lib.")
    ;
  quit;
  proc transpose data=work.&prefix.tmp out=work.&prefix.tmp2;
    var memname;
    id memname;
  run;
  proc transpose data=work.&prefix.tmp2(obs=0) out=work.&prefix.tmp3(keep=_name_);
    var &datasets.;
  run;
  proc sql noprint;
    select _name_
    into :DataSetsList separated by " "
    from &prefix.tmp3
    ;
    drop table work.&prefix.tmp;
    drop table work.&prefix.tmp2;
    drop table work.&prefix.tmp3;
  quit;
)));
&DataSetsList.
%mend expandDataSetsList;

%PUT %expandDataSetsList(lib=work,datasets=x_: y1-y4 y9);

I assumed all datasets are in one library.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

RichardAD

Hi Bart:
Very interesting perspective shift, switching from data set list to variable list. I'll chew on this a bit thinking about the datasets= specifying tables in more than one library.

Quentin

@RichardAD wrote:
Hi Bart:
Very interesting perspective shift, switching from data set list to variable list. I'll chew on this a bit thinking about the datasets= specifying tables in more than one library.

I think @yabwon nailed it (which only makes it even more likely that I'll ping him the next time I see an interesting macro problem that I'm not creative enough to solve. : )

Even if you want to extend it to allow:

%PUT %expandDataSetsList(datasets=foo.x_: bar.y1-bar.y4 foo.y9);

I think parsing the librefs out of the list should be straight forward, then you could essentially run the logic in chunks, once for each libref.

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Quentin

@RichardAD wrote:

Unfortnately no. I have to deal with data sets specified by a macro argument, such as
%macro zzz(datasets=) ;
...
%mend zzz;

%zzz(datasets=x_: y1-y4 y9)
It looks like I'll end up using Proc PRINTTO to capture a log snippet and then parse that.

Hi, that seems feasible, but would be an annoying way to have to do it. It certainly feels like there should be a better way for SAS to do it for you.

I'm trying to think, what are the different places that data set lists are supported?

PROC DATASETS allows them and I tried:

proc datasets library=work memtype=data ;
  select x_: y1-y4 y9  ;
quit ;

But it doesn't work, because SELECT can only be used with the COPY statement and a few other statements.

Hmm, glancing at the docs, it looks like dataset lists are only used for SET statement MERGE statement and PROC DATASETS... https://documentation.sas.com/doc/en/lrcon/9.4/n0oru3akf51pi5n1wco2onzr39oz.htm

I still have hope that someone will come up with a neat way to expand a dataset list. Maybe @yabwon has a thought?

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Tom

I do NOT understand what you want.

I THINK you are asking how to convert something like:

x_: y1-y4 y9

into the list of dataset names that a SET statement would find.

I do NOT understand why that would be useful.

But it does not seem too hard. For example a query like this:

select memname 
  from dictionary.members 
  where libname='WORK'
    and (memname like 'X^_%' escape '^'
       or (memname like 'Y%' and input(substr(memname,2),?32.) between 1 and 4)
       or memname = 'Y9'
        )
;

It should not be hard to parse the initial list into pieces that then translate into the WHERE condition.

mkeintz

Instead of using the INDSNAME parameter, you could parse the log notes, which include the datasets with zero observations. All those notes have the text 'observations read from the data set', which can be used as the parameter of the INPUT @'character string' technique:

data x_a ; x=1 ; a='abc' ; do i=1 to 4 ; output ; end ;
data x_b ; x=1 ; a='defg' ; b='xyzzy' ; do i=1 to 4 ; output ; end ;
data x_d ; y=1 ; b='xyz'  ; a='pqrstuv' ; i=1; stop ;
data x_f ; x='aaa' ; f='pqr' ; do i=1 to 4 ; output ; end ;
data y1 y2 y3 y4 y9 y10 y11; qqq='rrr' ; do i=1 to 4 ; output ; end ;
data x_1 x_empty  x_3;
  x=1; output x_1;
  x=3; output x_3; output x_3;
run;

options obs=1;
filename tmplog  temp ;
proc printto log=tmplog new;
run;
data _null_;
  set x_:(drop=_all_) y1-y4(drop=_all_) y9(drop=_all_) ;
run ;

proc printto; run;
options obs=max;
data datasets;
  infile tmplog dlm=' .';
  input  @'observations read from the data set' dslib :$8. dsname :$32. ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

RichardAD

I found mkeintz approach being the solution most tightly coupled to (i.e the actual) process done by SET or MERGE. Input @'string' is a nice touch.

Fleshed out to a robust macro I ended up with

%macro memnames_of(datasets=, out=_datasets, outsym_qlibnamelist=libnames) ;
  %* Note: Does NOT handle data set specifiers that are direct filenames such as '<path>/xyzzy.sas7bdat' ;

  %if not %symexist (&outsym_qlibnamelist) %then %do ;
    %put ERROR: Symbol [&outsym_qlibnamelist] does not exist in the callers scope ;
    %return ;
  %end ;

  %let datasets = %sysfunc(prxchange(s/(\S+)/$1(drop=_all_)/,-1,&datasets));

  %* capture the expansion of &datasets ;

  filename tmplog temp ;
  proc printto log=tmplog ;
    options obs=1 ;
    data _null_ ;
      set &datasets ;
    run ;
    options obs=max ;
  proc printto log=log ;

  %* parse the captured log to get a list of data sets ;

  %let &outsym_qlibnamelist = '' ;
  data &out (keep=dataset memnum) ;
    infile tmplog end=end;
    input ;

    rx = prxparse ('m/^NOTE: There were \d+ observations read from the data set (.*)\./') ;

    length memnum 8 dataset $41 libnames $32767 libname $8 ;
    retain libnames ;

    if prxmatch (rx, _infile_) then do ;
      dataset = prxposn(rx, 1, _infile_) ;
      memnum + 1 ;
      output ;
      libname = scan(dataset,1,'.') ;
      if indexw(libnames, quote(trim(libname),"'")) = 0 then libnames = catx(' ', libnames, quote(trim(libname),"'")) ;
    end ;
    else if _infile_ =: 'ERROR:' then
      put _infile_ ;
    
    %* libnames when used in subsequent join will reduce dictionary search space;
    if end then call symputx("&outsym_qlibnamelist", libnames) ; 
  run ;
  filename tmplog ;
%mend memnames_of ;

and test code

%let workpath = %sysfunc(pathname(WORK)) ;

options dlcreatedir ;

libname one "&workpath/one" ;
libname two "&workpath/two" ;

options nodlcreatedir ;

data a b c d e ; set sashelp.class (obs=1); output a b d ; run ;
options user=one ;
data x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13; set sashelp.class (obs=1); run ;
options user=two ;
data y1 y2 y3 y4 y17 y18 y19 y20 ; set sashelp.class (obs=1); output y1 y2 y3 y4 ; run ;
options user=work ;

/* NOTE:
   Numbered data set lists with reverse order numbering (such as y20-y17)
   only works when endpoints have same number of digits and no leading zeroes
*/

%global libname_coverage ;
%memnames_of ( datasets=a b c d e one.x: two.y1-two.y4 two.y20-two.y17, outsym_qlibnamelist=libname_coverage ) 

title  "Data set list expanded from" ;
title2 "[a b c d e one.x: two.y1-two.y4 two.y20-two.y17]" ;
title3 "libname coverage qlist: " "&libname_coverage" ;
proc print data=_datasets noobs;
run ;

producing output

Data set list expanded from
[a b c d e one.x: two.y1-two.y4 two.y20-two.y17]
libname coverage qlist: 'WORK' 'ONE' 'TWO'

memnum    dataset

   1      WORK.A
   2      WORK.B
   3      WORK.C
   4      WORK.D
   5      WORK.E
   6      ONE.X1
   7      ONE.X10
   8      ONE.X11
   9      ONE.X12
  10      ONE.X13
  11      ONE.X2
  12      ONE.X3
  13      ONE.X4
  14      ONE.X5
  15      ONE.X6
  16      ONE.X7
  17      ONE.X8
  18      ONE.X9
  19      TWO.Y1
  20      TWO.Y2
  21      TWO.Y3
  22      TWO.Y4
  23      TWO.Y20
  24      TWO.Y19
  25      TWO.Y18
  26      TWO.Y17

Quentin

I like scanning the log for errors/warning/notes, but in general scanning the log for data scares me a bit. So I'd probably opt for @yabwon 's approach. That said, thanks for posting your finished macro.

Just in case SAS ever fixes the grammatical errors in their note:

NOTE: There *were* 1 observation*s* read from the data set WORK.X_F.

You might want to update the regex slightly.

I did get burned by my log checker once when a note changed from "data set" to "dataset" (or the converse) during a version update. SAS is pretty fanatical about backwards compatibility (in a good way), but they are free to change the text in log messages.

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Re: Determine SET statement data set list expansion

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!