Solved: Re: Query multiple datasets in a library for values in another dataset

AJ_Brien · Posted 11-13-2019 03:39 PM

Hello,

I have created a dataset that holds the names of all the datasets which are stored in a unix library. the number of datasets in that location will increase daily.

This is the code that I have to create that master dataset:

ods output Members=Members;
proc datasets library=abc memtype=data;
run;
quit;

I also have a sample dataset:

63735723627

774648684

45648558585

85658676876

5865867568

What I need to do is check the library to see if any of the library datasets has any of the sample dataset values in there. But I'm clueless on how to even start querying the library datasets. How can I query all teh library datasets to look for sample dataset values?

Appreciate everyone's help and time and helping me figure this out.

Thank you.

mkeintz · Posted 11-13-2019 04:18 PM

You want to inspect ALL the dataset in libname ABC to see if any of the datasets have variable ACC with a value you have in your work.sample dataset.

Now if your are lucky enough such that all the datasets in ABC have names beginning with D or T, you can do this:

data mysample;
  input acc;
datalines;
63735723627
774648684
45648558585
85658676876
5865867568
run;

data acc_found;
  if _n_=1 then do;
    declare hash h (dataset:'mysample');
      h.definekey('acc');
      h.definedone();
  end;
  set abc.d: abc.t:  indsname=indsn;
  if h.find()=0;
  dsname=indsn;
run;

which will generate a dataset with two variables: ACC and DSNAME. It will have all the found ACC values, and the data set name(s) in which it was found.

The SET statement:
1. the "abc:d:" and "abc.t:" ask for all the datasets in library abc whose names begin with D or T. So the single SET statement reads all the datasets.
2. The "indsname=indsn" make a temporary variables (INDSN) with the name of the dataset for each incoming observation.
The "if h.find()=0" is a sub-setting IF (it has no THEN clause), and allows only observations in which the FIND method finds a matching ACC value in the hash object named h. (The find method returns a zero when successful).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 11-13-2019 03:49 PM

Do you at least have a variable name that you expect the value(s) to be in?

Do you know if the values are character or numeric?

Do any of the data sets have the same structure? Perhaps they should be appended so that the search involves fewer sets.

Do you expect this to finish in any given time frame?

Show what you expect the output for this exercise to look like.

And SAS may not be the best tool for this.

AJ_Brien · Posted 11-13-2019 03:52 PM

Thank you for your reply.
I do know the name of that variable, it's of num type. All the datasets have the same structure too. They are pretty huge in size and number that's the reason it was adviced not to append them.

AJ_Brien · Posted 11-13-2019 03:54 PM

For eg, the sample dataset column name is 'acc' and of type num. All the datasets in that library also have a column name 'acc' of type num. Hope this helps.

ballardw · Posted 11-13-2019 04:23 PM

@AJ_Brien wrote:
Thank you for your reply.
I do know the name of that variable, it's of num type. All the datasets have the same structure too. They are pretty huge in size and number that's the reason it was adviced not to append them.

Just how big is "huge"? People have quite different standards and one person's "huge" is another's "miniscule".

If the sets are truly largish, such as terabyte sized then it is going to take a lot of time to search regardless.

If there is only one or two variables that you are going to look for them perhaps you might want to investigate indexing the data sets on those variables. That can often improve some sorts of data retrieval.

If the need is to search among arbitrary variables then be prepared to wait. A lot.

And you haven't indicated what you expect the output to look like. What you actually need can seriously effect the approach. Do you need a list of data sets that have any of the values, a list of data sets by value, a count of the number of times each value appears in each set, a list of all data sets with the names of all of the variables that may have any of the values ...

mkeintz · Posted 11-13-2019 04:18 PM

You want to inspect ALL the dataset in libname ABC to see if any of the datasets have variable ACC with a value you have in your work.sample dataset.

Now if your are lucky enough such that all the datasets in ABC have names beginning with D or T, you can do this:

data mysample;
  input acc;
datalines;
63735723627
774648684
45648558585
85658676876
5865867568
run;

data acc_found;
  if _n_=1 then do;
    declare hash h (dataset:'mysample');
      h.definekey('acc');
      h.definedone();
  end;
  set abc.d: abc.t:  indsname=indsn;
  if h.find()=0;
  dsname=indsn;
run;

which will generate a dataset with two variables: ACC and DSNAME. It will have all the found ACC values, and the data set name(s) in which it was found.

The SET statement:
1. the "abc:d:" and "abc.t:" ask for all the datasets in library abc whose names begin with D or T. So the single SET statement reads all the datasets.
2. The "indsname=indsn" make a temporary variables (INDSN) with the name of the dataset for each incoming observation.
The "if h.find()=0" is a sub-setting IF (it has no THEN clause), and allows only observations in which the FIND method finds a matching ACC value in the hash object named h. (The find method returns a zero when successful).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AJ_Brien · Posted 11-13-2019 04:34 PM

thank you this makes so much sense.

One question though, I tested by adding some more columns to my sample dataset, that seems to have no impact as expected since the key is defined as 'acc' and only those columns will be checked. Is that the right understanding?

AJ_Brien · Posted 11-13-2019 04:40 PM

another question, sorry trying to understand all that I can do with this feature. If I would like to pull another column, 'date' column for eg. from those library datasets in addition to the dataset names for all the observations that are found, is there a way to do that too?

SuryaKiran · Posted 11-13-2019 04:22 PM

Hello,

Did you try dictionary tables, you can find lot metadata information. I would query dictionary.tables if I were you.

proc sql;
select * from dictionary.tables
where libname="SASHELP";
quit;

Thanks,
Suryakiran

SAS Innovate 2025: Call for Content

Classroom Training Available!