BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AJ_Brien
Quartz | Level 8

Hello,

 

I have created a dataset that holds the names of all the datasets which are stored in a unix library. the number of datasets in that location will increase daily. 

 

This is the code that I have to create that master dataset:

 

ods output Members=Members;
proc datasets library=abc memtype=data;
run;
quit;

 

I also have a sample dataset:

63735723627
774648684
45648558585
85658676876
5865867568

 

What I need to do is check the library to see if any of the library datasets has any of the sample dataset values in there. But I'm clueless on how to even start querying the library datasets. How can I query all teh library datasets to look for sample dataset values?

 

Appreciate everyone's help and time and helping me figure this out.

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You want to inspect ALL the dataset in libname ABC to see if any of the datasets have variable ACC with a value you have in your work.sample dataset.

 

Now if your are lucky enough such that all the datasets in ABC have names beginning with   D  or T, you can do this:

data mysample;
  input acc;
datalines;
63735723627
774648684
45648558585
85658676876
5865867568
run;

data acc_found;
  if _n_=1 then do;
    declare hash h (dataset:'mysample');
      h.definekey('acc');
      h.definedone();
  end;
  set abc.d: abc.t:  indsname=indsn;
  if h.find()=0;
  dsname=indsn;
run;

which will generate a dataset with two variables: ACC and DSNAME.  It will have all the found ACC values, and the data set name(s) in which it was found.

 

  1.   The SET statement:
    1. the "abc:d:"  and "abc.t:" ask for all the datasets in library abc whose names begin with D or T.  So the single SET statement reads all the datasets.
    2. The "indsname=indsn" make a temporary variables (INDSN) with the name of the dataset for each incoming observation.
  2. The "if h.find()=0" is a sub-setting IF  (it has no THEN clause), and allows only observations in which the FIND method finds a matching ACC value in the hash object named h.  (The find method returns a zero when successful).

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

8 REPLIES 8
ballardw
Super User

Do you at least have a variable name that you expect the value(s) to be in?

Do you know if the values are character or numeric?

Do any of the data sets have the same structure? Perhaps they should be appended so that the search involves fewer sets.

 

Do you expect this to finish in any given time frame?

 

Show what you expect the output for this exercise to look like.

 

 

And SAS may not be the best tool for this.

 

AJ_Brien
Quartz | Level 8
Thank you for your reply.
I do know the name of that variable, it's of num type. All the datasets have the same structure too. They are pretty huge in size and number that's the reason it was adviced not to append them.
AJ_Brien
Quartz | Level 8
For eg, the sample dataset column name is 'acc' and of type num. All the datasets in that library also have a column name 'acc' of type num. Hope this helps.
ballardw
Super User

@AJ_Brien wrote:
Thank you for your reply.
I do know the name of that variable, it's of num type. All the datasets have the same structure too. They are pretty huge in size and number that's the reason it was adviced not to append them.

Just how big is "huge"? People have quite different standards and one person's "huge" is another's "miniscule".

 

If the sets are truly largish, such as terabyte sized then it is going to take a lot of time to search regardless.

If there is only one or two variables that you are going to look for them perhaps you might want to investigate indexing the data sets on those variables. That can often improve some sorts of data retrieval.

 

If the need is to search among arbitrary variables then be prepared to wait. A lot.

 

And you haven't indicated what you expect the output to look like. What you actually need can seriously effect the approach. Do you need  a list of data sets that have any of the values, a list of data sets by value, a count of the number of times each value appears in each set, a list of all data sets with the names of all of the variables that may have any of the values ...

 

mkeintz
PROC Star

You want to inspect ALL the dataset in libname ABC to see if any of the datasets have variable ACC with a value you have in your work.sample dataset.

 

Now if your are lucky enough such that all the datasets in ABC have names beginning with   D  or T, you can do this:

data mysample;
  input acc;
datalines;
63735723627
774648684
45648558585
85658676876
5865867568
run;

data acc_found;
  if _n_=1 then do;
    declare hash h (dataset:'mysample');
      h.definekey('acc');
      h.definedone();
  end;
  set abc.d: abc.t:  indsname=indsn;
  if h.find()=0;
  dsname=indsn;
run;

which will generate a dataset with two variables: ACC and DSNAME.  It will have all the found ACC values, and the data set name(s) in which it was found.

 

  1.   The SET statement:
    1. the "abc:d:"  and "abc.t:" ask for all the datasets in library abc whose names begin with D or T.  So the single SET statement reads all the datasets.
    2. The "indsname=indsn" make a temporary variables (INDSN) with the name of the dataset for each incoming observation.
  2. The "if h.find()=0" is a sub-setting IF  (it has no THEN clause), and allows only observations in which the FIND method finds a matching ACC value in the hash object named h.  (The find method returns a zero when successful).

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
AJ_Brien
Quartz | Level 8
thank you this makes so much sense.

One question though, I tested by adding some more columns to my sample dataset, that seems to have no impact as expected since the key is defined as 'acc' and only those columns will be checked. Is that the right understanding?
AJ_Brien
Quartz | Level 8
another question, sorry trying to understand all that I can do with this feature. If I would like to pull another column, 'date' column for eg. from those library datasets in addition to the dataset names for all the observations that are found, is there a way to do that too?
SuryaKiran
Meteorite | Level 14

Hello,

 

Did you try dictionary tables, you can find lot metadata information. I would query dictionary.tables if I were you.

 

proc sql;
select * from dictionary.tables
where libname="SASHELP";
quit;
Thanks,
Suryakiran

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2183 views
  • 1 like
  • 4 in conversation