- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have created a dataset that holds the names of all the datasets which are stored in a unix library. the number of datasets in that location will increase daily.
This is the code that I have to create that master dataset:
ods output Members=Members;
proc datasets library=abc memtype=data;
run;
quit;
I also have a sample dataset:
63735723627 |
774648684 |
45648558585 |
85658676876 |
5865867568 |
What I need to do is check the library to see if any of the library datasets has any of the sample dataset values in there. But I'm clueless on how to even start querying the library datasets. How can I query all teh library datasets to look for sample dataset values?
Appreciate everyone's help and time and helping me figure this out.
Thank you.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You want to inspect ALL the dataset in libname ABC to see if any of the datasets have variable ACC with a value you have in your work.sample dataset.
Now if your are lucky enough such that all the datasets in ABC have names beginning with D or T, you can do this:
data mysample;
input acc;
datalines;
63735723627
774648684
45648558585
85658676876
5865867568
run;
data acc_found;
if _n_=1 then do;
declare hash h (dataset:'mysample');
h.definekey('acc');
h.definedone();
end;
set abc.d: abc.t: indsname=indsn;
if h.find()=0;
dsname=indsn;
run;
which will generate a dataset with two variables: ACC and DSNAME. It will have all the found ACC values, and the data set name(s) in which it was found.
- The SET statement:
- the "abc:d:" and "abc.t:" ask for all the datasets in library abc whose names begin with D or T. So the single SET statement reads all the datasets.
- The "indsname=indsn" make a temporary variables (INDSN) with the name of the dataset for each incoming observation.
- The "if h.find()=0" is a sub-setting IF (it has no THEN clause), and allows only observations in which the FIND method finds a matching ACC value in the hash object named h. (The find method returns a zero when successful).
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you at least have a variable name that you expect the value(s) to be in?
Do you know if the values are character or numeric?
Do any of the data sets have the same structure? Perhaps they should be appended so that the search involves fewer sets.
Do you expect this to finish in any given time frame?
Show what you expect the output for this exercise to look like.
And SAS may not be the best tool for this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I do know the name of that variable, it's of num type. All the datasets have the same structure too. They are pretty huge in size and number that's the reason it was adviced not to append them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@AJ_Brien wrote:
Thank you for your reply.
I do know the name of that variable, it's of num type. All the datasets have the same structure too. They are pretty huge in size and number that's the reason it was adviced not to append them.
Just how big is "huge"? People have quite different standards and one person's "huge" is another's "miniscule".
If the sets are truly largish, such as terabyte sized then it is going to take a lot of time to search regardless.
If there is only one or two variables that you are going to look for them perhaps you might want to investigate indexing the data sets on those variables. That can often improve some sorts of data retrieval.
If the need is to search among arbitrary variables then be prepared to wait. A lot.
And you haven't indicated what you expect the output to look like. What you actually need can seriously effect the approach. Do you need a list of data sets that have any of the values, a list of data sets by value, a count of the number of times each value appears in each set, a list of all data sets with the names of all of the variables that may have any of the values ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You want to inspect ALL the dataset in libname ABC to see if any of the datasets have variable ACC with a value you have in your work.sample dataset.
Now if your are lucky enough such that all the datasets in ABC have names beginning with D or T, you can do this:
data mysample;
input acc;
datalines;
63735723627
774648684
45648558585
85658676876
5865867568
run;
data acc_found;
if _n_=1 then do;
declare hash h (dataset:'mysample');
h.definekey('acc');
h.definedone();
end;
set abc.d: abc.t: indsname=indsn;
if h.find()=0;
dsname=indsn;
run;
which will generate a dataset with two variables: ACC and DSNAME. It will have all the found ACC values, and the data set name(s) in which it was found.
- The SET statement:
- the "abc:d:" and "abc.t:" ask for all the datasets in library abc whose names begin with D or T. So the single SET statement reads all the datasets.
- The "indsname=indsn" make a temporary variables (INDSN) with the name of the dataset for each incoming observation.
- The "if h.find()=0" is a sub-setting IF (it has no THEN clause), and allows only observations in which the FIND method finds a matching ACC value in the hash object named h. (The find method returns a zero when successful).
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One question though, I tested by adding some more columns to my sample dataset, that seems to have no impact as expected since the key is defined as 'acc' and only those columns will be checked. Is that the right understanding?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Did you try dictionary tables, you can find lot metadata information. I would query dictionary.tables if I were you.
proc sql;
select * from dictionary.tables
where libname="SASHELP";
quit;
Suryakiran