Re: How to loop through multiple folders and create tables

stormblaster98 · Posted 06-28-2022 03:31 PM

I have a folder called area which contains many other folders of certain dates. Within the dates folder is another folder called Prod which contains a bunch of datasets (area -> (12312020 -> Prod -> cost.sas7bdat), (07312017 -> Prod -> cost.sas7bdat), (etc.)).

I have a program that creates tables using the datasets within prod. Right now my program uses a libname to manually choose the date folder:

%let folder = 12312020;
libname test "home/area/&folder/prod";

create table table1 as
 select a.ID;
 from test.cost a
 where ....
;

I want to be able to create table1 for every date in the area folder without having to manually re-enter the date. Is there a way I can automate this?

Tom · Posted 06-28-2022 03:39 PM

Get the list into a dataset.

Much easier it you can use operating system commands.

data files ;
  infile "ls /home/area/*/prod/cost.sas7bdat" pipe truncover;
  input filename $256.;
run;

But there are other ways.

Use the data to generate code.

filename code temp;
data _null_;
  set files end=eof;
  file code ;
  if _n_=1 then put 'set';
  put ' ' filename :$quote. ;
  if eof then put 'indsname=dsname;' ;
run;

Then use the generated code to create your desired dataset.

data want;
%include code / source2;
   date=input(scan(dsname,-3,'/'),mmddyy10.);
   format date yymmdd10.;
run;

PS: If you are going to use date strings in file names then always use them in YMD order so the filenames will sort properly. Also using either MDY or DMY order will confuse 50% of your audience.

stormblaster98 · Posted 06-28-2022 04:40 PM

I would like to use multiple files not just cost.sas7bdat. Here is an example of a table I want to create:

	create table Table1 as
		select a.ENTITY_ID, a.date, a.group
		from test.cost a
		where cats(a.ENTITY_ID, a.date, a.group) not in
		(select cats(ENTITY_ID, date, group) from test.cost_detl)
		;

As you can see there are multiple files within prod I want to access (cost and cost_detl).

I need a way to access every file within the folders and then be able to create tables based on each folder. So Table1 will be created for date 12312024, 12312023, 12312022, and on for every date folder in area.

Reeza · Posted 06-28-2022 05:47 PM

Do the dates follow a pattern? If so, what is that pattern. If it can be automated the process can be automated.

The other method is to scan all folders for that naming structure and implement the process (not provided at this point) for each folder with the correct naming structure.

stormblaster98 · Posted 06-28-2022 06:00 PM

Here is an example of my folder structure. Each of the date folders have inside a prod folder which carries all the datasets I want to access when creating a table for each date.

Reeza · Posted 06-28-2022 06:07 PM

That doesn't answer my question. Is there a pattern to the dates? Every month end? From your screenshot, it seems like random dates which then cannot be automated.

You could then go the second approach and scan each file and then do what you need.

Reeza · Posted 06-28-2022 06:13 PM

Which is closer to what you need to do:

Option 1

Create list of files
For each folder, run specific task
Save to some location

Option 2

Create list of files
Import all files under prod folder
Create single data set
OR create multiple data sets for each table
Process data as needed

Any steps that may be missing?

stormblaster98 · Posted 06-28-2022 06:33 PM

Sorry Im new to SAS programming. Option 1 seems closer.
If I could describe the process it would be

- As you saw I have area -> dates -> prod -> datasets. Each of these dates have datasets in them.

- In my original post I showed how I am using a libname to access dates folder "12312020" and then creating a table using the datasets in that folder.

- These tables I create are sent to WORK library.

- I want to automate it so the tables I create dont need a libname to specify where to look. I want the tables to be created for every dates folder datasets.

I hope that helps. Doing my best trying to explain. I think my original post explained it best.

Reeza · Posted 06-28-2022 06:37 PM

@stormblaster98 wrote:

- I want to automate it so the tables I create dont need a libname to specify where to look. I want the tables to be created for every dates folder datasets.

I hope that helps. Doing my best trying to explain. I think my original post explained it best.

Do all data sets in the different prod folders have unique names?

If not, then this is not possible.

SASKiwi · Posted 06-28-2022 07:14 PM

You really should consider reorganizing the way you manage your SAS datasets as you are creating a whole lot of extra work for yourself by having different folders for each dataset date. Just use ONE folder and datestamp your dataset names using YYYYMMDD as dataset suffixes. That way they will sort in date order automatically when viewed in Windows Explorer:

entity_20220601

entity_20220602

entity_scenario_20220601

entity_scenario_20220602

ONE LIBNAME statement then allows you to read all of your datasets without having to traverse folders.

If this approach creates too many datasets then consider appending all of your entity and entity_scenario datasets adding a column to identify which date the data came from.

Reeza · Posted 06-28-2022 07:21 PM

From the PM'd screenshot, these are views.

Talk to your DB team about getting direct access to the main table instead of the view access.

Otherwise, you need to basically re-process this data either into multiple datasets in a single library or a single data set in a library.

How large are each files? Since someone has bothered to create indexes I'm guessing fairly large.

Do you have control over the folder structure at all? If so, getting them populated into a single folder with unique names such as:

AL_Table_MMDDYY

Is a better option.

If you have access to that space as well, you could copy them all over to one location using FCOPY() but like I said, depending on the size this could be problematic.

So it also depends on what you're trying to do. I'd probably also consider designing my process to access the folders only as needed.

The gist of it is - you have partitioned files that you're trying to do something with. Without knowing all the information we can't offer much more advice. Is there someone at your company who could assist with the best approach?

Tom · Posted 06-28-2022 08:04 PM

Is there a pattern to the list of "date" folders? Those random looking digits strings are dates? Is that supposed to be in MMDDYYYY pattern? Why is there one for 31JUN2019? There are only 30 days in June.

It does not sound like you actually know what you want to do or at least are having a hard describing it clearly. Being able to say what you want is the first step to being able to program a computer to do what you want.

Do the datasets with the same names, like ENTITY or CODE, have the exact same structure in each folder? That is do they have all of the variables? Are the variables defined exactly the same, same type, same storage length? Or have there been changes to the structure over time. In particular is the same variable sometimes numeric and other times character? Do the storage lengths of character variables change? That is critical in being able to generate a single program that can work smoothly with data from any of the folders and especially for code that wants to combine data from multiple folders.

Do all of the folders contain the same list of datasets? Are there some datasets that are only in some of the folders and not in others?

Which datasets to you actually need to combine? First you said you wanted to combine the COST datasets. Then you said you also needed to look at COST_DETL datasets. Now you show a picture of files that are for ENTITY and ENTITY_SCENARIO datasets. How many datasets are their in each folder? Which of them do you need to combine?

Do you need to combine the data from multiple different folders into a single dataset or not? Or do you just want a easy way look a some given version of the dataset? If you want to combine them do you need to combine them across all of the "dates"?

Do you just want to individual datasets across the folders? Or do you actually need to join the data from the same folder together like your second post where you are trying to use SQL to join them. And if you need to join them do you want to join them first and then combine the results from many "dates"? Or would it be better to combine the individual datasets together and then join them? If you need to join them then what are the key variables? Do the key variables uniquely identify the observations?

So assuming you only needed to combine data from two folders you could just do something like:

libname x1 '/home/area/06312019/prod';
libname x2 '/home/area/12312022/prod';

data entity;
  set x1.entity x2.entity;
run;

You can see how that pattern could easily be extended to 3,4,5, etc folders. If you had a dataset with the list of folders you could easily generate that code from that list.

So show what code it is you are actually trying to run. Explain which parts of it need to change.

It would help a lot if you gave an overview of what the whole project is about. Why are there so many datasets in so many different folders? Why do you now need to combine them?

stormblaster98 · Posted 06-28-2022 09:06 PM

I'm sorry I am trying my best to explain. I'll try once more. Forget what I had said before.

I have an area folder which contains lots of dates folders (yes they are dates in the MMDDYYYY format), in the dates folder is a prod folder which contains that dates datasets. (As I shown in multiple screenshots). There are around 20 datasets in each folder all with the same name. The dates represent when the datasets data where captured. The dataset names in each prod folder are the same. So there is an entity.sas7bdat in every date->prod folder, as well as allocation.sas7bdat, and risk.sas7bdat, and entity_scenario.sas7bdat etc.

%let folder = 12312020; /* Specifying the date I want */
libname test "home/area/&folder/prod";

create table Table1 as
		select a.id, a.date, a.group
		from test.allocation a
		where cats(a.id, a.date, a.group) not in
		(select cats(id, date, group) from test.risk)
		;

This is an example of one of the tests I create. Notice how I have to specify which date folder I want to use. I want to make it so I dont need to specify the date, and so this table is created for all the date folders.

The project is to create tests and check data quality. So I use sql to create tables to find possible problems in the datasets from the prod folder and review them in the WORK library which is where the create tables are outputted to.

Right now, I have to manually change the date (as you saw in my let statement in the original post) in order to access multiple date folders and test the datasets in them. I want a way to automate this so I dont have to manually change the date, instead running the program would give me the data for all the dates. There are a lot of dates thats why automating it would be better.

I do not need to combine datasets or anything like that. Just need to test them using my create table sql and reviewing them in the excel I export them too.

Im hesitant to share any of the actual code due to work privacy. All the code examples Ive shown is a mock with different names etc. but the foundation of it is the same.

I hope this made it a more clear. Im still learning SAS so I'm trying to explain it in the best way I can

Tom · Posted 06-28-2022 10:41 PM

Thanks. That is clearer and I think you can start by just using simple methods.

In that situation I would use the same libref to point at the folder and just change which of the many directories it is pointing at.

Let's use the name PROD since you have that /prod/ terminal folder.

So if today you want to work on data from that phantom June 31st folder you would use.

libname prod "/home/area/06312014/prod" access=readonly;

Now you can create as many programs as you want to check the data that use PROD as the libref that points to the datasets. Like the snippet you showed. The sample program you showed is not really doing any testing, so you probably will want to spend some time figuring out what code you want to write just using the one of the sets of datasets.

Once you have some working SAS code then if you want to run those tests on a different folder just change the LIBNAME statement and leave the rest of the program the same.

Now if you want to automate running a series of tests on multiple folders you might want to store those tests into a program. Do NOT include the LIBNAME statement in the program file.

Say you did that and called the program qc_check.sas then you can then use a statement like:

%INCLUDE 'qc_check.sas' / source2;

to run that code.

So if you have series of dates you want to check just interleave those %INCLUDE statements with statements that change where PROD is pointing.

libname prod "/home/area/06312014/prod" access=readonly;
%INCLUDE 'qc_check.sas' / source2;
libname prod "/home/area/12312019/prod" access=readonly;
%INCLUDE 'qc_check.sas' / source2;

If you just want to check say 10 folders the simplest thing is just copy those lines 10 times.

If you need to do it for 1000 folders then you might want to make away to automate it.

So if you put the list of folders into a dataset you can use the dataset to generate that series of programs.

And once you are more comfortable with how SAS works and writing SAS code you can also look into learning how to use the SAS macro language to generate SAS code. So you might take your qc_check.sas program file and convert it into a macro that takes as one input parameter the "date" to use to create the libref.

%macro qc_check(date);
libname prod "/home/area/&date./prod" access=readonly;

* check entity dataset ;
* check code dataset; 
....
%mend qc_check;

Then to run it for multiple dates you just need to call it multiple times.

%qc_check(06312014);
%qc_check(12312019);

But you should wait until you know what SAS code you want to generate before trying how to write a program that writes SAS programs for you (aka a SAS macro).

Reeza · Posted 06-29-2022 11:10 AM

Ok, with this update, you're looking at Option 1 for sure.

Something like this is what I would suggest. You should be able to run this on your system and get a data set character_summary generated.


*create empty data set for list of folders;
data file_list;
length dir $300. name $100.;
run;

*macro to scan library for the list of folders;
%macro list_files(dir,ext);
  %local filrf rc did memcnt name i;
  %let rc=%sysfunc(filename(filrf,&dir));
  %let did=%sysfunc(dopen(&filrf));      

   %if &did eq 0 %then %do; 
    %put Directory &dir cannot be open or does not exist;
    %return;
  %end;

   %do i = 1 %to %sysfunc(dnum(&did));   

   %let name=%qsysfunc(dread(&did,&i));

     %if %qscan(&name,2,.) = %then %do; 
        proc sql;
        insert into file_list
        values("&dir", "&name");
        quit;
        %list_files(&dir\&name)
      %end;
   %end;

   %let rc=%sysfunc(dclose(&did));
   %let rc=%sysfunc(filename(filrf));     

%mend list_files;

*run macro for folder;
%list_files(/home/area);

*clean files macro (demo example);
%macro clean_files(name=);

*sets library to folder;
libname test "home/area/&folder/prod";

*runs a frequency on all character variables;
ods select none;
ods output onewayfreqs=_char_summary;
proc freq data=test.allocation;
table _character_ ;
run;
ods select all;

*Format output;
data _char_summary;
length variable $32. variable_value $100.;
set _char_summary;

source = "&folder.";

Variable=scan(table, 2);

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency percent cum:;
label variable='Variable' 
	variable_value='Variable Value';
run;


*appends to master file;
proc append base = character_summary data=_char_summary;
run;

*clean up temporary files;
proc datasets lib=work nodetails nolist;
delete _:;
run;quit;

*clear library reference;
libname test;

%mend;

*calls macro for each folder;
data _null_;
set file_list;
str = catt('%clean_files(name=', name, ');');
call execute(str);
run;

*check output;
proc print data=character_summary(obs=100);
run;

If this doesn't work, add this line to the top, re-run and post your code:

options mprint symbolgen;
run;

EDIT: modified the macro to have proc freq store all output. But basically an example.

Ready to join fellow brilliant minds for the SAS Hackathon?