I'd like to export a large SAS dataset to multiple SAS files using 3 BY groups in the filename. It needs to be flexible since the BY group variables can change from set to set.
In experimenting, I found this outstanding thread (https://communities.sas.com/t5/SAS-Programming/need-help-exporting-each-line-of-the-dataset-into-a-d...) that seems to do what I want, at least in in terms of creating filenames using the by group variable names. However, that code saves the files as csv files. I want to create SAS files (with filenames using the by group variables). So, while I started to adapt it for my files; ultimately I may need to start from scratch to reach my goal.
*name of the data set with the original data;
%let lib_name = work;
%let dsn_name = dsn1;
*Variable to split on;
%let var_split1 = YEAR;
%let var_split2 = MONTH;
%let var_split3 = Direction;
*path to folder to save text files;
%let path_folder= E:\Import Statistics\Warehouse\;
*if you are exporting each line this is not required
but should not cause any issues unless your data set is large. In that case rename your data set to _temp and skip this step;
PROC SORT DATA=&lib_name..&dsn_name OUT=_temp;
BY &var_split1 &var_split2 &var_split3;
RUN;
*make variable lists;
*for header row;
proc sql noprint;
select name into :var_list_csv separated by ", " from sashelp.vcolumn where libname = upper("&lib_name") and memname = upper("&dsn_name");
select name into :var_list separated by " " from sashelp.vcolumn where libname = upper("&lib_name") and memname = upper("&dsn_name");
quit;
DATA _NULL_;
SET _temp; *Dataset to be exported;
BY &var_split1 &var_split2. &var_split3.; *Variable that file is to be split on;
*Create path to file that is to be exported;
if first.&var_split3. then out_file=cats("&path_folder.", trim(&var_split1.), "_", trim(&var_split2.), "_", trim(&var_split3.), ".csv");
file temp filevar=out_file dlm=',' dsd;
*If first value of make then output column names;
if first.&var_split3. then
put "&var_list_csv.";
*Output variables;
put &var_list.;
run;
The file location can be any length. By "location" I mean the directory to which should assign a libname (which itself is just (up to) an 8-character symbol).
But the name of the sas dataset (i.e. the name of the file located in that library) has to be 32 characters or less
In other words you need something like:
libname mylib "c:\this\is\a\long\directory\name";
Then, instead of
dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));
h.output(dataset:dsname);
you could have
dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));
h.output(dataset:cats("mylib.",dsname));
Of course, this assumes that all the sas datasets will go to the same directory.
Describing a problem as "modifying X's program to do Y instead of Z" is not an informative topic description. Person X may have written 100's of programs. And once your own topic is marked as solved, future users of this community using problem-specific search terms will have no way to find that solution.
Please tell us what your objective is. I think it's something like split a SAS dataset into multiple SAS datasets based on the value of one or multiple variables. If so, then the solution you propose to tweak might not even be the most suitable.
Thanks. I have changed both the thread title and the goal.
The entire underlying logic seems wrong to me.
I'd like to export a large SAS dataset to multiple SAS files using 3 BY groups in the filename. It needs to be flexible since the BY group variables can change from set to set.
Typically, you don't "pre-create" filtered data sets for later use. You filter the original data set when you need it, so if at some point in time, you need the data from a specific year, specific month and specific direction, at this point in time you would go ahead and do one extract from the larger file.
data want;
set work.dsn1;
where year=2020 and month=7 and direction='East';
...
run;
This seems so much more efficient than creating all of these pre-filtered data sets via some "flexible code".
I do intentionally create discrete datasets (or pre-filtered, as you call them), regardless of whether that is typical. Any help towards my stated goal would be appreciated. Thanks
Since this is intended to create SAS data sets rather than CSV I would suggest an entirely different approach - use a hash object as a means of dynamically naming the output datasets as each by group is processed. The program below applies this approach to dataset HAVE, which is sorted by NAME/GROUP/SUBGROUP. For each such triplet, a dataset will be formed with the name based on the values of the three variables, separated by underscores. The program works in a single pass of the sorted data:
data have;
set sashelp.class;
do group='A','B';
do subgroup=1,2;
do i=1 to _n_;
output;
end;
end;
end;
run;
%let dsn=have;
%let byvars=name group subgroup;
%let last_byvar=%scan(&byvars,-1);
data _null_;
set &dsn;
by &byvars;
if _n_=1 then do;
declare hash h (dataset:"&dsn (obs=0)",multidata:'Y');
h.definekey("&last_byvar");
h.definedata(all:'Y');
h.definedone();
end;
h.add();
if last.&last_byvar;
length dsname $32;
dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));
h.output(dataset:dsname);
h.clear();
run;
The program makes a number of assumptions. Probably the most salient from an operational point of view is
That works well at creating the SAS files, but they are temporary files. How can I modify that to save all of the temporary files to a drive/folder? Almost there. Thanks
@texasmfp wrote:
That works well at creating the SAS files, but they are temporary files. How can I modify that to save all of the temporary files to a drive/folder? Almost there. Thanks
This is a good question to treat as a learning exercise. It will require very little change to generate.
As always with macros or macro variables: determine how this code would look without the macro variables, and then see what changes would be needed to write the output to a non-temporary SAS library instead of WORK. Then you would see where a macrovar for the user-specified libname would need to be placed. You can even get some guidance from Reeza's code that you originally mentioned.
Oh believe me, I am trying to learn.
Apparently my intended file location is, by itself, more than 32 characters long, so I created a shorter temporary directory (which detracts from the intended automation of this process).
That said the following modification comes close but its missing the .sas ending (Or perhaps I am completely off base)
dsname=cats("&root1.",%scan(&byvars,1),'_',%scan(&byvars,2),'_',%scan(&byvars,3));
ERROR: The value E:\TEMP\EXPORT_2022_01 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\EXPORT_2022_02 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\EXPORT_2022_03 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\EXPORT_2022_04 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_01 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_02 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_03 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_04 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
NOTE: The SAS System stopped processing this step because of errors.
If you use any macro variables without showing use how you use/ define them can't do much to debug.
Show what you are providing in the macro variables and then show what you expect to be created.
Better yet, also include the code that did not use any macros or macro variables that worked before attempting to use the macro/ macro variables.
If that DSNAME = statement is used in a data step you very likely do not want/need to use %scan, but the data step SCAN function.
Plus the errors you show are not generated by the code you use but trying to use that variable in another statement somewhere. So you need to show that statement.
The file location can be any length. By "location" I mean the directory to which should assign a libname (which itself is just (up to) an 8-character symbol).
But the name of the sas dataset (i.e. the name of the file located in that library) has to be 32 characters or less
In other words you need something like:
libname mylib "c:\this\is\a\long\directory\name";
Then, instead of
dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));
h.output(dataset:dsname);
you could have
dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));
h.output(dataset:cats("mylib.",dsname));
Of course, this assumes that all the sas datasets will go to the same directory.
Thanks, that tweak solved it.
Thanks Reeza. As you said, storage and file size can be issues. Those issues are in play here as, if held in one database, the file size would be around 1.5 terabytes, and growing by about 60-100 gig/year. A file that size comes with logistical issues related to storage/transfer/backup. In addition, simply trying to read/write such a huge file gets very "laggy" even it is just to use the powerful BY statements. Also, on a near daily basis, a small number of records are revised. By creating discrete sub-sets, it is much easier to revise the small number of affected files than it is to recompile a huge dataset. Finally, I have existing workflow (and SAS programs) that rely upon these discrete sub-set files. Much easier to change one file than to revise my entire workflow. Thanks to all that have contributed to this thread.
@texasmfp wrote:
Thanks Reeza. As you said, storage and file size can be issues. Those issues are in play here as, if held in one database, the file size would be around 1.5 terabytes, and growing by about 60-100 gig/year. A file that size comes with logistical issues related to storage/transfer/backup. In addition, simply trying to read/write such a huge file gets very "laggy" even it is just to use the powerful BY statements. Also, on a near daily basis, a small number of records are revised. By creating discrete sub-sets, it is much easier to revise the small number of affected files than it is to recompile a huge dataset. Finally, I have existing workflow (and SAS programs) that rely upon these discrete sub-set files. Much easier to change one file than to revise my entire workflow. Thanks to all that have contributed to this thread.
I have dealt with this very issue - we would receive, on a daily basis all the trades and quotes issued on the major US stock exchanges - over a billion daily records recently (these data have timestamps at the nanosecond level). Actually we do keep separate daily datasets, but they are sorted by stock ticker, but just think of it as an analog to a DATE by-variable in your case.
Now, if your dataset is sorted by date, and your periodic updates either (1) modify existing "historical" records (i.e. no deletions or additions in the middle of the dataset) , or (2) append records for new dates to the end of the dataset, you have a dataset sorted by date, in which the start record and end record for each date remains fixed throughout all subsequent updates.
If so, this is a case where you might benefit from what I call a condensed index (see "Condensed and Sparse Indexes for Sorted SAS® Datasets" from a 2012 NESUG (remember them?) presentation). If you only want a given date, you can use the firstobs and obs dataset name parameters to avoid filtering out all other dates. Those firstobs/obs values for each date (or each date/stock ticker in some of my work) are kept in a (very very) small ancillary file that is trivial to use and trivial to update.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.