Solved: Re: Exporting SAS dataset to multiple SAS files using BY groups in the...

texasmfp · Posted 05-30-2022 03:57 AM

I'd like to export a large SAS dataset to multiple SAS files using 3 BY groups in the filename. It needs to be flexible since the BY group variables can change from set to set.

In experimenting, I found this outstanding thread (https://communities.sas.com/t5/SAS-Programming/need-help-exporting-each-line-of-the-dataset-into-a-d...) that seems to do what I want, at least in in terms of creating filenames using the by group variable names. However, that code saves the files as csv files. I want to create SAS files (with filenames using the by group variables). So, while I started to adapt it for my files; ultimately I may need to start from scratch to reach my goal.

*name of the data set with the original data;
%let lib_name = work;
%let dsn_name = dsn1;
*Variable to split on;
%let var_split1 = YEAR;
%let var_split2 = MONTH;
%let var_split3 =  Direction;
*path to folder to save text files;
%let path_folder= E:\Import Statistics\Warehouse\;

*if you are exporting each line this is not required 
but should not  cause any issues unless your data set is large. In that case rename your data set to _temp and skip this step;
PROC SORT DATA=&lib_name..&dsn_name OUT=_temp;
BY &var_split1 &var_split2 &var_split3;
RUN;

*make variable lists;
*for header row;
proc sql noprint;
select name into :var_list_csv separated by ", " from sashelp.vcolumn where libname = upper("&lib_name") and memname = upper("&dsn_name");
select name into :var_list separated by " " from sashelp.vcolumn where libname = upper("&lib_name") and memname = upper("&dsn_name");
quit;



DATA _NULL_;

SET _temp; *Dataset to be exported;
BY &var_split1 &var_split2. &var_split3.; *Variable that file is to be split on;

*Create path to file that is to be exported;
if first.&var_split3. then out_file=cats("&path_folder.", trim(&var_split1.), "_", trim(&var_split2.), "_", trim(&var_split3.), ".csv");

file temp filevar=out_file dlm=',' dsd;

*If first value of make then output column names;
if first.&var_split3. then 
put "&var_list_csv.";

*Output variables;
put &var_list.;

run;

mkeintz · Posted 05-30-2022 12:05 PM

The file location can be any length. By "location" I mean the directory to which should assign a libname (which itself is just (up to) an 8-character symbol).

But the name of the sas dataset (i.e. the name of the file located in that library) has to be 32 characters or less

In other words you need something like:

libname mylib "c:\this\is\a\long\directory\name";

Then, instead of

dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));  
  h.output(dataset:dsname);

you could have

  dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));  
  h.output(dataset:cats("mylib.",dsname));

Of course, this assumes that all the sas datasets will go to the same directory.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 05-30-2022 06:00 AM

@texasmfp:

Describing a problem as "modifying X's program to do Y instead of Z" is not an informative topic description. Person X may have written 100's of programs. And once your own topic is marked as solved, future users of this community using problem-specific search terms will have no way to find that solution.

Please tell us what your objective is. I think it's something like split a SAS dataset into multiple SAS datasets based on the value of one or multiple variables. If so, then the solution you propose to tweak might not even be the most suitable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

texasmfp · Posted 05-30-2022 06:21 AM

Thanks. I have changed both the thread title and the goal.

PaigeMiller · Posted 05-30-2022 06:48 AM

The entire underlying logic seems wrong to me.

I'd like to export a large SAS dataset to multiple SAS files using 3 BY groups in the filename. It needs to be flexible since the BY group variables can change from set to set.

Typically, you don't "pre-create" filtered data sets for later use. You filter the original data set when you need it, so if at some point in time, you need the data from a specific year, specific month and specific direction, at this point in time you would go ahead and do one extract from the larger file.

data want;
    set work.dsn1;
    where year=2020 and month=7 and direction='East';
    ...
run;

This seems so much more efficient than creating all of these pre-filtered data sets via some "flexible code".

--
Paige Miller

texasmfp · Posted 05-30-2022 07:12 AM

I do intentionally create discrete datasets (or pre-filtered, as you call them), regardless of whether that is typical. Any help towards my stated goal would be appreciated. Thanks

mkeintz · Posted 05-30-2022 08:54 AM

Since this is intended to create SAS data sets rather than CSV I would suggest an entirely different approach - use a hash object as a means of dynamically naming the output datasets as each by group is processed. The program below applies this approach to dataset HAVE, which is sorted by NAME/GROUP/SUBGROUP. For each such triplet, a dataset will be formed with the name based on the values of the three variables, separated by underscores. The program works in a single pass of the sorted data:

data have;
  set sashelp.class;
  do group='A','B';
    do subgroup=1,2;
      do i=1 to _n_;
        output;
      end;
    end;
  end;
run;

%let dsn=have;
%let byvars=name group subgroup;
%let last_byvar=%scan(&byvars,-1);


data _null_;  
  set &dsn;
  by &byvars;
  if _n_=1 then do;
    declare hash h (dataset:"&dsn (obs=0)",multidata:'Y');
      h.definekey("&last_byvar");
      h.definedata(all:'Y');
      h.definedone();
  end;
  h.add();
  if last.&last_byvar;
  length dsname $32;
  dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));  
  h.output(dataset:dsname);
  h.clear();
run;

The program makes a number of assumptions. Probably the most salient from an operational point of view is

It assumes the first by var has values that always begin with a letter or underscore. Dataset names (unless you use "name literals", must begin with a letter or underscore.
It assumes the total concatenated length of the values of the 3 by variables is no more than 32, including the two "_" separators put in by the catx function. The SAS dataset name can't exceed 32 characters, unlike filenames for CSV files.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

texasmfp · Posted 05-30-2022 09:26 AM

That works well at creating the SAS files, but they are temporary files. How can I modify that to save all of the temporary files to a drive/folder? Almost there. Thanks

mkeintz · Posted 05-30-2022 09:56 AM

@texasmfp wrote:

That works well at creating the SAS files, but they are temporary files. How can I modify that to save all of the temporary files to a drive/folder? Almost there. Thanks

This is a good question to treat as a learning exercise. It will require very little change to generate.

As always with macros or macro variables: determine how this code would look without the macro variables, and then see what changes would be needed to write the output to a non-temporary SAS library instead of WORK. Then you would see where a macrovar for the user-specified libname would need to be placed. You can even get some guidance from Reeza's code that you originally mentioned.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

texasmfp · Posted 05-30-2022 10:48 AM

Oh believe me, I am trying to learn.

Apparently my intended file location is, by itself, more than 32 characters long, so I created a shorter temporary directory (which detracts from the intended automation of this process).

That said the following modification comes close but its missing the .sas ending (Or perhaps I am completely off base)

 dsname=cats("&root1.",%scan(&byvars,1),'_',%scan(&byvars,2),'_',%scan(&byvars,3));

ERROR: The value E:\TEMP\EXPORT_2022_01 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\EXPORT_2022_02 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\EXPORT_2022_03 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\EXPORT_2022_04 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_01 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_02 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_03 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
ERROR: The value E:\TEMP\IMPORT_2022_04 is not a valid SAS name.
ERROR: An error has occurred during instance method OM_OUTPUT(505) of "DATASTEP.HASH".
NOTE: The SAS System stopped processing this step because of errors.

ballardw · Posted 05-30-2022 12:00 PM

If you use any macro variables without showing use how you use/ define them can't do much to debug.

Show what you are providing in the macro variables and then show what you expect to be created.

Better yet, also include the code that did not use any macros or macro variables that worked before attempting to use the macro/ macro variables.

If that DSNAME = statement is used in a data step you very likely do not want/need to use %scan, but the data step SCAN function.

Plus the errors you show are not generated by the code you use but trying to use that variable in another statement somewhere. So you need to show that statement.

mkeintz · Posted 05-30-2022 12:05 PM

The file location can be any length. By "location" I mean the directory to which should assign a libname (which itself is just (up to) an 8-character symbol).

But the name of the sas dataset (i.e. the name of the file located in that library) has to be 32 characters or less

In other words you need something like:

libname mylib "c:\this\is\a\long\directory\name";

Then, instead of

dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));  
  h.output(dataset:dsname);

you could have

  dsname=catx('_',%scan(&byvars,1),%scan(&byvars,2),%scan(&byvars,3));  
  h.output(dataset:cats("mylib.",dsname));

Of course, this assumes that all the sas datasets will go to the same directory.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

texasmfp · Posted 05-30-2022 12:20 PM

Thanks, that tweak solved it.

Reeza · Posted 05-30-2022 12:42 PM

The methodology for creating a text file doesn't extend to creating SAS data sets and to be honest, it usually isn't recommended to split your SAS data sets unless it's for storage reasons for really big data sets.

The method I use to split a data set is here, which isn't really dynamic for multiple by variables as easily as the code above and doesn't uniquely name the file though those are doable, but slightly harder as there's the 32 character limit on the data set.
https://gist.github.com/statgeek/4bfb7574713bedf4e011

Instead of asking to modify the code, can you please explain what you're trying to accomplish? What's your input, what's the expected output?
And it may be a good idea to explain why. Like others have said, BY group processing is very powerful and there's rarely a need to split your data sets in this fashion.

texasmfp · Posted 05-30-2022 11:53 PM

Thanks Reeza. As you said, storage and file size can be issues. Those issues are in play here as, if held in one database, the file size would be around 1.5 terabytes, and growing by about 60-100 gig/year. A file that size comes with logistical issues related to storage/transfer/backup. In addition, simply trying to read/write such a huge file gets very "laggy" even it is just to use the powerful BY statements. Also, on a near daily basis, a small number of records are revised. By creating discrete sub-sets, it is much easier to revise the small number of affected files than it is to recompile a huge dataset. Finally, I have existing workflow (and SAS programs) that rely upon these discrete sub-set files. Much easier to change one file than to revise my entire workflow. Thanks to all that have contributed to this thread.

mkeintz · Posted 05-31-2022 11:13 AM

@texasmfp wrote:

Thanks Reeza. As you said, storage and file size can be issues. Those issues are in play here as, if held in one database, the file size would be around 1.5 terabytes, and growing by about 60-100 gig/year. A file that size comes with logistical issues related to storage/transfer/backup. In addition, simply trying to read/write such a huge file gets very "laggy" even it is just to use the powerful BY statements. Also, on a near daily basis, a small number of records are revised. By creating discrete sub-sets, it is much easier to revise the small number of affected files than it is to recompile a huge dataset. Finally, I have existing workflow (and SAS programs) that rely upon these discrete sub-set files. Much easier to change one file than to revise my entire workflow. Thanks to all that have contributed to this thread.

I have dealt with this very issue - we would receive, on a daily basis all the trades and quotes issued on the major US stock exchanges - over a billion daily records recently (these data have timestamps at the nanosecond level). Actually we do keep separate daily datasets, but they are sorted by stock ticker, but just think of it as an analog to a DATE by-variable in your case.

Now, if your dataset is sorted by date, and your periodic updates either (1) modify existing "historical" records (i.e. no deletions or additions in the middle of the dataset) , or (2) append records for new dates to the end of the dataset, you have a dataset sorted by date, in which the start record and end record for each date remains fixed throughout all subsequent updates.

If so, this is a case where you might benefit from what I call a condensed index (see "Condensed and Sparse Indexes for Sorted SAS® Datasets" from a 2012 NESUG (remember them?) presentation). If you only want a given date, you can use the firstobs and obs dataset name parameters to avoid filtering out all other dates. Those firstobs/obs values for each date (or each date/stock ticker in some of my work) are kept in a (very very) small ancillary file that is trivial to use and trivial to update.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Modifying Reeza's program to export as SAS files rather than csv files

Re: Modifying Reeza's program to export as SAS files rather than csv files

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Re: Exporting SAS dataset to multiple SAS files using BY groups in the filename

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away