Hello.
I have multiple datasets whose name has the following format: data_yyyymm. For example, the names of the datasets are data_202211, data_202212, data_202301, and so on. They all have same variables with year and month indicators in each dataset. I want to append these data so that I would have one master dataset that have all these datasets. I was thinking of using the following commands to work:
data masterdataset; set data_202211; do i = 202212 to 202307; some kinds of commands for append end; run;
I was not sure how I could make this code work so that I can have one master dataset.
Could anyone give me some suggestion? Thank you in advance for your time and help!
What about shortcut lists instead?
data master;
set data_202211-data_202212 data_2023: ;
run;
What about shortcut lists instead?
data master;
set data_202211-data_202212 data_2023: ;
run;
That is an amazing alternative and it worked perfectly!
Thank you very much for your help!
@Reeza's reply is compact and effective:
What about shortcut lists instead?
data master; set data_202211-data_202212 data_2023: ; run;
But if this compact SET statement implies reading of a large number of datasets, just be aware that computer memory is allocated to build a buffer for each dataset.
I have dealt with this situation when working with daily TAQ (trade and/or quote) datasets from the us stock market, with names constructed with the pattern CTyyyymmdd. If one were reading all the data for, say, June and July (i.e. SET CT202306: CT202307:), it would require 42 such buffers - one for each trading day. And you can be sure many researchers may not want to stop at analyzing two months of data.
The upshot is that this can become costly to performance.
In certain cases, there is an escape:
If all those datasets had the same variables, then use of the statement
SET CT202306: CT202307: open=defer;
helps a lot. The "open=defer" option tells SAS to defer opening the second dataset until the first is fully read. and apply the same principle to each of the datasets. That is, it tells SAS to re-use the same buffer for each of the 42 datasets.
Even if they have different variables, there is a trivial workaround: create an empty dummy dataset with all the variables (but no observations), and then list it first in the SET statement:
data dummy;
set ct202306: ct202307: ;
stop;
run;
data new;
set dummy ct202306: ct202307: open=defer;
....
run;
BUT (you knew another "but" was coming, right?) it still doesn't allow OPEN=DEFER in the presence of a BY statement. Consider this code (assume each dataset has one DATE value, but is sorted by STOCK_TICKER),
set CT202306: CT202307: open=defer;
by stock_ticker date;
This can't reduce the number of buffers, because SAS has to have all 42 datasets open in order to produce the interleaved sequencing demanded by the BY statement. In fact, it won't run at all, but will generate
ERROR: OPEN=DEFER is incompatible with the BY statement.
NOTE: The SAS System stopped processing this step because of errors.
in the log.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.