I have multiple datasets whose name has the following format: data_yyyymm. For example, the names of the datasets are data_202211, data_202212, data_202301, and so on. They all have same variables with year and month indicators in each dataset. I want to append these data so that I would have one master dataset that have all these datasets. I was thinking of using the following commands to work:
data masterdataset; set data_202211; do i = 202212 to 202307; some kinds of commands for append end; run;
I was not sure how I could make this code work so that I can have one master dataset.
Could anyone give me some suggestion? Thank you in advance for your time and help!
@Reeza's reply is compact and effective:
What about shortcut lists instead?
data master; set data_202211-data_202212 data_2023: ; run;
But if this compact SET statement implies reading of a large number of datasets, just be aware that computer memory is allocated to build a buffer for each dataset.
I have dealt with this situation when working with daily TAQ (trade and/or quote) datasets from the us stock market, with names constructed with the pattern CTyyyymmdd. If one were reading all the data for, say, June and July (i.e. SET CT202306: CT202307:), it would require 42 such buffers - one for each trading day. And you can be sure many researchers may not want to stop at analyzing two months of data.
The upshot is that this can become costly to performance.
In certain cases, there is an escape:
If all those datasets had the same variables, then use of the statement
SET CT202306: CT202307: open=defer;
helps a lot. The "open=defer" option tells SAS to defer opening the second dataset until the first is fully read. and apply the same principle to each of the datasets. That is, it tells SAS to re-use the same buffer for each of the 42 datasets.
Even if they have different variables, there is a trivial workaround: create an empty dummy dataset with all the variables (but no observations), and then list it first in the SET statement:
data dummy; set ct202306: ct202307: ; stop; run; data new; set dummy ct202306: ct202307: open=defer; .... run;
BUT (you knew another "but" was coming, right?) it still doesn't allow OPEN=DEFER in the presence of a BY statement. Consider this code (assume each dataset has one DATE value, but is sorted by STOCK_TICKER),
set CT202306: CT202307: open=defer; by stock_ticker date;
This can't reduce the number of buffers, because SAS has to have all 42 datasets open in order to produce the interleaved sequencing demanded by the BY statement. In fact, it won't run at all, but will generate
ERROR: OPEN=DEFER is incompatible with the BY statement.
NOTE: The SAS System stopped processing this step because of errors.
in the log.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.