@habeebao wrote: A join through proc SQL allows you to join on a variable. I've found it's more efficient when I'm creating a table with shared observations; whereas a set will stack my tables, not necessarily joining them. It's helpful if my information is already formatted for existing variables.
SET does a lot more than stack.
Consider two datasets QTR (4 per year) and YEAR, each sorted by ID/DATE. And you want to create a new quarterly dataset with the current QTR record and the most recent YEAR record. The problem is that QTR has 4 distinct dates, but YEAR has only one date, so joining on the variable DATE is not so simple.
You might use
proc sql noprint;
create table yrqtr as
select *
from qtr
left join
year (rename=(date=yr_date))
on year.id=qtr.id
and yr_date<=date<intnx('year',yr_date,1,'s')
order by id,date;
quit;
which is not too complex. But perhaps you also want to join a MONTH data set (12 dates per year). And maybe a YEAR record is missing, so you want to carry it forward until the next YEAR record appears.
With conditional SET statements, and sorted (by ID/DATE) datasets, this is trivial in a DATA step with SET statements:
data YQM ;
merge YEAR (in=iny keep=id date)
QTR (in=inq keep=id date)
MONTH (in=inm keep=id date);
by id date;
if iny then set YEAR (rename=(date=YR_date));
if inq then set QTR (rename=(date=QTR_date));
if inm then set MONTH (rename=(date=MON_date));
if inm then output;
if last.id then call missing(of _all_);
run;
This simple code produces one record per month, containing the current month, most recent quarter, and most recent year. It is not just a stacking of the three datasets. It is a (complex) join based on ID and DATE.
I suspect this is even more beneficial when you don't have intervals nested within each other (month within qtr within year). Say you have three event-driven data sets (admission/discharge, services, and tests), each sorted by ID/DATE. Consider the SQL code you would need to generate, for each event, no matter what type, a record with the most recent data from each of the three data sets.
data event_carried_forward;
merge admdis (in=inA keep=id date)
srvcs (in=inS keep=id date)
tests (in=inT keep=id date);
by id date;
if inA then set admdis (rename=(date=date_A));
if inS then set srvcs (rename=(date=date_S));
if inT then set tests (rename=(date=date_T));
output; /* Output a record for every event. No IF test needed*/
if last.id then call missing (of _all_);
run;
... View more