Relatively new to SAS so I'll need specifics. I am using SAS University, in case that makes any difference.
I have a lot of large data sets I am working with. I would like to do Survival analysis with them. To my understanding, I can use the following to get unique ID's with the latest date:
PROC SORT LIBRARY.DATA;
BY ID DATE;
RUN;
DATA LIBRARY.DATA_DATES (keep = ID DATE);
SET LIBRARY.DATA;
BY ID;
ENDDATE = MAX(DATE);
STARTDATE = MIN(DATE); *only for 2006 data;
RUN;
Please let me know if the code is wrong. My dataset is extremely large, so I won't be able to manually check so easily.
Now I want to merge the datasets (Years 2006-2013). When merging, I still want only the newest date, but I only want to keep IDs that are present in 2006 (start period). If I left merge, will the 2006 dates override the 2007-2013 dates? My other thought is to label the MAX DATE for each year (i.e. in own column) and then a separate column from that with the new MAX DATE. However, this doesn't sound the most efficient. Is there a better way to code for this?
The end goal with the dates is to use them for follow-up time. That is,
FOLLOWUP = MAX DATE - MIN 2006 DATE
I appreciate any help or feedback.
proc sort data=library.data; /* PROC SORT statement needs a DATA= option */
by id date;
run;
data library.data_dates;
set library.data (keep = id date);
by id;
format startdate enddate yymmdd10.;
retain startdate;
if first.id then startdate = date;
if last.id; /* will output only the last observation per id group */
enddate = date;
drop date;
run;
No need to shout at SAS, it will work when being talked to in normal voice.
In PROC SQL, it looks like this:
proc sql;
create table library.data_dates as
select
id,
min(date) as startdate format=yymmdd10.
max(date) as enddate format = yymmdd10.
from library.dates
group by id
;
quit;
Assuming your data are already sorted by id/date, then:
data want;
merge library.data_dates (where=(Year(date)=2006) in=in2006)
library.data_dates ;
by id;
if in2006=1 and last.id=1;
run;
Let's say there are 5 obs from 2006 and 20 obs overall for a given id. The first 4 obs from 2006 will be paired with the first 4 obs overall. The last obs from 2006 will be paired with ALL the remaining overall obs. So for the last obs, the IN2006 dummy will still be on, and the data will be output.
Also, when I say "paired" please note that the data values from 2006 will always be overwritten by the equivalent data variables from the overall - because the overall data set is to the right of the 2006 data subset in the merge statement. Only if there were certain variables dropped in the second data set reference would that variable value from the first reference prevail in the output.
proc sort data=library.data; /* PROC SORT statement needs a DATA= option */
by id date;
run;
data library.data_dates;
set library.data (keep = id date);
by id;
format startdate enddate yymmdd10.;
retain startdate;
if first.id then startdate = date;
if last.id; /* will output only the last observation per id group */
enddate = date;
drop date;
run;
No need to shout at SAS, it will work when being talked to in normal voice.
In PROC SQL, it looks like this:
proc sql;
create table library.data_dates as
select
id,
min(date) as startdate format=yymmdd10.
max(date) as enddate format = yymmdd10.
from library.dates
group by id
;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.