I have a large dataset with rows of IDs and dates as below
ID testdate
1 11/1/16
1 6/4/17
1 8/1/17
2 6/4/16
2 7/5/16
2 10/1/16
In the actual dataset, there are a lot more rows per person than I've shown above. I would like to pick the most recent two test dates for each person, and then create a new variable with the difference in days between those two dates.
I would really appreciate anyone's help!
Thank you very much!
KM
like this?
data have;
input ID$ testdate:ddmmyy10.;
format testdate ddmmyy10.;
datalines;
1 11/01/16
1 06/04/17
1 08/01/17
2 06/04/16
2 07/05/16
2 10/01/16
;
proc sort data=have;
by ID testdate;
run;
data want;
set have;
by ID;
days_between=intck('day',lag1(testdate), testdate);
if last.id;
run;
like this?
data have;
input ID$ testdate:ddmmyy10.;
format testdate ddmmyy10.;
datalines;
1 11/01/16
1 06/04/17
1 08/01/17
2 06/04/16
2 07/05/16
2 10/01/16
;
proc sort data=have;
by ID testdate;
run;
data want;
set have;
by ID;
days_between=intck('day',lag1(testdate), testdate);
if last.id;
run;
Or as a variant to @PeterClemmensens code
data want;
set have;
by ID;
days_between=dif(testdate);
if last.id;
run;
Thank you very much for your response!
I ran the code. It seems to be working for ID 2 but not for person ID 1.
The difference between 8/1/17 and 6/04/17 is 58 days but the code seems to be giving 88 days.
I was wondering if you are getting the same thing and if so how would one correct it.
Also, on a related topic, I was wondering if you could advise me on how to first subset the dataset so that the new dataset contains only the two most recent test dates per person.
Thank you very much!
"The difference between 8/1/17 and 6/04/17 is 58 days but the code seems to be giving 88 days."
The code does the right thing. The explanation is that the dates are read into SAS using DDMMYYY and NOT US style MMDDYY. Change the informat in the sample code for reading the raw data and you'll get the expected result.
"so that the new dataset contains only the two most recent test dates per person."
Please provide a sample how the desired result should look like especially in regards of the values for column days_between.
I see. Thank you! I would like the final dataset to look as shown below. If can first subset the dataset so that it only includes the two most recent test dates, then the next step of calculating days_between might use less memory, as my dataset is large
ID testdate days_between
1 6/4/17 88
1 8/1/17 88
2 7/5/16 31
2 10/1/16 31
I was wondering if you could advise me on an additional data step that would allow me to subset the dataset as I have described.
Thank you very much for your help!
Assuming I understood what you mean. data have; input ID$ testdate:mmddyy10.; format testdate mmddyy10.; datalines; 1 11/01/16 1 06/04/17 1 08/01/17 2 06/04/16 2 07/05/16 2 10/01/16 ; proc sort data=have; by ID descending testdate ; run; data want; set have; by id; if first.id then n=0; n+1; if n<3 then output; drop n; run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.