Hello,
I have long, repeated data for my participants (sample data on first two participants shown below):
StudyID ScoreDate R_Score M_Score S_Score P_Score count
0001 01/11/2007 1 3 3 3 1
0001 02/01/2007 1 3 3 3 2
0001 05/17/2007 1 3 3 3 3
0001 10/26/2007 1 4 4 3 4
0001 11/06/2007 1 5 5 3 5
0001 11/07/2007 1 3 3 3 6
0001 11/21/2007 1 3 3 3 7
0001 12/31/2007 1 3 3 3 8
0001 03/18/2008 1 3 3 3 9
0002 06/04/2012 2 3 3 2 1
0002 08/08/2012 2 3 3 2 2
0002 09/27/2012 2 3 3 2 3
0002 11/26/2012 2 3 3 2 4
0002 01/29/2013 2 3 3 2 5
0002 03/21/2013 2 3 3 2 6
0002 04/05/2013 2 3 3 2 7
I also have a second database of time periods of interest, specified by a start and end date, as follows (sample data for the first two participants):
StudyID StartDate EndDate
0001 2007-01-01 2007-01-30
0001 2007-01-31 2008-04-30
0001 2008-05-01 2015-01-01
0002 2007-01-01 2012-05-31
0002 2012-06-01 2013-05-23
0002 2013-05-24 2015-01-01
I need to combine these two databases by placing the score data into the correct time periods based on whether the ScoreDate (e.g. 05/17/2007) fits into the time period (e.g. 2007-01-31 to 2008-04-30). If the ScoreDate fits into a certain period, I need to put the ScoreDate and all three corresponding scores into that period. If there are multiple ScoreDates that belong in that time period, then I need all ScoreDates and corresponding scores to be entered (in wide format) into that period (preferably with some type of indicator, e.g. ScoreDate1 R_Score1 M_Score1 S_Score1 P_Score1 and ScoreDate2 R_Score2 M_Score1 S_Score2 P_Score2, so that I know which scores were measured on which date. I also need to be able to later compare the ScoreDate to the time period StartDate and EndDate, so I think ScoreDate needs to stay as a variable). If there are no ScoreDates for a certain period, then the period remains unchanged (with no additional variables).
I think this could be somewhat straightforward using PROC SQL, but I am still learning how to use it and I could really use some guidance. Thank you very much.
I'm not sure if I exactly understand your question, i can see it two ways and if it's one of the ways, this code may work:
PROC SQL;
SELECT tbl_2.StudyID
,tbl_2.StartDate
,tbl_2.EndDate
,tbl_1.ScoreDate
,tbl_1.R_Score
,tbl_1.M_Score
,tbl_1.S_Score
,tbl_1.P_Score
,tbl_1.count
FROM tbl_2
INNER JOIN tbl_1
ON tbl_2.StudyID = tbl_1.StudyID
AND tbl_2.StartDate <= tbl_1.ScoreDate
AND tbl_2.EndDate >= tbl_1.ScoreDate;
;QUIT;
But if by wide you mean you want differing number of columns in the output dataset per row (or more accurately the same arbitrarily large number of columns, many of which will be empty) and only one row in the output dataset per row in tbl_2, then that is more difficult although still possible, in DBs like Oracle there are some easy ways to do this, but not in SAS via a single step that I'm aware of.
The code you provided definitely got me closer to where I want to be. See below:
StudyID StartDate EndDate ScoreDate R_Score M_Score S_Score P_Score count
01 2007-01-01 2007-01-30 01/11/2007 1 3 3 3 1
01 2007-01-31 2008-04-30 02/01/2007 1 3 3 3 2
01 2007-01-31 2008-04-30 05/17/2007 1 3 3 3 3
01 2007-01-31 2008-04-30 10/26/2007 1 4 4 3 4
01 2007-01-31 2008-04-30 11/06/2007 1 5 5 3 5
01 2007-01-31 2008-04-30 11/07/2007 1 3 3 3 6
01 2007-01-31 2008-04-30 11/21/2007 1 3 3 3 7
01 2007-01-31 2008-04-30 12/31/2007 1 3 3 3 8
01 2007-01-31 2008-04-30 03/18/2008 1 3 3 3 9
02 2012-06-01 2013-05-23 06/04/2012 2 3 3 2 1
02 2012-06-01 2013-05-23 08/08/2012 2 3 3 2 2
02 2012-06-01 2013-05-23 09/27/2012 2 3 3 2 3
02 2012-06-01 2013-05-23 11/26/2012 2 3 3 2 4
02 2012-06-01 2013-05-23 01/29/2013 2 3 3 2 5
02 2012-06-01 2013-05-23 03/21/2013 2 3 3 2 6
02 2012-06-01 2013-05-23 04/05/2013 2 3 3 2 7
However, I think I was ultimately thinking of Option #2 (the same arbitrarily large number of columns, many of which will be empty). Do you have recommendations for next steps on how to get there from here?
Thank you!
I don't have any code handy that would do that, but my first thoughts would be to take the output of the code I sent earlier and process it through a data step where you output a separate dataset for each unique value from table 1 (studyID, startDate, EndDate). Each file would only have one variable and you would sequentially output R_Score M_Score S_Score P_Score count to the file. At that point you'll have X files (where X is the number of unique study periods in tbl 1). You could then do a PROC TRANSPOSE on the file to get a single row instead of a huge number of rows. I'm not sure about the best way to go about putting the X rows into 1 dataset at that point though. It may be easiest to export to CSV at that point, combine the files there and then reimport into a dataset (assuming you're not ok with just a CSV as the result).
proc transpose is a good idea. I think you don't even need to create X data sets, just run it directly on the output Sven111 provided.
proc transpose data= out=want ;
by StudyID StartDate EndDate;
id ScoreDate;
var R_Score M_Score S_Score P_Score count;
run;
...or some variant of the above. Look into the doc of proc transpose.
Edit: The above is not a perfect solution, as it will create 5 rows for each by group: one for each variable (R_Score M_Score S_Score P_Score count). If you one to transpose the dataset only because of displaying purposes, you could first put all those 5 variables into a single character variable: vars_char=catx(' ',R_Score,M_Score,S_Score,P_Score,count) and do the traspose after that.
If you need additional processing on the results... In that case I would not do the transose at all 🙂
You did not post the output yet .
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.