Hi all -- I am a clinical researcher (not a trained analyst/statistician), so I'll do my best to be very clear about the problem I am currently having. I will use generic names for datasets and variables for illustrative purposes. I have 2 data sets (data1 and data2). Both are sorted on study_id. Data1 - Has multiple records per individual (i.e. study_id #1 has 6 records, representing 6 completed questionnaires -- therefore, "study_id" variable 1 is listed 6x, and "qnum" variable is 1-6). Period of observation: 01-01-07 to 31-12-13 Data2 - Also has multiple records per individual (i.e. study_id #1 has 72 records, representing 72 medications he/she has been prescribed). The "quest" variable is NOT included in Data2. Period of observation: 01-01-85 to 31-12-13. However, if a participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has only 1 questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1 when I attempted to merge the data below). Goal: A merged data set that has one record per "questionnaire" (similar to Data1), and a new variable ("newvar") based on the prescribing of a drug reported in data2. To be more clear on "newvar": In Data1, there is a questionnaire interview date ("interviewdt") and in Data2 there is a prescription date ("medstartdate"). This "newvar" should be binary, such that IF a participant has started a particular prescription (represented by a particular code)(Data2) on or before the interview date (Data1), then "newvar"=1 (i.e. yes), ELSE, "newvar"=0 (i.e. no). Here is what I've done : data Data3 ; merge Data1 (keep=study_id qnum othervariables in=in_data1 where=(interviewdt^=.)) Data2 (in=in_data2 where (medstartdate^=. and (prescriptioncode="A" or prescriptioncode="B"))) ; by study_id ; if in_data1 and in_dadta2 ; dtdiff=interviewdt - medstartdate ; /*ONLY want "interviewdt" where qnum=1 (i.e. first questionnaire/baseline) - I think this is part of my problem; how to only select qnum=1?*/ MedStartDtDiff = abs(dtdiff) ; if dtdiff^=. then MedStartDtDiff=(-10220<=(interviewdt - medstartdate)<=10220) ; /*10220 represents the maximum number of days from the last day of observation, 31-12-13, to the first day, 01-01-85)*/ format MedStartDtDiff yesno. ; run ; proc sort data=Data3 (where =(MedStartDtDiff^=.)) ; by study_id MedStartDtDiff ; where qnum=1 ; run ; proc print=Data3 (obs=30) ; var study_id MedStartDtDiff qnum ; where qnum=1 ; run; SO this is what I get as an output --> study_id MedStartDtDiff MedStartDate qnum 1 YES 01NOV1996 1 --> This is correct; study_id1 has 6 questionnaires (with qnum=1 when interviewdt=14Feb2008 AND started med BEFORE interview date; therefore, MedStartDt is YES) 2 YES 01JUN1999 1 2 YES 01JUN1999 1 2 YES 07NOV2006 1 --> This is NOT correct; study_id2 should only have a single entry. So this is my major issue! Why am I getting multiple study_id for each individuals? To further elaborate on study_id2 -- this participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has basline/only questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1). I really only want one line per study_id. What I'd like to see: study_id MedStartDtDiff MedStartDate qnum InterviewDt 1 YES 01NOV1996 1 14FEB2008 2 YES 01JUN1999 1 6JUN2010 3 YES 15JUN1989 1 20AUG2008 4 NO 07NOV2010 1 11NOV2011 etc etc.... I am happy to elaborate further or clarify if this isn't 100% clear.
... View more