Hi all -- I am a clinical researcher (not a trained analyst/statistician), so I'll do my best to be very clear about the problem I am currently having. I will use generic names for datasets and variables for illustrative purposes.
I have 2 data sets (data1 and data2). Both are sorted on study_id.
Data1 - Has multiple records per individual (i.e. study_id #1 has 6 records, representing 6 completed questionnaires -- therefore, "study_id" variable 1 is listed 6x, and "qnum" variable is 1-6). Period of observation: 01-01-07 to 31-12-13
Data2 - Also has multiple records per individual (i.e. study_id #1 has 72 records, representing 72 medications he/she has been prescribed). The "quest" variable is NOT included in Data2. Period of observation: 01-01-85 to 31-12-13. However, if a participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has only 1 questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1 when I attempted to merge the data below).
Goal: A merged data set that has one record per "questionnaire" (similar to Data1), and a new variable ("newvar") based on the prescribing of a drug reported in data2.
To be more clear on "newvar": In Data1, there is a questionnaire interview date ("interviewdt") and in Data2 there is a prescription date ("medstartdate"). This "newvar" should be binary, such that IF a participant has started a particular prescription (represented by a particular code)(Data2) on or before the interview date (Data1), then "newvar"=1 (i.e. yes), ELSE, "newvar"=0 (i.e. no).
Here is what I've done :
data Data3 ;
merge Data1 (keep=study_id qnum othervariables in=in_data1 where=(interviewdt^=.))
Data2 (in=in_data2 where (medstartdate^=. and (prescriptioncode="A" or prescriptioncode="B"))) ;
by study_id ;
if in_data1 and in_dadta2 ;
dtdiff=interviewdt - medstartdate ; /*ONLY want "interviewdt" where qnum=1 (i.e. first questionnaire/baseline) - I think this is part of my problem; how to only select qnum=1?*/
MedStartDtDiff = abs(dtdiff) ;
if dtdiff^=. then MedStartDtDiff=(-10220<=(interviewdt - medstartdate)<=10220) ; /*10220 represents the maximum number of days from the last day of observation, 31-12-13, to the first day, 01-01-85)*/
format MedStartDtDiff yesno. ;
run ;
proc sort data=Data3 (where =(MedStartDtDiff^=.)) ;
by study_id MedStartDtDiff ;
where qnum=1 ;
run ;
proc print=Data3 (obs=30) ; var study_id MedStartDtDiff qnum ; where qnum=1 ; run;
SO this is what I get as an output -->
study_id MedStartDtDiff MedStartDate qnum
1 YES 01NOV1996 1 --> This is correct; study_id1 has 6 questionnaires (with qnum=1 when interviewdt=14Feb2008 AND started med BEFORE interview date; therefore, MedStartDt is YES)
2 YES 01JUN1999 1
2 YES 01JUN1999 1
2 YES 07NOV2006 1 --> This is NOT correct; study_id2 should only have a single entry. So this is my major issue! Why am I getting multiple study_id for each individuals?
To further elaborate on study_id2 -- this participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has basline/only questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1). I really only want one line per study_id.
What I'd like to see:
study_id MedStartDtDiff MedStartDate qnum InterviewDt
1 YES 01NOV1996 1 14FEB2008
2 YES 01JUN1999 1 6JUN2010
3 YES 15JUN1989 1 20AUG2008
4 NO 07NOV2010 1 11NOV2011 etc etc....
I am happy to elaborate further or clarify if this isn't 100% clear.
Hi,
Sorry, couldn't really follow that, perhaps post a datastep for each of your inputs with some test data, and an example of what you want out. Looking at the last bit I would estimate going something like - (assuming latest date in one):
proc sql;
create table WANT as
select A.STUDY_ID,
... as MEDSTARTDTDIFF,
A.MEDSTARTDATE,
... as QNUM,
MAX(B.INTERVIEWDT) as INTERVIEWDT
from (select distinct STUDY_ID,... from DATA1) A
left join DATA2 B
on A.STUDY_ID=B.STUDY_ID
and B.INTERVIEWDT < A.MEDSTARTDATE
group by STUDYID;
quit;
Not tested the above obviously, as I don't have any test data.
Unfortunately, I don't have any input datasteps; I was provided with two separate data files and need to merge them together (as aforementioned). I'll give your code a try and let you know what happens.
Yep, just make something up that looks like your data so I have something to work with.
Here is an example:
(1) Data1
data Data1 ;
input study_id 1-2 qnum 1-2 interviewdt date9 ; /*where study_id=participant, qnum=questionnaire, interviewdt=date of interview*/
datalines ;
1 1 14FEB2008
1 2 12FEB2009
1 3 01MAR2010
2 1 04SEPT2012
3 1 19MAR2008
3 2 12OCT2010
3 3 17NOV2011
;
run ;
(2) Data2
data Data2 ;
input study_id 1-2 medstartdate date9 atc_code $10 drug_class $6 ; /*where medstartdate=date that prescription was started, atc_code=medication code, drug_class=type of drug*/
datalines ;
1 01MAY1996 ABC /*Note: this value for drug_class is missing because it is a non-relevant drug for my study*/
1 01NOV1996 CBA A
1 01NOV1996 CBB B
1 01NOV1996 CBC A
2 31MAR1999 ABC
2 01JUN1999 CBD C
2 01JUN1999 CBE A
2 01MAY2003 ABC
3 17FEB1999 CBA A
3 17FEB1999 CBA A
3 17FEB1999 CBB B
3 01MAR2000 ABC
;
run ;
--> What I would like: As above, a table that only shows one record per participant (study_id) at baseline questionnaire (where qnum=1). I am ONLY interested in those participants who have
an outcome for drug_class (i.e. those with non-relevant atc_codes should not be included in the calculation of a new variable; as I am only interested in those with drug_class reported).
What I am TRYING to do, is estimate the proportion of the population who is medication naive .vs. medication experienced at his/her first questionnaire (qnum=1).
Well, could try something along these lines:
data Data1 ;
infile datalines dlm=",";
informat interviewdt date9.;
format interviewdt date9.;
input study_id qnum interviewdt;
datalines ;
1,1,14FEB2008
1,2,12FEB2009
1,3,01MAR2010
2,1,04SEP2012
3,1,19MAR2008
3,2,12OCT2010
3,3,17NOV2011
;
run;
data Data2 ;
infile datalines dlm=",";
informat medstartdate date9.;
format medstartdate date9.;
input study_id medstartdate atc_code $ drug_class $;
datalines ;
1,01MAY1996,ABC,
1,01NOV1996,CBA,A
1,01NOV1996,CBB,B
1,01NOV1996,CBC,A
2,31MAR1999,ABC,
2,01JUN1999,CBD,C
2,01JUN1999,CBE,A
2,01MAY2003,ABC,
3,17FEB1999,CBA,A
3,17FEB1999,CBA,A
3,17FEB1999,CBB,B
3,01MAR2000,ABC,
;
run ;
proc sql;
create table WANT as
select distinct
A.STUDY_ID,
A.INTERVIEWDT
from (select * from DATA1 where QNUM=1) A
where exists(select distinct STUDY_ID from DATA2 where STUDY_ID=A.STUDY_ID and DRUG_CLASS is not null);
quit;
You didn't post the output you need according to your sample data yet ?
If I understood what you want.
data Data1 ; infile datalines dlm=","; informat interviewdt date9.; format interviewdt date9.; input study_id qnum interviewdt; datalines ; 1,1,14FEB2008 1,2,12FEB2009 1,3,01MAR2010 2,1,04SEP2012 3,1,19MAR2008 3,2,12OCT2010 3,3,17NOV2011 ; run; data Data2 ; infile datalines dlm=","; informat medstartdate date9.; format medstartdate date9.; input study_id medstartdate atc_code $ drug_class $; datalines ; 1,01MAY1996,ABC, 1,01NOV1996,CBA,A 1,01NOV1996,CBB,B 1,01NOV1996,CBC,A 2,31MAR1999,ABC, 2,01JUN1999,CBD,C 2,01JUN1999,CBE,A 2,01MAY2003,ABC, 3,17FEB1999,CBA,A 3,17FEB1999,CBA,A 3,17FEB1999,CBB,B 3,01MAR2000,ABC, ; run ; proc sql; create table want as select a.*,case when exists(select * from data2 where study_id=a.study_id and drug_class is not missing and medstartdate lt a.interviewdt) then 'YES' else 'NO ' end as MedStartDtDiff from data1 as a where qnum=1; quit;
Xia Keshan
Message was edited by: xia keshan Fix a problem .
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.