Hi all -- I am a clinical researcher (not a trained analyst/statistician), so I'll do my best to be very clear about the problem I am currently having. I will use generic names for datasets and variables for illustrative purposes.
I have 2 data sets (data1 and data2). Both are sorted on study_id.
Data1 - Has multiple records per individual (i.e. study_id #1 has 6 records, representing 6 completed questionnaires -- therefore, "study_id" variable 1 is listed 6x, and "qnum" variable is 1-6). Period of observation: 01-01-07 to 31-12-13
Data2 - Also has multiple records per individual (i.e. study_id #1 has 72 records, representing 72 medications he/she has been prescribed). The "quest" variable is NOT included in Data2. Period of observation: 01-01-85 to 31-12-13. However, if a participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has only 1 questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1 when I attempted to merge the data below).
Goal: A merged data set that has one record per "questionnaire" (similar to Data1), and a new variable ("newvar") based on the prescribing of a drug reported in data2.
To be more clear on "newvar": In Data1, there is a questionnaire interview date ("interviewdt") and in Data2 there is a prescription date ("medstartdate"). This "newvar" should be binary, such that IF a participant has started a particular prescription (represented by a particular code)(Data2) on or before the interview date (Data1), then "newvar"=1 (i.e. yes), ELSE, "newvar"=0 (i.e. no).
Here is what I've done :
data Data3 ;
merge Data1 (keep=study_id qnum othervariables in=in_data1 where=(interviewdt^=.))
Data2 (in=in_data2 where (medstartdate^=. and (prescriptioncode="A" or prescriptioncode="B"))) ;
by study_id ;
if in_data1 and in_dadta2 ;
dtdiff=interviewdt - medstartdate ; /*ONLY want "interviewdt" where qnum=1 (i.e. first questionnaire/baseline) - I think this is part of my problem; how to only select qnum=1?*/
MedStartDtDiff = abs(dtdiff) ;
if dtdiff^=. then MedStartDtDiff=(-10220<=(interviewdt - medstartdate)<=10220) ; /*10220 represents the maximum number of days from the last day of observation, 31-12-13, to the first day, 01-01-85)*/
format MedStartDtDiff yesno. ;
run ;
proc sort data=Data3 (where =(MedStartDtDiff^=.)) ;
by study_id MedStartDtDiff ;
where qnum=1 ;
run ;
proc print=Data3 (obs=30) ; var study_id MedStartDtDiff qnum ; where qnum=1 ; run;
SO this is what I get as an output -->
study_id MedStartDtDiff MedStartDate qnum
1 YES 01NOV1996 1 --> This is correct; study_id1 has 6 questionnaires (with qnum=1 when interviewdt=14Feb2008 AND started med BEFORE interview date; therefore, MedStartDt is YES)
2 YES 01JUN1999 1
2 YES 01JUN1999 1
2 YES 07NOV2006 1 --> This is NOT correct; study_id2 should only have a single entry. So this is my major issue! Why am I getting multiple study_id for each individuals?
To further elaborate on study_id2 -- this participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has basline/only questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1). I really only want one line per study_id.
What I'd like to see:
study_id MedStartDtDiff MedStartDate qnum InterviewDt
1 YES 01NOV1996 1 14FEB2008
2 YES 01JUN1999 1 6JUN2010
3 YES 15JUN1989 1 20AUG2008
4 NO 07NOV2010 1 11NOV2011 etc etc....
I am happy to elaborate further or clarify if this isn't 100% clear.
Hi,
Sorry, couldn't really follow that, perhaps post a datastep for each of your inputs with some test data, and an example of what you want out. Looking at the last bit I would estimate going something like - (assuming latest date in one):
proc sql;
create table WANT as
select A.STUDY_ID,
... as MEDSTARTDTDIFF,
A.MEDSTARTDATE,
... as QNUM,
MAX(B.INTERVIEWDT) as INTERVIEWDT
from (select distinct STUDY_ID,... from DATA1) A
left join DATA2 B
on A.STUDY_ID=B.STUDY_ID
and B.INTERVIEWDT < A.MEDSTARTDATE
group by STUDYID;
quit;
Not tested the above obviously, as I don't have any test data.
Unfortunately, I don't have any input datasteps; I was provided with two separate data files and need to merge them together (as aforementioned). I'll give your code a try and let you know what happens.
Yep, just make something up that looks like your data so I have something to work with.
Here is an example:
(1) Data1
data Data1 ;
input study_id 1-2 qnum 1-2 interviewdt date9 ; /*where study_id=participant, qnum=questionnaire, interviewdt=date of interview*/
datalines ;
1 1 14FEB2008
1 2 12FEB2009
1 3 01MAR2010
2 1 04SEPT2012
3 1 19MAR2008
3 2 12OCT2010
3 3 17NOV2011
;
run ;
(2) Data2
data Data2 ;
input study_id 1-2 medstartdate date9 atc_code $10 drug_class $6 ; /*where medstartdate=date that prescription was started, atc_code=medication code, drug_class=type of drug*/
datalines ;
1 01MAY1996 ABC /*Note: this value for drug_class is missing because it is a non-relevant drug for my study*/
1 01NOV1996 CBA A
1 01NOV1996 CBB B
1 01NOV1996 CBC A
2 31MAR1999 ABC
2 01JUN1999 CBD C
2 01JUN1999 CBE A
2 01MAY2003 ABC
3 17FEB1999 CBA A
3 17FEB1999 CBA A
3 17FEB1999 CBB B
3 01MAR2000 ABC
;
run ;
--> What I would like: As above, a table that only shows one record per participant (study_id) at baseline questionnaire (where qnum=1). I am ONLY interested in those participants who have
an outcome for drug_class (i.e. those with non-relevant atc_codes should not be included in the calculation of a new variable; as I am only interested in those with drug_class reported).
What I am TRYING to do, is estimate the proportion of the population who is medication naive .vs. medication experienced at his/her first questionnaire (qnum=1).
Well, could try something along these lines:
data Data1 ;
infile datalines dlm=",";
informat interviewdt date9.;
format interviewdt date9.;
input study_id qnum interviewdt;
datalines ;
1,1,14FEB2008
1,2,12FEB2009
1,3,01MAR2010
2,1,04SEP2012
3,1,19MAR2008
3,2,12OCT2010
3,3,17NOV2011
;
run;
data Data2 ;
infile datalines dlm=",";
informat medstartdate date9.;
format medstartdate date9.;
input study_id medstartdate atc_code $ drug_class $;
datalines ;
1,01MAY1996,ABC,
1,01NOV1996,CBA,A
1,01NOV1996,CBB,B
1,01NOV1996,CBC,A
2,31MAR1999,ABC,
2,01JUN1999,CBD,C
2,01JUN1999,CBE,A
2,01MAY2003,ABC,
3,17FEB1999,CBA,A
3,17FEB1999,CBA,A
3,17FEB1999,CBB,B
3,01MAR2000,ABC,
;
run ;
proc sql;
create table WANT as
select distinct
A.STUDY_ID,
A.INTERVIEWDT
from (select * from DATA1 where QNUM=1) A
where exists(select distinct STUDY_ID from DATA2 where STUDY_ID=A.STUDY_ID and DRUG_CLASS is not null);
quit;
You didn't post the output you need according to your sample data yet ?
If I understood what you want.
data Data1 ; infile datalines dlm=","; informat interviewdt date9.; format interviewdt date9.; input study_id qnum interviewdt; datalines ; 1,1,14FEB2008 1,2,12FEB2009 1,3,01MAR2010 2,1,04SEP2012 3,1,19MAR2008 3,2,12OCT2010 3,3,17NOV2011 ; run; data Data2 ; infile datalines dlm=","; informat medstartdate date9.; format medstartdate date9.; input study_id medstartdate atc_code $ drug_class $; datalines ; 1,01MAY1996,ABC, 1,01NOV1996,CBA,A 1,01NOV1996,CBB,B 1,01NOV1996,CBC,A 2,31MAR1999,ABC, 2,01JUN1999,CBD,C 2,01JUN1999,CBE,A 2,01MAY2003,ABC, 3,17FEB1999,CBA,A 3,17FEB1999,CBA,A 3,17FEB1999,CBB,B 3,01MAR2000,ABC, ; run ; proc sql; create table want as select a.*,case when exists(select * from data2 where study_id=a.study_id and drug_class is not missing and medstartdate lt a.interviewdt) then 'YES' else 'NO ' end as MedStartDtDiff from data1 as a where qnum=1; quit;
Xia Keshan
Message was edited by: xia keshan Fix a problem .
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.