Merge 2 data sets results in multiple observations per participant

toneill · Posted 03-10-2015 11:53 AM

Hi all -- I am a clinical researcher (not a trained analyst/statistician), so I'll do my best to be very clear about the problem I am currently having. I will use generic names for datasets and variables for illustrative purposes.

I have 2 data sets (data1 and data2). Both are sorted on study_id.

Data1 - Has multiple records per individual (i.e. study_id #1 has 6 records, representing 6 completed questionnaires -- therefore, "study_id" variable 1 is listed 6x, and "qnum" variable is 1-6). Period of observation: 01-01-07 to 31-12-13

Data2 - Also has multiple records per individual (i.e. study_id #1 has 72 records, representing 72 medications he/she has been prescribed). The "quest" variable is NOT included in Data2. Period of observation: 01-01-85 to 31-12-13. However, if a participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has only 1 questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1 when I attempted to merge the data below).

Goal: A merged data set that has one record per "questionnaire" (similar to Data1), and a new variable ("newvar") based on the prescribing of a drug reported in data2.

To be more clear on "newvar": In Data1, there is a questionnaire interview date ("interviewdt") and in Data2 there is a prescription date ("medstartdate"). This "newvar" should be binary, such that IF a participant has started a particular prescription (represented by a particular code)(Data2) on or before the interview date (Data1), then "newvar"=1 (i.e. yes), ELSE, "newvar"=0 (i.e. no).

Here is what I've done :

data Data3 ;

merge Data1 (keep=study_id qnum othervariables in=in_data1 where=(interviewdt^=.))

Data2 (in=in_data2 where (medstartdate^=. and (prescriptioncode="A" or prescriptioncode="B"))) ;

by study_id ;

if in_data1 and in_dadta2 ;

dtdiff=interviewdt - medstartdate ; /*ONLY want "interviewdt" where qnum=1 (i.e. first questionnaire/baseline) - I think this is part of my problem; how to only select qnum=1?*/

MedStartDtDiff = abs(dtdiff) ;

if dtdiff^=. then MedStartDtDiff=(-10220<=(interviewdt - medstartdate)<=10220) ; /*10220 represents the maximum number of days from the last day of observation, 31-12-13, to the first day, 01-01-85)*/

format MedStartDtDiff yesno. ;

run ;

proc sort data=Data3 (where =(MedStartDtDiff^=.)) ;

by study_id MedStartDtDiff ;

where qnum=1 ;

run ;

proc print=Data3 (obs=30) ; var study_id MedStartDtDiff qnum ; where qnum=1 ; run;

SO this is what I get as an output -->

study_id MedStartDtDiff MedStartDate qnum

1 YES 01NOV1996 1 --> This is correct; study_id1 has 6 questionnaires (with qnum=1 when interviewdt=14Feb2008 AND started med BEFORE interview date; therefore, MedStartDt is YES)

2 YES 01JUN1999 1

2 YES 07NOV2006 1 --> This is NOT correct; study_id2 should only have a single entry. So this is my major issue! Why am I getting multiple study_id for each individuals?

To further elaborate on study_id2 -- this participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has basline/only questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1). I really only want one line per study_id.

What I'd like to see:

study_id MedStartDtDiff MedStartDate qnum InterviewDt

1 YES 01NOV1996 1 14FEB2008

2 YES 01JUN1999 1 6JUN2010

3 YES 15JUN1989 1 20AUG2008

4 NO 07NOV2010 1 11NOV2011 etc etc....

I am happy to elaborate further or clarify if this isn't 100% clear.

RW9 · Posted 03-10-2015 12:16 PM

Hi,

Sorry, couldn't really follow that, perhaps post a datastep for each of your inputs with some test data, and an example of what you want out. Looking at the last bit I would estimate going something like - (assuming latest date in one):

proc sql;

create table WANT as

select A.STUDY_ID,

... as MEDSTARTDTDIFF,

A.MEDSTARTDATE,

... as QNUM,

MAX(B.INTERVIEWDT) as INTERVIEWDT

from (select distinct STUDY_ID,... from DATA1) A

left join DATA2 B

on A.STUDY_ID=B.STUDY_ID

and B.INTERVIEWDT < A.MEDSTARTDATE

group by STUDYID;

quit;

Not tested the above obviously, as I don't have any test data.

toneill · Posted 03-10-2015 01:22 PM

Unfortunately, I don't have any input datasteps; I was provided with two separate data files and need to merge them together (as aforementioned). I'll give your code a try and let you know what happens.

RW9 · Posted 03-10-2015 01:55 PM

Yep, just make something up that looks like your data so I have something to work with.

toneill · Posted 03-10-2015 02:27 PM

Here is an example:

(1) Data1

data Data1 ;

input study_id 1-2 qnum 1-2 interviewdt date9 ; /*where study_id=participant, qnum=questionnaire, interviewdt=date of interview*/

datalines ;

1 1 14FEB2008

1 2 12FEB2009

1 3 01MAR2010

2 1 04SEPT2012

3 1 19MAR2008

3 2 12OCT2010

3 3 17NOV2011

;

run ;

(2) Data2

data Data2 ;

input study_id 1-2 medstartdate date9 atc_code $10 drug_class $6 ; /*where medstartdate=date that prescription was started, atc_code=medication code, drug_class=type of drug*/

datalines ;

1 01MAY1996 ABC /*Note: this value for drug_class is missing because it is a non-relevant drug for my study*/

1 01NOV1996 CBA A

1 01NOV1996 CBB B

1 01NOV1996 CBC A

2 31MAR1999 ABC

2 01JUN1999 CBD C

2 01JUN1999 CBE A

2 01MAY2003 ABC

3 17FEB1999 CBA A

3 17FEB1999 CBB B

3 01MAR2000 ABC

;

run ;

--> What I would like: As above, a table that only shows one record per participant (study_id) at baseline questionnaire (where qnum=1). I am ONLY interested in those participants who have

an outcome for drug_class (i.e. those with non-relevant atc_codes should not be included in the calculation of a new variable; as I am only interested in those with drug_class reported).

What I am TRYING to do, is estimate the proportion of the population who is medication naive .vs. medication experienced at his/her first questionnaire (qnum=1).

RW9 · Posted 03-11-2015 05:48 AM

Well, could try something along these lines:

data Data1 ;

infile datalines dlm=",";

informat interviewdt date9.;

format interviewdt date9.;

input study_id qnum interviewdt;

datalines ;

1,1,14FEB2008

1,2,12FEB2009

1,3,01MAR2010

2,1,04SEP2012

3,1,19MAR2008

3,2,12OCT2010

3,3,17NOV2011

;

run;

data Data2 ;

infile datalines dlm=",";

informat medstartdate date9.;

format medstartdate date9.;

input study_id medstartdate atc_code $ drug_class $;

datalines ;

1,01MAY1996,ABC,

1,01NOV1996,CBA,A

1,01NOV1996,CBB,B

1,01NOV1996,CBC,A

2,31MAR1999,ABC,

2,01JUN1999,CBD,C

2,01JUN1999,CBE,A

2,01MAY2003,ABC,

3,17FEB1999,CBA,A

3,17FEB1999,CBB,B

3,01MAR2000,ABC,

;

run ;

proc sql;

create table WANT as

select distinct

A.STUDY_ID,

A.INTERVIEWDT

from (select * from DATA1 where QNUM=1) A

where exists(select distinct STUDY_ID from DATA2 where STUDY_ID=A.STUDY_ID and DRUG_CLASS is not null);

quit;

Ksharp · Posted 03-11-2015 06:44 AM

You didn't post the output you need according to your sample data yet ?

If I understood what you want.

data Data1 ;
  infile datalines dlm=",";
  informat interviewdt date9.;
  format interviewdt date9.;
  input study_id qnum interviewdt;
datalines ;
1,1,14FEB2008
1,2,12FEB2009
1,3,01MAR2010
2,1,04SEP2012
3,1,19MAR2008
3,2,12OCT2010
3,3,17NOV2011
;
run;
data Data2 ;
  infile datalines dlm=",";
  informat medstartdate date9.;
  format medstartdate date9.;
  input study_id medstartdate atc_code $ drug_class $;
datalines ;
1,01MAY1996,ABC,
1,01NOV1996,CBA,A
1,01NOV1996,CBB,B
1,01NOV1996,CBC,A
2,31MAR1999,ABC,
2,01JUN1999,CBD,C
2,01JUN1999,CBE,A
2,01MAY2003,ABC,
3,17FEB1999,CBA,A
3,17FEB1999,CBA,A
3,17FEB1999,CBB,B
3,01MAR2000,ABC,
;
run ;
 
proc sql;
 create table want as
  select a.*,case 
             when exists(select * from data2 where study_id=a.study_id and drug_class is not missing and medstartdate lt a.interviewdt) then 'YES'
                else 'NO ' 
                end as MedStartDtDiff    
   from data1 as a 
    where qnum=1;
quit;

Xia Keshan

Message was edited by: xia keshan Fix a problem .

Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Catch up on SAS Innovate 2026

Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Re: Merge 2 data sets results in multiple observations per participant

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away