DATA Step, Macro, Functions and more

Merge 2 data sets results in multiple observations per participant

Reply
Occasional Contributor
Posts: 14

Merge 2 data sets results in multiple observations per participant

Hi all -- I am a clinical researcher (not a trained analyst/statistician), so I'll do my best to be very clear about the problem I am currently having. I will use generic names for datasets and variables for illustrative purposes.

I have 2 data sets (data1 and data2). Both are sorted on study_id.

Data1 - Has multiple records per individual (i.e. study_id #1 has 6 records, representing 6 completed questionnaires -- therefore, "study_id" variable 1 is listed 6x, and "qnum" variable is 1-6). Period of observation: 01-01-07 to 31-12-13

Data2 - Also has multiple records per individual (i.e. study_id #1 has 72 records, representing 72 medications he/she has been prescribed). The "quest" variable is NOT included in Data2. Period of observation: 01-01-85 to 31-12-13. However, if a participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has only 1 questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1 when I attempted to merge the data below). 

Goal: A merged data set that has one record per "questionnaire" (similar to Data1), and a new variable ("newvar") based on the prescribing of a drug reported in data2.

To be more clear on "newvar": In Data1, there is a questionnaire interview date ("interviewdt") and in Data2 there is a prescription date ("medstartdate"). This "newvar" should be binary, such that IF a participant has started a particular prescription (represented by a particular code)(Data2) on or before the interview date (Data1), then "newvar"=1 (i.e. yes), ELSE, "newvar"=0 (i.e. no).

Here is what I've  done :

data Data3 ;

     merge Data1 (keep=study_id qnum othervariables in=in_data1 where=(interviewdt^=.))

          Data2 (in=in_data2 where (medstartdate^=. and (prescriptioncode="A" or prescriptioncode="B"))) ;

     by study_id ;

     if in_data1 and in_dadta2 ;

     dtdiff=interviewdt - medstartdate ; /*ONLY want "interviewdt" where qnum=1 (i.e. first questionnaire/baseline) - I think this is part of my problem; how to only select qnum=1?*/

     MedStartDtDiff = abs(dtdiff) ;

          if dtdiff^=. then MedStartDtDiff=(-10220<=(interviewdt - medstartdate)<=10220) ; /*10220 represents the maximum number of days from the last day of observation, 31-12-13, to the first day, 01-01-85)*/

     format MedStartDtDiff yesno. ;

run ;

proc sort data=Data3 (where =(MedStartDtDiff^=.)) ;

     by study_id MedStartDtDiff ;

          where qnum=1 ;

run ;

proc print=Data3 (obs=30) ; var study_id MedStartDtDiff qnum ; where qnum=1 ; run;

SO this is what I get as an output -->

study_id     MedStartDtDiff     MedStartDate     qnum

     1               YES                    01NOV1996     1          --> This is correct; study_id1 has 6 questionnaires (with qnum=1 when interviewdt=14Feb2008 AND started med BEFORE interview date; therefore, MedStartDt is YES)

     2               YES                    01JUN1999      1

     2               YES                    01JUN1999      1

     2               YES                    07NOV2006     1          --> This is NOT correct; study_id2 should only have a single entry. So this is my major issue! Why am I getting multiple study_id for each individuals?

To further elaborate on study_id2 -- this participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has basline/only questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1). I really only want one line per study_id.


What I'd like to see:

study_id     MedStartDtDiff     MedStartDate     qnum      InterviewDt

     1               YES                    01NOV1996     1              14FEB2008

     2               YES                    01JUN1999      1              6JUN2010

     3               YES                    15JUN1989      1              20AUG2008

     4               NO                      07NOV2010     1              11NOV2011 etc etc....

I am happy to elaborate further or clarify if this isn't 100% clear.

Super User
Super User
Posts: 7,977

Re: Merge 2 data sets results in multiple observations per participant

Hi,

Sorry, couldn't really follow that, perhaps post a datastep for each of your inputs with some test data, and an example of what you want out.  Looking at the last bit I would estimate going something like - (assuming latest date in one):

proc sql;

     create table WANT as

     select     A.STUDY_ID,

                  ... as MEDSTARTDTDIFF,

                  A.MEDSTARTDATE,

                  ... as QNUM,

                   MAX(B.INTERVIEWDT) as INTERVIEWDT

     from      (select distinct STUDY_ID,... from DATA1) A

     left join  DATA2 B

     on         A.STUDY_ID=B.STUDY_ID

     and        B.INTERVIEWDT < A.MEDSTARTDATE

     group by STUDYID;

quit;

Not tested the above obviously, as I don't have any test data.

Occasional Contributor
Posts: 14

Re: Merge 2 data sets results in multiple observations per participant

Unfortunately, I don't have any input datasteps; I was provided with two separate data files and need to merge them together (as aforementioned). I'll give your code a try and let you know what happens.

Super User
Super User
Posts: 7,977

Re: Merge 2 data sets results in multiple observations per participant

Yep, just make something up that looks like your data so I have something to work with. 

Occasional Contributor
Posts: 14

Re: Merge 2 data sets results in multiple observations per participant

Here is an example:

(1) Data1

data Data1 ;

     input study_id 1-2 qnum 1-2 interviewdt date9 ; /*where study_id=participant, qnum=questionnaire, interviewdt=date of interview*/ 

    datalines ;

     1     1     14FEB2008

     1     2     12FEB2009

     1     3      01MAR2010

     2     1      04SEPT2012

     3     1      19MAR2008

     3     2      12OCT2010

     3     3      17NOV2011

;

run ;

(2) Data2

data Data2 ;

     input study_id 1-2 medstartdate date9 atc_code $10 drug_class $6 ; /*where medstartdate=date that prescription was started, atc_code=medication code, drug_class=type of drug*/

     datalines ;

     1     01MAY1996     ABC          /*Note: this value for drug_class is missing because it is a non-relevant drug for my study*/

     1     01NOV1996     CBA     A

     1     01NOV1996     CBB     B

     1     01NOV1996     CBC     A

     2     31MAR1999     ABC    

     2     01JUN1999      CBD     C

     2     01JUN1999      CBE     A

     2     01MAY2003     ABC    

     3     17FEB1999      CBA     A

     3     17FEB1999      CBA     A

     3     17FEB1999      CBB     B

     3     01MAR2000     ABC    

;

run ;

--> What I would like: As above, a table that only shows one record per participant (study_id) at baseline questionnaire (where qnum=1). I am ONLY interested in those participants who have

an outcome for drug_class (i.e. those with non-relevant atc_codes should not be included in the calculation of a new variable; as I am only interested in those with drug_class reported).

What I am TRYING to do, is estimate the proportion of the population who is medication naive .vs. medication experienced at his/her first questionnaire (qnum=1).

Super User
Super User
Posts: 7,977

Re: Merge 2 data sets results in multiple observations per participant

Well, could try something along these lines:

data Data1 ;

  infile datalines dlm=",";

  informat interviewdt date9.;

  format interviewdt date9.;

  input study_id qnum interviewdt;

datalines ;

1,1,14FEB2008

1,2,12FEB2009

1,3,01MAR2010

2,1,04SEP2012

3,1,19MAR2008

3,2,12OCT2010

3,3,17NOV2011

;

run;

data Data2 ;

  infile datalines dlm=",";

  informat medstartdate date9.;

  format medstartdate date9.;

  input study_id medstartdate atc_code $ drug_class $;

datalines ;

1,01MAY1996,ABC,

1,01NOV1996,CBA,A

1,01NOV1996,CBB,B

1,01NOV1996,CBC,A

2,31MAR1999,ABC,

2,01JUN1999,CBD,C

2,01JUN1999,CBE,A

2,01MAY2003,ABC,

3,17FEB1999,CBA,A

3,17FEB1999,CBA,A

3,17FEB1999,CBB,B

3,01MAR2000,ABC,

;

run ;

proc sql;

  create table WANT as

  select  distinct

          A.STUDY_ID,

          A.INTERVIEWDT

  from    (select * from DATA1 where QNUM=1) A

  where   exists(select distinct STUDY_ID from DATA2 where STUDY_ID=A.STUDY_ID and DRUG_CLASS is not null);

quit;

Super User
Posts: 10,041

Re: Merge 2 data sets results in multiple observations per participant

You didn't post the output you need according to your sample data yet ?

If I understood what you want.

data Data1 ;
  infile datalines dlm=",";
  informat interviewdt date9.;
  format interviewdt date9.;
  input study_id qnum interviewdt;
datalines ;
1,1,14FEB2008
1,2,12FEB2009
1,3,01MAR2010
2,1,04SEP2012
3,1,19MAR2008
3,2,12OCT2010
3,3,17NOV2011
;
run;
data Data2 ;
  infile datalines dlm=",";
  informat medstartdate date9.;
  format medstartdate date9.;
  input study_id medstartdate atc_code $ drug_class $;
datalines ;
1,01MAY1996,ABC,
1,01NOV1996,CBA,A
1,01NOV1996,CBB,B
1,01NOV1996,CBC,A
2,31MAR1999,ABC,
2,01JUN1999,CBD,C
2,01JUN1999,CBE,A
2,01MAY2003,ABC,
3,17FEB1999,CBA,A
3,17FEB1999,CBA,A
3,17FEB1999,CBB,B
3,01MAR2000,ABC,
;
run ;
 
proc sql;
 create table want as
  select a.*,case 
             when exists(select * from data2 where study_id=a.study_id and drug_class is not missing and medstartdate lt a.interviewdt) then 'YES'
                else 'NO ' 
                end as MedStartDtDiff    
   from data1 as a 
    where qnum=1;
quit;

Xia Keshan

Message was edited by: xia keshan Fix a problem .

Ask a Question
Discussion stats
  • 6 replies
  • 256 views
  • 1 like
  • 3 in conversation