BookmarkSubscribeRSS Feed
toneill
Calcite | Level 5

Hi all -- I am a clinical researcher (not a trained analyst/statistician), so I'll do my best to be very clear about the problem I am currently having. I will use generic names for datasets and variables for illustrative purposes.

I have 2 data sets (data1 and data2). Both are sorted on study_id.

Data1 - Has multiple records per individual (i.e. study_id #1 has 6 records, representing 6 completed questionnaires -- therefore, "study_id" variable 1 is listed 6x, and "qnum" variable is 1-6). Period of observation: 01-01-07 to 31-12-13

Data2 - Also has multiple records per individual (i.e. study_id #1 has 72 records, representing 72 medications he/she has been prescribed). The "quest" variable is NOT included in Data2. Period of observation: 01-01-85 to 31-12-13. However, if a participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has only 1 questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1 when I attempted to merge the data below). 

Goal: A merged data set that has one record per "questionnaire" (similar to Data1), and a new variable ("newvar") based on the prescribing of a drug reported in data2.

To be more clear on "newvar": In Data1, there is a questionnaire interview date ("interviewdt") and in Data2 there is a prescription date ("medstartdate"). This "newvar" should be binary, such that IF a participant has started a particular prescription (represented by a particular code)(Data2) on or before the interview date (Data1), then "newvar"=1 (i.e. yes), ELSE, "newvar"=0 (i.e. no).

Here is what I've  done :

data Data3 ;

     merge Data1 (keep=study_id qnum othervariables in=in_data1 where=(interviewdt^=.))

          Data2 (in=in_data2 where (medstartdate^=. and (prescriptioncode="A" or prescriptioncode="B"))) ;

     by study_id ;

     if in_data1 and in_dadta2 ;

     dtdiff=interviewdt - medstartdate ; /*ONLY want "interviewdt" where qnum=1 (i.e. first questionnaire/baseline) - I think this is part of my problem; how to only select qnum=1?*/

     MedStartDtDiff = abs(dtdiff) ;

          if dtdiff^=. then MedStartDtDiff=(-10220<=(interviewdt - medstartdate)<=10220) ; /*10220 represents the maximum number of days from the last day of observation, 31-12-13, to the first day, 01-01-85)*/

     format MedStartDtDiff yesno. ;

run ;

proc sort data=Data3 (where =(MedStartDtDiff^=.)) ;

     by study_id MedStartDtDiff ;

          where qnum=1 ;

run ;

proc print=Data3 (obs=30) ; var study_id MedStartDtDiff qnum ; where qnum=1 ; run;

SO this is what I get as an output -->

study_id     MedStartDtDiff     MedStartDate     qnum

     1               YES                    01NOV1996     1          --> This is correct; study_id1 has 6 questionnaires (with qnum=1 when interviewdt=14Feb2008 AND started med BEFORE interview date; therefore, MedStartDt is YES)

     2               YES                    01JUN1999      1

     2               YES                    01JUN1999      1

     2               YES                    07NOV2006     1          --> This is NOT correct; study_id2 should only have a single entry. So this is my major issue! Why am I getting multiple study_id for each individuals?

To further elaborate on study_id2 -- this participant only completed a single questionnaire (Data1), he/she still has ALL previous prescriptions "linked" to that questionnaire (i.e. study_id=2 has basline/only questionnaire in 2010, but he/she has prescription records back to 1999 liked to qnum=1). I really only want one line per study_id.


What I'd like to see:

study_id     MedStartDtDiff     MedStartDate     qnum      InterviewDt

     1               YES                    01NOV1996     1              14FEB2008

     2               YES                    01JUN1999      1              6JUN2010

     3               YES                    15JUN1989      1              20AUG2008

     4               NO                      07NOV2010     1              11NOV2011 etc etc....

I am happy to elaborate further or clarify if this isn't 100% clear.

6 REPLIES 6
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

Sorry, couldn't really follow that, perhaps post a datastep for each of your inputs with some test data, and an example of what you want out.  Looking at the last bit I would estimate going something like - (assuming latest date in one):

proc sql;

     create table WANT as

     select     A.STUDY_ID,

                  ... as MEDSTARTDTDIFF,

                  A.MEDSTARTDATE,

                  ... as QNUM,

                   MAX(B.INTERVIEWDT) as INTERVIEWDT

     from      (select distinct STUDY_ID,... from DATA1) A

     left join  DATA2 B

     on         A.STUDY_ID=B.STUDY_ID

     and        B.INTERVIEWDT < A.MEDSTARTDATE

     group by STUDYID;

quit;

Not tested the above obviously, as I don't have any test data.

toneill
Calcite | Level 5

Unfortunately, I don't have any input datasteps; I was provided with two separate data files and need to merge them together (as aforementioned). I'll give your code a try and let you know what happens.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yep, just make something up that looks like your data so I have something to work with. 

toneill
Calcite | Level 5

Here is an example:

(1) Data1

data Data1 ;

     input study_id 1-2 qnum 1-2 interviewdt date9 ; /*where study_id=participant, qnum=questionnaire, interviewdt=date of interview*/ 

    datalines ;

     1     1     14FEB2008

     1     2     12FEB2009

     1     3      01MAR2010

     2     1      04SEPT2012

     3     1      19MAR2008

     3     2      12OCT2010

     3     3      17NOV2011

;

run ;

(2) Data2

data Data2 ;

     input study_id 1-2 medstartdate date9 atc_code $10 drug_class $6 ; /*where medstartdate=date that prescription was started, atc_code=medication code, drug_class=type of drug*/

     datalines ;

     1     01MAY1996     ABC          /*Note: this value for drug_class is missing because it is a non-relevant drug for my study*/

     1     01NOV1996     CBA     A

     1     01NOV1996     CBB     B

     1     01NOV1996     CBC     A

     2     31MAR1999     ABC    

     2     01JUN1999      CBD     C

     2     01JUN1999      CBE     A

     2     01MAY2003     ABC    

     3     17FEB1999      CBA     A

     3     17FEB1999      CBA     A

     3     17FEB1999      CBB     B

     3     01MAR2000     ABC    

;

run ;

--> What I would like: As above, a table that only shows one record per participant (study_id) at baseline questionnaire (where qnum=1). I am ONLY interested in those participants who have

an outcome for drug_class (i.e. those with non-relevant atc_codes should not be included in the calculation of a new variable; as I am only interested in those with drug_class reported).

What I am TRYING to do, is estimate the proportion of the population who is medication naive .vs. medication experienced at his/her first questionnaire (qnum=1).

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, could try something along these lines:

data Data1 ;

  infile datalines dlm=",";

  informat interviewdt date9.;

  format interviewdt date9.;

  input study_id qnum interviewdt;

datalines ;

1,1,14FEB2008

1,2,12FEB2009

1,3,01MAR2010

2,1,04SEP2012

3,1,19MAR2008

3,2,12OCT2010

3,3,17NOV2011

;

run;

data Data2 ;

  infile datalines dlm=",";

  informat medstartdate date9.;

  format medstartdate date9.;

  input study_id medstartdate atc_code $ drug_class $;

datalines ;

1,01MAY1996,ABC,

1,01NOV1996,CBA,A

1,01NOV1996,CBB,B

1,01NOV1996,CBC,A

2,31MAR1999,ABC,

2,01JUN1999,CBD,C

2,01JUN1999,CBE,A

2,01MAY2003,ABC,

3,17FEB1999,CBA,A

3,17FEB1999,CBA,A

3,17FEB1999,CBB,B

3,01MAR2000,ABC,

;

run ;

proc sql;

  create table WANT as

  select  distinct

          A.STUDY_ID,

          A.INTERVIEWDT

  from    (select * from DATA1 where QNUM=1) A

  where   exists(select distinct STUDY_ID from DATA2 where STUDY_ID=A.STUDY_ID and DRUG_CLASS is not null);

quit;

Ksharp
Super User

You didn't post the output you need according to your sample data yet ?

If I understood what you want.

data Data1 ;
  infile datalines dlm=",";
  informat interviewdt date9.;
  format interviewdt date9.;
  input study_id qnum interviewdt;
datalines ;
1,1,14FEB2008
1,2,12FEB2009
1,3,01MAR2010
2,1,04SEP2012
3,1,19MAR2008
3,2,12OCT2010
3,3,17NOV2011
;
run;
data Data2 ;
  infile datalines dlm=",";
  informat medstartdate date9.;
  format medstartdate date9.;
  input study_id medstartdate atc_code $ drug_class $;
datalines ;
1,01MAY1996,ABC,
1,01NOV1996,CBA,A
1,01NOV1996,CBB,B
1,01NOV1996,CBC,A
2,31MAR1999,ABC,
2,01JUN1999,CBD,C
2,01JUN1999,CBE,A
2,01MAY2003,ABC,
3,17FEB1999,CBA,A
3,17FEB1999,CBA,A
3,17FEB1999,CBB,B
3,01MAR2000,ABC,
;
run ;
 
proc sql;
 create table want as
  select a.*,case 
             when exists(select * from data2 where study_id=a.study_id and drug_class is not missing and medstartdate lt a.interviewdt) then 'YES'
                else 'NO ' 
                end as MedStartDtDiff    
   from data1 as a 
    where qnum=1;
quit;

Xia Keshan

Message was edited by: xia keshan Fix a problem .

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 922 views
  • 1 like
  • 3 in conversation