Solved: Re: Converting a Data Step to an SQL Procedure

LEINAARE · Posted 07-11-2019 03:36 PM

Hello,

I have a dataset that contains multiple observations for individuals with a date variable that increments by month. It has variables for contraceptive method and birth outcomes. There may be up to 60 observations for each individual (January 2012 - December 2016). A simplified version of my data is structured as below.

UniqueID Date Method Birth

1 01JAN2012 0 0

1 01FEB2012 0 0

1 01MAR2012 0 0

1 01APR2012 0 0

1 01MAY2012 0 1

1 01JUN2012 0 0

1 01JUL2012 0 0

1 01AUG2012 0 0

1 01SEP2012 0 0

1 01OCT2012 0 0

1 01NOV2012 0 0

1 01DEC2012 0 0

1 01JAN2013 0 0

For each date in the original dataset, I want to know if a birth (live or non-live) occurred within 6-10 months later. I created a copy of the original dataset with UniqueID, date, and birth flag to merge back with the original dataset for corresponding months. The original dataset was modified to create five new date variables as shown in the code below.

data test (keep=UniqueID Date: Birth);
   set original (obs=500);
   array newdate Date_6-Date_10;
   do i = 1 to dim(newdate);
       k = i + 5;
       newdate{i} = intnx('month',date,k);
   end;
   format date_: date9.;
run;

I was able to accomplish this using a data step with sort procedures within a macro that executed an iterative do loop on a sample dataset with 500 observations. It worked fine, but my real dataset has over 61,000,000 observations. I would like to use PROC SQL to avoid using SORT procedures to reduce processing time, but I am not savvy enough with SQL to accomplish this. Could someone please help? Below is a copy of my code using data steps.

%macro lookforward;
   %do i = 6 %to 10;
      proc sort data=test;
          by orsid date_&i;
      run;
      data test;
         merge test (in=a)
            copy (rename=(date=date_&i birth=event_&i));
         by UniqueID date_&i;
         if a = 1;
         event_flag = sum(of event_:);
      run;
   %end;
%mend;
%lookforward;

The modified dataset would look something like this.

Any assistance in converting this code to SQL is much appreciated.

Thanks,

Ted

ballardw · Posted 07-11-2019 06:32 PM

@LEINAARE wrote:

Hi @ballardw,

above, with the edition of the event flags shown below (starting from Date_9 to save space).

..... Date_9 Date_10    Event_6    Event_7    Event_8 Event_9 Event_10    Event_Flag

   01OCT2012 01NOV2012    0 0    0 0 0    0

   01NOV2012 01DEC2012    0 0    0 0 0    0

   01DEC2012   01JAN2013 0 0    0 1 0    1

I am trying to use the SQL procedure to produce a table structured similarly to this, but without the redundant sort procedures and data steps. I apologize for not including better information. Thank you for the code you provided.

Ted

Why do you want a table in that format? What is magical about that specific layout.

The code I provided before can get the month difference of the "event" if that is supposed be the month of the pregnancy.

proc sql;
   create table temp as
   select a.uniqueid, a.date, a.method,a.birth,
          b.date as datewithbirth, intck('month',a.date,b.date) as monthdiff
   from have as a left join
        (select * from have where birth=1) as b
        on a.uniqueid=b.uniqueid
   where 6 le intck('month',a.date,b.date) le 10
   order by a.uniqueid, a.date
   ;
quit;

Summaries on the reduced set from above can be merged back with the original. If the monthdiff is missing then "no event".

The above can be transposed to create an across with the "event" value. But since your "events are basically a diagonal of 0/ 1 values with at max a single 1 per date then you really aren't getting any additional information. If the "event_flag" is what you really want:

proc sql;
   create table want as
   select a.*, (b.monthdiff>0) as event_flag
   from have as a
        left join
        temp as b
        on a.uniqueid=b.uniqueid
        and a.date=b.date
   ;
quit;

View solution in original post

Reeza · Posted 07-11-2019 03:50 PM

99% sure there's an easier way to do that overall calculation. Do you want that output for another reason, or solely to calculate how many have another birth within X months. That's actually a very straightforward SQL query itself.

LEINAARE · Posted 07-11-2019 04:28 PM

Hi @Reeza,

Thanks for responding. I didn't include part of the code, because I wanted to keep it simple, but I use an array to reset values GE one back to one. I am using it as a dichotomous variable for regression analysis. The main thing I am looking for is using SQL to create a table that matches flag values from the copied dataset with the incremented date variables for each observation. I am looking for efficiency, but do not have much experience using SQL.

Thanks,

Ted

Reeza · Posted 07-11-2019 05:04 PM

The problem with SQL is lack of arrays, if you want this type of functionality you're better off staying with data steps. Long formatted data, your original is more optimal for SQL analysis.

ballardw · Posted 07-11-2019 05:04 PM

Since you didn't even present a birth within that 6 to 10 month interval or what your intermediate steps might have been I'm not really sure what you need.

This is one way to find records where the birth occurs within that time frame though. Note that I edited your example to show a birth on the 01SEP2012 record so that it would be within the time frame.

data have;
input UniqueID      Date  :date9.          Method        Birth ;
format date date9.;
datalines;
      1         01JAN2012             0                0
      1         01FEB2012             0                0
      1         01MAR2012            0                0
      1         01APR2012             0                0
      1         01MAY2012            0                0
      1         01JUN2012             0                0
      1         01JUL2012             0                0
      1         01AUG2012            0                0
      1         01SEP2012            0               1                
      1         01OCT2012            0                0
      1         01NOV2012            0                0
      1         01DEC2012            0                0
      1         01JAN2013             0                0
;

proc sql;
   create table temp as
   select a.uniqueid, a.date, a.method,a.birth,
          b.date as datewithbirth
   from have as a left join
        (select * from have where birth=1) as b
        on a.uniqueid=b.uniqueid
   where 6 le intck('month',a.date,b.date) le 10
   order by a.uniqueid, a.date
   ;
quit;

From the way you present the "newdates" those might be obtainable by transposing the result from above but I'm not quite sure what your process is following. But the above does find all of the months for the same individual where the condition is met as I understand it.

LEINAARE · Posted 07-11-2019 05:58 PM

Hi @ballardw,

Thank you for your response. In the data step where I created the dataset named "test", I used the array to create five new date variables (Date_6 - Date_10) as shown below.

UniqueID Date Method Birth Date_6 Date_7 Date_8 Date_9 Date_10

1 01JAN2012 0 0 01JUL2012 01AUG2012 01SEP2012 01OCT2012 01NOV2012

1 01FEB2012 0 0 01AUG2012 01SEP2012 01OCT2012 01NOV2012 01DEC2012

1 01MAR2012 0 0 01SEP2012 01OCT2012 01NOV2012 01DEC2012 01JAN2013

.

1 01DEC2012 0 0 01JUN2013 01JUL2013 01AUG2013 01SEP2013 01OCT2013

The macro program containing a SORT procedure with a data step that merged the 'test' and 'copy' datasets created five new flag variables (event_6 - event_10) that were summed to create one variable indicating whether or not a birth occurred on at least one of the future dates specified for each observation. The dataset produced by the macro program included all of the variables shown above, with the edition of the event flags shown below (starting from Date_9 to save space).

..... Date_9 Date_10 Event_6 Event_7 Event_8 Event_9 Event_10 Event_Flag

01OCT2012 01NOV2012 0 0 0 0 0 0

01NOV2012 01DEC2012 0 0 0 0 0 0

01DEC2012 01JAN2013 0 0 0 1 0 1

I am trying to use the SQL procedure to produce a table structured similarly to this, but without the redundant sort procedures and data steps. I apologize for not including better information. Thank you for the code you provided.

Ted

ballardw · Posted 07-11-2019 06:32 PM

@LEINAARE wrote:

Hi @ballardw,

above, with the edition of the event flags shown below (starting from Date_9 to save space).

..... Date_9 Date_10    Event_6    Event_7    Event_8 Event_9 Event_10    Event_Flag

   01OCT2012 01NOV2012    0 0    0 0 0    0

   01NOV2012 01DEC2012    0 0    0 0 0    0

   01DEC2012   01JAN2013 0 0    0 1 0    1

I am trying to use the SQL procedure to produce a table structured similarly to this, but without the redundant sort procedures and data steps. I apologize for not including better information. Thank you for the code you provided.

Ted

Why do you want a table in that format? What is magical about that specific layout.

The code I provided before can get the month difference of the "event" if that is supposed be the month of the pregnancy.

proc sql;
   create table temp as
   select a.uniqueid, a.date, a.method,a.birth,
          b.date as datewithbirth, intck('month',a.date,b.date) as monthdiff
   from have as a left join
        (select * from have where birth=1) as b
        on a.uniqueid=b.uniqueid
   where 6 le intck('month',a.date,b.date) le 10
   order by a.uniqueid, a.date
   ;
quit;

Summaries on the reduced set from above can be merged back with the original. If the monthdiff is missing then "no event".

The above can be transposed to create an across with the "event" value. But since your "events are basically a diagonal of 0/ 1 values with at max a single 1 per date then you really aren't getting any additional information. If the "event_flag" is what you really want:

proc sql;
   create table want as
   select a.*, (b.monthdiff>0) as event_flag
   from have as a
        left join
        temp as b
        on a.uniqueid=b.uniqueid
        and a.date=b.date
   ;
quit;

LEINAARE · Posted 07-12-2019 07:47 AM

Thank you @ballardw. I was trying to reproduce a method that a senior researcher suggested, but I will follow your and @Reeza advice and reconsider my conceptual approach.

Thank you,

Ted

Reeza · Posted 07-11-2019 06:39 PM

Ted, that's how you would do it in Excel or manually, but generally not how you'd do it in a programming language. Creating all those dummy variables doesn't make a lot of sense, especially via SQL since it' cannot do arrays.

The 2025 SAS Hackathon has begun!