BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LEINAARE
Obsidian | Level 7

Hello,

 

I have a dataset that contains multiple observations for individuals with a date variable that increments by month.  It has variables for contraceptive method and birth outcomes.  There may be up to 60 observations for each individual (January 2012 - December 2016).  A simplified version of my data is structured as below.

 

UniqueID      Date            Method        Birth

 

      1         01JAN2012             0                0

      1         01FEB2012             0                0

      1         01MAR2012            0                0

      1         01APR2012             0                0

      1         01MAY2012            0                1

      1         01JUN2012             0                0

      1         01JUL2012             0                0

      1         01AUG2012            0                0

      1         01SEP2012            0               0                

      1         01OCT2012            0                0

      1         01NOV2012            0                0

      1         01DEC2012            0                0

      1         01JAN2013             0                0

 

For each date in the original dataset, I want to know if a birth (live or non-live) occurred within 6-10 months later.  I created a copy of the original dataset with UniqueID, date, and birth flag to merge back with the original dataset for corresponding months.  The original dataset was modified to create five new date variables as shown in the code below. 

data test (keep=UniqueID Date: Birth);
   set original (obs=500);
   array newdate Date_6-Date_10;
   do i = 1 to dim(newdate);
       k = i + 5;
       newdate{i} = intnx('month',date,k);
   end;
   format date_: date9.;
run;

I was able to accomplish this using a data step with sort procedures within a macro that executed an iterative do loop on a sample dataset with 500 observations.  It worked fine, but my real dataset has over 61,000,000 observations.  I would like to use PROC SQL to avoid using SORT procedures to reduce processing time, but I am not savvy enough with SQL to accomplish this.  Could someone please help?  Below is a copy of my code using data steps.

 

 

%macro lookforward;
   %do i = 6 %to 10;
      proc sort data=test;
          by orsid date_&i;
      run;
      data test;
         merge test (in=a)
            copy (rename=(date=date_&i birth=event_&i));
         by UniqueID date_&i;
         if a = 1;
         event_flag = sum(of event_:);
      run;
   %end;
%mend;
%lookforward;

 

 

The modified dataset would look something like this.

Capture.JPG

 

Any assistance in converting this code to SQL is much appreciated.

 

Thanks,

 

Ted

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@LEINAARE wrote:

Hi @ballardw,

 

above, with the edition of the event flags shown below (starting from Date_9 to save space).

 

..... Date_9             Date_10            Event_6      Event_7      Event_8     Event_9     Event_10      Event_Flag

 

   01OCT2012       01NOV2012            0                 0                0                 0                 0                      0

   01NOV2012       01DEC2012            0                 0                0                 0                 0                      0                 

   01DEC2012       01JAN2013             0                 0                0                 1                 0                      1

 

I am trying to use the SQL procedure to produce a table structured similarly to this, but without the redundant sort procedures and data steps.  I apologize for not including better information.  Thank you for the code you provided.

 

Ted


Why do you want a table in that format? What is magical about that specific layout.

 

The code I provided before can get the month difference of the "event" if that is supposed be the month of the pregnancy.

proc sql;
   create table temp as
   select a.uniqueid, a.date, a.method,a.birth,
          b.date as datewithbirth, intck('month',a.date,b.date) as monthdiff
   from have as a left join
        (select * from have where birth=1) as b
        on a.uniqueid=b.uniqueid
   where 6 le intck('month',a.date,b.date) le 10
   order by a.uniqueid, a.date
   ;
quit;

Summaries on the reduced set from above can be merged back with the original. If the monthdiff is missing then "no event".

The above can be transposed to create an across with the "event" value. But since your "events are basically a diagonal of 0/ 1 values with at max a single 1 per date then you really aren't getting any additional information. If the "event_flag" is what you really want:

proc sql;
   create table want as
   select a.*, (b.monthdiff>0) as event_flag
   from have as a
        left join
        temp as b
        on a.uniqueid=b.uniqueid
        and a.date=b.date
   ;
quit;

 

View solution in original post

8 REPLIES 8
Reeza
Super User
99% sure there's an easier way to do that overall calculation. Do you want that output for another reason, or solely to calculate how many have another birth within X months. That's actually a very straightforward SQL query itself.
LEINAARE
Obsidian | Level 7

Hi @Reeza,

 

Thanks for responding.  I didn't include part of the code, because I wanted to keep it simple, but I use an array to reset values GE one back to one.  I am using it as a dichotomous variable for regression analysis.  The main thing I am looking for is using SQL to create a table that matches flag values from the copied dataset with the incremented date variables for each observation.  I am looking for efficiency, but do not have much experience using SQL.

 

Thanks,

 

Ted

 

Reeza
Super User
The problem with SQL is lack of arrays, if you want this type of functionality you're better off staying with data steps. Long formatted data, your original is more optimal for SQL analysis.
ballardw
Super User

Since you didn't even present a birth within that 6 to 10 month interval or what your intermediate steps might have been I'm not really sure what you need.

This is one way to find records where the birth occurs within that time frame though. Note that I edited your example to show a birth on the 01SEP2012 record so that it would be within the time frame.

data have;
input UniqueID      Date  :date9.          Method        Birth ;
format date date9.;
datalines;
      1         01JAN2012             0                0
      1         01FEB2012             0                0
      1         01MAR2012            0                0
      1         01APR2012             0                0
      1         01MAY2012            0                0
      1         01JUN2012             0                0
      1         01JUL2012             0                0
      1         01AUG2012            0                0
      1         01SEP2012            0               1                
      1         01OCT2012            0                0
      1         01NOV2012            0                0
      1         01DEC2012            0                0
      1         01JAN2013             0                0
;

proc sql;
   create table temp as
   select a.uniqueid, a.date, a.method,a.birth,
          b.date as datewithbirth
   from have as a left join
        (select * from have where birth=1) as b
        on a.uniqueid=b.uniqueid
   where 6 le intck('month',a.date,b.date) le 10
   order by a.uniqueid, a.date
   ;
quit;

From the way you present the "newdates" those might be obtainable by transposing the result from above but I'm not quite sure what your process is following. But the above does find all of the months for the same individual where the condition is met as I understand it.

LEINAARE
Obsidian | Level 7

Hi @ballardw,

 

Thank you for your response.  In the data step where I created the dataset named "test", I used the array to create five new date variables (Date_6 - Date_10) as shown below.

 

UniqueID        Date           Method     Birth        Date_6             Date_7              Date_8              Date_9             Date_10        

 

      1         01JAN2012          0              0       01JUL2012      01AUG2012      01SEP2012      01OCT2012       01NOV2012

      1         01FEB2012          0              0       01AUG2012     01SEP2012      01OCT2012      01NOV2012       01DEC2012

      1         01MAR2012         0              0       01SEP2012     01OCT2012      01NOV2012      01DEC2012       01JAN2013

.

.

.

.

      1        01DEC2012         0              0       01JUN2013      01JUL2013       01AUG2013      01SEP2013      01OCT2013

 

The  macro program containing a SORT procedure with a data step that merged the 'test' and 'copy' datasets created five new flag variables (event_6 - event_10) that were summed to create one variable indicating whether or not a birth occurred on at least one of the future dates specified for each observation.  The dataset produced by the macro program included all of the variables shown above, with the edition of the event flags shown below (starting from Date_9 to save space).

 

..... Date_9             Date_10            Event_6      Event_7      Event_8     Event_9     Event_10      Event_Flag

 

   01OCT2012       01NOV2012            0                 0                0                 0                 0                      0

   01NOV2012       01DEC2012            0                 0                0                 0                 0                      0                 

   01DEC2012       01JAN2013             0                 0                0                 1                 0                      1

 

I am trying to use the SQL procedure to produce a table structured similarly to this, but without the redundant sort procedures and data steps.  I apologize for not including better information.  Thank you for the code you provided.

 

Ted

ballardw
Super User

@LEINAARE wrote:

Hi @ballardw,

 

above, with the edition of the event flags shown below (starting from Date_9 to save space).

 

..... Date_9             Date_10            Event_6      Event_7      Event_8     Event_9     Event_10      Event_Flag

 

   01OCT2012       01NOV2012            0                 0                0                 0                 0                      0

   01NOV2012       01DEC2012            0                 0                0                 0                 0                      0                 

   01DEC2012       01JAN2013             0                 0                0                 1                 0                      1

 

I am trying to use the SQL procedure to produce a table structured similarly to this, but without the redundant sort procedures and data steps.  I apologize for not including better information.  Thank you for the code you provided.

 

Ted


Why do you want a table in that format? What is magical about that specific layout.

 

The code I provided before can get the month difference of the "event" if that is supposed be the month of the pregnancy.

proc sql;
   create table temp as
   select a.uniqueid, a.date, a.method,a.birth,
          b.date as datewithbirth, intck('month',a.date,b.date) as monthdiff
   from have as a left join
        (select * from have where birth=1) as b
        on a.uniqueid=b.uniqueid
   where 6 le intck('month',a.date,b.date) le 10
   order by a.uniqueid, a.date
   ;
quit;

Summaries on the reduced set from above can be merged back with the original. If the monthdiff is missing then "no event".

The above can be transposed to create an across with the "event" value. But since your "events are basically a diagonal of 0/ 1 values with at max a single 1 per date then you really aren't getting any additional information. If the "event_flag" is what you really want:

proc sql;
   create table want as
   select a.*, (b.monthdiff>0) as event_flag
   from have as a
        left join
        temp as b
        on a.uniqueid=b.uniqueid
        and a.date=b.date
   ;
quit;

 

LEINAARE
Obsidian | Level 7

Thank you @ballardw.  I was trying to reproduce a method that a senior researcher suggested, but I will follow your and @Reeza advice and reconsider my conceptual approach.

 

Thank you,

 

Ted

Reeza
Super User
Ted, that's how you would do it in Excel or manually, but generally not how you'd do it in a programming language. Creating all those dummy variables doesn't make a lot of sense, especially via SQL since it' cannot do arrays.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2534 views
  • 3 likes
  • 3 in conversation