BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
PamG
Quartz | Level 8

I have a file with cases and controls (combined file is 15m).  I would like to select 2 controls without replacement for each case such that the date for cases is less than the dates for controls.  Some controls have missing dates but they should be included for control selection.  A sample dataset is attached here.

 

data have;
input id case DateInt :yymmdd. @@;
format dateInt date11.;
datalines;
1  0 .  2 0 20120103  3 1 20120101
4  0 20120103  5 0 20120101  6 1 20120103
7  0 20120101  8 0 .  9 0 20120103
10 1 20120105 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120106 15 0 20120107
;

That is the control selection should be as follows-

(1) For case id 3, the 2 selections without replacement should be from ids 1,2,4,8,9,11,12,13,14,15.

(2) For case id 6, the 2 selections without replacement should be from ids 1,8,10,14,15 and excluding selections in (1) above.

(3) For case id 10, the 2 selections without replacement should be from ids 1,8,14,15 and excluding selections in (1) and (2) above.

 

Thanks for any tips.  I am struggling to write a good code.

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

So you are saying that a missing DATE should be regarded as later than all known dates, correct?

 

Now, simple random selection could end up with assignment of the latest control dates to early case dates, leaving no control candidates for assignment to the latest case dates.  If so, you could restart as many times as needed until you ended up with a completely valid collection of case-control assignments.  A possibly very long process depending on the DATE distribution of CASEs and CONTROLs (assuming at least one valid collection of assignments exists).

 

A simple approach to this issue could be to start with the latest CASE, then randomly select from qualifying CONTROLS.  Repeat for each CASE in reverse chronological order.  This would guarantee a valid set of assignments, assuming one exists.

 

This code does that by sorting the data by descending date (assigning missing date to 31dec2099 for sorting purposes.  As a result for each CASE, you can randomly select from CONTROLS that occur earlier in the sorted dataset:

 

data have;
  input id case DateInt :yymmdd. @@;
  format dateInt date11.;
datalines;
1  0 .  2 0 20120103  3 1 20120101
4  0 20120103  5 0 20120101  6 1 20120103
7  0 20120101  8 0 .  9 0 20120103
10 1 20120105 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120106 15 0 20120107
run;

data vneed/ view=vneed;
  set have;
  if dateint=. then dateint='31dec2099'd;
run;

proc sort data=vneed out=need;
  by descending dateint  case;  /*For tied dates make CONTROLs precede CASEs */
run;

data want (drop=_: i);  /* Two obs per CASE, with a new variable CONTROL_ID */
  array cntrl_ids {1164000} _temporary_;
  set need; 

  if case=0 then do;    /* Update list of eligible controls    */
    _n+1;               /* Count of eligible controls          */
    cntrl_ids{_n}=id;
  end;

  if case=1 then do i=1 to 2 while (_n>0);  /* Find up to 2 controls */
    _ptr=ceil(_n*ranuni(150195860));
    control_id=cntrl_ids{_ptr};
    output;

    /*Remove the assigned control_id from cntrl_ids array*/
    control_id=.;
    /* Slide the remaining id's to the left */
    do _p=_ptr to _n;
      cntrl_ids{_p}=cntrl_ids{_p+1};
    end;
    _n+(-1);           /* Decrement count of eligible controls*/
  end;

run;

Just make sure that the array CNTRL_IDS is large enough to hold the rolling count of control observations.

 

Edited Note: I added the "while (_n>0)" to the DO statement to ensure selection of controls only when there are qualifying controls available.

 

Second edited note.  I have changed the proc sort to force CONTROLs to precede CASEs when there are tied dates, thereby making them available for matching.  

 

Also I have added a new reply to perform this task separately by SEX, in a single use of the program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

8 REPLIES 8
mkeintz
PROC Star

So you are saying that a missing DATE should be regarded as later than all known dates, correct?

 

Now, simple random selection could end up with assignment of the latest control dates to early case dates, leaving no control candidates for assignment to the latest case dates.  If so, you could restart as many times as needed until you ended up with a completely valid collection of case-control assignments.  A possibly very long process depending on the DATE distribution of CASEs and CONTROLs (assuming at least one valid collection of assignments exists).

 

A simple approach to this issue could be to start with the latest CASE, then randomly select from qualifying CONTROLS.  Repeat for each CASE in reverse chronological order.  This would guarantee a valid set of assignments, assuming one exists.

 

This code does that by sorting the data by descending date (assigning missing date to 31dec2099 for sorting purposes.  As a result for each CASE, you can randomly select from CONTROLS that occur earlier in the sorted dataset:

 

data have;
  input id case DateInt :yymmdd. @@;
  format dateInt date11.;
datalines;
1  0 .  2 0 20120103  3 1 20120101
4  0 20120103  5 0 20120101  6 1 20120103
7  0 20120101  8 0 .  9 0 20120103
10 1 20120105 11 0 20120103 12 0 20120103
13 0 20120103 14 0 20120106 15 0 20120107
run;

data vneed/ view=vneed;
  set have;
  if dateint=. then dateint='31dec2099'd;
run;

proc sort data=vneed out=need;
  by descending dateint  case;  /*For tied dates make CONTROLs precede CASEs */
run;

data want (drop=_: i);  /* Two obs per CASE, with a new variable CONTROL_ID */
  array cntrl_ids {1164000} _temporary_;
  set need; 

  if case=0 then do;    /* Update list of eligible controls    */
    _n+1;               /* Count of eligible controls          */
    cntrl_ids{_n}=id;
  end;

  if case=1 then do i=1 to 2 while (_n>0);  /* Find up to 2 controls */
    _ptr=ceil(_n*ranuni(150195860));
    control_id=cntrl_ids{_ptr};
    output;

    /*Remove the assigned control_id from cntrl_ids array*/
    control_id=.;
    /* Slide the remaining id's to the left */
    do _p=_ptr to _n;
      cntrl_ids{_p}=cntrl_ids{_p+1};
    end;
    _n+(-1);           /* Decrement count of eligible controls*/
  end;

run;

Just make sure that the array CNTRL_IDS is large enough to hold the rolling count of control observations.

 

Edited Note: I added the "while (_n>0)" to the DO statement to ensure selection of controls only when there are qualifying controls available.

 

Second edited note.  I have changed the proc sort to force CONTROLs to precede CASEs when there are tied dates, thereby making them available for matching.  

 

Also I have added a new reply to perform this task separately by SEX, in a single use of the program.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PamG
Quartz | Level 8

Wow!  This is brilliant!  Thank you!!!!

 

 I am now told that in addition to the dates condition, the cases have to be selected based on sex as well i.e. controls have to be of the same sex as the cases.  How would the above code change? A sample data is given below.

data have;
  input id case DateInt :yymmdd. sex $ @@;
  format dateInt date11.;
datalines;
1  0 .  M 2 0 20120103 M 3 1 20120101 F
4  0 20120103 F 5 0 20120101 M 6 1 20120103 F
7  0 20120101 M 8 0 . F 9 0 20120103 M
10 1 20120105 M 11 0 20120103 F 12 0 20120103 F
13 0 20120103 M 14 0 20120106 M 15 0 20120107 M
run;
mkeintz
PROC Star

One could make some changes to this code to do what you are asking in one submission.

 

But I think it's simpler just to run the program twice, once for each sex.

 

Just change

data vneed/ view=vneed;
  set have;
  if dateint=. then dateint='31dec2099'd;
run;

to

 

data vneed/ view=vneed;
  set have;
  where sex='M';
  if dateint=. then dateint='31dec2099'd;
run;

for males, and make

data wantm;

Then use WHERE SEX='F';    and DATA WANTF;  for females.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

In your sample data, for ID 6  (case=1, sex='F', date='03jan2012), there is only ONE control with a later date (ID 8  case=0, sex='F', date=.).  Is that your expectation?   That is how the code currently works.

 

Or do you also consider matches valid if the control has the same date as the case?  In your sample data, that would add two other ID's (11 and 12) that would be valid matches for ID 6.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PamG
Quartz | Level 8

Thanks so much!

 

Yes.  The matches are considered valid if they have the same dates .  Tried editing the code but the matches are not correct.

mkeintz
PROC Star

If same day matches are acceptable, then change

 

 

proc sort data=vneed out=need;
  by descending dateint descending case;  /*For tied dates make CASEs precede CONTROLs*/
run;

to

 

proc sort data=vneed out=need;
  by sex descending dateint case;  /*For tied dates make CONTROLSs precede CASEs */
run;

So for tied dates, the CONTROLs are put in the array prior to the CASES - and therefore are available for random selection in matching cases.   I will also modify the answer you marked as a solution.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

This is a second reply to your request for including matching by sex.  It processes each sex in a single use of the program:

 

data have;
  input id case DateInt :yymmdd. sex $ @@;
  format dateInt date11.;
datalines;
1  0 .  M 2 0 20120103 M 3 1 20120101 F
4  0 20120103 F 5 0 20120101 M 6 1 20120103 F
7  0 20120101 M 8 0 . F 9 0 20120103 M
10 1 20120105 M 11 0 20120103 F 12 0 20120103 F
13 0 20120103 M 14 0 20120106 M 15 0 20120107 M
run;

data vneed/ view=vneed;
  set have;
  if dateint=. then dateint='31dec2099'd;
run;

proc sort data=vneed out=need;
  by sex descending dateint case;  /*For tied dates make CONTROLSs precede CASEs */
run;

data want (drop=_: i);  /* Two obs per CASE, with a new variable CONTROL_ID */
  array cntrl_ids {1164000} _temporary_;
  set need; 

  by sex;  /** This starts the new part */
  retain _n 0;

  if first.sex then do _n=_n to 0 by -1 ;
    if _n=0 then leave;
    cntrl_ids{_n}=.;
  end;     /** This ends the new part */

  if case=0 then do;    /* Update list of eligible controls    */
    _n+1;               /* Count of eligible controls          */
    cntrl_ids{_n}=id;
  end;

  if case=1 then do i=1 to 2 while (_n>0);  /* Find up to 2 controls */
    _ptr=ceil(_n*ranuni(150195860));
    control_id=cntrl_ids{_ptr};
    output;

    /*Remove the assigned control_id from cntrl_ids array*/
    control_id=.;
    /* Slide the remaining id's to the left */
    do _p=_ptr to _n;
      cntrl_ids{_p}=cntrl_ids{_p+1};
    end;
    _n+(-1);           /* Decrement count of eligible controls*/
  end;

run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PamG
Quartz | Level 8

@mkeintz This is awesome!  Thank you so, so much for taking the time and trouble for writing the code and helping me with me with this!!!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1048 views
  • 3 likes
  • 2 in conversation