BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lvb1
Calcite | Level 5

I have a dataset with potential case-control pairs. I would like to select for every case four controls, without replacement. 

 

Have:

casecontrol
11
12
13
14
15
16
22
24
28
210
212
214
31
34
39
311
313
314
316
320
322


Want:

casecontrol
11
12
13
14
28
210
212
214
39
311
313
316

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
s_lassen
Meteorite | Level 14

Great to hear - please mark it as a solution, then 😊

View solution in original post

12 REPLIES 12
lvb1
Calcite | Level 5

Yes, Smantha, I did:

 

proc surveyselect data=cc out=RandOrderCC noprint
method=SRS /* sample w/o replacement seed=123 */
samprate=1 /* proportion of observations in sample */
OUTRANDOM; 
run;

 

That didn't give four controls for every case. I should have N cases * 4 controls. Depending on the seed, I get a different number of controls.

 

(Also: if I have for case 1 controls 1 2 3 4 and for case 2 I also have control 2, I would like to avoid that the selection procedure uses control 2 for case 2, since I only have 4 controls for case 1. I would like to have as many controls as possible.)

VENKATAMAHESH
Calcite | Level 5
data given;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
;
run;

proc sort nodupkey;
by control;
run;

proc sort;
by case control;
run;

%macro function(dsn,value);
data &dsn;
set given;
if case = &value then output &dsn;
run;

data &dsn;
set &dsn;
if _n_ < 5 then output;
run;
%mend function;

%function(one,1);
%function(two,2);
%function(three,3);

data final;
set one two three;
run;
lvb1
Calcite | Level 5

Dear Venkatamahesh,

 

your code generates indeed the want table I gave in my original request. However, by sorting on control, you also delete records which are still valid. For example if you use:

 

datalines;

1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
4 1
4 2
4 3
4 4
4 5
4 6

 

The fourth case is completely deleted, but case 4 should be matched to control 5 and 6.  

smantha
Lapis Lazuli | Level 10
data cc;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
4 1
4 2
4 3
4 4
4 5
4 6
;
run;

proc sort; by case control; 
run;

data _null_;
set cc nobs=n;
call symputx('nrow',n);
stop;
run;
options symbolgen;
data want;
set cc ;
array test{&nrow.};
retain test1-test&nrow. count;
by case;
if first.case then count=1;

if count <5 then do;
 put _all_;
 if control not in test then do;
   count=count+1;
   output;
   test[_n_]= control;
 end;
end;


drop test1-test&nrow. count;
run;
lvb1
Calcite | Level 5

Hi Smanta, the codes works for a small set of data, but I have 400.000 potential case control matches. The code ran for three days and the computer blocked. The log file was about 177GB. I will need something else I'm afraid. 

 

s_lassen
Meteorite | Level 14

You can do it in a single datastep, quite simple:

data have;
input case control;
datalines;
1 1
1 2
1 3
1 4
1 5
1 6
2 2
2 4
2 8
2 10
2 12
2 14
3 1
3 4
3 9
3 11
3 13
3 14
3 16
3 20
3 22
;
run;

data want;
  do _N_=1 by 1 until(last.case);
    set have;
    by case;
    end;
  out=4;
  do _N_=_N_ to 1 by -1;
    set have;
    if rand('uniform')<out/_N_ then do;
      output;
      out=out-1;
      end;
    end;
  drop out;  /* Yes, that's a valid SAS statement... */
run;
lvb1
Calcite | Level 5

Sorry, that didn't work. As said, once I have four cases, the code doesn't generate the correct output.

s_lassen
Meteorite | Level 14

Sorry, I misunderstood your requirements. I thought you needed a random selection of cases, but it seems it does not have to be. On the other hand you need CONTROL to be unique. Is that correct?

 

If I have understood that correctly, here is a way that may work:

data want;
  set have;
  stop;
run;

proc sql;
  create unique index control on want(control);
quit;

data want;
  if 0 then modify want;
  out=4;
  do until(last.case);
    set have;
    by case;
    if out=0 then continue;
    _iorc_=0;
    output;
    if _iorc_ then continue; /* unique key restraint violated */
    out=out-1;
    end;
  _error_=0;
run;
lvb1
Calcite | Level 5

This works, thanks.

s_lassen
Meteorite | Level 14

Great to hear - please mark it as a solution, then 😊

kjmathes03
Fluorite | Level 6

Hello,

 

I have the same problem and when I run the first step of your code -

 

data want;

set have;

stop;

run;

 

I don;t under stand this step as it just reduces my data to zero records?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1450 views
  • 2 likes
  • 5 in conversation