Re: selecting only one match per record

torestin · Posted 02-24-2017 06:00 AM

Hi there, I am trying to find a more efficient way to produce a smallish matched control group from a very very very large data set. So what I want is for sas to read 1 record, then find 1 match and then move back to read the next. So say the two data sets look like

Data set1;

informat gender $1. ethnicity $1.;

input serial gender ethnicity wage;

datalines;

232 M W 230

233 F B 240

Data set2;

informat gender $1. ethnicity $1.;

input id gender ethnicity wage;

1 M W 254

2 M W 270

3.M B 255

4 M W 230

5 F B 211

6 F B 222

So the output would be simply

1 M W 254

5 F B 240

I thought about match them first and then trim it down (removing duplicate a.serial). But given the size of the dataset it takes an awfully long time.

proc sql;
create table control as
   select a.serial b.*
        from set1 a, set2 b
        where a.gender = b.gender And a.ethnicity = b.ethnicity AND a.wage*0.9<=b.wage<=a.wage*1.2
        order by id;

quit;

Any ideas? Thanks

Shmuel · Posted 02-24-2017 06:20 AM

You can try next code:

proc sql;
 create table control as
   select a.serial b.*
       from set1 a
       left join set2 b
       on a.gender = b.gender And 
            a.ethnicity = b.ethnicity AND 
            a.wage*0.9<=b.wage<=a.wage*1.2
      order by id;  
 quit;

torestin · Posted 02-24-2017 06:31 AM

mmm that still returns two matches for the first line M W in a

M W 254

M W 270

F B 234

where as I want it to do

M W 254

F B 234

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Kurt_Bremser · Posted 02-24-2017 08:07 AM

@torestin wrote:
mmm that still returns two matches for the first line M W in a

M W 254

M W 270

F B 234

where as I want it to do

M W 254

F B 234

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Which of the suggested codes do you mean? My result is included in my post and has only two results.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 02-24-2017 06:27 AM

What about

data set1;
informat gender $1. ethnicity $1.;
input serial gender ethnicity wage;
datalines;
232 M W 230
233 F B 240
;
run;

data set2;
informat gender $1. ethnicity $1.;
input id gender ethnicity wage;
cards;
1 M W 254
2 M W 270
3 M B 255
4 M W 230
5 F B 211
6 F B 222
;
run;

proc sort data=set1;
by gender ethnicity;
run;

proc sort data=set2;
by gender ethnicity;
run;

data want;
merge
  set1 (in=a keep=gender ethnicity wage rename=(wage=wage_a))
  set2 (in=b)
;
by gender ethnicity;
retain flag;
if first.ethnicity then flag = 1;
if flag
then do;
  if 0.9 * wage_a <= wage <= 1.2 * wage_a
  then do;
    flag = 0;
    output;
  end;
end;
drop flag wage_a;
run;

proc print data=want noobs;
run;

The result is

gender    ethnicity    id    wage

  F           B         6     222
  M           W         1     254

(your expected result value of 240 for wage seems to be a typo, as that value only appears in set1)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

torestin · Posted 02-24-2017 05:02 PM

Hey thanks for that. Sorry I should have been more clearer. What I actually want is for the output to have the same observations with set1 and for each record in set 1 to be matched only once (even if they have multiple matches in set2 on the same criteria). So I changed that to first.id. The first.ethnicity didn't work if I have three lines in set 1 like

232 M W 230

233 F B 240

234 M W 254

etc. as it seems to discard the last record and still returns 2 matches. I tried use first.id (from set2), still doesn't work. Ideas?

Patrick · Posted 02-24-2017 05:53 PM

@torestin Would code like below do the job for you?



data sample_set;
  informat gender $1. ethnicity $1.;
  input serial gender ethnicity wage;
  datalines;
232 M W 230
233 F B 240
;
run;

data huge_set;
  informat gender $1. ethnicity $1.;
  input id gender ethnicity wage;
  cards;
1 M W 254
2 M W 270
3 M B 255
4 M W 230
5 F B 211
6 F B 222
;
run;

data want(drop=sample_wage);

  set huge_set;

  if _n_=1 then
    do;
      if 0 then set sample_set(rename=(wage=sample_wage));
      dcl hash h1(dataset:'sample_set(rename=(wage=sample_wage))');
      h1.defineKey('gender','ethnicity');
      h1.defineData('serial','sample_wage');
      h1.defineDone();
    end;

  if h1.find()=0 then
    do;
      if wage*0.9<=sample_wage<=wage*1.2 then  
        do;
          output;
          h1.remove();
          if h1.num_items=0 then stop;
        end;
    end;

run;

Code amended as suggested by @art297

mkeintz · Posted 02-24-2017 08:28 PM

Folks:

I like @Patrick's solution, but it depends on the OP's small data set having only 1 obs per ethnicity /gender. Otherwise, the hash object will have to be modified, as well as its processing.

Also, if large is not randomly ordered with regards to (1) wage and (2) any other vars pertinent to the subsequent analysis, then none of the responses offered so far produce both (1) a single matching large obs per small obs, and (2) unbiased sampling from large. So my question to @torestin is whether randomized selection from the large data set is needed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

torestin · Posted 02-24-2017 11:25 PM

Hey thanks. Yes exactly. The small data sets will have many obs, so would
be very grateful if you or Patrick can suggest how this might be modified.

The large set is randomly ordered. Just as a tangent, if a fail safe is
needed, would a randomly assigned starting point for the large set
do the trick?

But anyway, I am starting to think I might have to do random sampling.

##- Please type your reply above this line. Simple formatting, no
attachments. -##

art297 · Posted 02-25-2017 03:36 PM

@Patrick: Shouldn't that be

      if sample_wage*0.9<=wage<=sample_wage*1.2 then

rather than

      if wage*0.9<=sample_wage<=wage*1.2 then

Art, CEO, AnalystFinder.com

Ksharp · Posted 02-25-2017 01:29 AM

Patrick's code also could apply to the case of multi-keys.

data sample_set;
  informat gender $1. ethnicity $1.;
  input serial gender ethnicity wage;
  datalines;
232 M W 230
232 M W 270
233 F B 240
;
run;

data huge_set;
  informat gender $1. ethnicity $1.;
  input id gender ethnicity wage;
  cards;
1 M W 254
2 M W 270
3 M B 255
4 M W 230
5 F B 211
6 F B 222
;
run;


data want(drop=sample_wage);
  set huge_set;
  if _n_=1 then
    do;
      if 0 then set sample_set(rename=(wage=sample_wage));
      dcl hash h1(dataset:'sample_set(rename=(wage=sample_wage))',multidata:'y');
      h1.defineKey('gender','ethnicity');
      h1.defineData('serial','sample_wage');
      h1.defineDone();
    end;

  rc=h1.find();
  do while(rc=0);
      if wage*0.9<=sample_wage<=wage*1.2 then 
        do;
          output;
          h1.removedup();
          if h1.num_items=0 then stop;
          leave;
        end;
      rc=h1.find_next(); 
  end;
run;

mkeintz · Posted 02-25-2017 03:22 AM

@Ksharp

I believe your leave statement only exits the if wage*0.9<=sample_wage<=wage*1.2 then do group, but not the outer do while(rc=0) do group. It that's true then it seems possible that a single large observation could be matched with multiple small observations having the same gender/ethnicity and similar wage values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 02-25-2017 09:07 PM

Mark,

No . that would not happen. if that happened, OP must have replicated record in large table. That is another story.

art297 · Posted 02-25-2017 03:37 PM

@Ksharp: same comment I left for @Patrick

Shouldn't that be

      if sample_wage*0.9<=wage<=sample_wage*1.2 then

rather than

      if wage*0.9<=sample_wage<=wage*1.2 then

Art, CEO, AnalystFinder.com

art297 · Posted 02-25-2017 01:40 PM

If the huge_set is in fact randomly ordered, and contains large numbers of candidates for each record to be matched from the sample set file, couldn't you accomplish the task by using code like:

data want;
  do i=1 to nobs;
    set sample_set (rename=(gender=s_gender ethnicity=s_ethnicity wage=s_wage)) nobs=nobs;
    found=0;
    do j=1 by 1 until (found);
      set huge_set;
      if s_gender eq gender and
         s_ethnicity eq ethnicity and
         s_wage*0.9<=wage<=s_wage*1.2
         then do;
        found=1;
        output;
      end;
     end;
  end;
run;

Art, CEO, AnalystFinder.com

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away