Re: selecting only one match per record

mkeintz · Posted 02-25-2017 04:57 PM

Art asks:

If the huge_set is in fact randomly ordered, and contains large numbers of candidates for each record to be matched from the sample set file, couldn't you accomplish the task by using code like:
data want;
  do i=1 to nobs;
    set sample_set (rename=(gender=s_gender ethnicity=s_ethnicity wage=s_wage)) nobs=nobs;
    found=0;
    do j=1 by 1 until (found);
      set huge_set;
      if s_gender eq gender and
         s_ethnicity eq ethnicity and
         s_wage*0.9<=wage<=s_wage*1.2
         then do;
        found=1;
        output;
      end;
     end;
  end;
run;
Art, CEO, AnalystFinder.com

My answer: "yes". Although I don't think you need the outermost loop.

But if one is really worried about a perverse situation in which HUGE_SET is exhausted before all the SAMPLE_SET is matched (say a rare ETHNICITY/GENDER/WAGE occurs early in HUGE_SET, but late in SAMPLE_SET), one could change

SET HUGE_SET
to
SET HUGE_SET HUGE_SET HUGE_SET open=defer.

This just runs through HUGE_SET three times, if neccessary. If the matching is satisfied during the first pass through of huge_set (as per @art297's reasonable expectation) , then this code will not add any time to completion (since it is the end of sample_set that terminates the data step). This modificaton would just be an insurance policy against an unexpected situation.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

torestin · Posted 02-25-2017 06:10 PM

I thought about that. My question was with that doesn't the inner loop
starts from the beginning of the huge set every time? So if the top two in
the sample set are m w and the first record of the huge set matches. It
will just return that twice. Instead of searching down the list for another
match in huge set for no2 on the sample

I suppose I could delete the match from the huge set every time. That might
work?

##- Please type your reply above this line. Simple formatting, no
attachments. -##

art297 · Posted 02-25-2017 06:27 PM

Not sure which post you're replying to. With the code I suggested, the inner loop goes through the file sequentially. Records are only read once from the huge_set, and only until all records from the sample_set are matched.

The same occurs with the hash solutions.

Art, CEO, AnalystFinder.com

torestin · Posted 03-01-2017 02:51 AM

Hi all. Thank you for all your help. I finally put it through the run today
and something rather strange happened. For some reason it seems to work
with only 1 condition. So it works fine when I tried to match wage lt
1.2*sample wage. Returns 6000 matches. Against 1m records. But adding any
other condition returns 5 observations. Doesn't matter if it's another
gender ethnicity or or an upper limit on income. Any guess where I went
wrong? I struggle to see why adding and wage lt 10*wage sample returns only
5 records or making an ethnicity match.....

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Kurt_Bremser · Posted 03-01-2017 02:57 AM

Post the where condition before and after the change.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

torestin · Posted 03-01-2017 05:15 AM

It's a bit like the simulated set below 16 obs for this one and the sample set is read 17 times. My own dataset with the front half 0.8*sample_wage<=wage nets the same amount of matches, adding 0.8*sample_wage<=wage<=10*sample_wage nets 5 obs but I can count from the previous matches that there are defnitely more than 5 matches...Any ideas?

data Simulated;
	do i=1 to 100000000;
		x=ranuni(1);
		if x < .5 then
			gender='M';
		else
			gender='F';
		y = ranuni (1);
		if y<.5 then ethnicity = 'B';
		if y>.5 then ethnicity = 'W';
		
		if gender='M' and ethnicity = 'W' then
			wage = 50000+ rand("Normal")*30000;
		if gender='M' and ethnicity = 'B' then
		   wage = 40000+ rand ("normal")*21000;
		if gender='F' and ethnicity = 'W' then
			wage=45000 + rand("Normal")*20000;
		if gender='F' and ethnicity = 'B' then
			wage=39000 + rand("Normal")*15000;		
		output;
	end;
run;

proc surveyselect data=simulated
      method=srs n=500 out=SampleSRS (rename=(wage=sample_wage gender=sample_gender ethnicity=sample_ethnicity i=sample_i));
   run;
   
 data matched;
  
  do i=1 to nobs;
    set samplesrs nobs=nobs;
    found=0;
    do j=1 by 1 until (found);
      set simulated;
      if 
         sample_wage*0.9<=wage<=sample_wage*10
         then do;
        found=1;
        output;
      end;
     end;
  end;
run;