BookmarkSubscribeRSS Feed
mkeintz
PROC Star

Art asks:

 


If the huge_set is in fact randomly ordered, and contains large numbers of candidates for each record to be matched from the sample set file, couldn't you accomplish the task by using code like:

 

data want;
  do i=1 to nobs;
    set sample_set (rename=(gender=s_gender ethnicity=s_ethnicity wage=s_wage)) nobs=nobs;
    found=0;
    do j=1 by 1 until (found);
      set huge_set;
      if s_gender eq gender and
         s_ethnicity eq ethnicity and
         s_wage*0.9<=wage<=s_wage*1.2
         then do;
        found=1;
        output;
      end;
     end;
  end;
run;

Art, CEO, AnalystFinder.com

 


 

My answer: "yes".  Although I don't think you need the outermost loop.

 

But if one is really worried about a perverse situation in which HUGE_SET is exhausted before all the SAMPLE_SET is matched (say a rare ETHNICITY/GENDER/WAGE occurs early in HUGE_SET, but late in SAMPLE_SET), one could change

   SET HUGE_SET
to
   SET HUGE_SET HUGE_SET HUGE_SET open=defer.

 

This just runs through HUGE_SET three times, if neccessary.  If the matching is satisfied during the first pass through  of huge_set (as per @art297's reasonable expectation) , then this code will not add any time to completion (since it is the end of sample_set that terminates the data step).  This modificaton would just be an insurance policy against an unexpected situation. 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
torestin
Calcite | Level 5
I thought about that. My question was with that doesn't the inner loop
starts from the beginning of the huge set every time? So if the top two in
the sample set are m w and the first record of the huge set matches. It
will just return that twice. Instead of searching down the list for another
match in huge set for no2 on the sample

I suppose I could delete the match from the huge set every time. That might
work?


##- Please type your reply above this line. Simple formatting, no
attachments. -##
art297
Opal | Level 21

Not sure which post you're replying to. With the code I suggested, the inner loop goes through the file sequentially. Records are only read once from the huge_set, and only until all records from the sample_set are matched.

 

The same occurs with the hash solutions.

 

Art, CEO, AnalystFinder.com

torestin
Calcite | Level 5
Hi all. Thank you for all your help. I finally put it through the run today
and something rather strange happened. For some reason it seems to work
with only 1 condition. So it works fine when I tried to match wage lt
1.2*sample wage. Returns 6000 matches. Against 1m records. But adding any
other condition returns 5 observations. Doesn't matter if it's another
gender ethnicity or or an upper limit on income. Any guess where I went
wrong? I struggle to see why adding and wage lt 10*wage sample returns only
5 records or making an ethnicity match.....


##- Please type your reply above this line. Simple formatting, no
attachments. -##
torestin
Calcite | Level 5

It's a bit like the simulated set below 16 obs for this one and the sample set is read 17 times. My own dataset with the front half 0.8*sample_wage<=wage nets the same amount of matches, adding 0.8*sample_wage<=wage<=10*sample_wage nets 5 obs but I can count from the previous matches that there are defnitely more than 5 matches...Any ideas?

 

 

data Simulated;
	do i=1 to 100000000;
		x=ranuni(1);
		if x < .5 then
			gender='M';
		else
			gender='F';
		y = ranuni (1);
		if y<.5 then ethnicity = 'B';
		if y>.5 then ethnicity = 'W';
		
		if gender='M' and ethnicity = 'W' then
			wage = 50000+ rand("Normal")*30000;
		if gender='M' and ethnicity = 'B' then
		   wage = 40000+ rand ("normal")*21000;
		if gender='F' and ethnicity = 'W' then
			wage=45000 + rand("Normal")*20000;
		if gender='F' and ethnicity = 'B' then
			wage=39000 + rand("Normal")*15000;		
		output;
	end;
run;

proc surveyselect data=simulated
      method=srs n=500 out=SampleSRS (rename=(wage=sample_wage gender=sample_gender ethnicity=sample_ethnicity i=sample_i));
   run;
   
 data matched;
  
  do i=1 to nobs;
    set samplesrs nobs=nobs;
    found=0;
    do j=1 by 1 until (found);
      set simulated;
      if 
         sample_wage*0.9<=wage<=sample_wage*10
         then do;
        found=1;
        output;
      end;
     end;
  end;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 1229 views
  • 1 like
  • 7 in conversation