Art asks:
If the huge_set is in fact randomly ordered, and contains large numbers of candidates for each record to be matched from the sample set file, couldn't you accomplish the task by using code like:
data want; do i=1 to nobs; set sample_set (rename=(gender=s_gender ethnicity=s_ethnicity wage=s_wage)) nobs=nobs; found=0; do j=1 by 1 until (found); set huge_set; if s_gender eq gender and s_ethnicity eq ethnicity and s_wage*0.9<=wage<=s_wage*1.2 then do; found=1; output; end; end; end; run;Art, CEO, AnalystFinder.com
My answer: "yes". Although I don't think you need the outermost loop.
But if one is really worried about a perverse situation in which HUGE_SET is exhausted before all the SAMPLE_SET is matched (say a rare ETHNICITY/GENDER/WAGE occurs early in HUGE_SET, but late in SAMPLE_SET), one could change
SET HUGE_SET
to
SET HUGE_SET HUGE_SET HUGE_SET open=defer.
This just runs through HUGE_SET three times, if neccessary. If the matching is satisfied during the first pass through of huge_set (as per @art297's reasonable expectation) , then this code will not add any time to completion (since it is the end of sample_set that terminates the data step). This modificaton would just be an insurance policy against an unexpected situation.
Not sure which post you're replying to. With the code I suggested, the inner loop goes through the file sequentially. Records are only read once from the huge_set, and only until all records from the sample_set are matched.
The same occurs with the hash solutions.
Art, CEO, AnalystFinder.com
Post the where condition before and after the change.
It's a bit like the simulated set below 16 obs for this one and the sample set is read 17 times. My own dataset with the front half 0.8*sample_wage<=wage nets the same amount of matches, adding 0.8*sample_wage<=wage<=10*sample_wage nets 5 obs but I can count from the previous matches that there are defnitely more than 5 matches...Any ideas?
data Simulated;
do i=1 to 100000000;
x=ranuni(1);
if x < .5 then
gender='M';
else
gender='F';
y = ranuni (1);
if y<.5 then ethnicity = 'B';
if y>.5 then ethnicity = 'W';
if gender='M' and ethnicity = 'W' then
wage = 50000+ rand("Normal")*30000;
if gender='M' and ethnicity = 'B' then
wage = 40000+ rand ("normal")*21000;
if gender='F' and ethnicity = 'W' then
wage=45000 + rand("Normal")*20000;
if gender='F' and ethnicity = 'B' then
wage=39000 + rand("Normal")*15000;
output;
end;
run;
proc surveyselect data=simulated
method=srs n=500 out=SampleSRS (rename=(wage=sample_wage gender=sample_gender ethnicity=sample_ethnicity i=sample_i));
run;
data matched;
do i=1 to nobs;
set samplesrs nobs=nobs;
found=0;
do j=1 by 1 until (found);
set simulated;
if
sample_wage*0.9<=wage<=sample_wage*10
then do;
found=1;
output;
end;
end;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.