Hi, I have mailing file which I have profiled on age, gender, ethnicity, occupation.
From the main database I want to select as big a sample as possible that has the same characteristics as the mailing file based on age, gender, ethnicity and occupation.
I can do this in a very long winded way by matching % of the database that are same in the mailing file but it would be better to if I could find a sas procedure/methodology to do it as I need an element of statistical rigour.
I am using SAS EG v7.
Thanks
Andrew
I think you need two columns: nMail and nPop with the mail sample sizes and the population sizes respectively. From those, you could do:
data counts;
input Age $ Sex $ Ethnicity $ Occupation $ nMail nPop;
datalines;
20-30 Male Black Plumber 350 2000
30-40 Male White Painter 290 5000
40-50 Male White Lawyer 170 1200
20-30 Female White Teacher 210 5000
30-40 Male Asian Doctor 290 500
40-50 Male White Plumber 170 3500
;
proc sql;
create table sampsize as
select
*,
min(nPop/nMail) * nMail as SampleSize
from counts;
select * from sampsize;
quit;
proc surveyselect data=pop out=control sampsize=sampsize;
strata age sex ethnicity occupation;
run;
Note: The population and sample size datasets should be sorted the same way.
could use proc sort nodupkey to reduce the mailing file to all the possible combinations of age/sex/ethnicity/occupation. And then merge this onto the master file.
proc sort data=mailingfile out=template (keep=age sex ethnicity occupation) nodupkey;
by age sex ethnicity occupation;
run;
proc sort masterfile
by age sex ethnicity occupation;;
run;
data new;
merge masterfile template (in=a);
by age sex ethnicity occupation;
if a then output;
run;
it deponds a lot on exactly what you're doing ie how they should be matched. You could do something sophisticated ie proc psmatch https://support.sas.com/documentation/onlinedoc/stat/142/psmatch.pdf
Mmm! Not sure that will work, I probably haven't explained the guts of this too well.
If in the mail file I have the following profile of age sex ethnicity occupation...
Age | Sex | Ethnicity | Occupation | % of total |
20-30 | Male | Black | Plumber | 3.5% |
30-40 | Male | White | Painter | 2.9% |
40-50 | Male | White | Lawyer | 1.7% |
20-30 | Female | White | Teacher | 2.1% |
30-40 | Male | Asian | Doctor | 2.9% |
40-50 | Male | White | Plumber | 1.7% |
What I want in the control I select from the main database to have the same profile by age, sex, ethnicity and occupation in terms of the %.
Hope this makes more sense now, and thanks for your help so far.
Andrew
I think you need two columns: nMail and nPop with the mail sample sizes and the population sizes respectively. From those, you could do:
data counts;
input Age $ Sex $ Ethnicity $ Occupation $ nMail nPop;
datalines;
20-30 Male Black Plumber 350 2000
30-40 Male White Painter 290 5000
40-50 Male White Lawyer 170 1200
20-30 Female White Teacher 210 5000
30-40 Male Asian Doctor 290 500
40-50 Male White Plumber 170 3500
;
proc sql;
create table sampsize as
select
*,
min(nPop/nMail) * nMail as SampleSize
from counts;
select * from sampsize;
quit;
proc surveyselect data=pop out=control sampsize=sampsize;
strata age sex ethnicity occupation;
run;
Note: The population and sample size datasets should be sorted the same way.
That looks like it might be what I want. I'll have a bash at it in the morning.
Many thanks!
Andrew
And if you don't have the latest version of SAS to use PSMATCH, see the Mayo Clinic's macro for greedy match algorithms.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.