Fluorite | Level 6

## Subsetting a dataset with specific frequencies

I am trying to do frequency matching for case control studies. I have results from Dataset A with the frequency distribution for certain variables (Dataset A: Male 30%, Females 70% | Age>=65: 40%, Age <65: 60% | Region: West 10%, Northeast 20%, Southwest 40%, Midwest 25%, Southeast 5%). Please note that I do not have access to Dataset A and just have the frequency distribution for those 3 variables (Gender, Age, Region).

I have Dataset B and I need to create a subset of dataset B which will provide the same frequency distribution for those 3 variables as Dataset A i.e. when I create the subset of Dataset B and run proc freq on age, gender and region it should give the same results as given above for dataset A.

Could you please suggest what is the best way to do that?

Thanks.

5 REPLIES 5
Super User

## Re: Subsetting a dataset with specific frequencies

PROC SURVEYSELECT and specifying your sample sizes as above.
If you don't have interaction frequencies, I'd assume equal distribution (dangerous assumption) and calculate the amount per combination and pass that through.

See example 3 here except you have one more variable:
https://stats.oarc.ucla.edu/sas/faq/how-can-i-take-a-stratified-random-sample-of-my-data/
Fluorite | Level 6

## Re: Subsetting a dataset with specific frequencies

Thank you! I will try this out.
Super User

## Re: Subsetting a dataset with specific frequencies

``````%let sample_size=1000 ;
proc plan seed=27371 ;
factors n=&sample_size. ordered sex=10 /noprint;
output out=sex ;

factors n=&sample_size. ordered age=10 /noprint;
output out=age ;

run;

data sex;
set sex;
char_sex=ifc(sex in (1:3),'Male  ','Female');
keep char_sex;
run;

data age;
set age;
char_age=ifc(age in (1:4),'Age>=65 ','Age <65');
keep char_age;
run;

data want;
merge sex age;
run;``````
Fluorite | Level 6

## Re: Subsetting a dataset with specific frequencies

Thank you! I will check this out.
Super User

## Re: Subsetting a dataset with specific frequencies

``````%let sample_size=1000 ;
proc plan seed=27371 ;
factors sex=&sample_size.  /noprint;
output out=sex ;

factors age=&sample_size. /noprint;
output out=age ;

factors Region=&sample_size. /noprint;
output out=Region ;
quit;

data temp;
merge sex age region;
run;

proc rank data=temp out=temp2 groups=100 ;
var sex age region;
ranks r_sex r_age r_region;
run;

data want;
set temp2;
char_sex=ifc(r_sex in (0:29),'Male  ','Female');
char_age=ifc(r_age in (0:39),'Age>=65 ','Age <65');

select;
when(r_region in (0:9))   char_Region='West     ';
when(r_region in (10:29)) char_Region='Northeast';
when(r_region in (30:69)) char_Region='Southwest';
when(r_region in (70:94)) char_Region='Midwest  ';
when(r_region in (95:99)) char_Region='Southeast  ';
otherwise;
end;
keep char_:;
run;

proc freq data=want;
table char_:;
run;
``````

Discussion stats
• 5 replies
• 562 views
• 2 likes
• 3 in conversation