BookmarkSubscribeRSS Feed
SasStatistics
Pyrite | Level 9

Assume I in principe have observations in a SAS data set of the form (with over 10 million rows so this is just "a piece of the data"): 

Customer_ID  Country Male_Flag below_40_years over_40_years
1 USA 1 1 0
2 UK 0 0 1
3 GER 0 0 1
4 USA 1 1 0
5 UK 1 0 1

 

If the below_40_years_flag = 1 that means you are less than 40 years old. 

Now I would like to take a couple of random sample containing 100 customers each. 

For the first random sample: You should be from USA, be a Male (Male_Flag = 1) and be below 40 years old. 
For the second random sample: You should be from UK, a woman (Male_Flag = 0) and be above 40 years old. 


In my real example, I have around 5 criterias (Country, Age, Gender, ...) and want to take around 15 random samples. So the example above just illustrates the point and it could be solved "mechanically" but in my real example it would get extremly messy. 

Any ideas? 


 

2 REPLIES 2
FreelanceReinh
Jade | Level 19

Hello @SasStatistics,

 

Try this:

/* Create example data */

data have;
input Customer_ID Country $ Male_Flag below_40_years over_40_years;
cards;
1 USA 1 1 0
2 UK 0 0 1
3 GER 0 0 1
4 USA 1 1 0
5 UK 1 0 1
;

/* Create a dataset defining the subpopulations (s=1, 2, ..., 15) of interest */

data subpopdef;
input s Country $ Male_Flag below_40_years /* crit4 crit5 */;
cards;
1 USA 1 1
2 UK  0 0
;

/* Select all customers from the subpopulations of interest */

proc sql;
create table sel as
select d.s, a.*
from have a natural join subpopdef d
order by s, Customer_ID;
quit;

/* Draw a random sample of 100 customers (if possible) from each subpopulation */

proc surveyselect data=sel
method=srs n=100 selectall
seed=2718 out=want(drop=SelectionProb SamplingWeight);
strata s;
run;
ballardw
Super User

Similar though slightly different approach than @FreelanceReinh 

 

Just one example:

proc surveyselect data=have out=want sampsize=100
   selectall noprint outsize stats
   ;
   where country='USA' and Male_flag and below_40_years;
run;

You would very likely want to name the out= data set differently for each set.

What this does: Sampsize sets the number of records to select. Selectall option means if there are fewer then the requested size all the records are selected. Noprint suppresses several tables that appear in the result window. Run it without and see if you want/need that output. OUTSIZE provides additional variables in the output to indicate how many possible records were available and the number selected. The STATS provides probability of selection and weighting values. The counts and probability are based on the number of records that meet the WHERE restriction.

Since your flags are numeric 1/0 (at least by appearance) the 1 is considered true for selection and 0 is false. To select a 0 flag value you could use "and not Male_flag" for example. Another option if you feel more comfortable would be to use the Variable= 1 or 0 as preferred.

 

If you had a single AGE variable that had categories of interest and sorted the data by County Male_flag Age_category

 

You could use

proc surveyselect data=have out=want sampsize=100
   selectall noprint outsize stats
   ;
   strata country male_flag age_category;
run;

The order of variables on the STRATA statement would match the order on the Proc Sort By statement.

This will do samples of all the combinations of country gender and age, have a variable on each record with the number of records in the stratum (the combination of values), the number selected, and the probability of selection and and weight.

 

Actually if you sort by both of your age variables you could use them on the Strata statement as well. I just personally find that multiple dichotomous for age is awkward to work with. If you have other age variables you didn't show then the data isn't going to be nice for this single call to Surveyselect and details would be needed.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 423 views
  • 1 like
  • 3 in conversation