Hi,
I would like to know the most efficient SAS procedure to select at least one item per category randomly, with a minimum of 50% of the whole population.
To illustrate my problem. Lets say I have the following table :
Category | Info1 |
---|---|
Flower | aaa |
Flower | bbb |
Flower | ccc |
Fruit | ddd |
Car | ddd |
Car | eee |
Car | fff |
Car | iii |
Toy | fff |
Toy | ggg |
The result would be like the following table (with random values and at least 1 item per category) :
Category | Info1 |
---|---|
Flower | aaa |
Fruit | ddd |
Car | fff |
Toy | ggg |
Flower | ccc |
Thank you for your help and time.
Surveyselect with samprate and strata will do exactly what you want:
data have;
input category $ info $;
datalines;
Flower aaa
Flower bbb
Flower ccc
Fruit ddd
Car ddd
Car eee
Car fff
Car iii
Toy fff
Toy ggg
;
proc surveyselect data=have out=want samprate=0.5 seed=114876001;
strata category notsorted;
run;
proc print data=want noobs; run;
PG
Surveyselect with samprate and strata will do exactly what you want:
data have;
input category $ info $;
datalines;
Flower aaa
Flower bbb
Flower ccc
Fruit ddd
Car ddd
Car eee
Car fff
Car iii
Toy fff
Toy ggg
;
proc surveyselect data=have out=want samprate=0.5 seed=114876001;
strata category notsorted;
run;
proc print data=want noobs; run;
PG
Thank you very much to all of you for your help.
For the proc surveyselect method, when I do not sort my table by the specific variable and use 'notsorted' it doesn't return the appropriate result.
Therefore, when I use a proc sort on my variable, the I use proc surveyselect with mentioning 'notsorted', it works.
As for the modified proc sql method. When I use it in my data, it select one additional record that should not be there...
I would like to know if my manipulation with the proc surveyselect are Ok and what would be the correct method to do it with a proc SQL.
Thanks
NOTSORTED means that the data from each category occurs together consecutively but not necessarily in sorted order (e.g. aaaa cccc dd bbb).
Post your code.
PG
Thank you for the quick reply.
I won't post my actual data because it is sensitive, but this example triggers the same problem. If I understand properly, the number of strata should be 4, but it returns "Number of strata = 7".
data have;
input preinfo $ category $ info $;
datalines;
asdfasdfa Flower aaa
asfdasdfasdf Flower bbb
awdfadfasf Flower ccc
fasdfsdafds Fruit ddd
asdsdafsdf Car ddd
asdfasdfs Car eee
asdfasfsdff Car fff
asdfasdfds Car iii
asdfasdf Toy fff
asdfsdaerqwerwefad Toy ggg
asdfasdfa Flower aaa
asfdasdfasdf Flower bbb
awdfadfasf Flower ccc
fasdfsdafds Fruit ddd
asdsdafsdf Car ddd
asdfasdfs Car eee
asdfasfsdff Car fff
asdfasdfds Car iii
;
proc surveyselect data=have out=want samprate=0.1 seed=114876001;
strata category notsorted;
run;
Yes, since your category data is not consecutive, you need to sort BY category before calling surveyselect.
proc sort data=have; by category; run;
proc surveyselect data=have out=want samprate=0.1 seed=114876001;
strata category;
run;
PG
Excellent.
Thank you all for your help!
proc sql;
select * , rand('UNIFORM') as r
from have
group by category
having r=min(r) or (count(category)>=1 and count(category)<=2);
quit;
your solution will not work for a general population with larger categories. It will choose a single observation per category, which may be far less than 50%. A simple change to your SQL approach can however yield good samples (without the perfect assurance of a >50% sample size):
data have;
input category $ info $;
datalines;
Flower aaa
Flower bbb
Flower ccc
Fruit ddd
Fruit eee
Fruit fff
Fruit ggg
Fruit hhh
Fruit iii
Fruit jjj
;
/* Modified code */
proc sql;
select * , rand('UNIFORM') as r
from have
group by category
having r=min(r) or r<0.5;
quit;
PG
Hi PG,
Yes, that is right. Thanks for the suggestion.
Regards,
Naeem
Along the same lines, if you want to guarantee a >50% sample, you can use proc rank to create fractional ranks (hint: fractional rank=1 when there is a single case in a category) :
data temp;
call streaminit(987256);
set have;
r = RAND("UNIFORM");
run;
proc rank data=temp out=want(where=(r>0.5)) fraction;
by category notsorted;
var r;
run;
proc print data=want(drop=r) noobs; run;
PG
RAND function suppose to generate a random number from a distribution where as it producing all the records from the table (have) when I ran your code. Please advise. I assumed 'have' dataset as below.
data have;
input category $ info $;
datalines;
Flower aaa
Flower bbb
Flower ccc
Fruit ddd
Fruit eee
Fruit fff
Fruit ggg
Fruit hhh
Fruit iii
Fruit jjj
;
Also 'r' value differs when I ran it one variable 'category'.
I don't understand your question. When I run this (exactly this):
data have;
input category $ info $;
datalines;
Flower aaa
Flower bbb
Flower ccc
Fruit ddd
Fruit eee
Fruit fff
Fruit ggg
Fruit hhh
Fruit iii
Fruit jjj
;
data temp;
call streaminit(987256);
set have;
r = RAND("UNIFORM");
run;
proc rank data=temp out=want(where=(r>0.5)) fraction;
by category notsorted;
var r;
run;
proc print data=want(drop=r) noobs; run;
I get this output:
The SAS System 1
11:15 Thursday, September 25, 2014
category info
Flower bbb
Flower ccc
Fruit eee
Fruit hhh
Fruit iii
Fruit jjj
The proc rank step replaces the random numbers (r in dataset temp) by their fractional ranks (r in dataset want) and keeps only the fractional ranks greater than 0.5. This way, a random selection of half the observations is kept in each category.
PG
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.