Help using Base SAS procedures

Select at least one item per category radomly, with a minimum % of the whole population

Accepted Solution Solved
Reply
Regular Contributor
Posts: 186
Accepted Solution

Select at least one item per category radomly, with a minimum % of the whole population

Hi,

I would like to know the most efficient SAS procedure to select at least one item per category randomly, with a minimum of 50% of the whole population.

To illustrate my problem. Lets say I have the following table :

CategoryInfo1
Floweraaa
Flowerbbb
Flowerccc
Fruitddd
Carddd
Careee
Carfff
Cariii
Toyfff
Toyggg

The result would be like the following table (with random values and at least 1 item per category) :

CategoryInfo1
Floweraaa
Fruitddd
Carfff
Toyggg
Flowerccc

Thank you for your help and time.


Accepted Solutions
Solution
‎09-24-2014 08:56 PM
Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

Surveyselect with samprate and strata will do exactly what you want:

data have;

input category $ info $;

datalines;

Flower aaa

Flower bbb

Flower ccc

Fruit ddd

Car ddd

Car eee

Car fff

Car iii

Toy fff

Toy ggg

;

proc surveyselect data=have out=want samprate=0.5 seed=114876001;

strata category notsorted;

run;

proc print data=want noobs; run;

PG

PG

View solution in original post


All Replies
Solution
‎09-24-2014 08:56 PM
Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

Surveyselect with samprate and strata will do exactly what you want:

data have;

input category $ info $;

datalines;

Flower aaa

Flower bbb

Flower ccc

Fruit ddd

Car ddd

Car eee

Car fff

Car iii

Toy fff

Toy ggg

;

proc surveyselect data=have out=want samprate=0.5 seed=114876001;

strata category notsorted;

run;

proc print data=want noobs; run;

PG

PG
Regular Contributor
Posts: 186

Re: Select at least one item per category radomly, with a minimum % of the whole population

Thank you very much to all of you for your help.

For the proc surveyselect method, when I do not sort my table by the specific variable and use 'notsorted' it doesn't return the appropriate result.

Therefore, when I use a proc sort on my variable, the I use proc surveyselect with mentioning 'notsorted', it works.

As for the modified proc sql method. When I use it in my data, it select one additional record that should not be there...

I would like to know if my manipulation with the proc surveyselect are Ok and what would be the correct method to do it with a proc SQL.

Thanks

Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

NOTSORTED means that the data from each category occurs together consecutively but not necessarily in sorted order (e.g. aaaa cccc dd bbb).

Post your code.

PG

PG
Regular Contributor
Posts: 186

Re: Select at least one item per category radomly, with a minimum % of the whole population

Thank you for the quick reply.

I won't post my actual data because it is sensitive, but this example triggers the same problem. If I understand properly, the number of strata should be 4, but it returns "Number of strata = 7".

data have;

input preinfo $ category $ info $;

datalines;

asdfasdfa Flower aaa

asfdasdfasdf Flower bbb

awdfadfasf Flower ccc

fasdfsdafds Fruit ddd

asdsdafsdf Car ddd

asdfasdfs Car eee

asdfasfsdff Car fff

asdfasdfds Car iii

asdfasdf Toy fff

asdfsdaerqwerwefad Toy ggg

asdfasdfa Flower aaa

asfdasdfasdf Flower bbb

awdfadfasf Flower ccc

fasdfsdafds Fruit ddd

asdsdafsdf Car ddd

asdfasdfs Car eee

asdfasfsdff Car fff

asdfasdfds Car iii

;

proc surveyselect data=have out=want samprate=0.1 seed=114876001;

strata category notsorted;

run;

Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

Yes, since your category data is not consecutive, you need to sort BY category before calling surveyselect.

proc sort data=have; by category; run;

proc surveyselect data=have out=want samprate=0.1 seed=114876001;

strata category;

run;

PG

PG
Regular Contributor
Posts: 186

Re: Select at least one item per category radomly, with a minimum % of the whole population

Excellent.

Thank you all for your help!

Trusted Advisor
Posts: 1,228

Re: Select at least one item per category radomly, with a minimum % of the whole population

proc sql;

select * , rand('UNIFORM') as r

from have

group by category

having r=min(r) or (count(category)>=1 and count(category)<=2);

quit;

Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

your solution will not work for a general population with larger categories. It will choose a single observation per category, which may be far less than 50%. A simple change to your SQL approach can however yield good samples (without the perfect assurance of a >50% sample size):

data have;

input category $ info $;

datalines;

Flower aaa

Flower bbb

Flower ccc

Fruit ddd

Fruit eee

Fruit fff

Fruit ggg

Fruit hhh

Fruit iii

Fruit jjj

;

/* Modified code */

proc sql;

select * , rand('UNIFORM') as r

from have

group by category

having r=min(r) or r<0.5;

quit;

PG

PG
Trusted Advisor
Posts: 1,228

Re: Select at least one item per category radomly, with a minimum % of the whole population

Hi PG,

Yes, that is right. Thanks for the suggestion.

Regards,

Naeem

Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

Along the same lines, if you want to guarantee a >50% sample, you can use proc rank to create fractional ranks (hint: fractional rank=1 when there is a single case in a category) :

data temp;

call streaminit(987256);

set have;

r = RAND("UNIFORM");

run;

proc rank data=temp out=want(where=(r>0.5)) fraction;

by category notsorted;

var r;

run;

proc print data=want(drop=r) noobs; run;

PG

PG
Regular Contributor
Posts: 168

Re: Select at least one item per category radomly, with a minimum % of the whole population

RAND function suppose to generate a random number from a distribution where as it producing all the records from the table (have) when I ran your code. Please advise. I assumed 'have' dataset as below.

data have;

input category $ info $;

datalines;

Flower aaa

Flower bbb

Flower ccc

Fruit ddd

Fruit eee

Fruit fff

Fruit ggg

Fruit hhh

Fruit iii

Fruit jjj

;


Also 'r' value differs when I ran it one variable 'category'.

Respected Advisor
Posts: 4,927

Re: Select at least one item per category radomly, with a minimum % of the whole population

I don't understand your question. When I run this (exactly this):

data have;

input category $ info $;

datalines;

Flower aaa

Flower bbb

Flower ccc

Fruit ddd

Fruit eee

Fruit fff

Fruit ggg

Fruit hhh

Fruit iii

Fruit jjj

;

data temp;

call streaminit(987256);

set have;

r = RAND("UNIFORM");

run;

proc rank data=temp out=want(where=(r>0.5)) fraction;

by category notsorted;

var r;

run;

proc print data=want(drop=r) noobs; run;

I get this output:

                                   The SAS System                                  1

                                                  11:15 Thursday, September 25, 2014

                                  category    info

                                   Flower     bbb

                                   Flower     ccc

                                   Fruit      eee

                                   Fruit      hhh

                                   Fruit      iii

                                   Fruit      jjj

The proc rank step replaces the random numbers (r in dataset temp) by their fractional ranks (r in dataset want) and keeps only the fractional ranks greater than 0.5. This way, a random selection of half the observations is kept in each category.

PG

PG
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 457 views
  • 1 like
  • 4 in conversation