turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Select at least one item per category radomly, wit...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-24-2014 04:58 PM

Hi,

I would like to know the most efficient SAS procedure to select at least one item per category randomly, with a minimum of 50% of the whole population.

To illustrate my problem. Lets say I have the following table :

Category | Info1 |
---|---|

Flower | aaa |

Flower | bbb |

Flower | ccc |

Fruit | ddd |

Car | ddd |

Car | eee |

Car | fff |

Car | iii |

Toy | fff |

Toy | ggg |

The result would be like the following table (with random values and at least 1 item per category) :

Category | Info1 |
---|---|

Flower | aaa |

Fruit | ddd |

Car | fff |

Toy | ggg |

Flower | ccc |

Thank you for your help and time.

Accepted Solutions

Solution

09-24-2014
08:56 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nicnad

09-24-2014 08:56 PM

**Surveyselect** with **samprate** and **strata** will do exactly what you want:

**data have;**

**input category $ info $;**

**datalines;**

**Flower aaa**

**Flower bbb**

**Flower ccc**

**Fruit ddd**

**Car ddd**

**Car eee**

**Car fff**

**Car iii**

**Toy fff**

**Toy ggg**

**;**

**proc surveyselect data=have out=want samprate=0.5 seed=114876001;**

**strata category notsorted;**

**run;**

**proc print data=want noobs; run;**

PG

PG

All Replies

Solution

09-24-2014
08:56 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nicnad

09-24-2014 08:56 PM

**Surveyselect** with **samprate** and **strata** will do exactly what you want:

**data have;**

**input category $ info $;**

**datalines;**

**Flower aaa**

**Flower bbb**

**Flower ccc**

**Fruit ddd**

**Car ddd**

**Car eee**

**Car fff**

**Car iii**

**Toy fff**

**Toy ggg**

**;**

**proc surveyselect data=have out=want samprate=0.5 seed=114876001;**

**strata category notsorted;**

**run;**

**proc print data=want noobs; run;**

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

09-25-2014 01:19 PM

Thank you very much to all of you for your help.

For the proc surveyselect method, when I do not sort my table by the specific variable and use 'notsorted' it doesn't return the appropriate result.

Therefore, when I use a proc sort on my variable, the I use proc surveyselect with mentioning 'notsorted', it works.

As for the modified proc sql method. When I use it in my data, it select one additional record that should not be there...

I would like to know if my manipulation with the proc surveyselect are Ok and what would be the correct method to do it with a proc SQL.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nicnad

09-25-2014 02:28 PM

NOTSORTED means that the data from each category occurs together consecutively but not necessarily in sorted order (e.g. aaaa cccc dd bbb).

Post your code.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

09-25-2014 03:14 PM

Thank you for the quick reply.

I won't post my actual data because it is sensitive, but this example triggers the same problem. If I understand properly, the number of strata should be 4, but it returns "Number of strata = 7".

data have;

input preinfo $ category $ info $;

datalines;

asdfasdfa Flower aaa

asfdasdfasdf Flower bbb

awdfadfasf Flower ccc

fasdfsdafds Fruit ddd

asdsdafsdf Car ddd

asdfasdfs Car eee

asdfasfsdff Car fff

asdfasdfds Car iii

asdfasdf Toy fff

asdfsdaerqwerwefad Toy ggg

asdfasdfa Flower aaa

asfdasdfasdf Flower bbb

awdfadfasf Flower ccc

fasdfsdafds Fruit ddd

asdsdafsdf Car ddd

asdfasdfs Car eee

asdfasfsdff Car fff

asdfasdfds Car iii

;

proc surveyselect data=have out=want samprate=0.1 seed=114876001;

strata category notsorted;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nicnad

09-25-2014 03:39 PM

Yes, since your category data is not consecutive, you need to sort BY category before calling surveyselect.

**proc sort data=have; by category; run;**

**proc surveyselect data=have out=want samprate=0.1 seed=114876001;**

**strata category;**

**run;**

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

09-25-2014 03:40 PM

Excellent.

Thank you all for your help!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to nicnad

09-24-2014 09:05 PM

proc sql;

select * , rand('UNIFORM') as r

from have

group by category

having r=min(r) or (count(category)>=1 and count(category)<=2);

quit;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to stat_sas

09-24-2014 09:29 PM

your solution will not work for a general population with larger categories. It will choose a single observation per category, which may be far less than 50%. A simple change to your SQL approach can however yield good samples (without the perfect assurance of a >50% sample size):

**data have;**

**input category $ info $;**

**datalines;**

**Flower aaa**

**Flower bbb**

**Flower ccc**

**Fruit ddd**

**Fruit eee**

**Fruit fff**

**Fruit ggg**

**Fruit hhh**

**Fruit iii**

**Fruit jjj**

**;**

**/* Modified code */**

**proc sql;**

**select * , rand('UNIFORM') as r**

**from have**

**group by category**

**having r=min(r) or r<0.5;**

**quit;**

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

09-24-2014 10:22 PM

Hi PG,

Yes, that is right. Thanks for the suggestion.

Regards,

Naeem

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to stat_sas

09-24-2014 10:45 PM

Along the same lines, if you want to guarantee a >50% sample, you can use **proc rank** to create fractional ranks (hint: fractional rank=1 when there is a single case in a category) :

**data temp;**

**call streaminit(987256);**

**set have;**

**r = RAND("UNIFORM");**

**run;**

**proc rank data=temp out=want(where=(r>0.5)) fraction;**

**by category notsorted;**

**var r;**

**run;**

**proc print data=want(drop=r) noobs; run;**

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

09-25-2014 06:21 AM

RAND function suppose to generate a random number from a distribution where as it producing all the records from the table (have) when I ran your code. Please advise. I assumed 'have' dataset as below.

**data have;**

**input category $ info $;**

**datalines;**

**Flower aaa**

**Flower bbb**

**Flower ccc**

**Fruit ddd**

**Fruit eee**

**Fruit fff**

**Fruit ggg**

**Fruit hhh**

**Fruit iii**

**Fruit jjj**

**;**

**Also 'r' value differs when I ran it one variable 'category'.**

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to RamKumar

09-25-2014 11:30 AM

I don't understand your question. When I run this (exactly this):

**data have;**

**input category $ info $;**

**datalines;**

**Flower aaa**

**Flower bbb**

**Flower ccc**

**Fruit ddd**

**Fruit eee**

**Fruit fff**

**Fruit ggg**

**Fruit hhh**

**Fruit iii**

**Fruit jjj**

**;**

**data temp;**

**call streaminit(987256);**

**set have;**

**r = RAND("UNIFORM");**

**run;**

**proc rank data=temp out=want(where=(r>0.5)) fraction;**

**by category notsorted;**

**var r;**

**run;**

**proc print data=want(drop=r) noobs; run;**

I get this output:

** The SAS System 1**

** 11:15 Thursday, September 25, 2014**

** category info**

** Flower bbb**

** Flower ccc**

** Fruit eee**

** Fruit hhh**

** Fruit iii**

** Fruit jjj**

The **proc rank** step replaces the random numbers (**r** in dataset **temp**) by their fractional ranks (**r** in dataset **want**) and keeps only the fractional ranks greater than 0.5. This way, a random selection of half the observations is kept in each category.

PG

PG