I have a dataset with 10,000 rows and 1000 unique account ids. I need to select 300 random account ids and return all rows for those 300 ids (without specifying the ids in a where clause). How do I do it? Please see the below example:
Input dataset
account_id Amount
ABC 100
ABC 150
ABC 200
DEF 90
DEF 80
Output dataset
e.g., If I select 1 random account id from the above data, I expect the following result:
account_id Amount
DEF 90
DEF 80
data Sample;
set Sashelp.Cars(keep=Type MPG_City);
run;
%let n=1 ;
proc surveyselect data=Sample out=want noprint seed=12345 sampsize=&n.;
cluster type;
run;
@sasmaverick wrote:
I have a dataset with 10,000 rows and 1000 unique account ids. I need to select 300 random account ids and return all rows for those 300 ids (without specifying the ids in a where clause). How do I do it? Please see the below example:
What do you mean without specifying the ids in a where clause? A join is ok?
I mean is there a way to do it in a single step (like proc surveyselect). I can surely think of doing it in multiple steps.
1. Select unique IDS from dataset A into dataset B
2. Pick random sample (x%) from dataset B into dataset C
3. Inner join dataset C to dataset A
I can think of one way to cover all of it in a single data step. But it would include a hash table for the unique id's (=still two passes through the data) and code would be rather complicated. What I have in mind would be an extension of method 3 from Sample 24722: Simple random sample without replacement.
But.... code would become so more complicated to understand as compared to doing this via multiple steps so really not worth it. You've got only 100000 source rows so also performance can't be a reason to go for a single data step.
Update: Just for fun below a single data step approach.
data have;
do i=1 to 10000;
id=rand('integer',1,1000);
output;
end;
run;
%let n_distinct_id=300;
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'have', multidata:'n');
h1.defineKey('id');
h1.defineDone();
/* Initialize _K to the number of sample obs needed and _N to the */
/* total number of obs in the data set. */
retain _k &n_distinct_id _n;
_n=h1.num_items;
end;
set have;
if h1.check()=0 then
do;
/* To randomly select the first observation for the sample, use the */
/* fact that each obs in the data set has an equal chance of being */
/* selected: _k/_n. If a random number between 0 and 1 is less than */
/* or equal to _k/_n, we select that the first obs for our sample */
/* and also adjust _k and the number of obs needed to complete the */
/* sample. */
if ranuni(1230498) <= _k/_n then
do;
_rc=h1.remove();
_k=_k-1;
end;
/* At every iteration, adjust _N, the number of obs left to */
/* sample from. */
_n=_n-1;
end;
if h1.check() ne 0 then output;
run;
proc sql;
select count(distinct id) as cnt_dist_id, count(*) as n_obs from want;
quit;
Here is a solution in a single data step. It assumes that the number of unique IDs is not known when you start, and that the input data is sorted by ID:
data want;
retain _noofIDs 0 _IDsToExtract 300;
/* count the IDs */
if _N_=1 then do until(done);
set have(keep=id) end=done;
by id;
_noofIDs+first.id;
end;
/* are we going to extract this ID? */
_Extract=(_IDsToExtract)/(_noofIDs-_N_+1)>rand('UNIFORM');
do until(last.id);
set have;
by id;
if _Extract then output;
end;
_IDsToExtract+(-_Extract);
if _IDsToExtract=0 then stop;
drop _:;
run;
data Sample;
set Sashelp.Cars(keep=Type MPG_City);
run;
%let n=1 ;
proc surveyselect data=Sample out=want noprint seed=12345 sampsize=&n.;
cluster type;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.