Solved: Re: Select Subset of a Group and Return All Rows

sasmaverick · Posted 01-17-2023 06:59 PM

I have a dataset with 10,000 rows and 1000 unique account ids. I need to select 300 random account ids and return all rows for those 300 ids (without specifying the ids in a where clause). How do I do it? Please see the below example:

Input dataset

account_id Amount

ABC 100

ABC 150

ABC 200

DEF 90

DEF 80

Output dataset

e.g., If I select 1 random account id from the above data, I expect the following result:

account_id Amount

DEF 90

DEF 80

@Reeza @Tom @Ksharp Thanks a bunch for the help!

Ksharp · Posted 01-18-2023 06:45 AM

data Sample;    
set Sashelp.Cars(keep=Type MPG_City);
run;


%let n=1 ;

proc surveyselect data=Sample out=want noprint seed=12345  sampsize=&n.;   
cluster type; 
run;

View solution in original post

Reeza · Posted 01-17-2023 07:01 PM

@sasmaverick wrote:

I have a dataset with 10,000 rows and 1000 unique account ids. I need to select 300 random account ids and return all rows for those 300 ids (without specifying the ids in a where clause). How do I do it? Please see the below example:

What do you mean without specifying the ids in a where clause? A join is ok?

sasmaverick · Posted 01-17-2023 07:24 PM

I mean is there a way to do it in a single step (like proc surveyselect). I can surely think of doing it in multiple steps.

1. Select unique IDS from dataset A into dataset B

2. Pick random sample (x%) from dataset B into dataset C

3. Inner join dataset C to dataset A

Patrick · Posted 01-17-2023 10:03 PM

I can think of one way to cover all of it in a single data step. But it would include a hash table for the unique id's (=still two passes through the data) and code would be rather complicated. What I have in mind would be an extension of method 3 from Sample 24722: Simple random sample without replacement.

But.... code would become so more complicated to understand as compared to doing this via multiple steps so really not worth it. You've got only 100000 source rows so also performance can't be a reason to go for a single data step.

Update: Just for fun below a single data step approach.

data have;
  do i=1 to 10000;
    id=rand('integer',1,1000);
    output;
  end;
run;

%let n_distinct_id=300;
data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have', multidata:'n');
      h1.defineKey('id');
      h1.defineDone();

      /* Initialize _K to the number of sample obs needed and _N to the */
      /*  total number of obs in the data set.                        */
      retain _k &n_distinct_id _n;
      _n=h1.num_items;
    end;

    set have; 

    if h1.check()=0 then
      do;

        /* To randomly select the first observation for the sample, use the */
        /* fact that each obs in the data set has an equal chance of being  */
        /* selected: _k/_n. If a random number between 0 and 1 is less than   */
        /* or equal to _k/_n, we select that the first obs for our sample     */
        /* and also adjust _k and the number of obs needed to complete the   */
        /* sample.                                                          */

         if ranuni(1230498) <= _k/_n then
          do;
            _rc=h1.remove();
            _k=_k-1;
          end;

        /* At every iteration, adjust _N, the number of obs left to */
        /* sample from.                                            */
        _n=_n-1;

      end;

    if h1.check() ne 0 then output;

run;

proc sql;
  select count(distinct id) as cnt_dist_id, count(*) as n_obs from want;
quit;

s_lassen · Posted 01-18-2023 02:59 AM

Here is a solution in a single data step. It assumes that the number of unique IDs is not known when you start, and that the input data is sorted by ID:

data want;
  retain _noofIDs 0 _IDsToExtract 300;
  /* count the IDs */
  if _N_=1 then do until(done);
    set have(keep=id) end=done;
    by id;
    _noofIDs+first.id;
    end;
  /* are we going to extract this ID? */
  _Extract=(_IDsToExtract)/(_noofIDs-_N_+1)>rand('UNIFORM');
  do until(last.id);
    set have;
    by id;
    if _Extract then output;
    end;
  _IDsToExtract+(-_Extract);
  if _IDsToExtract=0 then stop;
  drop _:;
run;

Ksharp · Posted 01-18-2023 06:45 AM

data Sample;    
set Sashelp.Cars(keep=Type MPG_City);
run;


%let n=1 ;

proc surveyselect data=Sample out=want noprint seed=12345  sampsize=&n.;   
cluster type; 
run;

sasmaverick · Posted 01-18-2023 08:56 PM

@Ksharp Thanks a lot. This is exactly what I was looking for!!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away