BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sasmaverick
Obsidian | Level 7

I have a dataset with 10,000 rows and 1000 unique account ids. I need to select 300 random account ids and return all rows for those 300 ids (without specifying the ids in a where clause). How do I do it? Please see the below example:

 

Input dataset

account_id Amount

ABC            100

ABC            150

ABC            200

DEF            90

DEF            80

 

Output dataset

e.g., If I select 1 random account id from the above data, I expect the following result:

account_id Amount

DEF               90

DEF               80

 

@Reeza @Tom @Ksharp Thanks a bunch for the help!

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data Sample;    
set Sashelp.Cars(keep=Type MPG_City);
run;


%let n=1 ;

proc surveyselect data=Sample out=want noprint seed=12345  sampsize=&n.;   
cluster type; 
run;

View solution in original post

6 REPLIES 6
Reeza
Super User

@sasmaverick wrote:

I have a dataset with 10,000 rows and 1000 unique account ids. I need to select 300 random account ids and return all rows for those 300 ids (without specifying the ids in a where clause). How do I do it? Please see the below example:

What do you mean without specifying the ids in a where clause? A join is ok?

sasmaverick
Obsidian | Level 7

I mean is there a way to do it in a single step (like proc surveyselect). I can surely think of doing it in multiple steps.

1. Select unique IDS from dataset A into dataset B

2. Pick random sample (x%) from dataset B into dataset C

3. Inner join dataset C to dataset A

Patrick
Opal | Level 21

I can think of one way to cover all of it in a single data step. But it would include a hash table for the unique id's (=still two passes through the data) and code would be rather complicated. What I have in mind would be an extension of method 3 from Sample 24722: Simple random sample without replacement.

 

But.... code would become so more complicated to understand as compared to doing this via multiple steps so really not worth it. You've got only 100000 source rows so also performance can't be a reason to go for a single data step.

 

Update: Just for fun below a single data step approach.

data have;
  do i=1 to 10000;
    id=rand('integer',1,1000);
    output;
  end;
run;

%let n_distinct_id=300;
data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have', multidata:'n');
      h1.defineKey('id');
      h1.defineDone();

      /* Initialize _K to the number of sample obs needed and _N to the */
      /*  total number of obs in the data set.                        */
      retain _k &n_distinct_id _n;
      _n=h1.num_items;
    end;

    set have; 

    if h1.check()=0 then
      do;

        /* To randomly select the first observation for the sample, use the */
        /* fact that each obs in the data set has an equal chance of being  */
        /* selected: _k/_n. If a random number between 0 and 1 is less than   */
        /* or equal to _k/_n, we select that the first obs for our sample     */
        /* and also adjust _k and the number of obs needed to complete the   */
        /* sample.                                                          */

         if ranuni(1230498) <= _k/_n then
          do;
            _rc=h1.remove();
            _k=_k-1;
          end;

        /* At every iteration, adjust _N, the number of obs left to */
        /* sample from.                                            */
        _n=_n-1;

      end;

    if h1.check() ne 0 then output;

run;

proc sql;
  select count(distinct id) as cnt_dist_id, count(*) as n_obs from want;
quit;

Patrick_0-1674016063351.png

 

s_lassen
Meteorite | Level 14

Here is a solution in a single data step. It assumes that the number of unique IDs is not known when you start, and that the input data is sorted by ID:

data want;
  retain _noofIDs 0 _IDsToExtract 300;
  /* count the IDs */
  if _N_=1 then do until(done);
    set have(keep=id) end=done;
    by id;
    _noofIDs+first.id;
    end;
  /* are we going to extract this ID? */
  _Extract=(_IDsToExtract)/(_noofIDs-_N_+1)>rand('UNIFORM');
  do until(last.id);
    set have;
    by id;
    if _Extract then output;
    end;
  _IDsToExtract+(-_Extract);
  if _IDsToExtract=0 then stop;
  drop _:;
run;
Ksharp
Super User
data Sample;    
set Sashelp.Cars(keep=Type MPG_City);
run;


%let n=1 ;

proc surveyselect data=Sample out=want noprint seed=12345  sampsize=&n.;   
cluster type; 
run;
sasmaverick
Obsidian | Level 7
@Ksharp Thanks a lot. This is exactly what I was looking for!!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 774 views
  • 3 likes
  • 5 in conversation