Re: Randomly select 3 rows with condition

Ronein · Posted 02-17-2024 03:11 AM

Hello

I want to select randomly 3 rows from data set have with the following condition-

Not select a department (dept_no) more than once

For example:

1 10
2 10
3 20

Is not meeting the criteria because dept_no 10 appear more than once

Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;

Here is the way to do it with teradata code but I want to ask how to do it with sas code

select *
from have
qualify rank() over(partition by dept_no ordr by dept_no)=1
sample 3

Ksharp · Posted 02-17-2024 06:12 AM

Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;

proc sort data=have;by dept_no;run;
proc surveyselect data=have out=temp seed=123 sampsize=1 noprint outrandom;
strata dept_no;
run;
proc surveyselect data=temp out=want seed=123 sampsize=3 noprint outrandom;
run;

Ronein · Posted 02-24-2024 01:38 AM

Some questions please-

1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?

Should we always write seed=123 or it depends on out task?

2- You wrote sampzise=1. Is it equivalent to write n=1?

3-Yuo didnt write method=srs. IS it because the default is srs?

4-What is the meaning of outrandom? Why did you add this word to code?

Ksharp · Posted 02-25-2024 03:28 AM

You asked too many questions.
1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?
/*******
seed=123 is a random seed, you could select any number, like 321 or 1 ...
Yes. If you changed this random seed, you will got different result.
Set a random seed by seed= is to make sure the result could be reproduced.
********/

Should we always write seed=123 or it depends on out task?
/***
No. You could write any number you want.
******/
2- You wrote sampzise=1. Is it equivalent to write n=1?
/**
Yes.
**/

3-Yuo didnt write method=srs. IS it because the default is srs?
/**
Yes. You are correct.
**/

4-What is the meaning of outrandom? Why did you add this word to code?
/****
That means randomize the result.
If your result was 2,5,1,7,9 (the order is the same with dataset), using OUTRANDOM would randomly order it,like 9,1,5,7,2
****/

Ronein · Posted 02-25-2024 09:30 PM

Is it essential to do proc sort?

Ksharp · Posted 02-25-2024 11:39 PM

Yes.
Because I used STRATA statement .

PGStats · Posted 02-17-2024 05:26 PM

Or equivalently,

proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;

Note: May be a bit faster if you have more distinct depts than emp by dept.

Note: Sorting is implied in the first selection step.

PG

mkeintz · Posted 02-17-2024 09:47 PM

@PGStats wrote:

proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;

I've never used proc surveyselect, and haven't thought about sampling for a long time, so I am going to venture beyond my usual competencies.

In the first proc surveyselect above, are all dept_no's equally likely to be selected, regardless of size? I suspect that is the case - I appended another 990 observations with "dept_no=30" (department 30 was not originally selected), used the same seed, and got the same result, i.e. department 30 did not get selected, even though it was 99% of my contrived sample.

In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments). Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.

Could this be the case?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 02-21-2024 05:52 PM

@mkeintz wrote:

In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments). Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.

This is a good point as it addresses the question what probability distribution should govern the "random" selection. There are 68 (out of comb(10,3)=120) different selections of three employee numbers (EMP_NO) satisfying the constraint. Virtually any generic probability distribution on this finite set could be regarded as a solution. While the OP's Teradata code determines a distribution, it is not clear whether this choice was based on statistical considerations or just code availability.

A different distribution would assign probability 1/68 to each eligible "triple" of EMP_NOs. PG's code, however, creates a distribution with equal probability (of 1/10) for each eligible triple of DEPT_NOs (and in particular equal probabilities [of 0.6] for each department to occur in the selection). There are many distributions with this characteristic. Yet another class of distributions would feature equal probabilities (of 0.3) for each EMP_NO. One of these could be obtained by adding the PPS option to the SAMPLINGUNIT statement in PG's code (thus implementing your idea of selecting "departments weighted proportionate to their frequency"). I think in general it depends on the input data whether equal probabilities for each EMP_NO can be achieved.

mkeintz · Posted 02-24-2024 08:35 PM

Here is a program that randomly selects 3 obs from among (up to) 10 departments, such that each obs comes from a different department. ~~The probability of selecting from a given department is proportional to the department's size.~~

Editted Note: At each sample selection, the probability of selecting from a given department is proportional to the department's size (relative to the total of available departments). But because of the constraint, at each obs there is a reduced number of possible departments - thereby changing the probability of selection among remaining departments. So the probability of being among the 3 obs is NOT proportional to the size of the department vs all original departments.

After a given employee is selected, that department's probability of selection is set to zero.

Edited note: I've moved the assignment of a _seed value from inside to outside of the "do obs=1 to &sampsize" loop, as per @FreelanceReinh's comment.

And I've added a test dataset have, containing dept_no 1 through dept_no 10, with 4,8,12,...,40 employees respectively.

data have;
  do dept_no=1 to 10;
    do i=1 to 4*dept_no;
      emp_id=dept_no+i/100;
      format dept_no z2.  emp_id z5.2;
      drop i;
      output;
    end;
  end;
run;

%let n_dep=10;    /*Provide for up to this many departments, may be larger than actual department count*/
%let sampsize=3;  /*N of employees to select, each from a different department */

data want (drop=_: d);
  set have end=end_of_have;

  length dept_size 8;
  if _n_=1 then do;
    declare hash h();    /*Department Size, plus One random obs per department */
      h.definekey('dept_no');
      h.definedata('dept_no','dept_size','_emp_no_samp','_rnum_samp');
      h.definedone();
    declare hiter hi ('h');
  end;

  _rnum_emp=ranuni(109581508);     /*Randum num for this employee */

  if h.find()^=0 then call missing(dept_size,_rnum_samp);
  dept_size = sum(dept_size,1);

  if _rnum_emp > _rnum_samp then do;  /* Keep the employee with the highest random number */
    _rnum_samp   = _rnum_emp;
    _emp_no_samp = emp_no;
  end;
  h.replace();

  _n_avail+1;

  if end_of_have;

  array dept_share {&n_dep} _temporary_ (&n_dep*0) ;
  array dept       {&n_dep} _temporary_  ;

  do obs=1 to &sampsize while (_n_avail>0);
    do d=1 by 1 while (hi.next()=0);
      dept_share{d}=dept_size/_n_avail;
      dept{d}=dept_no;
    end;

    _seed=508333952;
    call rantbl(_seed,of dept_share{*},d);  /* Randomly choose department, proportional to size */
    h.find(key:dept{d});
    emp_no=_emp_no_samp;
    output;

    _n_avail = _n_avail - dept_size;        /* Remove this department from furthr sampling */
    dept_size=0;
    h.replace();
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 02-25-2024 03:50 PM

Hi @mkeintz,

Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).

Interestingly, the algorithm (after the correction) does not lead to equal selection probabilities for each employee: If my calculation is correct, the ten probabilities are

19/72   ≈0.2639 for each emp_no in (1 2 7 3 5 9)
821/2520≈0.3258 for each emp_no in (4 10)
241/630 ≈0.3825 for each emp_no in (6 8)

So the PPS sampling method mentioned in my previous post seems to be using yet a different algorithm internally.

mkeintz · Posted 02-25-2024 06:38 PM

@FreelanceReinh wrote:

Hi @mkeintz,

Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).

@FreelanceReinh thanks for the note. Yes, I did intend to allow the _seed to be updated over DO loop iterations. I'll correct it in my original response.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ronein · Posted 02-24-2024 04:13 AM

In your code you first choose 3 departments and then one observarion of each of them.

What is the difference between yor code and following code?

You use "samplingunit" and here I use "cluster"

proc surveyselect data=Haveout=temp n=3 seed=123;
cluster dept_no;
Run;

proc surveyselect data=temp out=want n=1 seed=123;
strata dept_no;
Run;

PGStats · Posted 02-24-2024 03:46 PM

"samplingunit" and "cluster" are synonyms, see the documentation.

PG

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away