BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Hello

I want to select randomly 3 rows from data set have  with the following condition-

Not select a department (dept_no) more than once

For example:

1 10
2 10
3 20

Is not meeting the criteria because dept_no 10 appear more than once

 

Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;

Here is the way to do it with teradata code but I want to ask how to do it with sas code

select *
from have
qualify rank() over(partition by dept_no ordr by dept_no)=1
sample 3
13 REPLIES 13
Ksharp
Super User
Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;

proc sort data=have;by dept_no;run;
proc surveyselect data=have out=temp seed=123 sampsize=1 noprint outrandom;
strata dept_no;
run;
proc surveyselect data=temp out=want seed=123 sampsize=3 noprint outrandom;
run;
Ronein
Meteorite | Level 14

Some questions please-

1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?

Should we always write seed=123 or it depends on out task?

2- You wrote  sampzise=1. Is it equivalent to write n=1?

3-Yuo didnt write method=srs. IS it because the default is srs?

4-What is the meaning of outrandom? Why did you add this word to code?

 

Ksharp
Super User
You asked too many questions.
1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?
/*******
seed=123 is a random seed, you could select any number, like 321 or 1 ...
Yes. If you changed this random seed, you will got different result.
Set a random seed by seed= is to make sure the result could be reproduced.
********/

Should we always write seed=123 or it depends on out task?
/***
No. You could write any number you want.
******/
2- You wrote sampzise=1. Is it equivalent to write n=1?
/**
Yes.
**/

3-Yuo didnt write method=srs. IS it because the default is srs?
/**
Yes. You are correct.
**/

4-What is the meaning of outrandom? Why did you add this word to code?
/****
That means randomize the result.
If your result was 2,5,1,7,9 (the order is the same with dataset), using OUTRANDOM would randomly order it,like 9,1,5,7,2
****/
Ronein
Meteorite | Level 14
Is it essential to do proc sort?
Ksharp
Super User
Yes.
Because I used STRATA statement .
PGStats
Opal | Level 21

Or equivalently,

 

proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;

Note: May be a bit faster if you have more distinct depts than emp by dept.

Note: Sorting is implied in the first selection step.

PG
mkeintz
PROC Star

@PGStats wrote:

 

proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;

I've never used proc surveyselect, and haven't thought about sampling for a long time, so I am going to venture beyond my usual competencies.

 

In the first proc surveyselect above, are all dept_no's equally likely to be selected, regardless of size?    I suspect that is the case - I appended another 990 observations with "dept_no=30"  (department 30 was not originally selected), used the same seed, and got the same result, i.e. department 30 did not get selected, even though it was 99% of my contrived sample.

 

In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments).  Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.

 

Could this be the case?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

@mkeintz wrote:

In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments).  Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.


This is a good point as it addresses the question what probability distribution should govern the "random" selection. There are 68 (out of comb(10,3)=120) different selections of three employee numbers (EMP_NO) satisfying the constraint. Virtually any generic probability distribution on this finite set could be regarded as a solution. While the OP's Teradata code determines a distribution, it is not clear whether this choice was based on statistical considerations or just code availability.

 

A different distribution would assign probability 1/68 to each eligible "triple" of EMP_NOs. PG's code, however, creates a distribution with equal probability (of 1/10) for each eligible triple of DEPT_NOs (and in particular equal probabilities [of 0.6] for each department to occur in the selection). There are many distributions with this characteristic. Yet another class of distributions would feature equal probabilities (of 0.3) for each EMP_NO. One of these could be obtained by adding the PPS option to the SAMPLINGUNIT statement in PG's code (thus implementing your idea of selecting "departments weighted proportionate to their frequency"). I think in general it depends on the input data whether equal probabilities for each EMP_NO can be achieved.

mkeintz
PROC Star

Here is a program that randomly selects 3 obs from among (up to) 10 departments, such that each obs comes from a different department.  The probability of selecting from a given department is proportional to the department's size. 

 

Editted Note:  At each sample selection, the probability of selecting from a given department is proportional to the department's size (relative to the total of available departments).  But because of the constraint, at each obs there is a reduced number of possible departments - thereby changing the probability of selection among remaining departments.  So the probability of being among the 3 obs is NOT proportional to the size of the department vs all original departments.

 

After a given employee is selected, that department's probability of selection is set to zero.

 

Edited note:  I've moved the assignment of a _seed value from inside to outside of the "do obs=1 to &sampsize" loop, as per @FreelanceReinh's comment.

 

And I've added a test dataset have, containing dept_no 1 through dept_no 10, with 4,8,12,...,40 employees respectively.

 

data have;
  do dept_no=1 to 10;
    do i=1 to 4*dept_no;
      emp_id=dept_no+i/100;
      format dept_no z2.  emp_id z5.2;
      drop i;
      output;
    end;
  end;
run;

 

 

%let n_dep=10;    /*Provide for up to this many departments, may be larger than actual department count*/
%let sampsize=3;  /*N of employees to select, each from a different department */

data want (drop=_: d);
  set have end=end_of_have;

  length dept_size 8;
  if _n_=1 then do;
    declare hash h();    /*Department Size, plus One random obs per department */
      h.definekey('dept_no');
      h.definedata('dept_no','dept_size','_emp_no_samp','_rnum_samp');
      h.definedone();
    declare hiter hi ('h');
  end;

  _rnum_emp=ranuni(109581508);     /*Randum num for this employee */

  if h.find()^=0 then call missing(dept_size,_rnum_samp);
  dept_size = sum(dept_size,1);

  if _rnum_emp > _rnum_samp then do;  /* Keep the employee with the highest random number */
    _rnum_samp   = _rnum_emp;
    _emp_no_samp = emp_no;
  end;
  h.replace();

  _n_avail+1;

  if end_of_have;

  array dept_share {&n_dep} _temporary_ (&n_dep*0) ;
  array dept       {&n_dep} _temporary_  ;

  do obs=1 to &sampsize while (_n_avail>0);
    do d=1 by 1 while (hi.next()=0);
      dept_share{d}=dept_size/_n_avail;
      dept{d}=dept_no;
    end;

    _seed=508333952;
    call rantbl(_seed,of dept_share{*},d);  /* Randomly choose department, proportional to size */
    h.find(key:dept{d});
    emp_no=_emp_no_samp;
    output;

    _n_avail = _n_avail - dept_size;        /* Remove this department from furthr sampling */
    dept_size=0;
    h.replace();
  end;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

Hi @mkeintz,

 

Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).

 

Interestingly, the algorithm (after the correction) does not lead to equal selection probabilities for each employee: If my calculation is correct, the ten probabilities are

19/72   ≈0.2639 for each emp_no in (1 2 7 3 5 9)
821/2520≈0.3258 for each emp_no in (4 10)
241/630 ≈0.3825 for each emp_no in (6 8)

 

So the PPS sampling method mentioned in my previous post seems to be using yet a different algorithm internally.

mkeintz
PROC Star

@FreelanceReinh wrote:

Hi @mkeintz,

 

Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).

@FreelanceReinh thanks for the note.  Yes, I did intend to allow the _seed to be updated over DO loop iterations.  I'll correct it in my original response.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ronein
Meteorite | Level 14

In your code you first choose 3 departments and then one observarion of each of them.

What is the difference between yor code and following code?

You use  "samplingunit"  and here I use "cluster"

 

proc surveyselect data=Haveout=temp n=3 seed=123;
cluster dept_no;
Run;

proc surveyselect data=temp out=want n=1 seed=123;
strata dept_no;
Run;

 

 

PGStats
Opal | Level 21

"samplingunit"  and "cluster" are synonyms, see the documentation.

PG

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2620 views
  • 11 likes
  • 5 in conversation