BookmarkSubscribeRSS Feed
Ronein
Onyx | Level 15

Hello

I want to select randomly 3 rows from data set have  with the following condition-

Not select a department (dept_no) more than once

For example:

1 10
2 10
3 20

Is not meeting the criteria because dept_no 10 appear more than once

 

Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;

Here is the way to do it with teradata code but I want to ask how to do it with sas code

select *
from have
qualify rank() over(partition by dept_no ordr by dept_no)=1
sample 3
13 REPLIES 13
Ksharp
Super User
Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;

proc sort data=have;by dept_no;run;
proc surveyselect data=have out=temp seed=123 sampsize=1 noprint outrandom;
strata dept_no;
run;
proc surveyselect data=temp out=want seed=123 sampsize=3 noprint outrandom;
run;
Ronein
Onyx | Level 15

Some questions please-

1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?

Should we always write seed=123 or it depends on out task?

2- You wrote  sampzise=1. Is it equivalent to write n=1?

3-Yuo didnt write method=srs. IS it because the default is srs?

4-What is the meaning of outrandom? Why did you add this word to code?

 

Ksharp
Super User
You asked too many questions.
1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?
/*******
seed=123 is a random seed, you could select any number, like 321 or 1 ...
Yes. If you changed this random seed, you will got different result.
Set a random seed by seed= is to make sure the result could be reproduced.
********/

Should we always write seed=123 or it depends on out task?
/***
No. You could write any number you want.
******/
2- You wrote sampzise=1. Is it equivalent to write n=1?
/**
Yes.
**/

3-Yuo didnt write method=srs. IS it because the default is srs?
/**
Yes. You are correct.
**/

4-What is the meaning of outrandom? Why did you add this word to code?
/****
That means randomize the result.
If your result was 2,5,1,7,9 (the order is the same with dataset), using OUTRANDOM would randomly order it,like 9,1,5,7,2
****/
Ronein
Onyx | Level 15
Is it essential to do proc sort?
Ksharp
Super User
Yes.
Because I used STRATA statement .
PGStats
Opal | Level 21

Or equivalently,

 

proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;

Note: May be a bit faster if you have more distinct depts than emp by dept.

Note: Sorting is implied in the first selection step.

PG
mkeintz
PROC Star

@PGStats wrote:

 

proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;

I've never used proc surveyselect, and haven't thought about sampling for a long time, so I am going to venture beyond my usual competencies.

 

In the first proc surveyselect above, are all dept_no's equally likely to be selected, regardless of size?    I suspect that is the case - I appended another 990 observations with "dept_no=30"  (department 30 was not originally selected), used the same seed, and got the same result, i.e. department 30 did not get selected, even though it was 99% of my contrived sample.

 

In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments).  Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.

 

Could this be the case?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

@mkeintz wrote:

In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments).  Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.


This is a good point as it addresses the question what probability distribution should govern the "random" selection. There are 68 (out of comb(10,3)=120) different selections of three employee numbers (EMP_NO) satisfying the constraint. Virtually any generic probability distribution on this finite set could be regarded as a solution. While the OP's Teradata code determines a distribution, it is not clear whether this choice was based on statistical considerations or just code availability.

 

A different distribution would assign probability 1/68 to each eligible "triple" of EMP_NOs. PG's code, however, creates a distribution with equal probability (of 1/10) for each eligible triple of DEPT_NOs (and in particular equal probabilities [of 0.6] for each department to occur in the selection). There are many distributions with this characteristic. Yet another class of distributions would feature equal probabilities (of 0.3) for each EMP_NO. One of these could be obtained by adding the PPS option to the SAMPLINGUNIT statement in PG's code (thus implementing your idea of selecting "departments weighted proportionate to their frequency"). I think in general it depends on the input data whether equal probabilities for each EMP_NO can be achieved.

mkeintz
PROC Star

Here is a program that randomly selects 3 obs from among (up to) 10 departments, such that each obs comes from a different department.  The probability of selecting from a given department is proportional to the department's size. 

 

Editted Note:  At each sample selection, the probability of selecting from a given department is proportional to the department's size (relative to the total of available departments).  But because of the constraint, at each obs there is a reduced number of possible departments - thereby changing the probability of selection among remaining departments.  So the probability of being among the 3 obs is NOT proportional to the size of the department vs all original departments.

 

After a given employee is selected, that department's probability of selection is set to zero.

 

Edited note:  I've moved the assignment of a _seed value from inside to outside of the "do obs=1 to &sampsize" loop, as per @FreelanceReinh's comment.

 

And I've added a test dataset have, containing dept_no 1 through dept_no 10, with 4,8,12,...,40 employees respectively.

 

data have;
  do dept_no=1 to 10;
    do i=1 to 4*dept_no;
      emp_id=dept_no+i/100;
      format dept_no z2.  emp_id z5.2;
      drop i;
      output;
    end;
  end;
run;

 

 

%let n_dep=10;    /*Provide for up to this many departments, may be larger than actual department count*/
%let sampsize=3;  /*N of employees to select, each from a different department */

data want (drop=_: d);
  set have end=end_of_have;

  length dept_size 8;
  if _n_=1 then do;
    declare hash h();    /*Department Size, plus One random obs per department */
      h.definekey('dept_no');
      h.definedata('dept_no','dept_size','_emp_no_samp','_rnum_samp');
      h.definedone();
    declare hiter hi ('h');
  end;

  _rnum_emp=ranuni(109581508);     /*Randum num for this employee */

  if h.find()^=0 then call missing(dept_size,_rnum_samp);
  dept_size = sum(dept_size,1);

  if _rnum_emp > _rnum_samp then do;  /* Keep the employee with the highest random number */
    _rnum_samp   = _rnum_emp;
    _emp_no_samp = emp_no;
  end;
  h.replace();

  _n_avail+1;

  if end_of_have;

  array dept_share {&n_dep} _temporary_ (&n_dep*0) ;
  array dept       {&n_dep} _temporary_  ;

  do obs=1 to &sampsize while (_n_avail>0);
    do d=1 by 1 while (hi.next()=0);
      dept_share{d}=dept_size/_n_avail;
      dept{d}=dept_no;
    end;

    _seed=508333952;
    call rantbl(_seed,of dept_share{*},d);  /* Randomly choose department, proportional to size */
    h.find(key:dept{d});
    emp_no=_emp_no_samp;
    output;

    _n_avail = _n_avail - dept_size;        /* Remove this department from furthr sampling */
    dept_size=0;
    h.replace();
  end;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FreelanceReinh
Jade | Level 19

Hi @mkeintz,

 

Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).

 

Interestingly, the algorithm (after the correction) does not lead to equal selection probabilities for each employee: If my calculation is correct, the ten probabilities are

19/72   ≈0.2639 for each emp_no in (1 2 7 3 5 9)
821/2520≈0.3258 for each emp_no in (4 10)
241/630 ≈0.3825 for each emp_no in (6 8)

 

So the PPS sampling method mentioned in my previous post seems to be using yet a different algorithm internally.

mkeintz
PROC Star

@FreelanceReinh wrote:

Hi @mkeintz,

 

Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).

@FreelanceReinh thanks for the note.  Yes, I did intend to allow the _seed to be updated over DO loop iterations.  I'll correct it in my original response.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ronein
Onyx | Level 15

In your code you first choose 3 departments and then one observarion of each of them.

What is the difference between yor code and following code?

You use  "samplingunit"  and here I use "cluster"

 

proc surveyselect data=Haveout=temp n=3 seed=123;
cluster dept_no;
Run;

proc surveyselect data=temp out=want n=1 seed=123;
strata dept_no;
Run;

 

 

PGStats
Opal | Level 21

"samplingunit"  and "cluster" are synonyms, see the documentation.

PG

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 4799 views
  • 11 likes
  • 5 in conversation