Hello
I want to select randomly 3 rows from data set have with the following condition-
Not select a department (dept_no) more than once
For example:
1 10
2 10
3 20
Is not meeting the criteria because dept_no 10 appear more than once
Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;
Here is the way to do it with teradata code but I want to ask how to do it with sas code
select *
from have
qualify rank() over(partition by dept_no ordr by dept_no)=1
sample 3
Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;
proc sort data=have;by dept_no;run;
proc surveyselect data=have out=temp seed=123 sampsize=1 noprint outrandom;
strata dept_no;
run;
proc surveyselect data=temp out=want seed=123 sampsize=3 noprint outrandom;
run;
Some questions please-
1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?
Should we always write seed=123 or it depends on out task?
2- You wrote sampzise=1. Is it equivalent to write n=1?
3-Yuo didnt write method=srs. IS it because the default is srs?
4-What is the meaning of outrandom? Why did you add this word to code?
Or equivalently,
proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;
Note: May be a bit faster if you have more distinct depts than emp by dept.
Note: Sorting is implied in the first selection step.
@PGStats wrote:
proc surveyselect data=have out=temp seed=123 sampsize=3 noprint; samplingunit dept_no; proc surveyselect data=temp out=want seed=123 sampsize=1 noprint; strata dept_no;
I've never used proc surveyselect, and haven't thought about sampling for a long time, so I am going to venture beyond my usual competencies.
In the first proc surveyselect above, are all dept_no's equally likely to be selected, regardless of size? I suspect that is the case - I appended another 990 observations with "dept_no=30" (department 30 was not originally selected), used the same seed, and got the same result, i.e. department 30 did not get selected, even though it was 99% of my contrived sample.
In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments). Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.
Could this be the case?
@mkeintz wrote:
In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments). Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.
This is a good point as it addresses the question what probability distribution should govern the "random" selection. There are 68 (out of comb(10,3)=120) different selections of three employee numbers (EMP_NO) satisfying the constraint. Virtually any generic probability distribution on this finite set could be regarded as a solution. While the OP's Teradata code determines a distribution, it is not clear whether this choice was based on statistical considerations or just code availability.
A different distribution would assign probability 1/68 to each eligible "triple" of EMP_NOs. PG's code, however, creates a distribution with equal probability (of 1/10) for each eligible triple of DEPT_NOs (and in particular equal probabilities [of 0.6] for each department to occur in the selection). There are many distributions with this characteristic. Yet another class of distributions would feature equal probabilities (of 0.3) for each EMP_NO. One of these could be obtained by adding the PPS option to the SAMPLINGUNIT statement in PG's code (thus implementing your idea of selecting "departments weighted proportionate to their frequency"). I think in general it depends on the input data whether equal probabilities for each EMP_NO can be achieved.
Here is a program that randomly selects 3 obs from among (up to) 10 departments, such that each obs comes from a different department. The probability of selecting from a given department is proportional to the department's size.
Editted Note: At each sample selection, the probability of selecting from a given department is proportional to the department's size (relative to the total of available departments). But because of the constraint, at each obs there is a reduced number of possible departments - thereby changing the probability of selection among remaining departments. So the probability of being among the 3 obs is NOT proportional to the size of the department vs all original departments.
After a given employee is selected, that department's probability of selection is set to zero.
Edited note: I've moved the assignment of a _seed value from inside to outside of the "do obs=1 to &sampsize" loop, as per @FreelanceReinh's comment.
And I've added a test dataset have, containing dept_no 1 through dept_no 10, with 4,8,12,...,40 employees respectively.
data have;
do dept_no=1 to 10;
do i=1 to 4*dept_no;
emp_id=dept_no+i/100;
format dept_no z2. emp_id z5.2;
drop i;
output;
end;
end;
run;
%let n_dep=10; /*Provide for up to this many departments, may be larger than actual department count*/
%let sampsize=3; /*N of employees to select, each from a different department */
data want (drop=_: d);
set have end=end_of_have;
length dept_size 8;
if _n_=1 then do;
declare hash h(); /*Department Size, plus One random obs per department */
h.definekey('dept_no');
h.definedata('dept_no','dept_size','_emp_no_samp','_rnum_samp');
h.definedone();
declare hiter hi ('h');
end;
_rnum_emp=ranuni(109581508); /*Randum num for this employee */
if h.find()^=0 then call missing(dept_size,_rnum_samp);
dept_size = sum(dept_size,1);
if _rnum_emp > _rnum_samp then do; /* Keep the employee with the highest random number */
_rnum_samp = _rnum_emp;
_emp_no_samp = emp_no;
end;
h.replace();
_n_avail+1;
if end_of_have;
array dept_share {&n_dep} _temporary_ (&n_dep*0) ;
array dept {&n_dep} _temporary_ ;
do obs=1 to &sampsize while (_n_avail>0);
do d=1 by 1 while (hi.next()=0);
dept_share{d}=dept_size/_n_avail;
dept{d}=dept_no;
end;
_seed=508333952;
call rantbl(_seed,of dept_share{*},d); /* Randomly choose department, proportional to size */
h.find(key:dept{d});
emp_no=_emp_no_samp;
output;
_n_avail = _n_avail - dept_size; /* Remove this department from furthr sampling */
dept_size=0;
h.replace();
end;
run;
Hi @mkeintz,
Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).
Interestingly, the algorithm (after the correction) does not lead to equal selection probabilities for each employee: If my calculation is correct, the ten probabilities are
19/72 ≈0.2639 for each emp_no in (1 2 7 3 5 9) 821/2520≈0.3258 for each emp_no in (4 10) 241/630 ≈0.3825 for each emp_no in (6 8)
So the PPS sampling method mentioned in my previous post seems to be using yet a different algorithm internally.
@FreelanceReinh wrote:
Hi @mkeintz,
Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).
@FreelanceReinh thanks for the note. Yes, I did intend to allow the _seed to be updated over DO loop iterations. I'll correct it in my original response.
In your code you first choose 3 departments and then one observarion of each of them.
What is the difference between yor code and following code?
You use "samplingunit" and here I use "cluster"
proc surveyselect data=Haveout=temp n=3 seed=123;
cluster dept_no;
Run;
proc surveyselect data=temp out=want n=1 seed=123;
strata dept_no;
Run;
"samplingunit" and "cluster"
are synonyms, see the documentation.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.