Hello
I want to select randomly 3 rows from data set have with the following condition-
Not select a department (dept_no) more than once
For example:
1 10
2 10
3 20
Is not meeting the criteria because dept_no 10 appear more than once
Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;
Here is the way to do it with teradata code but I want to ask how to do it with sas code
select *
from have
qualify rank() over(partition by dept_no ordr by dept_no)=1
sample 3
Data have;
input emp_no dept_no;
cards;
1 10
2 10
3 20
4 30
5 20
6 40
7 10
8 90
9 20
10 30
;
Run;
proc sort data=have;by dept_no;run;
proc surveyselect data=have out=temp seed=123 sampsize=1 noprint outrandom;
strata dept_no;
run;
proc surveyselect data=temp out=want seed=123 sampsize=3 noprint outrandom;
run;
Some questions please-
1-What is the meaning of seed=123.Why did you choose value 123 and not other value? will result be different if you change seed value?
Should we always write seed=123 or it depends on out task?
2- You wrote sampzise=1. Is it equivalent to write n=1?
3-Yuo didnt write method=srs. IS it because the default is srs?
4-What is the meaning of outrandom? Why did you add this word to code?
Or equivalently,
proc surveyselect data=have out=temp seed=123 sampsize=3 noprint;
samplingunit dept_no;
proc surveyselect data=temp out=want seed=123 sampsize=1 noprint;
strata dept_no;
Note: May be a bit faster if you have more distinct depts than emp by dept.
Note: Sorting is implied in the first selection step.
@PGStats wrote:
proc surveyselect data=have out=temp seed=123 sampsize=3 noprint; samplingunit dept_no; proc surveyselect data=temp out=want seed=123 sampsize=1 noprint; strata dept_no;
I've never used proc surveyselect, and haven't thought about sampling for a long time, so I am going to venture beyond my usual competencies.
In the first proc surveyselect above, are all dept_no's equally likely to be selected, regardless of size? I suspect that is the case - I appended another 990 observations with "dept_no=30" (department 30 was not originally selected), used the same seed, and got the same result, i.e. department 30 did not get selected, even though it was 99% of my contrived sample.
In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments). Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.
Could this be the case?
@mkeintz wrote:
In the OP's request, I would have assumed that all obs should be equally likely, subject to the constraint (i.e. they should come from 3 distinct departments). Which in turn makes me think that the first surveyselect above should rendomly select departments weighted proportionate to their frequency.
This is a good point as it addresses the question what probability distribution should govern the "random" selection. There are 68 (out of comb(10,3)=120) different selections of three employee numbers (EMP_NO) satisfying the constraint. Virtually any generic probability distribution on this finite set could be regarded as a solution. While the OP's Teradata code determines a distribution, it is not clear whether this choice was based on statistical considerations or just code availability.
A different distribution would assign probability 1/68 to each eligible "triple" of EMP_NOs. PG's code, however, creates a distribution with equal probability (of 1/10) for each eligible triple of DEPT_NOs (and in particular equal probabilities [of 0.6] for each department to occur in the selection). There are many distributions with this characteristic. Yet another class of distributions would feature equal probabilities (of 0.3) for each EMP_NO. One of these could be obtained by adding the PPS option to the SAMPLINGUNIT statement in PG's code (thus implementing your idea of selecting "departments weighted proportionate to their frequency"). I think in general it depends on the input data whether equal probabilities for each EMP_NO can be achieved.
Here is a program that randomly selects 3 obs from among (up to) 10 departments, such that each obs comes from a different department. The probability of selecting from a given department is proportional to the department's size.
Editted Note: At each sample selection, the probability of selecting from a given department is proportional to the department's size (relative to the total of available departments). But because of the constraint, at each obs there is a reduced number of possible departments - thereby changing the probability of selection among remaining departments. So the probability of being among the 3 obs is NOT proportional to the size of the department vs all original departments.
After a given employee is selected, that department's probability of selection is set to zero.
Edited note: I've moved the assignment of a _seed value from inside to outside of the "do obs=1 to &sampsize" loop, as per @FreelanceReinh's comment.
And I've added a test dataset have, containing dept_no 1 through dept_no 10, with 4,8,12,...,40 employees respectively.
data have;
do dept_no=1 to 10;
do i=1 to 4*dept_no;
emp_id=dept_no+i/100;
format dept_no z2. emp_id z5.2;
drop i;
output;
end;
end;
run;
%let n_dep=10; /*Provide for up to this many departments, may be larger than actual department count*/
%let sampsize=3; /*N of employees to select, each from a different department */
data want (drop=_: d);
set have end=end_of_have;
length dept_size 8;
if _n_=1 then do;
declare hash h(); /*Department Size, plus One random obs per department */
h.definekey('dept_no');
h.definedata('dept_no','dept_size','_emp_no_samp','_rnum_samp');
h.definedone();
declare hiter hi ('h');
end;
_rnum_emp=ranuni(109581508); /*Randum num for this employee */
if h.find()^=0 then call missing(dept_size,_rnum_samp);
dept_size = sum(dept_size,1);
if _rnum_emp > _rnum_samp then do; /* Keep the employee with the highest random number */
_rnum_samp = _rnum_emp;
_emp_no_samp = emp_no;
end;
h.replace();
_n_avail+1;
if end_of_have;
array dept_share {&n_dep} _temporary_ (&n_dep*0) ;
array dept {&n_dep} _temporary_ ;
do obs=1 to &sampsize while (_n_avail>0);
do d=1 by 1 while (hi.next()=0);
dept_share{d}=dept_size/_n_avail;
dept{d}=dept_no;
end;
_seed=508333952;
call rantbl(_seed,of dept_share{*},d); /* Randomly choose department, proportional to size */
h.find(key:dept{d});
emp_no=_emp_no_samp;
output;
_n_avail = _n_avail - dept_size; /* Remove this department from furthr sampling */
dept_size=0;
h.replace();
end;
run;
Hi @mkeintz,
Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).
Interestingly, the algorithm (after the correction) does not lead to equal selection probabilities for each employee: If my calculation is correct, the ten probabilities are
19/72 ≈0.2639 for each emp_no in (1 2 7 3 5 9) 821/2520≈0.3258 for each emp_no in (4 10) 241/630 ≈0.3825 for each emp_no in (6 8)
So the PPS sampling method mentioned in my previous post seems to be using yet a different algorithm internally.
@FreelanceReinh wrote:
Hi @mkeintz,
Thank you. This is a nice algorithm. I assume that you intended to set the second random seed (_seed=508333952) before the do obs=1 ... loop. If it is set within that loop, the CALL RANTBL routine (unlike the RANTBL function or the RAND('TABLE',...) function) will use the same seed in all iterations. As a consequence, the selections will not be independent, which has a massive impact on the probability distributions (which I saw in two simulations of 5000 samples each, for the two placements of the _seed= assignment statement, with varying seeds of course).
@FreelanceReinh thanks for the note. Yes, I did intend to allow the _seed to be updated over DO loop iterations. I'll correct it in my original response.
In your code you first choose 3 departments and then one observarion of each of them.
What is the difference between yor code and following code?
You use "samplingunit" and here I use "cluster"
proc surveyselect data=Haveout=temp n=3 seed=123;
cluster dept_no;
Run;
proc surveyselect data=temp out=want n=1 seed=123;
strata dept_no;
Run;
"samplingunit" and "cluster"
are synonyms, see the documentation.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.